Cluster comparison using BlastP • geneviewer

Intro

This tutorial describes how we can use geneviewer to identify and visualize homologous genes across clusters by performing a BlastP alignment. To demonstrate this we use a GenBank file that contains the genes involved in the biosynthesis of erythromycin A from Saccharopolyspora erythraea and compare it too several homologous clusters from different species which were identified using antiSMASH. For guidance on importing clusters from GenBank files, please refer to this guide.

Materials

The .gbk files and the additional gene info can be downloaded from the geneviewer-tutorials repository. For visualization, the geneviewer package is required. Sequence alignment is performed using the Biostrings and pwalign packages that can be downloaded from Bioconductor. Optionally, the parallel package can be utilized to increase processing times.

devtools::install_github("nvelden/geneviewer")
BiocManager::install("Biostrings")
BiocManager::install("pwalign")
# Optional but recommended for speeding up processing
install.packages("parallel") 

library(geneviewer)
library(Biostrings)
library(pwalign)
library(parallel)

Loading cluster information

The cluster information needed to run the BlastP can be loaded directly from the .gbk files by pointing to the folder path that contains the files and running the protein_blast() function.

# change the path to the folder where the .gbk files are saved
folder_path <- "~/path/to/folder/"

Alternatively, the protein blast can be performed using a data.frame that contains the start, end, strand, translation as well as a unique identifier for each gene and cluster as shown below.

	protein_id	translation	cluster	strand	start	end
1	AAU93784.2	MIASAPGEEVSLRAVCAAAGVQMPTLYHFFGNKQGLTDAVVERGFDLYLAEKSAGESTGDPIQDIRDGWDAHVAFGLRNPGFYTLMYGKVRPGYSPSAQARPSELLRELTRKAHEQGRLCTTPEQAAAHVLVTNIGVTLRQIVLATEDRDLSIAVREGVIAAITGSASSSPSLQADVRRVLEHAVTQRDLLGGEETALLTKWLSRLLAVGSDG	BGC0000054	complement	228	869
2	AAU93785.1	MSLQYSCIADTCSYSLDIEGVTMSTHTLEGKTVLVAGGAKNLGGLISRQAAEAGANVALHYNSASTRADAEETLAAVEAAGGKGVILTGDLTVPDNVARLFADAESALGRVDVAVNTVGKVLRKPIADTTEAEYDAMFDINSKAAYFFIKEAGRHVADGGKVITIVTSLLGAFTDGYSTYAGGKSPVEHFTRAAAKEFAERGISVTAIAPGPMDTPFFYGQETPERVEFHRSQAMGGQLTQIEDISPIVRFLATEGWWITGQTIFANGGYTTR	BGC0000054	forward	925	1746
4	AAU93786.2	MPSASRRGRRTLHADCDSRAQPENPSACLPWHPPPSSTRRTNVPQSRPASGQRRATPESPSSSSSRPTSRYAATSGGVALVGLATRSAIEQHPTTSGTAATQLSTSEIGCPRGATEQRHEHEQARPDAQTPRRPDRPCRRTPPPIVRRRPARRRGAGDEPGIRRRTCLATLRDAVDPRGPDGLAG	BGC0000054	forward	2172	2729
5	AAU93787.1	MRGIILAGGTGSRLHPITRGISKQLVPVYDKPMVYYPLSTLMLAGIRDILIITSPHEAELFVRLLGDGSAFGVNLSYAVQPTPDGLAQAFLIGESHLGQEPAALVLGDNIFHGTGMGYQLRRHTQTEVAAIFGYRTSNPQAYGVVEIGEDGRAISLEEKPQRPRSPYAVPGLYFYPDDVVEHAKTLVPSARGELEITDLNRLYLQEGRLHVEVLPRGTAWFDTGTFDSLNDASNFVRTIEARQGTLVGSPEEVAWRVGFLSDAELLERAERHSSSTYGPYLRALLDEGDDAHDPYRV	BGC0000054	forward	2920	3813
6	AAU93788.2	MTIDVPFLDLDAAYVELRPQIDAAVSGVLASGRYLLGAQTEAFEAEFAAACSAAHCVTVGSGCDALELSLTALGVGPGDEVVVPAHTFIATWLAVSRCGARPVPVEPSPDGYLVDVEAVEAAITPRTAAILVVHLYGEVADLAAIRRVADRHGLALVEDAAQSTGARGRDGAVVGSGSTAAAFSFYPGKNLGALGDGGAVVTADDDLARRVRLLRNYGSVAKYVHEVRGTNSRLDEIQAAVLRAKLPLLDAWNARRATIAALYLDRLEEASGVRLPPHDQGRSAWHLFVVRCHDRETLQRELTRHGVETLVHYPTPVHLSAAYADHGHRAGSLPHAERLAREVLSLPMGPHLPLPAAERVADLVAVLAARP	BGC0000054	forward	3822	4937
7	AAU93789.1	MRARQLRVAGGWEFTPTVHHDARGVFASPLQEEAFVAAVGHPFPVAQTNHIVSDRGVLRGVHFTASPGQSKYVHCAAGRALDVMIDVRTDSPTFGTWEAVELDPRRCNALYFPIGTAHAFMALEPGTIMSYLVSTPYDAELELAIDPFDPELALPWPADVPALLSGRDEVAMSFAQARAGGLLPRSGPRMVAEGVEP	BGC0000054	complement	4934	5527

Run BlastP

In this tutorial, we will directly input the folder path into the protein_blast() function to load our data. We’ll select BGC0000055 as our query cluster and conduct a BlastP analysis to find the homologous in the other clusters. We use 30 as the minimum identity threshold. Performing the BlastP analysis with this dataset can take several minutes so we set parallel processing to TRUE. For smaller datasets or if the parallel package is not installed, set parallel processing to FALSE.

BlastP_results <- geneviewer::protein_blast(
  folder_path,
  query = "BGC0000055",
  id = "protein_id", # Name of column containing gene identifiers. 
  cluster = "cluster", # Name of column containing cluster identifiers. 
  identity = 30,
  parallel = TRUE
)

After running the BlastP, the dataset will contain three additional columns. These columns are:

BlastP: Indicates the top BlastP match found within the query cluster, characterized by its unique protein id.
identity: The percentage identity to the BlastP hit.
similarity: The percentage similarity to the BlastP hit.
score: Synteny score between clusters based on those used in antiSMASH/MultiGeneBlast.

protein_id	region	cluster	strand	start	end	rowID	identity	similarity	BlastP	score
CAM00053.1	1..381	BGC0000055	forward	1	381	1	100	100	No Hit	22.5
CAM00054.1	441..1634	BGC0000055	forward	441	1634	2	100	100	CAM00054.1	22.5
CAM00055.1	complement(1589..2191)	BGC0000055	complement	1589	2191	3	100	100	CAM00055.1	22.5
CAM00056.1	complement(2199..3668)	BGC0000055	complement	2199	3668	4	100	100	CAM00056.1	22.5
CAM00057.1	complement(3706..4911)	BGC0000055	complement	3706	4911	5	100	100	CAM00057.1	22.5
CAM00058.1	complement(4908..6371)	BGC0000055	complement	4908	6371	6	100	100	CAM00058.1	22.5

We can visualize the results using geneviewer. In the graph the clusters are ordered based on the synteny score. The genes are colored by the BlastP hit found in the query cluster. Genes that remain uncolored did not have any significant homologous. By hoovering over the genes one can see the percentage identity and similarity.

GC_chart(
  data = BlastP_results,
  cluster = "cluster",
  strand = "strand",
  group = "BlastP",
  height = "400px"
) %>%
GC_clusterLabel() %>%
GC_legend(FALSE)

Color by gene function

Rather than assigning colors to genes based on their BlastP matches, we can color them by gene function. To do this we can bind the extra gene information specific to our query cluster from the BGC0000055_info.csv file.

# change the path to the folder where the .gbk files are saved
BGC0000055_info <- "~/path/to/folder/BGC0000055_info.csv"

protein_id	Gene	Position	Product	Functions
CAM00053.1		778214 - 778594 (+)	putative erythromycin esterase	Other
CAM00054.1	eryK	778654 - 779847 (+)	cytochrome P450 Erythromycin B/D C-12 hydroxylase	Tailoring (Hydroxylation)
CAM00055.1	eryBVII	779802 - 780404 (-)	dTDP-4-deoxyglucose 3	Precursor biosynthesis
CAM00056.1	eryCV	780412 - 781881 (-)	EryCV NDP-4,6-dideoxyhexose 3,4-enoyl reductase	Precursor biosynthesis
CAM00057.1	eryCIV	781919 - 783124 (-)	eryCIV NDP-6-deoxyhexose 3,4-dehydratase	Precursor biosynthesis
CAM00058.1	eryBVI	783121 - 784584 (-)	NDP-4-keto-6-deoxy-glucose 2,3-dehydratase	Precursor biosynthesis

By executing a left_join() operation with the BlastP_results dataset on the BlastP field, we can add the extra information to the BlastP results.

BlastP_results_functions <- left_join(BlastP_results, BGC0000055_info, by = dplyr::join_by(BlastP == protein_id ) )

We can now use the Function category to color the genes. We’ll also add gene names, links, and BlastP identity values to the cluster information.

GC_chart(
  data = BlastP_results_functions,
  cluster = "cluster",
  group = "Functions",
  strand = "strand",
  height = "600px"
) %>%
GC_labels(label = "Gene", cluster = 1) %>% 
GC_links(group = "BlastP",  measure = "identity") %>%  
GC_clusterLabel() %>%
GC_legend(TRUE)

Alternatively, the GC_links() function allows for highlighting connections between specific genes by utilizing the value1 and value2 parameters. To gain insights into the cluster sizes, we can adjust the axis_type to range. For additional styling and coloring options, refer to the GC_links() and GC_scale documentation.

GC_chart(
  data = BlastP_results_functions,
  cluster = "cluster",
  group = "Functions",
  strand = "strand",
  height = "600px"
) %>%
GC_labels(label = "Gene", cluster = 1) %>% 
GC_scale(axis_type = "range") %>%
GC_links(group = "Gene",  
         value1 = c("eryAI", "eryAII", "eryAIII"), 
         value2 = c("eryAI", "eryAII", "eryAIII"), 
         use_group_colors = TRUE
         ) %>%  
GC_clusterLabel() %>%
GC_legend(TRUE)