Intro
This tutorial describes how we can use geneviewer
to
identify and visualize homologous genes across clusters by performing a
BlastP alignment. To demonstrate this we use a GenBank file that
contains the genes involved in the biosynthesis of erythromycin A from
Saccharopolyspora erythraea and compare it too several
homologous clusters from different species which were identified using
antiSMASH.
For guidance on importing clusters from GenBank files, please refer to
this guide.
Materials
The .gbk files and the additional gene info can be downloaded from
the geneviewer-tutorials
repository. For visualization, the geneviewer
package is
required. Sequence alignment is performed using the
Biostrings
and pwalign
packages that can be
downloaded from Bioconductor. Optionally, the parallel
package can be utilized to increase processing times.
devtools::install_github("nvelden/geneviewer")
BiocManager::install("Biostrings")
BiocManager::install("pwalign")
# Optional but recommended for speeding up processing
install.packages("parallel")
library(geneviewer)
library(Biostrings)
library(pwalign)
library(parallel)
Loading cluster information
The cluster information needed to run the BlastP can be loaded
directly from the .gbk files by pointing to the folder path that
contains the files and running the protein_blast()
function.
# change the path to the folder where the .gbk files are saved
folder_path <- "~/path/to/folder/"
Alternatively, the protein blast can be performed using a data.frame that contains the start, end, strand, translation as well as a unique identifier for each gene and cluster as shown below.
protein_id | translation | cluster | strand | start | end | |
---|---|---|---|---|---|---|
1 | AAU93784.2 | MIASAPGEEVSLRAVCAAAGVQMPTLYHFFGNKQGLTDAVVERGFDLYLAEKSAGESTGDPIQDIRDGWDAHVAFGLRNPGFYTLMYGKVRPGYSPSAQARPSELLRELTRKAHEQGRLCTTPEQAAAHVLVTNIGVTLRQIVLATEDRDLSIAVREGVIAAITGSASSSPSLQADVRRVLEHAVTQRDLLGGEETALLTKWLSRLLAVGSDG | BGC0000054 | complement | 228 | 869 |
2 | AAU93785.1 | MSLQYSCIADTCSYSLDIEGVTMSTHTLEGKTVLVAGGAKNLGGLISRQAAEAGANVALHYNSASTRADAEETLAAVEAAGGKGVILTGDLTVPDNVARLFADAESALGRVDVAVNTVGKVLRKPIADTTEAEYDAMFDINSKAAYFFIKEAGRHVADGGKVITIVTSLLGAFTDGYSTYAGGKSPVEHFTRAAAKEFAERGISVTAIAPGPMDTPFFYGQETPERVEFHRSQAMGGQLTQIEDISPIVRFLATEGWWITGQTIFANGGYTTR | BGC0000054 | forward | 925 | 1746 |
4 | AAU93786.2 | MPSASRRGRRTLHADCDSRAQPENPSACLPWHPPPSSTRRTNVPQSRPASGQRRATPESPSSSSSRPTSRYAATSGGVALVGLATRSAIEQHPTTSGTAATQLSTSEIGCPRGATEQRHEHEQARPDAQTPRRPDRPCRRTPPPIVRRRPARRRGAGDEPGIRRRTCLATLRDAVDPRGPDGLAG | BGC0000054 | forward | 2172 | 2729 |
5 | AAU93787.1 | MRGIILAGGTGSRLHPITRGISKQLVPVYDKPMVYYPLSTLMLAGIRDILIITSPHEAELFVRLLGDGSAFGVNLSYAVQPTPDGLAQAFLIGESHLGQEPAALVLGDNIFHGTGMGYQLRRHTQTEVAAIFGYRTSNPQAYGVVEIGEDGRAISLEEKPQRPRSPYAVPGLYFYPDDVVEHAKTLVPSARGELEITDLNRLYLQEGRLHVEVLPRGTAWFDTGTFDSLNDASNFVRTIEARQGTLVGSPEEVAWRVGFLSDAELLERAERHSSSTYGPYLRALLDEGDDAHDPYRV | BGC0000054 | forward | 2920 | 3813 |
6 | AAU93788.2 | MTIDVPFLDLDAAYVELRPQIDAAVSGVLASGRYLLGAQTEAFEAEFAAACSAAHCVTVGSGCDALELSLTALGVGPGDEVVVPAHTFIATWLAVSRCGARPVPVEPSPDGYLVDVEAVEAAITPRTAAILVVHLYGEVADLAAIRRVADRHGLALVEDAAQSTGARGRDGAVVGSGSTAAAFSFYPGKNLGALGDGGAVVTADDDLARRVRLLRNYGSVAKYVHEVRGTNSRLDEIQAAVLRAKLPLLDAWNARRATIAALYLDRLEEASGVRLPPHDQGRSAWHLFVVRCHDRETLQRELTRHGVETLVHYPTPVHLSAAYADHGHRAGSLPHAERLAREVLSLPMGPHLPLPAAERVADLVAVLAARP | BGC0000054 | forward | 3822 | 4937 |
7 | AAU93789.1 | MRARQLRVAGGWEFTPTVHHDARGVFASPLQEEAFVAAVGHPFPVAQTNHIVSDRGVLRGVHFTASPGQSKYVHCAAGRALDVMIDVRTDSPTFGTWEAVELDPRRCNALYFPIGTAHAFMALEPGTIMSYLVSTPYDAELELAIDPFDPELALPWPADVPALLSGRDEVAMSFAQARAGGLLPRSGPRMVAEGVEP | BGC0000054 | complement | 4934 | 5527 |
Run BlastP
In this tutorial, we will directly input the folder path into
the protein_blast()
function to load our data. We’ll select
BGC0000055 as our query cluster and conduct a BlastP analysis to find
the homologous in the other clusters. We use 30 as the minimum identity
threshold. Performing the BlastP analysis with this dataset can take
several minutes so we set parallel processing to TRUE. For smaller
datasets or if the parallel
package is not installed, set
parallel processing to FALSE
.
BlastP_results <- geneviewer::protein_blast(
folder_path,
query = "BGC0000055",
id = "protein_id", # Name of column containing gene identifiers.
cluster = "cluster", # Name of column containing cluster identifiers.
identity = 30,
parallel = TRUE
)
After running the BlastP, the dataset will contain three additional columns. These columns are:
- BlastP: Indicates the top BlastP match found within the query cluster, characterized by its unique protein id.
- identity: The percentage identity to the BlastP hit.
- similarity: The percentage similarity to the BlastP hit.
- score: Synteny score between clusters based on those used in antiSMASH/MultiGeneBlast.
protein_id | region | cluster | strand | start | end | rowID | identity | similarity | BlastP | score |
---|---|---|---|---|---|---|---|---|---|---|
CAM00053.1 | 1..381 | BGC0000055 | forward | 1 | 381 | 1 | 100 | 100 | No Hit | 22.5 |
CAM00054.1 | 441..1634 | BGC0000055 | forward | 441 | 1634 | 2 | 100 | 100 | CAM00054.1 | 22.5 |
CAM00055.1 | complement(1589..2191) | BGC0000055 | complement | 1589 | 2191 | 3 | 100 | 100 | CAM00055.1 | 22.5 |
CAM00056.1 | complement(2199..3668) | BGC0000055 | complement | 2199 | 3668 | 4 | 100 | 100 | CAM00056.1 | 22.5 |
CAM00057.1 | complement(3706..4911) | BGC0000055 | complement | 3706 | 4911 | 5 | 100 | 100 | CAM00057.1 | 22.5 |
CAM00058.1 | complement(4908..6371) | BGC0000055 | complement | 4908 | 6371 | 6 | 100 | 100 | CAM00058.1 | 22.5 |
We can visualize the results using geneviewer
. In the
graph the clusters are ordered based on the synteny score. The genes are
colored by the BlastP hit found in the query cluster. Genes that remain
uncolored did not have any significant homologous. By hoovering over the
genes one can see the percentage identity and similarity.
GC_chart(
data = BlastP_results,
cluster = "cluster",
strand = "strand",
group = "BlastP",
height = "400px"
) %>%
GC_clusterLabel() %>%
GC_legend(FALSE)
Color by gene function
Rather than assigning colors to genes based on their BlastP matches,
we can color them by gene function. To do this we can bind the extra
gene information specific to our query cluster from the
BGC0000055_info.csv
file.
# change the path to the folder where the .gbk files are saved
BGC0000055_info <- "~/path/to/folder/BGC0000055_info.csv"
protein_id | Gene | Position | Product | Functions |
---|---|---|---|---|
CAM00053.1 | 778214 - 778594 (+) | putative erythromycin esterase | Other | |
CAM00054.1 | eryK | 778654 - 779847 (+) | cytochrome P450 Erythromycin B/D C-12 hydroxylase | Tailoring (Hydroxylation) |
CAM00055.1 | eryBVII | 779802 - 780404 (-) | dTDP-4-deoxyglucose 3 | Precursor biosynthesis |
CAM00056.1 | eryCV | 780412 - 781881 (-) | EryCV NDP-4,6-dideoxyhexose 3,4-enoyl reductase | Precursor biosynthesis |
CAM00057.1 | eryCIV | 781919 - 783124 (-) | eryCIV NDP-6-deoxyhexose 3,4-dehydratase | Precursor biosynthesis |
CAM00058.1 | eryBVI | 783121 - 784584 (-) | NDP-4-keto-6-deoxy-glucose 2,3-dehydratase | Precursor biosynthesis |
By executing a left_join()
operation with the
BlastP_results
dataset on the BlastP
field, we
can add the extra information to the BlastP results.
BlastP_results_functions <- left_join(BlastP_results, BGC0000055_info, by = dplyr::join_by(BlastP == protein_id ) )
We can now use the Function
category to color the genes.
We’ll also add gene names, links, and BlastP identity values to the
cluster information.
GC_chart(
data = BlastP_results_functions,
cluster = "cluster",
group = "Functions",
strand = "strand",
height = "600px"
) %>%
GC_labels(label = "Gene", cluster = 1) %>%
GC_links(group = "BlastP", measure = "identity") %>%
GC_clusterLabel() %>%
GC_legend(TRUE)
Alternatively, the GC_links() function allows for highlighting
connections between specific genes by utilizing the value1 and value2
parameters. To gain insights into the cluster sizes, we can adjust the
axis_type to range. For additional styling and coloring options, refer
to the GC_links() and GC_scale
documentation.
GC_chart(
data = BlastP_results_functions,
cluster = "cluster",
group = "Functions",
strand = "strand",
height = "600px"
) %>%
GC_labels(label = "Gene", cluster = 1) %>%
GC_scale(axis_type = "range") %>%
GC_links(group = "Gene",
value1 = c("eryAI", "eryAII", "eryAIII"),
value2 = c("eryAI", "eryAII", "eryAIII"),
use_group_colors = TRUE
) %>%
GC_clusterLabel() %>%
GC_legend(TRUE)