Loading Gene Clusters From Genbank Files
Source:vignettes/LoadGenBankFiles.Rmd
LoadGenBankFiles.Rmd
Intro
This tutorial demonstrates creating gene cluster visualizations from
GenBank files using geneviewer
. For guidance on importing
clusters from FASTA files see: link.
You can obtain GenBank files containing genomic data from a variety of
platforms such as NCBI, Ensembl, UCSC, antiSMASH
and MIBiG. While
geneviewer
is designed to handle GenBank files from these
sources (please report any compatibility issues on Github), this
tutorial specifically aims to replicate a gene cluster visualization
found on the MIBiG
website.
Materials
The .gbk files can be downloaded from the MIBiG website following the below instructions or directly from the geneviewer-tutorials repository.
Loading GenBank file
The GenBank files from MIBiG include additional details, enabling us to color-code genes by function, which we will use to reproduce the visualizations they show on the website. Begin by downloading a GenBank file from the MIBiG website. For the below tutorial we will use the genbank file for gene cluster BGC0000001. Simply click the “Download GenBank file” link and save the file.
We can load the file into R using the read_gbk
function
from geneviewer
. This will create a list containing all the
sections and features found in the GenBank file.
library(geneviewer)
# change the path to where you have saved the file
file_path <- "~/path/to/file/BGC0000001.gbk"
gbk <- read_gbk(file_path)
View(gbk) # Inspect the list in Rstudio
We can inspect the .gbk using the View pane in Rstudio (use
View(gbk)
) or by simply opening the .gbk file in a text
editor. For the visualization we are interested in the start and end
position of each gene as well as their identifiers and functions. We can
find these under FEATURES > CDS. Lets print the first CDS to see what
information it contains.
CDS_1 <- gbk[["BGC0000001"]][["FEATURES"]][["CDS"]][[1]]
df <- data.frame(values = CDS_1)
df
values | |
---|---|
codon_start | 1 |
gene | orfP |
gene_functions | biosynthetic-additional (rule-based-clusters) PCMT |
gene_kind | biosynthetic-additional |
product | protein methyltransferase |
protein_id | AEK75490.1 |
sec_met_domain | PCMT (E-value: 4.2e-28, bitscore: 89.4, seeds: 9, tool: rule-based-clusters |
transl_table | 11 |
translation | MTELDRAFDAVPAPIYTHHERHGETVHRSAPESIRRELAA |
region | 1..1083 |
We can bind the information of each CDS into a data.frame using the
gbk_features_to_df
function. In the below code we define
the feature and the keys we are interested in and bind it into a
data.frame. We set process_region to TRUE
which will
extract the strand, start and end from the region key.
cds_df <- gbk_features_to_df(
gbk,
feature = "CDS",
keys = c("region", "gene", "protein_id", "gene_kind", "product"),
process_region = TRUE
)
gene | protein_id | gene_kind | product | cluster | strand | start | end |
---|---|---|---|---|---|---|---|
orfP | AEK75490.1 | biosynthetic-additional | protein methyltransferase | BGC0000001 | forward | 1 | 1083 |
NA | AEK75491.1 | NA | putative hydrolase | BGC0000001 | forward | 1208 | 1834 |
abyR | AEK75492.1 | regulatory | pathway-specific SARP activator | BGC0000001 | forward | 1887 | 2633 |
abyX | AEK75493.1 | biosynthetic-additional | cytochrome P450 | BGC0000001 | complement | 2646 | 3836 |
abyH | AEK75494.1 | regulatory | LuxR family transcriptional regulator | BGC0000001 | complement | 3927 | 6596 |
abyI | AEK75495.1 | regulatory | pathway-specific SARP activator | BGC0000001 | forward | 6952 | 7710 |
Cluster Visualization
We can see that there are some NA values in the gene_kind column so we will replace those with “other”. After we have done that we are ready to make the cluster visualization.
# replace NA with "other" for gene_kind
cds_df$gene_kind[is.na(cds_df$gene_kind)] <- "other"
# Make the cluster chart
chart <- GC_chart(
cds_df,
start = "start",
end = "end",
group = "gene_kind",
strand = "strand",
height = "130px"
)
chart
To make the chart more similar to the one we find on the MIBiG website we can add a title, alter the coloring and appearance of the genes and add a scale. In addition, we alter the tooltip to display the gene, product and region on hoover.
colors <- list(
"biosynthetic" = "#810E15",
"biosynthetic-additional" = "#F16D75",
"transport" = "#6495ED",
"regulatory" = "#2E8B57",
"other" = "#808080"
)
chart %>%
GC_genes(
marker = "boxarrow",
marker_size = "small"
) %>%
GC_title(
"BGC0000001: abyssomicin C biosynthetic gene cluster
from <i>Verrucosispora maris</i> AB-18-032",
align = "left",
titleFont = list(
fontSize = "12px"
)
) %>%
GC_scale() %>%
GC_color(customColors = colors) %>%
GC_legend(order = names(colors)) %>%
GC_tooltip(
formatter = "
<b>Gene:</b> {gene}<br>
<b>Product:</b> {product}<br>
<b>Region:</b>{start} - {end}
"
)
Loading Multiple GenBank files
The above functions to read and transfom .gbk files can also be used to load multiple .gbk files from a folder and shown the clusters side by side. To demonstrate this we can go to the cluster page for BGC0000001 and navigate to the “KnownClusterBlast” page. Click on the homologous clusters and download the .gbk files for BGC0001694, BGC0001492, BGC0001288 and BGC0000133.
We place all the.gbk files in the same directory and can load them all together as we have done previously.
library(geneviewer)
# change the path to the folder containing the .gbk files
folder_path <- "~/path/to/folder/"
gbk <- read_gbk(folder_path)
View(gbk) # Inspect the list in Rstudio
We now have a list of our .gbk files containing the different
sections and features. As before we can use the
gbk_features_to_df
function to make a data.frame of the
features. This time besides the CDS we also make a data.frame of the
source such that we can add the Organism name and genomic region as
title for each cluster.
cds_clusters_df <- gbk_features_to_df(
gbk_list,
feature = "CDS",
keys = c("region", "gene", "protein_id", "gene_kind", "product"),
process_region = TRUE
)
source_clusters_df <- gbk_features_to_df(gbk_list, feature = "source")
#> Warning in gbk_features_to_df(gbk_list, feature = "source"): Feature source not
#> found in cluster BGC0001004
#> Warning in gbk_features_to_df(gbk_list, feature = "source"): Feature source not
#> found in cluster BGC0001204
#> Warning in gbk_features_to_df(gbk_list, feature = "source"): Feature source not
#> found in cluster BGC0002124
gene | protein_id | gene_kind | product | cluster | strand | start | end |
---|---|---|---|---|---|---|---|
orfP | AEK75490.1 | biosynthetic-additional | protein methyltransferase | BGC0000001 | forward | 1 | 1083 |
NA | AEK75491.1 | other | putative hydrolase | BGC0000001 | forward | 1208 | 1834 |
abyR | AEK75492.1 | regulatory | pathway-specific SARP activator | BGC0000001 | forward | 1887 | 2633 |
abyX | AEK75493.1 | biosynthetic-additional | cytochrome P450 | BGC0000001 | complement | 2646 | 3836 |
abyH | AEK75494.1 | regulatory | LuxR family transcriptional regulator | BGC0000001 | complement | 3927 | 6596 |
abyI | AEK75495.1 | regulatory | pathway-specific SARP activator | BGC0000001 | forward | 6952 | 7710 |
We can now plot the graph but this time we specify the “cluster” column as identifier to distinguish the clusters and add cluster labels and titles.
# replace NA with "other" for gene_kind
cds_clusters_df$gene_kind[is.na(cds_clusters_df$gene_kind)] <- "other"
colors <- list(
"biosynthetic" = "#810E15",
"biosynthetic-additional" = "#F16D75",
"transport" = "#6495ED",
"regulatory" = "#2E8B57",
"other" = "#808080"
)
# Make the cluster chart
GC_chart(
cds_clusters_df,
start = "start",
end = "end",
group = "gene_kind",
strand = "strand",
cluster = "cluster",
height = "800px"
) %>%
GC_clusterTitle(
y = 5,
title = paste0("<i>", source_clusters_df$organism, "</i>: ", source_clusters_df$region),
align = "left",
titleFont = list(fontSize = 12),
) %>%
GC_clusterLabel(names(gbk_list)) %>%
GC_genes(
marker = "boxarrow",
marker_size = "small"
) %>%
GC_scaleBar(title = "1kb", scaleBarUnit = 1000) %>%
GC_color(customColors = colors) %>%
GC_legend(order = names(colors)) %>%
GC_tooltip(
formatter = "
<b>Gene:</b> {gene}<br>
<b>ProteinID:</b> {protein_id}<br>
<b>Product:</b> {product}<br>
<b>Region:</b>{start} - {end}
"
)