Skip to contents

Intro

This tutorial demonstrates creating gene cluster visualizations from GenBank files using geneviewer. For guidance on importing clusters from FASTA files see: link. You can obtain GenBank files containing genomic data from a variety of platforms such as NCBI, Ensembl, UCSC, antiSMASH and MIBiG. While geneviewer is designed to handle GenBank files from these sources (please report any compatibility issues on Github), this tutorial specifically aims to replicate a gene cluster visualization found on the MIBiG website.

Materials

The .gbk files can be downloaded from the MIBiG website following the below instructions or directly from the geneviewer-tutorials repository.

Loading GenBank file

The GenBank files from MIBiG include additional details, enabling us to color-code genes by function, which we will use to reproduce the visualizations they show on the website. Begin by downloading a GenBank file from the MIBiG website. For the below tutorial we will use the genbank file for gene cluster BGC0000001. Simply click the “Download GenBank file” link and save the file.

We can load the file into R using the read_gbk function from geneviewer. This will create a list containing all the sections and features found in the GenBank file.

library(geneviewer)
# change the path to where you have saved the file
file_path <- "~/path/to/file/BGC0000001.gbk"
gbk <- read_gbk(file_path)
View(gbk) # Inspect the list in Rstudio

We can inspect the .gbk using the View pane in Rstudio (use View(gbk)) or by simply opening the .gbk file in a text editor. For the visualization we are interested in the start and end position of each gene as well as their identifiers and functions. We can find these under FEATURES > CDS. Lets print the first CDS to see what information it contains.

CDS_1 <- gbk[["BGC0000001"]][["FEATURES"]][["CDS"]][[1]] 
df <- data.frame(values = CDS_1)
df
values
codon_start 1
gene orfP
gene_functions biosynthetic-additional (rule-based-clusters) PCMT
gene_kind biosynthetic-additional
product protein methyltransferase
protein_id AEK75490.1
sec_met_domain PCMT (E-value: 4.2e-28, bitscore: 89.4, seeds: 9, tool: rule-based-clusters
transl_table 11
translation MTELDRAFDAVPAPIYTHHERHGETVHRSAPESIRRELAA
region 1..1083

We can bind the information of each CDS into a data.frame using the gbk_features_to_df function. In the below code we define the feature and the keys we are interested in and bind it into a data.frame. We set process_region to TRUE which will extract the strand, start and end from the region key.

cds_df <- gbk_features_to_df(
  gbk, 
  feature = "CDS",
  keys = c("region", "gene", "protein_id", "gene_kind", "product"), 
  process_region = TRUE
  )
gene protein_id gene_kind product cluster strand start end
orfP AEK75490.1 biosynthetic-additional protein methyltransferase BGC0000001 forward 1 1083
NA AEK75491.1 NA putative hydrolase BGC0000001 forward 1208 1834
abyR AEK75492.1 regulatory pathway-specific SARP activator BGC0000001 forward 1887 2633
abyX AEK75493.1 biosynthetic-additional cytochrome P450 BGC0000001 complement 2646 3836
abyH AEK75494.1 regulatory LuxR family transcriptional regulator BGC0000001 complement 3927 6596
abyI AEK75495.1 regulatory pathway-specific SARP activator BGC0000001 forward 6952 7710

Cluster Visualization

We can see that there are some NA values in the gene_kind column so we will replace those with “other”. After we have done that we are ready to make the cluster visualization.

# replace NA with "other" for gene_kind
cds_df$gene_kind[is.na(cds_df$gene_kind)] <- "other"

# Make the cluster chart
chart <- GC_chart(
  cds_df,
  start = "start",
  end = "end",
  group = "gene_kind",
  strand = "strand",
  height = "130px"
)
chart

To make the chart more similar to the one we find on the MIBiG website we can add a title, alter the coloring and appearance of the genes and add a scale. In addition, we alter the tooltip to display the gene, product and region on hoover.

colors <- list(
  "biosynthetic" = "#810E15", 
  "biosynthetic-additional" = "#F16D75", 
  "transport" = "#6495ED",
  "regulatory" = "#2E8B57", 
  "other" = "#808080"
)

chart %>%
  GC_genes(
    marker = "boxarrow",
    marker_size = "small"
  ) %>%
  GC_title(
    "BGC0000001: abyssomicin C biosynthetic gene cluster
    from <i>Verrucosispora maris</i> AB-18-032",
    align = "left",
    titleFont = list(
      fontSize = "12px"
    )
    ) %>%
  GC_scale() %>%
  GC_color(customColors = colors) %>%
  GC_legend(order = names(colors)) %>%
  GC_tooltip(
    formatter = "
    <b>Gene:</b> {gene}<br>
    <b>Product:</b> {product}<br>
    <b>Region:</b>{start} - {end}
    "
  )

Loading Multiple GenBank files

The above functions to read and transfom .gbk files can also be used to load multiple .gbk files from a folder and shown the clusters side by side. To demonstrate this we can go to the cluster page for BGC0000001 and navigate to the “KnownClusterBlast” page. Click on the homologous clusters and download the .gbk files for BGC0001694, BGC0001492, BGC0001288 and BGC0000133.

We place all the.gbk files in the same directory and can load them all together as we have done previously.

library(geneviewer)
# change the path to the folder containing the .gbk files
folder_path <- "~/path/to/folder/"
gbk <- read_gbk(folder_path)
View(gbk) # Inspect the list in Rstudio

We now have a list of our .gbk files containing the different sections and features. As before we can use the gbk_features_to_df function to make a data.frame of the features. This time besides the CDS we also make a data.frame of the source such that we can add the Organism name and genomic region as title for each cluster.

cds_clusters_df <- gbk_features_to_df(
  gbk_list, 
  feature = "CDS",
  keys = c("region", "gene", "protein_id", "gene_kind", "product"), 
  process_region = TRUE
  )
source_clusters_df <- gbk_features_to_df(gbk_list, feature = "source")
#> Warning in gbk_features_to_df(gbk_list, feature = "source"): Feature source not
#> found in cluster BGC0001004
#> Warning in gbk_features_to_df(gbk_list, feature = "source"): Feature source not
#> found in cluster BGC0001204
#> Warning in gbk_features_to_df(gbk_list, feature = "source"): Feature source not
#> found in cluster BGC0002124
gene protein_id gene_kind product cluster strand start end
orfP AEK75490.1 biosynthetic-additional protein methyltransferase BGC0000001 forward 1 1083
NA AEK75491.1 other putative hydrolase BGC0000001 forward 1208 1834
abyR AEK75492.1 regulatory pathway-specific SARP activator BGC0000001 forward 1887 2633
abyX AEK75493.1 biosynthetic-additional cytochrome P450 BGC0000001 complement 2646 3836
abyH AEK75494.1 regulatory LuxR family transcriptional regulator BGC0000001 complement 3927 6596
abyI AEK75495.1 regulatory pathway-specific SARP activator BGC0000001 forward 6952 7710

We can now plot the graph but this time we specify the “cluster” column as identifier to distinguish the clusters and add cluster labels and titles.

# replace NA with "other" for gene_kind
cds_clusters_df$gene_kind[is.na(cds_clusters_df$gene_kind)] <- "other"

colors <- list(
  "biosynthetic" = "#810E15", 
  "biosynthetic-additional" = "#F16D75", 
  "transport" = "#6495ED",
  "regulatory" = "#2E8B57", 
  "other" = "#808080"
)

# Make the cluster chart
GC_chart(
  cds_clusters_df,
  start = "start",
  end = "end",
  group = "gene_kind",
  strand = "strand",
  cluster = "cluster",
  height = "800px"
) %>%
    GC_clusterTitle(
    y = 5,  
    title = paste0("<i>", source_clusters_df$organism, "</i>: ", source_clusters_df$region),
    align = "left",
    titleFont = list(fontSize = 12),
    ) %>%
  GC_clusterLabel(names(gbk_list)) %>%
  GC_genes(
    marker = "boxarrow",
    marker_size = "small"
  ) %>%
  GC_scaleBar(title = "1kb", scaleBarUnit = 1000) %>%
  GC_color(customColors = colors) %>%
  GC_legend(order = names(colors)) %>%
  GC_tooltip(
    formatter = "
    <b>Gene:</b> {gene}<br>
    <b>ProteinID:</b> {protein_id}<br>
    <b>Product:</b> {product}<br>
    <b>Region:</b>{start} - {end}
    "
  )