Loading Gene Clusters From Genbank Files • geneviewer

Intro

This tutorial demonstrates creating gene cluster visualizations from GenBank files using geneviewer. For guidance on importing clusters from FASTA files see: link. You can obtain GenBank files containing genomic data from a variety of platforms such as NCBI, Ensembl, UCSC, antiSMASH and MIBiG. While geneviewer is designed to handle GenBank files from these sources (please report any compatibility issues on Github), this tutorial specifically aims to replicate a gene cluster visualization found on the MIBiG website.

Materials

The .gbk files can be downloaded from the MIBiG website following the below instructions or directly from the geneviewer-tutorials repository.

Loading GenBank file

The GenBank files from MIBiG include additional details, enabling us to color-code genes by function, which we will use to reproduce the visualizations they show on the website. Begin by downloading a GenBank file from the MIBiG website. For the below tutorial we will use the genbank file for gene cluster BGC0000001. Simply click the “Download GenBank file” link and save the file.

We can load the file into R using the read_gbk function from geneviewer. This will create a list containing all the sections and features found in the GenBank file.

library(geneviewer)
# change the path to where you have saved the file
file_path <- "~/path/to/file/BGC0000001.gbk"
gbk <- read_gbk(file_path)
View(gbk) # Inspect the list in Rstudio

We can inspect the .gbk using the View pane in Rstudio (use View(gbk)) or by simply opening the .gbk file in a text editor. For the visualization we are interested in the start and end position of each gene as well as their identifiers and functions. We can find these under FEATURES > CDS. Lets print the first CDS to see what information it contains.

CDS_1 <- gbk[["BGC0000001"]][["FEATURES"]][["CDS"]][[1]] 
df <- data.frame(values = CDS_1)
df

	values
codon_start	1
gene	orfP
gene_functions	biosynthetic-additional (rule-based-clusters) PCMT
gene_kind	biosynthetic-additional
product	protein methyltransferase
protein_id	AEK75490.1
sec_met_domain	PCMT (E-value: 4.2e-28, bitscore: 89.4, seeds: 9, tool: rule-based-clusters
transl_table	11
translation	MTELDRAFDAVPAPIYTHHERHGETVHRSAPESIRRELAA
region	1..1083

We can bind the information of each CDS into a data.frame using the gbk_features_to_df function. In the below code we define the feature and the keys we are interested in and bind it into a data.frame. We set process_region to TRUE which will extract the strand, start and end from the region key.

cds_df <- gbk_features_to_df(
  gbk, 
  feature = "CDS",
  keys = c("region", "gene", "protein_id", "gene_kind", "product"), 
  process_region = TRUE
  )

gene	protein_id	gene_kind	product	cluster	strand	start	end
orfP	AEK75490.1	biosynthetic-additional	protein methyltransferase	BGC0000001	forward	1	1083
NA	AEK75491.1	NA	putative hydrolase	BGC0000001	forward	1208	1834
abyR	AEK75492.1	regulatory	pathway-specific SARP activator	BGC0000001	forward	1887	2633
abyX	AEK75493.1	biosynthetic-additional	cytochrome P450	BGC0000001	complement	2646	3836
abyH	AEK75494.1	regulatory	LuxR family transcriptional regulator	BGC0000001	complement	3927	6596
abyI	AEK75495.1	regulatory	pathway-specific SARP activator	BGC0000001	forward	6952	7710

Cluster Visualization

We can see that there are some NA values in the gene_kind column so we will replace those with “other”. After we have done that we are ready to make the cluster visualization.

# replace NA with "other" for gene_kind
cds_df$gene_kind[is.na(cds_df$gene_kind)] <- "other"

# Make the cluster chart
chart <- GC_chart(
  cds_df,
  start = "start",
  end = "end",
  group = "gene_kind",
  strand = "strand",
  height = "130px"
)
chart

To make the chart more similar to the one we find on the MIBiG website we can add a title, alter the coloring and appearance of the genes and add a scale. In addition, we alter the tooltip to display the gene, product and region on hoover.

colors <- list(
  "biosynthetic" = "#810E15", 
  "biosynthetic-additional" = "#F16D75", 
  "transport" = "#6495ED",
  "regulatory" = "#2E8B57", 
  "other" = "#808080"
)

chart %>%
  GC_genes(
    marker = "boxarrow",
    marker_size = "small"
  ) %>%
  GC_title(
    "BGC0000001: abyssomicin C biosynthetic gene cluster
    from <i>Verrucosispora maris</i> AB-18-032",
    align = "left",
    titleFont = list(
      fontSize = "12px"
    )
    ) %>%
  GC_scale() %>%
  GC_color(customColors = colors) %>%
  GC_legend(order = names(colors)) %>%
  GC_tooltip(
    formatter = "
    <b>Gene:</b> {gene}<br>
    <b>Product:</b> {product}<br>
    <b>Region:</b>{start} - {end}
    "
  )

Loading Multiple GenBank files

The above functions to read and transform .gbk files can also be used to load multiple .gbk files from a folder and shown the clusters side by side. To demonstrate this we can go to the cluster page for BGC0000001 and navigate to the “KnownClusterBlast” page. Click on the homologous clusters and download the .gbk files for BGC0001694, BGC0001492, BGC0001288 and BGC0000133.

We place all the.gbk files in the same directory and can load them all together as we have done previously.

library(geneviewer)
# change the path to the folder containing the .gbk files
folder_path <- "~/path/to/folder/"
gbk <- read_gbk(folder_path)
View(gbk) # Inspect the list in Rstudio

We now have a list of our .gbk files containing the different sections and features. As before we can use the gbk_features_to_df function to make a data.frame of the features. This time besides the CDS we also make a data.frame of the source such that we can add the Organism name and genomic region as title for each cluster.

cds_clusters_df <- gbk_features_to_df(
  gbk_list, 
  feature = "CDS",
  keys = c("region", "gene", "protein_id", "gene_kind", "product"), 
  process_region = TRUE
  )
source_clusters_df <- gbk_features_to_df(gbk_list, feature = "source")
#> Warning in gbk_features_to_df(gbk_list, feature = "source"): Feature source not
#> found in cluster BGC0001004
#> Warning in gbk_features_to_df(gbk_list, feature = "source"): Feature source not
#> found in cluster BGC0001204
#> Warning in gbk_features_to_df(gbk_list, feature = "source"): Feature source not
#> found in cluster BGC0002124

gene	protein_id	gene_kind	product	cluster	strand	start	end
orfP	AEK75490.1	biosynthetic-additional	protein methyltransferase	BGC0000001	forward	1	1083
NA	AEK75491.1	other	putative hydrolase	BGC0000001	forward	1208	1834
abyR	AEK75492.1	regulatory	pathway-specific SARP activator	BGC0000001	forward	1887	2633
abyX	AEK75493.1	biosynthetic-additional	cytochrome P450	BGC0000001	complement	2646	3836
abyH	AEK75494.1	regulatory	LuxR family transcriptional regulator	BGC0000001	complement	3927	6596
abyI	AEK75495.1	regulatory	pathway-specific SARP activator	BGC0000001	forward	6952	7710

We can now plot the graph but this time we specify the “cluster” column as identifier to distinguish the clusters and add cluster labels and titles.

# replace NA with "other" for gene_kind
cds_clusters_df$gene_kind[is.na(cds_clusters_df$gene_kind)] <- "other"

colors <- list(
  "biosynthetic" = "#810E15", 
  "biosynthetic-additional" = "#F16D75", 
  "transport" = "#6495ED",
  "regulatory" = "#2E8B57", 
  "other" = "#808080"
)

# Make the cluster chart
GC_chart(
  cds_clusters_df,
  start = "start",
  end = "end",
  group = "gene_kind",
  strand = "strand",
  cluster = "cluster",
  height = "800px"
) %>%
    GC_clusterTitle(
    y = 5,  
    title = paste0("<i>", source_clusters_df$organism, "</i>: ", source_clusters_df$region),
    align = "left",
    titleFont = list(fontSize = 12),
    ) %>%
  GC_clusterLabel(names(gbk_list)) %>%
  GC_genes(
    marker = "boxarrow",
    marker_size = "small"
  ) %>%
  GC_scaleBar(title = "1kb", scaleBarUnit = 1000) %>%
  GC_color(customColors = colors) %>%
  GC_legend(order = names(colors)) %>%
  GC_tooltip(
    formatter = "
    <b>Gene:</b> {gene}<br>
    <b>ProteinID:</b> {protein_id}<br>
    <b>Product:</b> {product}<br>
    <b>Region:</b>{start} - {end}
    "
  )