Skip to contents

Intro

This tutorial demonstrates creating gene cluster visualizations from FASTA files using geneviewer. For guidance on importing clusters from GenBank files see: link. geneviewer parses FASTA headers to extract key-value pairs, denoted by =. For instance, in the below FASTA header it identifies ‘protein’ as ‘putative phosphoenolpyruvate synthase’ and similarly processes other pairs. The ‘location’ tag is crucial because it allows the extraction of ‘start’, ‘end’, and ‘strand’ information for each gene.

>lcl|HQ3864.1_prot_ADX475.1_1 protein=putative phosphoenolpyruvate synthase protein_id=ADX66475.1 location=complement(<1..2247) gbkey=CDS
MRATGLVRGRAAKRFGRAGGAGDAIGQCRATGHDCLGGSAMIEQYVWDLHEVDETQVAVVGGKGAHLGGL
SRIEGIRVPAGFCVTTDAFRRIMAEAPSIDDGLDQLSRLNPDDREAIRTLSAQIRRTIEGIAIPGDLAAA
ITRALARLGEHAACAVRSSATAEDLPTASFAGQQDTYLNVVGPTAILQHVSRCWASLFTERAVTYRQRNG

Materials

The FASTA files uses for the tutorial were originally downloaded from the NCBI website but can be downloaded directly from the geneviewer-tutorials repository.

Loading FASTA files

The example FASTA files used for this tutorial are each from a different organism and holds the genes of a cluster that produces a polyene antifungal agent. Download the files and place them under the same folder. We can read the files into R using the read_fasta function from geneviewer. The function uses the Biostrings library so we will need to load this as well. The function will parse all the FASTA headers and create a data.frame from the extracted information. We have set sequence to FALSE since we do not need the sequences to create the cluster visualization.

library(geneviewer)
# if (!require("BiocManager", quietly = TRUE))
#     install.packages("BiocManager")
# 
# BiocManager::install("Biostrings")
library(Biostrings)

# change the path to the folder where you have saved the file
file_path <- "~/path/to/folder/"
fasta_df <- read_fasta(file_path, sequence = FALSE)
View(fasta_df) # Inspect the data.frame in Rstudio

The data.frame holds the protein IDs, start, end and strand information for each gene. In addition there is the cluster column that tells us from which file each gene is coming.

protein protein_id location gbkey gene start end strand cluster locus_tag
putative ADX66475.1 complement(<1..2247) CDS NA 1 2247 complement BGC0000108 NA
putative ADX66476.1 complement(3301..3870) CDS sigD 3301 3870 complement BGC0000108 NA
conserved ADX66477.1 complement(4356..5282) CDS calU-c 4356 5282 complement BGC0000108 NA
ScnRII ADX66474.1 complement(5576..6154) CDS scnRII 5576 6154 complement BGC0000108 NA
ScnRI ADX66458.1 6555..10085 CDS scnRI 6555 10085 forward BGC0000108 NA
putative ADD85130.2 10250..11626 CDS scnK 10250 11626 forward BGC0000108 NA

Cluster Visualization

Using the data.frame we created from the FASTA files we can now easily make the cluster visualization.

# Make the cluster chart
chart <- GC_chart(
  fasta_df,
  start = "start",
  end = "end",
  strand = "strand",
  cluster = "cluster",
  height = "400px"
) %>% 
  GC_clusterLabel(unique(fasta_df$cluster))
chart

Uniprot Keywords

We can get additional information for each gene using the Uniprot.ws package. We make the connection to the database using the UniProt.ws function. Using keytypes(up) we can see all the keys we can use to query the database and with columns(up) we can see all the different fields we can get in return. Our gene IDs are EMBL-GenBank-DDBJ_CDS identifiers. For this tutorial we want to return the keywords associated with each gene such that we can use it to group the genes based on their function.

library(UniProt.ws)

up <- UniProt.ws::UniProt.ws()

keytypes(up)
columns(up)

keywords <- UniProt.ws::select(
  up, 
  column = c("keyword"), 
  keys = fasta_df$protein_id, 
  keytype = "EMBL-GenBank-DDBJ_CDS"
  )

This will return a data.frame with our original identifiers, uniprotIDs and the keywords associated to each gene.

From Entry Keywords
ADX66475.1 F1CLA2 Pyruvate
ADX66476.1 F1CLA3 Transcription;Transcription regulation
ADX66477.1 F1CLA4 NA
ADX66474.1 B9V2R3 DNA-binding
ADX66458.1 F1CLA6 DNA-binding;Two-component regulatory system
ADD85130.2 D4P8Q9 Glycosyltransferase;Transferase

We now do a bit of data wrangling with the help of the dplyr package performing the following steps:

  • We split the keywords for each gene on “;”
  • Count the occurrence of each keyword
  • For each gene keep only the highest occurring keyword
  • Replace any keyword that has less then 10 occurrences to “Other”
  • Bind the keywords to our fasta_df by protein ID
  • Replace any genes without keyword to Other
  • Select only the columns we need for our visualization
library(dplyr)

keywords_count <- keywords %>%
  # Separate the Keywords into different rows
  tidyr::separate_rows(Keywords, sep = ";") %>%
  # Replace NA values with 'Other'
  mutate(Keywords = replace(Keywords, is.na(Keywords), "Other")) %>%
  # Add a Count column that counts each Keyword
  add_count(Keywords)

# Keep top ranking keywords for each protein ID
keywords_count <- keywords_count %>%
  group_by(From) %>%
  filter(rank(-n, ties.method = "first") == 1) %>%
  ungroup()

# Replace any keyword that has a count less then 10 with "Other"
keywords_count <- keywords_count %>%
  mutate(Keywords = ifelse(n < 10, "Other", Keywords))

# Bind Keywords back to fasta_df
fasta_df_with_keywords <- 
  left_join(fasta_df, keywords_count, by = c("protein_id" = "From")) %>% 
  mutate(Keywords = replace(Keywords, is.na(Keywords), "Other")) %>%
  select(cluster, start, end, strand, protein_id, protein, Keywords, n)
cluster start end strand protein_id protein Keywords n
BGC0000108 1 2247 complement ADX66475.1 putative Other 1
BGC0000108 3301 3870 complement ADX66476.1 putative Other 4
BGC0000108 4356 5282 complement ADX66477.1 conserved Other 9
BGC0000108 5576 6154 complement ADX66474.1 ScnRII DNA-binding 23
BGC0000108 6555 10085 forward ADX66458.1 ScnRI DNA-binding 23
BGC0000108 10250 11626 forward ADD85130.2 putative Transferase 44

We can now make our visual again but use the keywords to group the genes. We alter the order of the legend such that it is in alphabetical order and the Other group is placed last and finally we customize the tooltip to also show the protein Id on hoover.


key_legend <- c(setdiff(sort(unique(fasta_df_with_keywords$Keywords)), "Other"), "Other")

# Make the cluster chart
chart <- GC_chart(
  fasta_df_with_keywords,
  start = "start",
  end = "end",
  strand = "strand",
  group = "Keywords",
  cluster = "cluster",
  height = "400px"
) %>% 
  GC_clusterLabel(unique(fasta_df$cluster)) %>%
  GC_legend(order = key_legend) %>%
  GC_tooltip(
    formatter = "
    <b>ProteinID:</b> {protein_id}<br>
    <b>Region:</b> {start} - {end}
    "
  )
chart

For further customization options like setting a title, altering the colors adding gene links and much more see Get Started and Examples.