Loading Gene Clusters From FASTA Files • geneviewer

Intro

This tutorial demonstrates creating gene cluster visualizations from FASTA files using geneviewer. For guidance on importing clusters from GenBank files see: link. geneviewer parses FASTA headers to extract key-value pairs, denoted by =. For instance, in the below FASTA header it identifies ‘protein’ as ‘putative phosphoenolpyruvate synthase’ and similarly processes other pairs. The ‘location’ tag is crucial because it allows the extraction of ‘start’, ‘end’, and ‘strand’ information for each gene.

>lcl|HQ3864.1_prot_ADX475.1_1 protein=putative phosphoenolpyruvate synthase protein_id=ADX66475.1 location=complement(<1..2247) gbkey=CDS
MRATGLVRGRAAKRFGRAGGAGDAIGQCRATGHDCLGGSAMIEQYVWDLHEVDETQVAVVGGKGAHLGGL
SRIEGIRVPAGFCVTTDAFRRIMAEAPSIDDGLDQLSRLNPDDREAIRTLSAQIRRTIEGIAIPGDLAAA
ITRALARLGEHAACAVRSSATAEDLPTASFAGQQDTYLNVVGPTAILQHVSRCWASLFTERAVTYRQRNG

Materials

The FASTA files uses for the tutorial were originally downloaded from the NCBI website but can be downloaded directly from the geneviewer-tutorials repository.

Loading FASTA files

The example FASTA files used for this tutorial are each from a different organism and holds the genes of a cluster that produces a polyene antifungal agent. Download the files and place them under the same folder. We can read the files into R using the read_fasta function from geneviewer. The function uses the Biostrings library so we will need to load this as well. The function will parse all the FASTA headers and create a data.frame from the extracted information. We have set sequence to FALSE since we do not need the sequences to create the cluster visualization.

library(geneviewer)
# if (!require("BiocManager", quietly = TRUE))
#     install.packages("BiocManager")
# 
# BiocManager::install("Biostrings")
library(Biostrings)

# change the path to the folder where you have saved the file
file_path <- "~/path/to/folder/"
fasta_df <- read_fasta(file_path, sequence = FALSE)
View(fasta_df) # Inspect the data.frame in Rstudio

The data.frame holds the protein IDs, start, end and strand information for each gene. In addition there is the cluster column that tells us from which file each gene is coming.

protein	protein_id	location	gbkey	gene	start	end	strand	cluster	locus_tag
putative	ADX66475.1	complement(<1..2247)	CDS	NA	1	2247	complement	BGC0000108	NA
putative	ADX66476.1	complement(3301..3870)	CDS	sigD	3301	3870	complement	BGC0000108	NA
conserved	ADX66477.1	complement(4356..5282)	CDS	calU-c	4356	5282	complement	BGC0000108	NA
ScnRII	ADX66474.1	complement(5576..6154)	CDS	scnRII	5576	6154	complement	BGC0000108	NA
ScnRI	ADX66458.1	6555..10085	CDS	scnRI	6555	10085	forward	BGC0000108	NA
putative	ADD85130.2	10250..11626	CDS	scnK	10250	11626	forward	BGC0000108	NA

Cluster Visualization

Using the data.frame we created from the FASTA files we can now easily make the cluster visualization.

# Make the cluster chart
chart <- GC_chart(
  fasta_df,
  start = "start",
  end = "end",
  strand = "strand",
  cluster = "cluster",
  height = "400px"
) %>% 
  GC_clusterLabel(unique(fasta_df$cluster))
chart

Uniprot Keywords

We can get additional information for each gene using the Uniprot.ws package. We make the connection to the database using the UniProt.ws function. Using keytypes(up) we can see all the keys we can use to query the database and with columns(up) we can see all the different fields we can get in return. Our gene IDs are EMBL-GenBank-DDBJ_CDS identifiers. For this tutorial we want to return the keywords associated with each gene such that we can use it to group the genes based on their function.

library(UniProt.ws)

up <- UniProt.ws::UniProt.ws()

keytypes(up)
columns(up)

keywords <- UniProt.ws::select(
  up, 
  column = c("keyword"), 
  keys = fasta_df$protein_id, 
  keytype = "EMBL-GenBank-DDBJ_CDS"
  )

This will return a data.frame with our original identifiers, uniprotIDs and the keywords associated to each gene.

From	Entry	Keywords
ADX66475.1	F1CLA2	ATP-binding;Nucleotide-binding;Pyruvate
ADX66476.1	F1CLA3	Sigma factor;Transcription;Transcription regulation
ADX66477.1	F1CLA4	NA
ADX66474.1	B9V2R3	NA
ADX66458.1	F1CLA6	DNA-binding;Transcription;Transcription regulation;Two-component regulatory system
ADD85130.2	D4P8Q9	Glycosyltransferase;Transferase

We now do a bit of data wrangling with the help of the dplyr package performing the following steps:

We split the keywords for each gene on “;”
Count the occurrence of each keyword
For each gene keep only the highest occurring keyword
Replace any keyword that has less then 10 occurrences to “Other”
Bind the keywords to our fasta_df by protein ID
Replace any genes without keyword to Other
Select only the columns we need for our visualization

library(dplyr)

keywords_count <- keywords %>%
  # Separate the Keywords into different rows
  tidyr::separate_rows(Keywords, sep = ";") %>%
  # Replace NA values with 'Other'
  mutate(Keywords = replace(Keywords, is.na(Keywords), "Other")) %>%
  # Add a Count column that counts each Keyword
  add_count(Keywords)

# Keep top ranking keywords for each protein ID
keywords_count <- keywords_count %>%
  group_by(From) %>%
  filter(rank(-n, ties.method = "first") == 1) %>%
  ungroup()

# Replace any keyword that has a count less then 10 with "Other"
keywords_count <- keywords_count %>%
  mutate(Keywords = ifelse(n < 10, "Other", Keywords))

# Bind Keywords back to fasta_df
fasta_df_with_keywords <- 
  left_join(fasta_df, keywords_count, by = c("protein_id" = "From")) %>% 
  mutate(Keywords = replace(Keywords, is.na(Keywords), "Other")) %>%
  select(cluster, start, end, strand, protein_id, protein, Keywords, n)

cluster	start	end	strand	protein_id	protein	Keywords	n
BGC0000108	1	2247	complement	ADX66475.1	putative	Nucleotide-binding	22
BGC0000108	3301	3870	complement	ADX66476.1	putative	Other	8
BGC0000108	4356	5282	complement	ADX66477.1	conserved	Other	11
BGC0000108	5576	6154	complement	ADX66474.1	ScnRII	Other	11
BGC0000108	6555	10085	forward	ADX66458.1	ScnRI	Other	8
BGC0000108	10250	11626	forward	ADD85130.2	putative	Transferase	47

We can now make our visual again but use the keywords to group the genes. We alter the order of the legend such that it is in alphabetical order and the Other group is placed last and finally we customize the tooltip to also show the protein Id on hoover.


key_legend <- c(setdiff(sort(unique(fasta_df_with_keywords$Keywords)), "Other"), "Other")

# Make the cluster chart
chart <- GC_chart(
  fasta_df_with_keywords,
  start = "start",
  end = "end",
  strand = "strand",
  group = "Keywords",
  cluster = "cluster",
  height = "400px"
) %>% 
  GC_clusterLabel(unique(fasta_df$cluster)) %>%
  GC_legend(order = key_legend) %>%
  GC_tooltip(
    formatter = "
    <b>ProteinID:</b> {protein_id}<br>
    <b>Region:</b> {start} - {end}
    "
  )
chart

For further customization options like setting a title, altering the colors adding gene links and much more see Get Started and Examples.