Intro
This tutorial demonstrates creating gene cluster visualizations from
FASTA files using geneviewer
. For guidance on importing
clusters from GenBank files see: link.
geneviewer
parses FASTA headers to extract key-value pairs,
denoted by =. For instance, in the below FASTA header it identifies
‘protein’ as ‘putative phosphoenolpyruvate synthase’ and similarly
processes other pairs. The ‘location’ tag is crucial because it allows
the extraction of ‘start’, ‘end’, and ‘strand’ information for each
gene.
>lcl|HQ3864.1_prot_ADX475.1_1 protein=putative phosphoenolpyruvate synthase protein_id=ADX66475.1 location=complement(<1..2247) gbkey=CDS
MRATGLVRGRAAKRFGRAGGAGDAIGQCRATGHDCLGGSAMIEQYVWDLHEVDETQVAVVGGKGAHLGGL
SRIEGIRVPAGFCVTTDAFRRIMAEAPSIDDGLDQLSRLNPDDREAIRTLSAQIRRTIEGIAIPGDLAAA
ITRALARLGEHAACAVRSSATAEDLPTASFAGQQDTYLNVVGPTAILQHVSRCWASLFTERAVTYRQRNG
Materials
The FASTA files uses for the tutorial were originally downloaded from the NCBI website but can be downloaded directly from the geneviewer-tutorials repository.
Loading FASTA files
The example FASTA files used for this tutorial are each from a
different organism and holds the genes of a cluster that produces a
polyene antifungal agent. Download the files and place them under the
same folder. We can read the files into R using the
read_fasta
function from geneviewer
. The
function uses the Biostrings
library so we will need to
load this as well. The function will parse all the FASTA headers and
create a data.frame from the extracted information. We have set sequence
to FALSE
since we do not need the sequences to create the
cluster visualization.
library(geneviewer)
# if (!require("BiocManager", quietly = TRUE))
# install.packages("BiocManager")
#
# BiocManager::install("Biostrings")
library(Biostrings)
# change the path to the folder where you have saved the file
file_path <- "~/path/to/folder/"
fasta_df <- read_fasta(file_path, sequence = FALSE)
View(fasta_df) # Inspect the data.frame in Rstudio
The data.frame holds the protein IDs, start, end and strand information for each gene. In addition there is the cluster column that tells us from which file each gene is coming.
protein | protein_id | location | gbkey | gene | start | end | strand | cluster | locus_tag |
---|---|---|---|---|---|---|---|---|---|
putative | ADX66475.1 | complement(<1..2247) | CDS | NA | 1 | 2247 | complement | BGC0000108 | NA |
putative | ADX66476.1 | complement(3301..3870) | CDS | sigD | 3301 | 3870 | complement | BGC0000108 | NA |
conserved | ADX66477.1 | complement(4356..5282) | CDS | calU-c | 4356 | 5282 | complement | BGC0000108 | NA |
ScnRII | ADX66474.1 | complement(5576..6154) | CDS | scnRII | 5576 | 6154 | complement | BGC0000108 | NA |
ScnRI | ADX66458.1 | 6555..10085 | CDS | scnRI | 6555 | 10085 | forward | BGC0000108 | NA |
putative | ADD85130.2 | 10250..11626 | CDS | scnK | 10250 | 11626 | forward | BGC0000108 | NA |
Cluster Visualization
Using the data.frame we created from the FASTA files we can now easily make the cluster visualization.
# Make the cluster chart
chart <- GC_chart(
fasta_df,
start = "start",
end = "end",
strand = "strand",
cluster = "cluster",
height = "400px"
) %>%
GC_clusterLabel(unique(fasta_df$cluster))
chart
Uniprot Keywords
We can get additional information for each gene using the
Uniprot.ws
package. We make the connection to the database
using the UniProt.ws
function. Using
keytypes(up)
we can see all the keys we can use to query
the database and with columns(up)
we can see all the
different fields we can get in return. Our gene IDs are
EMBL-GenBank-DDBJ_CDS identifiers. For this tutorial we want to return
the keywords associated with each gene such that we can use it to group
the genes based on their function.
library(UniProt.ws)
up <- UniProt.ws::UniProt.ws()
keytypes(up)
columns(up)
keywords <- UniProt.ws::select(
up,
column = c("keyword"),
keys = fasta_df$protein_id,
keytype = "EMBL-GenBank-DDBJ_CDS"
)
This will return a data.frame with our original identifiers, uniprotIDs and the keywords associated to each gene.
From | Entry | Keywords |
---|---|---|
ADX66475.1 | F1CLA2 | Pyruvate |
ADX66476.1 | F1CLA3 | Sigma factor;Transcription;Transcription regulation |
ADX66477.1 | F1CLA4 | NA |
ADX66474.1 | B9V2R3 | NA |
ADX66458.1 | F1CLA6 | DNA-binding;Transcription;Transcription regulation;Two-component regulatory system |
ADD85130.2 | D4P8Q9 | Glycosyltransferase;Transferase |
We now do a bit of data wrangling with the help of the
dplyr
package performing the following steps:
- We split the keywords for each gene on “;”
- Count the occurrence of each keyword
- For each gene keep only the highest occurring keyword
- Replace any keyword that has less then 10 occurrences to “Other”
- Bind the keywords to our fasta_df by protein ID
- Replace any genes without keyword to Other
- Select only the columns we need for our visualization
library(dplyr)
keywords_count <- keywords %>%
# Separate the Keywords into different rows
tidyr::separate_rows(Keywords, sep = ";") %>%
# Replace NA values with 'Other'
mutate(Keywords = replace(Keywords, is.na(Keywords), "Other")) %>%
# Add a Count column that counts each Keyword
add_count(Keywords)
# Keep top ranking keywords for each protein ID
keywords_count <- keywords_count %>%
group_by(From) %>%
filter(rank(-n, ties.method = "first") == 1) %>%
ungroup()
# Replace any keyword that has a count less then 10 with "Other"
keywords_count <- keywords_count %>%
mutate(Keywords = ifelse(n < 10, "Other", Keywords))
# Bind Keywords back to fasta_df
fasta_df_with_keywords <-
left_join(fasta_df, keywords_count, by = c("protein_id" = "From")) %>%
mutate(Keywords = replace(Keywords, is.na(Keywords), "Other")) %>%
select(cluster, start, end, strand, protein_id, protein, Keywords, n)
cluster | start | end | strand | protein_id | protein | Keywords | n |
---|---|---|---|---|---|---|---|
BGC0000108 | 1 | 2247 | complement | ADX66475.1 | putative | Other | 1 |
BGC0000108 | 3301 | 3870 | complement | ADX66476.1 | putative | Other | 7 |
BGC0000108 | 4356 | 5282 | complement | ADX66477.1 | conserved | Other | 16 |
BGC0000108 | 5576 | 6154 | complement | ADX66474.1 | ScnRII | Other | 16 |
BGC0000108 | 6555 | 10085 | forward | ADX66458.1 | ScnRI | Other | 7 |
BGC0000108 | 10250 | 11626 | forward | ADD85130.2 | putative | Transferase | 49 |
We can now make our visual again but use the keywords to group the genes. We alter the order of the legend such that it is in alphabetical order and the Other group is placed last and finally we customize the tooltip to also show the protein Id on hoover.
key_legend <- c(setdiff(sort(unique(fasta_df_with_keywords$Keywords)), "Other"), "Other")
# Make the cluster chart
chart <- GC_chart(
fasta_df_with_keywords,
start = "start",
end = "end",
strand = "strand",
group = "Keywords",
cluster = "cluster",
height = "400px"
) %>%
GC_clusterLabel(unique(fasta_df$cluster)) %>%
GC_legend(order = key_legend) %>%
GC_tooltip(
formatter = "
<b>ProteinID:</b> {protein_id}<br>
<b>Region:</b> {start} - {end}
"
)
chart
For further customization options like setting a title, altering the colors adding gene links and much more see Get Started and Examples.