Loading Gene Clusters From Generic Feature Format (GFF) Files
Source:vignettes/LoadGFF.Rmd
LoadGFF.Rmd
Intro
This tutorial demonstrates creating gene cluster visualizations from
Generic Feature Format (GFF) files using geneviewer
. You
can obtain .gff files from a variety of platforms such as NCBI, Ensembl or UCSC. This tutorial demonstrates how
to load and visualize several viral genomes in .gff format using
geneviewer.
Materials
The .gff files were retrieved from the Ensembl website by searching with their respective GenBank identifiers and downloading the records in GFF3 format. Alternatively, the files can also be directly accessed from the geneviewer-tutorials repository.
Loading GFF files
We can load the .gff files into R using the read_gff function from the geneviewer package. To do this, we can either load each file individually by specifying its file path, or load all files at once by specifying the directory that contains all the .gff files.
In the example below, we load all .gff files from a specified
directory using the fields
parameter to select specific
fields for loading. If no fields are specified, all fields will be
loaded by default. Using the dplyr
package, we filter the
data to select entries where the ‘type’ column contains ‘CDS’. In
addition we add an extra column which maps each GenBank ID in the
filename
column to its corresponding viral name.
library(geneviewer)
library(dplyr)
# change the path to where you have saved the
# file or the directory containing all .gff files
folder_path <- "~/path/to/folder/"
gff <- read_gff(
folder_path,
fields = c("source", "type", "start", "end", "strand", "Name")
) %>%
dplyr::filter(type == "CDS")
# Add viral names
virus_names <- c(
GU071086 = "Marseillevirus",
HQ113105.1 = "Lausannevirus",
KF261120 = "Cannesvirus",
KF483846 = "Tunisvirus"
)
gff <- gff %>% mutate(Name = virus_names[filename])
View(gff) # Inspect the data frame in Rstudio
source | type | start | end | strand | Name | filename | Virus |
---|---|---|---|---|---|---|---|
Genbank | CDS | 293 | 1039 | - | ADB03794.1 | GU071086 | Marseillevirus |
Genbank | CDS | 1146 | 1802 | - | ADB03795.1 | GU071086 | Marseillevirus |
Genbank | CDS | 1950 | 2714 | - | ADB03796.1 | GU071086 | Marseillevirus |
Genbank | CDS | 2737 | 3210 | - | ADB03797.1 | GU071086 | Marseillevirus |
Genbank | CDS | 3242 | 3520 | - | ADB03798.1 | GU071086 | Marseillevirus |
Genbank | CDS | 3566 | 4048 | - | ADB03799.1 | GU071086 | Marseillevirus |
We can now visualize the genomic data using geneviewer
.
The chart displays the start and end positions of genes for each viral
genome oriented according to strand location. A custom title is added to
the chart, and the gene clusters are labeled with their viral names. The
axis is set to range
to facilitate size comparisons among
the genomes. Additionally, the visual representation of the genes is
simplified by removing outlines (strokes) and using smaller markers. As
a final touch we change the tooltip to only display the gene Name.
GC_chart(gff,
start = "start",
end = "end",
strand = "strand",
cluster = "Virus",
height = "400px",
) %>%
GC_title(
"<i>Marseilleviridae</i> viral genomes",
height = "40px") %>%
GC_clusterLabel() %>%
GC_scale(axis_type = "range") %>%
GC_genes(
group = "filename",
stroke = "none",
marker_size = "small"
) %>%
GC_tooltip(formatter = "<b>Gene:</b> {Name}")