Intro
This tutorial guides you through the process of performing genome
alignments using MUMmer, an alignment
tool widely used for comparing genomic sequences. As an example we will
align the plasmids from serveral species of B. thuringiensis.
Following the alignment, we’ll show you how to visualize the results
using geneviewer
.
Materials
The .gbk files can be downloaded directly from GenBank using accession numbers CP000486.1, CP009599.1, CP003188.1 and CP009638.1. Alternatively, you can obtain the GenBank files including all alignment files directly from the geneviewer-tutorials repository.
MUMmer alignment
MUMmer needs to be installed on your local machine to execute
alignments. For detailed installation instructions, visit the MUMmer
website. Once MUMmer is installed, you can perform the alignment
using geneviewer
with the mummer_alignment()
function. By default it will run the alignment in order of the files in
the folder.
Important MUMmer requires file paths without spaces.
library(geneviewer)
# Change the path to the folder where the .gbk files are saved
folder_path <- "~/path/to/folder/"
By default, the mummer_alignment()
function will perform
pair wise alignments starting with the first file in the folder. To
adjust the order in which files are aligned, assign a vector of file
names to the cluster
variable. By default nucleotide
sequences are translated into proteins using “promer” and a
“many-to-many” mapping alignment is performed. For the most suitable
alignment method for your project, please carefully review the MUMmer documentation. To
learn how to modify the default settings, consult the documentation by
running ?mummer_alignment()
.
alignment <-
mummer_alignment(
path = folder_path,
cluster = NULL,
maptype = "many-to-many",
seqtype = "protein"
)
The alignment generates a data frame that details the start and end
positions of each aligned region between the reference (cluster1) and
query (cluster2), along with identity and similarity scores. This data
frame can be directly loaded into the GC_links()
function.
start1 | end1 | start2 | end2 | length1 | length2 | identity | similarity | stop | frame1 | frame2 | tag1 | tag2 | cluster1 | cluster2 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1_CP000486.1.1 | 3 | 51542 | 51540 | 1 | 51540 | 51540 | 100.00 | 100.00 | 5.53 | 3 | -2 | CP000486.1 | CP009599.1 | 1_CP000486.1 | 2_CP009599.1 |
1_CP000486.1.2 | 55939 | 51545 | 51543 | 55937 | 4395 | 4395 | 100.00 | 100.00 | 6.01 | -1 | 3 | CP000486.1 | CP009599.1 | 1_CP000486.1 | 2_CP009599.1 |
2_CP009599.1.1 | 1013 | 3202 | 47346 | 45157 | 2190 | 2190 | 92.33 | 94.38 | 6.37 | 2 | -3 | CP009599.1 | CP003188.1 | 2_CP009599.1 | 3_CP003188.1 |
2_CP009599.1.2 | 3354 | 4118 | 44995 | 44267 | 765 | 729 | 57.98 | 74.32 | 2.33 | 3 | -2 | CP009599.1 | CP003188.1 | 2_CP009599.1 | 3_CP003188.1 |
2_CP009599.1.3 | 4113 | 9041 | 44270 | 39342 | 4929 | 4929 | 94.83 | 96.41 | 4.66 | 3 | -1 | CP009599.1 | CP003188.1 | 2_CP009599.1 | 3_CP003188.1 |
2_CP009599.1.4 | 9772 | 10803 | 38441 | 37422 | 1032 | 1020 | 61.78 | 72.70 | 0.29 | 1 | -1 | CP009599.1 | CP003188.1 | 2_CP009599.1 | 3_CP003188.1 |
Synteny gene cluster chart
To map the alignment onto the genome, we must first load the cluster information from the .gbk files. For detailed instructions, please refer to the tutorial Load Gene Cluster data from GenBank Files. We add an extra column called “CDS” to the data which will be used as key to color the genes.
folder_path <- "~/path/to/folder/"
gbk <- read_gbk(folder_path)
cds_df <- gbk_features_to_df(
gbk,
feature = "CDS",
keys = c("region", "gene", "protein_id", "note", "product")
)
cds_df$type <- "CDS"
region | gene | protein_id | product | note | cluster | strand | start | end | type |
---|---|---|---|---|---|---|---|---|---|
complement(665..1603) | terS | ABK88159.1 | phage terminase small subunit | NA | 1_CP000486.1 | complement | 665 | 1603 | CDS |
complement(2151..2975) | uvrC | ABK88160.1 | possible excinuclease ABC subunit | NA | 1_CP000486.1 | complement | 2151 | 2975 | CDS |
complement(2988..3599) | res | ABK88161.1 | resolvase | NA | 1_CP000486.1 | complement | 2988 | 3599 | CDS |
complement(4075..4305) | NA | ABK88162.1 | conserved hypothetical protein | possible phage-related protein, N-terminal region | 1_CP000486.1 | complement | 4075 | 4305 | CDS |
complement(4394..4720) | NA | ABK88163.1 | hypothetical protein | NA | 1_CP000486.1 | complement | 4394 | 4720 | CDS |
complement(5796..6185) | NA | ABK88164.1 | phage-related protein | NA | 1_CP000486.1 | complement | 5796 | 6185 | CDS |
We can load the cluster data into the GC_chart()
and the
alignment data into the GC_links()
function. In addition,
we change the axis_type to “range”, make the genes a little smaller and
add cluster labels. For more customization options see Get
started.
Important: The gene cluster data like the alignment data is loaded in the order of the files in the folder. If a custom order was used for the MUMer alignment make sure to sort the “cluster” column in the cluster data in the same way.
GC_chart(
data = cds_df,
group = "cluster",
strand = "strand",
cluster = "cluster"
) %>%
GC_links(
data = alignment,
measure = "identity", # similarity/none
) %>%
GC_scale(axis_type = "range") %>%
GC_cluster(separate_strands = TRUE) %>%
GC_genes(
group = "type",
marker_size = "small"
) %>%
GC_clusterLabel(
title = c("pALH1", "pBFQ", "F837_55", "pBFI_4"),
width = 50
) %>%
GC_tooltip(
formatter = "<b>{protein_id}</b><br>{start} - {end}"
) %>%
GC_legend(FALSE)