Genome alignment using MUMmer • geneviewer

Intro

This tutorial guides you through the process of performing genome alignments using MUMmer, an alignment tool widely used for comparing genomic sequences. As an example we will align the plasmids from several species of B. thuringiensis. Following the alignment, we’ll show you how to visualize the results using geneviewer.

Materials

The .gbk files can be downloaded directly from GenBank using accession numbers CP000486.1, CP009599.1, CP003188.1 and CP009638.1. Alternatively, you can obtain the GenBank files including all alignment files directly from the geneviewer-tutorials repository.

MUMmer alignment

MUMmer needs to be installed on your local machine to execute alignments. For detailed installation instructions, visit the MUMmer website. Once MUMmer is installed, you can perform the alignment using geneviewer with the mummer_alignment() function. By default it will run the alignment in order of the files in the folder.

Important MUMmer requires file paths without spaces.

library(geneviewer)
# Change the path to the folder where the .gbk files are saved
folder_path <- "~/path/to/folder/"

By default, the mummer_alignment() function will perform pair wise alignments starting with the first file in the folder. To adjust the order in which files are aligned, assign a vector of file names to the cluster variable. By default nucleotide sequences are translated into proteins using “promer” and a “many-to-many” mapping alignment is performed. For the most suitable alignment method for your project, please carefully review the MUMmer documentation. To learn how to modify the default settings, consult the documentation by running ?mummer_alignment().

alignment <- 
  mummer_alignment(
    path = folder_path,
    cluster = NULL,
    maptype = "many-to-many",
    seqtype = "protein"
  )

The alignment generates a data frame that details the start and end positions of each aligned region between the reference (cluster1) and query (cluster2), along with identity and similarity scores. This data frame can be directly loaded into the GC_links() function.

	start1	end1	start2	end2	length1	length2	identity	similarity	stop	frame1	frame2	tag1	tag2	cluster1	cluster2
1_CP000486.1.1	3	51542	51540	1	51540	51540	100.00	100.00	5.53	3	-2	CP000486.1	CP009599.1	1_CP000486.1	2_CP009599.1
1_CP000486.1.2	55939	51545	51543	55937	4395	4395	100.00	100.00	6.01	-1	3	CP000486.1	CP009599.1	1_CP000486.1	2_CP009599.1
2_CP009599.1.1	1013	3202	47346	45157	2190	2190	92.33	94.38	6.37	2	-3	CP009599.1	CP003188.1	2_CP009599.1	3_CP003188.1
2_CP009599.1.2	3354	4118	44995	44267	765	729	57.98	74.32	2.33	3	-2	CP009599.1	CP003188.1	2_CP009599.1	3_CP003188.1
2_CP009599.1.3	4113	9041	44270	39342	4929	4929	94.83	96.41	4.66	3	-1	CP009599.1	CP003188.1	2_CP009599.1	3_CP003188.1
2_CP009599.1.4	9772	10803	38441	37422	1032	1020	61.78	72.70	0.29	1	-1	CP009599.1	CP003188.1	2_CP009599.1	3_CP003188.1

Synteny gene cluster chart

To map the alignment onto the genome, we must first load the cluster information from the .gbk files. For detailed instructions, please refer to the tutorial Load Gene Cluster data from GenBank Files. We add an extra column called “CDS” to the data which will be used as key to color the genes.

folder_path <- "~/path/to/folder/"
gbk <- read_gbk(folder_path)
cds_df <- gbk_features_to_df(
  gbk, 
  feature = "CDS",
  keys = c("region", "gene", "protein_id", "note", "product")
  )
cds_df$type <- "CDS"

region	gene	protein_id	product	note	cluster	strand	start	end	type
complement(665..1603)	terS	ABK88159.1	phage terminase small subunit	NA	1_CP000486.1	complement	665	1603	CDS
complement(2151..2975)	uvrC	ABK88160.1	possible excinuclease ABC subunit	NA	1_CP000486.1	complement	2151	2975	CDS
complement(2988..3599)	res	ABK88161.1	resolvase	NA	1_CP000486.1	complement	2988	3599	CDS
complement(4075..4305)	NA	ABK88162.1	conserved hypothetical protein	possible phage-related protein, N-terminal region	1_CP000486.1	complement	4075	4305	CDS
complement(4394..4720)	NA	ABK88163.1	hypothetical protein	NA	1_CP000486.1	complement	4394	4720	CDS
complement(5796..6185)	NA	ABK88164.1	phage-related protein	NA	1_CP000486.1	complement	5796	6185	CDS

We can load the cluster data into the GC_chart() and the alignment data into the GC_links() function. In addition, we change the axis_type to “range”, make the genes a little smaller and add cluster labels. For more customization options see Get started.

Important: The gene cluster data like the alignment data is loaded in the order of the files in the folder. If a custom order was used for the MUMer alignment make sure to sort the “cluster” column in the cluster data in the same way.

GC_chart(
  data = cds_df,
  group = "cluster",
  strand = "strand",
  cluster = "cluster"
) %>%
GC_links(
  data = alignment,
  measure = "identity", # similarity/none
) %>%
GC_scale(axis_type = "range") %>%
GC_cluster(separate_strands = TRUE) %>%
GC_genes(
  group = "type",
  marker_size = "small"
) %>%    
GC_clusterLabel(
  title = c("pALH1", "pBFQ", "F837_55", "pBFI_4"), 
  width = 50
  ) %>%
GC_tooltip(
  formatter = "<b>{protein_id}</b><br>{start} - {end}" 
) %>%
GC_legend(FALSE)