Skip to contents

Intro

This tutorial shows how to use biomaRt to load transcript data from Ensembl and visualize it using geneviewer. As a practical example, we will retrieve the exon and UTR regions of five different splice variants of the BRCA1 gene and display the results with geneviewer.

Loading Transcript data

Transcript data for the BRCA1 gene can be loaded using the biomaRt package, with the Ensembl gene ID obtained from the Ensembl website. In this example, we first load the necessary libraries. Next, we specify the Ensembl gene ID for BRCA1 and establish a connection to the Ensembl database using useMart(). Finally, we retrieve all transcript IDs associated with the BRCA1 gene using the getBM() function.

library(biomaRt)
library(geneviewer)
library(dplyr)


ensembl_gene_id <- c("ENSG00000012048")
mart = useMart("ensembl", dataset = "hsapiens_gene_ensembl")

# Retrieve all transcript IDs for the gene
transcrips <- getBM(attributes = c("ensembl_gene_id", "ensembl_transcript_id"),
                              filters = "ensembl_gene_id",
                              values = ensembl_gene_id,
                              mart = mart)
ensembl_gene_id ensembl_transcript_id
ENSG00000012048 ENST00000497488
ENSG00000012048 ENST00000489037
ENSG00000012048 ENST00000478531
ENSG00000012048 ENST00000357654
ENSG00000012048 ENST00000473961
ENSG00000012048 ENST00000477152

Now that we have all transcript IDs associated with the BRCA1 gene, we can retrieve the start and end positions of the exons, as well as the 5’ and 3’ UTRs. Additionally, we will add the strand information for each transcript. As an example, we only retrieve the information for the first 5 transcript IDs. We use the getBM function to retrieve the start and end positions for each feature separately. We rename the start and end columns, allowing us to combine all results into a single data frame using bind_rows from the dplyr package. Finally, we remove any rows with NA in the start column and add the strand information for each transcript.


transcript_IDs <- transcrips$ensembl_transcript_id[1:5]

# Get exon details
exon_attributes <- c("ensembl_transcript_id", "exon_chrom_start", "exon_chrom_end")
exon_results <- getBM(attributes = exon_attributes,
                      filters = "ensembl_transcript_id",
                      values = transcript_IDs,
                      mart = mart) %>%
  dplyr::mutate(type = "exon", start = exon_chrom_start, end = exon_chrom_end) %>%
  dplyr::select(ensembl_transcript_id, type, start, end)

# Get 3' UTR details
utr3_attributes <- c("ensembl_transcript_id", "3_utr_start", "3_utr_end")
utr3_results <- getBM(attributes = utr3_attributes,
                      filters = "ensembl_transcript_id",
                      values = transcript_IDs,
                      mart = mart) %>%
  dplyr::mutate(type = "3_utr", start = `3_utr_start`, end = `3_utr_end`) %>%
  dplyr::select(ensembl_transcript_id, type, start, end)

# Get 5' UTR details
utr5_attributes <- c("ensembl_transcript_id", "5_utr_start", "5_utr_end")
utr5_results <- getBM(attributes = utr5_attributes,
                      filters = "ensembl_transcript_id",
                      values = transcript_IDs,
                      mart = mart) %>%
  dplyr::mutate(type = "5_utr", start = `5_utr_start`, end = `5_utr_end`) %>%
  dplyr::select(ensembl_transcript_id, type, start, end)

# Bind features
transcript_attributes <- bind_rows(exon_results, utr3_results, utr5_results) %>%
  filter(!is.na(start))

# Add strand
transcript_strand <- getBM(attributes = c("ensembl_transcript_id", "strand"),
                           filters = "ensembl_transcript_id",
                           values = transcript_IDs,
                           mart = mart) 

# Add transcript strand
transcript_attributes <- 
  dplyr::left_join(
    transcript_attributes, 
    transcript_strand, 
    by = "ensembl_transcript_id"
    )
ensembl_transcript_id type start end strand
ENST00000357654 exon 43090944 43091032 -1
ENST00000357654 exon 43082404 43082575 -1
ENST00000357654 exon 43076488 43076614 -1
ENST00000357654 exon 43074331 43074521 -1
ENST00000357654 exon 43070928 43071238 -1
ENST00000357654 exon 43067608 43067695 -1

Now that we have the start, end and strand information of the transcripts we can visualize the results using geneviewer and the GC_transcript function. Note that the intron positions are calculated based on the exon positions. To ensure the exons are recognized, the type column with the feature name must be specified and include “exon” in its name.

GC_chart(
  data = transcript_attributes, 
  start = "start",
  end = "end",
  strand = "strand",
  height = "600px"
) %>%
  GC_transcript(
    transcript = "ensembl_transcript_id",
    type = "type"
  )

We can further customize the chart by adding a title with the transcript name and a footer with the relative abundance of each transcript. To change the color, alter the appearance of specific features, and explore many more options, see the Get Started guide or the GC_transcripts documentation by running ?GC_transcripts in the console.

GC_chart(
  data = transcript_attributes, 
  height = "600px"
  ) %>%
  GC_transcript(
    transcript = "ensembl_transcript_id", 
    strand = "strand", 
    type = "type",
  ) %>%
  GC_clusterTitle(
    title = unique(transcript_attributes$ensembl_transcript_id),
    titleFont = list(
      fontSize = "12px"
    )
  ) %>%
  GC_clusterFooter(
    title = c("0.8%", "0.1%", "0.07%", "0.02%", "0.01%"),
    align = "center"
  )