Perform Protein BLAST Analysis Within Specified Clusters

This function conducts a BLAST analysis for protein sequences within specified clusters. It generates all possible protein combinations between a query cluster and other clusters, performs pairwise alignments, calculates sequence identity and similarity, and filters results based on a minimum identity threshold.

Usage

protein_blast(
  data,
  query,
  id = "protein_id",
  start = "start",
  end = "end",
  cluster = "cluster",
  genes = NULL,
  identity = 30,
  parallel = TRUE
)

Arguments

data: A dataframe or a character vector specifying the path to .gbk files. When a character vector is provided, it is interpreted as file paths to .gbk files which are then read and processed. The dataframe must contain columns for unique protein identifiers, cluster identifiers, protein sequences, and the start and end positions of each gene.
query: The name of the query cluster to be used for BLAST comparisons.
id: The name of the column that contains the gene identifiers. Defaults to "protein_id".
start: The name of the column specifying the start positions of genes. Defaults to "start".
end: The name of the column specifying the end positions of genes. Defaults to "end".
cluster: The name of the column specifying the cluster names. Defaults to "cluster".
genes: An optional vector of gene identifiers to include in the analysis. Defaults to NULL.
identity: Minimum identity threshold for BLAST hits to be considered significant. Defaults to 30.
parallel: Logical indicating whether to use parallel processing for alignments. Defaults to TRUE.

Value

A modified version of the input `data` dataframe, including additional columns for BLAST results (identity, similarity).

Note

This function relies on the Biostrings and pwalign package for sequence alignment and the dplyr package for data manipulation. Ensure these packages are installed and loaded into your R session.

Examples

if (FALSE) {
path_to_folder <- "path/to/gbk/folder/"
data_updated <- protein_blast(
                         path_to_folder,
                         id = "protein_id",
                         query = "cluster A",
                         identity = 30
                         )
}