This function conducts a BLAST analysis for protein sequences within specified clusters. It generates all possible protein combinations between a query cluster and other clusters, performs pairwise alignments, calculates sequence identity and similarity, and filters results based on a minimum identity threshold.
Usage
protein_blast(
data,
query,
id = "protein_id",
start = "start",
end = "end",
cluster = "cluster",
genes = NULL,
identity = 30,
parallel = TRUE
)
Arguments
- data
A dataframe or a character vector specifying the path to .gbk files. When a character vector is provided, it is interpreted as file paths to .gbk files which are then read and processed. The dataframe must contain columns for unique protein identifiers, cluster identifiers, protein sequences, and the start and end positions of each gene.
- query
The name of the query cluster to be used for BLAST comparisons.
- id
The name of the column that contains the gene identifiers. Defaults to "protein_id".
- start
The name of the column specifying the start positions of genes. Defaults to "start".
- end
The name of the column specifying the end positions of genes. Defaults to "end".
- cluster
The name of the column specifying the cluster names. Defaults to "cluster".
- genes
An optional vector of gene identifiers to include in the analysis. Defaults to NULL.
- identity
Minimum identity threshold for BLAST hits to be considered significant. Defaults to 30.
- parallel
Logical indicating whether to use parallel processing for alignments. Defaults to TRUE.
Value
A modified version of the input `data` dataframe, including additional columns for BLAST results (identity, similarity).