Read Protein Sequences from FASTA Files — read

This function reads protein sequences from the specified FASTA file or all FASTA files within a directory. It specifically looks for metadata in the FASTA headers with key-value pairs separated by an equals sign `=`. For example, from the header '>protein1 [gene=scnD] [protein=ScnD]', it extracts 'gene' as the key and 'scnD' as its value, and similarly for other key-value pairs.

Usage

read_fasta(fasta_path, sequence = TRUE, keys = NULL, file_extension = "fasta")

Arguments

fasta_path: Path to the FASTA file or directory containing FASTA files.
sequence: Logical; if `TRUE`, the protein sequences are included in the returned data frame.
keys: An optional vector of strings representing specific keys within the fasta header to retain in the final data frame. If `NULL` (the default), all keys within the specified feature are included.
file_extension: Extension of the FASTA files to be read from the directory (default is 'fasta').

Value

A data frame with columns for each piece of information extracted from the FASTA headers.

Details

The Biostrings package is required to run this function.

Examples

if (FALSE) {
# Read sequences from a single FASTA file
sequences_df <- read_fasta("path/to/single_file.fasta")

# Read all sequences from a directory of FASTA files
sequences_df <- read_fasta("path/to/directory/", file_extension = "fa")

# Read sequences and include the protein sequences in the output
sequences_df <- read_fasta("path/to/directory/", sequence = TRUE)
}