Skip to contents

This function reads protein sequences from the specified FASTA file or all FASTA files within a directory. It specifically looks for metadata in the FASTA headers with key-value pairs separated by an equals sign `=`. For example, from the header '>protein1 [gene=scnD] [protein=ScnD]', it extracts 'gene' as the key and 'scnD' as its value, and similarly for other key-value pairs.

Usage

read_fasta(fasta_path, sequence = TRUE, keys = NULL, file_extension = "fasta")

Arguments

fasta_path

Path to the FASTA file or directory containing FASTA files.

sequence

Logical; if `TRUE`, the protein sequences are included in the returned data frame.

keys

An optional vector of strings representing specific keys within the fasta header to retain in the final data frame. If `NULL` (the default), all keys within the specified feature are included.

file_extension

Extension of the FASTA files to be read from the directory (default is 'fasta').

Value

A data frame with columns for each piece of information extracted from the FASTA headers.

Details

The Biostrings package is required to run this function.

Examples

if (FALSE) {
# Read sequences from a single FASTA file
sequences_df <- read_fasta("path/to/single_file.fasta")

# Read all sequences from a directory of FASTA files
sequences_df <- read_fasta("path/to/directory/", file_extension = "fa")

# Read sequences and include the protein sequences in the output
sequences_df <- read_fasta("path/to/directory/", sequence = TRUE)
}