R/getPfamAnnotationForPDBFilenames.R
get_pfam_annotation_for_targets.RdThis function retrieves Pfam domain annotations for a given set of PDB target structures. The targets are expected to be results from a local Foldseek search against the PDB database. The function extracts the relevant PDB assembly and chain identifiers, queries the RCSB API, and returns the Pfam descriptions associated with the identified polymer entities.
get_pfam_annotation_for_targets(targets, batch_size = 200)A character vector of PDB filenames with assembly and chain information as returned by Foldseek in the "target" column.
An integer specifying the number of targets to process in one batch (default: 200). The API calls to the RCSB server are made in batches to avoid overloading the server.
A data frame with columns:
target: The original target name from Foldseek output.
rcsb_id: The RCSB identifier for the matched polymer entity. Note: This uses the "label_asym_id" instead of the "auth_asym_id" used in the target name.
title: The title of the PDB entry associated with the entity.
pfam_description: The description of the Pfam family associated with the entity.
The function operates in the following steps:
Extracts PDB assembly and chain IDs from the target names.
Queries the RCSB API to retrieve corresponding polymer entity identifiers.
Filters results to match Foldseek output.
Retrieves Pfam annotations for the identified entities.
Merges results into a structured data frame.
Internally, the function calls:
get_assembly_id_from_target(): Extracts assembly ID from PDB target name.
get_polymer_info(): Queries the RCSB API for Pfam domain information.
The Research Collaboratory for Structural Bioinformatics (RCSB) provides structural and functional annotations for macromolecules stored in the Protein Data Bank (PDB). Foldseek is a fast search tool for comparing protein structures. When searching for similar structures in the PDB, Foldseek returns a table which contains the "target" column. In the case of searches against the PDB, this target column contains the PDB filename of the hit, which is not easily interpretable.
This function automates the extraction of Pfam domain annotations for these target PDB filenames returned by Foldseek, using the RCSB GraphQL API.