Introduction to rgbio
rgbio-introduction.RmdOverview
rgbio provides performant reading and writing operations
for GenBank (.gb/.gbk/.gbff) files in R via an interface to the
high-performance gb-io Rust
crate. It is designed to be fast and memory-efficient while providing
R-friendly data structures.
Why rgbio?
- the only way to directly write GenBank files from R (to my knowledge)
- much faster reading of GenBank files (~10x-30x faster than other packages in my benchmarks)
- reading into and writing from both tidy objects (e.g. tibbles/data.frames) and “Bioconductor Sequence Infrastructure” objects (e.g. DNAStrings).
- robust parsing via the robust gb-io Rust crate
- extensively tested on ~50 diverse GenBank files with many edge cases.
Installation
The rgbio package is not available on CRAN (for now),
because it depends on a Rust crate. You can install it from the
R-universe repository without having installed Rust or any Rust
toolchain, as there are binary versions available for Windows, macOS,
and Linux.
install.packages("rgbio",
repos = c("https://richardstoeckl.r-universe.dev",
"https://cloud.r-project.org"))
If there is no pre-built binary available for your system, or you
want the latest development version, you can install rgbio
from GitHub, provided you have the Rust toolchain installed. You can
find information on how to install Rust at https://github.com/r-rust/hellorust.
# install.packages("remotes")
remotes::install_github("richardstoeckl/rgbio")Basic Usage
Loading the Package
library(rgbio)Writing and Reading (Tidy Workflow)
To write a GenBank file in tidy mode, you typically provide: 1.
Sequences: A named character vector (or
DNAStringSet). 2. Features: A
data.frame with columns type,
start, end, strand,
qualifiers. 3. Metadata: A list,
data.frame, or DataFrame with record-level
attributes.
Let’s create a minimal example sequence.
# 1. The sequence
seq_dna <- "ATGCGTACGTTAGC"
# 2. Metadata
metadata <- list(
definition = "Synthetic Example Sequence",
accession = "EX0001",
version = "1",
molecule_type = "DNA",
topology = "linear",
division = "SYN",
date = "01-JAN-2023"
)
# 3. Features
# Note: 'qualifiers' must be a list column where each element is a named character vector.
features_df <- data.frame(
type = c("source", "gene", "CDS"),
start = c(1L, 1L, 1L),
end = c(14L, 14L, 14L),
strand = c("+", "+", "+"),
stringsAsFactors = FALSE
)
features_df$qualifiers <- list(
c(organism = "Synthetic Organism", mol_type = "genomic DNA"),
c(gene = "exampleGene"),
c(gene = "exampleGene", product = "hypothetical protein", translation = "MRTS")
)
# Preview features
print(features_df)
#> type start end strand qualifiers
#> 1 source 1 14 + Synthetic Organism, genomic DNA
#> 2 gene 1 14 + exampleGene
#> 3 CDS 1 14 + exampleGene, hypothetical protein, MRTSNow, write it to a temporary file:
Reading Back in Tidy Format
Reading is straightforward. read_gbk parses the file and
can return tidy tables.
Inspecting the Data
The returned object has three components matching what we wrote.
Metadata:
str(records$metadata)
#> tibble [1 × 13] (S3: tbl_df/tbl/data.frame)
#> $ record_id : chr "EX0001"
#> $ name : chr "EX0001"
#> $ definition : chr "Synthetic Example Sequence"
#> $ accession : chr "EX0001"
#> $ version : chr "1"
#> $ keywords :List of 1
#> ..$ : chr(0)
#> $ source : chr ""
#> $ organism : chr NA
#> $ molecule_type: chr "DNA"
#> $ topology : chr "linear"
#> $ division : chr "SYN"
#> $ date : chr "01-JAN-2023"
#> $ references :List of 1
#> ..$ : list()Sequence:
records$sequences$sequence[[1]]
#> [1] "ATGCGTACGTTAGC"Features:
The features are returned as a tidy data.frame.
print(records$features)
#> # A tibble: 3 × 6
#> record_id type start end strand qualifiers
#> <chr> <chr> <int> <int> <chr> <I<list>>
#> 1 EX0001 source 1 14 + <chr [2]>
#> 2 EX0001 gene 1 14 + <chr [1]>
#> 3 EX0001 CDS 1 14 + <chr [3]>Writing and Reading (Bioconductor Workflow)
You can also use Bioconductor-native classes for input and output.
seqs_bioc <- Biostrings::DNAStringSet(c(EX0002 = "ATGCGGTTAA"))
gr <- GenomicRanges::GRanges(
seqnames = "EX0002",
ranges = IRanges::IRanges(start = c(1L, 1L), end = c(10L, 10L)),
strand = c("+", "+")
)
S4Vectors::mcols(gr)$type <- c("source", "gene")
S4Vectors::mcols(gr)$qualifiers <- list(
c(organism = "Synthetic Organism", mol_type = "genomic DNA"),
c(gene = "exampleGene2")
)
meta_bioc <- S4Vectors::DataFrame(
definition = "Bioconductor input example",
accession = "EX0002",
molecule_type = "DNA"
)
tmp_bioc <- tempfile(fileext = ".gb")
write_gbk(
file = tmp_bioc,
sequences = seqs_bioc,
features = gr,
metadata = meta_bioc
)
#> [1] TRUE
bioc_out <- read_gbk(tmp_bioc, format = "bioconductor")
class(bioc_out$sequences)
#> [1] "DNAStringSet"
#> attr(,"package")
#> [1] "Biostrings"
class(bioc_out$features)
#> [1] "GRanges"
#> attr(,"package")
#> [1] "GenomicRanges"
class(bioc_out$metadata)
#> [1] "DFrame"
#> attr(,"package")
#> [1] "S4Vectors"Minimum Required Information for write_gbk()
Absolute minimum required inputs:
-
file: output file path. -
sequences: non-empty named character vector orDNAStringSetwith non-empty sequence strings.
Everything else is optional:
-
features: optional (NULLis valid). -
metadata: optional (NULLis valid).
If omitted, rgbio fills required record-level fields
using sequence names:
-
name,definition,accessiondefault to the record name. -
molecule_typedefaults to"DNA".
Practical note:
-
append = TRUErequires thatfilealready exists and is a valid GenBank file.
Supported Metadata Fields
The following metadata fields are supported by
write_gbk() and returned by read_gbk():
-
name(Locus name) definitionaccessionversion-
keywords(character vector) sourceorganism-
molecule_type(e.g., “DNA”) division-
topology(“linear” or “circular”) -
date(format:DD-MON-YYYY) -
references(list of references; each reference may includedescription,authors,consortium,title,journal,pubmed,remark)
Advanced: Complex Locations
When reading GenBank files, rgbio preserves feature
locations as GenBank location expressions produced by the parser,
including patterns such as:
join(1..10,20..30)complement(100..200)- fuzzy bounds such as
<5..>120
In other words, rgbio focuses on faithful
I/O of location syntax rather than fully symbolic location
algebra.
For advanced manipulations (interval arithmetic, set operations,
transcript/CDS composition), use Bioconductor range tooling on the
GRanges output or parse location strings with specialized
utilities.
Performance
rgbio leverages Rust’s zero-copy parsing where possible
and efficient string handling to outperform pure R implementations,
especially for large multi-record GenBank files. See the full benchmark
details and methodology available in the benchmarks
article.
Disclaimer
Important note: This was/is a project for me to play
around with agentic coding, and was written primarily by LLMs (“AI”)
under my direction. Nevertheless, it provides real value as it is one of
the only ways to write GenBank files in R, and is one of the most
performant ways to read Genbank files to R. It uses the very robust Rust
gb-io crate and is tested against ~50 diverse GenBank files
with many edge cases.
This library is provided under the MIT License. The gb-io Rust crate package was written by David Leslie and is licensed under the terms of the MIT License.
This project is in no way affiliated, sponsored, or otherwise endorsed by the original gb-io authors.