About
Every transcriptome listed on this site was assembled de novo from publicly available RNA-seq short reads using Semblans, an end-to-end pipeline for automated transcriptome assembly and processing.
The assembly pipeline
Semblans (pronounced “semblance”) is a C++ tool that abstracts the quality-control, reconstitution, and post-processing steps of transcriptome assembly from raw short-read sequences to annotated coding sequences. It leverages C++ data streaming to pass large sequence data between successive program calls without costly reformatting on disk, and it has been tested across a range of computing architectures including the NSF ACCESS cyberinfrastructure ecosystem.
The pipeline is organised into three successive phases, each of which can be run individually or in tandem:
- Preprocessing. Reads are first screened with FastQC. Single-base sequencing errors are corrected with Rcorrector (reads flagged as “unfixable” are removed), synthetic adapter sequences are trimmed with Trimmomatic when detected, foreign reads are filtered out via the k-mer-based taxonomic classification of Kraken2, and overrepresented sequences identified by a second FastQC pass are removed.
- Assembly. Cleaned reads are assembled de novo into a preliminary set of contigs with Trinity.
- Postprocessing. Chimeric contigs are detected via a BLASTX-compatible DIAMOND alignment against a user-supplied reference proteome and removed following the Yang & Smith (2013) procedure. Read support is quantified with Salmon, transcripts are clustered with Corset to extract a gene-level approximation of the data, and a final BLASTP-compatible DIAMOND alignment guides TransDecoder in predicting the final coding regions and peptides. Optional functional annotation is performed with HMMER against the PANTHER peptide database.
In benchmarking, Semblans produced higher-quality assemblies — measured by TransRate score — for 98 of the 101 short-read runs tested, drawn from the 1KP project and from a Caryophyllales-focused study.
For documentation, source code, and binaries see the project on GitHub and the project wiki.
Citing Semblans
If you use any of the transcriptomes listed on this site, please cite the Semblans paper:
Miles D Woodcock-Girard, Eric C Bretz, Holly M Robertson, Karolis Ramanauskas, Jarrad T Hampton-Marcell, Joseph F Walker, Semblans: automated assembly and processing of RNA-seq data, Bioinformatics, Volume 41, Issue 1, January 2025, btaf003, https://doi.org/10.1093/bioinformatics/btaf003
- Publication: doi.org/10.1093/bioinformatics/btaf003
- Source code: github.com/gladshire/Semblans
- Documentation: github.com/gladshire/Semblans/wiki