Introduction to de novo transcriptome assembly
De novo transcriptome assembly using RNA-seq is an efficient way to gather sequence-level information on transcripts, expression levels and SNP identification. It has critical applications in fields such as plant breeding and lncRNA research, where reference genome sequences are absent, incomplete, unfeasible due to tetra- or hexa-ploidity or inadequately annotated. In these cases, an efficient way to gather insights about genomic information is transcriptomics: RNA from various tissues and conditions is extracted followed by reverse-transcription and cDNA synthesis, NGS library preparation and sequencing. Then, de novo assembly of the sequencing reads provides contigs or candidate transcripts. Reads can then be mapped back to the transcriptome to obtain estimates of gene expression levels and variant calling, in addition to enabling the collection of statistics on the read mapping. These statistics serve as quality metrics of the de novo assembly (proportion of reads mapped, broken read pairs, etc.).
Figure 1. RNA-seq workflow for de novo assembly of transcriptomes with applications in plant breeding, lncRNA identification and virome metatranscriptomics.
Applications in plant genomics, lncRNA discovery in zebrafish and virus discovery in metatranscriptomics of the microbiome in peatland
For plant breeders pursuing genotyping-by-sequencing approaches, establishing de novo transcriptome assemblies can be a fruitful strategy as it generates in-depth knowledge of the germplasm transcriptome(s) for selection of parents for new crosses and introgression of novel alleles from exotic germplasms. It is also is a cost-effective way for SNP discovery in breeding programs.
Vendramin et al. (2019) prepared cDNA libraries from four tissues and, after sequencing and removing reads originating from E.coli, mitochondria or chloroplasts, developed de novo assemblies on the pooled total of 384 million reads from these four libraries. QIAGEN CLC and Velvet-Oasis were benchmarked against each other at various parameter settings, and completeness of the assemblies was assessed by comparison with independent samples of wheat genes. To cite the authors: “The assembly performed with CLC with k-mer 64 and all four tissues together was selected due to the best overall features including N50, contig length and total size assembled. […] One of the most commonly used tools, Trinity, was tested as well; however, after a few tests it was abandoned since the results obtained did not significantly improve the quality of the assembly, coupled with an unreasonable request of resources.”
Honaas et al. (2016) conducted a detailed comparison of transcriptome assemblies for Arabidopsis thaliana and Oryza generated using six de novo genome assemblers, including QIAGEN CLC, Trinity, SOAP, Oases, ABySS and NextGENe. A careful evaluation revealed that Trinity, QIAGEN CLC and SOAP de novo-trans assemblers were equivalent and superior to Oases, ABySS and NextGENe.
In a compelling case of inadequate genome annotation, Valenzuela-Muñoz et al. (2019) identified a total of 12,165 putative lncRNA sequences in the zebrafish kidney transcriptome. QIAGEN CLC was used for de novo assembly of RNA-seq data from SVCV-challenged wild-type and rag1 mutants to identify candidate lncRNA involved in innate immunity in vertebrates. This clever approach allows for the identification of non-annotated transcripts with simultaneous measurement of differential gene expression in the conditions of interest. GO terms of nearby protein-coding genes were used to infer gene ontology-by-proxy.
Stough et al. (2018) used metatranscriptomics to describe the diversity and activity of viruses infecting microbes within the Sphagnum peat bog. Extracting RNA from Sphagnum plant stems, reverse transcription, library construction and sequencing was followed by de novo assembly into metatranscriptomes using QIAGEN CLC. Contigs were screened for the presence of conserved virus gene markers. With this elaborate bioinformatics pipeline, the authors identified a treasure trove of new undocumented phages, single-stranded RNA (ssRNA) viruses, nucleocytoplasmic large DNA viruses (NCLDV), virophage or polinton-like viruses, in addition to co-occurrence networks based on expression levels. This innovative approach is likely to be useful for the identification of virus diversity and interactions in understudied clades of the microbiome.
Algorithmic steps in de novo transcript assembly
The de novo assembler of QIAGEN CLC Genomics Workbench makes use of de Bruijn graphs to represent overlapping reads, which is a common approach for short-read de novo assembly that allows efficient handling of a large number of reads. Two parameters govern the de Bruijn graph construction and resolution, which can be adjusted to accommodate assumptions on sequencing errors and repeat sizes. These parameters are k-mer size for graph construction and bubble size from graph resolution.
Figure 2. A bubble caused by a heterozygous SNP or a sequencing error.
Figure 3. The central node represents the repeat region that is represented twice in the genome. The neighboring nodes represent the flanking regions of this repeat in the genome.
To strike a balance between error-induced misassembly in long k-mer assemblies and repeat-induced misassembly in short k-mer assemblies, the default k-mer size is automatically set by the QIAGEN CLC assembler as a function of the total amount of input data: the more reads, the higher the k-mer size. For de novo assembly of 1 Gbp of sequence, the default k-mer size is 22 (see bold line below), and increasing by one for every tripling in input sequence volume.
word size 12: 0 bp – 30000 bp
word size 13: 30001 bp – 90002 bp
word size 14: 90003 bp – 270008 bp
word size 15: 270009 bp – 810026 bp
word size 16: 810027 bp – 2430080 bp
word size 17: 2430081 bp – 7290242 bp
word size 18: 7290243 bp – 21870728 bp
word size 19: 21870729 bp – 65612186 bp
word size 20: 65612187 bp – 196836560 bp
word size 21: 196836561 bp – 590509682 bp
word size 22: 590509683 bp – 1771529048 bp
word size 23: 1771529049 bp – 5314587146 bp
word size 24: 5314587147 bp – 15943761440 bp
word size 25: 15943761441 bp – 47831284322 bp
word size 26: 47831284323 bp – 143493852968 bp
word size 27: 143493852969 bp – 430481558906 bp
word size 28: 430481558907 bp – 1291444676720 bp
word size 29: 1291444676721 bp – 3874334030162 bp
word size 30: 3874334030163 bp – 11623002090488 bp
With high-quality short reads such as Illumina reads, long k-mers are preferred, as this maximizes repeat resolution without being prone to error-induced bubbles, as shown in Figure 2. The maximum allowed QIAGEN CLC assembler k-mer size is 64. This limit has been added, as gains in terms of assembly quality with k-mers larger than 64 will be marginal, whereas computational requirements (memory, speed and space) become prohibitive.
Resources for de novo transcript assembly
The de novo assembly algorithm is a generic tool in the QIAGEN CLC Genomics Workbench and is equally applicable to transcript and genome assemblies. QIAGEN CLC Workbenches come with ready-to-use resources for reference (the manual) and quick start (the tutorial), in addition to detailed discussions in the form of whitepapers and application notes.
The de novo assembly section of the QIAGEN CLC Genomics Workbench manual contains a ‘best practices’ section on how to obtain the best results by 1) preparing high-quality input data and monitoring quality control output, 2) setting parameters of the de novo assembly algorithm, and 3) evaluating the quality of the assembly and refining it.
The de novo assembly tutorial provides a step-by-step guide to performing de novo assembly, using a bacterial genome as an example. In addition to quality-control monitoring and de novo assembly parameter settings, the tutorial explains the value of mapping reads back to contig. This allows the interrogation of the read mapping for ‘broken pairs’, or paired-end reads that have the same original molecule in the library, but end up in separate contigs, indicating problems with those contigs.
The CLC de novo assembly whitepaper contains the details of the implementation of the algorithm, in addition to many benchmarks.
The de novo assembler tool of the QIAGEN CLC Genomics Workbench is an easy-to-use, versatile and computationally efficient tool with applications in transcript assembly for plant genomics, and discovery of new lncRNAs and RNA viruses.
Honaas, L.A. et al. (2016) Selecting superior de novo transcriptome assemblies: Lessons learned by leveraging the best plant genome. PLoS ONE 11(1): e0146062.
Stough, J. et al. (2018) Diversity of active viral infections within the Sphagnum microbiome. Applied and environmental microbiology 84(23), e01124-18.
Valenzuela-Muñoz, V. et al. (2019) Comparative modulation of lncRNAs in wild-type and rag1-heterozygous mutant zebrafish exposed to immune challenge with spring viraemia of carp virus (SVCV). Sci Rep 9: 14174.
Vendramin, V. et al. (2019) Genomic tools for durum wheat breeding: de novo assembly of Svevo transcriptome and SNP discovery in elite germplasm. BMC Genomics 20: 278.