During the current pandemic, the importance of continually monitoring viral genomes for new mutations has become fundamental to help guide decisions. The combined efforts of labs across the world have generated enormous amounts of SARS-CoV-2 sequencing data that must be analyzed in order to place it into the broader context of the pandemic.
QIAGEN has several resources to support SARS-CoV-2 data analysis. This includes the CoV-2 Insights Service for genomic surveillance, which offers full bioinformatics analysis for QIAGEN’s QIAseq SARS-CoV-2 Primer Panel, Ion Torrent’s Ion AmpliSeq SARS-CoV-2 Research Panel or the Illumina panels. However, if you find your panel data is currently not supported by this prebuilt solution, don’t worry. You can easily analyze any panel data by creating a simple workflow in QIAGEN CLC Genomics Workbench.
Here we show an example of building a workflow to process the long reads generated by Oxford Nanopore Technology using QIAGEN CLC Genomics Workbench with the Long Read Support plugin. We show a simple workflow that can process the data to generate variant calls. Using sequencing data from the University of Exeter (Baker et al., 2020), we provide an example analysis by examining the mutation signatures at different time points in the pandemic.
Trim and map reads, and call and filter variants
Download reference data as a GenBank file from NCBI and extract annotations using the ‘Convert to Tracks’ tool.
Figure 1 shows an example of a simple workflow, which consists of the following steps:
- Trim Reads: Here, we trim for quality. You can optionally remove adapters, depending on the sequencing protocol of your data.
- Map Long Reads to Reference: The workflow uses the default settings and reference genome MN908947.3.
- Trim primers of mapped reads: Here, we use ‘Trim Primers and Their Dimers from Mapping’ to un-align the primer part of the mapped reads. We use the primer file associated with the ARTIC protocol (https://artic.network).
- Variant calling: For this, we use ‘Fixed Ploidy Variant Detection’ with ploidy = 1. We use this variant caller instead of the ‘Low-Frequency Variant Detection’ tool due to the high error-rate of Nanopore reads. For the same reason, we adjusted the minimum coverage and count to 30 and 5, respectively.
- Remove frameshifting indels and low-frequency single nucleotide variants (SNVs): Using three rounds of ‘Filter on Custom Criteria’, we first identify indels, then frameshifting indels by filtering by a length of 1, 2 , 4 or 5, with coverage less than 30x and Insertions/Deletions. Using ‘Filter against Known Variants’, we remove these from the final variant track. Finally, we use ‘Remove Marginal Variants’ to remove SNVs with a frequency below 70 % and quality below 10.
- The variant tracks from all input samples are finally collected and compared in a track list.
Visualizing mutations in a track list
The called variants can be visualized in a track list using the reference genome and the variant tracks. This view makes it easy to monitor new mutations. An amino acid track helps us by distinguishing synonymous from non-synonymous mutations. In Figure 2, we show a subset of 9 variant tracks from samples collected at various time points in the pandemic. The tracks span from March 2020 to December 2020 and have been sorted chronologically by sample collection date from top to bottom. Here, we can see that variants are accumulating over time.
The latest data set from December 12, 2020 is a sequencing run of the B.1.1.7 strain. We identify the strain by adding the amino acid changes to the track list. In Figure 3, two of the variants characteristic of this strain can be seen, namely N501Y and P681H in the spike protein.
We visualize the viral evolution in a SNP tree using one of the oldest samples (2020-03-25 as root).
As you can see, constructing a workflow for the analysis of SARS-CoV2 variants in QIAGEN CLC Genomics Workbench is quick and easy. The entire workflow shown here can be run in less than 5 minutes from input to variant calling on a standard laptop for a sample of 400,000 reads. This allows for great scalability and efficiency in sample analysis.
Baker, Dave J. et al. (2020) CoronaHiT: Large scale multiplexing of SARS-CoV-2 genomes using Nanopore sequencing. bioRxiv 2020.06.24