Food and feed safety is a main concern for food authorities, centers of disease control, departments of agriculture and public health laboratories, surveilling and acting on epidemiological outbreaks of foodborne pathogens such as Salmonella, Listeria, Vibrio, E. coli, Shigella, Campylobacter and Cronobacter reported by hospitals and doctors. Typically, the detective work involved in tracing an outbreak includes examining pre-infection food intake of patients, drinking water, supermarket receipts and social media postings of dining-out events. Laboratory steps involve culturing bacteria from suspected sources using specialized bacterial growth media to isolate the causal agent, followed by strain typing. Once the source of contamination has been identified, it is eliminated from the food chain, often by product recalls from the company producing the food. At this stage, the damage has been done, and the impacts are wide. Consumers are affected by potential fatalities. Brand reputation and immediate financial returns are diminished. A lot of food is wasted. For this reason, there is a strong industry trend to have in-house facilities and expertise to establish bacterial baselines and monitor any deviations thereof, enabling early detection and avoiding costly outbreaks. Other use cases include correct labeling and fraud detection of food items.
In addition to culturing and PCR-based identification methods, NGS-based approaches to sample characterization are gaining traction, namely whole genome sequencing of isolates and taxonomic profiling of bacterial communities. Food quality laboratories are now routinely equipped with desktop sequencing machines from Illumina or Ion Torrent and portable devices from Oxford Nanopore to provide the sequences.
Taxonomic profiling of the bacterial community can involve sampling and sequencing DNA using whole metagenome shotgun approaches, either with or without an intercalated PCR amplification step of the 16S rRNA genes. The bioinformatics pipelines for the NGS data analysis vary according to the approach taken (Jagadeesan et al., 2019).
Whole genome sequencing
Whole genome sequencing (WGS) of isolates consists of quality control, read trimming and assembly, bacterial characterization, strain typing, antimicrobial resistance characterization, variant calling, phylogenetic analysis and visualization tasks.
A recent German consortium effort for genome-based surveillance of Salmonella enterica isolates using Illumina sequencing technology lists 15 open-source bioinformatics software tools needed for the WGS analysis, whereas a recent paper by the Institute of Food Safety and Analytical Sciences, Nestlé Research, lists 13 tools for their corresponding pipeline. Web-based bioinformatics services tailored to the NGS analysis of food-borne pathogens include the US-based GenomeTrakr and the Danish Evergreen pipelines. For non-bioinformatician food scientists and microbiologists, it is time-consuming and impractical to learn these programs, let alone installing, tying together, version-controlling and maintaining them. Instead, a single, integrated platform that is easy to use, install and maintain is much preferred. QIAGEN CLC Genomics Workbench Premium has tailored tools for all steps along the pipeline and is designed for bench scientists to use without bioinformatics expertise. Workflows that tie these steps together are also available so that execution is as simple as clicking a few mouse-clicks using the graphical user interface. Scaling to enterprise levels is relatively straightforward with the QIAGEN CLC Genomics Server or QIAGEN CLC Genomics Cloud Engine software.
Figure 1. A schematic workflow for the analysis of NGS reads generated by whole genome sequencing of isolates.
The tutorial “Typing and Epidemiological Clustering of Common Pathogens” includes an example workflow for analyzing NGS data from isolated and cultivated bacterial samples using QIAGEN CLC Genomics Workbench. Using Illumina data from 47 cultured Salmonella enterica, the workflow identifies the best matching reference and its taxonomy, performs NGS-based multilocus sequence typing (MLST), finds antimicrobial resistance genes, identifies potential contaminants in a sample and performs outbreak analysis based on SNP-trees. The databases needed for workflow execution are also provided and include Salmonella and Staphylococcus genome references, MLST schemes and antimicrobial resistance gene databases. The workflow can easily be adopted to other bacterial species or modified to perform other tasks or search additional databases. The tutorial also demonstrates how to work with many samples to create both k-mer trees and SNP-trees and display these in the context of metadata. Metadata can be added and displayed on trees as described in the “Phylogenetic Trees and Metadata” tutorial.
Taxonomic profiling of bacterial communities
For monitoring microbial communities along the food processing chain, the bacterial isolation and genome typing workflow is often impractical, as the heavily manual laboratory process of sample culturing does not scale well and is heavily biased to identifying the “usual suspects” amongst species that can be cultured. In contrast, culture-independent approaches permit high-throughput automation while also providing unbiased information on the microbial composition in the samples. Two approaches are widely used: Amplicon-based profiling and whole shotgun metagenomics.
Amplicon-based profiling is based on sequencing highly conserved regions of bacterial genomes at the 16S rRNA locus (ITS for fungi), clustering the resulting NGS reads into pseudo-species called Operational Taxonomic Units (OTUs), and compute the abundance of each OTU. In reference-based OTU clustering, a database provides taxonomy assignment for the OTUs while OTUs can be constructed without a valid match in the reference database, providing evidence for yet unknown bacterial (or fungal) species. The PCR amplification ensures a highly sensitive assay. With relatively few sequences, a representative and reproducible taxonomic profile of the samples can be obtained, making this approach highly cost-effective and scalable. The tutorial “OTU Clustering Using Workflows” provides a workflow for analyzing NGS data from soil samples using QIAGEN CLC Genomics Workbench and visualizing the results using zoomable sunburst and bar chart plots.
Whole shotgun metagenomic taxonomic profiling
A more direct approach that does not rely on PCR (and hence avoiding many of the PCR-associated potential biases) is based on whole shotgun sequencing of metagenomic DNA and performing taxonomic profiling. This is done by mapping the reads to a representative microbiome reference database and reporting back the taxonomic levels of references to which reads map and the percentage of reads mapped to a given reference as a proxy for abundance of this species in the microbiome. Evidence for unknown microbial species is contained in the reads not matching the reference database(s) and a metagenomic assembly and binning of such reads allows for the construction of metagenome-assembled genomes (MAGs) that can be incorporated into one’s reference database and serve as quality markers.
The tutorial “Taxonomic Profiling of Whole Shotgun Metagenomic Data” demonstrates the taxonomic analysis to monitor the effect of antibiotic treatment of two subjects’ gut microbiota in a time series experiment. For metagenomic assembly and binning of contigs, the “QC, Assemble and Bin Pangenomes” workflow template is provided in the software. Constructing and maintaining the databases is explained in the “Creating and using annotated sequences as microbial reference data” tutorial.
A common source of error in whole shotgun metagenomic approaches is derived from the reads originating from the “food matrix”. This usually means the bulk of the reads are derived from the host genome or, in the case of fermented products, from the starter culture. Hence, a filtering step to remove these should be included in the analysis. The “Taxonomic Profiling” tool includes this optional filter. Beck et al. (2021) lists 31 commonly used food and feed “matrix filtering genomes” that should be used as “decoy” reference(s) in this step. Including these matrix references will also reduce false-positive findings and speed up the read mapping step, as exact matches of reads to reference are found much faster than approximate matches.
The secret to success
The choice of reference data is key to the success of taxonomic profiling approaches using NGS. If a given species in the microbiome is not represented in the reference data, this will lead to false-negative findings. If a species is not present yet reads originating from this species map to the genome of an unrelated species with similar genomic regions, this will lead to false-positive findings. This can happen if the reference databases used to perform taxonomic profiling are not representative of the habitat studied. In the food safety NGS area, false-negatives will result in overlooked problems, and false-positives will trigger unnecessary alerts. For this reason, generic reference databases that try to capture the entire (rarefied) tree of life, regardless of habitat, may be a poor choice. Instead, habitat-specific reference databases are becoming the new standard. RVDB for virus references, ProGenomes2 and MGnify for a wide range of microbial communities are recent examples of such databases. Food and feed monitoring laboratories have to set up the laboratory procedures involved in sampling along the production chain, nucleic acid extraction, library preparation and sequencing. The bioinformatics analysis then must be set up correspondingly. QIAGEN CLC Genomics Workbench not only supports this approach but also provides a single point-of-entry to bioinformatics by having a universal and flexible toolset that can be executed as workflows to do large-scale analyses without having to learn how to use and install and maintain various open-source tools and databases.
Use case and example workflow
Scientists at a global dairy company are using QIAGEN CLC Genomics Workbench with Oxford Nanopore sequencing technology to perform whole metagenome shotgun analysis to establish a “normal” microbiome community baseline for dairy products and to monitor deviations thereof which are associated with food spoilage. Advantages are the availability of a library preparation kit, sequencing technology, plug-and-play software, and most importantly, speed of analysis, which improves turnaround times from weeks with culture-based approaches of slow-growing, cold-adapted bacterial species to only a few days. The workflow used is depicted in Figure 2.
Figure 2. The bioinformatics workflow for the analysis of whole metagenome data. The reference data consists of known “food matrix” genomes in addition to food spoilers. The “Not annotated” reads can be de novo assembled and used as queries to find new spoilers or matrix associated reference genomes and included in the reference collection for future use. Strain typing, AMR, virulence and plasmid characterization can be easily plugged in as part of the “Iterate per taxonomy” workflow.
Don’t miss these related QIAGEN CLC blogs:
Barretto et al. (2021) Genome sequencing applied to pathogen source tracking in food industry: Key considerations for robust bioinformatics data analysis and reliable results interpretation. Genes 12, 275.