Manual curation in the age of automation: 4 reasons why COSMIC remains the gold standard database


QIAGEN Digital Insights

Manual curation in the age of automation: 4 reasons why COSMIC remains the gold standard database

As the technology revolution rages on, the age-old battle rears its head: old vs. new, computers vs. people, tried-and-tested vs. innovation. But when it comes to a somatic mutation resource, which one is best for you?

Before delving into details, it’s vital to understand that automation and manual curation might work with the same data, but they play in completely different tournaments. It’s almost like comparing Formula 1 and NASCAR – they’re both about cars, but their rules, quality, audience, and specifications are entirely different.

When scientists, bioinformaticians, and clinicians look for a somatic mutation resource to support their next-generation sequencing (NGS) data analysis and precision oncology activities, the priorities are accuracy, transparency, and flexibility. Only one database ticks all of these boxes, and that’s COSMIC, the Catalogue of Somatic Mutations in Cancer.

On close inspection, it’s clear that COSMIC is the only major player with the scope and breadth to deliver its offerings. Our analysis provides four key reasons why manual curation is the ‘gold standard’ and will maintain that place on the podium for a while yet.

4 key reasons why COSMIC remains the gold standard
  1. Accuracy and efficiency-driving features

COSMIC deploys high-precision data curation methods. It comprises information from almost 1.5 million cancer samples, manually curated from more than 28,000 peer-reviewed papers by PhD-level experts with decades of experience. These experts perform exhaustive literature searches to select papers from which they reorganize, interpret, standardize, and catalog mutation data, phenotype information, and clinical details. To date, manual curation remains the gold standard for associating genotypic and phenotypic data, as it is most precise and delivers higher quality data.

On the other hand, there are advanced machine learning tools developed to accelerate the process of retrieving variant evidence from scientific literature. They, however, sometimes leave much to be desired in the quality and accuracy of genomic data extracted. For instance, significant error rates when associating variants with the correct genes are observed with the use of crowdsourcing and artificial intelligence (AI) applications. Significant amounts of undetected disease-associated mutations and false-positive article associations are also issues with text-mining approaches.

  1. Quality over quantity

In precision oncology, quality is far more important than quantity, which is why we choose manual curation over other options.

Natural language processing (NLP) and machine learning (ML) approaches facilitate “seeding” relationships from articles to describe genotypic and phenotypic relationships. While these approaches let AI-driven databases scale the indexing of PubMed articles, they do not provide the necessary precision needed to curate deep, unstructured biological, phenotypic, and complex clinical data, including graphics, full text, and supplementary material.

However, deep, manual curation does. Consequently, human judgment remains critical to analyzing and capturing complex relationships, interactions, and contradictory evidence. The high-touch, human review process COSMIC employs ensures high accuracy, high specificity, relevance, context, and consistency in data.

  1. Transparency – data you can trust

In COSMIC, every data point is traceable to the source, and data processing is documented. The data sources that feed into COSMIC to characterize cancer samples and mutations include peer-reviewed papers, targeted gene-screening panels, genome-wide screen data, and cancer cell line omics data — all of which users have full access to and can use as preferred.

  1. Manual curation gives users phenotype context

COSMIC’s cancer histology/phenotype classification system is unique and is the world’s most comprehensive cancer phenotype classification linked to somatic mutations. Asides from loosely following the World Health Organization’s (WHO) classification system, COSMIC goes into more detail and precedes/anticipates approvals to WHO.

COSMIC presents the cancer site and cancer histology separately: e.g. lung/left lower lobe/ns/ns, carcinoma/adenocarcinoma/ns/ns each in 4 levels.


Figure 2. Example of how human curation ensures that all important demographic, tumor and sample data are captured for the database


Bottom line

In clinical genomics, data quality standards are high as outputs are only as reliable as the evidence used to obtain them.

Text-mining tools are useful and helpful in identifying relevant information, but to rely solely on them is to leave the door open to misleading or missing critical evidence.

COSMIC shuts, locks, and seals that door. A team of expert variant scientists updates COSMIC three times a year. They constantly manually review biomedical literature to classify variants and harmonize differences in nomenclature across gene transcripts. COSMIC’s experts focus on continuous curation and variant reclassification— never relying on updates from external entities. It is a database suited to deep, accurate, and thorough explorations of human genetics.

All of this is why COSMIC is a vital addition to any precision oncology toolbox.


Want to learn more about COSMIC?

With over 71 million somatic mutations, COSMIC is the world’s largest expert-curated somatic mutation database trusted by over 20,000 users. Learn more about the industry-leading database here, where you can explore features, watch videos, and request a complimentary demonstration.

Learn more about COSMIC here.