Big data getting bigger


QIAGEN Digital Insights

Big data getting bigger

PLoS Biology has published an interesting paper about big data in genomics from lead author Zachary Stephens, senior author Gene Robinson, and their collaborators at the University of Illinois at Urbana-Champaign and Cold Spring Harbor Laboratory.

The perspective offers a bold vision about the projected growth of data generation and management in the field. The authors compare genomic data to other areas known for their leading data production (astronomy, Twitter, and YouTube) and offer solid documentation for their theory that 10 years from now, genomics could outpace all other big data fields. Check it out: Big Data: Astronomical or Genomical?

One of the areas they focused on was data analysis, a category that’s near and dear to us. Stephens et al. call out variant interpretation as one of the most computationally intensive processes for genomic data. Projecting out to the number of genomes that could be available by 2025, they write, “Variant calling on 2 billion genomes per year, with 100,000 CPUs in parallel, would require methods that process 2 genomes per CPU-hour, three-to-four orders of magnitude faster than current capabilities.”

Another great point in the scientists’ perspective was their insistence on data sharing across labs and institutions. “For precision medicine and similar efforts to be most effective, genomes and related ’omics data need to be shared and compared in huge numbers,” the authors write. “If we do not commit as a scientific community to sharing now, we run the risk of establishing thousands of isolated, private data collections, each too underpowered to allow subtle signals to be extracted.”

We heartily support this statement, and are proud to be co-founders of a leading initiative aimed at facilitating this kind of sharing — the Allele Frequency Community. When we first conceived the community, data sharing was one of our most important goals. That’s why we adopted a share-and-share-alike approach for AFC, letting all scientists use the data as long as they share their own allele frequency data in exchange. This proviso has led to remarkable and fast growth for the community, constantly making the resource more valuable to everyone using it. We think there are opportunities to use a similar approach for other types of genomic data and hope others are inspired to try it.

Kudos to Stephens et al. for a thought-provoking commentary that has gotten the whole field talking about what the future of genomics might look like!

Learn about our products.