When I bought my first laptop in 2005, it came with a free 64MB flash drive*, which I thought was pretty awesome. Given the rate at which genomic data generation has increased in the past decade, the storage capacity of that flash drive is laughable today. In their new PLOS Biology paper, Stephens et al. talk about genomics as a Big Data science, compare it to other Big Data domains (Astronomy, YouTube, and Twitter, specifically), and project where genomics is headed in the next decade in terms of data acquisition, storage, distribution, and analysis.
The amount of genomic data generated globally doubles approximately every seven months. Based on historical rates, the worldwide annual sequencing capacity will exceed 1.0 Zetta-basepairs (Zbps) before the year 2025. To put that in perspective, 1 Zbps = 1 trillion Giga basepairs = 1,000,000,000,000,000,000,000 basepairs. In comparison, by 2025, the amount of data uploaded to YouTube and Twitter is projected to be a paltry 1-2 exabytes/year and 1.36 petabytes/year, respectively.
Projected data storage requirements for all four Big Data domains reviewed by Stephens et al. are enormous. For genomic data, storage requirements depend in part on the accuracy of the data collected. “For every 3 billion bases of human genome sequence, 30-fold more data (~100 gigabases) must be collected because of errors in sequencing, base calling, and genome alignment. This means that as much as 2–40 exabytes of storage capacity will be needed by 2025 just for the human genomes.” Theoretically, as sequencing technology improves, less storage space will be needed for any given project. Stephens et al. also suggest that real time data collection, for example, calling sequence variants or inferring transcript expression levels, may negate the need to store the raw data altogether.
The distribution of sequence data happens on a scale that ranges from a few hundred base pairs to multi-terabyte data sets. As genomic data accumulate, the bandwidth required to up and download the data also increases. Stephens et al. suggest that the future of genomics is one in which data storage and analysis will happen remotely in the cloud, obviating the need to move data around. With more and more data moving into cloud storage, security safeguards like authentication and encryption become much more important, particularly for genome sequence data associated with personal health.
Variant calling and whole genome alignment are among the most computationally intensive genomic analyses. For example, whole genome alignment between mouse and human requires about 100 CPU hours. Although computational resources continue to improve, Stephens et al. suggest the real bottleneck in Big Data genome analyses might be the input/output (I/O) hardware that moves data between where it is stored and where it is processed.
The future of genomics and Big Data
In addition to the potential of genomic data to answer consequential questions in evolutionary biology, the next decade will undoubtedly be exciting in terms of how we learn to more efficiently collect, transfer, and analyze Big (genomic) Data.
Because genomics poses unique challenges in terms of data acquisition, distribution, storage, and especially analysis, waiting for innovations from outside our field is unlikely to be sufficient. We must face these challenges ourselves, starting with integrating data science into graduate, undergraduate, and high-school curricula to train the next generations of quantitative biologists, bioinformaticians, and computer scientists and engineers.
Zachary D. Stephens, Skylar Y. Lee, Faraz Faghri, Roy H. Campbell, Chengxiang Zhai, Miles J. Efron, Ravishankar Iyer, Michael C. Schatz, Saurabh Sinha, Gene E. Robinson (2015) Big Data: Astronomical or Genomical? DOI: 10.1371/journal.pbio.1002195
*I still have the flash drive! For some reason, I can’t bring myself to part with it although I no longer use it for obvious reasons.