A new algorithm for processing DNA sequence data, STITCH, could lower costs for studies of genetic variation within species by reconstructing, or “imputing”, the sequences of individual samples within a larger dataset.
The ongoing proliferation of high-throughput (or, ugh, “next generation”) DNA sequencing methods has made it much cheaper and easier to collect data on the DNA sequence variation in a population than it was even a few years ago — but the cost of sequencing can still add up, particularly for samples of hundreds of individuals.
Most HT sequencing systems produce data as millions of short “reads” of DNA sequence, no more than a couple hundred bases long, scattered randomly over the genome. Geneticists often discuss such data in terms of a total volume spread over the length of an entire genome sequence. Sprinkle a dozen gigabases (Gb) of short reads, like snippets of ticker tape, over a 3Gb genome, and they’ll pile up four reads deep. That “depth of coverage” means that, on average, every one of those three billion DNA bases has been sequenced four times. In general, deeper coverage means greater confidence in the genome sequence that emerges from combining those multiple layers of sequence snippets, and better hope of putting together long stretches of continuous sequence. Even though HTS makes DNA sequencing much cheaper on a cost-per-base basis, sequencing to useful depth of coverage across whole genomes adds up quickly.
Geneticists have worked out a lot of hacks to reduce those costs for samples of many individuals. Methods like RADseq or targeted sequence capture can identify variable sites across the a subset of the genome (selected either randomly, or from regions with particular qualities, like protein-coding genes), but may miss a lot of potentially interesting information. Genotyping arrays can cheaply produce data from many individual samples — but, again, they can only “cover” a small subset of most genomes, and designing an array requires a set of known variable sites, which might only work well for a single population. Pooling population samples can get you estimates of local allele frequencies at many sites with relatively little sequencing cost — but sacrifices individual genotypes. It’s possible to fill in the gaps left by most of these methods, given a set of high-quality sequences that can inform “genotype imputation.” If you’re only starting to understand the scope of variation within a species, though, none of these options get you truly whole-genome data.
The new program STITCH (Sequencing To Imputation Through Constructing Haplotypes) promises to solve many, if not all, of these issues by imputing missing information from very low coverage sequence data, without the need for a high-coverage reference sample. The STITCH algorithm estimates how many unique haplotypes — continuous DNA sequences — were carried by the ancestors of the individuals sampled by a given dataset, then determines how those ancestral haplotypes were re-mixed by recombination to produce the fragmentary individual sequences in the sample.
Robert Davies and colleagues, the program’s creators, report that using 0.15x-depth of sequencing on over 2,000 mouse samples, STITCH returned imputed genotype data that agreed with the results of a genotyping array to a correlation of 0.97, and with results from 10x whole-genome sequencing to 0.95 — though the correlation with deeper sequencing falls off for loci with rare alleles. Still, that’s reconstructing whole-genome information from a dataset in which actual sequence data only accounts for a bit more than one of every seven bases of each actual genome in the sample.
STITCH can work with less data than that, too. Davies et al. subsampled their data down to lower per-sample coverage, and continued to get genome-wide correlations to array data of more than 0.9 with coverage as low as 0.06x, as long as they used all 2,000 samples. Generally, reducing the number of unique individuals sequenced had a bigger impact on STICH’s accuracy than reducing the depth of sequencing coverage, which makes sense — larger samples of individuals increase the odds that all the variable sites in a population are covered in at least one sample, even if the odds are low that any one of those sites is sequenced in any particular individual. That need for large sample sizes might offset the apparent benefits of low per-sample coverage. To cover 2,000 mouse genomes, at about 3Gb apiece, at a depth of 0.06x, we’re still talking about 360Gb of short-read data.
The same authors, and some additional collaborators, took their method for a test-drive in a companion study, using STITCH to run genome-wide association for 200 different traits measured in just shy of 1,900 mice from the test dataset, using the 0.15x sequence dataset. From that data, STITCH imputed almost 6 million single-nucleotide polymorphisms, in the process estimating that the mice in the sample were all descended from individuals carrying just four unique haplotypes — consistent with a recent population bottleneck for the mice, which were obtained from commercial providers. The analysis identified statistically significant quantitative trait loci for 92 of the traits, in many cases mapping to individual genes.
STITCH may, one day, let us tackle big studies of population genetic diversity more quickly and cheaply than we could otherwise — precisely what Davies et al. envision. But you can’t just grab a bunch of genetic samples from a brand new study site, sequence them in a single run on an Illumina 2500, and feed the results into STITCH. The algorithm looks pretty computationally intensive, and computation time increases with the number of ancestral haplotypes. In species that haven’t seen the same sort of bottleneck as semi-domesticated lab mice (or humans, the source of the other test data set Davies et al. employ), that’s a potential issue. Also, though STITCH could let molecular ecologists leapfrog the intensive, expensive sequencing needed to develop a high-quality haplotype reference panel for every population they study, it still requires a high-quality reference genome. This new method may shorten the road from no genomic resources to large samples of diversity in populations and whole species, but it won’t let you leapfrog all the way in one step.
Davies, R. W., J. Flint, S. Myers, and R. Mott. 2016. Rapid genotype imputation from sequence without reference panels. Nature Genetics doi: 10.1038/ng.3594
Nicod, J., R. W. Davies, N. Cai, C. Hassett, L. Goodstadt, C. Cosgrove, B. K. Yee, V. Lionikaite, R. E. McIntyre, C. A. Remme, E. M. Lodder, J. S. Gregory, T. Hough, R. Joynson, H. Phelps, B. Nell, C. Rowe, J. Wood, A. Walling, N. Bopp, A. Bhomra, P. Hernandez-Pliego, J. Callebert, R. M. Aspden, N. P. Talbot, P. A. Robbins, M. Harrison, M. Fray, J.-M. Launay, Y. M. Pinto, D. A. Blizard, C. R. Bezzina, D. J. Adams, P. Franken, T. Weaver, S. Wells, S. D. M. Brown, P. K. Potter, P. Klenerman, A. Lionikas, R. Mott, and J. Flint. 2016. Genome-wide association of multiple complex traits in outbred mice by ultra-low-coverage sequencing. Nature Genetics, doi: 10.1038/ng.3595