How many markers does it take to make a dataset “genomic”?

One size fits all. Welcome to the 80's

Actually, no. (Flickr: Stephan van Es)

A new paper in Ecology Letters by Matthew Fitzpatrick and Stephen Keller proposes to use some a class of statistical methods developed for understanding the distribution of species in different environments to understand the distribution of genetic variants in different environments. It’s an interesting idea, and a cool possible point of cross-pollination between the emerging field of landscape genomics and the more established field of community ecology. But there was one sentence that got my dander up:

In analyses, we have run on a computer with a 2.95 GHz quad-core processor, [this analysis method] took 0.5 h to analyse 2314 SNP loci, which is not unreasonable compared to other genetic analysis methods for detecting local adaptation.

Half an hour to analyze more than 2,000 genetic markers might sound “not unreasonable” to the authors, but if the computation time necessary for their analysis scales linearly with the number of markers, that implies a run time of 500 hours—almost 21 days—for a 2 million SNP dataset such as the one I’ve been lucky to play with in my postdoctoral research. (And, depending on how exactly the method incorporates data from multiple loci, computation time might increase quite a bit faster than linearly with additional markers.) Yes, we’d hope to have more powerful computing resources for a larger dataset, and maybe the analysis can be parallelized to spread the work across many linked computer processors—but that is still not what I’d call an efficient analysis. For comparison, a much simpler genotype-environment association scan I did with 2 million SNPs in 200 lines of Medicago truncatula took about 8 hours, if I recall correctly.

I don’t necessarily blame the Fitzpatrick and Keller, and I don’t mean to single them out. I think this points up a bigger issue that ecological genetics is facing as we enter the era of high-throughput sequencing. Specifically, it’s that genetic datasets that aren’t actually all that big when you spread them across a whole genome, feel big because they’re still orders of magnitude bigger than anything we’ve been able to collect before. See, for another example, the paper I discussed here recently, which looked for loci that determine survival to adulthood in salmon: more than 5,500 SNPs sounds like a lot, but they’re just a small sample in a 3 billion-nucleotide genome.

To be clear, hundreds or thousands of markers provide very good power for estimating degree and timing of genetic isolation between populations, or identifying distinct genetic clusters, or reconstructing historical relationships between species. But when you’re “scanning” the genome for loci responsible for a particular phenotype, or under differential selection in different environments, even thousands of markers are going to miss a lot in almost any eukaryotic genome. This is one of the major points made by my postdoctoral mentor, Peter Tiffin, and Jeff Ross-Ibarra in a review article that is now available as a preprint: how densely markers are distributed across the genome, not just the total number of numbers, makes a big difference.

A 2,000-marker dataset is “genomic” in the sense that it captures the overall demographic history of the sample. But for many “genomic” analyses, what we really care about is the bits of the genome that deviate from that history. To find those we need a comb with many, many more teeth.


Fitzpatrick M.C. and S.R. Keller. 2014. Ecological genomics meets community-level modelling of biodiversity: mapping the genomic landscape of current and future environmental adaptation, Ecology Letters. doi: 10.1111/ele.12376.

Tiffin P, Ross-Ibarra J. 2014. Advances and limits of using population genetics to understand local adaptation. PeerJ PrePrints 2:e488v1 doi: 10.7287/peerj.preprints.488v1.


About Jeremy Yoder

Jeremy Yoder is an Assistant Professor of Biology at California State University, Northridge. He also blogs at Denim and Tweed, and tweets under the handle @jbyoder.
This entry was posted in association genetics, genomics, next generation sequencing, population genetics, software. Bookmark the permalink.