How many markers does it take to make a dataset “genomic”?

One size fits all. Welcome to the 80's

Actually, no. (Flickr: Stephan van Es)

A new paper in Ecology Letters by Matthew Fitzpatrick and Stephen Keller proposes to use some a class of statistical methods developed for understanding the distribution of species in different environments to understand the distribution of genetic variants in different environments. It’s an interesting idea, and a cool possible point of cross-pollination between the emerging field of landscape genomics and the more established field of community ecology. But there was one sentence that got my dander up:

In analyses, we have run on a computer with a 2.95 GHz quad-core processor, [this analysis method] took 0.5 h to analyse 2314 SNP loci, which is not unreasonable compared to other genetic analysis methods for detecting local adaptation.

Half an hour to analyze more than 2,000 genetic markers might sound “not unreasonable” to the authors, but if the computation time necessary for their analysis scales linearly with the number of markers, that implies a run time of 500 hours—almost 21 days—for a 2 million SNP dataset such as the one I’ve been lucky to play with in my postdoctoral research. (And, depending on how exactly the method incorporates data from multiple loci, computation time might increase quite a bit faster than linearly with additional markers.) Yes, we’d hope to have more powerful computing resources for a larger dataset, and maybe the analysis can be parallelized to spread the work across many linked computer processors—but that is still not what I’d call an efficient analysis. For comparison, a much simpler genotype-environment association scan I did with 2 million SNPs in 200 lines of Medicago truncatula took about 8 hours, if I recall correctly.

I don’t necessarily blame the Fitzpatrick and Keller, and I don’t mean to single them out. I think this points up a bigger issue that ecological genetics is facing as we enter the era of high-throughput sequencing. Specifically, it’s that genetic datasets that aren’t actually all that big when you spread them across a whole genome, feel big because they’re still orders of magnitude bigger than anything we’ve been able to collect before. See, for another example, the paper I discussed here recently, which looked for loci that determine survival to adulthood in salmon: more than 5,500 SNPs sounds like a lot, but they’re just a small sample in a 3 billion-nucleotide genome.

To be clear, hundreds or thousands of markers provide very good power for estimating degree and timing of genetic isolation between populations, or identifying distinct genetic clusters, or reconstructing historical relationships between species. But when you’re “scanning” the genome for loci responsible for a particular phenotype, or under differential selection in different environments, even thousands of markers are going to miss a lot in almost any eukaryotic genome. This is one of the major points made by my postdoctoral mentor, Peter Tiffin, and Jeff Ross-Ibarra in a review article that is now available as a preprint: how densely markers are distributed across the genome, not just the total number of numbers, makes a big difference.

A 2,000-marker dataset is “genomic” in the sense that it captures the overall demographic history of the sample. But for many “genomic” analyses, what we really care about is the bits of the genome that deviate from that history. To find those we need a comb with many, many more teeth.

Reference

Fitzpatrick M.C. and S.R. Keller. 2014. Ecological genomics meets community-level modelling of biodiversity: mapping the genomic landscape of current and future environmental adaptation, Ecology Letters. doi: 10.1111/ele.12376.

Tiffin P, Ross-Ibarra J. 2014. Advances and limits of using population genetics to understand local adaptation. PeerJ PrePrints 2:e488v1 doi: 10.7287/peerj.preprints.488v1.

Share

About Jeremy Yoder

Jeremy Yoder is an Assistant Professor of Biology at California State University, Northridge. He also blogs at Denim and Tweed, and tweets under the handle @jbyoder.

This entry was posted in association genetics, genomics, next generation sequencing, population genetics, software. Bookmark the permalink.
  • Noah Reid

    I might argue that nothing should be considered “genomic” unless linkage is explicitly dealt with. If you’re applying a method to 2 million SNPs that is fundamentally the same as one you would apply to 200 SNPs, then maybe there’s no need for a term other than “genetic”.

    • Yeah, maybe some sort of standard maximum recombination rate or minimum r^2 between the average pair of markers in the dataset? I could get behind that.

  • Todd R. Gack

    You know, 21 days isn’t really all that long. How much time does it take to collect, extract, analyze, etc the samples to get 2 million SNPs in the first place? I don’t see the logic in then using “a simpler method” to analyze those data just because it takes only 8 hours – a savings of a paltry 13 hours over a superior method that provides better inference. And the authors state 30 minutes “is not unreasonable *compared to other genetic analysis methods* for detecting local adaptation”.

    • Okay, so I’ll start by noting that the time difference between 21 days and 8 hours is not 13 hours, but rather more than 20 days. I’ll nevertheless assume you meant what you said in the opening sentence, and respond to that:

      You’re not wrong that 21 days is less time than it took to collect the MHP dataset. But 21 days for one iteration of the analysis is really something of an under-estimate for the time it’d take to perform a study with the method in question and a 2 million SNP dataset, because, unless you are a much cleverer person than I am (and indeed cleverer than any of the other large-dataset-using biologists I know), you are going to run it more than one time.

      That’s because at first you’re learning how to run the analysis, and there are a million ways to screw up just about any large-scale analysis and some of them won’t be apparent until you’ve done all your pilot runs with small sample datasets and think you’re all set and you pull the trigger on the 21-day job and find out, three weeks later, that you screwed up and the results are useless. Then it’s because you finally got it to work but you look at the results and they don’t make any sense so you try it all over again with another environmental dataset. Then it’s because you discover that there’s a systematic bias associated with very rare variants and the only way to remove those from the results set is to create a new filtered input dataset and re-run the whole thing from scratch.

      None of those are issues that I actually know will turn up in an analysis with the method proposed by Fitzpatrick and Keller, because, no, I haven’t tried it! They’re just the kind of things that happen when you’re doing this kind of work. Genomic data analysis is messy and iterative and introducing a step in the process that involves waiting for days is enough of a pain in the butt that I want to be damned sure the method that takes three weeks to run is going to tell me something that I can’t learn from an alternative method that takes less than 2% of that time.

      I do think what Fitzpatrick and Keller describe is interesting and probably worth the effort to make it more efficient—but, yes, based on my reading of the paper, there are multiple comparable “genetic analysis methods for detecting local adaptation” that will work a lot faster, which is exactly the opposite of what that sentence says. Hence my objection.

    • Larry Van Nostren

      Umm…you botched the math there Todd.

  • Well, of course, no one has lost by betting on increasing availability of computational resources, at least over the long term. But that doesn’t mean that computational power isn’t limiting. I still see, for example, studies using Bayesian methods that admit to having estimated parameters from a non-stationary distribution because the algorithm didn’t get to stationarity after a month or more of computing time.

    It is also certainly possible that the computation time for this particular algorithm increases less than linearly with the number of loci in the dataset. (That would make it the first such case in my personal experience!) But we don’t know one way or the other, because whoever edited the paper allowed the authors to make an assertion about what is practical for genome-scale data without actually testing their method on genome-scale data.