Isolating isolation by distance

Linanthus parryae population

At its most basic level, population genetics is about looking for patterns. Patterns of discontinuity that might indicate barriers to dispersal. Patterns of association that might indicate local adaptation. Or, better yet, individual loci that violate the patterns seen in the rest of the genome, because those loci are likely to be interesting—i.e., recent targets of natural selection.

So naturally, it should worry us if the patterns we see in our data don’t mean what we think they mean. That’s the problem posed by Patrick Meirmans in a recent perspective article for Molecular Ecology. Meirmans argues that a lot of the most common methods we use to look for population genetics patterns, and to detect loci that violate those patterns, can be confounded by one of the most old-established patterns of them all: isolation by distance.

Isolation by distance (IBD) is a simple consequence of limited dispersal across space, which Sewell Wright described almost seventy years ago: pairs of populations close to each other will be more genetically similar to each other than populations farther away from each other, not because of any selective need for those genetic similarities, but just because individual critters, or their seeds, or pollen, or larvae are less likely to travel longer distances.

IBD, or population structure? Or both?

There’s a simple, well-known test for IBD: a Mantel test for a significant relationship between the genetic distances between sampling sites and their physical/geographic distances. But Meirmans points out that, generally speaking, once population geneticists perform a Mantel test and find evidence of IBD, we don’t do anything about it. Surveying 72 Molecular Ecology papers published in 2011 that include a Mantel test for IBD, Meirmans found that a large majority detected IBD; and the majority of those papers then didn’t use that discovery to inform the rest of the anlayses they performed.

That’s a problem because, as Meirmans demonstrates, IBD can be easy to conflate with one of the first patterns most population geneticists test for in a new data set: population structure. He simulates genetic data under two scenarios: one in which populations are clustered in two sharply divided groups, and one in which populations are continuously distributed across a landscape. In the first case, allele frequencies change rapidly at the point of division, as you’d get in real life if there’s some sort of environmental barrier to dispersal or change in the selective environment that means migrants across the line don’t do very well. In the second case, IBD is the only force acting on spatial variation in allele frequencies, and they change in a gentle slope from one end of the landscape to the other.

But these two different scenarios look very similar when viewed through the lens of certain standard population genetics analyses. Mantel tests showed a qualitatively similar profile of genetic correlation with distance in both the sharp-transition landscape and the IBD-only landscape. On the other hand, when Meirmans divided the IBD-only landscape into eastern and western regions, then ran an analysis of molecular variance (AMOVA) test, he found a significant portion of genetic variation was attributable just to the east-west division. Meirmans doesn’t do a similar demonstration with the popular clustering analysis Structure, though he notes that Structure has been previously reported to be confused by IBD. (I would have liked to see an explicit test of Structure within the same framework as AMOVA, myself.)

So let’s say you’ve collected genetic data from sites on either side of a line you think might be biologically significant—a pretty standard-issue population genetics study. You run your data through Structure, and find two clusters of collection sites that line up pretty well with that Line of Hypothesized Biological Significance. As a followup, you conduct an AMOVA with the collection sites grouped according to their placement by Structure, and you find that the clusters explain a significant fraction of the total genetic variation in your data set. Therefore, you conclude that the LHBS is, in fact, a significant barrier to dispersal.

Except that as we’ve just discussed, everything you’ve just found could be a consequence of simple IBD plus the fact that you’ve structured your sampling so that your LHBS happens to bisect the landscape you’re studying. And just to add to the frustration, even if you’d started out by testing for IBD before you started with all of the tests for population structure, a significant result in a Mantel test for IBD wouldn’t necessarily mean that population structure wasn’t there.

I can’t help but think this puts the working population geneticist in what you might call a Chinatown dilemma.

“It’s IBD. It’s population structure. It’s IBD and population structure!”

Lying outliers

But wait, it gets worse.

Meirmans follows up with simulations of a kind of analysis that’s becoming very popular as big, next-generation sequencing datasets become more and more accessible for people studying non-model organisms (i.e., most of us): outlier analysis. This is the class of analyses where, given population data from hundreds or thousands of loci, we “scan” for loci with an unusually strong association between their alleles and some environmental variable of interest (Meirmans tests the approach implemented in SAM) or for that show unusually high or low differentiation (like FDIST, which uses FST as its measure of differentation). Essentially, both of these approaches assume that if a locus falls outside of the 95% confidence interval—established by a big sample of other loci or coalescent simulations or what have you—then it’s probably in the tail of the distribution because natural selection has been acting on it.

Simulating populations evolving under IBD—without any selection at all—across a map of Scandinavia, Meirmans tested for associations with real climate data for the mapped region. He not only found that loci ended up in the tail of the association distribution (in SAM) or the differentiation distribution (in FDIST)—he found that both analyses identified an excess of loci with p ≤ 0.05. In the case of one SAM analysis, upwards of 30% of the simulated loci had a p-value below the traditional threshold for “significance.” For SAM, this is because the spatial pattern arising from IBD—greater genetic differentiation from one end of the Scandinavian Peninsula to the other—lines up with the major north-south axis along which most major climate variables change across the region. For FDIST, Meirmans attributes the excess of significant loci to violations of the population genetic model underlying the coalescent simulations used to identify outliers.

Living with IBD

So what then should we do, as we sit down to analyze a new table of microsatellite or SNP genotypes from our favorite critters? Meirmans’s advice comes down to “watch out for IBD.” To deal with the Chinatown dilemma, he proposes testing for IBD within population clusters identified by AMOVA or a clustering algorithm, with the caveat that subdividing your data will reduce statistical power. Meirmans also reports that a partial Mantel test can be used to test whether geography (and thereby IBD) contributes to apparent clustering—by testing to see whether an association between a matrix of cluster assignments and genetic distances disappears when controlled for geographic distance.

It might also be possible to test for a difference in the slope of the IBD relationship for pairs of collection sites from different clusters and pairs of sites from the same cluster—if the clusters are on either side of a true barrier to dispersal, you’d expect that genetic distance would increase more rapidly with geographical distance when making comparisons across your LHBS.

For the outlier analyses, Meirmans advocates approaches that explicitly take into account geographic location, rather than simply measuring association with environmental variables or differentiation in a vaccuum. One such method, spatial ancestry analysis (spa), was recently published in Nature Genetics, and although it uses a possibly over-simple model of allele frequency change across space, it looks like a promising start.

But at a much more fundamental level, Meirmans’s examination of outlier analyses is a reminder that there’s nothing magical about p ≤ 0.05—which I’d hope everyone reading this already knows. Whatever your criterion for detecting “unusual” loci in a genetic dataset, it’s important to make sure that what you choose to call outliers are actually outliers in the distribution of your data as a whole—and to understand that identifying outliers isn’t necessarily the same thing as positively identifying targets of selection. It’s really only a way to pick out a subset of loci for further, in-depth analysis—”candidate loci,” in the jargon of association genetics, which has learned some of these lessons already.

In the end, if all your data shows is that allele frequencies change with geographic location: Forget it Jake, it’s (probably) IBD.

Reference

Meirmans, P.G. 2012. The trouble with isolation by distance. Molecular Ecology 21: 2839-2846. DOI: 10.1111/j.1365-294X.2012.05578.x.

Wright, S. 1943. Isolation by distance. Genetics 28: 114-138. PMCID: PMC1209196.

Yang, W.-Y., Novembre, J., Eskin, E. & Halperin, E. 2012. A model-based approach for analysis of spatial structure in genetic data. Nat Genet 44: 725-731. DOI: 10.1038/ng.2285.

RedditDiggMendeleyPocketShare and Enjoy

About Jeremy Yoder

Jeremy Yoder is a postdoctoral associate in the Department of Plant Biology at the University of Minnesota. He also blogs at Denim and Tweed and Nothing in Biology Makes Sense!, and tweets under the handle @jbyoder.
This entry was posted in methods, population genetics and tagged , , , , . Bookmark the permalink.
  • Noah

    Shouldn’t it be possible to come up with a test that incorporates isolation by distance in the generation of a null expectation in order to look for outliers? I mean, isn’t this a case of p<0.05 not being the problem, but rather your measurement of p itself?

    Also: that was the most jarring video I've ever seen in a science blog post.

    And finally: I think "Living with IBD" is the title of a pamphlet I once saw in a doctor's office.

  • http://www.denimandtweed.com/ jeremyyoder

    Agreed on both counts, Noah—and, really, that’s what the analysis in spa is supposed to do, as I understand it. It fits a function to the distribution of alleles across lattitude and longitude for each locus, and then identifies loci for which the allele frequencies change with unusual “sharpness.” I’ve played around with it a little bit for one of the data sets I’m working on, and still not sure what I think of the approach.

    And, re: the “Chinatown” clip … maybe I’ve watched a little too much noir.

  • http://gcbias.org Graham

    Nice post.

    Note that Bayenv by fitting a pretty general model to the covariance of allele frequencies across populations can account for IBD-type models when looking for environmental correlations. So some of the tools we need are already out there. We specifically avoided trying to provide a p-value like measure in this framework as doing so assumes that the null model is appropriate and that clearly is rarely true. In general I prefer to rank loci and then see alternative ways [eg. enrichments in certain functional categories] to judge biological significance. We have a paper on a new version of Bayenv that performs Fst-like statistics while accounting for population structure that we’ll release on the Arxiv shortly.

  • Pingback: Genes … in … space! | The Molecular Ecologist

  • fernando

    Hi,
    I am using variation partitioning (varpart in R package vegan) to detrend linear relationships with several interacting factors at once. Namely in a symbiotic system in a Y=f(X1*X2*X3)
    where
    Y=genetic distance matrix of endosymbiont populations X1=gen.dist matrix of host populations X2=geographic variables X3=climate variables.

    In relatively flexible way it allows to test how much of the apparent structure is explained by distance itself (IBD), how much can be attributed to other environmental factors, and of course how much to their intersection. It is really handy as a control strategy. Used descriptively it can be used to test how much of the signal attributed to other factors can be blurred by geographic patterns.

    The setup is not perfect, as it is not really straight forward to use with distance matrices. In a past manuscript I used PCA to reduce the distance matrices to a discrete number of variables. This is a problem because then the degrees of freedom of the analyses is low (Num populations-1). So with 10 populations 9 variables would always explain 100% of the variability in the response no matter what.

    However, I guess that a distance matrix (well half of it) can also be transformed into a vector and each pairwise distance can be used as a separate observation.

    Maybe it is not a great tool, maybe it is. I hope it is useful for anyone out there.

  • Pingback: On “triangulation” in genome scans | The Molecular Ecologist

  • Pingback: How A Troublesome Inheritance gets human genetics wrong | The Molecular Ecologist