On (mis)interpreting STRUCTURE/ADMIXTURE results

STRUCTURE, ADMIXTURE and other similar software are among the most cited programs in modern population genomics. They are algorithms that estimate allele frequencies and admixture proportions under the premise that sampled genotypes are derived from one of “K” ancestral populations, and have been widely used to (1) detect and estimate population structure, (2) quantify ancestral admixture, and (3) build the basis for complex evolutionary hypotheses about population evolution. However, interpreting the results of these methods has often been contentious (see Gilbert et al. 2012, Lawson et al. 2012), mostly around interpretation of “K” ancestral populations. Additionally, alternate evolutionary scenarios can also produce similar observable patterns using STRUCTURE/ADMIXTURE. For e.g. three alternate scenarios – one of recent admixture, one of admixture with unsampled/unobservable “ghost” populations, and a third with a recent bottleneck are described in Falush et al. 2016 (also see the interesting Twitter conversations here and here).

In their new preprint, Falush et al. 2016 describe a goodness of fit assessment of the admixture model results compared to results using matrix factorization of “chromosome painting” palettes (Lawson et al. 2012), which uses haplotype information to estimate ancestry across individuals. A residual plot between the two results can then be computed, which clearly demonstrates differences across evolutionary scenarios. For instance, a scenario of recent admixture doesn’t show any discernible residuals, whereas admixture with a ghost clearly indicates underestimation of admixture proportions in the admixed population.

Figure 1 of Falush et al. 2016 showing three simulated models, estimated admixture proportions, and residual plots of admixture proportions and estimated ancestries. Image courtesy: http://biorxiv.org/content/biorxiv/early/2016/07/28/066431.full.pdf

Figure 1 of Falush et al. 2016 showing three simulated models, estimated admixture proportions, and residual plots of admixture proportions and estimated ancestries. Image courtesy: http://biorxiv.org/content/biorxiv/early/2016/07/28/066431.full.pdf

Falush et al. 2016 also show the efficacy of this new method using an empirical data set from Ari blacksmiths and cultivators. The origins of these ethnic peoples in Ethiopia have been argued, with some studies pointing to recent admixture from neighboring ethnic groups, and others towards recent bottlenecks. Their analyses using the new method however clearly indicate support for the recent bottleneck scenario, which has also been studied to be more plausible using model-based demographic analyses. The authors also point to the importance of adequate sampling of individuals from populations of focal interest while using STRUCTURE/ADMIXTURE using an example of adding Melanesians in analyses of large human population genetic data-sets.

Overall, these results show that in recent history, genetic drift has been at least as important in shaping variation within these populations as admixture. A simple history comprising a differentiation phase followed by a mixture phase is false and inferences based on this model are liable to be misleading. Other, qualitatively different scenarios should also be considered, such as one in which in which the processes of mixture and divergence in ancient history was similar to that in recent history and the differentiation into four major ancestries reflects sustained differences in connectedness between populations.

References:

Falush, D., van Dorp, L. and Lawson, D., 2016. A tutorial on how (not) to over-interpret STRUCTURE/ADMIXTURE bar plots. bioRxiv, p.066431.

Lawson, D.J., Hellenthal, G., Myers, S. and Falush, D., 2012. Inference of population structure using dense haplotype data. PLoS Genet, 8(1), p.e1002453.

Pritchard, J.K., Stephens, M. and Donnelly, P., 2000. Inference of population structure using multilocus genotype data. Genetics, 155(2), pp.945-959.

Alexander, D.H., Novembre, J. and Lange, K., 2009. Fast model-based estimation of ancestry in unrelated individuals. Genome research, 19(9), pp.1655-1664.

Gilbert, K.J., Andrew, R.L., Bock, D.G., Franklin, M.T., Kane, N.C., Moore, J.S., Moyers, B.T., Renaut, S., Rennison, D.J., Veen, T. and Vines, T.H., 2012. Recommendations for utilizing and reporting population genetic analyses: the reproducibility of genetic clustering using the program structure. Molecular Ecology, 21(20), pp.4925-4930.

About Arun Sethuraman

I am a computational biologist, and I build statistical models and tools for population genetics. I am particularly interested in studying the dynamics of structured populations, genetic admixture, and ancestral demography.
This entry was posted in bioinformatics, genomics, howto, methods, population genetics, software, STRUCTURE and tagged , , , , . Bookmark the permalink.