Genes … in … space!

Geography (A) and genetics (B). Figure 2 from Wang et al. 2012.

It’s something of a classic result in human population genomics: Go out and genotype thousands of people at thousands of genetic markers. (This is getting easier to do every day.) Then summarize the genetic variation at your thousands of markers using Principal Components Analysis, which is a method for transforming that genetic data set into values on several statistically dependent “PC axes.” Plot the transformed summary values for each of your hundreds of samples on the first two such PC axes, and you’ll probably see that the scatterplot looks strikingly like the map of the places where you collected the samples.

Like the figure you see at the top of this post. It suggests that the geographical space from which you collected samples has left an imprint on the genes of the people you sampled. Genes … in … space! Which, to take a cue from Scicurious, calls for a clip from the Muppet Show:

This pattern has turned up across Europe, the site of the 2008 study that is probably the first to note this pattern, as well as in samples collected from countries as small as Japan and Finland to samples collected across Africa, and Asia, and even worldwide. The obvious takeaway from this recurring correspondence between genes and geography is that, even for a highly mobile species like Homo sapiens, geography matters—processes like isolation by distance and barriers to movement like the Mediterranean Sea have left their marks on our genomes.

Of course, that makes humans no different from any other living thing on the planet. As scientists, we’re interested not just in establishing that patterns exist in nature, but trying to identify things that don’t fit those patterns. One way to start doing that would be to compare this genes-matching-geography pattern in different parts of the world, and even to pick out individual samples and groups of samples that don’t match geography very well.

Like Native Americans. Here’s the geography-versus-genes comparison for humans worldwide:

Figure 1 from Wang et al. 2012

Samples from different geographic/ethic groupings are plotted in different symbols and colors—Native North and South Americans are all colored purple. You can see that in the genetic-variation plot on the right, the purple symbols are quite a bit closer to the pink symbols for east Asian samples than they are in actual geographic space. That probably reflects the fact that humans colonized the Americas from Asia relatively recently in our history as a species. (You can see similar gene-geography disparity in the samples from Europe and the Middle East.)

But this begs the question of, what counts as “enough” gene-geography disparity to be really worth followup study? We know a lot about human history already, so we can pretty easily come up with ideas about why Native Americans aren’t as genetically distant from Asia as they are geographically different. But what if you were doing this sort of analysis with a species whose history you didn’t know so well? Maybe you’d definitely want to investigate more if you say the kind of gene-geography mismatch seen in Native Americans, but what about the degree seen in Europe? Or in central Asia?

To start to make that kind of judgement, you really need a measurement of that mismatch, not just a general sense of how things look. And that brings us to the new study, published recently in PLOS Genetics, from which I borrowed the gene-geography maps in this post—it’s an attempt to measure gene-geography similarity in human genomic data.

The authors, Chaolong Wang, Sebastian Zöllner, and Noah Rosenberg, compiled existing human genetic datasets for several different geographic regions, and filtered the data so they contained the same set of markers—more than 30 thousand SNPs. Wang et al. then performed PCA on each regional dataset using the same procedure. Then they turned to a statistical tool designed to answer the key question, are these two plots more similar than we’d expect by chance?

That tool is a Procrustes analysis, which is named after a character in Greek mythology who killed people by forcing them to fit an iron-framed bed—either stretching them, or cuttting off limbs. (Anyone who thinks torture porn is a disturbing new trend hasn’t read much classical mythology.) The Procrustes analysis stretches and rotates one graph to try and match it up with a second graph, and then summarizes the difference between the graphs in terms of the average distance between corresponding points when the two graphs are overlaid.

That gave Wang et al. a numeric measurement of gene-geography similarity for each data set they considered. By imitating Procrustes in yet another way, they developed a way to identify groups of samples for which genes and geography were unusually out of alignment. That is, they divided each dataset by geographic region of origin (or ethnicity), then systematically removed each such set of samples, re-ran the PCA and Procrustes analysis, and checked to see how the Procrustes score had changed. If a set of samples had particularly mismatched genes and geography, cutting it out of the analysis should result in a better Procrustes score.

The results are maybe not that shocking—every set of samples the team examined had better gene-geography Procrustes scores than expected by chance (as estimated from a permutation analysis). The dataset with the lowest Procrustes score, based on samples collected across eastern Asia, had pretty extensive gene-geography mismatch:

More interesting is the leave-one-out approach for identifying sets of samples contributing most to gene-geography mismatch. In the worldwide data set, this method identified Native Americans and natives of Oceania (Melanesia and Papua New Guinea)—both regions relatively recently colonized from Asia—as contributing most strongly to gene-geography mismatch. In Europe, excluding Italian samples led to the biggest improvement in Procrustes score; in Africa, samples from members of the Maasai were major contributors to mismatch. Again, most of these cases are not surprising, since we have a lot of prior knowledge about humans; but this is a pretty good prototype analysis for identifying “interesting” populations in a sample from a species whose history is less well recorded.

So the big contribution from this study is mostly that it provides a new kind of measuring stick—the Procrustes analysis—to quantify mismatch between geography and population genetic structure. It may not yield any groundbreaking insights into human evolutionary history, but as those of us who study less charismatic critters begin to find ourselves with tens of thousands of SNPs at our analytic fingertips, I expect we’ll see this approach used quite a bit.


Novembre, J., Johnson, T., Bryc, K., Kutalik, Z., Boyko, A. R., Auton, A., Indap, A., et al. 2008. Genes mirror geography within Europe. Nature 456(7218):98–101. DOI: 10.1038/nature07331

Wang, C., Zöllner, S., & Rosenberg, N. 2012. A Quantitative comparison of the similarity between genes and geography in worldwide human populations. PLoS Genetics 8(8):e1002886. DOI: 10.1371/journal.pgen.1002886


About Jeremy Yoder

Jeremy Yoder is an Assistant Professor of Biology at California State University, Northridge. He also blogs at Denim and Tweed, and tweets under the handle @jbyoder.

This entry was posted in population genetics and tagged , , , , . Bookmark the permalink.