North America is full of new arrivals. Europeans colonized the continent starting half a millennium ago, displaced and eradicated native populations, and brought enslaved workers from Africa with them — and further immigrants have followed ever since. This mass movement of people is a huge complication for studies of human population genetics, but it’s also an opportunity to study how that movement is reflected in the diversity of the people who now live in North America. One study of people in the Caribbean, for instance, found the effects of colonization and the slave trade, but also evidence of migration across the region that pre-dated both.
An important tool for studying the complex human history of North America is emerging from a consumer trend you’ve probably heard about on a couple thousand podcast sponsorship messages — personalized genetic analysis. Services like 23andMe and Ancestry.com offer genome-wide genotyping and comparison to geographically-specific samples to identify your ancestors’ origins, and both companies ask customers to volunteer their data for research. Data collected by 23andMe allowed comparison of genetic ancestry to racial and ethnic identity that reveals how slippery the relationship between race and biology really can be. Now, a study of AncestryDNA customers helps link the history of colonization and migration across North America to individual Americans’ family histories.
The study, published recently in Nature Communications, combines genetic samples from more than 700,000 AncestryDNA customers with family trees they created using the Ancestry.com service (which was all Ancestry used to do, before personal genetics came along). The authors, led by AncestryDNA scientist Eunjung Han, identified relationships between samples using identity-by-descent — the other IBD in population genetics. DNA sequences from two different people are identical by descent if they are inherited from a shared ancestor. As anyone who’s heard J.B.S. Haldane’s quip about laying down his life for siblings knows, we share more DNA sequence with people we’re more closely related to. That isn’t enough to strongly identify relatedness, though — we all probably share a lot of individual genetic markers with people who are not particularly close relatives. Identity-by-descent, is, in this case, identified by long continuous stretches of shared DNA sequence. This takes advantage of the fact that recombination breaks up stretches of shared sequence each generation. I not only share about 50% of my DNA sequence with my brother; that shared sequence comprises large stretches of whole chromosomes. In comparison, the sequence I share with a niece or nephew, or with one of my grandparents, is both a smaller portion of our respective genomes and in smaller pieces.
So, two AncestryDNA samples that share one or more long stretches of DNA sequence are likely to have inherited them from a recent common ancestor. Han et al. calculated the length of continuous shared sequence between samples. If a pair of samples shared a particularly large quantity of sequence, they linked them together. Building many such links produced a network of genotypes that reflected the relatedness of every individual in the dataset. Within that network, Han et al. identified clusters of samples sharing more IBD links with each other than with the rest of the dataset. They repeated this analysis within each cluster to identify smaller groupings. They then annotated each cluster and sub-cluster with information from family trees submitted by the AncestryDNA participants, and by comparison to genomic datasets from native populations around the world, to identify both the geographic origins of the clusters both within the U.S., Canada, and Mexico, and their likely ancestry prior to colonization.
The first level of clustering identified five large groups that accounted for a majority of the samples. The authors broke the next-level clusters into four different types. “Intact immigrant groups,” like African Americans, French Canadians, and Scandanavians, retained detectable shared ancestry and higher-than-average similarity to a native population elsewhere in the world. “Continental admixed groups,” mostly Hispanic and Latin@-identified participants, had ancestry from both Native American and European populations. “Assimilated immigrant groups”, the largest clusters, had mixed Western European ancestry, and were not strongly differentiated from each other, except by their ancestral origins within North America. Finally, “post-migration isolated groups” had ancestral origins within North America, but were nevertheless distinct clusters, indiciating histories distinct from the rest of the post-colonial population — these included the Amish, Mormons, and people from Appalachia.
The authors visualize the distributions of cluster members’ geographic origins using a clever, but slightly unintuitive mapping approach, as seen above. Points on the map indicate locations not for the AncestryDNA customers, but for the ancestors in their family trees — each point is colored to indicate the identity-by-descent cluster that has the most family trees tracing to or through that location, and points supported by more family trees are larger. Thus, a dense smatter of large points of a given color indicate many family trees from the same genomic cluster, all linked to a specific geographic region. This highlights some features of the data that make a lot of sense to any student of North American history: Scandanavian family trees concentrated in Minnesota, or a cluster with family-tree roots in both the Canadian Maratime provinces and Louisiana — the Acadians.
It also reveals some surprises. The locations for the “assimilated immigrant” clusters (the U.S. northeast, the state of Pennsylvania, the lower Midwest and Appalachia, and the upland and lowland southeastern U.S.) form strong east-west bands with little north-south spillover, suggesting (the authors propose) cultural isolation that persisted even as Europeans colonized farther west. Another odd feature of the data is that the cluster best corresponding to African Americans has its family-tree locations restricted to the U.S. southeast — this does not mean that all the AncestryDNA samples from African Americans were from the South, but rather that more northerly locations were not unusually common in their family trees, compared to the rest of the dataset.
Han et al. are at pains to explain that the population genomic structure they’ve traced does not reflect particularly strong “genetic isolation” in the sense we usually think of it in natural or human populations, even for groups that we think of as “isolated”, like the Amish. The clustering analysis isn’t identifying separate patches of forest within the AncestryDNA sample — it’s more like picking individual leaves from a unified forest canopy and trying to find out which share the same roots, somewhere down at ground level. As we collect more and finer-grained human genetic data, we’ll come closer to tracing the branches of human ancestry directly, and we’ll likely find even more ways in which they’ve been shaped by our history.
Han E, P Carbonetto, RE Curtis, Y Wang, JM Granka, J Byrnes, K Noto, AR Kermany, NM Myres, MJ Barber, KA Rand, S Song, T Roman, E Battat, E Elyashiv, H Guturu, EL Hong, KG Chahine, and CA Ball. 2017. Clustering of 770,000 genomes reveals post-colonial population structure of North America. Nat. Commun. 8:14238. doi: 10.1038/ncomms14238
Leslie S, B Winney, G Hellenthal, D Davison, A Boumertit, T Day, K Hutnik, EC Royrvik, B Cunliffe, Wellcome Trust Case Control Consortium, International Multiple Sclerosis Genetics Consortium, DJ Lawson, D Falush, C Freeman, M Pirinen, S Myers, M Robinson, P Donnelly, and W Bodmer. 2015. The fine-scale genetic structure of the British population. Nature 519:309–314. doi: 10.1038/nature14230
Moreno-Estrada A, S Gravel, F Zakharia, JL McCauley, JK Byrnes, et al. 2013. Reconstructing the population genetic history of the Caribbean. PLOS Genetics 9(11): e1003925. doi: 10.1371/journal.pgen.1003925
Ralph P and G Coop. 2013. The geography of recent genetic ancestry across Europe. PLOS Biology 11(5): e1001555. doi: 10.1371/journal.pbio.1001555