Finding hidden structure in uneven data

If you are a population geneticist, your work might include sampling a bunch of individuals and figuring out who is related to who. Seems simple right? Before you can ask questions about differences or similarities between groups, you have to understand what actually constitutes a group in the first place.
A methodological stalwart of “how many groups do I have?” analyses is the program STRUCTURE, which has been cranking out these ubiquitous plots for more than fifteen years (and to the tune of >10,000 citations). As you can imagine, a program that has been so widely applied as STRUCTURE has been examined, questioned, and improved many times over.
For example, the basic application has been made faster and faster. Additionally, both simulations and empirical investigations have shown the caveats for these analyses (like avoiding close relatives, large temporal variation in sampling, isolation by distance scenarios, etc). Overall though, STRUCTURE is still being used all the time and most people seem fine with it.
Given the history of studying the effectiveness of STRUCTURE, I was surprised to see this new paper by Sebastien Puechmaille in Molecular Ecology Resources, titled “The program STRUCTURE does not reliably recover the correct population structure when sampling is uneven: sub-sampling and new estimators alleviate the problem”.
Now I’m no scholar on the history of population-clustering techniques, but are you telling me that no one ever asked if having uneven sample sizes makes a difference? Say you sample an insect species from three genetically-distinct populations, but have uneven sample sizes (N = 30, 12, 12). According to Puechmaille’s simulations, you are more likely to see those population with lower sample sizes lumped together (even if they really are very different) and the population with a large sample size may be unnecessarily split into multiple.
The solution to this problem ends up being (somewhat) equally simple: keep sample sizes relatively even, subsample large groups, and use a variety of estimators to help you pick the “most supported” number of groups. In fact, Puechmaille offers up a suite of new estimators (MedMeaK, MaxMeaK, MedMedK, MaxMedK) that you can add to your arsenal.
With that out of the way, we can get to the really important questions, like why does population structure matter at all?
Puechmaille, S. J. (2016). The program STRUCTURE does not reliably recover the correct population structure when sampling is uneven: sub‐sampling and new estimators alleviate the problem. Molecular Ecology Resources. DOI: 10.1111/1755-0998.12512

This entry was posted in software and tagged , . Bookmark the permalink.