Opening Pandora’s box: PSMC and population structure

Essentially, all models are wrong, but some are useful. — George Box

Population size change vs population structure. Source: Orozco-terWengel, Heredity 2016

Population size change vs population structure. Source: Orozco-terWengel, Heredity 2016

Publication of the Li and Durbin’s 2011 paper titled “Inference of human population history from individual whole-genome sequences” was a milestone in the inference of demography.

By allowing the estimation of population dynamics from a single diploid genome, Li and Drubin’s pairwise sequentially Markovian coalescent (PSMC) model is perfectly suited for the genomic era of “less is more”, i.e. sequencing whole genomes of a few individuals rather than sequencing few loci of many individuals.

“The distribution of the time since the most recent common ancestor (TMRCA) between two alleles in an individual provides information about the history of change in population size over time.” (Li and Durbin 2011)

PSMC uses the coalescent approach to estimate changes in population size. Each diploid genome is a collection of hundreds of thousands independent loci. Estimating TMRCA of the two alleles at each locus is used to create a TMRCA distribution across the genome. And since the rate of coalescent events is inversely proportional to effective population size (Ne), PSMC identifies periods of Ne change. For example, when many loci coalesce at the same time, it is a sign of small Ne at that time.

This approach is becoming extremely popular in whole genome studies and is of a particular interest in ancient DNA and conservation genomics. Among others, it has been applied to study demographic history of the giant panda (Zhao et al. 2012), passenger pigeon (Hung et al. 2014) and the woolly mammoth (Palkopoulou et al. 2015).

However, PSMC has several considerable limitations that should be kept in mind.

  • It doesn’t recover sudden changes in Ne
  • Nor does it recover recent changes, e.g. younger than 10,000 years BP in humans (Li and Durbin 2011).
  • Simulation suggest that it also performs worse in case of very ancient changes in Ne (Mazet et al. 2015).
  • Using incorrect mutation rate or generation time can cause bias in the interpretation.
  • The change in Ne in a PSMC plot can be actually caused by population structure.

68435285Mind the population structure

The last point is of a particular interest to the group around Lounès Chikhi. It seems that by the growing list of papers on the topic of demographic inference and population structure (Chikhi et al. 2010, Heller et al. 2013, Paz-Vinas et al. 2013, Mazet et al. 2015, Mazet et al. 2016), he is trying to start a movement for a better treatment of population structure in population genomics.

The effect of population structure is well-known principle, but I was surprised by the rather small number of papers that try to address the topic practically (without Chikhi’s name in the author list). Does anybody else care? The papers by Nielsen and Beaumont (2009), Peter et al. (2010) and recently, Boitard et al. (2016) are some of those that do.

Here is a nice explanation of the problem by Nielsen and Beaumont (2009):

“It turns out that population structure very easily can be confused with changes in the population size. Imagine a model in which there are multiple different demes, but all individuals in the sample are obtained from only one deme. Most of the gene copies in the deme will find a common ancestor relatively recently, but a few may be descendents of migrants from other demes. These genes copies will not find a most recent common ancestor with the other gene copies in the deme until the lineages ancestral to the gene copies have migrated into the same deme and then coalesced. Depending on the number of demes, their population size, and the migration rate, this may take a long time. In terms of pairwise differences, most will be very small, but a few will be very large. This is exactly the pattern we would expect to observe if there had been a strong reduction in population size, or there had been a bottleneck.”

“Separating population size variation from population structure”

Mazet, Rodríguez and Chikhi (2015) used a maximum likelihood-based approach to compare between two models, Single Step Population Size Change (SSPSC) and Structured Symmetrical Island (StSI) model.

In case of SSPSC:

  • If a bottleneck occurred very recently, the coalescence rate will be mostly determined by the pre-bottleneck population size
  • If a bottleneck is very ancient, the coalescence rate will be mostly determined by the post-bottleneck population size

In case of StSI:

  • If migration rate is low, the coalescence rate will mostly depend on population size
  • If migration rate is high, the coalescence rate will depend on the whole metapopulation size

They showed that with the increasing number of independent loci studied, the parameters of both models can be resolved with increasing precision. What is interesting, they also showed that with 10,000 independent loci it is possible to differentiate between the two models, SSPSC and StSI, even if the bottleneck is very old.

However, the population size change model of a single-step change is very simplistic and cannot be really used for real data.

“On the importance of being structured”

In the follow-up paper, Mazet et al. (2016) introduce a theoretical framework with a parameter called the inverse instantaneous coalesce rate (IICR), which is “a trajectory of instantaneous ‘population sizes’ that fully explains complex models without loss of information”.

“As coalescence rates are expected to be inversely related to effective population sizes, it may seem natural to see the IICR as an ‘instantaneous population size’. However, we stress that the IICR is equivalent to a population size only in panmictic models. For models incorporating population structure the IICR exhibits a temporal trajectory that can be strongly disconnected from the real demographic history (that is, identifying a decrease when the population size was actually constant or increasing).”

Their simulations showed that the trajectories inferred by PSMC and IICR align well and PSMC thus does not estimate a population size change, but IICR. Both PSMC and IICR identified a population decrease or increase when sampling in the same deme or in different demes, respectively, even though the population size was constant.

Inferred population size changes under population structure and two sampling schemes. Source: Mazet et al. 2016, Heredity

Inferred population size changes under population structure and two sampling schemes. Source: Mazet et al. 2016, Heredity

Mazet and colleagues thus indirectly question the correctness of all the published PSMC plots showing population bottlenecks. Does it mean we have to give up on PSMC? Hopefully not, and it’s not what Mazet et al. (2016) are suggesting either. It’s more of a call for a healthy dose of skepticism and critical evaluation of one’s results when neglecting population structure from a model.

At least 18X and max 25% of missing data

PSMC is a fantastic method and its frequent usage in recent publications proves it. However, there is one more “but” in this post.

Nadachowska-Brzyska et al. (2016) recently published a methodically interesting PSMC analysis, where they highlight the importance of considering the biases caused by the quality of data. They evaluated the influence of different filters on PSMC estimates of 200 genomes of 4 species.

Based on the comparisons, they set the following (rather strict) cut off values for running a reliable PSMC analysis:

  • A mean genome-wide coverage of at least 18X
  • No more than 25% of missing data
  • A per-site filter of ≥10 reads

References

Boitard S, Rodríguez W, Jay F, Mona S, Austerlitz F (2016) Inferring Population Size History from Large Samples of Genome-Wide Molecular Data – An Approximate Bayesian Computation Approach. PLoS Genet 12(3): e1005877. doi:10.1371/journal.pgen.1005877

Chikhi L, Sousa VC, Luisi P, Goossens B, Beaumont MA (2010) The Confounding Effects of Population Structure, Genetic Diversity and the Sampling Scheme on the Detection and Quantification of Population Size Changes. Genetics 186(3):983–995. doi: 10.1534/genetics.110.118661

Heller H, Chikhi L, Siegismund HR (2013) The confounding effect of population structure on Bayesian skyline plot inferences of demographic history. PLoS One 8: e62992. http://dx.doi.org/10.1371/journal.pone.0062992

Hung CM, Shaner PJL, Zink RM, Liu WC, Chu TC, Huang WS, et al. (2014) Drastic population fluctuations explain the rapid extinction of the passenger pigeon. PNAS 111(29):10636–10641. doi: 10.1073/pnas.1401526111

Mazet O, Rodriguez W, Chikhi L (2015). Demographic inference using genetic data from a single individual: Separating population size variation from population structure. Theor Popul Biol 104: 46–58. http://dx.doi.org/10.1016/j.tpb.2015.06.003

Mazet O, Rodriguez W, Grusea S, Boitard S, Chikhi L (2016) On the importance of being structured: instantaneous coalescence rates and human evolution—lessons for ancestral population size inference. Heredity 116: 362-371. doi:10.1038/hdy.2015.104

Li H, Durbin R (2011) Inference of human population history from individual wholegenome sequences. Nature 475: 493–496. doi:10.1038/nature10231

Nadachowska-Brzyska K, Burri R, Smeds L, Ellegren H (2016) PSMC analysis of effective population sizes in molecular ecology and its application to black-and-white Ficedula flycatchers. Mol Ecol 25: 1058–1072. doi:10.1111/mec.13540

Nielsen R, Beaumont MA (2009) Statistical inferences in phylogeography. Mol Ecol 18: 1034–1047. doi: 10.1111/j.1365-294X.2008.04059.x

Orozco-terWengel P (2016) The devil is in the details: the effect of population structure on demographic inference. Heredity 116: 349–350. doi:10.1038/hdy.2016.9

Palkopoulou E, Mallick S, Skoglund P, Enk J, Rohland N, Li H, et al. (2015) Complete genomes reveal signatures of demographic and genetic declines in the woolly mammoth. Curr Biol 25:1395–400. doi:10.1016/j.cub.2015.04.007

Paz-Vinas I, Quéméré E, Chikhi L, Loot G, Blanchet S (2013) The demographic history of populations experiencing asymmetric gene flow: combining simulated and empirical data. Mol Ecol 22: 3279–3291. doi: 10.1111/mec.12321

Peter BM, Wegmann D, Excoffier L (2010) Distinguishing between population bottleneck and population subdivision by a Bayesian model choice procedure. Mol Ecol 19: 4648–4660. doi: 10.1111/j.1365-294X.2010.04783.x

Zhao S, Zheng P, Dong S, Zhan X, Wu Q, Guo X et al. (2013) Whole-genome sequencing of giant pandas provides insights into demographic history and local adaptation. Nat Genet 45: 67–71. doi:10.1038/ng.2494

Share

About Patrícia Pečnerová

I am a PhD candidate at Stockholm University. I study paleogenomics of the last population of the woolly mammoth before its extinction and I am mainly interested in combining ancient DNA, population and conservation genomics to trace pre-extinction changes in genetic diversity.
This entry was posted in bioinformatics, methods, Paleogenomics, population genetics, theory and tagged , , , , . Bookmark the permalink.
  • Frank Hailer

    Great summary, well done!
    Just a comment: you should include the Orozco-terWengel reference in your reference list. “His” figure at the top of your article is the one spread on twitter etc, so I’m sure he’d like to be actually cited. 🙂

    • Thank you for your remark. I must have missed it since I don’t refer to the paper in the text. At least the figure was properly cited all along. I’ve now also added the reference into the reference list.

  • Indianadiez

    I see, so the Nadachowska-Brzyska cut off values basically are telling us ‘do not use PSMC on ancient genomes’.

  • asif zubair

    Thank you for the well-written post. However, I think in the post you refer to Nadachowska-Brzyska’s 2016 paper, but I think the references point to the 2013 paper. You may want to fix that, unless it is the 2013 paper that you meant. Thanks!

    • Correct, thank you for pointing that out! Looks like I should be using EndNote even for writing my Molecular Ecologist posts 🙂

  • Ketil

    Great rewiew, thanks. I’ve of course seen PSMC used all over the place, but been very uncertain as to how accurate it really is. Is there any work at all that compares the analysis to actual empirical population data (whatever that might be)?

    Also, are the studies investigating how sequencing errors or assembly artifiacts might bias the analysis? How about effects of adaptive evolution or variations in mutational load over time?

    So lots of questions still, but with your post here, I at least have a place to start reading. So thanks again.

    -k