Bayesian Markov-chain Monte Carlo in population genetics

IMGP3678

Bayes’s Theorem in neon. Photo by Matt Buck.


This is a guest post by Arun Sethuraman, a postdoctoral associate with Jody Hey, studying statistical models for divergence population genetics in the Department of Biology at Temple University. You can also find him on Twitter, and on his short story blog.
Prompted by the great response to my talk at Evolution 2014 on “Parallel MCMC and inference of ancient demography under the IM model”, I wanted to throw together a quick and dirty review of Bayesian MCMC, and why and how it works for the uninitiated, as well as highlight some of the important problems surrounding the method and its implementation.
Some introductory jargon:
Target distribution: Probability distribution to be approximated/evaluated, and is usually a model likelihood function (probability of the data given a set of parameters), or a posterior distribution (probability of parameters, given the data). In the population genetic context, the data is invariably genotypic data (SNP’s, STR’s, allozymes, etc.), while parameters of interest estimated in the many methods above include phase, allele frequencies, admixture proportions, effective population sizes, population divergence times, number of subpopulations, TMRCA’s, migration rates, etc.
Stationary/Equilibrium distribution: Sampling distribution for a Markov chain; once the Markov chain has attained stationarity, the joint distribution of any ‘n’ sampled random variables from the Markov chain, {X1, X2,…,Xn} is the same as the joint distribution of {X1+m, X2+m,…,Xn+m}, ‘m’ steps ahead. In other words, the Markov chain is stationary in time. The target distribution is the same as the sampling distribution, after reaching stationarity.
Posterior, Likelihood, Prior: A prior distribution refers to a priori knowledge about the distribution of a random variable, in this case, a parameter of interest. For instance, one may assume that migration rates (m) are uniformly distributed between 0 and 1 (as in IM, MDIV, etc.), or the number of subpopulations, K to be uniformly distributed between 1 and 5 (as in STRUCTURE). The likelihood of the model can be defined as the probability distribution of parameters of interest (unobserved), given the data (observed). The posterior distribution is proportional (by Bayes’ theorem) to the product of the likelihood and the prior, and is the distribution of the parameters, given the data.
Bayesian Markov Chain-Monte Carlo (MCMC), and its many versions are algorithms to sample from a target distribution by using Markov chains, whose stationary or equilibrium distribution is an approximation of the target distribution. While algorithms implementing Bayesian MCMC have been around for a while, its utility in population/evolutionary genetics was thrown wide open by a seminal paper by Kuhner, Yamato and Felsenstein (LAMARC, 1995). This was followed up by several methods that have made use of MCMC algorithms for computing posterior density distributions of their parameters, given genetic data (MDIV – Nielsen and Wakeley 2001, IM – Hey and Nielsen 2007, IMa2 – Hey 2010, STRUCTURE – Pritchard et al. 2000, BAPS – Corander et al. 2004, BEAST – Drummond and Rambaut 2007, PHASE – Stephens et al. 2001, STRUCTURAMA – Huelsenbeck and Andolfatto 2007, MCMCcoal – Rannala and Yang 2003, MrBayes – Ronquist and Huelsenbeck 2003, MIGRATE – Beerli and Felsenstein 2001, etc.) All methods have the same inner machinery – a model as a means to obtain the distribution of the parameters to be estimated, given the data (or posterior distribution in the Bayesian parlance), and/or the likelihood of the data (probability of the data, given parameters).
Now suppose we were to write either the posterior density, or the likelihood, in terms of a realization of the data (say G, generic for genealogy as used in genealogy samplers). If we devised some smart way of sampling from G, we can approximate either the posterior or the likelihood, and use this to infer parameters of interest. All methods assume a model for the data under parameters of interest, sample using one of several methods (Metropolis Hastings, Gibbs sampling, etc.), and perform inference of parameters under the model.
More jargon to watch for:
Burn-in: (Expected) iterations of MCMC before sampling from the stationary distribution.
Heating value/Temperature of chain: In a Metropolis coupling framework, a “heated” chain is one that explores flatter regions of the posterior density surface, while the “cold” chain is often at peaks. One only samples from the cold chain.
Convergence: Measure of approach to the stationary distribution. Chains are said to be converged if they are sampling from the stationary distribution.
Mixing: Mixing denotes how efficiently the MCMC space is being sampled from. Chains are not mixing well if they are stuck at a certain sampled parameter value for a long period of the run.
In one sentence, for your method to give you any good estimates of parameters, you need to ensure that your chains have burned in, are mixing well, and are eventually converging to your target distribution, i.e. run it long enough! (But how long is long enough?)
Beyond the basics, MCMC methods come with many issues, as highlighted by several authors (Hey 2011, Gilbert et al. 2012, etc). Almost all of them have to do with convergence. In a utopian world, where we have infinite computational time, a single Markov chain should eventually converge to its equilibrium distribution for estimating a large number of parameters. But I promised that this would be a post about quick and dirty methods, so a rather neat extension of MCMC to the parallel paradigm, as suggested by Altekar et al. 2004 works out pretty well. Several of the above cited methods have implemented this over the years, and many more in the works, including my own parallel version of IMa2 (Hey and Nielsen 2007, Hey 2010).
But even so, many questions were raised at, and after my talk about, which a lot of us population geneticists are dealing with:
a)     Is there a trade-off between the number of processors, chains, and the time required to converge?
b)    How do we really assess mixing of chains in a parallel context?
c)     What if one of the processors is chugging along really slow, while the rest of the machines are way ahead? How does this affect mixing, convergence, and time of computation?
d)    Could we relax the “exchange rule” in a Metropolis coupling context? (See Altekar et al. 2004 for details on this rule).
e)     How do we minimize communications between processors to save on overheads during synchronization, and swap attempts (in a Metropolis coupling context)?
f)     How big is too big for Bayesian MCMC methods? A handful of loci and populations? What happens to inference under a reasonable amount of time with larger data?
I shall busy myself with these questions over the next few weeks and hope for some Bayes’ theorem neon bulbs in my head (thanks to Peter Beerli).
Also, feel free to contact me for a generic OpenMPI/C++ and C implementation of Metropolis Coupled MCMC – a little tweak to set up all your distributions, and you should be all set to run your MCMC method in parallel!
References
Altekar G.,  J. P. Huelsenbeck &  F. Ronquist (2004). Parallel Metropolis coupled Markov chain Monte Carlo for Bayesian phylogenetic inference, Bioinformatics, 20 (3) 407-415. DOI: http://dx.doi.org/10.1093/bioinformatics/btg427
Gilbert K.J., Dan G. Bock, Michelle T. Franklin, Nolan C. Kane, Jean-Sébastien Moore, Brook T. Moyers, Sébastien Renaut, Diana J. Rennison, Thor Veen & Timothy H. Vines & (2012). Recommendations for utilizing and reporting population genetic analyses: the reproducibility of genetic clustering using the program structure , Molecular Ecology, 21 (20) 4925-4930. DOI: http://dx.doi.org/10.1111/j.1365-294x.2012.05754.x
Hey J. (2010). Isolation with migration models for more than two populations, Molecular Biology and Evolution, 27 (4) 905-920. DOI: http://dx.doi.org/10.1093/molbev/msp296
Kuhner M.K., Yamato J. & Felsenstein J. (1995). Estimating effective population size and mutation rate from sequence data using Metropolis-Hastings sampling., Genetics, 140 (4) 1421-1430. PMID: http://www.ncbi.nlm.nih.gov/pubmed/7498781

About Jeremy Yoder

Jeremy B. Yoder is an Associate Professor of Biology at California State University Northridge, studying the evolution and coevolution of interacting species, especially mutualists. He is a collaborator with the Joshua Tree Genome Project and the Queer in STEM study of LGBTQ experiences in scientific careers. He has written for the website of Scientific American, the LA Review of Books, the Chronicle of Higher Education, The Awl, and Slate.
This entry was posted in methods, population genetics, software and tagged , . Bookmark the permalink.