Approximate Bayesian computation without the wait

Approximate Bayesian computation (abc) is arguably one of the most exciting, and quickly developing, tools available to modern population geneticists . Requiring a combination of large scale simulations and the evaluation of summaries of simulated datasets and an observed dataset, the potential of abc for population genetic inference is massive. At issue is that given increasing scenario complexity and dataset size, significant time and computing resources become limiting in order to effectively and appropriately conduct abc. Estoup et al. 20121 , in a recent issue of Molecular Ecology Resources, describe the use of linear discriminant analysis (LDA) to transform simulation summaries (Ss) as part of a larger abc analysis, and in so doing, report a “speed gain around a factor of 100″. Wow!

Likelihood-free inference

At the center of much of population genetic methodology is the identification of an appropriate likelihood function. More or less, it is the likelihood function that specifies the relevant parameters, and their relationship to one another, necessary to explain a process or model. Specification and evaluation of the parameters of a given likelihood, and its fit for a given genetic dataset, can be extremely difficult. Moreover, recently derived and developing datasets epitomize truistic imbalances between most population genetic models and empirical realities. Of course, that isn’t to say they’re not useful.

Abc functions upon the rationale that the likelihood might be approximated through the use of simulation and simulation summary statistics2, and that the evaluation of model fit to a dataset can be identified through a comparison of Ss derived from simulated scenarios and calculation of those same summaries on an observed, empirical dataset. In theory, simulation summaries are selected to provide maximal distinction amongst competing models. In practice, identifying these summaries isn’t always easy, and is the object of continued research3. Additionally, determining the so-called “distance” (sometimes described as the posterior probability for a given abc analysis) between model summaries and an observed dataset has been a focus of debate, though the most common, and one applied by Estoup et al. 2012, is the polychotomous logistic regression4.

Do It Yourself-ABC

Estoup et al. 2012 utilize a version of the integrated package Do It Yourself-ABC (DIYABC 5, 6) in order to conduct their full abc analysis. While abc is generally applicable to any model of research interest, DIYABC focuses on inferences of population divergences, admixtures, and population size changes. The scenarios treated in Estoup et al. 2012 include both (1) simulated datasets in which the details of the model are known explicitly and (2) a previously published microsatellite dataset7,8, both of which were derived in order to infer the world-wide invasion dynamics of Harmonia axyridis (a coccinellid beetle, commonly referred to as invasive ladybird). abc is particularly applicable to understanding the complex dynamics of biological invasions, which are characterized by multiple, heterogeneous introductions, and often include multiple population bottlenecks and/or expansions of varying intensities. Simply put, a single likelihood function isn’t going to cut it.

The exact details of the scenarios under study can be found in Estoup et al. 2012. For (1), there was a total of ten different scenarios being compared and total of 86 Ss, while (2) included anywhere between 10 and 28 scenarios (the analysis is sequential in the sense that multiple invasions took place across different continents, each of which received its own abc analysis to determine the base scenario in which to build the next step of the invasion), and between 86 and 223 Ss. These summaries included common population genetic meaures such as mean number of alleles per locus, expected heterozygosity, allele size variance, pairwise FST values, etc..

Now to the interesting part. The exact derivation and proof for LDA transformation can be found in the supplementary material of Estoup et al. 2012. For our present purposes,  LDA as applied by DIYABC will reduce some J number of Ss to K-1 independent variables maximizing the differences amongst K compared scenarios (with the assumption that J>K). This point is important for theoretical purposes, but as a practical matter, the larger the number of scenarios and Ss utilized for a given analysis, the greater the computational time required to evaluate posterior probabilities of individual scenarios with polychotomous logistic regression (the relationship is generally non-linear and increasing). For the Estoup et al. 2012, (1) and (2) are reduced to 9 transformed LDA Ss and between 9 and 27 transformed LDA Ss, respectively. This reduction in the number of Ss will allow a researcher to estimate posterior probabilities of a larger number of more complex demo-genetic models via polychotomous logistic regression in a fraction of time. That’s just the start of computational savings though.

Concerns regarding the use of abc for model comparison were raised by Roberts et al. 201110, and hinges upon the use of insufficient summary statistics. The basis of the concern is discussed within, and suggestions for alleviating these concerns are described for DIYABC. Under the option of “Evaluate confidence in scenario choice”, DIYABC will conduct posterior model estimation, where the true model is simulated with scenario parameters based upon fixed or values drawn from prior specification, and Ss from this simulation is compared to the Ss from the remaining scenarios in the manner of computing posterior model probabilities. If the model with the highest posterior probability is not the one used to generate the Ss, and this conclusion is consistently so, limited power exists to conduct the model choice central to the abc analysis. Evaluating the Type I and II error rate can be extremely computationally intensive, as polychotomous logistic regression must be conducted iteratively. Depending upon the number of scenarios being compared, and Ss, evaluating each scenario can take weeks on a reasonably powerful machine. The same computational advance made for evaluating posterior model probabilities above through LDA is applied to each iteration, taking a week long analysis and making it in a day or even a few hours. Less time, a smaller carbon footprint, what could be better?

Results?  

The results with and without LDA were highly congruent, while the required computational resources were much smaller with the application of LDA. Excellent!

I was extremely impressed by the described improvement in Estoup et al. 2012. Having recently completed an application of DIYABC with my colleagues Kim Gilbert and Steve Keller to compare 10 different scenarios describing the invasion dynamics of the angiosperm Silene latifolia, I wanted to try out DIYABC with the option for LDA. Implementing the LDA version of the analysis was extremely easy, as the necessary inputs are the same and requires only the addition of the LDA transform step (it took seconds). I haven’t completed the confidence in scenario choice analysis yet. Early results look highly consistent with the untransformed Ss (I’ll update this soon). Identifying the model with the highest probability was consistent with LDA transformed and untransformed Ss. However, the LDA transformed Ss took a fraction of the time to evaluate.

Do you have a DIYABC dataset around that you analyzed without LDA transformation? If so, I think it could be quite interesting to run this new version of DIYABC on it and post the results here! I’ll post my updated results soon.

Caveats and Conclusions

Even with LDA transformation, abc analyses are quite computationally intensive. Simulating a large number of scenarios can take quite a lot of time. As stated above, DIYABC is an integrated program and capable of a full abc analysis. While powerful, DIYABC focuses on only a subset of analyses that will likely be of interest to molecular biologists. Many other abc methodologies focusing on different sorts of research questions exist. LDA transformation is just one framework currently being explored for dimension reduction of Ss, some of which are described in Blum et al. 2012

Two of my favorite sources for updates on abc research are:

Christian Robert’s Xi’an’s Og

Scott Sisson’s ABC_Research Twitter account

Reference

Estoup, A., Lombaert, E., Marin, J.-M., Guillemaud, T., Pudlo, P., Robert, C. P., & Cornuet, J.-M. (2012). Estimation of demo-genetic model probabilities with Approximate Bayesian Computation using linear discriminant analysis on summary statistics. Molecular Ecology Resources, 12(5), 846–855. doi:10.1111/j.1755-0998.2012.03153.x

Csilléry, K., Blum, M. G. B., Gaggiotti, O. E., & François, O. (2010). Approximate Bayesian Computation (ABC) in practice. Trends in Ecology & Evolution, 25(7), 410–418. doi:10.1016/j.tree.2010.04.001

Fearnhead, P., & Prangle, D. (2012). Constructing summary statistics for approximate Bayesian computation: semi-automatic approximate Bayesian computation. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 74(3), 419-474. doi: 10.1111/j.1467-9868.2011.01010.x

Beaumont MA (2008) Joint determination of topology, divergence time and immigration in population trees. In: Simulation, Genetics and Human Prehistory (eds Matsamura S, Forster P & Rrenfrew C), pp. 135–154. McDonald Institute for Archaeological Research, Cambridge, UK.

Cornuet, J.-M., Santos, F., Beaumont, M. A., Robert, C. P., Marin, J.-M., Balding, D. J., Guillemaud, T., & Estoup, A. (2008). Inferring population history with DIY ABC: a user-friendly approach to approximate Bayesian computation. Bioinformatics, 24(23), 2713–2719. doi:10.1093/bioinformatics/btn514

This package isn’t currently available via the website, but is available upon request from Arnaud Estoup: estoup@supagro.inra.fr

Lombaert, E., Guillemaud, T., Thomas, C. E., Lawson Handley, L. J., Li, J., Wang, S., . Estoup, A. (2011). Inferring the origin of populations introduced from a genetically structured native range by approximate Bayesian computation: case study of the invasive ladybird Harmonia axyridis. Molecular Ecology, 20(22), 4654-4670. doi: 10.1111/j.1365-294X.2011.05322.x

8 http://datadryad.org/resource/doi:10.5061/dryad.7m0b37bn

9 Blum, M. G. B., Nunes, M. A., Prangle, D., & Sisson, S. A. (2012, February 16). A comparative review of dimension reduction methods in approximate Bayesian computation. arXiv.org.

10 Robert, C. P., Cornuet, J.-M., Marin, J.-M., & Pillai, N. S. (2011). Lack of confidence in approximate Bayesian computation model choice. Proceedings of the National Academy of Sciences, 108(37), 15112–15117. doi:10.1073/pnas.1102900108

RedditDiggMendeleyPocketShare and Enjoy
This entry was posted in bioinformatics, software and tagged , , . Bookmark the permalink.