Picking the ripest model with PHRAPL

Authors are free to use my awful logo, no charge.

Authors are free to use my awful logo, no charge.

To study patterns of genetic variation is to consider scale. The choices an investigator makes when designing a study can produce such a beautiful breadth of evolutionary patterns: from populations to species, from local to continental, from ancient to contemporary. The fields that combine to describe these different scales are sometimes disparate and sometimes highly integrative. Fields like population genetics have benefitted from extensive theory that has been reproduced in the field and the lab. Fields like phylogenetics have rapidly expanded thanks to the power provided by expanding sequencing/computing technology.
Phylogeography, the discipline in between population genetics and phylogenetics, gets the advantages of both fields along with twice the headaches. A lot of things can happen while populations shrink, expand, differentiate, and come to together again. This complexity makes fitting analyses that were designed for different scales a challenge. For example, most phylogenetic methods used in a phylogeographic context will show the evolutionary relationship between taxa. However, these techniques aren’t necessarily designed to consider gene flow between groups, which can alter conclusions. Similar limitations happen in the opposite direction when an analytical method for population genetics is scaled up to answer phylogeographic questions.
If you are interested in understanding the demographic history of a taxon, this all adds up to a lot of parameters to consider: population size changes, migration, divergence, drift, among others. The number of evolutionary scenarios that could have shaped a specific pattern of genetic variation can be staggering. Therefore, phylogeographers have often limited themselves to only the models they would most expect given the information they already possess. As you can imagine, this is often a tenuous situation, especially in cases where exploratory analyses are needed.

A new paper was released last week over at bioRxiv that aims to tackle this limitation using an approach called PHRAPL (Phylogeographic Inference Using Approximated Likelihoods):

Since no single model incorporates every possible evolutionary process, researchers rely on intuition to choose the models that they use to analyze their data. Here, we develop an approach to circumvent this reliance on intuition.

I want a 3D-printed demographic model. You want one. We all want one.

I want a 3D-printed demographic model. You want one. We all want one.

Now, I’ve heard a lot about PHRAPL for a while from various conference tweets and mentions from colleagues. I’ve also recognized that I had no idea what PHRAPL actually does. Most embarrassing of all? One of the project leaders for the program, Bryan Carstens, has an office just one floor above mine. To finally get some perspective on this new approach, I braved the arduous journey up one flight of stairs to ask.
Bryan told me that PHRAPL began with a good idea that he didn’t think was possible. What started at dinner between himself and lead author Brian O’Meara evolved into an NSF grant and multiple years of project development.
The analogy that the authors make about PHRAPL is that it fills a niche like the program ModelTest, but for phylogeography. Modeltest and its derivatives are used to select the most appropriate model of nucleotide substitution for sequences before they are used to construct phylogenies. Similarly, PHRAPL is used to select the best fitting model of demographic history for gene trees before they are used for parameter estimation or something else. However, I’d abandon this analogy when it comes to thinking about the consequences of making the wrong choice. Choosing an incorrect nucleotide substitution model can cause problems, but I’d wager that choosing the wrong demographic model results in a greater degree of lost inference.

PHRAPL compares the topology of gene trees estimated from empirical data to those simulated under various demographic models. It then approximates the probability of the data given those demographic models by calculating the proportion of times that simulated gene tree topologies match the empirical topologies (O’Meara 2010), and adopts a multiple model inference framework (Burnham and Anderson, 2004) to quantify the support for each model in the comparison set.

The idea behind this is simple enough: take all the potential demographic models for however many groups and figure out which fit the data best. Before PHRAPL, all those potential models were a computational nightmare. O’Meara and colleagues use a combination of several shortcuts (population label switching, putting parameter values in a grid format, incorporating non-uniform distribution of gene tree probabilities, among others) to make this process tractable. Best of all, it’s (relatively) fast! A large set of models took around a week on average to complete.
Other approaches for ranking demographic models, mainly Approximate Bayesian Computation (ABC), may not be as efficient in selecting the optimal model from among a large set of possible models. Carstens suggests that PHRAPL would work well in combination with other methods such as ABC or fastsimcoal2 that are likely to provide very accurate estimates of parameters given a particular model identified by PHRAPL.
In addition to using PHRAPL on simulated data, O’Meara and colleagues reanalyze data from 19 other studies that used a phylogenetic approach (*BEAST) or isolation-with-migration (IMa2) approach. Interestingly, the type of model chosen using PHRAPL didn’t necessarily match that used by the authors of these other studies, indicating that the wide net cast by PHRAPL might catch unexpected models that sneak by otherwise.

Our reanalysis of empirical datasets highlights the utility of phylogeographic model selection by demonstrating that the intuition of researchers (inclusive to some of the authors of this paper) is sometimes flawed in choosing the models used to analyze data from empirical systems.

Figure 3 from O'Meara et al (2015) compares the weighted probabilities for 19 datasets as they relate to other types of analyses methods. Few models show support for strictly an isolation-only, isolation with migrations, or migration-only assumption.

Figure 3 from O’Meara et al (2015) compares the weighted probabilities (gray dots) for 19 datasets as they relate to other types of analyses methods (percentage of studies using each method at tips of triangle). Few models show support for strictly an isolation-only, isolation with migrations, or migration-only model. While most studies relied on isolation-only methods, PHRAPL suggests an under-appreciation for gene flow in phylogeographic studies.

Modeltest, the analogy that the authors use to explain PHRAPL’s utility, is quickly approaching 18,000 citations. Has phylogeography found its next important tool? We’ll have to wait on the next coalescent event (of scientific opinion).

By allowing a direct probabilistic assessment of nearly any coalescent model to the empirical data, PHRAPL represents a substantial addition to the methodological toolbox available to phylogeographers.

O’Meara, B. C., Jackson, N. D., Morales-Garcia, A. E., & Carstens, B. C. (2015). Phylogeographic Inference Using Approximate Likelihoods. bioRxiv, 025353.

This entry was posted in methods, phylogeography and tagged , , . Bookmark the permalink.