Machine learning for model selection in population genomics
The application of model-based methods in phylogeography helped the field transition from a more qualitative, overlay-a-tree-on-a-map, discipline to one that tests hypotheses in robust statistical frameworks. Many researchers have embraced approximate Bayesian computation (ABC) for model selection since computing the likelihood of complex population models is often intractable. While ABC is easy to use and provides a posterior distribution, like all methods, it has its weaknesses. For example, ABC requires a very large amount of simulated data and the choice of what summary statistic and rejection threshold to use is difficult.
In a recent preprint posted on bioRxiv, Sara Sheehan and Yun Song present a likelihood-free inference framework for population genomics that applies deep learning, an active area of machine learning research. They aimed to jointly infer natural selection and changes in population size, processes that can leave similar signatures in the genome, by testing their method on simulated data and on empirical data for Drosophila melanogaster. Sheehan and Song provide accessible introductions to ABC and deep learning in their preprint that are well worth the read.
When comparing the results of machine learning to those from ABCtoolbox, Sheehan and Song found deep learning produced better estimates of recent population size changes than did ABCtoolbox, while both methods had similar accuracies for more distant past population size changes. In the simulated data, ancient population size changes were least accurately estimated and some hard sweeps were misidentified as neutral. Additional testing by the authors showed this error occurred because many of the sweeps had not gone to completion, explaining why some regions appeared neutral (see Table 5 in the preprint). When analyzing the empirical Drosophila data, the deep learning method detected a post bottleneck expansion consistent with previously published results citing the range expansion of Drosophila melanogaster (out of sub-Saharan Africa) beginning around 15,000 years ago. In analyses restricted to genomic regions classified with probability greater than 0.9999, t 47 hard sweeps, 69 soft sweeps, and 18 regions under balancing selection were detected.
The application of deep learning to phylogeographic model selection has a lot of promising future directions including using deep learning to select informative statistics for a subsequent analysis such as ABC and combining the strengths of coalescent theory with the strengths of machine learning to create even more robust methods of inference in population genomics.
Sheehan and Song (2015) Deep learning for population genetic inference. bioRxiv DOI: 10.1101/028175