Next-generation sequencing (NGS) has put gobs of sequence data in the hands of molecular biologists, and that data is measurably advancing our prospects for a fully resolved Tree of Life. Nearly simultaneously, however, we have realized that every NGS dataset has unique properties (not a surprise), such as the number of loci you can expect to generate, variability of these loci, their usefulness at either shallow or deep timescales, etc.
A question that is being posed of all types of NGS datasets is: how do missing data affect phylogenetic inference? A new topic this is not; but it has recently taken on new fervency in genomic-scale studies, where missing data are commonplace. The Molecular Ecologist has blogged a bit about this recently as well (see here and here). Because my group is currently using UCEs to address phylogenomics of several different mammal taxa, I was curious if any consensus was emerging on how missing data in these particular markers impact phylogenetic inference.
To satisfy this curiosity, I conducted a brief – but taxonomically and methodologically representative – survey of a variety of recent literature, focusing specifically on papers with datasets characterized by some percent of missing loci, and where these were analyzed alongside more complete datasets. Some common themes emerged. First, in concatenated analyses (both ML and Bayesian), inclusion of more UCE loci at the expense of increasing missing data nearly always increases branch support. Also, in the majority of papers I read, missing data impact topology only minimally, and often not at all. This is consistent with some previous assertions (but which were based on single empirical datasets) that a relatively high amount of missing UCE data (20-50%) may not greatly affect historical inferences.
Second, UCE-based species trees built from summary coalescent or quartet approaches appear slightly more sensitive to missing data, both in terms of topology and support values. Still, the topological variation observed is often small. Moreover, anomalous or highly incongruent trees are usually recovered when built with highly complete (sometimes 100% complete) datasets. This might be expected, because target capture of UCEs yields many fewer loci than some other methods, such as RADseq. Also, UCEs are by their nature often minimally variable. Therefore, a low tolerance for missing data can lead to exclusion of a large proportion of loci (occasionally >90%) and, depending on the system, final datasets with pretty low levels of phylogenetic signal.
So how are species tree methods best utilized with incomplete UCE datasets? This is definitely a fine line, because additional evidence from other types of genomic data suggests summary coalescent methods in particular (such as ASTRAL) perform better when missing data are minimized. One solution is to choose the most phylogenetically informative loci, and to tolerate some small level of missing data in those loci. This could have the effect of maximizing returns when data are incomplete. The optimal level of missing data for such an approach is likely less than that permissible under concatenation, but exact numbers are still hard to come by, and these probably differ depending on the system. Given the significant advances in summary and quartet methods methodologies recently, future work that characterizes the performance of these approaches for UCEs in the presence of different amounts of missing data will be a ripe research area to pursue.