The 2017 standalone meeting of the Society of Systematic Biologists included expert-led debates on major issues in molecular systematics. Didn’t make it to Baton Rouge? Don’t worry – Melissa DeBiasse and I report on some of the main points (and our favorite lightening talks) from Day 2 of the meeting. Be sure to read our rundown of Day 1 of this great meeting as well.
Emily argued that the species tree, comprised and generated through individual gene trees, is the framework by which we understand evolution of species and clades. Matt, on the other hand, argued that the focus on a single bifurcating species tree topology clouds our understanding of the evolution of species and clades.
Under a species tree paradigm, we resolve or do away with the gene tree discordance. Under a gene tree paradigm, we keep that discordance and put it back into the analysis to get at the evolution of traits.
We could take the view that species trees are not real, that they don’t exist in nature but are constructs we create as scientist. Put another way, should we expect that the complexities of evolution can be expressed in a single, bifurcating tree? Even if we decide species trees are not real, they can still be useful models.
So which is more important? The gene tree or the species tree? On one hand, the species phylogeny is what most of us are ultimately interested in. We need the species tree because it tells us about horizontal gene transfer and incomplete lineage sorting. On the other hand, gene tree variation can tell us important things about the evolutionary process, such as rates of evolution, the extent of deep coalescence, and the true extent of hybridization across different species and temporal scales. Genes and gene trees underlie the traits in which we are often interested, although this idea is complicated by the fact that traits may follow a gene tree due to convergent evolution.
Missing data are commonplace in the phylogenomic era, due to stochasticity or taxon- or sample-specific biases in capture or sequencing. To begin to understand how missing data affect phylogenomic inference, we can draw on systematics literature from the past several decades. Not all of these studies may be applicable to phylogenomic-scale datasets, however, and thus ongoing theoretical and simulation work is critical.
To be safe, why not just exclude missing data, especially when you might have gobs of (more-complete) data? Because not all data are created equal! Accurate inferences are still possible from loci that have missing data, provided those loci have phylogenetic signal. Some have argued that it is only this signal that is important; as long as at least some complete loci are included, large amounts of missing data do not hinder inferences.
On the other hand, throwing all data in the phylogenetic “grinder” can have repercussions. Adding lots of missing data can expand phylogenetic space (the distribution of possible trees and their branch lengths) in complex ways, maybe also complicating the likelihood surface. Also, if there is minimal overlap between “full” datasets and those with missing data, then inferences can be impacted, for example, by creating rogue taxa. See Mark Holder’s slides for more on these ideas.
An important distinction to make for any dataset is whether missing data are 1) missing at random with respect to taxa, or 2) if the missing data are missing taxa. The effects of missing data on tree inference are likely more severe if your data fit model #2, because loci with missing taxa cannot even pretend to guide placement of those taxa. Most datasets likely include a mixture of sources of missing data and sensitivity analyses will be key.
Although we typically view missing data as a bad thing, when mining genomes or transcriptomes for loci, missing data could tell us about the evolution of gene gain and loss in gene families and across taxa.
Finally, Tracy Heath dropped some perspective on missing data. Attempts at using molecular data to date the Tree of Life are themselves a consequence of an imperfect fossil record (read: missing data). However, the fossil record will always be incomplete. Thus we must keep trying to understand how (a comparatively small) amount of missing genetic data impact topology, branch lengths, and molecular date estimates.
Bryan Carstens discussed his lab’s (macro)comparative approach to understanding phylogeographic patterns and their drivers.
Emily Ellis talked about her work that supports higher diversification rates in clades with bioluminescent courtship displays.
Lyndon Coghill is reviving work on the covarian substitution model, asking if it can improve phylogenomic inferences using markers such as UCEs.
Sarah Jacobs showed with her work on the Castilleja ambigua species complex that power analyses are helpful for addressing statistical power as a component in species delimitation.
From the Twittersphere