Comparing your options for phylogenomic data

MARY COLTER (R) SHOWING BLUEPRINT TO MRS ICKES (WIFE OF SECRETARY OF INTERIOR) CIRCA 1935. NPS.
The choices for current-generation (last generation?) molecular markers are grouped in two primary camps.
First, the “reduced representation” methods: take some DNA, cut it up with specific enzymes, tag those pieces, read the sequences. These methods produce lots and lots of single nucleotide polymorphisms (SNPs) and can be used for just about any taxon your heart desires. The most common acronyms you’ve read are probably RADseq (restriction site associated DNA sequencing) and ddRADseq (double digest restriction site associated DNA sequencing).
Second, the “targeted enrichment” methods: buy some probes that attach to highly-conserved areas of DNA and sequence them along with the flanking regions. These methods provide loci that are more likely to be found across divergent taxa, which both expands the scope of phylogenetic questions and reduces missing data. The most common acronyms you’ve read are probably UCE (ultra-conserved elements) and exon-cap (exon-capture or anchored hybrid enrichment).
Even though these molecular resources are relatively new, it is incredible how quickly the “word on the street” pigeonholes certain tools for certain questions. For example, maybe you’ve been told that you can’t use RADseq for phylogenetic questions. Maybe you’ve heard that UCEs are only helpful for the deepest of nodes in a phylogeny. I’m not sure how these statements are perpetuated, but a recent pre-print from Rupert Collins and Tomas Hrbek may be a good starting point for those intrepid researchers who are asking themselves what molecular data will take them to the promise land of phylogenetic inference.
The authors downloaded 23 complete primate genomes in order to manufacture reduced representation (RADseq and ddRADseq) and targeted enrichment (UCE and exon-capture) datasets. RADseq and ddRADseq data were collected by simulating where the enzymes used in those protocols would cut pieces of DNA from the whole genomes. In a perfect world, these pieces would be the same as what you would be left with if the entire RAD protocol was conducted at your lab bench. UCE and exon-capture data was collected by converting those whole genomes into separate BLAST databases so the authors could search for the specific probes associated with either approach. In addition to these four methods, Collins and Hrbek obtained two additional datasets for some of the primate species, one based on Sanger-sequenced exons and one based on mitochondrial DNA.
Four general characteristics were compared across the four (plus two extra) sets of loci:

  1. the number of recovered loci and proportions of missing data
  2. topological uncertainty and statistical support for resolving nodes
  3. consistency in branch length estimates
  4. phylogenetic informativeness

The results aren’t surprising in many ways. Methods that used conserved sites (UCE and exon-capture) produce loci that are more likely to be recovered across taxa. Reduced representation methods produce a greater number of loci.
However, for most of the nodes on the primate tree, all four methods effectively resolved the “real” solution. Everybody wins!

relaxed clock divergence times

Figure 6 from Collins and Hrbek – Mean divergence time estimates with 95% credible intervals at selected nodes


Within the reduced-representation methods, ddRADseq produces many fewer loci than RADseq, but allows investigators to greatly increase the number of samples using the same coverage. The smaller number of ddRADseq loci weren’t able to resolve the oldest nodes on the primate tree, but both methods produced similar clades ages and levels of phylogenetic informativeness:

When compared to the results from the sequence capture methods, it is possible that the RADseq protocol generated more data than was actually necessary for resolving the phylogeny over the time scales studied here. However, the RADseq and ddRADseq data also have much higher relative, and in the case of RADseq also absolute information content, and thus are likely a better choice for resolving relationships at the population to species boundaries.

There were fewer differences between the sequence capture methods, as both confidently produced the correct tree. Exon-capture methods produced fewer (~1/4) loci compared to UCE, but those exon-capture loci had lower dropout rates and lower numbers of missing sites:

Comparing the UCE and exon-cap protocols, the latter provided the most complete data matrix, was least affected by phylogenetic divergence between taxa, and also displayed the most reliable, constant rate PI [phylogenetic informativeness] profile for molecular dating. With their greater degree of standardization and lower anonymity of the loci, both protocols also offer a more reliable solution to data sharing.

Okay, lots of small differences (including cost!), but who cares? If the taxa you study have large variations in divergence rate, have speciated rapidly in the past or recently, or some other evolutionary scenario that makes like difficult for a phylogeneticist, you will care.
Inevitably, unavoidably, inescapably, the right choice is dependent on the question you’re asking. Once you know that, Collins and Hrbek’s in silico study provides a nice starting point to finding your in situ data.
 
Cited
Collins, R. A., & Hrbek, T. (2015). An in silico comparison of reduced-representation and sequence-capture protocols for phylogenomics. bioRxiv, 032565.
 

This entry was posted in methods, next generation sequencing, phylogenetics and tagged , , . Bookmark the permalink.