RADseq and missing data: some considerations

Unlike Sanger sequencing, where loci are directly targeted for each individual and sequencing errors are relatively rare, massively multilocus datasets from next generation sequencing platforms are characterized by large amounts of missing data. This is particularly true for restriction digest based (RADseq) approaches, where data are lost at every stage from the lab bench to the computer (Figure 1). During RADseq library preparation, for instance, mutations at cutsites may directly generate null alleles, or newly-mutated cutsites within loci may reduce fragment size and thus cause allelic dropout when these fragments are lost during size selection. During sequencing, the random allocation of a finite number of reads across numerous loci and individuals results in discrepancies in coverage. And during data processing and assembly, decisions about the total number of variant sites allowed (sequence identity) and minimum number of reads required to properly genotype each individual for a locus (coverage threshold) further prune your matrix.

Figure 1 from Huang and Knowles (2016), highlight the origin of missing data from library preparation to sequence assembly.

Figure 1 from Huang and Knowles (2016), highlighting the origin of missing data during steps from library preparation to sequence assembly.

Which means if you have a pile of RADseq data — as most of us do, these days — it’s necessarily going to be patchy. But what are the effects of missing data on inferences, and how should they be handled to best reduce their biases? Though major questions remain, the following four studies offer some insight.

Allelic dropout results in overestimates of genetic variation: An obvious starting point to understanding the implications of missing RADseq data is exploring how allelic dropout (ADO) from mutations at cutsites affect population genetic inferences. In a 2012 paper in Molecular Ecology, Matthew Gautier and colleagues asked just that, mathematically deriving the influence of ADO on allele frequencies and using a RADseq dataset simulated under a coalescent model to explore its influence on expected heterozygosity and FST. Their major takeaways: the frequency of ADO depends on mutation rate and effective population size, and ADO results in overestimates genetic variation both within and between populations. What to do about it? Gautier et al. suggest that a “practical solution might consist in detecting and removing the RAD loci characterized by high ADO frequency (say with fr =0.5).”

Or does it introduce genealogical biases and underestimate diversity?: In a similar study based on coalescent simulations as well as an in silico digest of Drosophila genomes, Arnold et al. (2013) expanded on Gautier and crew’s work to explore the effect of missing data on common summary statistics (Pi, Theta, Tajima’s D, and FST.) But though their results are consistent with the earlier study in some ways — ADO can strongly influence population genetic summary statistic values, and enrich particular genealogical histories  — their interpretation differs. Specifically, their analysis found FST was largely reliable, but suggested ADO will result in underestimates heterozygosity. They claim this discrepancy results from the fact that Gautier et al. only considered segregating sites after their in silico digests, which would necessarily result in an inflation of minor allele frequency (and thus heterozygosity) because ADO is more likely to occur on the major allele haplotype. What do they say to do about it? “[Identifying] and pruning loci with incomplete sampling will be important in any RADseq experiment aimed at accurately estimating commonly used summary statistics”

But simply excluding missing data introduces its own biases: Gautier et al. and Arnold et al.’s advice is intuitively conservative. After all, if you’re worried about the influence of missing data on your study, it makes sense to limit the amount of null alleles you include in your matrix. But according to a recent paper in Systematic Biology by Huang and Knowles (first available online in 2014, and included in this month’s special issue on phylogenomic data in systematics), this too may be problematic. In their simulation study, Huang and Knowles discovered that stringent tolerances for the amount of missing data reduced the mutational spectrum represented in sampled loci, at the expense of those with the highest mutation rates (which makes sense, as they’d be most likely to feature a polymorphism at a cutsite in certain individuals and thus see more ADO). Additionally, trees estimated from “conservatively selected” loci (e.g., those present in a majority of individuals) were less accurate than those estimated from randomly selected loci. What to do about? Unfortunately, they believe their “results suggest that general rules-of-thumb are unlikely,” but highlight the importance of working to understand the particular factors influence missing data in your study design.

And what about phylogenies? Though initially use primarily in population genetic studies, RADseq data has become increasingly popular in phylogenetic reconstruction, particularly at shallower time scales. As part of a larger study exploring best practices for using RADseq data in phylogenetics, Leaché et al (2015) expanded on Huang and Knowles’ investigation of the influence of missing data on topology by characterizing its influence on branch length and bootstrap support. Using simulations and RADseq data for Phyrnosomatid lizards, they found that loci with more missing data resulted in 1) discordant topologies, 2) increased branch length errors and 3) lower bootstrap support, while loci with less missing data provide higher bootstrap support for shorter branches. What to do about it? Like Huang and Knowles, they believe there’s no silver bullet: “These problems make the selection of the “best” assembly a difficult problem, since the decision is likely to be a balance between obtaining a large number of SNPs and minimizing missing data.”

Figure 4 from Leaché et al (2015): the influence of missing data on phylogenetic tree parameters.

Figure 4 from Leaché et al (2015): the influence of missing data on phylogenetic tree parameters.

All of which is a lot to think about. What papers have I missed? Let me know in the in the comments.


Arnold, B., Corbett-Deitg, R. B., Hartl, D., Bomblies, K. 2013. RADseq underestimates diversity and introduces genealogical biases due to nonrandom haplotype sampling. Molecular Ecology. DOI: 10.1111/mec.12276

Gautier, M., Gharbi, K., Cezard, T., Foucaud, J., Kerdelhué, C., Pudlo, P., Cornuet, J., Estoup, A. 2013. The effect of RAD allele dropout on the estimation of genetic variation within and between populations. Molecular Ecology. DOI: 10.1111/mec.12089

Huang, H., Knowles, L. L. 2016. Unforeseen Consequences of Excluding Missing Data from Next-Generation Sequences: Simulation Study of RAD Sequences. Systematic Biology. DOI: 10.1093/sysbio/syu046

Leaché, A. D., Banbury, B. L., Felsenstein, J., Nieto-Montes de Oca, A., Stamatakis, A. 2015. Short Tree, Long Tree, Right Tree, Wrong Tree: New Acquisition Bias Corrections for Inferring SNP Phylogenies. Systematic Biology. DOI:10.1093/sysbio/syv053



About Ethan Linck

I'm a Ph.D. Candidate at the Department of Biology and the Burke Museum of Natural History, University of Washington, Seattle. I'm interested in population genetics, speciation, and natural history, mostly in birds.
This entry was posted in bioinformatics, genomics, methods, Molecular Ecology, the journal, next generation sequencing, phylogenetics, population genetics, theory and tagged , , , . Bookmark the permalink.