Mike Sovic on what comes AftrRAD
We’ve recently been highlighting some discussions comparing different protocols for restriction site-associated DNA sequencing (RADseq). We’ve seen the pros and cons of multiple techniques, but what happens when you finally have thousands of shiny SNPs sitting on your hard drive? Multiple programs have been developed to handle RADseq data, the newest of which called AftrRAD, was just released this week and described in Molecular Ecology Resources.
The lead author on the AftrRAD paper is Dr. Mike Sovic, who just so happens to be down the hall from me in the department of Evolution, Ecology, and Organismal Biology here at Ohio State. In addition to being incredibly knowledgeable about all things RADseq, Mike is genuinely accessible and always willing to apply his experience to others’ scientific challenges. He was kind enough to sit down and answer some questions about AftrRAD and the current state of RADseq:
Can you compare and contrast AftrRAD with other similar programs?
AftrRAD is comparable to Stacks and PyRAD, which are currently two of the more widely-used programs for analyzing RADseq data. Stacks is nice because it’s fast, but one downside is that it doesn’t account for indel variation. PyRAD accounts for indels, but can be computationally demanding, especially as datasets get large. It can perform parallel analyses to help compensate for this, but of course, this requires computational resources that allow for these analyses.
AftrRAD could be thought of as filling a niche in between. Like PyRAD, it accounts for indel variation, but does so in a more efficient way. It’s not as fast as Stacks, but it’s efficient enough that many datasets can be run on a laptop in reasonable timeframes.
Each program seems to have a certain connotation for its use. PyRAD is often talked about in the context of phylogenetics and Stacks is often tied to studies of hybridization/genomics. Are these reputations tied to actual methodological differences?
At some level, yes, but how important these differences are might depend on your system and the specific questions you want to address. For example, when divergence among samples increases, the likelihood of indel variation at loci also increases (think population genetic versus phylogenetic scales). I can imagine a scenario where using Stacks to analyze a high-indel dataset could result in some bias, as loci with indels will likely be oversplit, and in turn considered separate monomorphic loci. This could affect certain parameter estimates such as overall heterozygosity, and programs such as AftrRAD and PyRAD might be better choices in this scenario. However, if you were using that same high-indel dataset to identify SNPs for identifying species hybrids, any of the programs will likely provide plenty of informative markers to do the job. In this case, the computational efficiency associated with something like Stacks might be attractive.
I’m using indels as a specific example here, but there are other factors to consider too. These might include how the program deals with low-frequency alleles in the population (i.e. singletons), what methods are used to score genotypes once loci are assembled (i.e. maximum likelihood approaches vs. binomial tests), and how paralogous loci are identified. Again, it’s probably best to think about the methods the programs use in the context of the specific question(s) you’re trying to answer.
What are the most overlooked aspects to using RADseq (data collection, library prep, assembly, analysis, etc)?
The first thing that comes to mind is the quality of the library prep. In general, it seems people often think a lot about what program to use to analyze the data, and what suite of parameter values are best when running the program, and these things are important. However, our experience with RADseq data has suggested that the quality of the library prep and sequencing can have as much, or more, of an impact on the overall quality of a dataset than what analysis program you use or what parameter values you set in that program. In the user manual for AftrRAD we suggest a few things to look for to try to identify any potential problems with the libraries themselves.
At last year’s Evolution meeting, there was a lot of chatter over beers about missing data. What’s your perspective?
In our own datasets, we’ve seen varying levels of missing data. Early on, these often extended far beyond what you might reasonably expect from polymorphism at restriction sites. In many cases, it turned out that we had alleles and/or loci dropping out of the library prior to amplification, and some relatively small tweaks to our library prep protocol improved our datasets substantially (many thanks to a number of colleagues including Jeff DaCosta for helping improve these protocols). However, we sometimes still see relatively high levels of missing data in specific samples, or across libraries as a whole. This may be related to DNA quality or low sequencing coverage.
One thing to keep in mind is that when missing data is common, there is likely to be a related bias towards homozygosity at loci for which genotypes are scored. This is because whatever phenomenon is causing the missing data can also result in the loss of one of the two alleles at a truly heterozygous locus, leading to an incorrect homozygous call. It’s important to assess how much missing data might be in your dataset and understand how any systematic biases related to this missing data might influence the inferences you draw. We’ve tried to include output in AftrRAD that helps evaluate levels and patterns of missing data, and as I mentioned in the last question, we provide some suggestions in the user manual for specific patterns you might look for to identify it and determine its causes.
What is the process like to take software like this from idea to release?
We never really set out with a goal of developing software for general use. As a whole, we (myself and the rest of the Gibbs lab) tend to be much more interested in the biological questions these bioinformatic tools help us address. We started writing scripts when we began to generate genomic-scale data a few years ago. At the time, there weren’t many good options for analyzing RADseq data, so the scripts were written almost out of necessity, and were really just intended for us. There was certainly a steep learning curve in doing this, as neither Tony (Fries) nor I had much of a programming background to speak of (this is probably very evident to any real programmer who takes a look at some of our scripts). By the time other reasonable options became available for analyzing the data, we were pretty comfortable with the pipeline we had put together.
It took a while, but we eventually got around to cleaning it up and packaging it in a form for more general use. In the meantime, we were able to run it on a fairly diverse array of RADseq datasets from across several taxa, which was helpful in trying to identify things that needed tweaked in the scripts.
Now that AftrRAD is released, what does the future hold for you and/or the program?
Well, hopefully it doesn’t hold a lot of troubleshooting and bug fixing if folks start using the program. I’m in a great post-doc position right now where we’re addressing a number of questions relating to the conservation and management of various taxa. Developing bioinformatic resources is an emphasis only to the extent that it’s necessary to address those questions. So, realistically, while there’s a long list of features we’d like to add to try to add to AftrRAD, most of them are not immediately necessary for what we’re doing, and so probably won’t get done in the near future.
For now, I’ll do my best to keep up with fixing any minor bugs that arise, and then whenever I move on from my current position, I’ll maybe start thinking about more substantive additions/changes to AftrRAD. This will be driven largely by how much the program is used between now and then, and what demand there might be for any specific additional features.
Sovic M.G., Fries A.C. & Gibbs H.L. (2015). AftrRAD: A pipeline for accurate and efficient de novo assembly of RADseq data , Molecular Ecology Resources, n/a-n/a. DOI: http://dx.doi.org/10.1111/1755-0998.12378