Q&A: Julian Catchen helps us dig into STACKS – Part II

lego_timetrack_workweek
As promised, below is part II of our interview with Julian Catchen. These questions focus more on the specifics of using stacks (i.e., user-related questions). Please see the first post if you are interested a general overview. Even more information, including tutorials and a manual, can be found at the stacks website.  Without further ado…
What kind of data can Stacks accommodate? Is it better at accommodating certain types of data (e.g., Illumina vs. 454)?
Stacks is definitely optimized for Illumina data. The the amount of sequence the Illumina platform produces (150-200 million reads per lane) is ideal for multiplexing lots of samples together, while the types of errors produced are manageable using quality scores and Stacks’ likelihood-based SNP calling model. In principle, you can use 454 data, but despite its better length, its lack of sequencing depth makes it impractical and it is prohibitively expensive. We haven’t yet had a lot of experience with ION torrent data and I have heard that a few people are using MiSeq Illumina data with Stacks.

Is there an optimal tradeoff between sequencing depth and number of individuals multiplexed? Is there some sort of a priori(or pilot-run) analyses that users could employ to determine how many individuals they could and/or should barcode and multiplex?
Based on our experience, the SNP model works best with a minimum of 35x coverage. If you have lower coverage, you can still get good results, but you will have more errors in SNP calling. It is important to do a back of the envelope calculation as to how many samples you can multiplex in a lane. You will generally lose about 20% when you cull your data using quality scores. So, if you assume 150-160 million reads, 35 reads per restriction enzyme cut site, and X cut sites, you can do the algebra to determine the number of individuals you can multiplex. If you don’t know X, you can either try to look at a closely related species with a reference genome to estimate X, or you can do a pilot run with a small sample of your data – perhaps a colleague can “spike in” a sample to see what it’s going to look like.
What is a good barcoding strategy when samples have been collected from multiple populations? What about multiple species from multiple populations?
Illumina requires a certain amount of barcode diversity within the first three nucleotides of inline barcodes, so people designing barcode oligos should be careful to design barcodes with sufficient diversity. A second consideration is that if you design your barcodes to be more than two nucleotides apart in sequence space, then Stacks can correct sequencing errors in the barcode sequence (this can’t be done if a single sequencing error can flip one legitimate barcode to a second legitimate one).
Some people are using inline barcodes of varying lengths to increase the nucleotide diversity. This adds an additional step to demultiplex the data but it isn’t too hard to handle. And now, a number of people are starting to use Illumina index barcodes (they appear in the FASTQ header of the output reads) and Stacks can handle these as easily as inline barcodes.
Finally, a number of people are starting to use combinatorial barcodes (pairs of barcodes, one on each read). We could have a long discussion on whether it’s worth it to use these barcodes. Using them obviously requires one to do paired-end sequencing instead of single-end sequencing, which is typically twice as expensive. But, Stacks can handle these data as well.
If you are multiplexing large numbers of samples together, it can be difficult to properly combine them at the bench – pipettes are only so accurate at very small volumes. In a library containing 100 samples, typically one or two of them will not end up with any appreciable reads after sequencing due to multiplexing error at the bench. It can pay off to qPCR individual samples, or sub-pools of samples before doing the final pooling of the restriction enzyme library to ensure that they are combined at the proper volumes.
How should users best post-process the loci called by Stacks? Is there a way to filter the loci based on the likelihood scores (or does that occur behind the scenes?). Can you provide some rules of thumb?
Loci generated in a Stacks analysis can be grouped into two broad categories, those that have sufficient evidence to be called heterozygous or homozygous, and those that contain too much sequencing error, or too few reads to tell the state of the locus. To rank all of the loci from ‘best’ to ‘worst’ doesn’t make sense in my opinion, in that there is not enough information provided by the analysis for such a detailed ranking.
With that said, one can obtain loci from the first group (those that have sufficient evidence) in two ways. First, you can require a certain level of statistical significance (alpha) from calls made by the maximum likelihood SNP model in each individual. Satisfying a smaller value for alpha requires higher depth of coverage. Second, we can use the population to provide additional evidence that a SNP call is accurate: we require that each SNP be found in a certain number of individuals in the population. SNPs that are found over and over in individuals are quite reliable. For SNPs that satisfy each of these two requirements, it’s difficult to further rank one as better than another.
Is it acceptable for users to first filter their own data?
Users can filter their own data. Of course, being biased, I would assert that our filtering program, process_radtags is fast, accurate and handles many different data types, so why would you want to? 🙂
Stacks can incorporate a reference genome. Can this also help for species that are closely related, but not identical to, the reference genome?
Reference genomes are great if you have one, particularly if you plan to run thousands of samples. Since each de novo sample you build will bring a small amount of error into Stacks’ catalog of loci, if you process hundreds of samples you will get lots of error in the catalog. Obviously, aligning reads to the reference genome before calling stacks will act as a big filter keeping many of those erroneous reads out. The flip side of that is that if you are dealing with many different populations, some amount of population-specific genomic sequence will not be in the reference and you will lose that data (we see this consistently if you run the pipeline with and without the reference).
A reference can also be useful for closely-related species, but it depends on how closely-related. You will see a drop off in the number of reads you can align to the reference proportional to evolutionary distance, so its usefulness depends on how much data you can still align.
How would you recommend a user with paired-end data use Stacks?
There are at least three ways to use paired-end data. First, if you use the RAD-seq protocol, and hence you sheared the molecules in the RAD library, you can assemble the paired-end reads associated with a stack into a short contig. This is very useful if you have also generated RNA-seq data, as you can then match the RAD-tag and its paired-end contig against the RNA-seq data giving you RAD markers near or within genes. All three of these types of data can then be compared against other organisms to identify gene orthologs. With this type of data you can then look for conserved synteny between genomes (see our spotted gar paper for an example).
Many users would like to take these paired-end contigs and call SNPs within them, enabling one to extend the haplotype from the single-end read into the paired-end contig. This is not yet possible within the Stacks pipeline, although we are experimenting right now with the best way to try and do this within the pipeline. Users with decent informatics skills could do this themselves. Not only does it give additional potential for SNPs (particularly in organisms with low heterozygosity), but it could allow one to differentiate restriction enzyme loci that are merged together due to a recent whole genome duplication (such as in salmon or a number of plants). However, there are a number of technical challenges to calling SNPs consistently in all samples (because of varying levels of coverage in the paired-end region).
A third way to deal with paried-end data is to do a double-digestion. In this case both the single and paired-end reads are anchored with a restriction enzyme cut site. Stacks can currently handle this data similarly to how it handles single restriction enzymes.
It may not be clear to users what the best minimum sequencing depth of each stack should be: clearly this may vary by the overall depth of sequencing, number of cut sites in each data set, etc. On the other hand, maybe changing this parameter has little effect on the output as attempts are later made for the secondary reads to be re-incorporated into stacks?
Our latest paper shows data on how varying this minimum sequencing depth affects stack formation. Long story short, you want to raise this parameter enough to avoid convergent sequencing error from the Illumina machine. For example, given 50 Illumina reads from a particular locus, you happen to get the same error three times at position 94. You now have three identical reads that look like an allele at that locus. The SNP calling model can handle this type of error; however, if you have really high depth of coverage, this can happen more than once at a single locus. When merging loci together, at high depths you could have one or two of these error alleles along with the two true alleles giving four total alleles at the locus. This then starts to look like a confounded locus that the software could throw out before it gets to the SNP calling stage. So, it’s important to investigate how the parameters affect your data and to realize that more sequencing depth isn’t always better, as you add additional error with additional depth.
It seems like there could be differences in output based upon the maximum distance allowed between stacks. Is this possible? Should users try several values? If so, what would be a good range of values to change?
If you were to visualize all the restriction enzyme loci produced from a genome, there will be a group of them that are legitimately distinct in sequence space. That is, given the 100 nucleotides that make up the locus, you would have to change scores of nucleotides to convert one locus into another. Among these loci, increasing the maximum distance between stacks will have no negative affect.
There is a second set of loci that are close in sequence space. Given just a handful of sequence changes you could flip one locus into another. Adjusting the maximum distance parameter will give you more or less of these loci along with a certain level of error where you incorrectly merge two distinct loci. Finally, there will be a group of loci constructed from repetitive DNA. Each of these loci could have scores of alleles merged together from all over the genome, despite what you set the maximum distance to.
So, you need to experiment to see how many loci from the middle group you can capture. Typically, given 100bp in each locus, you could expect between one to three or possibly four SNPs at a locus.
Related to the above question, what are the effects of varying the number of mismatches between samples to create loci (between individuals as opposed to within individuals)?
This depends heavily on the evolutionary distance between your populations. You have to consider how many fixed differences could exist between populations.
Which user-defined choices have the biggest effect on downstream analyses? Should users try multiple options? Do these choices have large effects (i.e., there generally is one best set of options) or small effects (i.e., regardless of the options chosen, the output is generally the same)? This question relates to all options (lumberjack stacks, deleveraging algorithm etc.)?
I think there is a sweet spot that will give good results. There will always be loci on the edges of your parameter set that you wish you could capture. One can jump down the rabbit hole and become obsessed with perfecting their data. But in reality, how many solid loci do you need to reconstruct to answer your biological question? Does your genetic map need 17,000 markers, or will 15,000 suffice? Do you need 7,000 highly reliable SNPs for your population genomics analysis, or will 5,000 do? Get a good set and then get on with your life. Trying to disentangle repetitive DNA can ruin your life… 🙂
Thanks for the opportunity to talk about Stacks!
Thanks Julian!!!

This entry was posted in bioinformatics, genomics, howto, interview, methods and tagged . Bookmark the permalink.