Q&A: Julian Catchen helps us dig into STACKS
Julian Catchen is a post-doctoral researcher at the University of Oregon, where he uses computational solutions to facilitate the analysis of next-generation sequencing data. Prior to obtaining his PhD, Julian worked for both Intel and IBM, experiences that no doubt prepared him well for his future endeavors. One recent endeavor is the program STACKS. I have previously provided a general overview of the program here, but there is no better resource than Julian to answer a wide array of questions. In the first of this two-part series, Julian answers some general and overview-type questions. Later in the week, I’ll post the second part of the interview, which will focus on user-specific issues.
Can you briefly describe what the program Stacks was designed to do and who could benefit from using it?
Stacks is designed to handle short-read sequence data anchored to the genome at specific locations – reads that will stack together. This type of data is typically produced by digesting genomic DNA with a restriction enzyme, although newer protocols are experimenting with engineered constructs. Digesting DNA with a restriction enzyme is equivalent to sampling the genome every few kilobases and creates a reduced representation of the genome. The nice thing about this process is that it can be repeated in lots of different individuals returning nearly the same samples across everyone. And that’s where Stacks comes in – the software is designed to reconstruct the loci created by the restriction enzyme digestion in hundreds or thousands of individuals from multiple populations. The software will call SNPs within those loci and track the alleles segregating at each locus in all the populations.
This SNP data is useful in a number of analyses, such as (1) using SNPs as markers to make genetic maps. Short-read sequencing has made genetic maps incredibly useful again: you can use them for QTL analysis, to help scaffold assemblies of newly sequenced genomes, or to investigate the structural changes in a genome across different populations. (2) These SNPs are useful for population genomic studies: Stacks will calculate a number of population genetic statistics across the genome such as π,FIS, and FST. (3) Another emerging use is for phylogeography. Stacks can identify SNPs that are fixed within populations but varying between them in order to build phylogenetic trees. This is a new type of data for phylogeneticists and as such there are open questions as to how restriction enzyme data behaves in phylogenetic models – but some neat studies are being done. (4) Finally, Stacks SNP data has been used for conservation genetics, such as finding SNPs supported in many populations for building genotyping chips.
What is the history of Stacks?
A collaboration between Eric Johnson’s and Bill Cresko’s labs at the University of Oregon resulted in the introduction of RAD-seq, and these labs were analyzing large datasets generated from the protocol. Within this environment, Angel Amores and I initially designed Stacks starting in 2009 while working to build a genetic map of the spotted gar in John Postlethwait’s lab. At the time our Illumina reads were 36bp in length and you would get 10 million reads in a lane from a good sequencing run. Stacks was initially a collection of Perl scripts, but we found out very quickly that it was not a robust way to handle the volume of data and so I redesigned the core in C++. Angel and I were both thrilled when we tested our first lane of Illumina data from our mapping cross. Our first linkage map was several hundred centimorgans in length and contained something like ten RAD markers. But hey, it worked!
Around that same time in Bill Cresko’s lab, postdocs Paul Hohenlohe and Susie Bassham were working on population genomics statistics to assay genetic differentiation using RAD-seq in marine and freshwater stickleback fish. When I joined Bill’s lab as a postdoc in 2010 we began incorporating these population genomic capabilities into Stacks in order to investigate very large populations of stickleback (Susie was the one who actually came up with the name ‘Stacks’). Our latest paper looks at 590 Oregon fish in nine populations (and our newest unpublished dataset tops 1,000 individuals).
Do you see Stacks as a way for users with varied levels of bioinformatics experience to move expediently from raw sequencing data to useable output?
We had two major design goals while building Stacks. First, we wanted to design a pipeline that was fully integrated. That is, it would not require the use of external components or programs for its primary job. This gave us full control over the internal algorithms, allowing us to optimize code based on the biological aspects specific to reduced representation data and it allowed us to plan properly for the scale of the data we wanted to analyze – by parallelizing our algorithms, for example.
The second major design goal was to provide software that didn’t operate like a black box. To conduct successful experiments the researcher has to understand their data – how biological factors, or experimental design affected the data produced by the sequencer, and how the data responds to different parameter settings in the software. This requires visualization, including methods to see these thousands of loci and genotypes. To accomplish this we designed a database with a web-based interface.
Initially we thought that Stacks would be used by teams of people – one set of individuals, say those doing system administration for an institute or sequencing center would install and take care of the database components (it’s based on a common open source design model called a LAMP application). We then imagined the biologists would run the pipeline and view their results via the web. Things haven’t turned out as we expected, and we find a lot of users don’t have any institutional support and struggle to get the web interface installed. This causes users to sometimes just plow forward without looking at their data at all, jumping directly from raw Illumina reads to genotypes with no idea if their experiments were successful at the bench. So, we are rethinking some of these aspects at the moment.
Do you have any upcoming advancements to Stacks in the works?
Oh yes, we have a long list of features planned. If only we had a team of programmers, unit testers, and documentation writers… But here are some of the things we are working on. We’re writing a graphical interface for Stacks for use on Apple’s OS X. The program looks a little like Apple’s Mail program, but instead of mail messages, it shows Stacks loci. It will be a basic viewing application initially, you’ll still run the Stacks pipeline on a large computer, but you can then download your Stacks folder and view it in the application – no databases or web servers required. We are planning a new stage for the pipeline that runs after all of the loci are assembled in each individual. This new stage will look at loci across all the populations and make automated corrections based on the strength of the SNP calls in each individual (we currently have a limited form of this feature in our genotype program for building genetic maps). We are also doing some experimentation right now to extend Stacks’ haplotype calls from single-end reads to include the paired-end contigs as well. Finally, we are still doing lots of work on our population genetics statistics – trying to get a solid set of useful statistics (with bootstrapped significance levels) that run fast and on very large data sets.
What is the best way for interested users to learn about Stacks?
Read our papers! Our initial Stacks paper was published in G3 in 2011 and we have a second paper in review that includes information on the population genomics aspects of the pipeline. Check out the Stacks website, and consider joining the Google Groups mailing list . We provide a few tutorials on the website to try things out. Also, we do try to teach Stacks at a few courses throughout the year, so interested people can keep an eye out for those (the last two years we have taught at the genomics course in Cesky Krumlov, and we have previously taught at NESCent and at Michigan State).
What sort of data collection would you recommend for someone just starting out (single vs. paired end reads; read length; platform etc.)? Do you have a favorite RAD-Seq lab protocol that’s posted on the web somewhere?
Well, I’m biased, but I still think the best bang for the buck is the original RAD-seq protocol. Doing single-end RAD sequencing is cheap, and you can look at lots of samples. If you want to do paired-end sequencing on your samples (or a subset of them) the shearing involved in the protocol allows you to assemble paired-end contigs, which can be very useful.
If you have to pool your data, due to low amounts of DNA in a single individual, you should consider how that will affect your downstream data and your ability to call SNPs.
A number of people have started using a variety of double-digest protocols in order to avoid the shearing part of the RAD protocol. We experimented with this a few years ago and were not satisfied that it gave as good results. The primary problem is repeatability: depending on the combination of restriction enzymes you use, you often fail to recover the same restriction sites in multiple individuals making it hard to analyze the data at a population level. (Here is one paper describing these types of problems)
That’s it for now. Stay tuned for part II later this week…..