A Nice opinion on confronting uncertainty and modeling it for GBS data

Just over a week ago, I had the opportunity to work in Chris Nice‘s lab at Texas State University. I was accompanied by one of our MS students, Ben, and my colleague, Erik Sotka, to prep libraries for a genomic survey of a certain alga I’ve a penchant to write about. We also were there to prep a library with Torrance Hanley, a postdoc in the Kimbro and Hughes labs at Northeastern.

Chris walked us through each step as we embark on our first population genomic projects. We got to talking about analyses and issues I’ve written about before. In addition, we got to talking about times in which Bayesian approaches, such as STRUCTURE, may not be appropriate (i.e., when there are strong departures from HWE) and possible ways to get around this in the future!

I asked Chris to offer his opinion and write a small piece for TME. Et voilà 

Population genomics is certainly progressing as a field and there seems to be about as many ways to do things as there are labs doing them. Several methods for library construction have been reviewed recently with some good discussions (Andrews & Luikart, 2014; Puritz et al., 2014; Andrews et al., 2014). One area that has not received as much attention is the downstream analytical details – once you have your sequence reads.

In reading recent papers, it seems clear that there are differing philosophies arising from the fact that next-generation sequence data, especially from reduced representation, GBS protocols, have forced molecular ecologists to confront notions of genotype uncertainty. Stochasticity arising from library preparation and the sequencing process means there is substantial variation in coverage depth per locus and per individual in GBS data sets. This, in turn, means there can be uncertainty about the genotype for an individual at a particular gene region. A popular approach to deal with this uncertainty is to filter data in a way to minimize it. This means throwing away data below a coverage threshold and keeping only those markers for which there are many sequence reads.

An alternative is to confront that uncertainty and to model it. The central idea is that we are sampling the underlying genotype with GBS sequence reads, with all the attendant issues concerning sampling. Thought of this way, it makes sense (to me, at least) to treat the problem of genotyping from a modeling perspective like any other inference problem. In this context, some important contributions seem to have received less attention than I think they deserve.

For example, Zach Gompert and Alex Buerkle proposed a statistical framework to account for genotype uncertainty in GBS data (Gompert & Buerkle, 2011; Buerkle & Gompert, 2013) using hierarchical Bayesian approaches. I think these models and others (e.g. Nielsen et al. 2012) provide a clear path for data analysis where the uncertainty about genotype is accounted for and propagated through the higher levels of the model’s hierarchy such that summary statistics (e.g. allele frequencies, Fst, hybrid index, etc.) include credible intervals that reflect underlying uncertainty stemming from the variation in coverage and sequencing error (Alex Buerkle discussed the issues and approaches to genotype uncertainty in an American Genetics Association workshop on \Population Genomics for Nonmodel Taxa” in 2013).

A recent paper by Mandeville et al. (2015) in Molecular Ecology illustrates this approach. Not only is this a very interesting paper exploring geographic variation in reproductive isolation in repeated hybrid zones in fish, but the authors use a clustering algorithm that accounts for genotype uncertainty. This is an algorithm that is an extension of the hierarchical Bayesian ideas mentioned above, and based on the STRUCTURE algorithm (Pritchard et al., 2000; Falush et al., 2003, 2007), but modied to handle GBS data. The model, called ENTROPY, takes genotype likelihoods from variant calling via SAMtools/BCFtools as the starting point and, as in STRUCTURE, provides a clustering solution for varying numbers of populations (k). Output includes the assignment probabilities as well as genotype probabilities for all individuals at all loci and credible intervals for these estimated parameters.

Figure from Mandeville et al. 2015

Figure 4 from Mandeville et al. (2015) in which they estimated posterior distributions of admixture proportion (q) for each individual using ENTROPY, for k=2 to k=8 genetic clusters.


 

This provides a powerful approach for population genomics using GBS data (Mandeville et al., 2015). The use of Bayesian inference does require more computational time than other approaches to GBS data analysis. It also requires some familiarity with Bayesian methods and might not be applicable to all situations. On the other hand, another advantage of modeling genotype uncertainty is that you can potentially take advantage of lower coverage data, meaning that these methods accounting for variable coverage allow researchers to use more of their data (Buerkle & Gompert, 2013). This alone might justify paying more attention to these models.

References

Andrews, K. R., Hohenlohe, P. A., Miller, M. R., Hand, B. K., Seeb, J. E. & Luikart, G. (2014). Trade-os and utility of alternative radseq methods: Reply to puritz et al. Molecular ecology, 23, 5943-5946.

Andrews, K. R. & Luikart, G. (2014). Recent novel approaches for population genomics data analysis. Molecular ecology, 23, 1661-1667.

Buerkle, C. A. & Gompert, Z. (2013). Population genomics based on low coverage sequencing: how low should we go? Molecular Ecology, 22, 3028-3035.

Falush, D., Stephens, M. & Pritchard, J. K. (2003). Inference of population structure using multilocus genotype data: linked loci and correlated allele frequencies. Genetics, 164, 1567-1587.

Falush, D., Stephens, M. & Pritchard, J. K. (2007). Inference of population structure using multilocus genotype data: dominant markers and null alleles. Molecular Ecology Notes, 7, 574-578.

Gompert, Z. & Buerkle, C. A. (2011). A hierarchical Bayesian model for next-generation population genomics. Genetics, 187, 903-917.

Mandeville, E. G., Parchman, T. L., McDonald, D. B. & Buerkle, C. A. (2015). Highly variable reproductive isolation among pairs of catostomus species. Molecular ecology, 24, 1856-1872.

Nielsen, R., Korneliussen, T., Albrechtsen, A., Li, Y. & Wang, J. (2012). Snp calling, genotype calling, and sample allele frequency estimation from new-generation sequencing data. PLoS One, 7, e37558.

Pritchard, J. K., Stephens, M. & Donnelly, P. (2000). Inference of population structure using multilocus genotype data. Genetics, 155, 945-959.

Puritz, J. B., Matz, M. V., Toonen, R. J., Weber, J. N., Bolnick, D. I. & Bird, C. E. (2014). Demystifying the rad fad. Molecular ecology, 23, 5937-5942.

This entry was posted in bioinformatics, genomics, interview and tagged , , , . Bookmark the permalink.