A population genetic R-evolution

Uphill, both ways, in the snow, without shoes … quite apt when thinking of the dark days, in the not too distant past, in which a separate input file was needed for each popgen analysis in order to use a handful of separate programs (often for idiosyncratic reasons).
Add a complex life cycle into the mix, such as an alternation between haploid and diploid free-living phases, and you can multiply the number of input files by two. Yet, then you’d have to maintain a list of programs that only were compatible with diploids and the ones that will take diploids and haploids, but separately.
The reality is that not all organisms fit nicely into a diploid-only or even haploid-only box.
When I began my PhD and my first foray into the population dynamics of haploid-diploid seaweeds, GenAlEx (Peakall and Smouse 2006, 2012) was a revelation in terms of ease of use. There was one input sheet (though still per ploidy) and it could be stored with all output sheets in the same, albeit massive, Excel file, reducing the seemingly endless array of individual input and output files.
As 2015 dawns, the brave new world of population genetic analyses in R may make the multiple popgen input files of yesteryear a relict, not unlike floppy disks or Beta-decks.
For population geneticists, and especially those with a penchant for organisms that don’t conform, R is a limitless palette with a much larger popgen repertoire than before.

While curating, analyzing and writing what is amounting to be a magnum opus on an invasive seaweed, my conversion to R began in earnest this past autumn. Perhaps I am a wee bit behind the times, but as I was trying to break free of the love-hate relationship with multiple popgen programs, I stumbled across the R package poppr (Kamvar et al. 2014) and it became an important leap in my personal R-evolution.
Around the same time, I began contributing to The Molecular Ecologist and figured I’d highlight this great set of tools now conveniently available in R. Though perhaps not revolutionary in and of themselves, this package is an important milestone in the population genetic analyses in R.
A few weeks ago, I spoke with two of the great minds behind poppr, Niklaus Grünwald and Zhian Kamvar. They told me that the design of the package arose due to a series of serendipitous events:

  • R had become much easier to use
  • the Grünwald lab needed the tools to analyze population genetics in clonal organisms
  • the adgenet and vegan packages were available (Jombart 2008, Jombart and Ahmed 2011, Oksanen et al. 2013)
  • Kamvar had begun his dissertation research.

The humorous phrases at the start of each section in the poppr vignette lightened moments of sheer exasperation. And so, as I sing poppr’s praises, I’m going to shamelessly borrow some of the headings.
Can you take me hier(archy)?

© elitedaily.com

© elitedaily.com

Hierarchical sampling adds a layer of complexity to the analysis of any set of populations. Previously, this wasn’t so easy, even in R, in which a separate data set was needed, composed of different sampling levels, accompanied by separate analyses for each flavor of hierarchical levels. According to Kamvar et al. (2014):

The number of data sets undergo a factorial increase with each hierarchical level, therefore, increasing the chances of human error in data reformatting and analysis.  Thus, tools are needed for analysis of population data across hierarchies or subsets of data.

Enter the hierarchy slot, stage left.
poppr introduces the genclone class in order to make working with hierarchies simpler. It is built off the adegenet’s genind object, but in addition to genotypic information, it also includes population hierarchy. This enables you to define a desired hierarchy by simply providing a formula.
Thus, there is no longer a need to make up a new file when you want to exclude one potentially odd population, or set of individuals. A quick line of code … et voila the offending samples are removed.
Clone-ing around

© starwarsreport.com

© starwarsreport.com

You might not know if your population of interest is undergoing clonal reproduction. Traditionally, one hallmark of asexual reproduction, or clonality, is linkage disequilibrium among loci (e.g., Milgroom 1996, Sosa et al. 1999). One way to do this is to calculate the index of association in combination with a resampling of the dataset in order to obtain a null distribution with the underlying expectation of random mating (Agapow and Burt 2001). This has been applied in oomycetes (e.g., Goss et al. 2014; Grünwald and Hoheisel 2006) as well as in seaweeds (e.g., Guillemin et al. 2008, Krueger-Hadfield 2011). However, the program that was previously used, Multilocus (Agapow and Burt 2001), is no longer supported.
Regardless, in order to obtain reproducible research in Multilocus you would have to write down every step of the way to your multilocus LD estimates … what you did, how you did it and what the results were (and even possibly what tea you were drinking that brought favor with the input file gods such that your input file would work). Try to collaborate with a data set and you’d have to use the same version as your collaborator(s). Pointing and clicking is not as inherently reproducible as a well-annotated R markdown file.
Enter clone correction, center stage
One potential hiccup with clonal populations is the inclusion of many repeated multilocus genotypes (MLGs). This will bias estimates of different popgen parameters, such as allele frequencies. Previously, removal of repeated MLGs was done by hand. This is hazardous under the best of circumstances (i.e., a small dataset).
Clone correction in poppr allows you to keep just one MLG by correcting down to the lowest level of any defined population hierarchy. Thus, the hierarchy aspects of poppr are applicable beyond sexual, dare I say, “normal” organisms. This is a huge advance for those of us analyzing data sets including clones, followed by analyses with clones removed.
Once you’ve got the dataset you want for your populations (be they sexual, clonal or partially sexual and clonal), then you can perform a plethora of different analyses, such as summary statistics (including expected MLG based on rarefaction, Shannon-Weiner Index, expected heterozygosity, evenness, etc.).
If there is clonality occurring in a population, then you expect significant LD. Conversely, if the population is sexual, then you expect no LD. In poppr, you can easily calculate the index of association IA and to investigate the reproductive mode of your population(s) by the convenient function poppr().
Summary table produced by the poppr function (from Kamvar et al. 2014)

Summary table produced by the poppr() function (from Kamvar et al. 2014).

Every day I’m shuffling (data sets)
© Johnny Blood, Flickr

© Johnny Blood, Flickr

But, you ask, how can you determine whether your population multilocus estimates are significant. Or, for that matter, what about bootstrap support for your Bruvo’s genetic distance (Bruvo et al. 2004).
Enter permutations and bootstrap resampling, stage right.
For IA and , poppr provides four permutation algorithms in which the data are randomly shuffled at each locus, thus unlinking them. The Multilocus style permutation will shuffle genotypes at one locus, while maintaining the heterozygosity and allelic structure. However, poppr provides three new methods in which alleles sort independently at each locus, reflecting the null expectation in panmictic populations.
Linkage disequilibrium. Visualizations of tests for linkage disequilibrium, where observed values (blue dashed lines) of IA and rbarD are compared to histograms showing results of 999 permutations using permutation. (A) Results are shown for a sexual population and (B) for a clonal (from Kamvar et al. 2014).

Linkage disequilibrium. Visualizations of tests for linkage disequilibrium, where observed values (blue dashed lines) of IA and rbarD are compared to histograms showing results of 999 permutations using permutation. (A) Results are shown for a sexual population and (B) for a clonal population (from Kamvar et al. 2014).

Bruvo’s genetic distance was implemented in the R package polysat (Clark and Jasieniuk 2011), but this package is optimized for polyploids. In polysat, homozygous diploids will be collapsed into a single allele and it will attempt to determine the second allele in comparison to a heterozygous individual. Thus, in haploids and diploids in which we know the allelic dosage, poppr does not suffer from a bias when calculating true genetic distances.
The other novelty in poppr is the introduction of bootstrap support for Bruvo’s genetic distance. Graphing of minimum spanning networks and dendograms with bootstrap support were not previously available in other R packages.
Dendrogram based on genetic distance. UPGMA tree produced from Bruvo’s distance with 1000 bootstrap replicates (node values greater than 50% are shown). (from Kamvar et al. 2014).

Dendrogram based on genetic distance. UPGMA tree produced from Bruvo’s distance with 1000 bootstrap replicates (node values greater than 50% are shown; from Kamvar et al. 2014).

Getting into the mix … and … bringing something to the table
© tablespoon.com

© tablespoon.com

Soon it will be possible to incorporate MLG boundaries within poppr, namely in the form of Psex, in which one can assign the probability that a second encounter with a MLG is due to clonality or different sexual events (Parks and Werth 1993, Arnaud-Haond et al. 2007).
Kamvar and Grünwald also discussed the temptation to scale up to genomic data in which sliding window analyses for whole-genome SNP data could be implemented.
For those just embarking on the R language or population genetic analyses, there is also an excellent primer for population genetic analyses written by Grünwald, Kamvar and Sydney E. Everhart. The primer

provides a valuable tool for tackling the nitty-gritty analysis of populations that don’t conform to textbook genetics.

But I’d add, that it’s a great primer for any organism or anyone getting their feet wet in popgen analyses and R.
Likely this is a case of preaching to the converted, but the ease with which one can manipulate data while simultaneously reproducing said analyses is making popgen a lot easier. poppr represents an important advance for popgen analyses, from sexual to clonal.
Useful links:
The Grünwald Lab
Poppr vignette
Poppr on CRAN
Population genetics in R primer
Poppr on Github
Poppr Google group
Arnaud-Haond S, Duarte CM, Alberto F, Serrao EA (2007). Standardizing methods to address clonality in population studies. Mol Ecol 16: 5115-5139.  
Agapow PM, Burt A. (2001) Indices of multilocus linkage disequilibrium. Molecular Ecology Notes 1:101-102.
Bruvo R, Michiels NK, D’Souza TG, Schulenburg H. (2004) A simple method for the calculation of microsatellite genotype distances irrespective of ploidy level. Mol Ecol 13:2101-2106.
Clark L, Jasieniuk M (2011) Polysat: an R package for polyploid microsatellite analysis. Mol Ecol Res 11:562-566.
Goss, E. M., et al. (2014) The Irish potato famine pathogen Phytophthora infestans originated in central Mexico rather than the Andes. PNAS 111: 8791-8796.
Grünwald NJ, Hoheisel G (2006) Hierarchical analysis of diversity, selfing, and genetic differentiation in populations of the oomycete Aphanomyces euteiches. Phytopathology 96:1134-1141.
Guillemin ML, Faugeron S, Destombe C, et al. (2008) Genetic variation in wild and cultivated populations of the haploid-diploid red alga Gracilaria chilensis: How farming practices favor asexual reproduction and heterozygosity. Evolution, 62, 1500–1519.
Jombart T (2008) Adegenet: a R package for the multivariate analysis of genetic markers. Bioinformatics 24:1403-1405.
Jombart T, Ahmed I (2011) Adegenet 1.3-1: new tools for the analysis of genome-wide SNP data. Bioinformatics 27:3070-3071.
Kamvar ZN, Tabima JF, Grünwald NJ (2014) Poppr: an R package for genetic analysis of populations with clonal, partially clonal, and/or sexual reproduction. PeerJ e281.
Krueger-Hadfield SA (2011). Structure des populations chez l’algue rouge haploid-diploïde Chondrus crispus: système de reproduction, différenciation génétique et épidémiologie. UPMC-Sorbonne Universités and la Pontificia Universidad Católica de Chile. PhD Thesis.
Milgroom MG (1996) Recombination and the multilocus structure of fungal populations. Annual Review of Phytopathology 34:457-477
Oksanen J, Blanchet FG, Kindt R, Legendre P, Minchin PR, O’Hara RB, Simpson GL, Solymos P, Stevens MHH, Wagner H (2013) vegan: Community Ecology Package. R package version 2.0-7.
Parks JC, Werth CR (1993) A study of spatial features of clones in a population of bracken fern, Pteridium aquilinum (Dennstaedtiaceae). Am J Bot 80: 537-544.
Peakall R, Smouse PE (2006) GENALEX 6.2: genetic analysis in Excel. Population genetic software for teaching and research. Mol Ecol Notes 6: 288–295.
Peakall R, Smouse PE (2012) GenAlEx 6.5: genetic analysis in Excel. Population genetic software for teaching and research-an update. Bioinformatics 28: 2537-2539.
Sosa, P.A., Valero, M., Batista, F. & Gonzalez-Perez, M. A. Genetic structure of natural populations of Gelidium species: A re-evaluation of results. J. Appl. Phycol. 10:279-84.

This entry was posted in howto, methods, population genetics, R, software, Uncategorized. Bookmark the permalink.