Ploidy, dear reader, is something that I think about literally all the time. It impacts every facet of my research from the field to the bench to the stats used to analyze data sets. It’s been simultaneously the greatest and the worst aspect underlying the majority of my work thus far.
Anyone who deals with things more complicated than a diploid understands the difficulty. We absolutely have to correctly distinguish individuals with different ploidy levels if we want to accurately genotype and estimate allele frequencies in population genetic studies. Diploidized haploids don’t reflect the true allele frequencies, nor do tetraploids that are treated as diploids.
State-of-the-art ploidy-ing techniques includes flow cytometry (FCM) that determines ploidy level by quantifying nuclear content. There are now high-throughput FCM techniques as well as methods for dried tissue. The downside is that these techniques require the right instrumentation and sufficient tissue. This might not be available if tissue is poorly preserved or limited.
Microsatellites have also been used to determine ploidy. Haploids should have one allele, diploids should have two alleles and polyploids more. However, ploidy detection depends entirely upon the population allelic richness, the numbers of loci and genotyping error.
Recently, hight throughput sequencing has joined the ploidiers toolkit. Sequence data can be used to determine allelic copy numbers and ploidy levels, but often these approaches are for high coverage data sets (10x to 50x, for example, see Gompert and Mock 2017 that review these recent approaches).
Such high coverage isn’t the norm when GBS is used to assess population-level genetic variation. We sacrifice coverage for lots and lots of individuals (i.e., 2x). If we go as low as we can go for a GBS study, can we detect individual ploidy levels?
Gompert and Mock (2017) hypothesized that with the availability of thousands of biallelic SNPs,
rates of heterozygosity, allelic ratios and multi-SNP haplotypes [should] differ among cytotypes, and that this signal could be harnessed to assign cytotype to individuals.
They tested a method for discriminating among diploids, triploids and tetraploids based on rates of heterozygosity and allelic ratios. They assumed that polyploidy resulted from
the production of unreduced gametes that are derivedfrom different chromosomes (e.g., that result fromnondisjunction during meiosis 1).
Tetraploids that result from nondisjunction in meiosis 2 will be indistinguishable from diploids in terms of heterozygosity and allelic richness. Likewise for tetraploids that originate from somatic doubling. Other techniques that don’t rely on any type of marker data, like FCM will be necessary.
The premise of their method rests on the assumption that individuals with higher ploidy levels should be heterozygous at a greater proportion of SNPs and allelic ratios or proportions should reflect ploidy levels.
So, diploids should have one copy of each allele and have allelic ratios of 1:1.
Triploids should have one copy of one allele and two copies of the other allele with ratios of 1:2 or 2:1.
Tetraploids should harbor one, two or three copies of an allele with ratios of 1:3, 2:2 = 1:1, or 3:1.
Gompert and Mock simulated data and then tested their technique in natural populations of aspen that are known to contain both diploid and triploid individuals.
Figure 1 shows the outline of their method that is now implemented in the R package gbs2ploidy available from CRAN.
Their method worked well with the their pseudo-data sets and with the data previously generated by microsatellite and FCM data in aspens.
In the aspen data set, individual heterozygosity covered with sequence coverage, but by using the residuals obtained from regressing heterozygosity on mean coverage, they were able to remove this dependency.
This also highlights the difference between their method and that of Margarido and Heckerman (2015), that is implemented in the computer program ConPADE. In the latter, heterozygosity is ignored and allelic proportions are assumed to be similar to theoretical expectations.
With high coverage data (>10x), this is true, but with low coverage data, not so much (see Figure 3).
Allelic proportions varied among individuals with most SNPs having a 1:1 allelic proportion and a modest number with each of the other five allelic proportions, with 2:1 being the most common.
Most individuals were assigned with high confidence to one of the groups identified by k-means clustering (Step 6 in Figure 1). Only 4 individuals had a max assignment probability of <0.95. Likewise, classifications using their GBS method mostly matched (89.9%) the FCM and microsatellite designations (see Figure 6). Finally, they were able to match the same genets, but different ramets in most cases, including one really large clone.
Their method seems really promising, but they highlight a few cases when it might be a bit limited.
First, with low coverage data, it might be difficult to equate groups from k-mean clustering with specific cytotypes, particularly in cases in which the ploidy level isn’t know. This problem could be resolved by including a validated set of reference samples, thereby, avoiding step 6 in the methodology.
Second, k-means clustering can miss rare cytotypes. This problem could rear its ugly head if multiple populations are included that differ in diversity levels, but this problem could be resolved with a reference data set including the rare cytotypes.
Third, if cytotype covaries with diversity levels or inbreeding, heterozygosity and allelic proportions could yield contrasting patterns and confound the analyses. They suggest inferring cytotypes based solely on allelic proportions.
Their method does require that individuals are heterozygous as a reasonable number of loci, but barring extreme inbreeding or selfing, this shouldn’t really be an issue with GBS data sets.
With the R package ready to go, it seems like this will be super useful for a variety of species that range from diploid to n-ploid.