Inferring kinship from low coverage sequencing data

The kinship structure of the Noble and Most Ancient House of BLACK (from the “Harry Potter” series)

Knowing the relatedness structure of your population is essential for pretty much any study. Until recently, the only way to determine the kinship structure was to have a detailed pedigree or to estimate relatedness (poorly) using microsatellites. The spurt of genomic data coming out of many non-model systems and wild populations has drastically increased the number of markers that can be used to infer kinship, which should allow for more accurate estimates of kinship. Indeed, many programs take advantage the large number of variants that can be genotyped using high throughput sequencing technology (e.g., PLINK and KING). However these programs all assume that every genotype is known with high confidence. In other words, it doesn’t take into account the uncertainty of each genotype call that inevitably comes with low coverage sequencing. A new biorxiv preprint from Mikhail Lipatov and colleagues details a method that incorporates this uncertainty to allow for more accurate inference of kinship in low coverage sequencing data.

First, a bit of background:

When you use your favorite genotype caller (GATK, mpileup, VARSCAN, freeBayes, whatever), your end product is a vcf (“variant call format”) file. At any given biallelic variable site in this vcf file, there is an associated likelihood that an individual/sample has a genotype of 0 (homozygous for the reference allele), 1 (heterozygous), or 2 (homozygous for the alternate allele). Essentially, it’s a measure of how confident you are in the individual being a 0, 1, or 2. In high coverage (~50X) sequencing data, there is typically a very high likelihood for one of the three genotypes, so you are pretty confident in that being the true genotype. In low coverage (<10x) sequencing data, there is often a very small difference in the likelihoods of two (or three) genotypes. In this case, you shouldn’t choose just one genotype for that sample at that site because there is a relatively large chance that you could be wrong.
The paper

Lipatov et al used a maximum likelihood framework to incorporate the uncertainty that comes with genotypes inferred from low coverage sequencing. There is a bunch of math in the paper, but I’m not going to get into that. I’ll just focus on their results. They tested out their method on three sets of data: 1) simulated genotypes from a simple pedigree (Fig. 1 in the manuscript), 2) sequencing data from five members of CEPH pedigree 1463, and 3) sequencing data from 2,535 individuals from the 1000 genomes phase 3 data.
Overall, their method, lcMLkin, was pretty accurate and unbiased even down to ~2x coverage (the lowest they went: which means, on average, you sequence each allele once at each position in the genome). First, in the simulated data, they found a pretty strong relationship between their relatedness estimates and the expected coefficient of relatedness:

Figure 2 from Lipatov et al: Coefficient of relatedness, r, estimated by our method from simulated 2,10 and 20X coverage data versus the known r. Blue dots are estimates using only the genotype with the highest likelihood (i.e. a deterministic genotype) and red dots are estimates from summing over all possible genotypes in lcMLkin.

It gets even better when you plot the estimated coefficient of relatedness vs. the estimated k0 (the probability of two diploid individuals sharing 0 alleles that are identical by descent). So much so that they are able to distinguish parent-offspring dyads from full siblings (both r=0.5), as well as most 2nd degree (e.g., Bellatrix and Draco Malfoy) and some 3rd degree (e.g., Irma Crabbe and Draco Malfoy) relatives:
Figure 3 from Lipatov et al: r versus k0 from simulated 2x coverage. Blue=full siblings, red=parent-offspring, green=2nd degree, orange=3rd degree, sky blue=3rd degree, pale orange=5th degree, pink=unrelated

Figure 3 from Lipatov et al: r versus k0 from simulated 2x coverage. Blue=full siblings, red=parent-offspring, green=2nd degree, orange=3rd degree, sky blue=3rd degree, pale orange=5th
degree, pink=unrelated

The method works well with some real data, but is not quite as clean as the simulated results:
CEPH pedigree estimates (only 5 individuals)

CEPH pedigree estimates (only 5 individuals)

1000 genomes phase 3 individuals (unknown coverage, but they authors assume that it is low...)

1000 genomes phase 3 individuals (unknown coverage)

The takeaway

Incorporating confidence of genotype calls into your estimation of relatedness is pretty important and certainly increases accuracy (just look at how bad the methods using PLINK did with the low coverage data). And you can do this using the new method, lcMLkin.
Final note: Authors, if you’re reading this, I think you mixed up your legends for figures 4 and 5. As written the top one (using the most likely genotype) is the best, which isn’t what is stated in the text.
Lipatov, M., Sanjeev, K., Patro, R., & Veeramah, K. (2015). Maximum Likelihood Estimation of Biological Relatedness from Low Coverage Sequencing Data. bioRxiv. Cold Spring Harbor Labs Journals. doi:10.1101/023374

This entry was posted in Uncategorized. Bookmark the permalink.