Patterns of Linkage Disequilibrium (LD) across a genome has multiple implications for a population’s ancestral demography. For instance, population bottlenecks predictably result in increased LD, LD between SNP’s in loci under natural selection affect each others rates of adaptive evolution, selfing/inbreeding populations accumulate LD, etc (for an excellent review, see Slatkin 2008). Patterns of LD thus allow estimation of allele ages, patterns of ancestral admixture, natural selection, among other demographic processes.
In exploring some quick and dirty ways to understand/determine patterns of LD across the genome, here’s a simple tutorial for plotting LD in R (with a little help from PLINK).
As with some earlier posts, I am using the 1000 Genome Data as an example – here we explore variation in LD between the CEU (Central European), and YRI (Yoruba) individuals for whom pedigree information is available.
To obtain PED, MAP files:
The 1000 Genomes Project has a neat tool to obtain PED (pedigree), and SNP information files from VCF (variant call format) files – more information on this can be found here.
I downloaded PED and “.info” files for the example Chromosome 6 files for the CEU and YRI populations, between chromosome coordinates 6:46620015 and 6:46620998.
The “.info” file has only two columns (containing the name of the SNP, and the coordinate) and has to be modified to make a MAP file that can then be used with PLINK (http://pngu.mgh.harvard.edu/~purcell/plink/) to obtain LD statistics.
To do this, edit the file, and add a first column with the chromosome number (here 6), and a third column with 0’s. For more information on MAP files, see this link.
Now to calculate LD statistics (here ‘r’), I use PLINK at the command line (in Unix here) with:
$ ./plink --file ceu --noweb --r --allow-no-sex $ mv plink.ld ceu.ld $ ./plink --file yri --noweb --r --allow-no-sex $ mv plink.ld yri.ld
This should produce two files, “ceu.ld”, and “yri.ld”, which contain pairwise LD estimates across all SNP’s in each MAP file. For more information on the “r” statistic, and how to estimate LD across different window lengths in PLINK, I refer you to this page.
Now onto plotting these in R:
yri<-read.table("yri.ld",header=TRUE) ceu<-read.table("ceu.ld",header=TRUE) yri1<-yri[order(yri$BP_B-yri$BP_A),] ceu1<-ceu[order(ceu$BP_B-ceu$BP_A),] plot(yri1$BP_B-yri1$BP_A,yri1$R^2,type="l" ,col="red",ylim=c(0,max(ceu1$R^2,yri$R^2)), lwd=2,xlab="Distance between SNPs (bp)", ylab="Correlation") points(ceu1$BP_B-ceu1$BP_A,ceu1$R^2,type="l",col="blue",lwd=2) legend(500,0.15,c("CEU","YRI"),lwd=c(2,2),col=c("blue","red"))
And voila! A simple LD plot – you should be able to play around with cut-off lines, plotting multiple populations, etc. from hereon.
Slatkin, Montgomery. “Linkage disequilibrium—understanding the evolutionary past and mapping the medical future.” Nature Reviews Genetics 9.6 (2008): 477-485.
Purcell S, Neale B, Todd-Brown K, Thomas L, Ferreira MAR, Bender D, Maller J, Sklar P, de Bakker PIW, Daly MJ & Sham PC (2007) PLINK: a toolset for whole-genome association and population-based linkage analysis. American Journal of Human Genetics, 81.