Characterizing differentiation across individual genomes sampled from different populations can be very informative of the demographic processes that resulted in the differentiation in the first place. Manhattan plots have grown to be very popular representations of genome-wide differentiation statistics in recent literature. And what’s better? They’re surprisingly easy to make in R! In this post, I describe making these plots from scratch – starting with a VCF (Variant Call Format) file, which contains genotype information (and other meta data) across genomic positions.
As an example, I downloaded the variant calls for Chromosome 22 from the Phase 3 of the 1000 genome project (see link), and estimated Weir and Cockerham estimates of Fst for two populations (GBR – Great Britain, and YRI – Yoruba, a total of 199 individuals out of 2504) using VCFTools. The .weir.fst file produced by VCFTools contains pairwise Fst values for your specified window size. To do this, I used the command:
vcftools –vcf chr22.vcf –keep allindivs –out gbryri –weir-fst-pop gbrindivs –weir-fst-pop yriindivs
where allindivs is a file with individual ID’s of all individuals from the GBR, and YRI populations, and gbrindivs, and yriindivs are files with individual labels from GBR, and YRI respectively. I pulled out unique ID’s for all these individuals from the meta information made available on the same FTP site. Now onto plotting!
While neat Manhattan plots can be created just by using R’s plot(), or qplot() functions, I found Stephen Turner’s “qqman” package to be very handy, and easy to use. Just as an example, I randomly replaced some of the chromosome 22 values from the output file above with chromosome number 1-3. Ideally, when you’re analyzing whole genome/transcriptome VCF’s, this shouldn’t be a problem. I also subset the data to avoid lines with NA values.
Thereon, install the “qqman” package, and plot by:
install.packages(“qqman”) library(qqman) fst<-read.table(“fst”, header=TRUE) fstsubset<-fst[complete.cases(fst),] SNP<-c(1:(nrow(fstsubset)-1)) mydf<-data.frame(SNP,fstsubset) manhattan(mydf,chr="CHROM",bp="POS",p="WEIR_AND_COCKERHAM_FST" ,snp="SNP",logp=FALSE,ylab=”Weir and Cockerham Fst”)
And voila! You have your genome wide Fst Manhattan plot! The “qqman” package also has plenty of options for changing the color, displaying chromosomes, etc, which Stephen Turner explains in his blog here. Good luck!