Calculating pair-wise, unbiased Fst with R:

Calculating Weir and Cockerham’s FST is very useful because it is unbiased with respect to sample size (Weir and Cockerham 1984).  Without adjusting allele frequency estimates with respect to sample sizes, estimates of FST can be upwardly biased (see Waples 1998; note that precision may still be low).  Of course there has been lively debate about the utility and appropriate usage of FST (see Meirmans and Hedrick 2011 and Whitlock 2011 as just a few recent examples).  Nevertheless, FST still remains a commonly used metric that will likely be used for years to come (if only for comparative purposes).  There are a large number of stand-alone programs that can calculate unbiased F statistics (too many to mention), and they all do an excellent job.  However, if you are working with genetic data in R, then it can be useful to calculate this metric directly, without having to export your data to a standalone program.  This can be particularly useful if you are employing a large number of iterations/simulations etc.

The way that I calculate pair-wise theta is by using the R library Geneclust (Ancelet and Guillot).  Geneclust does much more than calculate FST, but I found this library to be particularly useful at achieving these ends.  I have run across a few other packages that calculate FST in R, but not Weir and Cockerham’s unbiased formulation (Please correct me in the comments if you know of other libraries).  To install Geneclust in R, simply type “install.packages(c(“Geneclust”,”deldir”,”fields”,”spatial”)”.  Below is the code I use to calculate pair-wise Fst values.  As with the allele frequency example, I have again used the Beech data as a nice example (here modified as a tab-delimited text file: DataMEC-11-0816.R1).   I have checked these values against FSTAT and they give identical results, but more double-checking couldn’t hurt.   I hope you find this useful – and please feel free to proffer suggestions, recommendations, opinions etc.

Script posted here:  Fst_script.txt

#This script calculates pair-wise theta.
#written by Mark Christie on 5/13/2012
#must install Geneclust library, run code: install.packages(c("Geneclust","deldir","fields","spatial")
#See provided text file for file format
 
setwd("C:/POPS/")  #Set the working directory. Can be anything you like
beechdata <- read.table("DataMEC-11-0816.R1.txt", header=TRUE, sep="\t", na.strings="NA", dec=".", strip.white=TRUE)  #load text file
data1 <- beechdata[,-2]     #remove column with clone id (do not run this line on your own data, if you do not have this extra column)
library(Geneclust)           #load Geneclust
gtypes=data1[,3:ncol(data1)]  #index only genotype data
gtypes1=matrix(as.integer(as.matrix(gtypes)),nrow(gtypes),ncol(gtypes))       #neccessary data formating
gtypes2=FormatGenotypes(gtypes1)   #more data formatting
pop.mbrship=data1[,2]               #population membership for individuals
pop.mbrship=as.numeric(pop.mbrship)  #conver to numeric (I couldn't get the function to work properly without this step)
npopmax=length(unique(pop.mbrship))   #number of populations
fst=Fst(gtypes2$genotypes,gtypes2$allele.numbers,pop.mbrship,npopmax)   #providing necessary parameters for Fst calculation in Geneclust
fst=fst$Pairwise.Fst     #isolate pair-wise fst    (note: if you type "fst" before this line, you will also get  Fit and  Fis, which is useful)
  
popnames=unique(data1[,2])   #population names
FST=cbind(as.character(popnames),fst)         #add population names
FST=rbind(c("",as.character(popnames)),FST)    #add population names
FST

write.table(FST,file="FST.txt",row.names=FALSE,col.names=FALSE,sep="\t",append=FALSE)  #write results to a file
This entry was posted in methods, population genetics, software. Bookmark the permalink.