Calculating pair-wise, unbiased Fst with R:

Calculating Weir and Cockerham’s FST is very useful because it is unbiased with respect to sample size (Weir and Cockerham 1984).  Without adjusting allele frequency estimates with respect to sample sizes, estimates of FST can be upwardly biased (see Waples 1998; note that precision may still be low).  Of course there has been lively debate about the utility and appropriate usage of FST (see Meirmans and Hedrick 2011 and Whitlock 2011 as just a few recent examples).  Nevertheless, FST still remains a commonly used metric that will likely be used for years to come (if only for comparative purposes).  There are a large number of stand-alone programs that can calculate unbiased F statistics (too many to mention), and they all do an excellent job.  However, if you are working with genetic data in R, then it can be useful to calculate this metric directly, without having to export your data to a standalone program.  This can be particularly useful if you are employing a large number of iterations/simulations etc.

The way that I calculate pair-wise theta is by using the R library Geneclust (Ancelet and Guillot).  Geneclust does much more than calculate FST, but I found this library to be particularly useful at achieving these ends.  I have run across a few other packages that calculate FST in R, but not Weir and Cockerham’s unbiased formulation (Please correct me in the comments if you know of other libraries).  To install Geneclust in R, simply type “install.packages(c(“Geneclust”,”deldir”,”fields”,”spatial”)”.  Below is the code I use to calculate pair-wise Fst values.  As with the allele frequency example, I have again used the Beech data as a nice example (here modified as a tab-delimited text file: DataMEC-11-0816.R1).   I have checked these values against FSTAT and they give identical results, but more double-checking couldn’t hurt.   I hope you find this useful – and please feel free to proffer suggestions, recommendations, opinions etc.

Script posted here:  Fst_script.txt

Share
Posted in methods, population genetics, software | Leave a comment

2012 Update to the NGS Field Guide

After many months of essentially no updates worth mentioning, Ion Torrent announced their Proton sequencer on the morning of January 10th, followed later that same day by an announcement from Illumina of their upgraded HiSeq 2500. This of course, caused some significant heartburn to all of us who were several months into the process of writing NSF equipment grant proposals (and had 2 weeks to modify them).

This was followed by the “megaton announcement” of the gridION and minION nanopore sequencers at AGBT in February. Now, we’ll have to see if anyone gets a sequencer funded from all those NSF proposals written prior to Oxford Nanopore’s announcement.

After some grieving, I put together the update for the 2012 Field Guide. I have done my best to make the comparisons apples to apples, but that’s pretty tough to do with the nanopore sequencers (there is no set run time or number of reads, length distributions are unknown, etc.). So, you will see that I’ve had to be kind of creative in some places, and the numbers may well be overly optimistic.

Nick Loman has some nice summaries of the Proton and Nanopore sequencers. Keith Robinson’s OMICS! OMICS! posts on the nanopore sequencers, Ion Torrent, and MiSeq (as well as others) are also insightful.

Hopefully these updated Field Guide tables will be of some utility to folks:

2012 NGS Field Guide

Share
Posted in next generation sequencing | Leave a comment

Calculating allele frequencies in R……

Here is a simple annotated script to quickly calculate, output, and graph allele frequencies in R.  Here I have  downloaded data (via dryad) from Lander et al. 2011 of a European Beech data set genotyped at 13 microsatellite markers.  I have used a single population as the example data set.

Here is the output from the script at one locus:

Locus allele count frequency
mfc7 107 2 0.001751
mfc7 111 699 0.612084
mfc7 115 2 0.001751
mfc7 117 150 0.131349
mfc7 119 2 0.001751
mfc7 123 55 0.048161
mfc7 125 68 0.059545
mfc7 127 85 0.074431
mfc7 129 76 0.06655
mfc7 131 3 0.002627

and here is a plot of allele frequencies at a different locus:  Allele frequency plot

Please keep in mind that this is only one example of how to calculate allele frequencies.  Feel free to add your own suggestions or improvements in the comments.  If anybody has a script that calculates unbiased allele frequencies (with respect to sample size), I would be happy to post that as well.

Share
Posted in bioinformatics, howto, software | 1 Comment

MoleculaR analyses with R:

R is a powerful data analysis environment that has a large number of useful features.  Chief among them are: (1) it is open source and freely distributed, meaning you can download and install it on any computer you have access to, (2) it runs on Windows, Macs, and Linux machines, (3) it is continually updated, monitored, and improved upon by a dedicated team of computer scientists, statisticians, and researchers, (4) the underlying algorithms in the written functions are quite adaptable, somewhat forgiving, and remarkably efficient, (5) the syntax of the language is very intuitive (newbies may scoff at this, so you may have to take my word on this) and (6) there are a vast array of user-contributed packages that allow for many specialized forms of data analysis (two random examples: multivariate analyses [ade4], map drawing [maptools]).

One attribute of R that unfortunately makes people choose other platforms is that it is largely a command line interface.  This means that there is very little pointing and clicking involved.  It also means that there is some syntax that has to be learned, which can intimidate those new to coding.  I will admit that there certainly is a learning period involved, but the advantages are well worth the effort.  With the command-line interface you can perform nearly any analysis that you can think up.  As a molecular ecologist this is an invaluable skill as genetic data often must be analyzed in novel or non-traditional ways.  My best advice is simply to dive in!  Immersing yourself in the language (i.e., forcing yourself to do all analyses in R) for a short period of time is the quickest way to learn.  Why not be productive at the same time?  Try to run an ANOVA or t-test on some data you have been waiting to analyze.

In the next few posts I will provide some simple walkthroughs to (1) import data and calculate allele frequencies from genotype data, (2) calculate Weir and Cockerham’s unbiased Fst, (3) look at some multivariate packages for population genetics and genomics and (4) explore ways to create publication-quality figures.  Below are some vital links for getting started with R.  Keep in mind that there is a very large community of R users and much help can be found online with a search engine.   Please feel free to recommend other resources and/or tips in the comments:

The main R webpage:  http://www.r-project.org/

What R is all about:  http://www.r-project.org/about.html

The R card: very useful for learning syntax.

R introductory manuals.

R commander: a point and click graphical user interface for R, which simultaneously displays the code in R – great for users intimidated by the command line.

TINN-R : A essential text editor for Windows users**.

RSeek: A search field that only returns results relevant to R.

QuickR:  Useful walkthroughs for some basic functions/calculations.

The R Journal: useful tidbits for more advanced users.

**(Using a text editor that interfaces with R is a great idea.  That way you can save, easily manipulate, and send code to R.  If you don’t save your code, you will forget it.  It is also a great way to document all of your analytical steps in a study such that you can go back and easily repeat what you did years from now).

Share
Posted in howto, software | 2 Comments

Great post about a Mol Ecol paper

Noah Reid over at Nothing in Biology has written a wide ranging and very interesting post about a recent Mol Ecol paper on inbreeding depression in bighorn sheep. I particularly liked this part:

“I found this paper fascinating in that it is the first I’ve read that tries to examine the genomic architecture of this important phenomenon. As the authors note, similar analyses have been conducted in the past, but they are often directed at specifically trying to find loci that cause incompatibility between populations in the context of speciation, rather than examining this complex phenomenon.”

The whole post is here. There’s also an excellent perspective piece by Zach Gompert that accompanies the Miller et al. paper, and both will very likely appear in issue 21.6 (mid March).

Share
Posted in Uncategorized | Leave a comment

Do famous researchers have biased perceptions of peer review?

I thought some of you would be interested in this:

http://scholarlykitchen.sspnet.org/2012/02/01/the-famous-grouse-do-prominent-scientists-have-biased-perceptions-of-peer-review/

Share
Posted in Uncategorized | Leave a comment

A penny for your method: AMPure Substitute

This week may stand as the most exciting week of the year for decreasing the cost of PCR cleanup, DNA cleanup, and library prep. Why, you ask? Because a great paper was published over in Genome Research by Nadin Rohland and David Reich entitled “Cost-effective, high-throughput DNA sequencing libraries for multiplexed target capture”.

Among several great nuggets within this paper is what the authors describe as a “[replacement for] a commonly used commercial kit (AMPure XP kit) for SPRI-clean-up steps with a home-made mix”, and in the supplementary information (available to everyone), they provide the components of this home-made mix.

Below, I’ve added a gel image illustrating differences between this home-brew mix and commercial AMPure XP – I refer to the home-brew mix by the name of the beads in solution (SeraMag Speed Beads). The numbers above each lane show the ratio of AMPure or SeraMag Speed Beads used to product being cleaned – in this case, the product being “cleaned” is the same ladder on the far left of each image.

Share
Posted in howto, methods, next generation sequencing | Leave a comment

Winter break

Since all the bloggers are heading back to their homes and log fires for the Christmas break, we’re going to put the blog on hold until early in 2012… we hope to see you then!

Share
Posted in Uncategorized | Leave a comment

Lego my regex

Regular expressions are something that pretty much everyone working with more than a handful of data should take the time to learn (a handful being around 500 lines). They can easily improve your life, particularly if you’ve ever had to make multiple, mind-numbing changes over more than about 50 lines of an excel file, for instance. They can also be terribly frustrating to learn and often are similar in appearance to Sumerian for the uninitiated. All that’s changing – Zed Shaw’s excellent series of books now delves into regular expression territory (the tutorials for other languages are also very, very good), and it’s a learning exercise that’s worth looking into (keep in mind that this is an early “alpha” version that is evolving):

http://regex.learncodethehardway.org/

Decent knowledge of regular expressions combined with a very good text editor and perhaps a data manipulation tool can fundamentally change the way you work, increase repeatability, and improve your science.

Share
Posted in bioinformatics | Leave a comment

A penny for your method: Nutator

This one requires a little ingenuity on your part (and perhaps some craft, duct tape, construction, and/or welding skills).

A nutator (AKA nutating mixer or rotating mixer) is a gently rocking/rotating platform useful for continuously and gently mixing samples. They are particularly useful if you’re doing bead-binding sorts of operations – like binding biotinylated oligos to streptavidin-coated beads. This is a pretty common workflow for lots of tasks including one that is increasingly useful… target enrichment (AKA sequence capture or solution hybridization).

Continue reading

Share
Posted in howto, methods, next generation sequencing | Leave a comment