Calculating genetic differentiation with R

http://evolution.berkeley.edu/evosite/evo101/IIIDGeneticdrift.shtml

http://evolution.berkeley.edu/evosite/evo101/IIIDGeneticdrift.shtml

As molecular ecologists, it is often necessary and useful to calculate some measure of genetic differentiation. This is often accomplished with metrics such as Wright’s Fst an or an unbiased analog (e.g., Weir & Cockerham’s Fst; G’st etc.). In addition to calculating global estimates, we often want to calculate estimates of genetic differentiation between pairs of populations. Furthermore, we often need to calculate similar measures of genetic differentiation despite having different marker types, different levels of polymorphism, different amounts of missing data, different sampling schemes, and vastly different questions. Despite all of these differences, we know that there are many useful metrics for measuring genetic differentiation and we also probably don’t want to write the code for each estimator from the ground up. Many of the estimators for genetic differentiation have been implemented in packages available for R. Each package typically has its own syntax and can take some time to fully work through (particularly without examples).

Thus, we thought it would be useful for the molecular ecology community to share in a communal GitHub repository focused on calculating genetic differentiation in R. Other languages will, of course, be accepted too but it seems like R has the greatest diversity of metrics. Over time, this will hopefully become a valuable resource. As an example, I have posted an R script that calculates ubiased global Fst using the package Hierfstat. The repository is available at The Molecular Ecologist organization on GitHub under the repository ‘Genetic Differentiation’. When testing your code, it would be best to try it out on two standardized input files (one for SNPs and one for microsatellites). The example data sets currently contain individuals from 4 populations at 100 SNPs and 30 microsatellites (they are simulated data). It would be useful to have a submissions of scripts that are: 1.) short and simple, 2.) heavily commented (using ‘#’ in R), and 3.) produce useful output for both of the example data sets.

Because this is a work in progress, please feel free to suggest or submit other ‘standard’ example data sets and other ideas (e.g., currently the test data sets do not have missing data). Please comment below (I can easily update this text) and also please send me your code snippets to my email (and I will curate, test, and upload them) or you should be able to add scripts directly to the repository.

Below is a log of all the current entries:

Global FST: Hierfstat

RedditDiggMendeleyPocketShare and Enjoy

About Mark Christie

Mark Christie is a post-doctoral fellow in the Department of Zoology at Oregon State University.
This entry was posted in population genetics, R, software. Bookmark the permalink.
  • http://dyerlab.bio.vcu.edu Rodney J Dyer

    You may want to look at http://dyerlab.github.io/gstudio/ as well.

  • Kevin Keenan

    Hi Mark,

    I like the idea of a concerted effort to make the task of calculating population genetic statistics easier. I especially like the focus on R too. However, there is an argument to be made that the current implementation of this effort is not the best way forward. Instead, an approach whereby an existing package/project is extended to meet community needs would probably be much more productive, with the added benefit of having a strong core for code continuity etc. (a must for a long-term maintainable project).

    I am the developer of a population genetics package, ‘diveRsity’ in R. As it stands, diveRsity has become much more than a genetic differentiation calculator, but its origins lie there. Personally, I think that calculating differentiation using the ‘diffCalc’ function is trivial. For example, if the user has a genepop file named “myfile.gen” in their working directory, the following code will result in the calculation of Fst (Weir & Cockerham, 1984), Gst (Nei & Chesser, 1983), G’st (Hedrick, 2005), G”st (Meirmans & Hedrick, 2011) and D (Jost, 2008) for all loci, globally, locus pairwise and pairwise (all loci):

    ### R code
    # load diveRsity
    library(diveRsity)
    # calculate stats
    diff_stats <- diffCalc(infile = "myfile.gen", outfile = "myresults", fst = TRUE,
    pairwise = TRUE)
    ### END

    The function returns result both to the R workspace and writes conveniently formatted plain text to file. With the addition of a few more arguments, users can calculation bias corrected 95% CIs for locus, global and pairwise differentiation statistics, even making use of parallel processing without any additional R coding knowledge. By making use of highly optimised code as well as C++, diveRsity provide a powerful and memory efficient (100,000+ loci can be analysed) solution to calculating genetic differentiation. I doubt that this level of simplicity or efficiency would be possible under the current suggest solution.

    While what you are suggesting is commendable, it think efforts might be much better spent improving already existing projects. To this end, the diveRsity project is on github (http://goo.gl/2xv7iB) and is under constant development. I would be very happy to hear from others about improvements, code contributions etc.

    Let me know what you think? I would be more than happy to consider improvements for future releases of the package.

    Sincerely

    Kevin Keenan
    QUB

    • http://dyerlab.bio.vcu.edu Rodney J Dyer

      So I think there is a problem with this kind of approach. Fundamentally, the reason someone may use one more more of these measures should rely upon their understanding of the parameter and the biology of their system. There are differences in assumptions and derivations for each. Weir & Cockerham’s theta, an estimator of Wright Fst, is an analysis of variance approach. Wrights original Fst=frac{sigma_p^2}{p(1-p)} is a bit different. In fact, AMOVA is just a multilocus W&C theta where you sum the numerators and denominators across loci before estimating the index. Gst is the equilibrium estimator and the G’st and Dest parameters were created for loci with lots of alleles. It doesn’t make sense to simply hitting the user with all of them and then wading through the mess that follows. This is one of the reasons so many people are absolutely confused by the real statistical differences between these parameters.

      • Kevin Keenan

        Rodney, I agree completely that the use of any of these statistics should be informed. My decision to adopt the behaviour I have in diveRsity is based on the assumption that it is. In cases where my assumption is correct, the behaviour of the package will have no negative effect since the user gets everything they need with a single command, and they can simply ignore inappropriate statistics. I may be naive on this point, but in cases where my assumption is incorrect, is the problem really that a user can easily generate a range of statistics, or does it lie elsewhere? Should software be designed to account for lack of expertise? Or should we hold researcher to a higher standard than is often expected. If you want to do popgen analysis, I think you should know how to. That’s not to say that I don’t feel a responsibility to minimise incorrect usage. Providing resources and examples, where newcomers can educate themselves on these statistics, is essential, and I do that as best I can. I should say that your gstudio tutorial page does an excellent job of this too.

        • http://dyerlab.bio.vcu.edu Rodney J Dyer

          Increasingly there is a notion that there is a ‘set’ of analyses that everyone who ever does any genetic marker work needs to include. Sit down and look through MolEcol and see how many of the published papers REALLY needed to run STRUCTURE (did they really have a hypothesis of separate groups) or Fst+Gst+AMOVA (do they really need to interpret that many different ways of looking at the same question). Soon these things become ingrained, and reviewers are saying to my students, run X, Y, & Z because everyone else is (independent of its relevance for the hypotheses being studied).

  • Mark Christie

    Hi Kevin,

    Thanks for your comments and information about diveRsity. I agree that it will be useful to have a package that can calculate a large number of indices simultaneously. I also think that even though certain R packages may overlap with one another, they will each also contribute something that is unique. For example, one aspect of Hierfstat that I like is the way that it calculates significance for pairwise measures of genetic differentiation. Undoubtedly there are other subtle differences between packages. An ideal solution would be some sort of wrapper or ‘metapackage’ that could execute script from a number of packages. I know that in practice one can use the ‘require’ call in R, but I am not sure how that would work and invariably people would cite only the metapackage and not the original contributors work.

    If you send me a pull request I could easily add you the repository. What are your thoughts on the two test files? Would a genepop format be more appropriate? Maybe I should write a post on what the best standardized file format for population genetics should be.

    Cheers,

    Mark

    • Guest

      PLINK format or VCF format would be much more convenient for SNPs than the (nonstandard?) Structure-like format that is currently being used, as there are a large number of R packages that handle PLINK/VCF for data i/o. (Those formats require marker position data, but since Fst doesn’t depend on marker location, you can just assign arbitrary positions if the real genomic location is unknown.)

      For microsatellites, Structure format and GenePop format are both highly annoying. Namely because:
      1) In strict Structure format, populations must be given numerical IDs, whereas I want to be able to give them meaningful alphanumeric names.
      2) In Genepop format, it is impossible to name your populations at all (so many programs just adopt the name of the first individual in the population).

      Spagedi format is quite good, and it is easily converted to from other formats using PGDSpider, GenAlEx, etc. But if you’re going to put your data in Spagedi format anyway, you might as well just use Spagedi to do the Fst analysis in the first place. Unless you want to use a more exotic version of Fst or genetic distance than the ones Spagedi offers.

    • Aaron Adamack

      On the use of two test files, based on my experience with the development of PopGenReport, I’d suggest using running at least 5-10 different datasets through your functions. It’s amazing the number of different ways users can find to break your program. One example being the enthusiastic honours student who was trying to calculate pairwise Fsts across several populations including one that had only a single individual.

  • Aaron Adamack

    Have you taken a look at the mmod package by David Winter? It already does several measures of differentiation.

  • Aaron Adamack

    As a follow-up, mmod use the genind data format that I think was implemented in adegenet. There are several functions in the adegenet and PopGenReport packages that are capable of importing data from a variety of common sources. adegenet includes several datasets that could be used as standards and the PopGenReport package has the bilby dataset that could be used as a standard. Note that the bilby dataset is based on real data, but we made a few modifications for demonstration purposes.

    • Mark Christie

      Thanks Aaron! I agree that a larger number of test data sets will be useful. I will take a look through the adegenet documentation to see what sort of conversions can be applied and i will update this post accordingly. It sounds like plink and/or vcf formats are popular too. Seems like an R package similar to Convert or Create would be useful…

  • Pingback: Science online, take the stairs edition | Jeremy Yoder