Using R to run parallel analyses of population genetic data in STRUCTURE: ParallelStructure

The structure of human populations across Eurasia, as estimated by Rosenberg et al (2002)

The structure of human populations across Eurasia, as estimated by Rosenberg et al (2002)

In this guest post, Francois Besnier explains how to use ParallelStructure, his new R package for running STRUCTURE analyses in parallel computing environments.

To start with, thanks to The Molecular Ecologist blog team (Tim and Jeremy) for the invitation to write post about the R package ParallelStructure. Briefly introducing myself: I am a post-doctoral researcher at the Institute of Marine Research in Bergen, Norway.

Among other things, our institute is in charge of monitoring the genetic structure of salmon populations in Norway, which often puts us in the situation to play with fairly large data sets (thousands of individuals, dozens of populations) to assess population admixture and perform individual assignment tests in STRUCTURE. In such situations, we are often limited by the computation time. Thus, to make the best out of our computer resources, we wrote some scripts to run STRUCTRURE analyses in parallel on multi-core computers. As this might be of use for a broader range of purposes, those scripts were encapsulated in a R package called ParallelStructure, which is available on R-forge, and described in Besnier & Glover 2013.

So why a R package to run STRUCTURE in parallel?

STRUCTURE is not implemented to run in parallel. In short: it does not make any use of our fancy 4 (8) core/CPU computers. [Ed: Most desktop and laptop computers now have multiple processor cores.] It is of course possible to make use of multi-core computers by opening several windows of STRUCTURE manually, and run two or more analyses simultaneously. This may work just as well if the amount of jobs is not too large, but it is not the most efficient solution. If you want to run analyses on 4 processors (or more), it soon feels like you are doing juggling with windows, and not making the most efficient use of your time.

It would thus be very useful to have a tool that automates the distribution of analyses among cores/CPUs, and R sounded like the most relevant platform to implement it for the following reasons:

  1. It is free, open source and available on both windows, Mac OS and Linux systems
  2. A lot of population genetics software is implemented in R, thus, most population geneticists are already familiar with this environment.

How does it work?

In the past few years, R has been provided with some excellent libraries for parallel computing. Some of them work better under Unix OS while others fit better Windows OS. For this reason ParallelStructure consists of two almost identical functions: MPI_structure() and parallel_structure(). The arguments and output are identical for these two functions; however, they rely on two different packages for parallel computing.

MPI_structure() relies on the Rmpi package, which is available for both platforms, but its installation might be more tricky on Linux. Parallel_structure() relies on the mclapply() function from the Parallel package, which has been part of the basic R installation since version 2.14. Unfortunately, mclapply() does not work under Windows OS, nor on the graphical R environment (you must run R from the shell). We thus recommend you use MPI_structure() under Windows and Mac OS, while parallel_structure() is an easier solution for Linux as it does not require installing Rmpi package.

In a nutshell, the script relies on the command line (back end) version of STRUCTURE. The user defines a set of jobs in a separate text input file that I usually call “joblist”. The joblist file contains one line per job and five columns:

  1. A job ID, e.g. “T1″
  2. A comma-separated list of populations
  3. Value of parameter K, the hypothesized number of genotype clusters
  4. Number of iterations of burnin
  5. Number of post-burn-in iterations

For each job, the R script generates: one temporary data file, one parameter file for STRUCTURE, and executes the shell command “structure –m ” on a separate core/CPU.

What does it do for you?

Besides setting parallel runs of STRUCTURE on multi-core computers, ParallelStructure includes some functionalities to help the user deal with large datasets. It is for example possible to run STRUCTURE on a subset of populations from the main dataset without generating a new data file containing only the populations of interest. In ParallelStructure a temporary data file is automatically generated for each job. This data file contains only the population(s) of interest for the given job.

The list of populations to be analyzed in a given job can also be replaced by the string character “pairwise.matrix”. In such case, all pairs of populations are analyzed.

On the output corner, ParallelStructure writes the standard STRUCTURE output files “_f” in a separate directory specified by the user, along with standard “_q” files if printqhat=1. If plotoutput=1 ParallelStructure will also write an individual assignment graphic output file in .PDF format in the same directory. Finally, a summary of all jobs specified in the joblist file is written in the working directory. This summary is labeled “results_summary”, formatted as .csv file, and contains the main parameters and statistics of each job.

An example

ParallelStructure is distributed with an example data and joblist, which can be called by typing:

data(structure_data)

and

data(structure_jobs)

structure_data contains microsatellite data on 9 loci for 987 individuals divided in 8 populations. The data is formatted as standard STRUCTURE input with one column of individual IDs, one column of populations IDs followed by nine columns of genotype data, and two rows per individuals. (Data files with one row per individual can be used with parameter onerowperind=1)

Joblist contains 20 jobs labeled T1 to T20. Each job ID is followed by a comma separated list of populations, a value of K, a number of burnin and number of iterations to be performed. You can see that not all the jobs include all the populations of the dataset.

Both data and joblist need to be written in a text file in the working directory:

write(t(structure_jobs), ncol=length(structure_jobs[1,]), file='joblist1.txt')

and

write(t(structure_data), ncol=length(structure_data[1,]), file='example_data.txt')

It is also convenient to have a separate folder to store the results files, e.g “structure_results”.

Before running the analyses, it is important to get the path to the executable back end version of STRUCTURE. This is in Mac OS: /Applications/Structure.app/Contents/Resources/Java/bin/

And in Windows: c:/Program Files (x86)/Structure2.3.4/bin/

To make sure we found the right one: opening the STRUCTURE program contained in this directory (by double clicking) should open a command line window displaying the following message:

Can't open the file "mainparams".
Exiting the program due to error(s) listed above.

If it opens a front-end STRUCTURE window, you are looking in the wrong place.

We then assign STRUCTURE path as

STR_path="c:/Program Files (x86)/Structure2.3.4/bin/"

for Windows users; Or

STR_path='/Applications/Structure.app/Contents/Resources/Java/bin/' 

for Mac users.

Running the 20 jobs defined in joblist on 4 CPUs can be achieved by typing:

MPI_structure(structure_path=STR_path, joblist='joblist1.txt', n_cpu=4, infile='example_data.txt', outpath='structure_results/', numinds=987, numloci=9, printqhat=1, plot_output=1)

As the number of iterations and burnin are small in the example joblist (1000 and 10,000), the analysis will only take a few minutes. We used the parameters printqhat=1 and plot_output = 1, therefore the structure_results folder will contain both “_f” “_q” files as well as individual assignment plots in .pdf format.

From our performance testing, ParallelStructure can speed up the analyses by a factor 3 on a 4-core computer and by a factor 6 on 8 cores. I hope that our little scripts are useful for other STRUCTURE users and help you save some time in your analyses. It has been tested on a variety of datasets and would work with most standard input files for STRUCTURE. If it doesn’t work with yours, I will do my best to help.

Thank you for reading!

Besnier F and KA Glover. 2013. ParallelStructure: A R package to distribute parallel runs of the population genetics program STRUCTURE on multi-core computers. PLoS ONE 8: e70651. doi:10.1371/journal.pone.0070651.

RedditDiggMendeleyPocketShare and Enjoy

About Jeremy Yoder

Jeremy Yoder is a postdoctoral associate in the Department of Plant Biology at the University of Minnesota. He also blogs at Denim and Tweed and Nothing in Biology Makes Sense!, and tweets under the handle @jbyoder.
This entry was posted in howto, population genetics, R, software, STRUCTURE. Bookmark the permalink.
  • Andrey Yurchenko

    I use it on linux cluster trough command line and have one problem – I cannot edit extraparams file!! I would like to change noadmix and linkage functions, but I don understand how I have to do this. If I start software as in guide – all works well. If I try to change anything – works nothing. How I can edit noadmix and linkage functions???

    • francois besnier

      Hi Andrey,
      Thanks for reporting the problem you encountered!

      It is not surprising that you can not edit the extraparam files: So
      far, ParallelStructure is implemented in a way that all parameters are edited
      in the mainparam file by the R script.

      If you want to edit linkage and noadmix parameters, you can specify it in the ParallelStructure instruction. For example:

      MPI_structure(structure_path=my_path,joblist=’joblist1.txt’,n_cpu=4,infile=’my_data.txt’,
      outpath=’structure_results/’,numinds=987,numloci=9,printqhat=1,noadmix=1,linkage=1)

      Linkage
      and noadmix parameters will then be written in mainparam file, and Structure should run just as fine with this.

      Now

      I have to warn you that I have not tested ParallelStructure with all
      possible parameter settings. I actually
      didn’t play with noadmix and linkage options yet.

      Nevertheless I am pretty confident that noadmix being set to 1 instead of the default 0 should not be any problem.

      Linkage is a bit more complicated as it may imply a slightly
      different input file format that might make ParallelStructure crash. If
      you encounter such problem I would be very grateful if you forwarded me
      an example of your input files. I could then improve
      the package to make it work also with linkage file format.

  • Ali Tahir

    Hi Jeremy,

    I am trying to start ParallelStructure on Linux using example file

    > setwd(“/home/atahir/Analysis/Str_parallel”)
    > require(ParallelStructure)
    Loading required package: ParallelStructure
    > data(structure_data)
    > data(structure_jobs)
    > write(t(structure_jobs), ncol=length(structure_jobs[1,]), file=’joblist1.txt’)
    > write(t(structure_data), ncol=length(structure_data[1,]), file=’example_data.txt’)
    > STR_path=’/home/atahir/Software/Structure/bin/’
    > parallel_structure(structure_path=STR_path, joblist=’joblist1.txt’, n_cpu=4, infile=’example_data.txt’, outpath=’structure_results/’, numinds=987, numloci=9, printqhat=1, plot_output=1)

    but I am getting following error message

    [1] “starting work at 2014-03-20 17:36:13″
    sh: line 1: 29662 Segmentation fault (core dumped) /home/atahir/Software/Structure/bin/structure -m parameter_job_T1 > /dev/null

    Warning message:
    In selectChildren(ac, 1) : error ‘Interrupted system call’ in select
    sh: line 1: 29721 Segmentation fault (core dumped) /home/atahir/Software/Structure/bin/structure -m parameter_job_T14 > /dev/null

    Could you please suggest me any thing to solve this issue! Thank you.

    Ali

    • http://www.denimandtweed.com/ Jeremy Yoder

      Hi, Ali. Although I’m the one who posted this article, please note that it’s a guest post from one of the creators of ParallelStructure, Francois Besnier. I would suggest that you get in touch with him (or whoever is credited as corresponding author on the paper) for help with technical issues.