Using R to run parallel analyses of population genetic data in STRUCTURE: ParallelStructure

The structure of human populations across Eurasia, as estimated by Rosenberg et al (2002)
The structure of human populations across Eurasia, as estimated by Rosenberg et al (2002)

In this guest post, Francois Besnier explains how to use ParallelStructure, his new R package for running STRUCTURE analyses in parallel computing environments.

To start with, thanks to The Molecular Ecologist blog team (Tim and Jeremy) for the invitation to write post about the R package ParallelStructure. Briefly introducing myself: I am a post-doctoral researcher at the Institute of Marine Research in Bergen, Norway.

Among other things, our institute is in charge of monitoring the genetic structure of salmon populations in Norway, which often puts us in the situation to play with fairly large data sets (thousands of individuals, dozens of populations) to assess population admixture and perform individual assignment tests in STRUCTURE. In such situations, we are often limited by the computation time. Thus, to make the best out of our computer resources, we wrote some scripts to run STRUCTRURE analyses in parallel on multi-core computers. As this might be of use for a broader range of purposes, those scripts were encapsulated in a R package called ParallelStructure, which is available on R-forge, and described in Besnier & Glover 2013.

So why a R package to run STRUCTURE in parallel?

STRUCTURE is not implemented to run in parallel. In short: it does not make any use of our fancy 4 (8) core/CPU computers. [Ed: Most desktop and laptop computers now have multiple processor cores.] It is of course possible to make use of multi-core computers by opening several windows of STRUCTURE manually, and run two or more analyses simultaneously. This may work just as well if the amount of jobs is not too large, but it is not the most efficient solution. If you want to run analyses on 4 processors (or more), it soon feels like you are doing juggling with windows, and not making the most efficient use of your time.

It would thus be very useful to have a tool that automates the distribution of analyses among cores/CPUs, and R sounded like the most relevant platform to implement it for the following reasons:

  1. It is free, open source and available on both windows, Mac OS and Linux systems
  2. A lot of population genetics software is implemented in R, thus, most population geneticists are already familiar with this environment.

How does it work?

In the past few years, R has been provided with some excellent libraries for parallel computing. Some of them work better under Unix OS while others fit better Windows OS. For this reason ParallelStructure consists of two almost identical functions: MPI_structure() and parallel_structure(). The arguments and output are identical for these two functions; however, they rely on two different packages for parallel computing.

MPI_structure() relies on the Rmpi package, which is available for both platforms, but its installation might be more tricky on Linux. Parallel_structure() relies on the mclapply() function from the Parallel package, which has been part of the basic R installation since version 2.14. Unfortunately, mclapply() does not work under Windows OS, nor on the graphical R environment (you must run R from the shell). We thus recommend you use MPI_structure() under Windows and Mac OS, while parallel_structure() is an easier solution for Linux as it does not require installing Rmpi package.

In a nutshell, the script relies on the command line (back end) version of STRUCTURE. The user defines a set of jobs in a separate text input file that I usually call “joblist”. The joblist file contains one line per job and five columns:

  1. A job ID, e.g. “T1”
  2. A comma-separated list of populations
  3. Value of parameter K, the hypothesized number of genotype clusters
  4. Number of iterations of burnin
  5. Number of post-burn-in iterations

For each job, the R script generates: one temporary data file, one parameter file for STRUCTURE, and executes the shell command “structure –m ” on a separate core/CPU.

What does it do for you?

Besides setting parallel runs of STRUCTURE on multi-core computers, ParallelStructure includes some functionalities to help the user deal with large datasets. It is for example possible to run STRUCTURE on a subset of populations from the main dataset without generating a new data file containing only the populations of interest. In ParallelStructure a temporary data file is automatically generated for each job. This data file contains only the population(s) of interest for the given job.

The list of populations to be analyzed in a given job can also be replaced by the string character “pairwise.matrix”. In such case, all pairs of populations are analyzed.

On the output corner, ParallelStructure writes the standard STRUCTURE output files “_f” in a separate directory specified by the user, along with standard “_q” files if printqhat=1. If plotoutput=1 ParallelStructure will also write an individual assignment graphic output file in .PDF format in the same directory. Finally, a summary of all jobs specified in the joblist file is written in the working directory. This summary is labeled “results_summary”, formatted as .csv file, and contains the main parameters and statistics of each job.

An example

ParallelStructure is distributed with an example data and joblist, which can be called by typing:

data(structure_data)

and

data(structure_jobs)

structure_data contains microsatellite data on 9 loci for 987 individuals divided in 8 populations. The data is formatted as standard STRUCTURE input with one column of individual IDs, one column of populations IDs followed by nine columns of genotype data, and two rows per individuals. (Data files with one row per individual can be used with parameter onerowperind=1)

Joblist contains 20 jobs labeled T1 to T20. Each job ID is followed by a comma separated list of populations, a value of K, a number of burnin and number of iterations to be performed. You can see that not all the jobs include all the populations of the dataset.

Both data and joblist need to be written in a text file in the working directory:

write(t(structure_jobs), ncol=length(structure_jobs[1,]), file='joblist1.txt')

and

write(t(structure_data), ncol=length(structure_data[1,]), file='example_data.txt')

It is also convenient to have a separate folder to store the results files, e.g “structure_results”.
Before running the analyses, it is important to get the path to the executable back end version of STRUCTURE.

This is in Mac OS: Applications/Structure.app/Contents/Resources/Java/bin/

And in Windows: c:/Program Files (x86)/Structure2.3.4/bin/

To make sure we found the right one: opening the STRUCTURE program contained in this directory (by double clicking) should open a command line window displaying the following message:

Can't open the file "mainparams".
Exiting the program due to error(s) listed above.

If it opens a front-end STRUCTURE window, you are looking in the wrong place. We then assign STRUCTURE path as

STR_path="c:/Program Files (x86)/Structure2.3.4/bin/"

for Windows users; Or

STR_path='/Applications/Structure.app/Contents/Resources/Java/bin/'

for Mac users. Running the 20 jobs defined in joblist on 4 CPUs can be achieved by typing:

MPI_structure(structure_path=STR_path, joblist='joblist1.txt', n_cpu=4, infile='example_data.txt', outpath='structure_results/', numinds=987, numloci=9, printqhat=1, plot_output=1)

As the number of iterations and burnin are small in the example joblist (1000 and 10,000), the analysis will only take a few minutes. We used the parameters printqhat=1 and plot_output = 1, therefore the structure_results folder will contain both “_f” “_q” files as well as individual assignment plots in .pdf format.

From our performance testing, ParallelStructure can speed up the analyses by a factor 3 on a 4-core computer and by a factor 6 on 8 cores. I hope that our little scripts are useful for other STRUCTURE users and help you save some time in your analyses. It has been tested on a variety of datasets and would work with most standard input files for STRUCTURE. If it doesn’t work with yours, I will do my best to help.

Thank you for reading!

Reference

Besnier F and KA Glover. 2013. ParallelStructure: A R package to distribute parallel runs of the population genetics program STRUCTURE on multi-core computers. PLoS ONE 8: e70651. doi:10.1371/journal.pone.0070651.

About Jeremy Yoder

Jeremy B. Yoder is an Associate Professor of Biology at California State University Northridge, studying the evolution and coevolution of interacting species, especially mutualists. He is a collaborator with the Joshua Tree Genome Project and the Queer in STEM study of LGBTQ experiences in scientific careers. He has written for the website of Scientific American, the LA Review of Books, the Chronicle of Higher Education, The Awl, and Slate.
This entry was posted in howto, population genetics, R, software, STRUCTURE. Bookmark the permalink.