Data Archiving: the nitty-gritty

One question we get asked a lot is ‘which data should I make available?’, so I thought it would be helpful to put together a checklist for the broad types of data we receive at Mol Ecol.

The theme throughout is that you generally want to be preparing a tab delimited text file with one row per individual/sample, and then giving all the corresponding data as columns (where it came from, museum voucher #s, any phenotypic data, the genotype data). The other data that went into the analyses that can’t be given on an individual by individual basis (e.g. environmental data for each site) can be provided in separate files.

NB When providing information on where samples were collected, please be as precise as possible. The coordinates as obtained from a GPS unit would be ideal.

For taxa where locations cannot be disclosed for conservation reasons, we encourage authors to transform their location data such that the original locations cannot be calculated but the spatial arrangement of the sampling sites is still the same. For example, one could give the coordinates relative to a randomly chosen origin, and then rotate them by an undisclosed number of degrees.

Next, we really encourage people to upload all the input files used in the data analysis- the final data set can be many steps removed from the raw data (especially DNA sequence data), and providing these files makes your paper that much more useful. (This in turn will makes it more likely that other people will ask to include it in their research, which equals more citations and more publications for you.) In a similar vein, including all your analysis scripts is also encouraged, as these allow others to see exactly what you’ve done without spending a huge amount of time trying to explain it in words in the methods; the same applies to parameter input files for e.g. Structure and IMa2.

Lastly, a readme file is a really useful thing to include with your archived data- this can explain things like the meaning of column headers in the data file, units, precise localities, indicators for missing data, codes for categorical variables etc (see Whitlock 2010 for more on this). You don’t need to specify the location of the readme file in your data accessibility paragraph (it should be in the same place as the data!), but I’ve included it in the lists below as a reminder.

I’ve organized the lists below in terms of the different data types, with the suggested contents of the files themselves in square brackets. Each entry of the list should appear as an element of the Data Accessibility paragraph (with the exception of the readme file).

a) other data types:

  • Whenever possible, museum voucher numbers should be included for all stored specimens
  • information about sampling locations and their related variables (e.g. site level environmental data) should be made available in a separate file

b) DNA sequence data:

  • Genbank/EMBL/DDBJ accession numbers for all unique DNA sequences
  • a file with data on each individual [indiv ID, sampling location, accession number for sequence at gene 1, gene 2 etc]
  • readme.txt file explaining the contents of the above file

OR

  • a POPSET accession number from NCBI- this should have one accession number for each individual in the sample, and the accession should explicitly state which individual and population of origin the sequence came from.

optional:
– TreeBASE study number
– DNA sequence alignments (although these are really useful)
– Analysis input files for programs like IMa2, as there’s a lot of steps between raw sequence and phased and aligned haplotypes
– Any scripts used to analyse the data (please check the copyright status before making these available)

c) microsatellite data:

  • a file with data on each individual [indiv ID, sampling location, genotype at locus 1, genotype at locus 2 etc]
  • readme.txt file explaining the contents of the above file
  • Analysis input files for e.g. Structure (including parameter files)
  • Any scripts used to analyse the data (please check the copyright status before making these available)

optional:
– Trace files and an accompanying readme explaining the scoring decisions (these would need to go onto Dryad)

d) SNP data:

  • a file with data on each individual [indiv ID, sampling location, genotype at locus 1, genotype at locus 2 etc]
  • readme.txt file explaining the contents of the above file
  • TreeBASE study number, if applicable (this is useful but not essential).
  • Analysis input files for e.g. Structure (including parameter files)
  • Any scripts used to analyse the data (please check the copyright status before making these available)

e) microarray data:

  • all the information required under the MIAME 2.0 protocols should be available on a public archive.
  • Analysis input files
  • Any scripts used to analyse the data (please check the copyright status before making these available)

f) Next Generation Sequencing data:

The scale and complexity of NGS data means that extra care is needed when archiving it for future generations of researchers. In particular, the steps taken to convert the raw reads into the final dataset may not be fully reproducible even with exactly the same methods, and hence it makes sense to archive the data at several stages of its analysis. This area is worthy of an entire post of its own, but I’ve tried to summarise the main types of data worth archiving below (thanks to Nolan Kane and Sebastien Renaut for advice on this).

  • raw read data from next generation sequencing is an important resource, and the raw reads should be stored on e.g. the SRA or the ENANB Please provide a table linking each sample to its corresponding individual barcode adaptor when this technology has been used.

  • the sequence alignment (e.g. .sam/.bam/.ace file) should be publicly archived whenever possible, although this file can be very large (e.g. if the reads are aligned to an existing genome). The reference genome/transcriptome/gene of interest will also need to be available. Dryad is a good place for these files.

  • at the very least, the final data file that the analyses were based on should be publicly available (eg. SNP calls, indel calls, expression values for each sample).

  • the scripts used to generate the final dataset from the raw reads (very useful but not absolutely necessary)