Data Archiving: the nitty-gritty

The joint data archiving policy has been in place for a year now, and it’s been very encouraging to see how receptive everyone has been to this change. One question we get asked a lot is ‘which data should I make available?’, so I thought it would be helpful to put together a checklist for the broad types of data we receive at Mol Ecol.

The theme throughout is that you generally want to be preparing a tab delimited text file with one row per individual/sample, and then giving all the corresponding data as columns (where it came from, any phenotypic data, the genotype data). The other data that went into the analyses that can’t be given on an individual by individual basis (e.g. environmental data for each site) can be provided in separate files.

Next, we really encourage people to upload all the input files used in the data analysis- the final data set can be many steps removed from the raw data (especially DNA sequence data), and providing these files makes your paper that much more useful. (This in turn will makes it more likely that other people will ask to include it in their research, which equals more citations and more publications for you.) In a similar vein, including all your analysis scripts is also encouraged, as these allow others to see exactly what you’ve done without spending a huge amount of time trying to explain it in words in the methods; the same applies to parameter input files for e.g. Structure and IMa2.

Lastly, a readme file is a really useful thing to include with your archived data- this can explain things like the meaning of column headers in the data file, units, precise localities, indicators for missing data, codes for categorical variables etc (see Whitlock 2010 for more on this). You don’t need to specify the location of the readme file in your data accessibility paragraph (it should be in the same place as the data!), but I’ve included it in the lists below as a reminder…

I’ve organized the lists below in terms of the different data types, with the suggested contents of the files themselves in square brackets. Each entry of the list should appear as an element of the Data Accessibility paragraph (with the exception of the readme file).

a) DNA sequence data:

  • Genbank/EMBL/DDBJ accession numbers for all unique DNA sequences
  • a file with data on each individual [indiv ID, sampling location, accession number for sequence at gene 1, gene 2 etc]
  • readme.txt file explaining the contents of the above file

OR

  • a POPSET accession number from NCBI- this should have one accession number for each individual in the sample, and the accession should explicitly state which individual and population of origin the sequence came from.

optional:
- TreeBASE study number
- DNA sequence alignments (although these are really useful)
- Analysis input files for programs like IMa2, as there’s a lot of steps between raw sequence and phased and aligned haplotypes
- Any scripts used to analyse the data (please check the copyright status before making these available)

b) microsatellite data:

  • a file with data on each individual [indiv ID, sampling location, genotype at locus 1, genotype at locus 2 etc]
  • readme.txt file explaining the contents of the above file
  • Analysis input files for e.g. Structure (including parameter files)
  • Any scripts used to analyse the data (please check the copyright status before making these available)

optional:
- Trace files and an accompanying readme explaining the scoring decisions (these would need to go onto Dryad)

c) SNP data:

  • a file with data on each individual [indiv ID, sampling location, genotype at locus 1, genotype at locus 2 etc]
  • readme.txt file explaining the contents of the above file
  • TreeBASE study number, if applicable (this is useful but not essential).
  • Analysis input files for e.g. Structure (including parameter files)
  • Any scripts used to analyse the data (please check the copyright status before making these available)

d) microarray data:

  • all the information required under the MIAME 2.0 protocols should be available on a public archive.
  • Analysis input files
  • Any scripts used to analyse the data (please check the copyright status before making these available)

e) Next Generation Sequencing data:

The scale and complexity of NGS data means that extra care is needed when archiving it for future generations of researchers. In particular, the steps taken to convert the raw reads into the final dataset may not be fully reproducible even with exactly the same methods, and hence it makes sense to archive the data at several stages of its analysis. This area is worthy of an entire post of its own, but I’ve tried to summarise the main types of data worth archiving below (thanks to Nolan Kane and Sebastien Renaut for advice on this).

  • raw read data from next generation sequencing is an important resource, but it’s hard to tell whether public archives will continue to accept this type of data. NCBI’s Short Read Archive has said that it will stop accepting new data at the start of September 2011, whereas the ENA’s Sequence Read Archive currently says that it will continue taking NGS data (see here). If there is no publicly-funded place to store raw read data they can kept on institutional or personal servers; in this case archiving is desirable but not essential under the policy.

  • the sequence alignment (e.g. .sam/.bam/.ace file) should be publicly archived whenever possible, although this file can be very large (e.g. if the reads are aligned to an existing genome). The reference genome/transcriptome/gene of interest will also need to be available.

  • at the very least, the final data file that the analyses were based on should be publicly available (eg. SNP calls, indel calls, expression values for each sample).

  • the scripts used to generate the final dataset from the raw reads (very useful but not absolutely necessary)

f) other data types:

  • information about sampling locations and their related variables (e.g. site level environmental data) should be made available in a separate file
Share

7 Responses to Data Archiving: the nitty-gritty

  1. Pingback: Data archiving guide | The Molecular Ecologist

  2. Wei-Ning Bai says:

    I have a problem: when I submit data in datadryad, in the first step, when I select Molecular Ecology, Manuscript Number MEC-11-1207, always said invalid Manuscript Number. What’s wrong with me ?

    • Tim Vines says:

      Hi Wei-Ning,

      Dryad only accepts data from accepted papers, and hence they need us (the journal) to tell them when a paper has been accepted. I’ll send you an email with a link to a Dryad upload patch.

      Tim

  3. James Hereward says:

    I think that data sharing is a good idea in general, and genbank is obviously an exceptional example of the benefits of such an approach, but having dealt with a lot of microsatellite datasets over the last few years I’m not sure that genotypes alone are sufficient. In my experience most of the errors that creep into microsatellite datasets come from mis-scoring, whereas with sequence data the scoring of bases is these days well automated and accurate. I think I would be much happier re-using someone else’s data if I could access the trace files, and an accompanying document justifying why one allele was called an not another. This may sound like an overly technical comment, but I’ve seen first hand the effects of calling errors on results, and if the data is to be held for posterity, I think that it would be highly beneficial that the raw data that inevitably gets lost when a student graduates is also held in the Dryad.

    • Tim Vines says:

      Hi James,

      Thanks for this comment- I agree that the scoring of microsats can be a contentious. I’ve added it to the list of topics to be discussed for this year’s editorial board meeting, and for the moment I’ll add it as an optional file to the above.

      Tim

  4. James Hereward says:

    The new (2008) primer note summary articles may have been a bit ahead of their time, considering the variety of methods deployed then and the amount of work that was still going into developing microsatellites. Now it seems that the format and the tomato database are somewhat outdated. Seeing as everyone is doing NGS for marker discovery these days, and generating primer pairs for 100s-1000s of putative markers in a short run of a computer program and a few thousand $$$ in sequencing it seems to me that there is a need to archive the results of these analyses. Now that the SRA http://www.ncbi.nlm.nih.gov/sra seems to be back “on” perhaps people could just archive all their raw reads. However, most students that are generating these markers are only really interested in the set that they end up using, and this is all that is provided for in the current MER system. The raw reads invariably don’t end up on SRA (which is a shame), and again all that data goes with the student. What I would like to see mandated as a minimum for MER is:
    Any microsatellite note using NGS data has to bank all the msat containing reads in a MER supported database (dryad or Genbank or something else).
    Or preferably the output data from a program such as the QDD pipeline has to be supplied with the note as supporting information (probably in Dryad/tomato).
    This would help enormously with the cross amplification of markers across species, for example if I want to obtain NGS data for a species, and another species in the same genus has been done already (and all the relevant data is stored @ tomato/dryad), then I could check the markers against each other using standard bioinformatics and design primers that would work for both.

    • Tim Vines says:

      Hi James,

      Thanks for this comment too- we needed to take a few days to think about how best to respond. With respect to your idea that “any microsatellite note using NGS data has to bank all the msat containing reads in a MER supported database”, we do already require that the sequences of the successful loci are available in the paper, and we can certainly recommend that the microsatellite containing reads be archived somewhere. Dryad could be an ideal location for the latter, but one practical difficulty is that individual PGR notes are not ‘published’, and hence Dryad cannot create an entry for them. The ME Resources primer database is very much oriented towards archiving individual primer pairs for particular species rather than hosting large quantities of relatively unfiltered sequence data, but this is something we can explore.

      One alternative solution is to develop a lot more markers from each NGS run and submit these as a full Resource Article, as these can be given a Dryad patch.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>