Data Archiving: the nitty-gritty

One question we get asked a lot is ‘which data should I make available?’, so I thought it would be helpful to put together a checklist for the broad types of data we receive at Mol Ecol.

The theme throughout is that you generally want to be preparing a tab delimited text file with one row per individual/sample, and then giving all the corresponding data as columns (where it came from, any phenotypic data, the genotype data). The other data that went into the analyses that can’t be given on an individual by individual basis (e.g. environmental data for each site) can be provided in separate files.

NB When providing information on where samples were collected, please be as precise as possible. The coordinates as obtained from a GPS unit would be ideal.

For taxa where locations cannot be disclosed for conservation reasons, we encourage authors to transform their location data such that the original locations cannot be calculated but the spatial arrangement of the sampling sites is still the same. For example, one could give the coordinates relative to a randomly chosen origin, and then rotate them by an undisclosed number of degrees.

Next, we really encourage people to upload all the input files used in the data analysis- the final data set can be many steps removed from the raw data (especially DNA sequence data), and providing these files makes your paper that much more useful. (This in turn will makes it more likely that other people will ask to include it in their research, which equals more citations and more publications for you.) In a similar vein, including all your analysis scripts is also encouraged, as these allow others to see exactly what you’ve done without spending a huge amount of time trying to explain it in words in the methods; the same applies to parameter input files for e.g. Structure and IMa2.

Lastly, a readme file is a really useful thing to include with your archived data- this can explain things like the meaning of column headers in the data file, units, precise localities, indicators for missing data, codes for categorical variables etc (see Whitlock 2010 for more on this). You don’t need to specify the location of the readme file in your data accessibility paragraph (it should be in the same place as the data!), but I’ve included it in the lists below as a reminder.

I’ve organized the lists below in terms of the different data types, with the suggested contents of the files themselves in square brackets. Each entry of the list should appear as an element of the Data Accessibility paragraph (with the exception of the readme file).

a) DNA sequence data:

  • Genbank/EMBL/DDBJ accession numbers for all unique DNA sequences
  • a file with data on each individual [indiv ID, sampling location, accession number for sequence at gene 1, gene 2 etc]
  • readme.txt file explaining the contents of the above file

OR

  • a POPSET accession number from NCBI- this should have one accession number for each individual in the sample, and the accession should explicitly state which individual and population of origin the sequence came from.

optional: - TreeBASE study number - DNA sequence alignments (although these are really useful) - Analysis input files for programs like IMa2, as there’s a lot of steps between raw sequence and phased and aligned haplotypes - Any scripts used to analyse the data (please check the copyright status before making these available)

b) microsatellite data:

  • a file with data on each individual [indiv ID, sampling location, genotype at locus 1, genotype at locus 2 etc]
  • readme.txt file explaining the contents of the above file
  • Analysis input files for e.g. Structure (including parameter files)
  • Any scripts used to analyse the data (please check the copyright status before making these available)

optional: - Trace files and an accompanying readme explaining the scoring decisions (these would need to go onto Dryad)

c) SNP data:

  • a file with data on each individual [indiv ID, sampling location, genotype at locus 1, genotype at locus 2 etc]
  • readme.txt file explaining the contents of the above file
  • TreeBASE study number, if applicable (this is useful but not essential).
  • Analysis input files for e.g. Structure (including parameter files)
  • Any scripts used to analyse the data (please check the copyright status before making these available)

d) microarray data:

  • all the information required under the MIAME 2.0 protocols should be available on a public archive.
  • Analysis input files
  • Any scripts used to analyse the data (please check the copyright status before making these available)

e) Next Generation Sequencing data:

The scale and complexity of NGS data means that extra care is needed when archiving it for future generations of researchers. In particular, the steps taken to convert the raw reads into the final dataset may not be fully reproducible even with exactly the same methods, and hence it makes sense to archive the data at several stages of its analysis. This area is worthy of an entire post of its own, but I’ve tried to summarise the main types of data worth archiving below (thanks to Nolan Kane and Sebastien Renaut for advice on this).

  • raw read data from next generation sequencing is an important resource, but it’s hard to tell whether public archives will continue to accept this type of data. NCBI’s Short Read Archive has said that it will stop accepting new data at the start of September 2011, whereas the ENA’s Sequence Read Archive currently says that it will continue taking NGS data (see here). If there is no publicly-funded place to store raw read data they can kept on institutional or personal servers; in this case archiving is desirable but not essential under the policy.

  • the sequence alignment (e.g. .sam/.bam/.ace file) should be publicly archived whenever possible, although this file can be very large (e.g. if the reads are aligned to an existing genome). The reference genome/transcriptome/gene of interest will also need to be available.

  • at the very least, the final data file that the analyses were based on should be publicly available (eg. SNP calls, indel calls, expression values for each sample).

  • the scripts used to generate the final dataset from the raw reads (very useful but not absolutely necessary)

f) other data types:

  • information about sampling locations and their related variables (e.g. site level environmental data) should be made available in a separate file
  • Pingback: Data archiving guide | The Molecular Ecologist

  • Wei-Ning Bai

    I have a problem: when I submit data in datadryad, in the first step, when I select Molecular Ecology, Manuscript Number MEC-11-1207, always said invalid Manuscript Number. What’s wrong with me ?

    • Tim Vines

      Hi Wei-Ning,

      Dryad only accepts data from accepted papers, and hence they need us (the journal) to tell them when a paper has been accepted. I’ll send you an email with a link to a Dryad upload patch.

      Tim

      • Nancy

        Hi Tim Vines, I have the same problem, what can I do?

        • http://www.facebook.com/tim.vines.71 Tim Vines

          send me an email with your manuscript number (to managing.editor@molecol.com) and we’ll send you a Dryad access link. This only works once your paper has been submitted to Molecular Ecology.

  • James Hereward

    I think that data sharing is a good idea in general, and genbank is obviously an exceptional example of the benefits of such an approach, but having dealt with a lot of microsatellite datasets over the last few years I’m not sure that genotypes alone are sufficient. In my experience most of the errors that creep into microsatellite datasets come from mis-scoring, whereas with sequence data the scoring of bases is these days well automated and accurate. I think I would be much happier re-using someone else’s data if I could access the trace files, and an accompanying document justifying why one allele was called an not another. This may sound like an overly technical comment, but I’ve seen first hand the effects of calling errors on results, and if the data is to be held for posterity, I think that it would be highly beneficial that the raw data that inevitably gets lost when a student graduates is also held in the Dryad.

    • Tim Vines

      Hi James,

      Thanks for this comment- I agree that the scoring of microsats can be a contentious. I’ve added it to the list of topics to be discussed for this year’s editorial board meeting, and for the moment I’ll add it as an optional file to the above.

      Tim

  • James Hereward

    The new (2008) primer note summary articles may have been a bit ahead of their time, considering the variety of methods deployed then and the amount of work that was still going into developing microsatellites. Now it seems that the format and the tomato database are somewhat outdated. Seeing as everyone is doing NGS for marker discovery these days, and generating primer pairs for 100s-1000s of putative markers in a short run of a computer program and a few thousand $$$ in sequencing it seems to me that there is a need to archive the results of these analyses. Now that the SRA http://www.ncbi.nlm.nih.gov/sra seems to be back “on” perhaps people could just archive all their raw reads. However, most students that are generating these markers are only really interested in the set that they end up using, and this is all that is provided for in the current MER system. The raw reads invariably don’t end up on SRA (which is a shame), and again all that data goes with the student. What I would like to see mandated as a minimum for MER is:
    Any microsatellite note using NGS data has to bank all the msat containing reads in a MER supported database (dryad or Genbank or something else).
    Or preferably the output data from a program such as the QDD pipeline has to be supplied with the note as supporting information (probably in Dryad/tomato).
    This would help enormously with the cross amplification of markers across species, for example if I want to obtain NGS data for a species, and another species in the same genus has been done already (and all the relevant data is stored @ tomato/dryad), then I could check the markers against each other using standard bioinformatics and design primers that would work for both.

    • Tim Vines

      Hi James,

      Thanks for this comment too- we needed to take a few days to think about how best to respond. With respect to your idea that “any microsatellite note using NGS data has to bank all the msat containing reads in a MER supported database”, we do already require that the sequences of the successful loci are available in the paper, and we can certainly recommend that the microsatellite containing reads be archived somewhere. Dryad could be an ideal location for the latter, but one practical difficulty is that individual PGR notes are not ‘published’, and hence Dryad cannot create an entry for them. The ME Resources primer database is very much oriented towards archiving individual primer pairs for particular species rather than hosting large quantities of relatively unfiltered sequence data, but this is something we can explore.

      One alternative solution is to develop a lot more markers from each NGS run and submit these as a full Resource Article, as these can be given a Dryad patch.

  • Armando Sunny

    I have the same problem that Wei-Ning Bai, can you send me too an email with the link to a Dryad upload patch?

    Thank you very much.

    Sincerely

    Sunny

  • Tim Vines

    Thanks. My reply is more or less the same as well: if your paper gets a positive decision (reconsider after revision, accept minor revisions or accept), and your Data Accessibility statement says that you want to use Dryad, we’ll send you an access link.

    For initial submissions, we just ask for a draft Data Accessibility statement- this doesn’t need to have the actual DOI’s or genbank accession numbers in it. For example, the DA section for your paper could read:

    Sampling data: Table S1
    Microsatellite data: Dryad entry XXX

    and then you’d add the Dryad DOI once it got a positive decision and we sent you the link.

  • Cynthia Riginos

    What about encouraging georeferencing? Often source locations of genotypes/alleles/haplotypes are unclear or imprecise.

    • Tim Vines

      Hi Cynthia,

      Thanks for bringing this up- I’ve added something to the guidelines above.

      Tim

  • ramesh krishnan

    Data archiving is a good idea. Is it necessary to indicate Genebank Acc. no. for all microsatellite sequences. Sometimes we get sequence from our partner institutes, in those cases, we may have control to ask them to submit those sequences in genebank. But, we can provide all the details regarding primer sequences, annealing temperature and all necessary information to use those primers.

  • Pingback: How to decide what data should be archived at publication « Dryad news and views