Data Archiving in ME Resources

One question we get asked a lot is ‘which data should I make available?’, so here is a checklist for the different types of paper we get at Molecular Ecology Resources. The theme throughout is to make available all the data needed to evaluate whatever is presented in the paper.

Methodological and Statistical Advances

Lab or Field Methods

a) DNA sequences

All new sequences should be on GenBank (or similar archive) and the accession numbers given in the paper. The sequence alignments should also be available, ensuring that the sample/individual ID given for each sequence corresponds to those listed in other tables. There should also be a table giving the sampling details for each individual, and preferably the GenBank numbers for their sequences.

b) RT-qPCR data

Given the large capacity (~10GB) of archives like Dryad, you might want to archive the raw curves.  More importantly, the individual expression data for each gene (including any pertinent reference/housekeeping genes) should be made available. Data for standards and r2 extrapolation should also be included.

c) PCR based assays

These are experiments where fragment lengths (e.g., microsatellites), allele-specific fluorescence (e.g., SNP assays), or band presence/absence is the data of interest. A spreadsheet giving the results for every individual or sample analyzed should be made available (no matter how big or small the file is). Summary tables are fine for the main paper, but the  individual genotypes need to be made available, with each allele clearly identified.

Statistical Advances

a) Scripts and code

If you’re presenting a new analytical method, any code or scripts you use need to be publicly available. GitHub or a similar site would be fine; Dryad can be used if the cc0 license is appropriate.

b) Simulations and simulated datasets

The code used to generate the simulations must be available, as then readers can see exactly how the data were generated. Storing the simulated data themselves is optional, but authors do sometimes put these on Dryad.

c) Example datasets

If you illustrate the method with example data, it must be made publicly available. For methods implemented in R, an Rdata object is a good option for this.

Permanent Genetic Resources

Large scale sequencing resources

The key here is that readers should be able to fully evaluate whatever resource you are presenting. For example, a study describing validated SNPs should provide all the data for the resource itself, as well as the genotypes of the individuals that the loci were tested in. If the loci have been used to genotype a large sample for a different project, you don’t need to provide the entire dataset – just enough so that reasonable tests for HWE etc can be performed.

a) raw Next Generation sequence data

When possible, the raw NGS data should be archived on NCBI’s Sequence Read Archive or similar public database (e.g. the ENA at EMBL). Dryad can also be used to store raw sequence data, but there are additional charges for entries above 10GB.

b) scripts and pipelines

The process used to convert the raw sequence data into the published resource should be completely transparent. This means that all custom scripts and code should be made available, along with the parameters and settings used for the standard pipeline programs.

c) final resource files

Since these are the subject of the paper, the final resources should clearly be freely available. This includes the feature list of microarray chips, the primers, probes and flanking sequences of SNPs (with Genbank accessions), and a full description of microsatellites. As mentioned above, any datasets produced when testing these loci should also be available – these include genotypes from test populations or RT-qPCR data.

DNA barcoding and molecular diagnostics

When presenting barcoding data as a resource, it is essential that readers be able to link the taxonomic identification to the DNA sequences and to a reference specimen. This is most easily achieved with a table listing the specimens, the sampling details, museum accession for the specimen itself, the Genbank numbers for all of its DNA sequences and (if applicable) the BOLD ID. The sequence alignments should also be made available. If the paper contains phylogenetic trees, one good solution is to add the trees and the alignments to TreeBASE.

Computer Programs

As with the ‘statistical methods’ above, these should give a link to the program itself (which must be freely available), and any other code or empirical data used in the paper.  Any link that is provided should be stable and expected to persist for the long term so that readers are not led to closed sites.  GitHub is a potential long-term solution to provide downloadable programs.

a) Scripts and code

If you’re presenting a new analytical method, any code or scripts you use need to be publicly available. GitHub or a similar site would be fine; Dryad can be used if the cc0 license is appropriate.

b) Simulations and simulated datasets

The code used to generate the simulations must be available, as then readers can see exactly how the data were generated. Storing the simulated data themselves is optional, but authors do sometimes put these on Dryad.

c) Example datasets

If you illustrate the method with example data, it must be made publicly available. For methods implemented in R, an R data object is a good option for this.