A tale of two Dryad submissions

As it happens, the last two scientific papers I’ve had accepted for publication are also the first two papers for which my first-authorial duties included some substantial journal-mandated archiving of supporting data (beyond uploading a handful of DNA sequences to GenBank). The respective journals that are publishing the two papers each require authors to upload data supporting published papers to some public repository, and both strongly suggested that the repository be Dryad.

And that’s about where the similarities end. The differences, I think, suggest that there’s still some work to be done before journals, authors, and public data archives settle on a set of standard procedures that will make the data collected in publicly funded scientific research easily available with a minumum of fuss.

The first paper went to Systematic Biology. It’s a phylogenetic analysis using data collected by the Medicago HapMap Project. (I described the results briefly over at Nothing in Biology Makes Sense! last week.) Like most genome projects, the MHP has its own infrastructure for making its data available, and I’d cited that website in the MS and figured we’d done our duty. But the afternoon after I submitted the manuscript, I recieved an e-mail from the editorial office reminding me that (1) Systematic Biology expects authors to upload supplementary figures (the MS had a couple) to Dryad and (2) we were also expected to make supporting data available to reviewers at the time of submission … so why not put the data in the same Dryad package?

2012.10.19 - HM101

Medicago truncatula, the focus of the Medicago HapMap Project.

Well, I was embarrassed at having misunderstood those requirements, and the fact of the matter was that most of the phylogenetic analysis was based on files formatted differently (filtered, coallated) than what the MHP would put online. So I sent the files to Dryad. And all was well, until the process of peer review necessitated revisions. Revisions meant changes to those supplementary figures. Because the manuscript was in review, Dryad had the data package under embargo—I could upload new versions of the images—but I couldn’t delete the versions they were supposed to replace. Over two and thorough and constructive rounds of reviews, three versions of some supplemental figures piled up at Dryad.

The folks at Dryad cleaned everything up once the manuscript was accepted, but this offended my sense of tidiness. Asking reviewers to wade through a pile of past versions seems not very helpful. What if I’d had to change the supporting data files as well? Speaking of which files: in the course of a review process marked by some really outstandingly thorough, helpful input from reviewers—dozens of pages of it—I’m pretty sure none of the reviewers did anything with the supporting data files. One of the key issues in the review was what specific kind of analysis was most appropriate for a genome-wide SNP data set. But actually replicating much of the analysis would’ve taken days, and I don’t think the reviewers needed to do that to evaluate the manuscript.

Contrast this with the other paper, which is in press at The Journal of Evolutionary Biology. It uses morphological and microsatellite data from populations of Joshua trees and their pollinators scattered across the Mojave Desert, to try to determine whether the pollinators’ preferences shape gene flow between Joshua tree populations. While the review process for this manuscript also centered around what analysis would best apply our data to answer that question, the reviewers never asked to examine the data directly.

2009.03.20 - Tikaboo Valley (closeup)

Joshua trees.

It wasn’t until the paper had been accepted for publication that JEB sent me an e-mail asking if I wanted to archive the supporting data at Dryad. I said I did, and the editorial office sent me a link to the upload page. It was pretty obvious what I needed to archive—after a little sorting, I uploaded files containing microsatellite genotypes for the trees and their pollinators, more files containing measurements of some key traits for each species, and one last file containing a table of lattitude and longitude coordinates for all the collection sites. Everything reflects the data supporting the manuscript as it will be published, and it’s ready for other folks to dig in, if they’re interested, as soon as the paper is released online.

I’m in favor of public data archiving on principle—I’ve made use of open data, and of course there’s empirical support for its benefits—and I’m happy to put my work-time where my mouth is. But different journal policies may or may not make archiving easier, and maximize its benefits both before and after publication. So, as an author, which of these approaches did I prefer?

On a purely emotional level, I preferred being asked to archive my data after the paper was accepted for publication a lot more than before review had even begun. At Systematic Biology, archiving is another step added to the already tedious and nerve-wracking process of formatting and submitting a manuscript that might or might not be accepted and might or might not undergo substantial changes in the course of peer review. (Please note that I find the submission process for every journal tedious and nerve-wracking!) It also strikes me as problematic to ask authors to post data that will be archived elsewhere, even if it’s in a somewhat different form—proliferating versions of the same dataset can only confuse people looking for data if they don’t start by following links from the associated journal articles.

And why on Earth is it necessary to freeze files once they’re uploaded to a data package that’s under review?

On the other hand, after review, I was in a good mood and ready to do whatever I was asked, if it’d bring me closer to a final, type-set publication. It didn’t matter if I couldn’t change files once I’d uploaded them, because I could upload exactly the versions associated with the final, accepted paper. It’s all much tidier.

I can see the argument for archiving before review: reviewers should have access to the data supporting the manuscripts they’re evaluating. But I wonder how often reviewers make use of such access. I try to be a careful and thorough reviewer myself, when called upon, but I doubt I’ll ever take the time to completely re-run an analysis just to check the results reported in a manuscript. If I did, it’d probably be because I already have serious questions about the manuscript. In which case, how much difference could re-analysis make to my recommendation?

I think I’m probably safe to think that most of our readers agree with me that data archiving is a good and worthy thing. But, given that we agree on this point—how should we then make it happen?

RedditDiggMendeleyPocketShare

About Jeremy Yoder

Jeremy Yoder is a postdoctoral associate in the Department of Plant Biology at the University of Minnesota. He also blogs at Denim and Tweed and Nothing in Biology Makes Sense!, and tweets under the handle @jbyoder.
This entry was posted in data archiving, peer review and tagged , , , , , . Bookmark the permalink.
  • http://www.facebook.com/tim.vines.71 Tim Vines

    I have to agree with Syst Bio in their asking for the specific data for your paper. The Medicago site is going to be continuously updated in the coming years, and thus finding the exact data that were used in your paper is going to be harder and harder. For readers trying to understand your results there’s no replacement for being able to access the actual data and not the broader resource from which it was derived.

    Letting authors delete files for ‘in review’ Dryad entries would be great, although this shouldn’t be possible for public entries (!).

    • http://www.denimandtweed.com/ Jeremy Yoder

      I think I do agree with the requirement to post the dataset as used for the paper—but still I think it’d be ideal to have it at the MHP website, where it can be linked to the larger datasets from which it’s derived. Or, if I wanted to do even more work, I suppose the true ideal would be to post not the derived data, but the pipeline that produced it from the original MHP data …

      And, agreed re: public entries! Once the paper’s out in print, the supplemental data should be as settled as the published article. (Which is to say, changes mean a retraction or a correction or something at that scale.)

      • http://www.facebook.com/tim.vines.71 Tim Vines

        Having the actual dataset on the MHP would be fine too, as long as you could provide a stable url in the paper itself.

        • http://www.facebook.com/tim.vines.71 Tim Vines

          PS not sure our FASEB paper is the best reference for the benefits of archiving data – Heather Piwowar has a much better list in her ‘Data citation’ group on Mendeley (http://www.mendeley.com/groups/544621/data-citation/)

  • http://twitter.com/boopsboops Rupert A. Collins

    I’m not sure why journals can’t have a two-step process for data access/archiving: on submission you MUST make your data available, but it doesn’t matter where (journal website, department website, dropbox, megaupload, whatever). Then on acceptance, you transfer the final files to a repository such as dryad.

    • http://www.facebook.com/tim.vines.71 Tim Vines

      I think the issue with a two step process is that it’s normally going to be more trouble than a one step process. Why not just upload it as a review entry to Dryad when you initially submit, edit it as need be, and then make it public?

  • Nicholas Crawford

    I imagine that at some point dryad and others will simply make submissions SVN repositories (e.g., like github). When you submit you’d just tag a commit appropriately. A place to start could be by modifying gitorius (http://gitorious.org/). I believe that Brant and Tim and I discussed this a few years back.

    • http://www.facebook.com/tim.vines.71 Tim Vines

      I wince whenever anyone discusses gits because it’s a very vivid pejorative in the UK. This does sound like an idea worth exploring though.