Best Practices for Scientific Computing…And Molecular Ecology?

Source: http://xkcd.com/292

Update

Best Practices in Computing has now been published in PLoS Biology!

Computers and computational techniques have significantly advanced the molecular ecologist’s toolbox for answering interesting and complex questions about a range of biological systems,  model or otherwise. Imagine, not so long ago, access to a piece of software required one to send a floppy disk to another researcher or developer, maybe some return postage fees, and then wait. I can barely wait for my browser to load at the Mountain Lake Biological Station I spend a good bit of time at in the summer! As can be imagined, it’s not so uncommon both then and now that source code isn’t made available to other researchers. The cause for such a circumstance is surely multifarious, though a number of problems have arisen and are documented in an editorial by Zeeya Merali in Nature1. The article described a pervasive problem wherein difficulties arise in that most important of scientific enterprises, replication of results. To say that this there were simply “pilot” errors on the part of researchers is an oversimplification of the problem.

I first became aware of this particular piece while participating in a course at the University of Virginia called Introduction to Computation as a Research Tool. The course, and the University of Virginia Alliance for Computational Science and Engineering more generally, aims to provide an outstanding resource for the University’s researchers, most of which, presumably, are not computer scientists (the particular course I’m referring to above, while listed as part of the Computer Science Dept. offerings is only open to non-computer science researchers). A major focus of the course is providing best practice guidelines for building, using, and sharing software resources for a range of research focuses and computational frameworks. Barring signing up for University of Virginia courses, one might wonder how best to distribute lessons in best practices in scientific computing?

Aruliah et al.2 just released a pre-, or e-print on arXiv.org, titled, you guessed it, Best Practices for Scientific Computing! In it, Aruliah et al.2 describe a number of practices which, if adhered to, will likely alleviate a number of the issues described by Merali. I really like this piece for a number of reasons, not the least of which is the first point: Write programs for people, not computers. Being able to read through a piece of code and understand it, whether from another researcher, and particularly oneself, is reassuring.

The second point suggests [Automation] of Repetitive Tasks. You may not always know outright the command, or sequence of commands, needed to manipulate a particular piece data, but UNIX and R probably has it (I know my fellow contributors will have more to say about this in future posts). It’s in these particular repetitive tasks that a computer algorithm will quickly exceed human aptitude and patience. Would you rather use a seed counter or a count them yourself? Which would you trust more?

Keeping a consistent and detailed history of each step of an experiment or study is at the very basis of a good lab notebook. Computing shouldn’t be any different, and so Using the Computer to Record History is a natural step for any researcher. Better yet, recording such a history will prevent you from having to keep so much information in what can be an already packed memory (what was that deadline I had?).

Making incremental [steps] as part of a research project is familiar to most molecular ecologists. Jumping straight from DNA extraction to fragment analysis without assessing DNA concentration/purity and marker verification will lead to a great deal of confusion when amplification occurs inconsistently or population genetic summaries fail to meet theoretical expectations. In fact, throw in some contamination concerns or potential for faulty reagents, and the matrix of tests becomes formidable. Verification of each step iteratively, and in many cases verification from a lab mate or collaborator, tends to contain these types of compounding concerns. When generating a analytical function, the research community has a plethora of outlets by which particular steps of an analysis might be verified (e.g. see R’s CRAN Task Views).

Directly related to the previous point, Use [of] Version Control is imperative, and a way in which one can reconstruct a developing idea. Individual iterations of a given project are tracked, in so doing “versions” of a given project are recorded. Simply put, rather than overwriting previous versions of a script, a temporal record of change is recorded. This seems familiar doesn’t it. Biologists are incorrigible hoarders. My research focuses on plant population genetics, and has been enriched directly and indirectly through the historical collections of herbaria. In these collections, one can see species delineations and descriptions in both stable and shifting states, describing the process by which we continue to understand relationships of extinct and extant species. Of the versioning systems described by Aruliah et al.2 I’m most excited about the use of Github (I’ll have more to say about this particular resource in future posts). Combining the standards of the open-source git versioning system and a very intuitive web interface, Github allows for powerful versioning of both individual and collaborative projects.

The next point covered by Aruliah et al.2 is particularly relevant to the R statistical software community, Don’t repeat yourself (or others). Building off existing functions can save time and confusion. Like incremental steps, building off of vetted functions will allow for a considerable reduction in potential for errors.

Plan for Mistakes: Defensive programming suggests particular concerns, but one that came to mind for me was software updates. Imagine your program in an ever evolving context of other developing programs and languages. This particular point seems difficult to generalize. Programming for longevity seems language specific, and languages such as C/C++ seem most robust to me at this point. What do you guys think?

The eighth (of 10; we’re almost there!) point concerns optimization. I mentioned above my impatience with computational time. I’m always looking for software and hardware to decrease the wait time for a given computation. [Optimizing] Software Only After It Works Correctly seems obvious from the outset, but it’s easy to get into a rush. Of particular interest and utility is the use of multi-core processors and CPU-GPU computing. Both offer advances in computational efficiency, assuming inference integrity is maintained.

Document the Design and Purpose of Code…and Conduct Code Reviews are the last two points and can be combined quite naturally. Doing a good job at the one will certainly aid in the other. Early on in an experiment, it’s easy to imagine you’re simply recording the obvious. As someone who has attempted such a reconstruction from a former lab member I can say that simply isn’t the case. Having a constant feedback while developing a protocol, from reviewers of multiple backgrounds, allows for the clearest explanation.

The above points shouldn’t necessarily be taken as a comprehensive list, but to my mind seem a natural protocol to be followed by molecular ecologists developing their own computational methods. As Tim mentioned earlier regarding the archiving of datasets, and Gilbert et al. 2012 3 showed for the use of the program STRUCTURE, best practices are important for the integrity of molecular biology research.

Feel free to comment about the above points. I think the more discussion we can have about particular workflows will aid in their implementation. Also check out the discussion of Aruliah et al.2 at Software Carpentry (I found this particular link at http://haldanessieve.org/, a great resource for e-prints focusing on a broad range of population genetics research).

1 Merali, Z. (2010, October 14). Computational science: …Error. Nature, pp. 775–777. doi:10.1038/467775a

2 Aruliah, D. A., Brown, C. T., Hong, N. P. C., Davis, M., Guy, R. T., Haddock, S. H. D., et al. (2012, September 30). Best Practices for Scientific Computing. arXiv.org.

3 Gilbert, K. J., Andrew, R. L., Bock, D. G., Franklin, M. T., Moore, B., Kane, N. C., Rennison, D. J., et al. 2012. Recommendations for utilizing and reporting population genetic analyses: the reproducibility of genetic clustering using the program STRUCTURE. doi: 10.1111/j.1365-294X.2012.05754.x

RedditDiggMendeleyPocketShare and Enjoy
This entry was posted in bioinformatics, data archiving, population genetics, science publishing, software and tagged . Bookmark the permalink.