Don’t trust your data: reviewing Bioinformatics Data Skills


Image by Tau Zero

There is little debate on the importance of bioinformatics for the present and future of science. As molecular ecologists, we are likely more aware of this than most disciplines due to the data explosion that has accompanied the wide application of next-generation sequencing methods. However, many of you (like me!) might be caught in an awkward area of bioinformatics expertise: too late to have these basics included in your undergraduate/graduate courses and too early to hire a freelance bioinformatician with your fat grant.

So there you are, staring at a bunch of fasta files wondering how to use someone’s poorly-documented python scripts. Maybe you have a tear in your eye and worry in your heart. You want to get your data from point A to B, but you realize that you are trying to use a tool without any understanding of its underlying concepts and the whole thing is written in a foreign language. Imagine implementing an ANOVA without any understanding of the normal distribution and all the software menus are in Russian.

This problem is at the core of Vince Buffalo’s Bioinformatics Data Skills: learning bioinformatics by first learning the general philosophy of working with data computationally. His new book isn’t a cookbook or guide to help you get a gene-annotation pipeline running. Instead, Buffalo takes the time to clearly explain the underlying reasons behind some of bioinformatics’ peculiarities.

The solution is to approach bioinformatics as a bioinformatician does: try stuff, and assess the results. In this way, bioinformatics is just about having the skills to experiment with data using a computer and understand your results. The experimental part is easy; this comes naturally to most scientists. The limiting factor for most biologists is having the data skills to freely experiment and work with large data on a computer. The goal of this book is to teach you the bioinformatics data skills necessary to allow you to experiment with data on a computer as easily as you would run experiments in the lab.

Buffalo’s goal is to make your work reproducible and robust. Instead of patching together some scripts until it works for your dataset, try building in checks and safeties that ensure errors and bugs are reliably detected. Instead of keeping your data and scripts in a set of ad hoc folders, plan your data management and store your work remotely often using version control.

Luckily adopting practices that will make your project reproducible also helps solve these problems. In this sense, good practices in bioinformatics (and scientific computing in general) both make life easier and lead to reproducible projects. The reason for this is simple: if each step of your project is designed to be both re-run (possibly with different data) and is well-documented, it’s already well on its way to being reproducible.

There are a lot of options out there for learning bioinformatics skills on your own, but Buffalo’s text is an attractive combination of the strengths of other books I’ve picked up but have ultimately put down for being too limited in scope. First and foremost: the text is clear and borders on conversational for the most part. Buffalo does use plenty of jargon throughout, but this seems inevitable given some of the material. Second, the basis of the book is the underlying philosophy of data skills without too much attachment to a particular programming language. There is a specific chapter on the R environment, but most of the book should prepare you to apply similar ideas with whatever scripting flavor you prefer. Third, despite the generalities necessitated by a focus on data skills, this book is ultimately based in working with molecular data. All in all, this makes for a thoughtful and intermediate book for a molecular ecologist who desires more ability in bioinformatics, whether that means writing your own tools, interpreting and adapting those created by others, or even just writing a paper with a collaborator.

BDS_coverThe organization comes in three parts. Part one includes the ideological basics that I described above. Part two focuses on basic skills for doing general bioinformatics like using the UNIX shell, connecting to remote machines, using version control, and accessing data. Part three is all about practicing variations of those basics using range, sequence, and alignment data. The structure of the book lends itself to nonlinearity. You can certainly hop directly from Part two’s “Remedial UNIX Shell” chapter to the “UNIX Data Tools” chapter in Part three.

Throughout both the ideological chapters and the methodological chapters, Buffalo continues to remind the reader why the basic data skill concepts he is presenting are so beneficial, even if intimidating to begin with. In chapter 7 (UNIX Data Tools), he tells the story of an exchange between famous computer scientists Donald Knuth and Doug McIlroy in which Knuth solves a magazine’s programming challenge to showcase the effectiveness of literate programming (programming written as text with code interspersed). Knuth’s seven page solution was then one-upped by McIlroy’s own: just six simple lines of UNIX script. Buffalo uses this example to highlight the flexibility and strength of UNIX’s modular design, but these details like these throughout the book also help the reader glimpse behind the curtain of computing to showcase an interesting field in its own right.

Bioinformatics Data Skills does make some assumptions about the reader’s prior knowledge, but these assumptions are conveniently summarized in the Preface. I have rewritten them here with an honest assessment of my comfort with each in parentheses:

  1. You know a scripting language (know R, dabbled in python)
  2. You know how to use a text editor (Yep)
  3. You have basic UNIX command line skills (very basic)
  4. You have a basic understanding of biology (I sure hope so)
  5. You have a basic understanding of regular expressions (No, but found these sections easy to grasp)
  6. You know how to get help and read documentation (Yep)
  7. You can manage your computer systems (No, but have usually had a systems administrator)

I’m picturing myself as somewhat representative of your average grad student in Molecular Ecology. While I didn’t meet 100% of the requirements, I was comfortable reading through from front to back.

Bottom line: Bioinformatics Data Skills won’t suddenly allow you to finish a lingering bioinformatics project, but it may help you do something more valuable: change your thinking on how to start the next one.

NOTE: The copy I’ve read for this review is an unreleased version that is currently under copy-editing. There are differences between this version and the one that can be bought as a preliminary release from O’Reilly. The full print version should be available in June according to the publisher.



About Rob Denton

I'm a PhD Candidate at The Ohio State University in the Department of Evolution, Ecology, and Organismal Biology. I'm most interested in understanding the evolutionary/ecological consequences of strange reproduction in salamanders (unisexual Ambystoma). Topics I'm likely to write about: population and landscape genetics, mitonuclear interactions, polyploidy, and reptiles/amphibians.
This entry was posted in bioinformatics, book review, genomics, software and tagged . Bookmark the permalink.