Docker: making our bioinformatics easier and more reproducible

This is a guest post by Alicia Mastretta-Yanes, a CONACYT Research Fellow assigned to CONABIO, Mexico. Her research uses molecular ecology and genomic tools to examine the effect of changes on species distributions due to historical climate fluctuations as well as the effect of human management and domestication. You can find more about her research in her website: http://mastrettayanes-lab.org. She tweets about reproducible research, genomics and cycling Mexico City as @AliciaMstt.
I decided to write this entry while reading the Results of the Molecular Ecologist’s Survey on High-Throughput Sequencing, because it stated that 89% (n=260) of molecular ecologists working with High-Throughput Sequencing are performing the bioinformatic analyses themselves. I could not think of a better place to share a tool that I think anyone performing bioinformatic analyses should know: Docker. I will explain what Docker is in a moment, but first let me state why I think we all should turn our eyes to it.
I suspect most of that 89% have little previous training on computer sciences. At least I hadn’t when I jumped into using ddRAD for my PhD, my PhD friends were in a similar situation and my students are. There are tons of papers out there presenting cool biological results out of genomic data, so we must be a clever lot capable of learning how to perform bioinformatic analyses. If you learned, or are learning, bioinformatics then you likely know that the first challenge is not understanding how to use a command line program, but actually installing the damn thing and all its (never-ending) dependencies (maybe you have access to a HPCC and the cluster admin does that for you, but still, you end up having to install some stuff in your personal or lab computer). If so you likely also know that installing something can mess up something else. You may have left a Linux computer out of service for a couple of panicking days, you may had have to perform a fresh install of your Mac’s OS (or you want to, but that would mean figuring out again the installation of that precious software it cost you so much to get running). As if this were not enough, just yesterday they released a new version of that software you already have installed, and you would like to upgrade, if only you were not afraid of sharks:

https://xkcd.com/349/

Success. (xkcd)


The solution to all this comes in the shape of a nice blue whale called Docker:
Docker_(container_engine)_logo
Docker is an open-source (yay!) engine that automates the deployment of applications into containers, so that you can install inside such a container the software you want, along with all what it needs to run: files system, code, system tools, obscure Perl libraries, etc. It does this by “adding an application deployment engine on top of a virtualised container execution environment”, which means that it is similar to VirtualBox, but that it runs on top of an operating system’s kernel and does not requiere an emulation layer. It is then an incredibly light, fast and efficient environment in which to run your code, reason why it quickly became a hot topic among coders.

Advantages for molecular ecologists? Many. Firs one: you can install and run a given bioinformatic software (you name it) in any operating system (Mac, Linux or Windows >7 64 bits) in such a way that the installation of this software is independent of anything else, including the host OS. Meaning you can do all the sudo you want without breaking your computer anymore (hooray!). Second: you can add to your beautiful-reproducible-research the ultimate step of making reproducible the whole system where your code was run.
Convinced? Here a very easy Docker installation and first steps tutorial.
Would you like to see some action? Assuming you have docker installed, this is how to download and run the latest version of Ubuntu from your terminal:
First pull the Ubuntu image from docker-hub, a Docker repository of common software and OS.

$ docker pull ubuntu
Using default tag: latest
latest: Pulling from library/ubuntu
5a132a7e7af1: Pull complete
fd2731e4c50c: Pull complete
28a2f68d1120: Pull complete
a3ed95caeb02: Pull complete
Digest: sha256:4e85ebe01d056b43955250bbac22bdb8734271122e3c78d21e55ee235fc6802d
Status: Downloaded newer image for ubuntu:latest

(that pulled the latest version, but we could also had specified which one we wanted, eg: docker pull ubuntu:14.04).
We then run a container from that image:

$ docker run -it ubuntu /bin/bash
root@4ff0be4995f0:/#
root@4ff0be4995f0:/# ls
bin   dev  home  lib64  mnt  proc  run   srv  tmp  var
boot  etc  lib   media  opt  root  sbin  sys  usr
root@4ff0be4995f0:/# echo "Hello world!"
Hello world!

And that’s it, root@4ff0be4995f0:/# means that we are root inside an Ubuntu container (named 4ff0be4995f0) and that we can do anything we would do from an Ubuntu Terminal, in this example a simple ls and the classic echo "Hello world". We can install all the software we want there, use it for ensambling a transcriptome or whatever, exit and come back to it again. We can also access data outside of the container (for that you need to mount a volume when you run the container).
Getting the latest version of Ubuntu up and running with two lines of code is nice, right? The same can be done for some computation biology software thanks to this two projects:

They basically are writing up dockerfiles of software commonly used in bioinformatics. A dockerfile is a text file that contains all the commands, in order, needed to build a given image. So if you want to install Bowtie 1.1.2 you can just pull the image from the biodocker repository:

$ docker pull biodckrdev/bowtie:1.1.2
1.1.2: Pulling from biodckrdev/bowtie

bunch of output not worth sharing

Status: Downloaded newer image for biodckrdev/bowtie:1.1.2

And then we can run it as a process using docker:

$ docker run biodckrdev/bowtie:1.1.2 bowtie
No index, query, or output file specified!
Usage:
bowtie [options]* <ebwt> {-1 <m1> -2 <m2> | --12 <r> | <s>} [<hit>]
<m1>    Comma-separated list of files containing upstream mates (or the
          sequences themselves, if -c is set) paired with mates in <m2>
  <m2>    Comma-separated list of files containing downstream mates (or the
          sequences themselves if -c is set) paired with mates in <m1>
  <r>     Comma-separated list of files containing Crossbow-style reads.  Can be
          a mixture of paired and unpaired.  Specify "-" for stdin.
  <s>     Comma-separated list of files containing unpaired reads, or the
          sequences themselves, if -c is set.  Specify "-" for stdin.
  <hit>   File to write hits to (default: stdout)
...

If you clicked the links above you may have noticed that Biodocker and Bioboxes are missing several programs that you may work with. The nice thing is that we all can contribute writing dockerfiles and developers could include a dockerfile as an install option.
There is much more to Docker than what I introduced here. For instance you can build a cluster using Docker Swarm. Docker is a hot topic among computer scientists and developers for a reason. It has just started to be used by the biologists community, but it has a huge potential. I think we will be seeing more and more of Docker soon, as it seems to be the natural next step to make our work easier and our bioinformatic analyses more reproducible.

About Jeremy Yoder

Jeremy B. Yoder is an Associate Professor of Biology at California State University Northridge, studying the evolution and coevolution of interacting species, especially mutualists. He is a collaborator with the Joshua Tree Genome Project and the Queer in STEM study of LGBTQ experiences in scientific careers. He has written for the website of Scientific American, the LA Review of Books, the Chronicle of Higher Education, The Awl, and Slate.
This entry was posted in bioinformatics, software and tagged , . Bookmark the permalink.