Riding the Elephant

I recently received my first batch of reads from a single paired-end lane run on an Illumina Hi-Seq instrument. This batch totaled about 20 billion basepairs of DNA sequence, and the associated data files a combined 55.4 gigs of text. By the time I finish my PhD I expect to run between 10 to 20 Hi-Seq lanes, and some quick calculations suggests I may exceed one terabyte of raw data. When I consider all the intermediate files I’ll produce, my PhD project could easily consume 2-3 terabytes of storage.

Although it’s a daunting amount, the storage is not the limiting factor. These days, the tricky part is processing the data in a timely manner. A computer program that just skims through all the lines in a 55.4 gig file takes about one hour to run. A more complicated analysis such as sorting, filtering, or removing duplicates takes even longer. As I continue to collect data, the time consumed waiting for analyses to finish becomes more and more inconvenient. Furthermore, if Next-Gen sequencing follows a sequencing version of Moore’s law (where the number of base-pairs of sequence I get for $2500 doubles every two years) then I could expect to get about 640 billion base-pairs for $2500 in 2021. Analyzing this much data will be rather inconvenient with the current software tools.

Luckily, biologists are not the only people with this problem. Large internet search companies like Google or Yahoo need to quickly create sorted indexes of billions of webpages stored in very large files. To manage all this data, Google developed a file system that handles large files easily and a software framework for dividing files into smaller, more easily digestible pieces. Basically, they speed up the indexing of very large files by dividing the data amongst many computers, removing duplicates (etc.), and recombining the data into a sorted and indexed file. It’s fast. Yahoo recently used this method to sort 1 terabyte of data in 63 seconds as part of the Sort Benchmark challenge. This filesystem and software framework are publicly and freely available from Apache and are know as respectively as HDFS and MapReduce or collectively as Hadoop.


However, due to its unique file system (HDFS), Hadoop requires dedicated machines to run at full capacity. Most universities have some sort of scientific computer cluster, but as of this time Cornell is the only University (that I’m aware of) that offers a dedicated Hadoop computer cluster for its researchers. Luckily, those of us without access to Cornell’s cluster can use Amazon’s. Yup, Amazon provides access to its Hadoop cluster for a small fee, or even for free if you apply for an Amazon Educational Research Grant.

Until now I’ve been referring to Hadoop generally, however, MapReduce is the crux of the framework. MapReduce, in its simplest form, requires that you write two pieces of code: a mapping function and a reducing function (see Figure 1.). The mapping function takes as input lines of the big-data-file and outputs modified lines, as well as associated integers (e.g., key-value pairs). The mapping function is run on many processors, and the data it receives is actually a subset of the total big-data-file. Thus the data it outputs is sent to a reducing function that processes it by key and produces a final output file.

Map Reduce Schematic

MapReduce has a some additional characteristics that make it ideal for processing large files such as those produced by Next-Gen sequencing. Firstly, although the MapReduce framework is written in Java, the mapping and reducing functions may be written in Python, Perl, Ruby, C, or C++, using Hadoop Streaming. This means that you likely do not have to learn a new computer language to write MapReduce algorithms. Secondly, MapReduce is also “robust to processor failure.” With traditional Grid Computing frameworks such as Sun/Oracle Grid Engine or Platform, if one processor fails or hangs the analysis either fails entirely or pauses until that processor recovers. Unfortunately, cluster processors often fail or hang. For example, in some of my non-hadoop code 95% of the processes finish in 30 minutes, but the whole analysis can take as long as 5 hours because 2 or 3 processes get scheduled on bad processors. It is certainly possible to write code to check for hung processors and restart processes on new processors, but MapReduce does this automatically. Furthermore, as your processes finish, MapReduce duplicates the remaining processes on unused processors so that the first one to finish ends the analysis. Lastly, when it comes to publishing a paper describing your software, Hadoop is useful because your code should run without modification on any Hadoop cluster. Perhaps, more usefully, you could create a website, MyRNA is a good example, that allows users to submit their data to Amazon’s cluster for analysis.

This is not to say that Hadoop is not without problems. The most inconvenient issue is getting your data to the Hadoop cluster. Uploading a 55.4 gb file to Amazon would take about 1.7 days on a fast 3 mbps connection. Real-world speeds will almost certainly be slower. In November, however, Amazon introduced a multipart upload system that allows you to upload different parts of the same file simultaneously and subsequently assemble them on the cluster. This speeds things up considerably. Another potential issue is that the standard sequence format from Next-Gen instruments involves multiple lines per sequence (e.g., Fastq format) and MapReduce operates line by line. However, it is possible to get one-line formats such as the Illumina SCARF single line format.

I foresee Hadoop being used for a lot of basic genomic methods such as post-processing of reads and summary statistic calculations. For example Hadoop could be used for separating reads by barcode tags, identifying duplicate reads (e.g., PCR artifacts), or calculating mean GC counts and average base-pair position quality score. Future directions will likely involve integration with Galaxy Genome Browser for which there is already an Amazon implementation: Galaxy Cloudman. Other uses could include SNP calling or tree building from RAD-tag or similar genetic markers.
Of couse Hadoop is not limited to genetic data. Any project that produces huge data files may benefit from Hadoop and MapReduce. Thus I imagine, many people will want to write their own MapReduce code. My next post will explain how to do this using Python and a new MapReduce Python Module called Dumbo.

Lastly, you’re probably wondering my goofy title and the name Hadoop came from. I think the inventor of Hadoop, Doug Cutting, explains it best:

…the name my kid gave a stuffed yellow elephant. Short, relatively easy to spell and pronounce, meaningless, and not used elsewhere: those are my naming criteria. Kids are good at generating such.

Special thanks goes out to Reuben Kabel who took a look at an early draft of this post and made some very helpful suggestions.

Share

About Nicholas Crawford

I'm a computational genomics Post Doctoral Fellow at the California Academy of Sciences. I'm working on a number of projects including vertebrate systematics and the genomics of adaptation in lizards, heliconius butterflies, and Hawaiian drosophila.
This entry was posted in bioinformatics, next generation sequencing, software and tagged , , , , . Bookmark the permalink.