A comparison of bioinformatics programming languages

If you program enough, it can change the way you look at the world…

The times are a-changin and most molecular ecologists and evolutionary biologists are no longer asking themselves, “Should I learn a programming language?”, but rather “Which programming language should I learn?”. There are a variety of programming languages that are used by the bioinformatics community, and the number of bioinformatics-compatible computer languages available is on the rise. As such, it can be a little daunting to decide which programming languages to master. From my perusing of various online forums, many professional programmers will insist that you should pick a programming language that works best for each particular purpose.  I somewhat agree with that sentiment, but how many languages can you realistically expect to learn?  Furthermore, it is often more efficient to be an expert in a handful of languages than to be an intermediate-level programmer in a greater number of languages.  On the flip-side, being dogmatically attached to a single language can be detrimental to productivity. From a statistical and quantitative point of view, I prefer R because it is open source.  I also like Linux as both a glue to bind analyses and for quick data management tasks.  But what language should you use for all those other bioinformatics-type tasks that you need to accomplish (e.g., filtering reads, mapping reads, parsing BLAST files, identifying SNPs)?

A paper by Fourment and Gillings provides a nice comparison of languages commonly used in bioinformatics.  In this paper, the programming languages are divided into scripting languages (Perl and Python), semi-compiled languages (Java and C#), and fully compiled languages (C and C++).  Perl and Python programs are (typically) compiled each time before they run and they are often not compiled to the same extent as C and C++ (but see PyPy for Python).  This means that C and C++ typically run faster and require less memory after a program has been completed.  Like most things in life, however, there is a tradeoff in that C and C++ programs usually require more lines of code because there are more details that have to be specified in each program.  Thus there is a tradeoff between time spent developing, writing, and debugging code and the time that the program takes to run through completion.  This tradeoff is nicely illustrated in Figures 1 and 5 from the paper.

Figures 1 and 5 from Fourment and Gillings, which illustrate the tradeoff between lines of code written and the speed at which a global alignment program runs to completion. Notice that the compiled and semi-compiled languages run much faster, but can take more lines of code to write. The “semi-compiled” languages (C# and Java), do not necessarily take more lines (though see notes on Perl below).

I would wager that there are a number of Perl gurus that could substantially reduce the number of lines of code in the Perl program depicted in the Figure above.  However, the authors of the paper understandably wanted the programs to be readable and easily documentable.  This is, in fact, a common complaint with Perl:  it can be unreadable, and a nightmare for anyone but the original programmer to comprehend.  Below are four lines of Perl that have been purposefully obfuscated, but which illustrate the need to program carefully in Perl….

@P=split//,”.URRUU\c8R”;@d=split//,”\nrekcah xinU / lreP rehtona tsuJ”;sub p{ @p{“r$p”,”u$p”}=(P,P);pipe”r$p”,”u$p”;++$p;($q*=2)+=$f=!fork;map{$P=$P[$f^ord ($p{$})&6];$p{$}=/ ^$P/ix?$P:close$}keys%p}p;p;p;p;p;map{$p{$}=~/^[P.]/&& close$}%p;wait until$?;map{/^r/&&<$>}%p;$_=$d[$q];sleep rand(2)if/\S/;print

These 4 lines actually represent a  fairly sophisticated program, but are difficult to decipher.  Many Perl-users defend this concern by rightly claiming that it is up to the programmer to provide clear, concise code and appropriate comments and documentation.  I have spent the last two years learning and working with Perl and when I first started I was guilty of creating strange-looking code.  If I went back to a program after a few months, it could take me quite a long time to figure out what I had written.  When working with Perl now, I comment on almost every line and write detailed comments before the script.  For some reason, this process of reflection helps me write cleaner, more concise code.

Programming languages are also simply a matter of personal taste.  Personally, I don’t like or dislike programming in Perl – I am somewhat ambivalent about it.  However, I really love programming in R.  I love the structure, the syntax, the clever-but-simple ways to optimize code etc.  Recently, I have begun using Python and have found it to be similar, in many respects, to programming in R.  I find that I am now using Python more often than Perl for the simple reason that I find it to be a more enjoyable experience.  Unfortunately, the only way to figure this out is to spend time working with both languages.  Perl and Python both have bioinformatics resources for ready use so that you don’t have to reinvent the wheel: Biopython and Bioperl.

And while I am on the subject of reinventing the wheel  – despite what everyone will tell you – it can be a good thing to occasionally reinvent the wheel when it comes to becoming proficient with a programming language.  Obviously, once you have become an advanced programmer it is a waste of time to recreate well-designed code, but you are only going to become an expert by starting with simple programs and building up from there.  Why not create a script to filter your Illumina reads?  Sure, there are hundreds of them out there – but you may not understand how to create more sophisticated scripts until you give it a shot yourself.

P.S.  I have not used C or Java enough to comment on them.  If you have used these (or different) programming languages  – please add your experiences with these languages to the comments.  Do you enjoy using them?  Why did you pick them, etc.?

 References:

Fourment, M. and Gillings, M.R. 2008. A comparison of common programming languages used in bioinformatics. BMC Bioinformatics 9: 82.

See also:

Dudley, J.T. and Butte A.J. 2009. A quick guide for developing effective bioinformatics programming skills. PLoS Computational Biology. 5:12

RedditDiggMendeleyPocketShare

About Mark Christie

Mark Christie is a post-doctoral fellow in the Department of Zoology at Oregon State University.
This entry was posted in bioinformatics, next generation sequencing, software. Bookmark the permalink.
  • http://sites.google.com/site/hoban3/ Sean Hoban

    I use Java, and highly recommend it. There are a variety of approaches to creating loops and conditional statements, it is relatively straightforward to read and write files, and very easy to create and manage arrays. Another important advantage is there are a huge number of examples and code snippets scattered across the internet for beginners and advanced users alike. Programming for either command line executables or GUI interfaces is also easy and intuitive. I think Java has also been designed to be very good at preventing run-time errors. I tried picking up C and Java simultaneously and found Java much easier than C to lay out a program. I use R sometimes, but I mostly use Java because its easier to interact with files and write much more complex code. Well, those are my thoughts. I do hope to get into Python sometime soon.

  • Tim Vines

    Does Mathematica count as a language or a program? The lab I did my PhD in was all M’ca, all the time. As a low level user it was much shallower learning curve than R, especially when it came to manipulating lists (R is a complete jerk with lists).

  • Mark Christie

    I still haven’t completely mastered the power of lists in R. My understanding is that they hold many different data types (data.frames, vectors, matrices etc) and so would be useful in large projects. Any avid R list users care to comment?

  • http://www.unionx.net/ unionx

    Java is good, but I don’t recommend it for bioinformatics tasks. JVM takes a long time to start, and numerical computation in Java is not as good as in Python or R.

    • Mark Christie

      Although I do not use Java myself, one interesting thing I have noticed is that the people running it on our cluster are almost (1) always running it in parallel and (2) are using very little computational resources. My guess would be that initial development in R or Python would be a good idea, but moving it over to Java or C might be a good idea when you start scaling up your applications.

      • http://www.unionx.net/ unionx

        I know some bioinfo guys who just write batch scripts to do some calculation. I am not sure whether they need to build online service. Yes, Java is very good for online service, and I use Clojure for that.

  • Jon Puritz

    I still rely on others to actually write the heavy duty analysis code, but I find bash incredibly easy and useful for analysis pipelines. I highly recommend that every bioinformatician be familiar with what bash and baseline unix commands can do for data manipulation.

  • Eric Thomas

    I used to wave the Java and C++ flags high but after solid libraries like biopython and scipy its hard to justify the time you would need to replicate a lot of this in Java. Python is just quick and can handle most things you need. The in house GUI (tkinter) doesn’t have as much going it as java but it usually more then fills the needs of a basic program. After doing this for about a year in a half, I have all but fully converted to python.

  • Matt

    For use once research code that filters/formats data anyone using C or Java doesn’t value their own time and likely just doesn’t know any other languages. I know all of the mentioned languages in this article with the exception of C# which I have only played with because portability matters to me. Each has its place but for just day to day data munging only Python and Perl are viable options with Perl genuinely nicer in syntax for shell script activities (e.g. no significant white space in if statements). For stats and plots R or Python with stats models and matplotlib are both great. Personally I try to stick more with Python because its just a lot less clunky in syntax IMHO and more useful to know if you ever decide to leave science. If you wish to write a large scale application others will contribute to that is highly algorithmic rather than on data processing Java or Python are equally viable. If you want something that’s solving some serious problems maybe in large combinatorial space you want to be using C with MPI or OpenMP at a minimum, if you didnt already know this then you aren’t solving really serious combinatorial problems! Most of us arent, unless you are dealing with short read assembly or phylogenetic tree search. The most valuable thing is your own time, line count isn’t a great measure of productivity. Java you copy paste the same 50-100 lines every time so once you have some of your own libs written its not too bad. Ultimately for really simple stuff Perl cannot be beaten since you can inline whilst in the shell. perl -lane ‘print $F[2]*$F[4]‘ < input.tsv for this sort of task: stripping the third and fifth columns of a table and multiplying them (or running any function could be seq comparisson) Perl mastery cannot be beaten. Converting a history of one liners like this into a full script doesn't take much effort at a later date too. The real issue is I haven't met a single person in bioinfo who wasn't comp sci trained get beyond 'intermediate' in any single language. The OP suggests you need a lot of time to learn many languages. That isn't true, after deep understanding of two languages with varied syntax it gets very easy to learn. If you know only one language (other than perhaps C) its not possible to be a true master since you lack understanding of how things might be working underneath the high level syntax. Like Java if you don't understand what a reference is and that the GC waits for references to some memory to vanish before collecting you can create memory leaks of a form just like in C. I don't think it's important bioinfo people are masters of programming, what is important to realise is that you won't ever really be a master programmer this takes real training and experience. Someone who is graceful with a hammer at home isn't a professional architect+construction worker and that's ok. The important thing is to pick a hammer for nails and a screwdriver for screws rather than always using the hammer!

  • Edward Kirton

    I’ve been working in bioinformatics for over a decade and have used a dozen languages over the years, including the ones discussed above. The bread-and-butter coding of bioinformaticians is writing scripts which wrap powerful third-party programs and manipulate files, often to create a pipeline (usually on the cluster). For this, the best are perl, python, bash, maybe c#. Each has pros/cons. Start with one of these and learn good coding practices (e.g. use of repositories like Git, good documentation habits, test-driven development, agile project management, etc.). Which to learn? I recommend you use whichever one you can get good coaching on. Do you have someone at work whom is willing to answer questions, do code reviews, and do paired-programming with you?