A comparison of bioinformatics programming languages

If you program enough, it can change the way you look at the world…

The times are a-changin and most molecular ecologists and evolutionary biologists are no longer asking themselves, “Should I learn a programming language?”, but rather “Which programming language should I learn?”. There are a variety of programming languages that are used by the bioinformatics community, and the number of bioinformatics-compatible computer languages available is on the rise. As such, it can be a little daunting to decide which programming languages to master. From my perusing of various online forums, many professional programmers will insist that you should pick a programming language that works best for each particular purpose.  I somewhat agree with that sentiment, but how many languages can you realistically expect to learn?  Furthermore, it is often more efficient to be an expert in a handful of languages than to be an intermediate-level programmer in a greater number of languages.  On the flip-side, being dogmatically attached to a single language can be detrimental to productivity. From a statistical and quantitative point of view, I prefer R because it is open source.  I also like Linux as both a glue to bind analyses and for quick data management tasks.  But what language should you use for all those other bioinformatics-type tasks that you need to accomplish (e.g., filtering reads, mapping reads, parsing BLAST files, identifying SNPs)?

A paper by Fourment and Gillings provides a nice comparison of languages commonly used in bioinformatics.  In this paper, the programming languages are divided into scripting languages (Perl and Python), semi-compiled languages (Java and C#), and fully compiled languages (C and C++).  Perl and Python programs are (typically) compiled each time before they run and they are often not compiled to the same extent as C and C++ (but see PyPy for Python).  This means that C and C++ typically run faster and require less memory after a program has been completed.  Like most things in life, however, there is a tradeoff in that C and C++ programs usually require more lines of code because there are more details that have to be specified in each program.  Thus there is a tradeoff between time spent developing, writing, and debugging code and the time that the program takes to run through completion.  This tradeoff is nicely illustrated in Figures 1 and 5 from the paper.

Figures 1 and 5 from Fourment and Gillings, which illustrate the tradeoff between lines of code written and the speed at which a global alignment program runs to completion. Notice that the compiled and semi-compiled languages run much faster, but can take more lines of code to write. The “semi-compiled” languages (C# and Java), do not necessarily take more lines (though see notes on Perl below).

I would wager that there are a number of Perl gurus that could substantially reduce the number of lines of code in the Perl program depicted in the Figure above.  However, the authors of the paper understandably wanted the programs to be readable and easily documentable.  This is, in fact, a common complaint with Perl:  it can be unreadable, and a nightmare for anyone but the original programmer to comprehend.  Below are four lines of Perl that have been purposefully obfuscated, but which illustrate the need to program carefully in Perl….

@P=split//,”.URRUU\c8R”;@d=split//,”\nrekcah xinU / lreP rehtona tsuJ”;sub p{ @p{“r$p”,”u$p”}=(P,P);pipe”r$p”,”u$p”;++$p;($q*=2)+=$f=!fork;map{$P=$P[$f^ord ($p{$})&6];$p{$}=/ ^$P/ix?$P:close$}keys%p}p;p;p;p;p;map{$p{$}=~/^[P.]/&& close$}%p;wait until$?;map{/^r/&&<$>}%p;$_=$d[$q];sleep rand(2)if/\S/;print

These 4 lines actually represent a  fairly sophisticated program, but are difficult to decipher.  Many Perl-users defend this concern by rightly claiming that it is up to the programmer to provide clear, concise code and appropriate comments and documentation.  I have spent the last two years learning and working with Perl and when I first started I was guilty of creating strange-looking code.  If I went back to a program after a few months, it could take me quite a long time to figure out what I had written.  When working with Perl now, I comment on almost every line and write detailed comments before the script.  For some reason, this process of reflection helps me write cleaner, more concise code.

Programming languages are also simply a matter of personal taste.  Personally, I don’t like or dislike programming in Perl – I am somewhat ambivalent about it.  However, I really love programming in R.  I love the structure, the syntax, the clever-but-simple ways to optimize code etc.  Recently, I have begun using Python and have found it to be similar, in many respects, to programming in R.  I find that I am now using Python more often than Perl for the simple reason that I find it to be a more enjoyable experience.  Unfortunately, the only way to figure this out is to spend time working with both languages.  Perl and Python both have bioinformatics resources for ready use so that you don’t have to reinvent the wheel: Biopython and Bioperl.

And while I am on the subject of reinventing the wheel  – despite what everyone will tell you – it can be a good thing to occasionally reinvent the wheel when it comes to becoming proficient with a programming language.  Obviously, once you have become an advanced programmer it is a waste of time to recreate well-designed code, but you are only going to become an expert by starting with simple programs and building up from there.  Why not create a script to filter your Illumina reads?  Sure, there are hundreds of them out there – but you may not understand how to create more sophisticated scripts until you give it a shot yourself.

P.S.  I have not used C or Java enough to comment on them.  If you have used these (or different) programming languages  – please add your experiences with these languages to the comments.  Do you enjoy using them?  Why did you pick them, etc.?

 References:

Fourment, M. and Gillings, M.R. 2008. A comparison of common programming languages used in bioinformatics. BMC Bioinformatics 9: 82.

See also:

Dudley, J.T. and Butte A.J. 2009. A quick guide for developing effective bioinformatics programming skills. PLoS Computational Biology. 5:12

Share

About Mark Christie

Mark Christie is an assistant professor in the Department of Biological Sciences and Department of Forestry & Natural Resources at Purdue University.
This entry was posted in bioinformatics, next generation sequencing, software. Bookmark the permalink.