Essential programmatic tools

There are essentials, and there are essentials. Here are several bioinformatic tools that I use on a daily basis (with a Python bias):

  • Python – rapidly becoming the go-to, high-level language of choice in biology. If you aren’t happy with Python, try Perl or Ruby. If you aren’t happy with high-level, there’s always C. If you are not happy with C (or C++ or Fortran), there are any number of functional programming languages. You should view each language as a particular tool for a particular job, and not all are well suited to certain tasks. Python is a very general language and is available in many formats suitable for a number of architectures here.

  • Kent Source – probably the most useful and legendary package of bioinformatics code for large-scale, genomic data manipulation currently available. Written in C and very fast. Requires compilation and available as zip or via git.

  • PyFasta – an optimized library for rapidly accessing massive fasta files. Also available at bitbucket.

  • biopython – one of the oldest bioinformatics libraries for Python, it is a large library with a great deal of functionality. Available in a number of formats at the biopython downloads page.

  • oursql – for the most part, you are going to find that you need access to a database. Typically, that will be a mysql database (although I also like postgres). I prefer oursql because of its buffering. Available from launchpad.

Runners Up

I find these interesting, but I have yet to put them into the daily rotation:

  • Pygr – bills itself as a “scalable bioinformatics interface”. Provides a number of wonderful ways to access the data and database of your choice.

  • sqlalchemy – a great SQL toolkit and object relational mapper for python. Works with a vast array of databases and database APIs.

  • disco – disco is python framework implementing the map-reduce algorithm (manuscript) for Python programmers using an erlang engine. It is highly fault-tolerant and well-suited to massive amounts of processing where the problem to be solved is parallelizable using the map-reduce approach. Also provides some handy tools like Discodex and the discodb object.

This entry was posted in bioinformatics, software. Bookmark the permalink.