Essential programmatic tools

There are essentials, and there are essentials. Here are several bioinformatic tools that I use on a daily basis (with a Python bias):

  • Python – rapidly becoming the go-to, high-level language of choice in biology. If you aren’t happy with Python, try Perl or Ruby. If you aren’t happy with high-level, there’s always C. If you are not happy with C (or C++ or Fortran), there are any number of functional programming languages. You should view each language as a particular tool for a particular job, and not all are well suited to certain tasks. Python is a very general language and is available in many formats suitable for a number of architectures here.

  • Kent Source – probably the most useful and legendary package of bioinformatics code for large-scale, genomic data manipulation currently available. Written in C and very fast. Requires compilation and available as zip or via git.

  • PyFasta – an optimized library for rapidly accessing massive fasta files. Also available at bitbucket.

  • biopython – one of the oldest bioinformatics libraries for Python, it is a large library with a great deal of functionality. Available in a number of formats at the biopython downloads page.

  • oursql – for the most part, you are going to find that you need access to a database. Typically, that will be a mysql database (although I also like postgres). I prefer oursql because of its buffering. Available from launchpad.

Runners Up

I find these interesting, but I have yet to put them into the daily rotation:

  • Pygr – bills itself as a “scalable bioinformatics interface”. Provides a number of wonderful ways to access the data and database of your choice.

  • sqlalchemy – a great SQL toolkit and object relational mapper for python. Works with a vast array of databases and database APIs.

  • disco – disco is python framework implementing the map-reduce algorithm (manuscript) for Python programmers using an erlang engine. It is highly fault-tolerant and well-suited to massive amounts of processing where the problem to be solved is parallelizable using the map-reduce approach. Also provides some handy tools like Discodex and the discodb object.


About Brant Faircloth

I'm an Assistant Researcher at the University of California - Los Angeles. My interests include mating behavior, social behavior, the (immuno-)genetic basis of mate choice, genomics of non-model organisms, metagenomics, computer programming, and the integration of molecular and field biology.
This entry was posted in bioinformatics, software. Bookmark the permalink.