Poorly updated databases will affect your results

If you’re anything like me, your research is heavily dependent on the many wonderful database resources available online. NCBI, UniProtKB, Ensembl, Swiss-Prot, EMBL-EBI, and many other sites and organizations offer highly useful (and often curated) molecular information. Can you imagine having a few nucleotide sequences and not being able to BLAST them to a public database? Many of these resources are updated continuously, some even daily.

Lina Wadi and her colleagues analyzed the expansion of gene annotations in the public databases of the Gene Ontology Consortium and Reactome, and found that the number of pathways and processes had doubled in the last seven years. This is all well and good thanks to the massive influx of new data the last couple of years.

vocabulary_processes

What is highly worrying, however, is that the great majority of publications using gene ontology and pathway enrichment analyses make use of tools that haven’t been updated in many years. According to Wadi et al., 80% of the publications they screened from 2015 used outdated software that only captured 20% of the pathway enrichments available in the current gene annotations.

Of the 21 pathway enrichment tools they surveyed, DAVID was by far the most popular, used in 2,500 publications in 2015 and representing 71% of all software citations. But the gene ontology database in DAVID has not been updated since 2010! This means that if you’re using outdated tools like DAVID, 80% of your annotations will go undetected, strongly affecting your results and conclusions.

go_tools

Fortunately, there is a very easy solution to avoid this problem. Either make use of existing pathway enrichment tools that update their databases continuously, or download the very latest database yourself to use in your analyses. My personal strategy usually consists of first downloading the latest annotation database, perform analyses using various parameters and evaluate results, and when I’m confident I won’t alter the analyses any more, I download the latest annotation information and run through everything one last time. Seeing how quickly databases are expanding nowadays, we need to make sure that we’re not using outdated information.

In the spirit of reproducibility, remember to keep all your software updated and to take notes of their versions and of all the dates you download the various databases.

 

Reference: Wadi et al. 2016. Impact of knowledge accumulation on pathway enrichment analysis. bioRxiv. doi: http://dx.doi.org/10.1101/049288

Share

About Elin Videvall

Elin is a PhD candidate in the Molecular Ecology and Evolution Lab, Lund University, Sweden. She studies birds and their microbes by analysing genomes, transcriptomes, and microbiomes. You can find her on Twitter: @ElinVidevall
This entry was posted in bioinformatics, genomics, next generation sequencing, software and tagged , , , , , . Bookmark the permalink.