Handling microbial contamination in NGS data

Until recently, I had given little thought to the potential for unwanted microbial contamination in high throughput sequence data. I suspect that if you’re a molecular ecologist who doesn’t primarily study microbes or work with ancient DNA, you’re in a similar boat: when I did a quick search of the documentation of three popular bioinformatics pipelines for handing the RADseq and UCE data most of us deal with (Pyrad, Phyluce, and Stacks) I found no hits for specific features dealing with exogenous microbial DNA.

Is it in your data? Photo credit Rocky Mountain Laboratories, NIAID, NIH

I was therefore surprised to learn — during a recent project involving DNA from both fresh tissues and historical museum specimens — that almost all my samples of both origins contained reads from Escherichia coli and other bacteria. Of course, it’s easy to see how this can occur, even with stringent wet-lab protocols. After all, the tissue we extract DNA from is rarely sterilized, samples may contain symbionts, errors are made, etc. Which may not be a big deal: while the potential for contamination to bias downstream analyses is well documented, it’s not clear how significant trace quantities of E. coli or other microbes are for the majority of studies of plants or vertebrates. But if you are concerned for one reason or other, are working with ancient DNA, or just want to follow best practices, what tools are available to you? Here’s what my collaborator Zach Hanna and I dug up:

Cleaning raw reads

The obvious starting point and most precise approach and for minimizing contamination concerns is to identify and remove reads prior to assembly and any other downstream analysis, eliminating the slim but non-zero possibility that some number assemblies become chimeras of target and contaminant sequence. The most comprehensive tool I’m aware of for this end is DeconSeq, which automatically detects and removes sequence contaminations from genomic and metagenomic datasets by attempting to align your .fasta or .fastq files to its in-house databases of human, microbial, and viral reference genomes. DeconSeq can either be run through an interactive web interface (capped at 600MB file size limit), or downloaded and installed locally (though this takes up a lot of space).

Alternatively, with .fasta file(s) for a custom collection of putative contaminants, you can apply an read aligner tool such as bowtie2 relatively simply.

First, you’ll need to index a file (contaminants.fasta) containing sequences of all contaminants:

bowtie-build contaminants.fasta IDX/contaminants_idx

Then, you can use the --un flag to write all raw reads from your sample (raw1.fq) that could not be aligned to this collection to a new, contaminant-free fastq file (nocontaminants.fq):

bowtie --un nocontaminants.fq IDX/contaminants_idx raw1.fq

Simpler yet is running the wrapper script 2-ScrubReads from Berkeley’s CGRL QB3 denovoTargetCapture bioinformatics pipeline to automate this process, which includes the same method contaminant removal along with adapter trimming and other quality control measures. This example, from the pipeline’s documentation, checks reads against the E. coli reference genome e_coli_K12.fasta:

ke@NGS:~/Desktop/SeqCap/data$ 2-ScrubReads -f ~/Desktop/SeqCap/data/rawdata/raw/pre-clean/ -o ~/Desktop/SeqCap/data/cleaned_data/ -a ~/Desktop/SeqCap/denovoTargetCapture/associated_files/Adapters.fasta -b ~/Desktop/SeqCap/denovoTargetCapture/associated_files/libInfo.txt -t /home/ke/Desktop/SeqCap/programs/Trimmomatic-0.32/trimmomatic-0.32.jar -c ~/Desktop/SeqCap/denovoTargetCapture/associated_files/ecoli/e_coli_K12.fasta -e 200 -m 15 -z

Removing contaminants from assemblies

Though eliminating raw reads will almost always be the best tactic to take, there may be circumstances in which you wish to identify and remove contaminant contigs following assembly. Inputting .fasta files of your assemblies to DeconSeq should still work for this goal, as will blastn searches with your sample of interest (sample1) as query against a custom blastn database (contaminants) from a .fasta of potential contaminant sequences (contaminants.fasta):

makeblastdb -in contaminants.fasta -parse_seqids -dbtype nucl
blastn -db contaminants -query sample1.fasta -outfmt 10 -num_alignments 5 -max_hsps 1 -out blast-output.m6

A more thorough (if somewhat storage-intensive) alternative to this approach is to perform the search against a local copy of the entire NCBI GenBank nucleotide database (nt):

blastn -db nt -query sample1.fasta -outfmt 10 -num_alignments 5 -max_hsps 1 -out blast-output.m6

This will produce an .m6 file with a huge number of potential alignments, as you might expect, and will next require a method of filtering results to identify contaminants. If you’re working with vertebrates, Zach and others wrote a script explicitly for this purpose. (Note that this relies on NCBI’s now-obsolete GI sequence identifier, and will be less effective on versions of the GenBank database downloaded after September 2016.)

One way or other, once you’ve identified a list of contigs you’d like to remove, you’ll need to cut them from your assemblies. While you can do this manually with command-F, or doubtlessly come up with a clever unix hack. (I ended up writing a set of R scripts to make the process simpler, with functions to collate a list of contaminant contig IDs from multiple BLAST search outputs and then remove them from either a single fasta file or a directory of fasta files.)

Removing SNPs from contaminant contigs

Finally, there are circumstances in which you may find yourself needing to remove contaminant sites from a SNP matrix. If your matrix was generated by aligning raw reads to a reliable reference genome which is itself free of contaminants, you should have no problems. But if you are aligning reads to a “pseudo-reference” generated from de novo assembly of one or several of your samples, as is common when working with non-model organisms, there’s always the possibility that reads will align with contaminants in this psuedo-reference. It’s more likely than not that depth cutoffs and other filtering steps will eliminate these sites. But if you’re being thorough and have identified contaminant contigs in your pseudo-reference genome with one of the methods described above, it’s preferable to leave these contigs where they are during alignment, to reduce the chance of mismatches. Then, you can remove all sites from suspect sequences (C7961234;C7963448;C8091874) on a set of .bam files using samtools / bioawk on a sorted .bam file (sorted-piledup.bam) generated by samtools mpileup function…

$ samtools view -h sorted-piledup.bam | awk '{if($3 != "C7961234" && $3 != "C7963448" && $3 != "C8091874" && $2 != "SN:C7961234" && $2 != "SN:C7963448" && $2 != "SN:C8091874"){print $0}}' | samtools view -Sb - > output-NoContamNoMito.bam

….or, from a .vcf file, using vcftools and the --not-chr flag:

vcftools --vcf input.vcf --not-chr C7961234 --not-chr C7963448 --not-chr C8091874 --recode --recode-INFO-all --out nocontamination


Our own approach was applied to hyRAD libraries, lacking a widely-used bioinformatics pipeline, and was necessarily a little cobbled together. My hunch is that plenty of researchers have a more formal set of protocols for dealing with contamination. We’d love to hear them in the comments.


Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL. 2008. BLAST+: architecture and applications. BMC Bioinformatics. DOI:10.1186/1471-2105-10-42

Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, Handsaker RE, Lunter G, Marth GT, Sherry ST, McVean G, Durbin R, 1000 Genomes Project Analysis Group. 2011. The variant call format and VCFtools. Bioinformatics. DOI:10.1093/bioinformatics/btr330

Henderson JB, Hanna ZR. GItaxidIsVert. DOI: 10.5281/zenodo.163737

Langmead B, Salzberg SL. 2012. Fast gapped-read alignment with Bowtie 2. Nature Methods. DOI: 10.1038/nmeth.1923

Laurin-Lemay S, Brinkmann H, Philippe H. 2012. Origin of land plants revisited in the light of sequence contamination and missing data. Current Biology. DOI: 10.1016/j.cub.2012.06.013

Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R; 1000 Genome Project Data Processing Subgroup. 2009. The Sequence alignment/map (SAM) format and SAMtools. Bioinformatics. DOI:10.1093/bioinformatics/btp352

About Ethan Linck

I'm a Ph.D. Candidate at the Department of Biology and the Burke Museum of Natural History, University of Washington, Seattle. My research uses museum specimens and genomic data to analyze and archive avian biodiversity and evolution, particularly in western North America and Melanesia.
This entry was posted in bioinformatics, genomics, howto, next generation sequencing, technical and tagged , , , . Bookmark the permalink.