Article thumbnail

Fast Identification and Removal of Sequence Contamination from Genomic and Metagenomic Datasets

By Robert Schmieder and Robert Edwards

Abstract

High-throughput sequencing technologies have strongly impacted microbiology, providing a rapid and cost-effective way of generating draft genomes and exploring microbial diversity. However, sequences obtained from impure nucleic acid preparations may contain DNA from sources other than the sample. Those sequence contaminations are a serious concern to the quality of the data used for downstream analysis, causing misassembly of sequence contigs and erroneous conclusions. Therefore, the removal of sequence contaminants is a necessary and required step for all sequencing projects. We developed DeconSeq, a robust framework for the rapid, automated identification and removal of sequence contamination in longer-read datasets (150 bp mean read length). DeconSeq is publicly available as standalone and web-based versions. The results can be exported for subsequent analysis, and the databases used for the web-based version are automatically updated on a regular basis. DeconSeq categorizes possible contamination sequences, eliminates redundant hits with higher similarity to non-contaminant genomes, and provides graphical visualizations of the alignment results and classifications. Using DeconSeq, we conducted an analysis of possible human DNA contamination in 202 previously published microbial and viral metagenomes and found possible contamination in 145 (72%) metagenomes with as high as 64% contaminating sequences. This new framework allows scientists to automatically detect and efficiently remove unwanted sequence contamination from their datasets while eliminating critical limitations of current methods. DeconSeq's web interface is simple and user-friendly. The standalone version allows offline analysis and integration into existing data processing pipelines. DeconSeq's results reveal whether the sequencing experiment has succeeded, whether the correct sample was sequenced, and whether the sample contains any sequence contamination from DNA preparation or host. In addition, the analysis of 202 metagenomes demonstrated significant contamination of the non-human associated metagenomes, suggesting that this method is appropriate for screening all metagenomes. DeconSeq is available at http://deconseq.sourceforge.net/

Topics: Research Article
Publisher: Public Library of Science
OAI identifier: oai:pubmedcentral.nih.gov:3052304
Provided by: PubMed Central

Suggested articles

Citations

  1. (2008). A bioinformatician’s guide to metagenomics.
  2. (2009). A core gut microbiome in obese and lean twins.
  3. (2010). A human gut microbial gene catalogue established by metagenomic sequencing.
  4. (2010). A primer on metagenomics.
  5. (2010). A survey of sequence alignment algorithms for nextgeneration sequencing.
  6. (2007). Accuracy and quality of massively parallel DNA pyrosequencing.
  7. (2008). Accurate whole human genome sequencing using reversible terminator chemistry.
  8. (2010). Annotating non-coding regions of the genome.
  9. (1990). Basic local alignment search tool.
  10. (2009). BFAST: an alignment tool for large scale genome resequencing.
  11. (2009). BLAST+: architecture and applications.
  12. (2002). BLAT–the BLAST-like alignment tool.
  13. (2010). Building the sequence map of the human pan-genome.
  14. (2010). Characterization of missing human genome sequences and copy-number polymorphic insertions.
  15. (2005). Comparative metagenomics of microbial communities.
  16. (2008). Database indexing for production MegaBLAST searches.
  17. (2010). Fast and accurate long-read alignment with BurrowsWheeler transform.
  18. (2009). Fast and accurate short read alignment with BurrowsWheeler transform.
  19. (2009). Faster human genome sequencing.
  20. (2008). Functional metagenomic profiling of nine biomes.
  21. (1997). Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.
  22. (2009). Human genetic variation and its contribution to complex traits.
  23. (1981). Identification of common molecular subsequences.
  24. (2001). Initial sequencing and analysis of the human genome.
  25. (2008). Mapping short DNA sequencing reads and calling variants using mapping quality scores.
  26. (2007). Mapping the cancer genome. pinpointing the genes involved in cancer will help chart a new course across the complex landscape of human malignancies.
  27. (2009). Metagenomic analysis of respiratory tract DNA viral communities in cystic fibrosis and non-cystic fibrosis individuals.
  28. (2009). Metagenomic signatures of 86 microbial and viral metagenomes.
  29. (2009). Next-generation sequencing of vertebrate experimental organisms.
  30. (2000). Opportunistic data structures with applications. In:
  31. (2011). Quality control and preprocessing of metagenomic datasets.
  32. (2009). Real-time DNA sequencing from single polymerase molecules.
  33. (2009). Sense from sequence reads: methods for alignment and assembly.
  34. (2010). Sequencing technologies - the next generation.
  35. (2009). Signal processing for metagenomics: extracting information from the soup.
  36. (2009). SOAP2: an improved ultrafast tool for short read alignment.
  37. (2001). SSAHA: a fast search method for large DNA databases.
  38. (2010). TagCleaner: identification and removal of tag sequences from genomic and metagenomic datasets.
  39. (2008). The complete genome of an individual by massively parallel DNA sequencing.
  40. (2008). The diploid genome sequence of an asian individual.
  41. (2007). The diploid genome sequence of an individual human.
  42. (2009). The first korean genome sequence and analysis: full genome sequencing for a socio-ethnic group.
  43. (2002). The human genome browser at UCSC.
  44. (2007). The human microbiome project.
  45. (2008). The metagenomics RAST server - a public resource for the automatic phylogenetic and functional analysis of metagenomes.
  46. (2009). The NIH human microbiome project.
  47. (2010). The sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants.
  48. (2001). The sequence of the human genome.
  49. (2010). Third generation DNA sequencing: pacific biosciences’ single molecule real time technology.
  50. (2009). Ultrafast and memoryefficient alignment of short DNA sequences to the human genome.
  51. (2007). Use of simulated data sets to evaluate the fidelity of metagenomic processing methods.
  52. (2008). Using quality scores and longer reads improves accuracy of solexa read mapping.
  53. (2004). Versatile and open software for comparing large genomes.

To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.