Search CORE

159 research outputs found

Scaling metagenome sequence assembly with probabilistic de Bruijn graphs

Author: A. Hintze
A. Howe
C. T. Brown
Compeau
Gans
Gilbert
Gilbert
Grabherr
Hess
Iqbal
J. M. Tiedje
J. Pell
Kelley
Liu
Mackelprang
Melsted
Miller
Pevzner
Price
Qin
R. Canino-Koning
Shi
Wooley
ZHANG
Publication venue: 'Proceedings of the National Academy of Sciences'
Publication date: 29/06/2012
Field of study

Deep sequencing has enabled the investigation of a wide range of environmental microbial ecosystems, but the high memory requirements for {\em de novo} assembly of short-read shotgun sequencing data from these complex populations are an increasingly large practical barrier. Here we introduce a memory-efficient graph representation with which we can analyze the k-mer connectivity of metagenomic samples. The graph representation is based on a probabilistic data structure, a Bloom filter, that allows us to efficiently store assembly graphs in as little as 4 bits per k-mer, albeit inexactly. We show that this data structure accurately represents DNA assembly graphs in low memory. We apply this data structure to the problem of partitioning assembly graphs into components as a prelude to assembly, and show that this reduces the overall memory requirements for {\em de novo} assembly of metagenomes. On one soil metagenome assembly, this approach achieves a nearly 40-fold decrease in the maximum memory requirements for assembly. This probabilistic graph representation is a significant theoretical advance in storing assembly graphs and also yields immediate leverage on metagenomic assembly

arXiv.org e-Print Archive

Crossref

Recovering complete and draft population genomes from metagenome datasets.

Author: Gilbert Jack A
Sangwan Naseer
Xia Fangfang
Publication venue: eScholarship, University of California
Publication date: 01/03/2016
Field of study

Assembly of metagenomic sequence data into microbial genomes is of fundamental value to improving our understanding of microbial ecology and metabolism by elucidating the functional potential of hard-to-culture microorganisms. Here, we provide a synthesis of available methods to bin metagenomic contigs into species-level groups and highlight how genetic diversity, sequencing depth, and coverage influence binning success. Despite the computational cost on application to deeply sequenced complex metagenomes (e.g., soil), covarying patterns of contig coverage across multiple datasets significantly improves the binning process. We also discuss and compare current genome validation methods and reveal how these methods tackle the problem of chimeric genome bins i.e., sequences from multiple species. Finally, we explore how population genome assembly can be used to uncover biogeographic trends and to characterize the effect of in situ functional constraints on the genome-wide evolution

Woods Hole Open Access Server

Springer - Publisher Connector

PubMed Central

eScholarship - University of California

These are not the k-mers you are looking for: efficient online k-mer counting using a probabilistic data structure

Author: Brown C. Titus
Canino-Koning Rosangela
Howe Adina Chuang
Pell Jason
Zhang Qingpeng
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 14/07/2014
Field of study

K-mer abundance analysis is widely used for many purposes in nucleotide sequence analysis, including data preprocessing for de novo assembly, repeat detection, and sequencing coverage estimation. We present the khmer software package for fast and memory efficient online counting of k-mers in sequencing data sets. Unlike previous methods based on data structures such as hash tables, suffix arrays, and trie structures, khmer relies entirely on a simple probabilistic data structure, a Count-Min Sketch. The Count-Min Sketch permits online updating and retrieval of k-mer counts in memory which is necessary to support online k-mer analysis algorithms. On sparse data sets this data structure is considerably more memory efficient than any exact data structure. In exchange, the use of a Count-Min Sketch introduces a systematic overcount for k-mers; moreover, only the counts, and not the k-mers, are stored. Here we analyze the speed, the memory usage, and the miscount rate of khmer for generating k-mer frequency distributions and retrieving k-mer counts for individual k-mers. We also compare the performance of khmer to several other k-mer counting packages, including Tallymer, Jellyfish, BFCounter, DSK, KMC, Turtle and KAnalyze. Finally, we examine the effectiveness of profiling sequencing error, k-mer abundance trimming, and digital normalization of reads in the context of high khmer false positive rates. khmer is implemented in C++ wrapped in a Python interface, offers a tested and robust API, and is freely available under the BSD license at github.com/ged-lab/khmer

arXiv.org e-Print Archive

Directory of Open Access Journals

PubMed Central

Assembling large, complex environmental metagenomes

Author: Brown C. Titus
Howe Adina Chuang
Jansson Janet
Malfatti Stephanie A.
Tiedje James M.
Tringe Susannah G.
Publication venue
Publication date: 12/12/2012
Field of study

The large volumes of sequencing data required to sample complex environments deeply pose new challenges to sequence analysis approaches. De novo metagenomic assembly effectively reduces the total amount of data to be analyzed but requires significant computational resources. We apply two pre-assembly filtering approaches, digital normalization and partitioning, to make large metagenome assemblies more comput\ ationaly tractable. Using a human gut mock community dataset, we demonstrate that these methods result in assemblies nearly identical to assemblies from unprocessed data. We then assemble two large soil metagenomes from matched Iowa corn and native prairie soils. The predicted functional content and phylogenetic origin of the assembled contigs indicate significant taxonomic differences despite similar function. The assembly strategies presented are generic and can be extended to any metagenome; full source code is freely available under a BSD license.Comment: Includes supporting informatio

arXiv.org e-Print Archive

eScholarship - University of California

Using cascading Bloom filters to improve the memory usage for de Brujin graphs

Author: A. Bowe
A. Kirsch
E. Porat
F.R. Blattner
J. Pell
J.R. Miller
M.G. Grabherr
P.A. Pevzner
R. Chikhi
T.C. Conway
Y. Peng
Z. Iqbal
Publication venue
Publication date: 01/01/2013
Field of study

De Brujin graphs are widely used in bioinformatics for processing next-generation sequencing data. Due to a very large size of NGS datasets, it is essential to represent de Bruijn graphs compactly, and several approaches to this problem have been proposed recently. In this work, we show how to reduce the memory required by the algorithm of [3] that represents de Brujin graphs using Bloom filters. Our method requires 30% to 40% less memory with respect to the method of [3], with insignificant impact to construction time. At the same time, our experiments showed a better query time compared to [3]. This is, to our knowledge, the best practical representation for de Bruijn graphs.Comment: 12 pages, submitte

arXiv.org e-Print Archive

CiteSeerX

Crossref

Springer - Publisher Connector

INRIA a CCSD electronic archive server

PubMed Central

Hal-Diderot

HAL-Ecole des Ponts ParisTech

HAL - UPEC / UPEM

A review of bioinformatics tools for bio-prospecting from metagenomic sequence data

Author: Alneberg
Aziz
Bloom
Boisvert
Bowe
Brady
Brown
Buermans
Burge
Caspi
Chaudhuri
Chin
Cleary
Consortium
Corduneanu
Cowan
Delcher
Fabregat
Flicek
Glass
Goodwin
Handelsman
Haw
Hess
Hunter
Hunter
Ip
Jain
Kanehisa
Kanehisa
Kelder
Kelley
Kelley
Kislyuk
Koren
Koren
Krogh
Kurtz
Li
Li
Loman
Markowitz
Markowitz
Mikheenko
Mitchell
Mäkinen
Nagarajan
Namiki
Noguchi
Noguchi
Pell
Peng
Pevzner
Pico
Rho
Richardson
Roehe
Sato
Seemann
Strous
Sunagawa
Ter-hovhannisyan
Treangen
Urban
van Dijk
Venter
Wallace
Wang
Watson
Zerbino
Zhu
Publication venue: 'Frontiers Media SA'
Publication date: 01/01/2017
Field of study

The microbiome can be defined as the community of microorganisms that live in a particular environment. Metagenomics is the practice of sequencing DNA from the genomes of all organisms present in a particular sample, and has become a common method for the study of microbiome population structure and function. Increasingly, researchers are finding novel genes encoded within metagenomes, many of which may be of interest to the biotechnology and pharmaceutical industries. However, such “bioprospecting” requires a suite of sophisticated bioinformatics tools to make sense of the data. This review summarizes the most commonly used bioinformatics tools for the assembly and annotation of metagenomic sequence data with the aim of discovering novel genes

Crossref

Frontiers - Publisher Connector

PubMed Central

Edinburgh Research Explorer

The khmer software package: enabling efficient nucleotide sequence analysis

Author: Crusoe Michael R.
Skennerton Connor T.
Publication venue: F1000 Research Ltd.
Publication date: 25/09/2015
Field of study

The khmer package is a freely available software library for working efficiently with fixed length DNA words, or k-mers. khmer provides implementations of a probabilistic k-mer counting data structure, a compressible De Bruijn graph representation, De Bruijn graph partitioning, and digital normalization. khmer is implemented in C++ and Python, and is freely available under the BSD license at https://github.com/dib-lab/khmer/

Caltech Authors