Search CORE

492 research outputs found

Space-efficient and exact de Bruijn graph representation based on a Bloom filter

Author: A Bowe
A Kirsch
B Chazelle
C Kingsford
C Ye
G Marçais
G Rizk
G Rizk
G Sacomoto
Guillaume Rizk
J Pell
JR Miller
JT Simpson
MG Grabherr
P Peterlongo
P Peterlongo
R Chikhi
R Li
Rayan Chikhi
RL Warren
RM Idury
SL Salzberg
TC Conway
Y Peng
Z Iqbal
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Using cascading Bloom filters to improve the memory usage for de Brujin graphs

Author: A. Bowe
A. Kirsch
E. Porat
F.R. Blattner
J. Pell
J.R. Miller
M.G. Grabherr
P.A. Pevzner
R. Chikhi
T.C. Conway
Y. Peng
Z. Iqbal
Publication venue
Publication date: 01/01/2013
Field of study

De Brujin graphs are widely used in bioinformatics for processing next-generation sequencing data. Due to a very large size of NGS datasets, it is essential to represent de Bruijn graphs compactly, and several approaches to this problem have been proposed recently. In this work, we show how to reduce the memory required by the algorithm of [3] that represents de Brujin graphs using Bloom filters. Our method requires 30% to 40% less memory with respect to the method of [3], with insignificant impact to construction time. At the same time, our experiments showed a better query time compared to [3]. This is, to our knowledge, the best practical representation for de Bruijn graphs.Comment: 12 pages, submitte

arXiv.org e-Print Archive

CiteSeerX

Crossref

Springer - Publisher Connector

INRIA a CCSD electronic archive server

PubMed Central

Hal-Diderot

HAL-Ecole des Ponts ParisTech

HAL - UPEC / UPEM

Scaling metagenome sequence assembly with probabilistic de Bruijn graphs

Author: A. Hintze
A. Howe
C. T. Brown
Compeau
Gans
Gilbert
Gilbert
Grabherr
Hess
Iqbal
J. M. Tiedje
J. Pell
Kelley
Liu
Mackelprang
Melsted
Miller
Pevzner
Price
Qin
R. Canino-Koning
Shi
Wooley
ZHANG
Publication venue: 'Proceedings of the National Academy of Sciences'
Publication date: 29/06/2012
Field of study

Deep sequencing has enabled the investigation of a wide range of environmental microbial ecosystems, but the high memory requirements for {\em de novo} assembly of short-read shotgun sequencing data from these complex populations are an increasingly large practical barrier. Here we introduce a memory-efficient graph representation with which we can analyze the k-mer connectivity of metagenomic samples. The graph representation is based on a probabilistic data structure, a Bloom filter, that allows us to efficiently store assembly graphs in as little as 4 bits per k-mer, albeit inexactly. We show that this data structure accurately represents DNA assembly graphs in low memory. We apply this data structure to the problem of partitioning assembly graphs into components as a prelude to assembly, and show that this reduces the overall memory requirements for {\em de novo} assembly of metagenomes. On one soil metagenome assembly, this approach achieves a nearly 40-fold decrease in the maximum memory requirements for assembly. This probabilistic graph representation is a significant theoretical advance in storing assembly graphs and also yields immediate leverage on metagenomic assembly

arXiv.org e-Print Archive

Crossref

Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph

Author: Benoit Gaëtan
Dayris Thibault
Drezen Erwan
Lavenier Dominique
Lemaitre Claire
Rizk Guillaume
Uricaru Raluca
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 14/09/2015
Field of study

International audienceData volumes generated by next-generation sequencing (NGS) technologies is now a major concern for both data storage and transmission. This triggered the need for more efficient methods than general purpose compression tools, such as the widely used gzip method.We present a novel reference-free method meant to compress data issued from high throughput sequencing technologies. Our approach, implemented in the software LEON, employs techniques derived from existing assembly principles. The method is based on a reference probabilistic de Bruijn Graph, built de novo from the set of reads and stored in a Bloom filter. Each read is encoded as a path in this graph, by memorizing an anchoring kmer and a list of bifurcations. The same probabilistic de Bruijn Graph is used to perform a lossy transformation of the quality scores, which allows to obtain higher compression rates without losing pertinent information for downstream analyses.LEON was run on various real sequencing datasets (whole genome, exome, RNA-seq or metagenomics). In all cases, LEON showed higher overall compression ratios than state-of-the-art compression software. On a C. elegans whole genome sequencing dataset, LEON divided the original file size by more than 20. LEON is an open source software, distributed under GNU affero GPL License, available for download at http://gatb.inria.fr/software/leon/

HAL-CentraleSupelec

INRIA a CCSD electronic archive server

PubMed Central

HAL-Rennes 1

Efficient String Graph Construction Algorithm

Author: Morshed S.M. Iqbal
Publication venue: 'Information Bulletin on Variable Stars (IBVS)'
Publication date: 01/05/2019
Field of study

In the field of genome assembly research where assemblers are dominated by de Bruijn graph-based approaches, string graph-based assembly approach is getting more attention because of its ability to losslessly retain information from sequence data. Despite the advantages provided by a string graph in repeat detection and in maintaining read coherence, the high computational cost for constructing a string graph hinders its usability for genome assembly. Even though different algorithms have been proposed over the last decade for string graph construction, efficiency is still a challenge due to the demand for processing a large amount of sequence data generated by NGS technologies. Therefore, in this thesis, we provide a novel, linear time and alphabet-size-independent algorithm SOF which uses the property of irreducible edges and transitive edges to efficiently construct string graph from an overlap graph. Experimental results show that SOF is at least 2 times faster than the string graph construction algorithm provided in SGA, one of the most popular string graph-based assembler, while maintaining almost the same memory footprint as SGA. Moreover, the availability of SOF as a subprogram in the SGA assembly pipeline will give user facilities to access the preprocessing and postprocessing steps for genome assembly provided in SGA

University of Central Florida (UCF): STARS (Showcase of Text, Archives, Research & Scholarship)