Search CORE

31 research outputs found

Using cascading Bloom filters to improve the memory usage for de Brujin graphs

Author: A. Bowe
A. Kirsch
E. Porat
F.R. Blattner
J. Pell
J.R. Miller
M.G. Grabherr
P.A. Pevzner
R. Chikhi
T.C. Conway
Y. Peng
Z. Iqbal
Publication venue
Publication date: 01/01/2013
Field of study

De Brujin graphs are widely used in bioinformatics for processing next-generation sequencing data. Due to a very large size of NGS datasets, it is essential to represent de Bruijn graphs compactly, and several approaches to this problem have been proposed recently. In this work, we show how to reduce the memory required by the algorithm of [3] that represents de Brujin graphs using Bloom filters. Our method requires 30% to 40% less memory with respect to the method of [3], with insignificant impact to construction time. At the same time, our experiments showed a better query time compared to [3]. This is, to our knowledge, the best practical representation for de Bruijn graphs.Comment: 12 pages, submitte

arXiv.org e-Print Archive

CiteSeerX

Crossref

Springer - Publisher Connector

INRIA a CCSD electronic archive server

PubMed Central

Hal-Diderot

HAL-Ecole des Ponts ParisTech

HAL - UPEC / UPEM

Recovering complete and draft population genomes from metagenome datasets.

Author: Gilbert Jack A
Sangwan Naseer
Xia Fangfang
Publication venue: eScholarship, University of California
Publication date: 01/03/2016
Field of study

Assembly of metagenomic sequence data into microbial genomes is of fundamental value to improving our understanding of microbial ecology and metabolism by elucidating the functional potential of hard-to-culture microorganisms. Here, we provide a synthesis of available methods to bin metagenomic contigs into species-level groups and highlight how genetic diversity, sequencing depth, and coverage influence binning success. Despite the computational cost on application to deeply sequenced complex metagenomes (e.g., soil), covarying patterns of contig coverage across multiple datasets significantly improves the binning process. We also discuss and compare current genome validation methods and reveal how these methods tackle the problem of chimeric genome bins i.e., sequences from multiple species. Finally, we explore how population genome assembly can be used to uncover biogeographic trends and to characterize the effect of in situ functional constraints on the genome-wide evolution

Woods Hole Open Access Server

Springer - Publisher Connector

PubMed Central

eScholarship - University of California

DIDA: Distributed Indexing Dispatched Alignment

Author: Birol Inanc
Breshears Clay P.
Chu Justin
Jackman Shaun D.
Mohamadi Hamid
Raymond Anthony
Vandervalk Benjamin P.
Publication venue
Publication date: 01/01/2015
Field of study

One essential application in bioinformatics that is affected by the high-throughput sequencing data deluge is the sequence alignment problem, where nucleotide or amino acid sequences are queried against targets to find regions of close similarity. When queries are too many and/or targets are too large, the alignment process becomes computationally challenging. This is usually addressed by preprocessing techniques, where the queries and/or targets are indexed for easy access while searching for matches. When the target is static, such as in an established reference genome, the cost of indexing is amortized by reusing the generated index. However, when the targets are non-static, such as contigs in the intermediate steps of a de novo assembly process, a new index must be computed for each run. To address such scalability problems, we present DIDA, a novel framework that distributes the indexing and alignment tasks into smaller subtasks over a cluster of compute nodes. It provides a workflow beyond the common practice of embarrassingly parallel implementations. DIDA is a cost-effective, scalable and modular framework for the sequence alignment problem in terms of memory usage and runtime. It can be employed in large-scale alignments to draft genomes and intermediate stages of de novo assembly runs. The DIDA source code, sample files and user manual are available through http://www.bcgsc.ca/platform/bioinfo/software/dida. The software is released under the British Columbia Cancer Agency License (BCCA), and is free for academic use

Directory of Open Access Journals

Simon Fraser University Institutional Repository

FigShare

EXFI: Exon and splice graph prediction without a reference genome

Author: Conklin Darrell
Estomba Recalde Miren Andone
Langa Arranz Jorge Eliseo
Publication venue: 'Wiley'
Publication date: 01/08/2020
Field of study

For population genetic studies in nonmodel organisms, it is important to use every single source of genomic information. This paper presents EXFI, a Python pipeline that predicts the splice graph and exon sequences using an assembled transcriptome and raw whole-genome sequencing reads. The main algorithm uses Bloom filters to remove reads that are not part of the transcriptome, to predict the intron-exon boundaries, to then proceed to call exons from the assembly, and to generate the underlying splice graph. The results are returned in GFA1 format, which encodes both the predicted exon sequences and how they are connected to form transcripts.Basque Government, Grant/Award Number: predoctoral grant PRE_ 2017_2_0169 and grant IT558-1

Archivo Digital para la Docencia y la Investigación

Recovering complete and draft population genomes from metagenome datasets

Author: A Bankevich
A Charuvaka
A Ramette
BJ Baker
C Luo
CJ Castelle
CL Dupont
CT Brown
CT Brown
D Earl
D Wu
DD Kang
DD Sommer
DH Haft
DH Parks
DJ Edwards
DR Mende
DR Zerbino
E Georganas
F Vezzi
FA Simão
GJ Dick
GW Tyson
HB Nielsen
I Sharon
J Alneberg
J Pell
J Qin
JA Gilbert
JF Vázquez-Castellanos
JT Simpson
K Mavromatis
K Salikhov
KC Wrighton
KM Handley
KR Bradnam
LM Rodriguez-R
LM Rodriguez-R
M Albertsen
M Botzman
M Eppinger
M Hess
M Hunt
M Imelfort
M Ofek-Lalzar
M Pignatelli
M Punta
M Roller
M Scholz
M Wu
M Wu
MC Wendl
MJ Morowitz
N Sangwan
OU Nalbantoglu
PSG Chain
R Ghai
R Luo
R Mackelprang
R Suzuki
R Vicedomini
RS Kantor
S Akhter
S Boisvert
S Boisvert
S Heilbronner
S Koren
SC Clark
SC Rienzi Di
SL Salzberg
SM Gibbons
T Davidsen
T Namiki
TJ Treangen
V Iverson
X Deng
X Huang
Y Kodama
Y Peng
Y-W Wu
Z Zhang
Z-S Hua
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Speeding up NGS software development

Author: Chikhi Rayan
Deltel Charles
Drezen Erwan
Lavenier Dominique
Lemaitre Claire
Peterlongo Pierre
Rizk Guillaume
Publication venue: HAL CCSD
Publication date: 28/05/2014
Field of study

International audienceThe analysis of NGS data remains a time and space-consuming task. Many efforts have been made toprovide efficient data structures for indexing the terabytes of data generated by the fast sequencingmachines (Suffix Array, Burrows-Wheeler transform, Bloom Filter, etc.). Mapper tools, genomeassemblers, SNP callers, etc., make an intensive use of these data structures to keep their memoryfootprint as lower as possible.The overall efficiency of NGS software is brought by a smart combination of how data are representedinside the computer memory and how they are processed through the available processing units insidea processor. Developing such software is thus a real challenge, as it requires a large spectrum ofcompetences from high-level data structure and algorithm concepts to tiny details of implementation.We have developed a C++ library, called GATB (Genomic Assembly and Analysis Tool Box) tospeed up the design of NGS algorithms. This library offers a panel of high-level optimized buildingblocks. The underlying data structure is the de Bruijn graph, and the general parallelism model ismultithreading. The GATB library targets standard computing resources such as current multicoreprocessor (laptop computer, small server) with a few GB of memory. Hence, from high-level C++API, NGS programing designers can rapidly elaborate their own software based on state-of-the-artalgorithms and data structures of the domain.To demonstrate the efficiency of the GATB library, several NGS software have been designed such ascontiger (Minia), read corrector (Bloocoo) or SNP discovery (DiscoSNP). The GATB library iswritten in C++ and is available at the following web site http://gatb.inria.fr under the GNU AfferoGPL license

HAL-CentraleSupelec

INRIA a CCSD electronic archive server

HAL-Rennes 1

Efficient Reconciliation of Genomic Datasets of High Similarity

Author: Belazzougui Djamal
Kucherov Gregory
Shibuya Yoshihiro
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 22nd International Workshop on Algorithms in Bioinformatics (WABI 2022)
Publication date: 01/01/2022
Field of study

We apply Invertible Bloom Lookup Tables (IBLTs) to the comparison of k-mer sets originated from large DNA sequence datasets. We show that for similar datasets, IBLTs provide a more space-efficient and, at the same time, more accurate method for estimating Jaccard similarity of underlying k-mer sets, compared to MinHash which is a go-to sketching technique for efficient pairwise similarity estimation. This is achieved by combining IBLTs with k-mer sampling based on syncmers, which constitute a context-independent alternative to minimizers and provide an unbiased estimator of Jaccard similarity. A key property of our method is that involved data structures require space proportional to the difference of k-mer sets and are independent of the size of sets themselves. As another application, we show how our ideas can be applied in order to efficiently compute (an approximation of) k-mers that differ between two datasets, still using space only proportional to their number. We experimentally illustrate our results on both simulated and real data (SARS-CoV-2 and Streptococcus Pneumoniae genomes)

Dagstuhl Research Online Publication Server

HAL-Ecole des Ponts ParisTech