Search CORE

20 research outputs found

BrownieAligner : accurate alignment of Illumina sequencing data to de Bruijn graphs

Author: Fostier Jan
Heydari Mahdi
Miclotte Giles
Van de Peer Yves
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2018
Field of study

Background: Aligning short reads to a reference genome is an important task in many genome analysis pipelines. This task is computationally more complex when the reference genome is provided in the form of a de Bruijn graph instead of a linear sequence string. Results: We present a branch and bound alignment algorithm that uses the seed-and-extend paradigm to accurately align short Illumina reads to a graph. Given a seed, the algorithm greedily explores all branches of the tree until the optimal alignment path is found. To reduce the search space we compute upper bounds to the alignment score for each branch and discard the branch if it cannot improve the best solution found so far. Additionally, by using a two-pass alignment strategy and a higher-order Markov model, paths in the de Bruijn graph that do not represent a subsequence in the original reference genome are discarded from the search procedure. Conclusions: BrownieAligner is applied to both synthetic and real datasets. It generally outperforms other state-of-the-art tools in terms of accuracy, while having similar runtime and memory requirements. Our results show that using the higher-order Markov model in BrownieAligner improves the accuracy, while the branch and bound algorithm reduces runtime. BrownieAligner is written in standard C++11 and released under GPL license. BrownieAligner relies on multithreading to take advantage of multi-core/multi-CPU systems

Ghent University Academic Bibliography

Directory of Open Access Journals

UPSpace at the University of Pretoria

Sparse Dynamic Programming on DAGs with Small Width

Author: Chikhi Rayan
Gagie Travis
Kuosmanen Anna
Mäkinen Veli
Paavilainen Topi
Tomescu Alexandru I.
Publication venue
Publication date: 01/05/2019
Field of study

The minimum path cover problem asks us to find a minimum-cardinality set of paths that cover all the nodes of a directed acyclic graph (DAG). We study the case when the size k of a minimum path cover is small, that is, when the DAG has a small width. This case is motivated by applications in pan-genomics, where the genomic variation of a population is expressed as a DAG. We observe that classical alignment algorithms exploiting sparse dynamic programming can be extended to the sequence-against-DAG case by mimicking the algorithm for sequences on each path of a minimum path cover and handling an evaluation order anomaly with reachability queries. Namely, we introduce a general framework for DAG-extensions of sparse dynamic programming. This framework produces algorithms that are slower than their counterparts on sequences only by a factor k. We illustrate this on two classical problems extended to DAGs: longest increasing subsequence and longest common subsequence. For the former, we obtain an algorithm with running time O(k vertical bar E vertical bar log vertical bar V vertical bar). This matches the optimal solution to the classical problem variant when the input sequence is modeled as a path. We obtain an analogous result for the longest common subsequence problem. We then apply this technique to the co-linear chaining problem, which is a generalization of the above two problems. The algorithm for this problem turns out to be more involved, needing further ingredients, such as an FM-index tailored for large alphabets and a two-dimensional range search tree modified to support range maximum queries. We also study a general sequence-to-DAG alignment formulation that allows affine gap costs in the sequence. The main ingredient of the proposed framework is a new algorithm for finding a minimum path cover of a DAG (V, E) in O(k vertical bar E vertical bar log vertical bar V vertical bar) time, improving all known time-bounds when k is small and the DAG is not too dense. In addition to boosting the sparse dynamic programming framework, an immediate consequence of this new minimum path cover algorithm is an improved space/time tradeoff for reachability queries in arbitrary directed graphs.Peer reviewe

Helsingin yliopiston digitaalinen arkisto

BdBG: a bucket-based method for compressing genome sequencing data with dynamic de Bruijn graphs

Author: Junyi Li
Rongjie Wang
Tianyi Zang
Yadong Wang
Yang Bai
Publication venue: 'PeerJ'
Publication date: 01/10/2018
Field of study

Dramatic increases in data produced by next-generation sequencing (NGS) technologies demand data compression tools for saving storage space. However, effective and efficient data compression for genome sequencing data has remained an unresolved challenge in NGS data studies. In this paper, we propose a novel alignment-free and reference-free compression method, BdBG, which is the first to compress genome sequencing data with dynamic de Bruijn graphs based on the data after bucketing. Compared with existing de Bruijn graph methods, BdBG only stored a list of bucket indexes and bifurcations for the raw read sequences, and this feature can effectively reduce storage space. Experimental results on several genome sequencing datasets show the effectiveness of BdBG over three state-of-the-art methods. BdBG is written in python and it is an open source software distributed under the MIT license, available for download at https://github.com/rongjiewang/BdBG

Directory of Open Access Journals

Computational pan-genomics: status, promises and challenges

Author: Abeel Thomas
Alkan Can
Baaijens Jasmijn
Bakker Paul
Boeva Valentina
Bonnal Raoul
Chiaromonte Francesca
Chikhi Rayan
Ciccarelli Francesca
Cijvat Robin
Datema Erwin
Dijkstra Louis
Duijn Cornelia
Dutilh Bas
Eichler Evan
El-Kebir Mohammed
Ernst Corinna
Eskin Eleazar
Garrison Erik
Ghaffaari Ali
Guryev Victor
Kersey Paul
Klau Gunnar
Kloosterman Wigard
Korbel Jan
Lameijer Eric-Wubbo
Langmead Benjamin
Marschall Tobias
Martin Marcel
Marz Manja
Medvedev Paul
Mu John
Mäkinen Veli
Neerincx Pieter
Novak Adam
Ouwens Klaasjan
Paten Benedict
Peterlongo Pierre
Pisanti Nadia
Porubsky David
Rahmann Sven
Raphael Benjamin
Reinert Knut
Ridder Dick
Ridder Jeroen
Rivals Eric
Sanders Ashley
Schlesner Matthias
Schulz-Trieglaff Ole
Schönhuth Alexander
Sheikhizadeh Siavash
Shneider Carl
Smit Sandra
The Computational Pan-Genomics Consortium
Valenzuela Daniel
Vandin Fabio
Wang Jiayin
Wessels Lodewyk
Ye Kai
Zhang Ying
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2018
Field of study

International audienceMany disciplines, from human genetics and oncology to plant breeding, microbiology and virology, commonly face the challenge of analyzing rapidly increasing numbers of genomes. In case of Homo sapiens, the number of sequenced genomes will approach hundreds of thousands in the next few years. Simply scaling up established bioinformatics pipelines will not be sufficient for leveraging the full potential of such rich genomic data sets. Instead, novel, qualitatively different computational methods and paradigms are needed. We will witness the rapid extension of computational pan-genomics, a new sub-area of research in computational biology. In this article, we generalize existing definitions and understand a pan-genome as any collection of genomic sequences to be analyzed jointly or to be used as a reference. We examine already available approaches to construct and use pan-genomes, discuss the potential benefits of future technologies and methodologies and review open challenges from the vantage point of the above-mentioned biological disciplines. As a prominent example for a computational paradigm shift, we particularly highlight the transition from the representation of reference genomes as strings to representations as graphs. We outline how this and other challenges from different application domains translate into common computational problems, point out relevant bioinformatics techniques and identify open problems in computer science. With this review, we aim to increase awareness that a joint approach to computational pan-genomics can help address many of the problems currently faced in various domains

INRIA a CCSD electronic archive server

Archivio della Ricerca - Università di Pisa

EUR Research Repository

HAL-MINES ParisTech

Archivio della ricerca della Scuola Superiore Sant'Anna

Radboud Repository

HAL-Rennes 1

Bit-parallel sequence-to-graph alignment

Author: Marschall Tobias
Mäkinen Veli
Rautiainen Mikko
Publication venue
Publication date: 01/01/2019
Field of study

Motivation: Graphs are commonly used to represent sets of sequences. Either edges or nodes can be labeled by sequences, so that each path in the graph spells a concatenated sequence. Examples include graphs to represent genome assemblies, such as string graphs and de Bruijn graphs, and graphs to represent a pan-genome and hence the genetic variation present in a population. Being able to align sequencing reads to such graphs is a key step for many analyses and its applications include genome assembly, read error correction and variant calling with respect to a variation graph. Results: We generalize two linear sequence-to-sequence algorithms to graphs: the Shift-And algorithm for exact matching and Myers' bitvector algorithm for semi-global alignment. These linear algorithms are both based on processing w sequence characters with a constant number of operations, where w is the word size of the machine (commonly 64), and achieve a speedup of up to w over naive algorithms. For a graph with vertical bar V vertical bar nodes and vertical bar E vertical bar edges and a sequence of length m, our bitvector-based graph alignment algorithm reaches a worst case runtime of O(vertical bar V vertical bar+(sic)m/w(sic)vertical bar E vertical bar logw) for acyclic graphs and O(vertical bar V vertical bar+m vertical bar E vertical bar logw) for arbitrary cyclic graphs. We apply it to five different types of graphs and observe a speedup between 3-fold and 20-fold compared with a previous (asymptotically optimal) alignment algorithm.Peer reviewe

Helsingin yliopiston digitaalinen arkisto

MPG.PuRe

Influenza classification from short reads with VAPOR facilitates robust mapping pipelines and zoonotic strain detection for routine surveillance applications

Author: Brown C.M.
Bull M.J.
Connor T.R.
Corden S.
Moore C.
Southgate B.
Southgate J.A.
Watkins J.
Publication venue: 'Oxford University Press (OUP)'
Publication date: 15/03/2020
Field of study

Motivation: Influenza viruses represent a global public health burden due to annual epidemics and pandemic potential. Due to a rapidly evolving RNA genome, inter-species transmission, intra-host variation, and noise in short-read data, reads can be lost during mapping, and de novo assembly can be time consuming and result in misassembly. We assessed read loss during mapping, and designed a graph-based classifier, VAPOR, for selecting mapping references, assembly validation, and detection of strains of non-human origin. Results: Standard human reference viruses were insufficient for mapping diverse influenza samples in simulation. VAPOR retrieved references for 257 real whole genome sequencing (WGS) samples with a mean of >99.8% identity to assemblies, and increased the proportion of mapped reads by up to 13.3% compared to standard references. VAPOR has the potential to improve the robustness of bioinformatics pipelines for surveillance and could be adapted to other RNA viruses

Online Research @ Cardiff

Fully-sensitive Seed Finding in Sequence Graphs Using a Hybrid Index

Author: Ghaffaari A.
Marschall T.
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2019
Field of study

MPG.PuRe

Using Minimum Path Cover to Boost Dynamic Programming on DAGs : Co-linear Chaining Extended

Author: A Amir
A Limasset
AI Tomescu
AM Novak
C-P Schnorr
D Belazzougui
D Eppstein
D Haussler
DM Church
DR Fulkerson
E Cohen
G Navarro
HV Jagadish
J Sirén
JE Hopcroft
K Park
M Abouelhoda
M Vyverman
R Patro
R Rizzi
R Uricaru
RK Ahuja
S Felsner
S Heber
S Wandelt
SC Ntafos
T Shibuya
V Mäkinen
V Mäkinen
VV Vazirani
Publication venue: Springer International Publishing AG
Publication date: 29/01/2018
Field of study

Peer reviewe

arXiv.org e-Print Archive

Crossref

INRIA a CCSD electronic archive server

HAL Descartes

Helsingin yliopiston digitaalinen arkisto

Hal-Diderot

Efficient Privacy-Preserving Viral Strain Classification via k-mer Signatures and FHE

Author: Adi Akavia
Ben Galili
Hayim Shaul
Mor Weiss
Zohar Yakhini
Publication venue: International Association for Cryptologic Research (IACR)
Publication date: 03/01/2023
Field of study

With the development of sequencing technologies, viral strain classification -- which is critical for many applications, including disease monitoring and control -- has become widely deployed. Typically, a lab (client) holds a viral sequence, and requests classification services from a centralized repository of labeled viral sequences (server). However, such ``classification as a service\u27\u27 raises privacy concerns. In this paper we propose a privacy-preserving viral strain classification protocol that allows the client to obtain classification services from the server, while maintaining complete privacy of the client\u27s viral strains. The privacy guarantee is against active servers, and the correctness guarantee is against passive ones. We implemented our protocol and performed extensive benchmarks, showing that it obtains almost perfect accuracy (

99.8\%

100\%

) and microAUC (

0.999

), and high efficiency (amortized per-sequence client and server runtimes of

4.95

ms and

0.53

ms, respectively, and

0.21

MB communication). In addition, we present an extension of our protocol that guarantees server privacy against passive clients, and provide an empirical evaluation showing that this extension provides the same high accuracy and microAUC, with amortized per sequences overhead of only a few milliseconds in client and server runtime, and 0.3MB in communication complexity. Along the way, we develop an enhanced packing technique in which two reals are packed in a single complex number, with support for homomorphic inner products of vectors of ciphertexts. We note that while similar packing techniques were used before, they only supported additions and multiplication by constants

Cryptology ePrint Archive