Search CORE

44 research outputs found

Toward Optimal Fingerprint Indexing for Large Scale Genomics

Author: Cazaux Bastien
Limasset Antoine
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 22nd International Workshop on Algorithms in Bioinformatics (WABI 2022)
Publication date: 01/01/2022
Field of study

Motivation. To keep up with the scale of genomic databases, several methods rely on local sensitive hashing methods to efficiently find potential matches within large genome collections. Existing solutions rely on Minhash or Hyperloglog fingerprints and require reading the whole index to perform a query. Such solutions can not be considered scalable with the growing amount of documents to index. Results. We present NIQKI, a novel structure with well-designed fingerprints that lead to theoretical and practical query time improvements, outperforming state-of-the-art by orders of magnitude. Our contribution is threefold. First, we generalize the concept of Hyperminhash fingerprints in (h,m)-HMH fingerprints that can be tuned to present the lowest false positive rate given the expected sub-sampling applied. Second, we provide a structure able to index any kind of fingerprints based on inverted indexes that provide optimal queries, namely linear with the size of the output. Third, we implemented these approaches in a tool dubbed NIQKI that can index and calculate pairwise distances for over one million bacterial genomes from GenBank in a few days on a small cluster. We show that our approach can be orders of magnitude faster than state-of-the-art with comparable precision. We believe this approach can lead to tremendous improvements, allowing fast queries and scaling on extensive genomic databases

Dagstuhl Research Online Publication Server

Minimum Segmentation for Pan-genomic Founder Reconstruction in Linear Time

Author: Cazaux Bastien
Kosolobov Dmitry
Norri Tuukka
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 18th International Workshop on Algorithms in Bioinformatics (WABI 2018)
Publication date: 01/01/2018
Field of study

Given a threshold L and a set R = {R_1, ..., R_m} of m strings (haplotype sequences), each having length n, the minimum segmentation problem for founder reconstruction is to partition [1,n] into set P of disjoint segments such that each segment [a,b] in P has length at least L and the number d(a,b)=|{R_i[a,b] : 1 <= i <= m}| of distinct substrings at segment [a,b] is minimized over [a,b] in P. The distinct substrings in the segments represent founder blocks that can be concatenated to form max{d(a,b) : [a,b] in P} founder sequences representing the original R such that crossovers happen only at segment boundaries. We give an optimal O(mn) time algorithm to solve the problem, improving over earlier O(mn^2). This improvement enables to exploit the algorithm on a pan-genomic setting of input strings being aligned haplotype sequences of complete human chromosomes, with a goal of finding a representative set of references that can be indexed for read alignment and variant calling. We implemented the new algorithm and give some experimental evidence on the practicality of the approach on this pan-genomic setting

Dagstuhl Research Online Publication Server

Linear time minimum segmentation enables scalable founder reconstruction

Author: Cazaux Bastien
Kosolobov Dmitry
Mäkinen Veli
Norri Tuukka
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 17/05/2019
Field of study

Abstract Background We study a preprocessing routine relevant in pan-genomic analyses: consider a set of aligned haplotype sequences of complete human chromosomes. Due to the enormous size of such data, one would like to represent this input set with a few founder sequences that retain as well as possible the contiguities of the original sequences. Such a smaller set gives a scalable way to exploit pan-genomic information in further analyses (e.g. read alignment and variant calling). Optimizing the founder set is an NP-hard problem, but there is a segmentation formulation that can be solved in polynomial time, defined as follows. Given a threshold L and a set

{\mathcal {R}} = \{R_1, \ldots , R_m\}

R = { R 1 , … , R m } of m strings (haplotype sequences), each having length n, the minimum segmentation problem for founder reconstruction is to partition [1, n] into set P of disjoint segments such that each segment

[a,b] \in P

[ a , b ] ∈ P has length at least L and the number

d(a,b)=|\{R_i[a,b] :1\le i \le m\}|

d ( a , b ) = | { R i [ a , b ] : 1 ≤ i ≤ m } | of distinct substrings at segment [a, b] is minimized over

[a,b] \in P

[ a , b ] ∈ P . The distinct substrings in the segments represent founder blocks that can be concatenated to form

\max \{ d(a,b) :[a,b] \in P \}

max { d ( a , b ) : [ a , b ] ∈ P } founder sequences representing the original

{\mathcal {R}}

R such that crossovers happen only at segment boundaries. Results We give an O(mn) time (i.e. linear time in the input size) algorithm to solve the minimum segmentation problem for founder reconstruction, improving over an earlier

O(mn^2)

O ( m n 2 ) . Conclusions Our improvement enables to apply the formulation on an input of thousands of complete human chromosomes. We implemented the new algorithm and give experimental evidence on its practicality. The implementation is available in https://github.com/tsnorri/founder-sequences

Helsingin yliopiston digitaalinen arkisto

Minimum Segmentation for Pan-genomic Founder Reconstruction in Linear Time

Author: Cazaux Bastien
Kosolobov Dmitry
Mäkinen Veli
Norri Tuukka
Publication venue: Schloss Dagstuhl Leibniz Center for Informatics
Publication date: 01/01/2018
Field of study

Peer reviewe

arXiv.org e-Print Archive

Helsingin yliopiston digitaalinen arkisto

Linear time minimum segmentation enables scalable founder reconstruction

Author: Cazaux Bastien
Kosolobov Dmitry
Mäkinen Veli
Norri Tuukka
Publication venue
Publication date: 01/01/2019
Field of study

Background: We study a preprocessing routine relevant in pan-genomic analyses: consider a set of aligned haplotype sequences of complete human chromosomes. Due to the enormous size of such data, one would like to represent this input set with a few founder sequences that retain as well as possible the contiguities of the original sequences. Such a smaller set gives a scalable way to exploit pan-genomic information in further analyses (e.g. read alignment and variant calling). Optimizing the founder set is an NP-hard problem, but there is a segmentation formulation that can be solved in polynomial time, defined as follows. Given a threshold L and a set R={R1,...,Rm} of m strings (haplotype sequences), each having length n, the minimum segmentation problem for founder reconstruction is to partition [1,n] into set P of disjoint segments such that each segment [a,b]P has length at least L and the number d(a,b)=|{Ri[a,b]:1im}| of distinct substrings at segment [a,b] is minimized over [a,b]P. The distinct substrings in the segments represent founder blocks that can be concatenated to form max{d(a,b):[a,b]P} founder sequences representing the original R such that crossovers happen only at segment boundaries. Results: We give an O(mn) time (i.e. linear time in the input size) algorithm to solve the minimum segmentation problem for founder reconstruction, improving over an earlier O(mn2). Conclusions: Our improvement enables to apply the formulation on an input of thousands of complete human chromosomes. We implemented the new algorithm and give experimental evidence on its practicality. The implementation is available in https://github.com/tsnorri/founder-sequences.Peer reviewe

Institutional repository of Ural Federal University named after the first President of Russia B.N.Yeltsin

Helsingin yliopiston digitaalinen arkisto

Linear Time Construction of Indexable Founder Block Graphs

Author: Cazaux Bastien
Equi Massimo
Mäkinen Veli
Norri Tuukka
Tomescu Alexandru
Publication venue: Schloss Dagstuhl - Leibniz-Zentrum für Informatik
Publication date: 01/01/2020
Field of study

Peer reviewe

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

Helsingin yliopiston digitaalinen arkisto

Founder reconstruction enables scalable and seamless pangenomic analysis

Author: Cazaux Bastien
Dönges Saska
Mäkinen Veli
Norri Tuukka
Valenzuela Daniel
Publication venue
Publication date: 15/12/2021
Field of study

Motivation: Variant calling workflows that utilize a single reference sequence are the de facto standard elementary genomic analysis routine for resequencing projects. Various ways to enhance the reference with pangenomic information have been proposed, but scalability combined with seamless integration to existing workflows remains a challenge. Results: We present PanVC with founder sequences, a scalable and accurate variant calling workflow based on a multiple alignment of reference sequences. Scalability is achieved by removing duplicate parts up to a limit into a founder multiple alignment, that is then indexed using a hybrid scheme that exploits general purpose read aligners. Our implemented workflow uses GATK or BCFtools for variant calling, but the various steps of our workflow (e.g. vcf2multialign tool, founder reconstruction) can be of independent interest as a basis for creating novel pangenome analysis workflows beyond variant calling.Peer reviewe

Helsingin yliopiston digitaalinen arkisto

A Linear Time Algorithm for Constructing Hierarchical Overlap Graphs

Author: Cazaux Bastien
Park Kunsoo
Park Sangsoo
Park Sung Gwan
Rivals Eric
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021)
Publication date: 01/01/2021
Field of study

The hierarchical overlap graph (HOG) is a graph that encodes overlaps from a given set P of n strings, as the overlap graph does. A best known algorithm constructs HOG in O(||P|| log n) time and O(||P||) space, where ||P|| is the sum of lengths of the strings in P. In this paper we present a new algorithm to construct HOG in O(||P||) time and space. Hence, the construction time and space of HOG are better than those of the overlap graph, which are O(||P|| + n^2).Comment: 8 pages, 2 figures, submitted to CPM 202

arXiv.org e-Print Archive

SNU Open Repository and Archive

HAL Descartes

Dagstuhl Research Online Publication Server

Hal-Diderot