Search CORE

22 research outputs found

Combined Data Structure for Previous- and Next-Smaller-Values

Author: Fischer Johannes
Publication venue
Publication date: 02/02/2011
Field of study

Let

A

be a static array storing

n

elements from a totally ordered set. We present a data structure of optimal size at most

n\log_2(3+2\sqrt{2})+o(n)

bits that allows us to answer the following queries on

A

in constant time, without accessing

A

: (1) previous smaller value queries, where given an index

i

, we wish to find the first index to the left of

i

where

A

is strictly smaller than at

i

, and (2) next smaller value queries, which search to the right of

i

. As an additional bonus, our data structure also allows to answer a third kind of query: given indices

i<j

, find the position of the minimum in

A[i..j]

. Our data structure has direct consequences for the space-efficient storage of suffix trees.Comment: to appear in Theoretical Computer Scienc

arXiv.org e-Print Archive

Elsevier - Publisher Connector

Lightweight Lempel-Ziv Parsing

Author: D. Okanohara
D. Okanohara
E. Ohlebusch
E. Ohlebusch
G. Chen
G. Navarro
G. Navarro
J. Barbay
J. Fischer
J. Kärkkäinen
J. Ziv
M. Crochemore
M.I. Abouelhoda
P. Ferragina
P. Ferragina
R. Cánovas
S. Kreft
S. Kuruppu
T. Gagie
T. Kasai
T. Starikovskaya
U. Manber
W.I. Chang
Publication venue
Publication date: 01/01/2013
Field of study

We introduce a new approach to LZ77 factorization that uses O(n/d) words of working space and O(dn) time for any d >= 1 (for polylogarithmic alphabet sizes). We also describe carefully engineered implementations of alternative approaches to lightweight LZ77 factorization. Extensive experiments show that the new algorithm is superior in most cases, particularly at the lowest memory levels and for highly repetitive data. As a part of the algorithm, we describe new methods for computing matching statistics which may be of independent interest.Comment: 12 page

arXiv.org e-Print Archive

Crossref

Editorial : Special Issue on Algorithms for Sequence Analysis and Storage

Author: Mäkinen Veli
Publication venue
Publication date: 01/01/2014
Field of study

Non peer reviewe

Multidisciplinary Digital Publishing Institute

Crossref

Directory of Open Access Journals

Helsingin yliopiston digitaalinen arkisto

Rpair: Rescaling RePair with Rsync

Author: A Abeliuk
A Jeż
A Jeż
A Lempel
AR Christiansen
CG Nevill-Manning
F Claude
F Claude
F Claude
F Claude
G Navarro
H Sakamoto
J Larsson
J Ziv
JA Storer
JC Kieffer
JD Kornblum
M Charikar
N Brisaboa
P Bille
T Gagie
W Rytter
Y Peng
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2019
Field of study

Data compression is a powerful tool for managing massive but repetitive datasets, especially schemes such as grammar-based compression that support computation over the data without decompressing it. In the best case such a scheme takes a dataset so big that it must be stored on disk and shrinks it enough that it can be stored and processed in internal memory. Even then, however, the scheme is essentially useless unless it can be built on the original dataset reasonably quickly while keeping the dataset on disk. In this paper we show how we can preprocess such datasets with context-triggered piecewise hashing such that afterwards we can apply RePair and other grammar-based compressors more easily. We first give our algorithm, then show how a variant of it can be used to approximate the LZ77 parse, then leverage that to prove theoretical bounds on compression, and finally give experimental evidence that our approach is competitive in practice

Crossref

Archivio della Ricerca - Università di Pisa

Archivio Istituzionale della Ricerca- Università del Piemonte Orientale

Indexing Finite Language Representation of Population Genotypes

Author: Mäkinen Veli
Sirén Jouni
Välimäki Niko
Publication venue: Springer
Publication date: 01/01/2011
Field of study

With the recent advances in DNA sequencing, it is now possible to have complete genomes of individuals sequenced and assembled. This rich and focused genotype information can be used to do different population-wide studies, now first time directly on whole genome level. We propose a way to index population genotype information together with the complete genome sequence, so that one can use the index to efficiently align a given sequence to the genome with all plausible genotype recombinations taken into account. This is achieved through converting a multiple alignment of individual genomes into a finite automaton recognizing all strings that can be read from the alignment by switching the sequence at any time. The finite automaton is indexed with an extension of Burrows-Wheeler transform to allow pattern search inside the plausible recombinant sequences. The size of the index stays limited, because of the high similarity of individual genomes. The index finds applications in variation calling and in primer design. On a variation calling experiment, we found about 1.0% of matches to novel recombinants just with exact matching, and up to 2.4% with approximate matching.Comment: This is the full version of the paper that was presented at WABI 2011. The implementation is available at http://www.cs.helsinki.fi/group/suds/gcsa

arXiv.org e-Print Archive

Helsingin yliopiston digitaalinen arkisto

Matching Statistics Speed up BWT Construction

Author: Masillo Francesco
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 31st Annual European Symposium on Algorithms (ESA 2023)
Publication date: 01/01/2023
Field of study

Dagstuhl Research Online Publication Server

Bidirectional Variable-Order de Bruijn Graphs

Author: Belazzougui Djamal
Gagie Travis
Mäkinen Veli
Previtali Marco
Puglisi Simon J.
Publication venue
Publication date: 01/12/2018
Field of study

Peer reviewe

Helsingin yliopiston digitaalinen arkisto

Suffix Sorting via Matching Statistics

Author: Lipták Zsuzsanna
Masillo Francesco
Puglisi Simon J.
Publication venue: Schloss Dagstuhl- Leibniz-Zentrum fur Informatik GmbH, Dagstuhl Publishing
Publication date: 01/01/2022
Field of study

Funding Information: Academy of Finland grants 339070 and 351150 Publisher Copyright: © Zsuzsanna Lipták, Francesco Masillo, and Simon J. Puglisi.We introduce a new algorithm for constructing the generalized suffix array of a collection of highly similar strings. As a first step, we construct a compressed representation of the matching statistics of the collection with respect to a reference string. We then use this data structure to distribute suffixes into a partial order, and subsequently to speed up suffix comparisons to complete the generalized suffix array. Our experimental evidence with a prototype implementation (a tool we call sacamats) shows that on string collections with highly similar strings we can construct the suffix array in time competitive with or faster than the fastest available methods. Along the way, we describe a heuristic for fast computation of the matching statistics of two strings, which may be of independent interest.Peer reviewe

Dagstuhl Research Online Publication Server

Catalogo dei prodotti della ricerca

Helsingin yliopiston digitaalinen arkisto