Search CORE

44 research outputs found

Automata and Differentiable Words

Author: Brlek
Brlek
Béal
Carpi
Carpi
Carpi
Carpi
Crochemore
Dekking
Dekking
Gabriele Fici
Jean-Marc Fédou
Kimberling
Kolakoski
Lepistö
Weakley
Publication venue: 'Elsevier BV'
Publication date: 01/01/2012
Field of study

We exhibit the construction of a deterministic automaton that, given k > 0, recognizes the (regular) language of k-differentiable words. Our approach follows a scheme of Crochemore et al. based on minimal forbidden words. We extend this construction to the case of C\infinity-words, i.e., words differentiable arbitrary many times. We thus obtain an infinite automaton for representing the set of C\infinity-words. We derive a classification of C\infinity-words induced by the structure of the automaton. Then, we introduce a new framework for dealing with \infinity-words, based on a three letter alphabet. This allows us to define a compacted version of the automaton, that we use to prove that every C\infinity-word admits a repetition in C\infinity whose length is polynomially bounded.Comment: Accepted for publicatio

arXiv.org e-Print Archive

CiteSeerX

Elsevier - Publisher Connector

Crossref

Archivio istituzionale della ricerca - Università di Palermo

Asymptotic Optimality of Antidictionary Codes

Author: Morita Hiroyoshi
Ota Takahiro
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/06/2010
Field of study

An antidictionary code is a lossless compression algorithm using an antidictionary which is a set of minimal words that do not occur as substrings in an input string. The code was proposed by Crochemore et al. in 2000, and its asymptotic optimality has been proved with respect to only a specific information source, called balanced binary source that is a binary Markov source in which a state transition occurs with probability 1/2 or 1. In this paper, we prove the optimality of both static and dynamic antidictionary codes with respect to a stationary ergodic Markov source on finite alphabet such that a state transition occurs with probability

p (0 < p \leq 1)

.Comment: 5 pages, to appear in the proceedings of 2010 IEEE International Symposium on Information Theory (ISIT2010

arXiv.org e-Print Archive

Crossref

Fully-online Construction of Suffix Trees for Multiple Texts

Author: Arimura Hiroki
Inenaga Shunsuke
Takagi Takuya
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 27th Annual Symposium on Combinatorial Pattern Matching (CPM 2016)
Publication date: 01/01/2016
Field of study

We consider fully-online construction of indexing data structures for multiple texts. Let T = {T_1, ..., T_K} be a collection of texts. By fully-online, we mean that a new character can be appended to any text in T at any time. This is a natural generalization of semi-online construction of indexing data structures for multiple texts in which, after a new character is appended to the kth text T_k, then its previous texts T_1, ..., T_k-1 will remain static. Our fully-online scenario arises when we maintain dynamic indexes for multi-sensor data. Let N and sigma denote the total length of texts in T and the alphabet size, respectively. We first show that the algorithm by Blumer et al. [Theoretical Computer Science, 40:31-55, 1985] to construct the directed acyclic word graph (DAWG) for T can readily be extended to our fully-online setting, retaining O(N log sigma)-time and O(N)-space complexities. Then, we give a sophisticated fully-online algorithm which constructs the suffix tree for T in O(N log sigma) time and O(N) space. A key idea of this algorithm is synchronized maintenance of the DAWG and the suffix tree

Dagstuhl Research Online Publication Server

Compression with the tudocomp Framework

Author: Dinklage Patrick
Fischer Johannes
Sadakane Kunihiko
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 16th International Symposium on Experimental Algorithms (SEA 2017)
Publication date: 01/01/2017
Field of study

We present a framework facilitating the implementation and comparison of text compression algorithms. We evaluate its features by a case study on two novel compression algorithms based on the Lempel-Ziv compression schemes that perform well on highly repetitive texts

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

Indexable Elastic Founder Graphs of Minimum Height

Author: Mäkinen Veli
Rizzo Nicola
Publication venue: Schloss Dagstuhl - Leibniz-Zentrum für Informatik
Publication date: 01/06/2022
Field of study

Indexable elastic founder graphs have been recently proposed as a data structure for genomics applications supporting fast pattern matching queries. Consider segmenting a multiple sequence alignment MSA[1..m,1..n] into b blocks MSA[1..m,1..j₁], MSA[1..m,j₁+1..j₂], …, MSA[1..m,j_{b-1}+1..n]. The resulting elastic founder graph (EFG) is obtained by merging in each block the strings that are equivalent after the removal of gap symbols, taking the strings as the nodes of the block and the original MSA connections as edges. We call an elastic founder graph indexable if a node label occurs as a prefix of only those paths that start from a node of the same block. Equi et al. (ISAAC 2021) showed that such EFGs support fast pattern matching and studied their construction maximizing the number of blocks and minimizing the maximum length of a block, but left open the case of minimizing the maximum number of distinct strings in a block that we call graph height. For the simplified gapless setting, we give an O(mn) time algorithm to find a segmentation of an MSA minimizing the height of the resulting indexable founder graph, by combining previous results in segmentation algorithms and founder graphs. For the general setting, the known techniques yield a linear-time parameterized solution on constant alphabet Σ, taking time O(m n² log|Σ|) in the worst case, so we study the refined measure of prefix-aware height, that omits counting strings that are prefixes of another considered string. The indexable EFG minimizing the maximum prefix-aware height provides a lower bound for the original height: by exploiting exploiting suffix trees built from the MSA rows and the data structure answering weighted ancestor queries in constant time of Belazzougui et al. (CPM 2021), we give an O(mn)-time algorithm for the optimal EFG under this alternative height.Peer reviewe

Helsingin yliopiston digitaalinen arkisto

Text Indexing and Searching in Sublinear Time

Author: Munro J. Ian
Navarro Gonzalo
Nekrich Yakov
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 31st Annual Symposium on Combinatorial Pattern Matching (CPM 2020)
Publication date: 15/07/2019
Field of study

We introduce the first index that can be built in o(n) time for a text of length n, and can also be queried in o(q) time for a pattern of length q. On an alphabet of size ?, our index uses O(n log ?) bits, is built in O(n log ? / ?{log n}) deterministic time, and computes the number of occurrences of the pattern in time O(q/log_? n + log n log_? n). Each such occurrence can then be found in O(log n) time. Other trade-offs between the space usage and the cost of reporting occurrences are also possible

arXiv.org e-Print Archive

Michigan Technological University

Dagstuhl Research Online Publication Server

On compact directed acyclic word graphs

Author: A. Blumer
A. Blumer
A. Blumer
A. Blumer
B. Clift
C. Lefevre
E. McCreight
E. Ukkonen
E. Ukkonen
J. Karkkainen
M. Crochemore
M. Crochemore
M. Crochemore
M. Crochemore
M. Crochemore
U. Manber
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Faster Approximate String Matching for Short Patterns

Author: A. Andersson
A.H. Wright
D. Gusfield
D. Harel
D.E. Knuth
E. Ukkonen
E. Ukkonen
E.W. Myers
F.T. Leighton
G. Myers
G. Navarro
G.M. Landau
H. Hyyrö
K.E. Batcher
M. Farach-Colton
M.A. Bender
P. Bille
P. Sellers
Philip Bille
R. Baeza-Yates
R. Cole
R.A. Baeza-Yates
R.A. Wagner
S. Albers
S. Alstrup
S. Wu
S.C. Sahinalp
T. Hagerup
T.H. Cormen
V.L. Arlazarov
W. Masek
Z. Galil
Z. Galil
Publication venue
Publication date: 17/03/2011
Field of study

We study the classical approximate string matching problem, that is, given strings

P

and

Q

and an error threshold

k

, find all ending positions of substrings of

Q

whose edit distance to

P

is at most

k

. Let

P

and

Q

have lengths

m

and

n

, respectively. On a standard unit-cost word RAM with word size

w \geq \log n

we present an algorithm using time

O(nk \cdot \min(\frac{\log^2 m}{\log n},\frac{\log^2 m\log w}{w}) + n)

When

P

is short, namely,

m = 2^{o(\sqrt{\log n})}

m = 2^{o(\sqrt{w/\log w})}

this improves the previously best known time bounds for the problem. The result is achieved using a novel implementation of the Landau-Vishkin algorithm based on tabulation and word-level parallelism.Comment: To appear in Theory of Computing System

arXiv.org e-Print Archive

Crossref

Online Research Database In Technology