Search CORE

. We consider the two most fundamental string metrics: the Hamming distance and the edit distance. Under the Hamming distance, we search for substrings of

T

that have at most

k

mismatches with

P

, while under the edit distance, we search for substrings of

T

that can be transformed to

P

with at most

k

edits. Exact occurrences of

P

T

have a very simple structure: If we assume for simplicity that

|T| \le 3|P|/2

and trim

T

so that

P

occurs both as a prefix and as a suffix of

T

, then both

P

and

T

are periodic with a common period. However, an analogous characterization for the structure of occurrences with up to

k

mismatches was proved only recently by Bringmann et al. [SODA'19]: Either there are

O(k^2)

k

-mismatch occurrences of

P

T

, or both

P

and

T

are at Hamming distance

O(k)

from strings with a common period

O(m/k)

. We tighten this characterization by showing that there are

O(k)

k

-mismatch occurrences in the case when the pattern is not (approximately) periodic, and we lift it to the edit distance setting, where we tightly bound the number of

k

-edit occurrences by

O(k^2)

in the non-periodic case. Our proofs are constructive and let us obtain a unified framework for approximate pattern matching for both considered distances. We showcase the generality of our framework with results for the fully-compressed setting (where

T

and

P

are given as a straight-line program) and for the dynamic setting (where we extend a data structure of Gawrychowski et al. [SODA'18])

MPG.PuRe

Faster Pattern Matching under Edit Distance

Author: Charalampopoulos P.
Kociumaka T.
Wellnitz P.
Publication venue
Publication date: 01/01/2022
Field of study

We consider the approximate pattern matching problem under the edit distance.Given a text

T

of length

n

, a pattern

P

of length

m

, and a threshold

k

, the task is to find the starting positions of all substrings of

T

thatcan be transformed to

P

with at most

k

edits. More than 20 years ago, Coleand Hariharan [SODA'98, J. Comput.'02] gave an

\mathcal{O}(n+k^4 \cdot n/m)

-time algorithm for this classic problem, and this runtime has not beenimproved since. Here, we present an algorithm that runs in time

\mathcal{O}(n+k^{3.5}\sqrt{\log m \log k} \cdot n/m)

, thus breaking through this long-standingbarrier. In the case where n^{1/4+\varepsilon} \leq k \leqn^{2/5-\varepsilon} for some arbitrarily small positive constant

\varepsilon

, our algorithm improves over the state-of-the-art by polynomialfactors: it is polynomially faster than both the algorithm of Cole andHariharan and the classic

\mathcal{O}(kn)

-time algorithm of Landau andVishkin [STOC'86, J. Algorithms'89]. We observe that the bottleneck case of the alternative

\mathcal{O}(n+k^4\cdot n/m)

-time algorithm of Charalampopoulos, Kociumaka, and Wellnitz[FOCS'20] is when the text and the pattern are (almost) periodic. Our newalgorithm reduces this case to a new dynamic problem (Dynamic Puzzle Matching),which we solve by building on tools developed by Tiskin [SODA'10,Algorithmica'15] for the so-called seaweed monoid of permutation matrices. Ouralgorithm relies only on a small set of primitive operations on strings andthus also applies to the fully-compressed setting (where text and pattern aregiven as straight-line programs) and to the dynamic setting (where we maintaina collection of strings under creation, splitting, and concatenation),improving over the state of the art.<br

MPG.PuRe

An Algorithmic Bridge Between Hamming and Levenshtein Distances

Author: Goldenberg E.
Kociumaka T.
Krauthgamer R.
Saha B.
Publication venue
Publication date: 01/01/2022
Field of study

MPG.PuRe

Faster pattern matching under edit distance : a reduction to dynamic puzzle matching and the Seaweed Monoid of permutation matrices

Author: Charalampopoulos Panagiotis
Kociumaka T.
Wellnitz P.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 28/12/2022
Field of study

We consider the approximate pattern matching problem under the edit distance. Given a text T of length n, a pattern P of length m, and a threshold k, the task is to find the starting positions of all substrings of T that can be transformed to P with at most k edits. More than 20 years ago, Cole and Hariharan [SODA’98, J. Comput.’02] gave an O(n + k^4·n/m)-time algorithm for this classic problem, and this runtime has not been improved since. Here, we present an algorithm that runs in time O(n + k^{3.5}√( log m log k) · n/m), thus breaking through this longstanding barrier. In the case where n^{1/4+ε} ≤ k ≤ n^{2/5−ε} for some arbitrarily small positive constant ε, our algorithm improves over the state-of-the-art by polynomial factors: it is polynomially faster than both the algorithm of Cole and Hariharan and the classic O(kn)-time algorithm of Landau and Vishkin [STOC’86, J. Algorithms’89]. We observe that the bottleneck case of the alternative O(n + k^4· n/m)-time algorithm of Charalampopoulos, Kociumaka, and Wellnitz [FOCS’20] is when the text and the pattern are (almost) periodic. Our new algorithm reduces this case to a new Dynamic Puzzle Matching problem, which we solve by building on tools developed by Tiskin [SODA’10, Algorithmica’15] for the so called seaweed monoid of permutation matrices. Our algorithm relies only on a small set of primitive operations on strings and thus also applies to the fully-compressed setting (where text and pattern are given as straight-line programs) and to the dynamic setting (where we maintain a collection of strings under creation, splitting, and concatenation), improving over the state of the art

Birkbeck Institutional Research Online

Wavelet Trees Meet Suffix Trees

Author: Babenko M.
Gawrychowski P.
Kociumaka T.
Starikovskaya T.
Publication venue
Publication date: 01/01/2014
Field of study

We present an improved wavelet tree construction algorithm and discuss its applications to a number of rank/select problems for integer keys and strings. Given a string of length n over an alphabet of size

\sigma\leq n

, our method builds the wavelet tree in

O(n \log \sigma/ \sqrt{\log{n}})

time, improving upon the state-of-the-art algorithm by a factor of

\sqrt{\log n}

. As a consequence, given an array of n integers we can construct in

O(n \sqrt{\log n})

time a data structure consisting of

O(n)

machine words and capable of answering rank/select queries for the subranges of the array in

O(\log n / \log \log n)

time. This is a

\log \log n

-factor improvement in query time compared to Chan and P\u{a}tra\c{s}cu and a

\sqrt{\log n}

-factor improvement in construction time compared to Brodal et al. Next, we switch to stringological context and propose a novel notion of wavelet suffix trees. For a string w of length n, this data structure occupies

O(n)

words, takes

O(n \sqrt{\log n})

time to construct, and simultaneously captures the combinatorial structure of substrings of w while enabling efficient top-down traversal and binary search. In particular, with a wavelet suffix tree we are able to answer in

O(\log |x|)

time the following two natural analogues of rank/select queries for suffixes of substrings: for substrings x and y of w count the number of suffixes of x that are lexicographically smaller than y, and for a substring x of w and an integer k, find the k-th lexicographically smallest suffix of x. We further show that wavelet suffix trees allow to compute a run-length-encoded Burrows-Wheeler transform of a substring x of w in

O(s \log |x|)

time, where s denotes the length of the resulting run-length encoding. This answers a question by Cormode and Muthukrishnan, who considered an analogous problem for Lempel-Ziv compression

MPG.PuRe

Normal, Abby Normal, Prefix Normal

Author: A. Amir
A. Butman
F. Cicalese
F. Ruskey
G. Benson
J. Ian Munro
K. Dührkop
L. Parida
L.-K. Lee
S. Böcker
S. Böcker
T. Gagie
T. Kociumaka
T.M. Moosa
T.M. Moosa
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2014
Field of study

A prefix normal word is a binary word with the property that no substring has more 1s than the prefix of the same length. This class of words is important in the context of binary jumbled pattern matching. In this paper we present results about the number

pnw(n)

of prefix normal words of length

n

, showing that

pnw(n) =\Omega\left(2^{n - c\sqrt{n\ln n}}\right)

for some

c

and

pnw(n) = O \left(\frac{2^n (\ln n)^2}{n}\right)

. We introduce efficient algorithms for testing the prefix normal property and a "mechanical algorithm" for computing prefix normal forms. We also include games which can be played with prefix normal words. In these games Alice wishes to stay normal but Bob wants to drive her "abnormal" -- we discuss which parameter settings allow Alice to succeed.Comment: Accepted at FUN '1

arXiv.org e-Print Archive

Crossref

Catalogo dei prodotti della ricerca

Archivio istituzionale della ricerca - Università di Palermo

On Maximal Unbordered Factors

Author: A Ehrenfeucht
D Moore
F Franĕk
J-P Duval
J-P Duval
J-P Duval
L Ilie
P Gawrychowski
P Nielsen
R Assous
S Holub
T Kociumaka
Publication venue
Publication date: 28/04/2015
Field of study

Given a string

S

of length

n

, its maximal unbordered factor is the longest factor which does not have a border. In this work we investigate the relationship between

n

and the length of the maximal unbordered factor of

S

. We prove that for the alphabet of size

\sigma \ge 5

the expected length of the maximal unbordered factor of a string of length~

n

is at least

0.99 n

(for sufficiently large values of

n

). As an application of this result, we propose a new algorithm for computing the maximal unbordered factor of a string.Comment: Accepted to the 26th Annual Symposium on Combinatorial Pattern Matching (CPM 2015

arXiv.org e-Print Archive

Crossref

HAL Descartes

Hal-Diderot

HAL-Ecole des Ponts ParisTech

Explore Bristol Research

HAL - UPEC / UPEM

Indexing weighted sequences: Neat and efficient

Author: Barton C. (Carl)
Kociumaka T. (Tomasz)
Liu C. (Chang)
Pissis S. (Solon)
Radoszewski J. (Jakub)
Publication venue: 'Elsevier BV'
Publication date: 04/09/2019
Field of study

In a weighted sequence, for every position of the sequence and every letter of the alphabet a probability of occurrence of this letter at this position is specified. Weighted sequences are commonly used to represent imprecise or uncertain data, for example in molecular biology, where they are known under the name of Position Weight Matrices. Given a probability threshold 1/z , we say that a string P of length m occurs in a weighted sequence X at position i if the product of probabilities of the letters of P at positions i, . . . , i+m−1 in X is at least 1/z . In this article, we consider an indexing variant of the problem, in which we are to pre-process a weighted sequence to answer multiple pattern matching queries. We present an O(nz)-time construction of an O(nz)-sized index for a weighted sequence of length n that answers pattern matching queries in the optimal O(m+Occ) time, where Occ is the number of occurrences reported. The cornerstone of our data structure is a novel construction of a family of [z] strings that carries the information about all the strings that occur in the weighted sequence with a sufficient probability. We thus improve the most efficient previously known index by Amir et al. (Theor. Comput. Sci., 2008) with size and construction time O(nz2 log z), preserving optimal query time. On the way we develop a new, more straightforward index for the so-called property matching problem. We provide an open-source implementation of our data structure and present experimental results using both synthetic and real data. Our construction allows us also to obtain a significant improvement over the complexities of the approximate variant of the weighted index presented by Biswas et al. at EDBT 2016 and an improvement of the space complexity of their general index. We also present applications of our index

CWI's Institutional Repository

Searching of gapped repeats and subrepetitions in a word

Author: D. Gusfield
G. Brodal
J. Storer
M. Crochemore
M. Crochemore
M. Crochemore
M. Crochemore
P. Emde Boas van
R. Kolpakov
R. Kolpakov
R. Kolpakov
T. Kociumaka
Z. Galil
Publication venue
Publication date: 29/09/2013
Field of study

A gapped repeat is a factor of the form

uvu

where

u

and

v

are nonempty words. The period of the gapped repeat is defined as

|u|+|v|

. The gapped repeat is maximal if it cannot be extended to the left or to the right by at least one letter with preserving its period. The gapped repeat is called

\alpha

-gapped if its period is not greater than

\alpha |v|

. A

\delta

-subrepetition is a factor which exponent is less than 2 but is not less than

1+\delta

(the exponent of the factor is the quotient of the length and the minimal period of the factor). The

\delta

-subrepetition is maximal if it cannot be extended to the left or to the right by at least one letter with preserving its minimal period. We reveal a close relation between maximal gapped repeats and maximal subrepetitions. Moreover, we show that in a word of length

n

the number of maximal

\alpha

-gapped repeats is bounded by

O(\alpha^2n)

and the number of maximal

\delta

-subrepetitions is bounded by

O(n/\delta^2)

. Using the obtained upper bounds, we propose algorithms for finding all maximal

\alpha

-gapped repeats and all maximal

\delta

-subrepetitions in a word of length

n

. The algorithm for finding all maximal

\alpha

-gapped repeats has

O(\alpha^2n)

time complexity for the case of constant alphabet size and

O(n\log n + \alpha^2n)

time complexity for the general case. For finding all maximal

\delta

-subrepetitions we propose two algorithms. The first algorithm has

O(\frac{n\log\log n}{\delta^2})

time complexity for the case of constant alphabet size and

O(n\log n +\frac{n\log\log n}{\delta^2})

time complexity for the general case. The second algorithm has

O(n\log n+\frac{n}{\delta^2}\log \frac{1}{\delta})

expected time complexity

arXiv.org e-Print Archive

Crossref