Search CORE

45 research outputs found

The k-mismatch problem revisited

Author: Clifford Raphaël
Fontaine Allyx
Porat Ely
Sach Benjamin
Starikovskaya Tatiana
Publication venue
Publication date: 27/08/2015
Field of study

We revisit the complexity of one of the most basic problems in pattern matching. In the k-mismatch problem we must compute the Hamming distance between a pattern of length m and every m-length substring of a text of length n, as long as that Hamming distance is at most k. Where the Hamming distance is greater than k at some alignment of the pattern and text, we simply output "No". We study this problem in both the standard offline setting and also as a streaming problem. In the streaming k-mismatch problem the text arrives one symbol at a time and we must give an output before processing any future symbols. Our main results are as follows: 1) Our first result is a deterministic

O(n k^2\log{k} / m+n \text{polylog} m)

time offline algorithm for k-mismatch on a text of length n. This is a factor of k improvement over the fastest previous result of this form from SODA 2000 by Amihood Amir et al. 2) We then give a randomised and online algorithm which runs in the same time complexity but requires only

O(k^2\text{polylog} {m})

space in total. 3) Next we give a randomised

(1+\epsilon)

-approximation algorithm for the streaming k-mismatch problem which uses

O(k^2\text{polylog} m / \epsilon^2)

space and runs in

O(\text{polylog} m / \epsilon^2)

worst-case time per arriving symbol. 4) Finally we combine our new results to derive a randomised

O(k^2\text{polylog} {m})

space algorithm for the streaming k-mismatch problem which runs in

O(\sqrt{k}\log{k} + \text{polylog} {m})

worst-case time per arriving symbol. This improves the best previous space complexity for streaming k-mismatch from FOCS 2009 by Benny Porat and Ely Porat by a factor of k. We also improve the time complexity of this previous result by an even greater factor to match the fastest known offline algorithm (up to logarithmic factors)

arXiv.org e-Print Archive

Crossref

Explore Bristol Research

Approximating Approximate Pattern Matching

Author: Studený Jan
Uznański Przemysław
Publication venue
Publication date: 01/01/2019
Field of study

Given a text

T

of length

n

and a pattern

P

of length

m

, the approximate pattern matching problem asks for computation of a particular \emph{distance} function between

P

and every

m

-substring of

T

. We consider a

(1\pm\varepsilon)

multiplicative approximation variant of this problem, for

\ell_p

distance function. In this paper, we describe two

(1+\varepsilon)

-approximate algorithms with a runtime of

\widetilde{O}(\frac{n}{\varepsilon})

for all (constant) non-negative values of

p

. For constant

p \ge 1

we show a deterministic

(1+\varepsilon)

-approximation algorithm. Previously, such run time was known only for the case of

\ell_1

distance, by Gawrychowski and Uzna\'nski [ICALP 2018] and only with a randomized algorithm. For constant

0 \le p \le 1

we show a randomized algorithm for the

\ell_p

, thereby providing a smooth tradeoff between algorithms of Kopelowitz and Porat [FOCS~2015, SOSA~2018] for Hamming distance (case of

p=0

) and of Gawrychowski and Uzna\'nski for

\ell_1

distance

arXiv.org e-Print Archive

Repository for Publications and Research Data

Pattern Matching in Multiple Streams

Author: A. Amir
D. Breslauer
F. Ergun
G.M. Landau
G.M. Landau
H. Karloff
K. Abrahamson
M. Ružić
R. Clifford
R. Clifford
R. Clifford
R. Clifford
R. Clifford
T.S. Jayram
Z. Galil
Publication venue
Publication date: 01/01/2012
Field of study

We investigate the problem of deterministic pattern matching in multiple streams. In this model, one symbol arrives at a time and is associated with one of s streaming texts. The task at each time step is to report if there is a new match between a fixed pattern of length m and a newly updated stream. As is usual in the streaming context, the goal is to use as little space as possible while still reporting matches quickly. We give almost matching upper and lower space bounds for three distinct pattern matching problems. For exact matching we show that the problem can be solved in constant time per arriving symbol and O(m+s) words of space. For the k-mismatch and k-difference problems we give O(k) time solutions that require O(m+ks) words of space. In all three cases we also give space lower bounds which show our methods are optimal up to a single logarithmic factor. Finally we set out a number of open problems related to this new model for pattern matching.Comment: 13 pages, 1 figur

arXiv.org e-Print Archive

Crossref

Warwick Research Archives Portal Repository

Longest Common Extensions in Sublinear Space

Author: A Amir
D Gusfield
D Harel
EW Myers
G Manacher
GM Landau
GM Landau
GM Landau
MG Main
NJ Fine
P Bille
R Cole
R Kolpakov
RM Karp
Publication venue
Publication date: 01/01/2015
Field of study

The longest common extension problem (LCE problem) is to construct a data structure for an input string

T

of length

n

that supports LCE

(i,j)

queries. Such a query returns the length of the longest common prefix of the suffixes starting at positions

i

and

j

T

. This classic problem has a well-known solution that uses

O(n)

space and

O(1)

query time. In this paper we show that for any trade-off parameter

1 \leq \tau \leq n

, the problem can be solved in

O(\frac{n}{\tau})

space and

O(\tau)

query time. This significantly improves the previously best known time-space trade-offs, and almost matches the best known time-space product lower bound.Comment: An extended abstract of this paper has been accepted to CPM 201

arXiv.org e-Print Archive

Crossref

Copenhagen University Research Information System

Online Research Database In Technology

Checking whether a word is Hamming-isometric in linear time

Author: Béal Marie-Pierre
Crochemore Maxime
Publication venue
Publication date: 23/07/2021
Field of study

A finite word

f

is Hamming-isometric if for any two word

u

and

v

of same length avoiding

f

u

can be transformed into

v

by changing one by one all the letters on which

u

differs from

v

, in such a way that all of the new words obtained in this process also avoid~

f

. Words which are not Hamming-isometric have been characterized as words having a border with two mismatches. We derive from this characterization a linear-time algorithm to check whether a word is Hamming-isometric. It is based on pattern matching algorithms with

k

mismatches. Lee-isometric words over a four-letter alphabet have been characterized as words having a border with two Lee-errors. We derive from this characterization a linear-time algorithm to check whether a word over an alphabet of size four is Lee-isometric.Comment: A second algorithm for checking whether a word is Hamming-isometric is added using the result given in reference [5

arXiv.org e-Print Archive

HAL-Ecole des Ponts ParisTech