Search CORE

1,799 research outputs found

Small space and streaming pattern matching with k edits

Author: Kociumaka Tomasz
Porat Ely
Starikovskaya Tatiana
Publication venue
Publication date: 10/06/2021
Field of study

In this work, we revisit the fundamental and well-studied problem of approximate pattern matching under edit distance. Given an integer

k

, a pattern

P

of length

m

, and a text

T

of length

n \ge m

, the task is to find substrings of

T

that are within edit distance

k

from

P

. Our main result is a streaming algorithm that solves the problem in

\tilde{O}(k^5)

space and

\tilde{O}(k^8)

amortised time per character of the text, providing answers correct with high probability. (Hereafter,

\tilde{O}(\cdot)

hides a

\mathrm{poly}(\log n)

factor.) This answers a decade-old question: since the discovery of a

\mathrm{poly}(k\log n)

-space streaming algorithm for pattern matching under Hamming distance by Porat and Porat [FOCS 2009], the existence of an analogous result for edit distance remained open. Up to this work, no

\mathrm{poly}(k\log n)

-space algorithm was known even in the simpler semi-streaming model, where

T

comes as a stream but

P

is available for read-only access. In this model, we give a deterministic algorithm that achieves slightly better complexity. In order to develop the fully streaming algorithm, we introduce a new edit distance sketch parametrised by integers

n\ge k

. For any string of length at most

n

, the sketch is of size

\tilde{O}(k^2)

and it can be computed with an

\tilde{O}(k^2)

-space streaming algorithm. Given the sketches of two strings, in

\tilde{O}(k^3)

time we can compute their edit distance or certify that it is larger than

k

. This result improves upon

\tilde{O}(k^8)

-size sketches of Belazzougui and Zhu [FOCS 2016] and very recent

\tilde{O}(k^3)

-size sketches of Jin, Nelson, and Wu [STACS 2021]

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

Edit Distance: Sketching, Streaming and Document Exchange

Author: Belazzougui Djamal
Zhang Qin
Publication venue
Publication date: 14/07/2016
Field of study

We show that in the document exchange problem, where Alice holds

x \in \{0,1\}^n

and Bob holds

y \in \{0,1\}^n

, Alice can send Bob a message of size

O(K(\log^2 K+\log n))

bits such that Bob can recover

x

using the message and his input

y

if the edit distance between

x

and

y

is no more than

K

, and output "error" otherwise. Both the encoding and decoding can be done in time

\tilde{O}(n+\mathsf{poly}(K))

. This result significantly improves the previous communication bounds under polynomial encoding/decoding time. We also show that in the referee model, where Alice and Bob hold

x

and

y

respectively, they can compute sketches of

x

and

y

of sizes

\mathsf{poly}(K \log n)

bits (the encoding), and send to the referee, who can then compute the edit distance between

x

and

y

together with all the edit operations if the edit distance is no more than

K

, and output "error" otherwise (the decoding). To the best of our knowledge, this is the first result for sketching edit distance using

\mathsf{poly}(K \log n)

bits. Moreover, the encoding phase of our sketching algorithm can be performed by scanning the input string in one pass. Thus our sketching algorithm also implies the first streaming algorithm for computing edit distance and all the edits exactly using

\mathsf{poly}(K \log n)

bits of space.Comment: Full version of an article to be presented at the 57th Annual IEEE Symposium on Foundations of Computer Science (FOCS 2016

arXiv.org e-Print Archive

Crossref

Small-Space Algorithms for the Online Language Distance Problem for Palindromes and Squares

Author: Bathie Gabriel
Kociumaka Tomasz
Starikovskaya Tatiana
Publication venue
Publication date: 26/09/2023
Field of study

We study the online variant of the language distance problem for two classical formal languages, the language of palindromes and the language of squares, and for the two most fundamental distances, the Hamming distance and the edit (Levenshtein) distance. In this problem, defined for a fixed formal language

L

, we are given a string

T

of length

n

, and the task is to compute the minimal distance to

L

from every prefix of

T

. We focus on the low-distance regime, where one must compute only the distances smaller than a given threshold

k

. In this work, our contribution is twofold: - First, we show streaming algorithms, which access the input string

T

only through a single left-to-right scan. Both for palindromes and squares, our algorithms use

O(k \cdot\mathrm{poly}~\log n)

space and time per character in the Hamming-distance case and

O(k^2 \cdot\mathrm{poly}~\log n)

space and time per character in the edit-distance case. These algorithms are randomised by necessity, and they err with probability inverse-polynomial in

n

. - Second, we show deterministic read-only online algorithms, which are also provided with read-only random access to the already processed characters of

T

. Both for palindromes and squares, our algorithms use

O(k \cdot\mathrm{poly}~\log n)

space and time per character in the Hamming-distance case and

O(k^4 \cdot\mathrm{poly}~\log n)

space and amortised time per character in the edit-distance case.Comment: Accepted to ISAAC'2

arXiv.org e-Print Archive

Approximate Similarity Search Under Edit Distance Using Locality-Sensitive Hashing

Author: McCauley Samuel
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 24th International Conference on Database Theory (ICDT 2021)
Publication date: 08/07/2020
Field of study

Edit distance similarity search, also called approximate pattern matching, is a fundamental problem with widespread database applications. The goal of the problem is to preprocess n strings of length d, to quickly answer queries q of the form: if there is a database string within edit distance r of q, return a database string within edit distance cr of q. Previous approaches to this problem either rely on very large (superconstant) approximation ratios c, or very small search radii r. Outside of a narrow parameter range, these solutions are not competitive with trivially searching through all n strings. In this work we give a simple and easy-to-implement hash function that can quickly answer queries for a wide range of parameters. Specifically, our strategy can answer queries in time O?(d3^rn^{1/c}). The best known practical results require c ? r to achieve any correctness guarantee; meanwhile, the best known theoretical results are very involved and difficult to implement, and require query time that can be loosely bounded below by 24^r. Our results significantly broaden the range of parameters for which there exist nontrivial theoretical bounds, while retaining the practicality of a locality-sensitive hash function

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

File Updates Under Random/Arbitrary Insertions And Deletions

Author: Cadambe Viveck
Jaggi Sidharth
Médard Muriel
Schwartz Moshe
Wang Qiwen
Publication venue
Publication date: 27/02/2015
Field of study

A client/encoder edits a file, as modeled by an insertion-deletion (InDel) process. An old copy of the file is stored remotely at a data-centre/decoder, and is also available to the client. We consider the problem of throughput- and computationally-efficient communication from the client to the data-centre, to enable the server to update its copy to the newly edited file. We study two models for the source files/edit patterns: the random pre-edit sequence left-to-right random InDel (RPES-LtRRID) process, and the arbitrary pre-edit sequence arbitrary InDel (APES-AID) process. In both models, we consider the regime in which the number of insertions/deletions is a small (but constant) fraction of the original file. For both models we prove information-theoretic lower bounds on the best possible compression rates that enable file updates. Conversely, our compression algorithms use dynamic programming (DP) and entropy coding, and achieve rates that are approximately optimal.Comment: The paper is an extended version of our paper to be appeared at ITW 201

arXiv.org e-Print Archive

DSpace@MIT

Crossref