Search CORE

14 research outputs found

Universal Compressed Text Indexing

Author: Navarro Gonzalo
Prezza Nicola
Publication venue
Publication date: 06/09/2018
Field of study

The rise of repetitive datasets has lately generated a lot of interest in compressed self-indexes based on dictionary compression, a rich and heterogeneous family that exploits text repetitions in different ways. For each such compression scheme, several different indexing solutions have been proposed in the last two decades. To date, the fastest indexes for repetitive texts are based on the run-length compressed Burrows-Wheeler transform and on the Compact Directed Acyclic Word Graph. The most space-efficient indexes, on the other hand, are based on the Lempel-Ziv parsing and on grammar compression. Indexes for more universal schemes such as collage systems and macro schemes have not yet been proposed. Very recently, Kempa and Prezza [STOC 2018] showed that all dictionary compressors can be interpreted as approximation algorithms for the smallest string attractor, that is, a set of text positions capturing all distinct substrings. Starting from this observation, in this paper we develop the first universal compressed self-index, that is, the first indexing data structure based on string attractors, which can therefore be built on top of any dictionary-compressed text representation. Let

\gamma

be the size of a string attractor for a text of length

n

. Our index takes

O(\gamma\log(n/\gamma))

words of space and supports locating the

occ

occurrences of any pattern of length

m

O(m\log n + occ\log^{\epsilon}n)

time, for any constant

\epsilon>0

. This is, in particular, the first index for general macro schemes and collage systems. Our result shows that the relation between indexing and compression is much deeper than what was previously thought: the simple property standing at the core of all dictionary compressors is sufficient to support fast indexed queries.Comment: Fixed with reviewer's comment

arXiv.org e-Print Archive

Archivio istituzionale della ricerca - Università degli Studi di Venezia Ca' Foscari

Repositorio Académico de la Universidad de Chile

Archivio della ricerca- LUISS Libera Università Internazionale degli Studi Sociali Guido Carli di Roma

Decompressing Lempel-Ziv Compressed Text

Author: Bille Philip
Ettienne Mikko Berggren
Gagie Travis
Gørtz Inge Li
Prezza Nicola
Publication venue
Publication date: 04/11/2019
Field of study

We consider the problem of decompressing the Lempel--Ziv 77 representation of a string

S

of length

n

using a working space as close as possible to the size

z

of the input. The folklore solution for the problem runs in

O(n)

time but requires random access to the whole decompressed text. Another folklore solution is to convert LZ77 into a grammar of size

O(z\log(n/z))

and then stream

S

in linear time. In this paper, we show that

O(n)

time and

O(z)

working space can be achieved for constant-size alphabets. On general alphabets of size

\sigma

, we describe (i) a trade-off achieving

O(n\log^\delta \sigma)

time and

O(z\log^{1-\delta}\sigma)

space for any

0\leq \delta\leq 1

, and (ii) a solution achieving

O(n)

time and

O(z\log\log (n/z))

space. The latter solution, in particular, dominates both folklore algorithms for the problem. Our solutions can, more generally, extract any specified subsequence of

S

with little overheads on top of the linear running time and working space. As an immediate corollary, we show that our techniques yield improved results for pattern matching problems on LZ77-compressed text

arXiv.org e-Print Archive

Crossref

Archivio istituzionale della ricerca - Università degli Studi di Venezia Ca' Foscari

Archivio della ricerca- LUISS Libera Università Internazionale degli Studi Sociali Guido Carli di Roma

Online Research Database In Technology

Approximating edit distance in the fully dynamic model

Author: Kociumaka Tomasz
Mukherjee Anish
Saha Barna
Publication venue: IEEE
Publication date: 22/12/2023
Field of study

Warwick Research Archives Portal Repository

Approximating Edit Distance in the Fully Dynamic Model

Author: Kociumaka Tomasz
Mukherjee Anish
Saha Barna
Publication venue
Publication date: 14/07/2023
Field of study

The edit distance is a fundamental measure of sequence similarity, defined as the minimum number of character insertions, deletions, and substitutions needed to transform one string into the other. Given two strings of length at most

n

, simple dynamic programming computes their edit distance exactly in

O(n^2)

time, which is also the best possible (up to subpolynomial factors) assuming the Strong Exponential Time Hypothesis (SETH). The last few decades have seen tremendous progress in edit distance approximation, where the runtime has been brought down to subquadratic, near-linear, and even sublinear at the cost of approximation. In this paper, we study the dynamic edit distance problem, where the strings change dynamically as the characters are substituted, inserted, or deleted over time. Each change may happen at any location of either of the two strings. The goal is to maintain the (exact or approximate) edit distance of such dynamic strings while minimizing the update time. The exact edit distance can be maintained in

\tilde{O}(n)

time per update (Charalampopoulos, Kociumaka, Mozes; 2020), which is again tight assuming SETH. Unfortunately, even with the unprecedented progress in edit distance approximation in the static setting, strikingly little is known regarding dynamic edit distance approximation. Utilizing the off-the-shelf tools, it is possible to achieve an

O(n^{c})

-approximation in

n^{0.5-c+o(1)}

update time for any constant

c\in [0,\frac16]

. Improving upon this trade-off remains open. The contribution of this work is a dynamic

n^{o(1)}

-approximation algorithm with amortized expected update time of

n^{o(1)}

. In other words, we bring the approximation-ratio and update-time product down to

n^{o(1)}

. Our solution utilizes an elegant framework of precision sampling tree for edit distance approximation (Andoni, Krauthgamer, Onak; 2010).Comment: Accepted to FOCS 202

arXiv.org e-Print Archive

Breaking the $O(n)$ -Barrier in the Construction of Compressed Suffix Arrays

Author: Kempa Dominik
Kociumaka Tomasz
Publication venue
Publication date: 23/06/2021
Field of study

The suffix array, describing the lexicographic order of suffixes of a given text, is the central data structure in string algorithms. The suffix array of a length-

n

text uses

\Theta(n \log n)

bits, which is prohibitive in many applications. To address this, Grossi and Vitter [STOC 2000] and, independently, Ferragina and Manzini [FOCS 2000] introduced space-efficient versions of the suffix array, known as the compressed suffix array (CSA) and the FM-index. For a length-

n

text over an alphabet of size

\sigma

, these data structures use only

O(n \log \sigma)

bits. Immediately after their discovery, they almost completely replaced plain suffix arrays in practical applications, and a race started to develop efficient construction procedures. Yet, after more than 20 years, even for

\sigma=2

, the fastest algorithm remains stuck at

O(n)

time [Hon et al., FOCS 2003], which is slower by a

\Theta(\log n)

factor than the lower bound of

\Omega(n / \log n)

(following simply from the necessity to read the entire input). We break this long-standing barrier with a new data structure that takes

O(n \log \sigma)

bits, answers suffix array queries in

O(\log^{\epsilon} n)

time, and can be constructed in

O(n\log \sigma / \sqrt{\log n})

time using

O(n\log \sigma)

bits of space. Our result is based on several new insights into the recently developed notion of string synchronizing sets [STOC 2019]. In particular, compared to their previous applications, we eliminate orthogonal range queries, replacing them with new queries that we dub prefix rank and prefix selection queries. As a further demonstration of our techniques, we present a new pattern-matching index that simultaneously minimizes the construction time and the query time among all known compact indexes (i.e., those using

O(n \log \sigma)

bits).Comment: 41 page

arXiv.org e-Print Archive

Small space and streaming pattern matching with k edits

Author: Kociumaka Tomasz
Porat Ely
Starikovskaya Tatiana
Publication venue
Publication date: 10/06/2021
Field of study

In this work, we revisit the fundamental and well-studied problem of approximate pattern matching under edit distance. Given an integer

k

, a pattern

P

of length

m

, and a text

T

of length

n \ge m

, the task is to find substrings of

T

that are within edit distance

k

from

P

. Our main result is a streaming algorithm that solves the problem in

\tilde{O}(k^5)

space and

\tilde{O}(k^8)

amortised time per character of the text, providing answers correct with high probability. (Hereafter,

\tilde{O}(\cdot)

hides a

\mathrm{poly}(\log n)

factor.) This answers a decade-old question: since the discovery of a

\mathrm{poly}(k\log n)

-space streaming algorithm for pattern matching under Hamming distance by Porat and Porat [FOCS 2009], the existence of an analogous result for edit distance remained open. Up to this work, no

\mathrm{poly}(k\log n)

-space algorithm was known even in the simpler semi-streaming model, where

T

comes as a stream but

P

is available for read-only access. In this model, we give a deterministic algorithm that achieves slightly better complexity. In order to develop the fully streaming algorithm, we introduce a new edit distance sketch parametrised by integers