Search CORE

59 research outputs found

Sublinear Algorithms for Approximating String Compressibility

Author: A. Luca de
Adam Smith
C.K. Chui
D. Benedetto
D. Sculley
Dana Ron
E. Frank
E. Keogh
E. Lehman
E.J. Keogh
F. Levé
F.M.J. Willems
G. Cormode
H. Cai
I. Gheorghiciuc
I.H. Witten
J. Cleary
J. Shallit
J. Ziv
J. Ziv
L. Ilie
L. Paninski
L. Paninski
L. Pierce II
M. Brautbar
M. Charikar
M. Li
M. Li
N. Ahmed
N. Alon
O. Keller
O.V. Kukushkina
R. Cilibrasi
R. Cilibrasi
Ronitt Rubinfeld
S. Janson
S. Raskhodnikova
S. Raskhodnikova
Sofya Raskhodnikova
T. Batu
T. Cover
Z. Bar-Yossef
Z. Kása
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/03/2011
Field of study

We raise the question of approximating the compressibility of a string with respect to a fixed compression scheme, in sublinear time. We study this question in detail for two popular lossless compression schemes: run-length encoding (RLE) and a variant of Lempel-Ziv (LZ77), and present sublinear algorithms for approximating compressibility with respect to both schemes. We also give several lower bounds that show that our algorithms for both schemes cannot be improved significantly. Our investigation of LZ77 yields results whose interest goes beyond the initial questions we set out to study. In particular, we prove combinatorial structural lemmas that relate the compressibility of a string with respect to LZ77 to the number of distinct short substrings contained in it (its ℓth subword complexity , for small ℓ). In addition, we show that approximating the compressibility with respect to LZ77 is related to approximating the support size of a distribution.National Science Foundation (U.S.) (Award CCF-1065125)National Science Foundation (U.S.) (Award CCF-0728645)Marie Curie International Reintegration Grant PIRG03-GA-2008-231077Israel Science Foundation (Grant 1147/09)Israel Science Foundation (Grant 1675/09

CiteSeerX

DSpace@MIT

Crossref

05291 Abstracts Collection -- Sublinear Algorithms

Author: Czumaj Artur
Muthukrishnan S. Muthu
Rubinfeld Ronitt
Sohler Christian
Publication venue: Dagstuhl Seminar Proceedings. 05291 - Sublinear Algorithms
Publication date: 01/01/2006
Field of study

From 17.07.05 to 22.07.05, the Dagstuhl Seminar 05291 ``Sublinear Algorithms\u27\u27 was held in the International Conference and Research Center (IBFI), Schloss Dagstuhl. During the seminar, several participants presented their current research, and ongoing work and open problems were discussed. Abstracts of the presentations given during the seminar as well as abstracts of seminar results and ideas are put together in this paper. The first section describes the seminar topics and goals in general. Links to extended abstracts or full papers are provided, if available

Dagstuhl Research Online Publication Server

Substring Complexity in Sublinear Space

Author: Bernardini Giulia
Fici Gabriele
Gawrychowski Paweł
Pissis Solon P.
Publication venue
Publication date: 16/07/2020
Field of study

Shannon's entropy is a definitive lower bound for statistical compression. Unfortunately, no such clear measure exists for the compressibility of repetitive strings. Thus, ad-hoc measures are employed to estimate the repetitiveness of strings, e.g., the size

z

of the Lempel-Ziv parse or the number

r

of equal-letter runs of the Burrows-Wheeler transform. A more recent one is the size

\gamma

of a smallest string attractor. Unfortunately, Kempa and Prezza [STOC 2018] showed that computing

\gamma

is NP-hard. Kociumaka et al. [LATIN 2020] considered a new measure that is based on the function

S_T

counting the cardinalities of the sets of substrings of each length of

T

, also known as the substring complexity. This new measure is defined as

\delta= \sup\{S_T(k)/k, k\geq 1\}

and lower bounds all the measures previously considered. In particular,

\delta\leq \gamma

always holds and

\delta

can be computed in

\mathcal{O}(n)

time using

\Omega(n)

working space. Kociumaka et al. showed that if

\delta

is given, one can construct an

\mathcal{O}(\delta \log \frac{n}{\delta})

-sized representation of

T

supporting efficient direct access and efficient pattern matching queries on

T

. Given that for highly compressible strings,

\delta

is significantly smaller than

n

, it is natural to pose the following question: Can we compute

\delta

efficiently using sublinear working space? It is straightforward to show that any algorithm computing

\delta

using

\mathcal{O}(b)

space requires

\Omega(n^{2-o(1)}/b)

time through a reduction from the element distinctness problem [Yao, SIAM J. Comput. 1994]. We present the following results: an

\mathcal{O}(n^3/b^2)

-time and

\mathcal{O}(b)

-space algorithm to compute

\delta

, for any

b\in[1,n]

; and an

\tilde{\mathcal{O}}(n^2/b)

-time and

\mathcal{O}(b)

-space algorithm to compute

\delta

, for any

b\in[n^{2/3},n]

arXiv.org e-Print Archive

Analysis and experimental evaluation of an approximation algorithm for the length of an optimal Lempel-Ziv parsing

Author: Nietosvaara Joonas
Publication venue: Helsingfors universitet
Publication date: 01/01/2019
Field of study

We examine a previously known sublinear-time algorithm for approximating the length of a string’s optimal (i.e. shortest) Lempel-Ziv parsing (a.k.a. LZ77 factorization). This length is a measure of compressibility under the LZ77 compression algorithm, so the algorithm also estimates a string’s compressibility. The algorithm’s approximation approach is based on a connection between optimal Lempel-Ziv parsing length and the number of distinct substrings of different lengths in a string. Some aspects of the algorithm are described more explicitly than in earlier work, including the constraints on its input and how to distinguish between strings with short vs. long optimal parsings in sublinear time; several proofs (and pseudocode listings) are also more detailed than in earlier work. An implementation of the algorithm is provided. We experimentally investigate the algorithm’s practical usefulness for estimating the compressibility of large collections of data. The algorithm is run on real-world data under a wide range of approximation parameter settings. The accuracy of the resulting estimates is evaluated. The estimates turn out to be consistently highly inaccurate, albeit always inside the stated probabilistic error bounds. We conclude that the algorithm is not promising as a practical tool for estimating compressibility. We also examine the empirical connection between optimal parsing length and the number of distinct substrings of different lengths. The latter turns out to be a suprisingly accurate predictor of the former within our test data, which suggests avenues for future work

Helsingin yliopiston digitaalinen arkisto

Compressibility-Aware Quantum Algorithms on Strings

Author: Gibney Daniel
Thankachan Sharma V.
Publication venue
Publication date: 14/02/2023
Field of study

Sublinear time quantum algorithms have been established for many fundamental problems on strings. This work demonstrates that new, faster quantum algorithms can be designed when the string is highly compressible. We focus on two popular and theoretically significant compression algorithms -- the Lempel-Ziv77 algorithm (LZ77) and the Run-length-encoded Burrows-Wheeler Transform (RL-BWT), and obtain the results below. We first provide a quantum algorithm running in

\tilde{O}(\sqrt{zn})

time for finding the LZ77 factorization of an input string

T[1..n]

with

z

factors. Combined with multiple existing results, this yields an

\tilde{O}(\sqrt{rn})

time quantum algorithm for finding the RL-BWT encoding with

r

BWT runs. Note that

r = \tilde{\Theta}(z)

. We complement these results with lower bounds proving that our algorithms are optimal (up to polylog factors). Next, we study the problem of compressed indexing, where we provide a

\tilde{O}(\sqrt{rn})

time quantum algorithm for constructing a recently designed

\tilde{O}(r)

space structure with equivalent capabilities as the suffix tree. This data structure is then applied to numerous problems to obtain sublinear time quantum algorithms when the input is highly compressible. For example, we show that the longest common substring of two strings of total length

n

can be computed in

\tilde{O}(\sqrt{zn})

time, where

z

is the number of factors in the LZ77 factorization of their concatenation. This beats the best known

\tilde{O}(n^\frac{2}{3})

time quantum algorithm when

z

is sufficiently small

arXiv.org e-Print Archive

07411 Abstracts Collection -- Algebraic Methods in Computational Complexity

Author: Agrawal Manindra
Buhrman Harry
Fortnow Lance
Thierauf Thomas
Publication venue: Dagstuhl Seminar Proceedings. 07411 - Algebraic Methods in Computational Complexity
Publication date: 01/01/2008
Field of study

From 07.10. to 12.10., the Dagstuhl Seminar 07411 ``Algebraic Methods in Computational Complexity\u27\u27 was held in the International Conference and Research Center (IBFI), Schloss Dagstuhl. During the seminar, several participants presented their current research, and ongoing work and open problems were discussed. Abstracts of the presentations given during the seminar as well as abstracts of seminar results and ideas are put together in this paper. The first section describes the seminar topics and goals in general. Links to extended abstracts or full papers are provided, if available

Dagstuhl Research Online Publication Server

GPU-accelerated k-mer counting

Author: Jylhä-Ollila Pekka
Publication venue: Helsingfors universitet
Publication date: 01/01/2020
Field of study

K-mer counting is the process of building a histogram of all substrings of length k for an input string S. The problem itself is quite simple, but counting k-mers efficiently for a very large input string is a difficult task that has been researched extensively. In recent years the performance of k-mer counting algorithms have improved significantly, and there have been efforts to use graphics processing units (GPUs) in k-mer counting. The goal for this thesis was to design, implement and benchmark a GPU accelerated k-mer counting algorithm SNCGPU. The results showed that SNCGPU compares reasonably well to the Gerbil k-mer counting algorithm on a mid-range desktop computer, but does not utilize the resources of a high-end computing platform as efficiently. The implementation of SNCGPU is available as open-source software

Helsingin yliopiston digitaalinen arkisto

PCPs and Instance Compression from a Cryptographic Lens

Author: Bronfman Liron
Rothblum Ron D.
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 13th Innovations in Theoretical Computer Science Conference (ITCS 2022)
Publication date: 01/01/2022
Field of study

Dagstuhl Research Online Publication Server

Indexing Highly Repetitive String Collections

Author: Navarro Gonzalo
Publication venue
Publication date: 13/12/2021
Field of study

Two decades ago, a breakthrough in indexing string collections made it possible to represent them within their compressed space while at the same time offering indexed search functionalities. As this new technology permeated through applications like bioinformatics, the string collections experienced a growth that outperforms Moore's Law and challenges our ability of handling them even in compressed form. It turns out, fortunately, that many of these rapidly growing string collections are highly repetitive, so that their information content is orders of magnitude lower than their plain size. The statistical compression methods used for classical collections, however, are blind to this repetitiveness, and therefore a new set of techniques has been developed in order to properly exploit it. The resulting indexes form a new generation of data structures able to handle the huge repetitive string collections that we are facing. In this survey we cover the algorithmic developments that have led to these data structures. We describe the distinct compression paradigms that have been used to exploit repetitiveness, the fundamental algorithmic ideas that form the base of all the existing indexes, and the various structures that have been proposed, comparing them both in theoretical and practical aspects. We conclude with the current challenges in this fascinating field

arXiv.org e-Print Archive