Search CORE

78 research outputs found

Optimal Encodings for Range Min-Max and Top-k

Author: Gawrychowski P.
Nicholson P.
Publication venue
Publication date: 24/11/2014
Field of study

In this paper we consider various encoding problems for range queries on arrays. In these problems, the goal is that the encoding occupies the information theoretic minimum space required to answer a particular set of range queries. Given an array

A[1..n]

a range top-

k

query on an arbitrary range

[i,j] \subseteq [1,n]

asks us to return the ordered set of indices

\{l_1 ,...,l_k \}

such that

A[l_m]

is the

m

-th largest element in

A[i..j]

. We present optimal encodings for range top-

k

queries, as well as for a new problem which we call range min-max, in which the goal is to return the indices of both the minimum and maximum element in a range

MPG.PuRe

Encodings of Range Maximum-Sum Segment Queries and Applications

Author: Gawrychowski P.
Nicholson P.
Publication venue
Publication date: 24/11/2014
Field of study

Given an array A containing arbitrary (positive and negative) numbers, we consider the problem of supporting range maximum-sum segment queries on A: i.e., given an arbitrary range [i,j], return the subrange [i' ,j' ] \subseteq [i,j] such that the sum of the numbers in A[i'..j'] is maximized. Chen and Chao [Disc. App. Math. 2007] presented a data structure for this problem that occupies {\Theta}(n) words, can be constructed in {\Theta}(n) time, and supports queries in {\Theta}(1) time. Our first result is that if only the indices [i',j'] are desired (rather than the maximum sum achieved in that subrange), then it is possible to reduce the space to {\Theta}(n) bits, regardless the numbers stored in A, while retaining the same construction and query time. We also improve the best known space lower bound for any data structure that supports range maximum-sum segment queries from n bits to 1.89113n - {\Theta}(lg n) bits, for sufficiently large values of n. Finally, we provide a new application of this data structure which simplifies a previously known linear time algorithm for finding k-covers: i.e., given an array A of n numbers and a number k, find k disjoint subranges [i_1 ,j_1 ],...,[i_k ,j_k ], such that the total sum of all the numbers in the subranges is maximized

MPG.PuRe

Weighted ancestors in suffix trees

Author: D.E. Willard
M. Farach
M.A. Bender
O. Berkman
P. Bille
P. Gawrychowski
T. Kopelowitz
Publication venue
Publication date: 01/01/2014
Field of study

The classical, ubiquitous, predecessor problem is to construct a data structure for a set of integers that supports fast predecessor queries. Its generalization to weighted trees, a.k.a. the weighted ancestor problem, has been extensively explored and successfully reduced to the predecessor problem. It is known that any solution for both problems with an input set from a polynomially bounded universe that preprocesses a weighted tree in O(n polylog(n)) space requires \Omega(loglogn) query time. Perhaps the most important and frequent application of the weighted ancestors problem is for suffix trees. It has been a long-standing open question whether the weighted ancestors problem has better bounds for suffix trees. We answer this question positively: we show that a suffix tree built for a text w[1..n] can be preprocessed using O(n) extra space, so that queries can be answered in O(1) time. Thus we improve the running times of several applications. Our improvement is based on a number of data structure tools and a periodicity-based insight into the combinatorial structure of a suffix tree.Comment: 27 pages, LNCS format. A condensed version will appear in ESA 201

arXiv.org e-Print Archive

CiteSeerX

Crossref

Compressed Membership for NFA (DFA) with Compressed Labels is in NP (P)

Author: A. Amir
A. Jeż
A. Jeż
A. Jeż
A. Jeż
Artur Jeż
B. Genest
G. Navarro
J. MacDonald
K. Mehlhorn
L. Gąsieniec
L. Gąsieniec
L. Gąsieniec
M. Beaudry
M. Charikar
M. Farach
M. Lohrey
M. Lohrey
M. Lohrey
M. Lohrey
M. Lohrey
N. Markey
P. Bille
P. Ferragina
P. Gawrychowski
P. Gawrychowski
P. Gawrychowski
P. Gawrychowski
S. Alstrup
S. Lasota
S.R. Kosaraju
T. Kida
W. Czerwiński
W. Plandowski
W. Plandowski
W. Plandowski
W. Plandowski
W. Rytter
Y. Lifshits
Y. Lifshits
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 11/10/2011
Field of study

In this paper, a compressed membership problem for finite automata, both deterministic and non-deterministic, with compressed transition labels is studied. The compression is represented by straight-line programs (SLPs), i.e. context-free grammars generating exactly one string. A novel technique of dealing with SLPs is introduced: the SLPs are recompressed, so that substrings of the input text are encoded in SLPs labelling the transitions of the NFA (DFA) in the same way, as in the SLP representing the input text. To this end, the SLPs are locally decompressed and then recompressed in a uniform way. Furthermore, such recompression induces only small changes in the automaton, in particular, the size of the automaton remains polynomial. Using this technique it is shown that the compressed membership for NFA with compressed labels is in NP, thus confirming the conjecture of Plandowski and Rytter and extending the partial result of Lohrey and Mathissen; as it is already known, that this problem is NP-hard, we settle its exact computational complexity. Moreover, the same technique applied to the compressed membership for DFA with compressed labels yields that this problem is in P; for this problem, only trivial upper-bound PSPACE was known

arXiv.org e-Print Archive

CiteSeerX

Crossref

Springer - Publisher Connector

Dagstuhl Research Online Publication Server

MPG.PuRe

A Combinatorial Approach to Collapsing Words

Author: Cherubini A.
Gawrychowski P.
Kisielewicz A.
Piochi Brunetto
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2006
Field of study

Florence Research

RLE Edit Distance in Near Optimal Time

Author: Gawrychowski Pawel
Kociumaka Tomasz
Martin Daniel P.
Uznanski Przemyslaw
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 44th International Symposium on Mathematical Foundations of Computer Science (MFCS 2019)
Publication date: 01/01/2019
Field of study

We show that the edit distance between two run-length encoded strings of compressed lengths m and n respectively, can be computed in O(mn log(mn)) time. This improves the previous record by a factor of O(n/log(mn)). The running time of our algorithm is within subpolynomial factors of being optimal, subject to the standard SETH-hardness assumption. This effectively closes a line of algorithmic research first started in 1993

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

Substring Complexity in Sublinear Space

Author: Bernardini Giulia
Fici Gabriele
Gawrychowski Paweł
Pissis Solon P.
Publication venue
Publication date: 16/07/2020
Field of study

Shannon's entropy is a definitive lower bound for statistical compression. Unfortunately, no such clear measure exists for the compressibility of repetitive strings. Thus, ad-hoc measures are employed to estimate the repetitiveness of strings, e.g., the size

z

of the Lempel-Ziv parse or the number

r

of equal-letter runs of the Burrows-Wheeler transform. A more recent one is the size

\gamma

of a smallest string attractor. Unfortunately, Kempa and Prezza [STOC 2018] showed that computing

\gamma

is NP-hard. Kociumaka et al. [LATIN 2020] considered a new measure that is based on the function

S_T

counting the cardinalities of the sets of substrings of each length of

T

, also known as the substring complexity. This new measure is defined as

\delta= \sup\{S_T(k)/k, k\geq 1\}

and lower bounds all the measures previously considered. In particular,

\delta\leq \gamma

always holds and

\delta

can be computed in

\mathcal{O}(n)

time using

\Omega(n)

working space. Kociumaka et al. showed that if

\delta

is given, one can construct an

\mathcal{O}(\delta \log \frac{n}{\delta})

-sized representation of

T

supporting efficient direct access and efficient pattern matching queries on

T

. Given that for highly compressible strings,

\delta

is significantly smaller than

n

, it is natural to pose the following question: Can we compute

\delta

efficiently using sublinear working space? It is straightforward to show that any algorithm computing

\delta

using

\mathcal{O}(b)

space requires

\Omega(n^{2-o(1)}/b)

time through a reduction from the element distinctness problem [Yao, SIAM J. Comput. 1994]. We present the following results: an

\mathcal{O}(n^3/b^2)

-time and

\mathcal{O}(b)

-space algorithm to compute

\delta

, for any

b\in[1,n]

; and an

\tilde{\mathcal{O}}(n^2/b)

-time and

\mathcal{O}(b)

-space algorithm to compute

\delta

, for any

b\in[n^{2/3},n]

arXiv.org e-Print Archive

CWI's Institutional Repository

On Maximal Unbordered Factors

Author: A Ehrenfeucht
D Moore
F Franĕk
J-P Duval
J-P Duval
J-P Duval
L Ilie
P Gawrychowski
P Nielsen
R Assous
S Holub
T Kociumaka
Publication venue
Publication date: 28/04/2015
Field of study

Given a string

S

of length

n

, its maximal unbordered factor is the longest factor which does not have a border. In this work we investigate the relationship between

n

and the length of the maximal unbordered factor of

S

. We prove that for the alphabet of size

\sigma \ge 5

the expected length of the maximal unbordered factor of a string of length~

n

is at least

0.99 n

(for sufficiently large values of

n

). As an application of this result, we propose a new algorithm for computing the maximal unbordered factor of a string.Comment: Accepted to the 26th Annual Symposium on Combinatorial Pattern Matching (CPM 2015

arXiv.org e-Print Archive

Crossref

HAL Descartes

Hal-Diderot

HAL-Ecole des Ponts ParisTech

Explore Bristol Research

HAL - UPEC / UPEM

The Dynamic k-Mismatch Problem

Author: Clifford Raphaël
Gawrychowski Paweł
Kociumaka Tomasz
Martin Daniel P.
Uznański Przemysław
Publication venue
Publication date: 01/01/2022
Field of study

The text-to-pattern Hamming distances problem asks to compute the Hamming distances between a given pattern of length

m

and all length-

m

substrings of a given text of length

n\ge m

. We focus on the

k

-mismatch version of the problem, where a distance needs to be returned only if it does not exceed a threshold

k

. We assume

n\le 2m

(in general, one can partition the text into overlapping blocks). In this work, we show data structures for the dynamic version of this problem supporting two operations: An update performs a single-letter substitution in the pattern or the text, and a query, given an index

i

, returns the Hamming distance between the pattern and the text substring starting at position

i

, or reports that it exceeds

k

. First, we show a data structure with

\tilde{O}(1)

update and

\tilde{O}(k)

query time. Then we show that

\tilde{O}(k)

update and

\tilde{O}(1)

query time is also possible. These two provide an optimal trade-off for the dynamic

k

-mismatch problem with

k \le \sqrt{n}

: we prove that, conditioned on the strong 3SUM conjecture, one cannot simultaneously achieve

k^{1-\Omega(1)}

time for all operations. For

k\ge \sqrt{n}

, we give another lower bound, conditioned on the Online Matrix-Vector conjecture, that excludes algorithms taking

n^{1/2-\Omega(1)}

time per operation. This is tight for constant-sized alphabets: Clifford et al. (STACS 2018) achieved

\tilde{O}(\sqrt{n})

time per operation in that case, but with

\tilde{O}(n^{3/4})

time per operation for large alphabets. We improve and extend this result with an algorithm that, given

1\le x\le k

, achieves update time

\tilde{O}(\frac{n}{k} +\sqrt{\frac{nk}{x}})

and query time

\tilde{O}(x)

. In particular, for

k\ge \sqrt{n}

, an appropriate choice of

x

yields

\tilde{O}(\sqrt[3]{nk})

time per operation, which is

\tilde{O}(n^{2/3})

when no threshold

k

is provided

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

Explore Bristol Research

Optimal Computation of Avoided Words

Author: A Akalin
C Acquisti
C Barton
C Barton
D Belazzougui
DB Searls
F Mignosi
I Rusinov
M Crochemore
P Gawrychowski
RN Mantegna
V Brendel
Publication venue
Publication date: 29/04/2016
Field of study

The deviation of the observed frequency of a word

w

from its expected frequency in a given sequence

x

is used to determine whether or not the word is avoided. This concept is particularly useful in DNA linguistic analysis. The value of the standard deviation of

w

, denoted by

std(w)

, effectively characterises the extent of a word by its edge contrast in the context in which it occurs. A word

w

of length

k>2

is a

\rho

-avoided word in

x

std(w) \leq \rho

, for a given threshold

\rho < 0

. Notice that such a word may be completely absent from

x

. Hence computing all such words na\"{\i}vely can be a very time-consuming procedure, in particular for large

k

. In this article, we propose an

O(n)

-time and

O(n)

-space algorithm to compute all

\rho

-avoided words of length

k

in a given sequence

x

of length

n

over a fixed-sized alphabet. We also present a time-optimal

O(\sigma n)

-time and

O(\sigma n)

-space algorithm to compute all

\rho

-avoided words (of any length) in a sequence of length

n

over an alphabet of size

\sigma

. Furthermore, we provide a tight asymptotic upper bound for the number of

\rho

-avoided words and the expected length of the longest one. We make available an open-source implementation of our algorithm. Experimental results, using both real and synthetic data, show the efficiency of our implementation

arXiv.org e-Print Archive

Crossref

King's Research Portal