Search CORE

97 research outputs found

Size-constrained Weighted Ancestors with Applications

Author: Bille Philip
Nekrich Yakov
Pissis Solon P.
Publication venue
Publication date: 27/11/2023
Field of study

The weighted ancestor problem on a rooted node-weighted tree

T

is a generalization of the classic predecessor problem: construct a data structure for a set of integers that supports fast predecessor queries. Both problems are known to require

\Omega(\log\log n)

time for queries provided

\mathcal{O}(n\text{ poly} \log n)

space is available, where

n

is the input size. The weighted ancestor problem has attracted a lot of attention by the combinatorial pattern matching community due to its direct application to suffix trees. In this formulation of the problem, the nodes are weighted by string depth. This attention has culminated in a data structure for weighted ancestors in suffix trees with

\mathcal{O}(1)

query time and an

\mathcal{O}(n)

-time construction algorithm [Belazzougui et al., CPM 2021]. In this paper, we consider a different version of the weighted ancestor problem, where the nodes are weighted by any function

\textsf{weight}

that maps the nodes of

T

to positive integers, such that

\textsf{weight}(u)\le \textsf{size}(u)

for any node

u

and

\textsf{weight}(u_1)\le \textsf{weight}(u_2)

if node

u_1

is a descendant of node

u_2

, where

\textsf{size}(u)

is the number of nodes in the subtree rooted at

u

. In the size-constrained weighted ancestor (SWAQ) problem, for any node

u

T

and any integer

k

, we are asked to return the lowest ancestor

w

u

with weight at least

k

. We show that for any rooted tree with

n

nodes, we can locate node

w

\mathcal{O}(1)

time after

\mathcal{O}(n)

-time preprocessing. In particular, this implies a data structure for the SWAQ problem in suffix trees with

\mathcal{O}(1)

query time and

\mathcal{O}(n)

-time preprocessing, when the nodes are weighted by

\textsf{weight}

. We also show several string-processing applications of this result

arXiv.org e-Print Archive

Recommended from our members

Text Indexing for Long Patterns: Anchors are All you Need

Author: Ayad L
Loukides G
Pissis S
Publication venue: Association for Computing Machinery
Publication date: 10/07/2023
Field of study

PVLDB Artifact Availability: The source code, data, and/or other artifacts have been made available at https://github.com/lorrainea/BDA- index.Copyright © 2023 the owner/author(s). In many real-world database systems, a large fraction of the data is represented by strings: sequences of letters over some alphabet. This is because strings can easily encode data arising from different sources. It is often crucial to represent such string datasets in a compact form but also to simultaneously enable fast pattern matching queries. This is the classic text indexing problem. The four absolute measures anyone should pay attention to when designing or implementing a text index are: (i) index space; (ii) query time; (iii) construction space; and (iv) construction time. Unfortunately, however, most (if not all) widely-used indexes (e.g., suffix tree, suffix array, or their compressed counterparts) are not optimized for all four measures simultaneously, as it is difficult to have the best of all four worlds. Here, we take an important step in this direction by showing that text indexing with locally consistent anchors (lc-anchors) offers remarkably good performance in all four measures, when we have at hand a lower bound l on the length of the queried patterns --- which is arguably a quite reasonable assumption in practical applications. Specifically, we improve on the construction of the index proposed by Loukides and Pissis, which is based on bidirectional string anchors (bd-anchors), a new type of lc-anchors, by: (i) designing an average-case linear-time algorithm to compute bd-anchors; and (ii) developing a semi-external-memory implementation to construct the index in small space using near-optimal work. We then present an extensive experimental evaluation, based on the four measures, using real benchmark datasets. The results show that, for long patterns, the index constructed using our improved algorithms compares favorably to all classic indexes: (compressed) suffix tree; (compressed) suffix array; and the FM-index.European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreements No 872539 and 956229, respectively; and by UKRI through REPHRAIN (EP/V011189/1)

Brunel University Research Archive

Substring Complexity in Sublinear Space

Author: Bernardini Giulia
Fici Gabriele
Gawrychowski Paweł
Pissis Solon P.
Publication venue
Publication date: 16/07/2020
Field of study

Shannon's entropy is a definitive lower bound for statistical compression. Unfortunately, no such clear measure exists for the compressibility of repetitive strings. Thus, ad-hoc measures are employed to estimate the repetitiveness of strings, e.g., the size

z

of the Lempel-Ziv parse or the number

r

of equal-letter runs of the Burrows-Wheeler transform. A more recent one is the size

\gamma

of a smallest string attractor. Unfortunately, Kempa and Prezza [STOC 2018] showed that computing

\gamma

is NP-hard. Kociumaka et al. [LATIN 2020] considered a new measure that is based on the function

S_T

counting the cardinalities of the sets of substrings of each length of

T

, also known as the substring complexity. This new measure is defined as

\delta= \sup\{S_T(k)/k, k\geq 1\}

and lower bounds all the measures previously considered. In particular,

\delta\leq \gamma

always holds and

\delta

can be computed in

\mathcal{O}(n)

time using

\Omega(n)

working space. Kociumaka et al. showed that if

\delta

is given, one can construct an

\mathcal{O}(\delta \log \frac{n}{\delta})

-sized representation of

T

supporting efficient direct access and efficient pattern matching queries on

T

. Given that for highly compressible strings,

\delta

is significantly smaller than

n

, it is natural to pose the following question: Can we compute

\delta

efficiently using sublinear working space? It is straightforward to show that any algorithm computing

\delta

using

\mathcal{O}(b)

space requires

\Omega(n^{2-o(1)}/b)

time through a reduction from the element distinctness problem [Yao, SIAM J. Comput. 1994]. We present the following results: an

\mathcal{O}(n^3/b^2)

-time and

\mathcal{O}(b)

-space algorithm to compute

\delta

, for any

b\in[1,n]

; and an

\tilde{\mathcal{O}}(n^2/b)

-time and

\mathcal{O}(b)

-space algorithm to compute

\delta

, for any

b\in[n^{2/3},n]

arXiv.org e-Print Archive

Archivio istituzionale della ricerca - Università di Trieste

CWI's Institutional Repository

Comparing Elastic-Degenerate Strings: Algorithms, Lower Bounds, and Applications

Author: Gabory Esteban
Mwaniki Moses Njagi
Pisanti Nadia
Pissis Solon P.
Radoszewski Jakub
Sweering Michelle
Zuba Wiktor
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023)
Publication date: 01/01/2023
Field of study

An elastic-degenerate (ED) string T is a sequence of n sets T[1], . . ., T[n] containing m strings in total whose cumulative length is N. We call n, m, and N the length, the cardinality and the size of T, respectively. The language of T is defined as L(T) = {S1 · · · Sn : Si ∈ T[i] for all i ∈ [1, n]}. ED strings have been introduced to represent a set of closely-related DNA sequences, also known as a pangenome. The basic question we investigate here is: Given two ED strings, how fast can we check whether the two languages they represent have a nonempty intersection? We call the underlying problem the ED String Intersection (EDSI) problem. For two ED strings T1 and T2 of lengths n1 and n2, cardinalities m1 and m2, and sizes N1 and N2, respectively, we show the following: There is no O((N1N2)1−ϵ)-time algorithm, thus no O ((N1m2 + N2m1)1−ϵ)-time algorithm and no O ((N1n2 + N2n1)1−ϵ)-time algorithm, for any constant ϵ > 0, for EDSI even when T1 and T2 are over a binary alphabet, unless the Strong Exponential-Time Hypothesis is false. There is no combinatorial O((N1 + N2)1.2−ϵf(n1, n2))-time algorithm, for any constant ϵ > 0 and any function f, for EDSI even when T1 and T2 are over a binary alphabet, unless the Boolean Matrix Multiplication conjecture is false. An O(N1 log N1 log n1 + N2 log N2 log n2)-time algorithm for outputting a compact (RLE) representation of the intersection language of two unary ED strings. In the case when T1 and T2 are given in a compact representation, we show that the problem is NP-complete. An O(N1m2 + N2m1)-time algorithm for EDSI. An Õ(N1ω−1n2 + N2ω−1n1)-time algorithm for EDSI, where ω is the exponent of matrix multiplication; the Õ notation suppresses factors that are polylogarithmic in the input size. We also show that the techniques we develop have applications outside of ED string comparison

VU Research Portal

CWI's Institutional Repository

INRIA a CCSD electronic archive server

Archivio della Ricerca - Università di Pisa

Dagstuhl Research Online Publication Server

Faster algorithms for longest common substring

Author: Charalampopoulos P. (Panagiotis)
Kociumaka T. (Tomasz)
Pissis S. (Solon)
Radoszewski J. (Jakub)
Publication venue
Publication date: 01/01/2021
Field of study

In the classic longest common substring (LCS) problem, we are given two strings S and T, each of length at most n, over an alphabet of size σ, and we are asked to find a longest string occurring as a fragment of both S and T. Weiner, in his seminal paper that introduced the suffix tree, presented an (n log σ)-time algorithm for this problem [SWAT 1973]. For polynomially-bounded integer alphabets, the linear-time construction of suffix trees by Farach yielded an (n)-time algorithm for the LCS problem [FOCS 1997]. However, for small alphabets, this is not necessarily optimal for the LCS problem in the word RAM model of computation, in which the strings can be stored in (n log σ/log n) space and read in (n log σ/log n) time. We show that, in this model, we can compute an LCS in time (n log σ / √{log n}), which is sublinear in n if σ = 2^{o(√{log n})} (in particular, if σ = (1)), using optimal space (n log σ/log n). We then lift our ideas to the problem of computing a k-mismatch LCS, which has received considerable attention in recent years. In this problem, the aim is to compute a longest substring of S that occurs in T with at most k mismatches. Flouri et al. showed how to compute a 1-mismatch LCS in (n log n) time [IPL 2015]. Thankachan et al. extended this result to computing a k-mismatch LCS in (n log^k n) time for k = (1) [J. Comput. Biol. 2016]. We show an (n log^{k-1/2} n)-time algorithm, for any constant integer k > 0 and irrespective of the alphabet size, using (n) space as the previous approaches. We thus notably break through the well-known n log^k n barrier, which stems from a recursive heavy-path decomposition technique that was first introduced in the seminal paper of Cole et al. [STOC 2004] for string indexing with k errors. </p

VU Research Portal

CWI's Institutional Repository

INRIA a CCSD electronic archive server

HAL Descartes

Dagstuhl Research Online Publication Server

Hal-Diderot

Text indexing for long patterns: Anchors are all you need

Author: Ayad L.A.K. (Lorraine)
Loukides G. (Grigorios)
Pissis S. (Solon)
Publication venue
Publication date: 10/07/2023
Field of study

In many real-world database systems, a large fraction of the data is represented by strings: Sequences of letters over some alphabet. This is because strings can easily encode data arising from different sources. It is often crucial to represent such string datasets in a compact form but also to simultaneously enable fast pattern matching queries. This is the classic text indexing problem. The four absolute measures anyone should pay attention to when designing or implementing a text index are: (ⅰ) index space; (ⅱ) query time;(ⅲ) construction space; and (iv) construction time. Unfortunately, however, most (if not all) widely-used indexes (e.g., suffix tree, suffix array, or their compressed counterparts) are not optimized for all four measures simultaneously, as it is difficult to have the best of all four worlds. Here, we take an important step in this direction by showing that text indexing with locally consistent anchors (lc-anchors) offers remarkably good performance in all four measures, when we have at hand a lower bound ℓ on the length of the queried patterns — which is arguably a quite reasonable assumption in practical applications. Specifically, we improve on the construction of the index proposed by Loukides and Pissis, which is based on bidirectional string anchors (bd-anchors), a new type of lc-anchors,by: (i) designing an average-case linear-time algorithm to compute bd-anchors; and (ii) developing a semi-external-memory implementation to construct the index in small space using near-optimal work. We then present an extensive experimental evaluation, based on the four measures, using real benchmark datasets. The results show that, for long patterns, the index constructed using our improved algorithms compares favorably to all classic indexes: (compressed) suffix tree; (compressed) suffix array; and the FM-index

CWI's Institutional Repository

String Covering: A Survey

Author: Mhaskar Neerja
Smyth W. F.
Publication venue
Publication date: 21/11/2022
Field of study

The study of strings is an important combinatorial field that precedes the digital computer. Strings can be very long, trillions of letters, so it is important to find compact representations. Here we first survey various forms of one potential compaction methodology, the cover of a given string x, initially proposed in a simple form in 1990, but increasingly of interest as more sophisticated variants have been discovered. We then consider covering by a seed; that is, a cover of a superstring of x. We conclude with many proposals for research directions that could make significant contributions to string processing in future

arXiv.org e-Print Archive

Episciences.org

Internal Shortest Absent Word Queries in Constant Time and Linear Space

Author: Badkobeh Golnaz
Charalampopoulos Panagiotis
Kosolobov Dmitry
Pissis Solon,
Publication venue: HAL CCSD
Publication date: 05/07/2021
Field of study

International audienceGiven a string T of length n over an alphabet Σ ⊂ {1, 2,. .. , n O(1) } of size σ, we are to preprocess T so that given a range [i, j], we can return a representation of a shortest string over Σ that is absent in the fragment T [i] • • • T [j] of T. We present an O(n)-space data structure that answers such queries in constant time and can be constructed in O(n log σ n) time

INRIA a CCSD electronic archive server

Internal shortest absent word queries

Author: Badkobeh G. (Golnaz)
Charalampopoulos P. (Panagiotis)
Pissis S. (Solon)
Publication venue
Publication date: 01/01/2021
Field of study

Given a string T of length n over an alphabet Σ ⊂ {1, 2, . . . , nO(1)} of size σ, we are to preprocess T so that given a range [i, j], we can return a representation of a shortest string over Σ that is absent in the fragment T[i] · · · T[j] of T. For any positive integer k ∈ [1, log logσ n], we present an O((n/k) · log logσ n)-size data structure, which can be constructed in O(n logσ n) time, and answers queries in time O(log logσ k)

VU Research Portal

CWI's Institutional Repository

Dagstuhl Research Online Publication Server

Elastic-Degenerate String Matching with 1 Error

Author: Bernardini Giulia
Gabory Estéban
Pissis Solon P.
Stougie Leen
Sweering Michelle
Zuba Wiktor
Publication venue
Publication date: 01/01/2022
Field of study

An elastic-degenerate string is a sequence of

n

finite sets of strings of total length

N

, introduced to represent a set of related DNA sequences, also known as a pangenome. The ED string matching (EDSM) problem consists in reporting all occurrences of a pattern of length

m

in an ED text. This problem has recently received some attention by the combinatorial pattern matching community, culminating in an

\tilde{\mathcal{O}}(nm^{\omega-1})+\mathcal{O}(N)

-time algorithm [Bernardini et al., SIAM J. Comput. 2022], where

\omega

denotes the matrix multiplication exponent and the

\tilde{\mathcal{O}}(\cdot)

notation suppresses polylog factors. In the

k

-EDSM problem, the approximate version of EDSM, we are asked to report all pattern occurrences with at most

k

errors.

k

-EDSM can be solved in

\mathcal{O}(k^2mG+kN)

time, under edit distance, or

\mathcal{O}(kmG+kN)

time, under Hamming distance, where

G

denotes the total number of strings in the ED text [Bernardini et al., Theor. Comput. Sci. 2020]. Unfortunately,

G

is only bounded by

N

, and so even for

k=1

, the existing algorithms run in

\Omega(mN)

time in the worst case. In this paper we show that

1

-EDSM can be solved in

\mathcal{O}((nm^2 + N)\log m)

\mathcal{O}(nm^3 + N)

time under edit distance. For the decision version, we present a faster

\mathcal{O}(nm^2\sqrt{\log m} + N\log\log m)

-time algorithm. We also show that

1

-EDSM can be solved in

\mathcal{O}(nm^2 + N\log m)

time under Hamming distance. Our algorithms for edit distance rely on non-trivial reductions from

1

-EDSM to special instances of classic computational geometry problems (2d rectangle stabbing or 2d range emptiness), which we show how to solve efficiently. In order to obtain an even faster algorithm for Hamming distance, we rely on employing and adapting the

k

-errata trees for indexing with errors [Cole et al., STOC 2004].Comment: This is an extended version of a paper accepted at LATIN 202

arXiv.org e-Print Archive

Archivio istituzionale della ricerca - Università di Trieste

VU Research Portal

CWI's Institutional Repository

INRIA a CCSD electronic archive server