Search CORE

18 research outputs found

Tight Upper and Lower Bounds on Suffix Tree Breadth

Author: Badkobeh Golnaz
Gawrychowski Pawel
Kärkkäinen Juha
Puglisi Simon
Zhukova Bella
Publication venue
Publication date: 01/01/2021
Field of study

The suffix tree - the compacted trie of all the suffixes of a string - is the most important and widely-used data structure in string processing. We consider a natural combinatorial question about suffix trees: for a string S of length n, how many nodes nu(S)(d) can there be at (string) depth d in its suffix tree? We prove nu(n, d) = max(S) (is an element of Sigma n) nu(S)(d) is O ((n/d) log(n/d)), and show that this bound is asymptotically tight, describing strings for which nu(S)(d) is Omega((n/d)log(n/d)). (C) 2020 Elsevier B.V. All rights reserved.Peer reviewe

Goldsmiths Research Online

Helsingin yliopiston digitaalinen arkisto

Checking whether a word is Hamming-isometric in linear time

Author: Béal Marie-Pierre
Crochemore Maxime
Publication venue
Publication date: 23/07/2021
Field of study

A finite word

f

is Hamming-isometric if for any two word

u

and

v

of same length avoiding

f

u

can be transformed into

v

by changing one by one all the letters on which

u

differs from

v

, in such a way that all of the new words obtained in this process also avoid~

f

. Words which are not Hamming-isometric have been characterized as words having a border with two mismatches. We derive from this characterization a linear-time algorithm to check whether a word is Hamming-isometric. It is based on pattern matching algorithms with

k

mismatches. Lee-isometric words over a four-letter alphabet have been characterized as words having a border with two Lee-errors. We derive from this characterization a linear-time algorithm to check whether a word over an alphabet of size four is Lee-isometric.Comment: A second algorithm for checking whether a word is Hamming-isometric is added using the result given in reference [5

arXiv.org e-Print Archive

HAL-Ecole des Ponts ParisTech

String Inference from Longest-Common-Prefix Array

Author: Kärkkäinen Juha
Piątkowski Marcin
Puglisi Simon J.
Publication venue: Schloss Dagstuhl - Leibniz-Zentrum für Informatik
Publication date: 01/01/2017
Field of study

Peer reviewe

Dagstuhl Research Online Publication Server

Helsingin yliopiston digitaalinen arkisto

On Suffix Tree Breadth

Author: Badkobeh Golnaz
Karkkainen Juha
Puglisi Simon
Zhukova Bella
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 06/09/2017
Field of study

The suffix tree—the compacted trie of all the suffixes of a string—is the most important and widely-used data structure in string processing. We consider a natural combinatorial question about suffix trees: for a string S of length n, how many nodes νS(d) can there be at (string) depth d in its suffix tree? We prove ν(n,d)=maxS∈ΣnνS(d) is O((n/d)logn) , and show that this bound is almost tight, describing strings for which νS(d)=d is Ω((n/d)log(n/d)

Goldsmiths Research Online

Crossref

Dichotomic Selection on Words: A Probabilistic Analysis

Author: Akhavi Ali
Darthenay Dimitri
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019)
Publication date: 01/01/2019
Field of study

The paper studies the behaviour of selection algorithms that are based on dichotomy principles. On the entry formed by an ordered list L and a searched element x not in L, they return the interval of the list L the element x belongs to. We focus here on the case of words, where dichotomy principles lead to a selection algorithm designed by Crochemore, Hancart and Lecroq, which appears to be "quasi-optimal". We perform a probabilistic analysis of this algorithm that exhibits its quasi-optimality on average

Dagstuhl Research Online Publication Server

Longest Common Abelian Factors and Large Alphabets

Author: A Amir
A Apostolico
F Cicalese
K Mehlhorn
LCK Hui
T Gagie
T Kociumaka
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 21/09/2016
Field of study

Two strings X and Y are considered Abelian equal if the letters of X can be permuted to obtain Y (and vice versa). Recently, Alatabbi et al. (2015) considered the longest common Abelian factor problem in which we are asked to find the length of the longest Abelian-equal factor present in a given pair of strings. They provided an algorithm that uses O(σn2) time and O(σn) space, where n is the length of the pair of strings and σ is the alphabet size. In this paper we describe an algorithm that uses O(n2log2nlog∗n) time and O(nlog2n) space, significantly improving Alatabbi et al.’s result unless the alphabet is small. Our algorithm makes use of techniques for maintaining a dynamic set of strings under split, join, and equality testing (Melhorn et al., Algorithmica 17(2), 1997)

Goldsmiths Research Online

Crossref

Substring Complexity in Sublinear Space

Author: and Solon P. Pissis.
Gabriele Fici
Giulia Bernardini
Paweł Gawrychowski
Publication venue: Schloss-Dagstuhl - Leibniz Zentrum für Informatik
Publication date: 28/11/2023
Field of study

Shannon’s entropy is a definitive lower bound for statistical compression. Unfortunately, no such clear measure exists for the compressibility of repetitive strings. Thus, ad hoc measures are employed to estimate the repetitiveness of strings, e.g., the size z of the Lempel–Ziv parse or the number r of equal-letter runs of the Burrows-Wheeler transform. A more recent one is the size γ of a smallest string attractor. Let T be a string of length n. A string attractor of T is a set of positions of T capturing the occurrences of all the substrings of T. Unfortunately, Kempa and Prezza [STOC 2018] showed that computing γ is NP-hard. Kociumaka et al. [LATIN 2020] considered a new measure of compressibility that is based on the function S_T(k) counting the number of distinct substrings of length k of T, also known as the substring complexity of T. This new measure is defined as δ = sup{S_T(k)/k, k ≥ 1} and lower bounds all the relevant ad hoc measures previously considered. In particular, δ ≤ γ always holds and δ can be computed in O(n) time using Θ(n) working space. Kociumaka et al. showed that one can construct an O(δ log n/(δ))-sized representation of T supporting efficient direct access and efficient pattern matching queries on T. Given that for highly compressible strings, δ is significantly smaller than n, it is natural to pose the following question: Can we compute δ efficiently using sublinear working space? It is straightforward to show that in the comparison model, any algorithm computing δ using O(b) space requires Ω(n^{2-o(1)}/b) time through a reduction from the element distinctness problem [Yao, SIAM J. Comput. 1994]. We thus wanted to investigate whether we can indeed match this lower bound. We address this algorithmic challenge by showing the following bounds to compute δ: - O((n3log b)/b2) time using O(b) space, for any b ∈ [1,n], in the comparison model. - Õ(n2/b) time using Õ(b) space, for any b ∈ [√n,n], in the word RAM model. This gives an Õ(n^{1+ε})-time and Õ(n^{1-ε})-space algorithm to compute δ, for any 0 < ε ≤ 1/2. Let us remark that our algorithms compute S_T(k), for all k, within the same complexities

Archivio istituzionale della ricerca - Università di Palermo

Substring Complexity in Sublinear Space

Author: Bernardini Giulia
Fici Gabriele
Gawrychowski Paweł
Pissis Solon P.
Publication venue
Publication date: 16/07/2020
Field of study

Shannon's entropy is a definitive lower bound for statistical compression. Unfortunately, no such clear measure exists for the compressibility of repetitive strings. Thus, ad-hoc measures are employed to estimate the repetitiveness of strings, e.g., the size

z

of the Lempel-Ziv parse or the number

r

of equal-letter runs of the Burrows-Wheeler transform. A more recent one is the size

\gamma

of a smallest string attractor. Unfortunately, Kempa and Prezza [STOC 2018] showed that computing

\gamma

is NP-hard. Kociumaka et al. [LATIN 2020] considered a new measure that is based on the function

S_T

counting the cardinalities of the sets of substrings of each length of

T

, also known as the substring complexity. This new measure is defined as

\delta= \sup\{S_T(k)/k, k\geq 1\}

and lower bounds all the measures previously considered. In particular,

\delta\leq \gamma

always holds and

\delta

can be computed in

\mathcal{O}(n)

time using

\Omega(n)

working space. Kociumaka et al. showed that if

\delta

is given, one can construct an

\mathcal{O}(\delta \log \frac{n}{\delta})

-sized representation of

T

supporting efficient direct access and efficient pattern matching queries on

T

. Given that for highly compressible strings,

\delta

is significantly smaller than

n

, it is natural to pose the following question: Can we compute

\delta

efficiently using sublinear working space? It is straightforward to show that any algorithm computing

\delta

using

\mathcal{O}(b)

space requires

\Omega(n^{2-o(1)}/b)

time through a reduction from the element distinctness problem [Yao, SIAM J. Comput. 1994]. We present the following results: an

\mathcal{O}(n^3/b^2)

-time and

\mathcal{O}(b)

-space algorithm to compute

\delta

, for any

b\in[1,n]

; and an

\tilde{\mathcal{O}}(n^2/b)

-time and

\mathcal{O}(b)

-space algorithm to compute

\delta

, for any

b\in[n^{2/3},n]

arXiv.org e-Print Archive

Archivio istituzionale della ricerca - Università di Trieste

CWI's Institutional Repository