45 research outputs found
String attractors and combinatorics on words
The notion of string attractor has recently been introduced in [Prezza, 2017] and studied in [Kempa and Prezza, 2018] to provide a unifying framework for known dictionary-based compressors. A string attractor for a word w = w[1]w[2] · · · w[n] is a subset Γ of the positions 1, . . ., n, such that all distinct factors of w have an occurrence crossing at least one of the elements of Γ. While finding the smallest string attractor for a word is a NP-complete problem, it has been proved in [Kempa and Prezza, 2018] that dictionary compressors can be interpreted as algorithms approximating the smallest string attractor for a given word. In this paper we explore the notion of string attractor from a combinatorial point of view, by focusing on several families of finite words. The results presented in the paper suggest that the notion of string attractor can be used to define new tools to investigate combinatorial properties of the words
Clustering words
We characterize words which cluster under the Burrows-Wheeler transform as
those words such that occurs in a trajectory of an interval exchange
transformation, and build examples of clustering words
On the Impact of Morphisms on BWT-Runs
Morphisms are widely studied combinatorial objects that can be used for generating infinite families of words. In the context of Information theory, injective morphisms are called (variable length) codes. In Data compression, the morphisms, combined with parsing techniques, have been recently used to define new mechanisms to generate repetitive words. Here, we show that the repetitiveness induced by applying a morphism to a word can be captured by a compression scheme based on the Burrows-Wheeler Transform (BWT). In fact, we prove that, differently from other compression-based repetitiveness measures, the measure r_bwt (which counts the number of equal-letter runs produced by applying BWT to a word) strongly depends on the applied morphism. More in detail, we characterize the binary morphisms that preserve the value of r_bwt(w), when applied to any binary word w containing both letters. They are precisely the Sturmian morphisms, which are well-known objects in Combinatorics on words. Moreover, we prove that it is always possible to find a binary morphism that, when applied to any binary word containing both letters, increases the number of BWT-equal letter runs by a given (even) number. In addition, we derive a method for constructing arbitrarily large families of binary words on which BWT produces a given (even) number of new equal-letter runs. Such results are obtained by using a new class of morphisms that we call Thue-Morse-like. Finally, we show that there exist binary morphisms ? for which it is possible to find words w such that the difference r_bwt(?(w))-r_bwt(w) is arbitrarily large
On the Impact of Morphisms on BWT-Runs
Morphisms are widely studied combinatorial objects that can be used for generating infinite families of words. In the context of Information theory, injective morphisms are called (variable length) codes. In Data compression, the morphisms, combined with parsing techniques, have been recently used to define new mechanisms to generate repetitive words. Here, we show that the repetitiveness induced by applying a morphism to a word can be captured by a compression scheme based on the Burrows-Wheeler Transform (BWT). In fact, we prove that, differently from other compression-based repetitiveness measures, the measure r_bwt (which counts the number of equal-letter runs produced by applying BWT to a word) strongly depends on the applied morphism. More in detail, we characterize the binary morphisms that preserve the value of r_bwt(w), when applied to any binary word w containing both letters. They are precisely the Sturmian morphisms, which are well-known objects in Combinatorics on words. Moreover, we prove that it is always possible to find a binary morphism that, when applied to any binary word containing both letters, increases the number of BWT-equal letter runs by a given (even) number. In addition, we derive a method for constructing arbitrarily large families of binary words on which BWT produces a given (even) number of new equal-letter runs. Such results are obtained by using a new class of morphisms that we call Thue-Morse-like. Finally, we show that there exist binary morphisms μ for which it is possible to find words w such that the difference r_bwt(μ(w))-r_bwt(w) is arbitrarily large
Novel Results on the Number of Runs of the Burrows-Wheeler-Transform
The Burrows-Wheeler-Transform (BWT), a reversible string transformation, is
one of the fundamental components of many current data structures in string
processing. It is central in data compression, as well as in efficient query
algorithms for sequence data, such as webpages, genomic and other biological
sequences, or indeed any textual data. The BWT lends itself well to compression
because its number of equal-letter-runs (usually referred to as ) is often
considerably lower than that of the original string; in particular, it is well
suited for strings with many repeated factors. In fact, much attention has been
paid to the parameter as measure of repetitiveness, especially to evaluate
the performance in terms of both space and time of compressed indexing data
structures.
In this paper, we investigate , the ratio of and of the number
of runs of the BWT of the reverse of . Kempa and Kociumaka [FOCS 2020] gave
the first non-trivial upper bound as , for any string
of length . However, nothing is known about the tightness of this upper
bound. We present infinite families of binary strings for which holds, thus giving the first non-trivial lower bound on
, the maximum over all strings of length .
Our results suggest that is not an ideal measure of the repetitiveness of
the string, since the number of repeated factors is invariant between the
string and its reverse. We believe that there is a more intricate relationship
between the number of runs of the BWT and the string's combinatorial
properties.Comment: 14 pages, 2 figue
Clustering and Arnoux-Rauzy words
We characterize the clustering of a word under the Burrows-Wheeler transform
in terms of the resolution of a bounded number of bispecial factors belonging
to the language generated by all its powers. We use this criterion to compute,
in every given Arnoux-Rauzy language on three letters, an explicit bound
such that each word of length at least is not clustering; this bound is
sharp for a set of Arnoux-Rauzy languages including the Tribonacci one. In the
other direction, we characterize all standard Arnoux-Rauzy clustering words,
and all perfectly clustering Arnoux-Rauzy words. We extend some results to
episturmian languages, characterizing those which produce infinitely many
clustering words, and to larger alphabets
On the Structure of Bispecial Sturmian Words
A balanced word is one in which any two factors of the same length contain
the same number of each letter of the alphabet up to one. Finite binary
balanced words are called Sturmian words. A Sturmian word is bispecial if it
can be extended to the left and to the right with both letters remaining a
Sturmian word. There is a deep relation between bispecial Sturmian words and
Christoffel words, that are the digital approximations of Euclidean segments in
the plane. In 1997, J. Berstel and A. de Luca proved that \emph{palindromic}
bispecial Sturmian words are precisely the maximal internal factors of
\emph{primitive} Christoffel words. We extend this result by showing that
bispecial Sturmian words are precisely the maximal internal factors of
\emph{all} Christoffel words. Our characterization allows us to give an
enumerative formula for bispecial Sturmian words. We also investigate the
minimal forbidden words for the language of Sturmian words.Comment: arXiv admin note: substantial text overlap with arXiv:1204.167