23 research outputs found
Words and forbidden factors
AbstractGiven a finite or infinite word v, we consider the set M(v) of minimal forbidden factors of v. We show that the set M(v) is of fundamental importance in determining the structure of the word v. In the case of a finite word w we consider two parameters that are related to the size of M(w): the first counts the minimal forbidden factors of w and the second gives the length of the longest minimal forbidden factor of w. We derive sharp upper and lower bounds for both parameters. We prove also that the second parameter is related to the minimal period of the word w. We are further interested to the algorithmic point of view. Indeed, we design linear time algorithm for the following two problems: (i) given w, construct the set M(w) and, conversely, (ii) given M(w), reconstruct the word w. In the case of an infinite word x, we consider the following two functions: gx that counts, for each n, the allowed factors of x of length n and fx that counts, for each n, the minimal forbidden factors of x of length n. We address the following general problem: what information about the structure of x can be derived from the pair (gx,fx)? We prove that these two functions characterize, up to the automorphism exchanging the two letters, the language of factors of each single infinite Sturmian word
A Characterization of Bispecial Sturmian Words
A finite Sturmian word w over the alphabet {a,b} is left special (resp. right
special) if aw and bw (resp. wa and wb) are both Sturmian words. A bispecial
Sturmian word is a Sturmian word that is both left and right special. We show
as a main result that bispecial Sturmian words are exactly the maximal internal
factors of Christoffel words, that are words coding the digital approximations
of segments in the Euclidean plane. This result is an extension of the known
relation between central words and primitive Christoffel words. Our
characterization allows us to give an enumerative formula for bispecial
Sturmian words. We also investigate the minimal forbidden words for the set of
Sturmian words.Comment: Accepted to MFCS 201
Linear-time Computation of Minimal Absent Words Using Suffix Array
An absent word of a word y of length n is a word that does not occur in y. It
is a minimal absent word if all its proper factors occur in y. Minimal absent
words have been computed in genomes of organisms from all domains of life;
their computation provides a fast alternative for measuring approximation in
sequence comparison. There exists an O(n)-time and O(n)-space algorithm for
computing all minimal absent words on a fixed-sized alphabet based on the
construction of suffix automata (Crochemore et al., 1998). No implementation of
this algorithm is publicly available. There also exists an O(n^2)-time and
O(n)-space algorithm for the same problem based on the construction of suffix
arrays (Pinho et al., 2009). An implementation of this algorithm was also
provided by the authors and is currently the fastest available. In this
article, we bridge this unpleasant gap by presenting an O(n)-time and
O(n)-space algorithm for computing all minimal absent words based on the
construction of suffix arrays. Experimental results using real and synthetic
data show that the respective implementation outperforms the one by Pinho et
al
Optimal Computation of Avoided Words
The deviation of the observed frequency of a word from its expected
frequency in a given sequence is used to determine whether or not the word
is avoided. This concept is particularly useful in DNA linguistic analysis. The
value of the standard deviation of , denoted by , effectively
characterises the extent of a word by its edge contrast in the context in which
it occurs. A word of length is a -avoided word in if
, for a given threshold . Notice that such a word
may be completely absent from . Hence computing all such words na\"{\i}vely
can be a very time-consuming procedure, in particular for large . In this
article, we propose an -time and -space algorithm to compute all
-avoided words of length in a given sequence of length over a
fixed-sized alphabet. We also present a time-optimal -time and
-space algorithm to compute all -avoided words (of any
length) in a sequence of length over an alphabet of size .
Furthermore, we provide a tight asymptotic upper bound for the number of
-avoided words and the expected length of the longest one. We make
available an open-source implementation of our algorithm. Experimental results,
using both real and synthetic data, show the efficiency of our implementation
On the Structure of Bispecial Sturmian Words
A balanced word is one in which any two factors of the same length contain
the same number of each letter of the alphabet up to one. Finite binary
balanced words are called Sturmian words. A Sturmian word is bispecial if it
can be extended to the left and to the right with both letters remaining a
Sturmian word. There is a deep relation between bispecial Sturmian words and
Christoffel words, that are the digital approximations of Euclidean segments in
the plane. In 1997, J. Berstel and A. de Luca proved that \emph{palindromic}
bispecial Sturmian words are precisely the maximal internal factors of
\emph{primitive} Christoffel words. We extend this result by showing that
bispecial Sturmian words are precisely the maximal internal factors of
\emph{all} Christoffel words. Our characterization allows us to give an
enumerative formula for bispecial Sturmian words. We also investigate the
minimal forbidden words for the language of Sturmian words.Comment: arXiv admin note: substantial text overlap with arXiv:1204.167
Minimal Forbidden Factors of Circular Words
Minimal forbidden factors are a useful tool for investigating properties of
words and languages. Two factorial languages are distinct if and only if they
have different (antifactorial) sets of minimal forbidden factors. There exist
algorithms for computing the minimal forbidden factors of a word, as well as of
a regular factorial language. Conversely, Crochemore et al. [IPL, 1998] gave an
algorithm that, given the trie recognizing a finite antifactorial language ,
computes a DFA recognizing the language whose set of minimal forbidden factors
is . In the same paper, they showed that the obtained DFA is minimal if the
input trie recognizes the minimal forbidden factors of a single word. We
generalize this result to the case of a circular word. We discuss several
combinatorial properties of the minimal forbidden factors of a circular word.
As a byproduct, we obtain a formal definition of the factor automaton of a
circular word. Finally, we investigate the case of minimal forbidden factors of
the circular Fibonacci words.Comment: To appear in Theoretical Computer Scienc
Suffix conjugates for a class of morphic subshifts
Let A be a finite alphabet and f: A^* --> A^* be a morphism with an iterative
fixed point f^\omega(\alpha), where \alpha{} is in A. Consider the subshift (X,
T), where X is the shift orbit closure of f^\omega(\alpha) and T: X --> X is
the shift map. Let S be a finite alphabet that is in bijective correspondence
via a mapping c with the set of nonempty suffixes of the images f(a) for a in
A. Let calS be a subset S^N be the set of infinite words s = (s_n)_{n\geq 0}
such that \pi(s):= c(s_0)f(c(s_1)) f^2(c(s_2))... is in X. We show that if f is
primitive and f(A) is a suffix code, then there exists a mapping H: calS -->
calS such that (calS, H) is a topological dynamical system and \pi: (calS, H)
--> (X, T) is a conjugacy; we call (calS, H) the suffix conjugate of (X, T). In
the special case when f is the Fibonacci or the Thue-Morse morphism, we show
that the subshift (calS, T) is sofic, that is, the language of calS is regular
Cyclic Complexity of Words
We introduce and study a complexity function on words called
\emph{cyclic complexity}, which counts the number of conjugacy classes of
factors of length of an infinite word We extend the well-known
Morse-Hedlund theorem to the setting of cyclic complexity by showing that a
word is ultimately periodic if and only if it has bounded cyclic complexity.
Unlike most complexity functions, cyclic complexity distinguishes between
Sturmian words of different slopes. We prove that if is a Sturmian word and
is a word having the same cyclic complexity of then up to renaming
letters, and have the same set of factors. In particular, is also
Sturmian of slope equal to that of Since for some
implies is periodic, it is natural to consider the quantity
We show that if is a Sturmian word,
then We prove however that this is
not a characterization of Sturmian words by exhibiting a restricted class of
Toeplitz words, including the period-doubling word, which also verify this same
condition on the limit infimum. In contrast we show that, for the Thue-Morse
word , Comment: To appear in Journal of Combinatorial Theory, Series
Correlations of minimal forbidden factors of the Fibonacci word
If and are two words, the correlation of over is a binary
word that encodes all possible overlaps between and . This concept was
introduced by Guibas and Odlyzko as a key element of their method for
enumerating the number of words of length over a given alphabet that avoid
a given set of forbidden factors. In this paper we characterize the pairwise
correlations between the minimal forbidden factors of the infinite Fibonacci
word.Comment: 11 page
Minimal Absent Words in Rooted and Unrooted Trees
We extend the theory of minimal absent words to (rooted and unrooted) trees, having edges labeled by letters from an alphabet of cardinality. We show that the set of minimal absent words of a rooted (resp. unrooted) tree T with n nodes has cardinality (resp.), and we show that these bounds are realized. Then, we exhibit algorithms to compute all minimal absent words in a rooted (resp. unrooted) tree in output-sensitive time (resp. assuming an integer alphabet of size polynomial in n