23 research outputs found

    Words and forbidden factors

    Get PDF
    AbstractGiven a finite or infinite word v, we consider the set M(v) of minimal forbidden factors of v. We show that the set M(v) is of fundamental importance in determining the structure of the word v. In the case of a finite word w we consider two parameters that are related to the size of M(w): the first counts the minimal forbidden factors of w and the second gives the length of the longest minimal forbidden factor of w. We derive sharp upper and lower bounds for both parameters. We prove also that the second parameter is related to the minimal period of the word w. We are further interested to the algorithmic point of view. Indeed, we design linear time algorithm for the following two problems: (i) given w, construct the set M(w) and, conversely, (ii) given M(w), reconstruct the word w. In the case of an infinite word x, we consider the following two functions: gx that counts, for each n, the allowed factors of x of length n and fx that counts, for each n, the minimal forbidden factors of x of length n. We address the following general problem: what information about the structure of x can be derived from the pair (gx,fx)? We prove that these two functions characterize, up to the automorphism exchanging the two letters, the language of factors of each single infinite Sturmian word

    A Characterization of Bispecial Sturmian Words

    Full text link
    A finite Sturmian word w over the alphabet {a,b} is left special (resp. right special) if aw and bw (resp. wa and wb) are both Sturmian words. A bispecial Sturmian word is a Sturmian word that is both left and right special. We show as a main result that bispecial Sturmian words are exactly the maximal internal factors of Christoffel words, that are words coding the digital approximations of segments in the Euclidean plane. This result is an extension of the known relation between central words and primitive Christoffel words. Our characterization allows us to give an enumerative formula for bispecial Sturmian words. We also investigate the minimal forbidden words for the set of Sturmian words.Comment: Accepted to MFCS 201

    Linear-time Computation of Minimal Absent Words Using Suffix Array

    Get PDF
    An absent word of a word y of length n is a word that does not occur in y. It is a minimal absent word if all its proper factors occur in y. Minimal absent words have been computed in genomes of organisms from all domains of life; their computation provides a fast alternative for measuring approximation in sequence comparison. There exists an O(n)-time and O(n)-space algorithm for computing all minimal absent words on a fixed-sized alphabet based on the construction of suffix automata (Crochemore et al., 1998). No implementation of this algorithm is publicly available. There also exists an O(n^2)-time and O(n)-space algorithm for the same problem based on the construction of suffix arrays (Pinho et al., 2009). An implementation of this algorithm was also provided by the authors and is currently the fastest available. In this article, we bridge this unpleasant gap by presenting an O(n)-time and O(n)-space algorithm for computing all minimal absent words based on the construction of suffix arrays. Experimental results using real and synthetic data show that the respective implementation outperforms the one by Pinho et al

    Optimal Computation of Avoided Words

    Get PDF
    The deviation of the observed frequency of a word ww from its expected frequency in a given sequence xx is used to determine whether or not the word is avoided. This concept is particularly useful in DNA linguistic analysis. The value of the standard deviation of ww, denoted by std(w)std(w), effectively characterises the extent of a word by its edge contrast in the context in which it occurs. A word ww of length k>2k>2 is a ρ\rho-avoided word in xx if std(w)≀ρstd(w) \leq \rho, for a given threshold ρ<0\rho < 0. Notice that such a word may be completely absent from xx. Hence computing all such words na\"{\i}vely can be a very time-consuming procedure, in particular for large kk. In this article, we propose an O(n)O(n)-time and O(n)O(n)-space algorithm to compute all ρ\rho-avoided words of length kk in a given sequence xx of length nn over a fixed-sized alphabet. We also present a time-optimal O(σn)O(\sigma n)-time and O(σn)O(\sigma n)-space algorithm to compute all ρ\rho-avoided words (of any length) in a sequence of length nn over an alphabet of size σ\sigma. Furthermore, we provide a tight asymptotic upper bound for the number of ρ\rho-avoided words and the expected length of the longest one. We make available an open-source implementation of our algorithm. Experimental results, using both real and synthetic data, show the efficiency of our implementation

    On the Structure of Bispecial Sturmian Words

    Full text link
    A balanced word is one in which any two factors of the same length contain the same number of each letter of the alphabet up to one. Finite binary balanced words are called Sturmian words. A Sturmian word is bispecial if it can be extended to the left and to the right with both letters remaining a Sturmian word. There is a deep relation between bispecial Sturmian words and Christoffel words, that are the digital approximations of Euclidean segments in the plane. In 1997, J. Berstel and A. de Luca proved that \emph{palindromic} bispecial Sturmian words are precisely the maximal internal factors of \emph{primitive} Christoffel words. We extend this result by showing that bispecial Sturmian words are precisely the maximal internal factors of \emph{all} Christoffel words. Our characterization allows us to give an enumerative formula for bispecial Sturmian words. We also investigate the minimal forbidden words for the language of Sturmian words.Comment: arXiv admin note: substantial text overlap with arXiv:1204.167

    Minimal Forbidden Factors of Circular Words

    Full text link
    Minimal forbidden factors are a useful tool for investigating properties of words and languages. Two factorial languages are distinct if and only if they have different (antifactorial) sets of minimal forbidden factors. There exist algorithms for computing the minimal forbidden factors of a word, as well as of a regular factorial language. Conversely, Crochemore et al. [IPL, 1998] gave an algorithm that, given the trie recognizing a finite antifactorial language MM, computes a DFA recognizing the language whose set of minimal forbidden factors is MM. In the same paper, they showed that the obtained DFA is minimal if the input trie recognizes the minimal forbidden factors of a single word. We generalize this result to the case of a circular word. We discuss several combinatorial properties of the minimal forbidden factors of a circular word. As a byproduct, we obtain a formal definition of the factor automaton of a circular word. Finally, we investigate the case of minimal forbidden factors of the circular Fibonacci words.Comment: To appear in Theoretical Computer Scienc

    Suffix conjugates for a class of morphic subshifts

    Full text link
    Let A be a finite alphabet and f: A^* --> A^* be a morphism with an iterative fixed point f^\omega(\alpha), where \alpha{} is in A. Consider the subshift (X, T), where X is the shift orbit closure of f^\omega(\alpha) and T: X --> X is the shift map. Let S be a finite alphabet that is in bijective correspondence via a mapping c with the set of nonempty suffixes of the images f(a) for a in A. Let calS be a subset S^N be the set of infinite words s = (s_n)_{n\geq 0} such that \pi(s):= c(s_0)f(c(s_1)) f^2(c(s_2))... is in X. We show that if f is primitive and f(A) is a suffix code, then there exists a mapping H: calS --> calS such that (calS, H) is a topological dynamical system and \pi: (calS, H) --> (X, T) is a conjugacy; we call (calS, H) the suffix conjugate of (X, T). In the special case when f is the Fibonacci or the Thue-Morse morphism, we show that the subshift (calS, T) is sofic, that is, the language of calS is regular

    Cyclic Complexity of Words

    Get PDF
    We introduce and study a complexity function on words cx(n),c_x(n), called \emph{cyclic complexity}, which counts the number of conjugacy classes of factors of length nn of an infinite word x.x. We extend the well-known Morse-Hedlund theorem to the setting of cyclic complexity by showing that a word is ultimately periodic if and only if it has bounded cyclic complexity. Unlike most complexity functions, cyclic complexity distinguishes between Sturmian words of different slopes. We prove that if xx is a Sturmian word and yy is a word having the same cyclic complexity of x,x, then up to renaming letters, xx and yy have the same set of factors. In particular, yy is also Sturmian of slope equal to that of x.x. Since cx(n)=1c_x(n)=1 for some n≄1n\geq 1 implies xx is periodic, it is natural to consider the quantity lim inf⁥n→∞cx(n).\liminf_{n\rightarrow \infty} c_x(n). We show that if xx is a Sturmian word, then lim inf⁥n→∞cx(n)=2.\liminf_{n\rightarrow \infty} c_x(n)=2. We prove however that this is not a characterization of Sturmian words by exhibiting a restricted class of Toeplitz words, including the period-doubling word, which also verify this same condition on the limit infimum. In contrast we show that, for the Thue-Morse word tt, lim inf⁥n→∞ct(n)=+∞.\liminf_{n\rightarrow \infty} c_t(n)=+\infty.Comment: To appear in Journal of Combinatorial Theory, Series

    Correlations of minimal forbidden factors of the Fibonacci word

    Full text link
    If uu and vv are two words, the correlation of uu over vv is a binary word that encodes all possible overlaps between uu and vv. This concept was introduced by Guibas and Odlyzko as a key element of their method for enumerating the number of words of length nn over a given alphabet that avoid a given set of forbidden factors. In this paper we characterize the pairwise correlations between the minimal forbidden factors of the infinite Fibonacci word.Comment: 11 page

    Minimal Absent Words in Rooted and Unrooted Trees

    Get PDF
    We extend the theory of minimal absent words to (rooted and unrooted) trees, having edges labeled by letters from an alphabet of cardinality. We show that the set of minimal absent words of a rooted (resp.&nbsp;unrooted) tree T with n nodes has cardinality (resp.), and we show that these bounds are realized. Then, we exhibit algorithms to compute all minimal absent words in a rooted (resp.&nbsp;unrooted) tree in output-sensitive time (resp. assuming an integer alphabet of size polynomial in n
    corecore