7 research outputs found

    Computing DAWGs and Minimal Absent Words in Linear Time for Integer Alphabets

    Get PDF
    The directed acyclic word graph (DAWG) of a string y is the smallest (partial) DFA which recognizes all suffixes of y and has only O(n) nodes and edges. We present the first O(n)-time algorithm for computing the DAWG of a given string y of length n over an integer alphabet of polynomial size in n. We also show that a straightforward modification to our DAWG construction algorithm leads to the first O(n)-time algorithm for constructing the affix tree of a given string y over an integer alphabet. Affix trees are a text indexing structure supporting bidirectional pattern searches. As an application to our O(n)-time DAWG construction algorithm, we show that the set MAW(y) of all minimal absent words of y can be computed in optimal O(n + |MAW(y)|) time and O(n) working space for integer alphabets

    Linear-time Computation of DAWGs, Symmetric Indexing Structures, and MAWs for Integer Alphabets

    Full text link
    The directed acyclic word graph (DAWG) of a string yy of length nn is the smallest (partial) DFA which recognizes all suffixes of yy with only O(n)O(n) nodes and edges. In this paper, we show how to construct the DAWG for the input string yy from the suffix tree for yy, in O(n)O(n) time for integer alphabets of polynomial size in nn. In so doing, we first describe a folklore algorithm which, given the suffix tree for yy, constructs the DAWG for the reversed string of yy in O(n)O(n) time. Then, we present our algorithm that builds the DAWG for yy in O(n)O(n) time for integer alphabets, from the suffix tree for yy. We also show that a straightforward modification to our DAWG construction algorithm leads to the first O(n)O(n)-time algorithm for constructing the affix tree of a given string yy over an integer alphabet. Affix trees are a text indexing structure supporting bidirectional pattern searches. We then discuss how our constructions can lead to linear-time algorithms for building other text indexing structures, such as linear-size suffix tries and symmetric CDAWGs in linear time in the case of integer alphabets. As a further application to our O(n)O(n)-time DAWG construction algorithm, we show that the set MAW(y)\mathsf{MAW}(y) of all minimal absent words (MAWs) of yy can be computed in optimal, input- and output-sensitive O(n+∣MAW(y)∣)O(n + |\mathsf{MAW}(y)|) time and O(n)O(n) working space for integer alphabets.Comment: This is an extended version of the paper "Computing DAWGs and Minimal Absent Words in Linear Time for Integer Alphabets" from MFCS 201

    Constructing Antidictionaries in Output-Sensitive Space

    Get PDF
    A word x that is absent from a word y is called minimal if all its proper factors occur in y. Given a collection of k words y_1,y_2,...,y_k over an alphabet Σ, we are asked to compute the set M^ℓ_y_1#...#y_k of minimal absent words of length at most ℓ of word y=y_1#y_2#...#y_k, #∉Σ. In data compression, this corresponds to computing the antidictionary of k documents. In bioinformatics, it corresponds to computing words that are absent from a genome of k chromosomes. This computation generally requires Ω(n) space for n=|y| using any of the plenty available O(n)-time algorithms. This is because an Ω(n)-sized text index is constructed over y which can be impractical for large n. We do the identical computation incrementally using output-sensitive space. This goal is reasonable when ||M^ℓ_y_1#...#y_N||=o(n), for all N∈[1,k]. For instance, in the human genome, n ≈ 3× 10^9 but ||M^12_y_1#...#y_k|| ≈ 10^6. We consider a constant-sized alphabet for stating our results. We show that all M^ℓ_y_1,...,M^ℓ_y_1#...#y_k can be computed in O(kn+∑^k_N=1||M^ℓ_y_1#...#y_N||) total time using O(MaxIn+MaxOut) space, where MaxIn is the length of the longest word in {y_1,...,y_k} and MaxOut={||M^ℓ_y_1#...#y_N||:N∈[1,k]}. Proof-of-concept experimental results are also provided confirming our theoretical findings and justifying our contribution
    corecore