9 research outputs found

    Asymptotic Optimality of Antidictionary Codes

    Full text link
    An antidictionary code is a lossless compression algorithm using an antidictionary which is a set of minimal words that do not occur as substrings in an input string. The code was proposed by Crochemore et al. in 2000, and its asymptotic optimality has been proved with respect to only a specific information source, called balanced binary source that is a binary Markov source in which a state transition occurs with probability 1/2 or 1. In this paper, we prove the optimality of both static and dynamic antidictionary codes with respect to a stationary ergodic Markov source on finite alphabet such that a state transition occurs with probability p(0<p≤1)p (0 < p \leq 1).Comment: 5 pages, to appear in the proceedings of 2010 IEEE International Symposium on Information Theory (ISIT2010

    Minimal Absent Words in Rooted and Unrooted Trees

    Get PDF
    We extend the theory of minimal absent words to (rooted and unrooted) trees, having edges labeled by letters from an alphabet of cardinality. We show that the set of minimal absent words of a rooted (resp.&nbsp;unrooted) tree T with n nodes has cardinality (resp.), and we show that these bounds are realized. Then, we exhibit algorithms to compute all minimal absent words in a rooted (resp.&nbsp;unrooted) tree in output-sensitive time (resp. assuming an integer alphabet of size polynomial in n

    Compression by Substring Enumeration Using Sorted Contingency Tables

    Get PDF
    This paper proposes two variants of improved Compression by Substring Enumeration (CSE) with a finite alphabet. In previous studies on CSE, an encoder utilizes inequalities which evaluate the number of occurrences of a substring or a minimal forbidden word (MFW) to be encoded. The inequalities are derived from a contingency table including the number of occurrences of a substring or an MFW. Moreover, codeword length of a substring and an MFW grows with the difference between the upper and lower bounds deduced from the inequalities, however the lower bound is not tight. Therefore, we derive a new tight lower bound based on the contingency table and consequently propose a new CSE algorithm using the new inequality. We also propose a new encoding order of substrings and MFWs based on a sorted contingency table such that both its row and column marginal total are sorted in descending order instead of a lexicographical order used in previous studies. We then propose a new CSE algorithm which is the first proposed CSE algorithm using the new encoding order. Experimental results show that compression ratios of all files of the Calgary corpus in the proposed algorithms are better than those of a previous study on CSE with a finite alphabet. Moreover, compression ratios under the second proposed CSE get better than or equal to that under a well-known compressor for 11 files amongst 14 files in the corpus

    Constructing Antidictionaries of Long Texts in Output-Sensitive Space

    Get PDF
    A word x that is absent from a word y is called minimal if all its proper factors occur in y. Given a collection of k words y1, … , yk over an alphabet Σ, we are asked to compute the set M{y1,…,yk}ℓ of minimal absent words of length at most ℓ of the collection {y1, … , yk}. The set M{y1,…,yk}ℓ contains all the words x such that x is absent from all the words of the collection while there exist i,j, such that the maximal proper suffix of x is a factor of yi and the maximal proper prefix of x is a factor of yj. In data compression, this corresponds to computing the antidictionary of k documents. In bioinformatics, it corresponds to computing words that are absent from a genome of k chromosomes. Indeed, the set Myℓ of minimal absent words of a word y is equal to M{y1,…,yk}ℓ for any decomposition of y into a collection of words y1, … , yk such that there is an overlap of length at least ℓ − 1 between any two consecutive words in the collection. This computation generally requires Ω(n) space for n = |y| using any of the plenty available O(n) -time algorithms. This is because an Ω(n)-sized text index is constructed over y which can be impractical for large n. We do the identical computation incrementally using output-sensitive space. This goal is reasonable when ∥M{y1,…,yN}ℓ∥=o(n), for all N ∈ [1,k], where ∥S∥ denotes the sum of the lengths of words in set S. For instance, in the human genome, n ≈ 3 × 109 but ∥M{y1,…,yk}12∥≈106. We consider a constant-sized alphabet for stating our results. We show that allMy1ℓ,…,M{y1,…,yk}ℓ can be computed in O(kn+∑N=1k∥M{y1,…,yN}ℓ∥) total time using O(MaxIn+MaxOut) space, where MaxIn is the length of the longest word in {y1, … , yk} and MaxOut=max{∥M{y1,…,yN}ℓ∥:N∈[1,k]}. Proof-of-concept experimental results are also provided confirming our theoretical findings and justifying our contribution

    Constructing Antidictionaries in Output-Sensitive Space

    Get PDF
    A word x that is absent from a word y is called minimal if all its proper factors occur in y. Given a collection of k words y_1,y_2,...,y_k over an alphabet Σ, we are asked to compute the set M^ℓ_y_1#...#y_k of minimal absent words of length at most ℓ of word y=y_1#y_2#...#y_k, #∉Σ. In data compression, this corresponds to computing the antidictionary of k documents. In bioinformatics, it corresponds to computing words that are absent from a genome of k chromosomes. This computation generally requires Ω(n) space for n=|y| using any of the plenty available O(n)-time algorithms. This is because an Ω(n)-sized text index is constructed over y which can be impractical for large n. We do the identical computation incrementally using output-sensitive space. This goal is reasonable when ||M^ℓ_y_1#...#y_N||=o(n), for all N∈[1,k]. For instance, in the human genome, n ≈ 3× 10^9 but ||M^12_y_1#...#y_k|| ≈ 10^6. We consider a constant-sized alphabet for stating our results. We show that all M^ℓ_y_1,...,M^ℓ_y_1#...#y_k can be computed in O(kn+∑^k_N=1||M^ℓ_y_1#...#y_N||) total time using O(MaxIn+MaxOut) space, where MaxIn is the length of the longest word in {y_1,...,y_k} and MaxOut={||M^ℓ_y_1#...#y_N||:N∈[1,k]}. Proof-of-concept experimental results are also provided confirming our theoretical findings and justifying our contribution

    Internal Shortest Absent Word Queries in Constant Time and Linear Space

    Get PDF
    International audienceGiven a string T of length n over an alphabet Σ ⊂ {1, 2,. .. , n O(1) } of size σ, we are to preprocess T so that given a range [i, j], we can return a representation of a shortest string over Σ that is absent in the fragment T [i] • • • T [j] of T. We present an O(n)-space data structure that answers such queries in constant time and can be constructed in O(n log σ n) time

    Internal shortest absent word queries

    Get PDF
    Given a string T of length n over an alphabet Σ ⊂ {1, 2, . . . , nO(1)} of size σ, we are to preprocess T so that given a range [i, j], we can return a representation of a shortest string over Σ that is absent in the fragment T[i] · · · T[j] of T. For any positive integer k ∈ [1, log logσ n], we present an O((n/k) · log logσ n)-size data structure, which can be constructed in O(n logσ n) time, and answers queries in time O(log logσ k)

    Querying and Efficiently Searching Large, Temporal Text Corpora

    Get PDF
    corecore