181 research outputs found

    On the Complexity of the Smallest Grammar Problem over Fixed Alphabets

    Get PDF
    In the smallest grammar problem, we are given a word w and we want to compute a preferably small context-free grammar G for the singleton language {w} (where the size of a grammar is the sum of the sizes of its rules, and the size of a rule is measured by the length of its right side). It is known that, for unbounded alphabets, the decision variant of this problem is NP-hard and the optimisation variant does not allow a polynomial-time approximation scheme, unless P = NP. We settle the long-standing open problem whether these hardness results also hold for the more realistic case of a constant-size alphabet. More precisely, it is shown that the smallest grammar problem remains NP-complete (and its optimisation version is APX-hard), even if the alphabet is fixed and has size of at least 17. The corresponding reduction is robust in the sense that it also works for an alternative size-measure of grammars that is commonly used in the literature (i. e., a size measure also taking the number of rules into account), and it also allows to conclude that even computing the number of rules required by a smallest grammar is a hard problem. On the other hand, if the number of nonterminals (or, equivalently, the number of rules) is bounded by a constant, then the smallest grammar problem can be solved in polynomial time, which is shown by encoding it as a problem on graphs with interval structure. However, treating the number of rules as a parameter (in terms of parameterised complexity) yields W[1]-hardness. Furthermore, we present an O(3∣w∣) exact exponential-time algorithm, based on dynamic programming. These three main questions are also investigated for 1-level grammars, i. e., grammars for which only the start rule contains nonterminals on the right side; thus, investigating the impact of the “hierarchical depth” of grammars on the complexity of the smallest grammar problem. In this regard, we obtain for 1-level grammars similar, but slightly stronger results.Peer Reviewe

    Searching for Smallest Grammars on Large Sequences and Application to DNA

    Get PDF
    International audienceMotivated by the inference of the structure of genomic sequences, we address here the smallest grammar problem. In previous work, we introduced a new perspective on this problem, splitting the task into two different optimization problems: choosing which words will be considered constituents of the final grammar and finding a minimal parsing with these constituents. Here we focus on making these ideas applicable on large sequences. First, we improve the complexity of existing algorithms by using the concept of maximal repeats when choosing which substrings will be the constituents of the grammar. Then, we improve the size of the grammars by cautiously adding a minimal parsing optimization step. Together, these approaches enable us to propose new practical algorithms that return smaller grammars (up to 10\%) in approximately the same amount of time than their competitors on a classical set of genomic sequences and on whole genomes of model organisms

    Regular Expression Search on Compressed Text

    Full text link
    We present an algorithm for searching regular expression matches in compressed text. The algorithm reports the number of matching lines in the uncompressed text in time linear in the size of its compressed version. We define efficient data structures that yield nearly optimal complexity bounds and provide a sequential implementation --zearch-- that requires up to 25% less time than the state of the art.Comment: 10 pages, published in Data Compression Conference (DCC'19

    Approximation of grammar-based compression via recompression

    Full text link
    In this paper we present a simple linear-time algorithm constructing a context-free grammar of size O(g log(N/g)) for the input string, where N is the size of the input string and g the size of the optimal grammar generating this string. The algorithm works for arbitrary size alphabets, but the running time is linear assuming that the alphabet \Sigma of the input string can be identified with numbers from {1, ..., N^c} for some constant c. Otherwise, additional cost of O(n log|\Sigma|) is needed. Algorithms with such approximation guarantees and running time are known, the novelty of this paper is a particular simplicity of the algorithm as well as the analysis of the algorithm, which uses a general technique of recompression recently introduced by the author. Furthermore, contrary to the previous results, this work does not use the LZ representation of the input string in the construction, nor in the analysis.Comment: 22 pages, some many small improvements, to be submited to a journa

    On vocabulary size of grammar-based codes

    Full text link
    We discuss inequalities holding between the vocabulary size, i.e., the number of distinct nonterminal symbols in a grammar-based compression for a string, and the excess length of the respective universal code, i.e., the code-based analog of algorithmic mutual information. The aim is to strengthen inequalities which were discussed in a weaker form in linguistics but shed some light on redundancy of efficiently computable codes. The main contribution of the paper is a construction of universal grammar-based codes for which the excess lengths can be bounded easily.Comment: 5 pages, accepted to ISIT 2007 and correcte

    Computing LZ77 in Run-Compressed Space

    Get PDF
    In this paper, we show that the LZ77 factorization of a text T {\in\Sigma^n} can be computed in O(R log n) bits of working space and O(n log R) time, R being the number of runs in the Burrows-Wheeler transform of T reversed. For extremely repetitive inputs, the working space can be as low as O(log n) bits: exponentially smaller than the text itself. As a direct consequence of our result, we show that a class of repetition-aware self-indexes based on a combination of run-length encoded BWT and LZ77 can be built in asymptotically optimal O(R + z) words of working space, z being the size of the LZ77 parsing

    Universal Coding and Prediction on Martin-L\"of Random Points

    Full text link
    We perform an effectivization of classical results concerning universal coding and prediction for stationary ergodic processes over an arbitrary finite alphabet. That is, we lift the well-known almost sure statements to statements about Martin-L\"of random sequences. Most of this work is quite mechanical but, by the way, we complete a result of Ryabko from 2008 by showing that each universal probability measure in the sense of universal coding induces a universal predictor in the prequential sense. Surprisingly, the effectivization of this implication holds true provided the universal measure does not ascribe too low conditional probabilities to individual symbols. As an example, we show that the Prediction by Partial Matching (PPM) measure satisfies this requirement. In the almost sure setting, the requirement is superfluous.Comment: 12 page
    • 

    corecore