2,194 research outputs found

    Which Regular Expression Patterns are Hard to Match?

    Full text link
    Regular expressions constitute a fundamental notion in formal language theory and are frequently used in computer science to define search patterns. A classic algorithm for these problems constructs and simulates a non-deterministic finite automaton corresponding to the expression, resulting in an O(mn)O(mn) running time (where mm is the length of the pattern and nn is the length of the text). This running time can be improved slightly (by a polylogarithmic factor), but no significantly faster solutions are known. At the same time, much faster algorithms exist for various special cases of regular expressions, including dictionary matching, wildcard matching, subset matching, word break problem etc. In this paper, we show that the complexity of regular expression matching can be characterized based on its {\em depth} (when interpreted as a formula). Our results hold for expressions involving concatenation, OR, Kleene star and Kleene plus. For regular expressions of depth two (involving any combination of the above operators), we show the following dichotomy: matching and membership testing can be solved in near-linear time, except for "concatenations of stars", which cannot be solved in strongly sub-quadratic time assuming the Strong Exponential Time Hypothesis (SETH). For regular expressions of depth three the picture is more complex. Nevertheless, we show that all problems can either be solved in strongly sub-quadratic time, or cannot be solved in strongly sub-quadratic time assuming SETH. An intriguing special case of membership testing involves regular expressions of the form "a star of an OR of concatenations", e.g., [aabbc][a|ab|bc]^*. This corresponds to the so-called {\em word break} problem, for which a dynamic programming algorithm with a runtime of (roughly) O(nm)O(n\sqrt{m}) is known. We show that the latter bound is not tight and improve the runtime to O(nm0.44)O(nm^{0.44\ldots})

    Efficient LZ78 factorization of grammar compressed text

    Full text link
    We present an efficient algorithm for computing the LZ78 factorization of a text, where the text is represented as a straight line program (SLP), which is a context free grammar in the Chomsky normal form that generates a single string. Given an SLP of size nn representing a text SS of length NN, our algorithm computes the LZ78 factorization of TT in O(nN+mlogN)O(n\sqrt{N}+m\log N) time and O(nN+m)O(n\sqrt{N}+m) space, where mm is the number of resulting LZ78 factors. We also show how to improve the algorithm so that the nNn\sqrt{N} term in the time and space complexities becomes either nLnL, where LL is the length of the longest LZ78 factor, or (Nα)(N - \alpha) where α0\alpha \geq 0 is a quantity which depends on the amount of redundancy that the SLP captures with respect to substrings of SS of a certain length. Since m=O(N/logσN)m = O(N/\log_\sigma N) where σ\sigma is the alphabet size, the latter is asymptotically at least as fast as a linear time algorithm which runs on the uncompressed string when σ\sigma is constant, and can be more efficient when the text is compressible, i.e. when mm and nn are small.Comment: SPIRE 201

    Prospects and limitations of full-text index structures in genome analysis

    Get PDF
    The combination of incessant advances in sequencing technology producing large amounts of data and innovative bioinformatics approaches, designed to cope with this data flood, has led to new interesting results in the life sciences. Given the magnitude of sequence data to be processed, many bioinformatics tools rely on efficient solutions to a variety of complex string problems. These solutions include fast heuristic algorithms and advanced data structures, generally referred to as index structures. Although the importance of index structures is generally known to the bioinformatics community, the design and potency of these data structures, as well as their properties and limitations, are less understood. Moreover, the last decade has seen a boom in the number of variant index structures featuring complex and diverse memory-time trade-offs. This article brings a comprehensive state-of-the-art overview of the most popular index structures and their recently developed variants. Their features, interrelationships, the trade-offs they impose, but also their practical limitations, are explained and compared

    Noiseless compression using non-Markov models

    Get PDF
    Adaptive data compression techniques can be viewed as consisting of a model specified by a database common to the encoder and decoder, an encoding rule and a rule for updating the model to ensure that the encoder and decoder always agree on the interpretation of the next transmission. The techniques which fit this framework range from run-length coding, to adaptive Huffman and arithmetic coding, to the string-matching techniques of Lempel and Ziv. The compression obtained by arithmetic coding is dependent on the generality of the source model. For many sources, an independent-letter model is clearly insufficient. Unfortunately, a straightforward implementation of a Markov model requires an amount of space exponential in the number of letters remembered. The Directed Acyclic Word Graph (DAWG) can be constructed in time and space proportional to the text encoded, and can be used to estimate the probabilities required for arithmetic coding based on an amount of memory which varies naturally depending on the encoded text. The tail of that portion of the text which was encoded is the longest suffix that has occurred previously. The frequencies of letters following these previous occurrences can be used to estimate the probability distribution of the next letter. Experimental results indicate that compression is often far better than that obtained using independent-letter models, and sometimes also significantly better than other non-independent techniques

    Detecting One-variable Patterns

    Full text link
    Given a pattern p=s1x1s2x2sr1xr1srp = s_1x_1s_2x_2\cdots s_{r-1}x_{r-1}s_r such that x1,x2,,xr1{x,x}x_1,x_2,\ldots,x_{r-1}\in\{x,\overset{{}_{\leftarrow}}{x}\}, where xx is a variable and x\overset{{}_{\leftarrow}}{x} its reversal, and s1,s2,,srs_1,s_2,\ldots,s_r are strings that contain no variables, we describe an algorithm that constructs in O(rn)O(rn) time a compact representation of all PP instances of pp in an input string of length nn over a polynomially bounded integer alphabet, so that one can report those instances in O(P)O(P) time.Comment: 16 pages (+13 pages of Appendix), 4 figures, accepted to SPIRE 201
    corecore