2,194 research outputs found
Which Regular Expression Patterns are Hard to Match?
Regular expressions constitute a fundamental notion in formal language theory
and are frequently used in computer science to define search patterns. A
classic algorithm for these problems constructs and simulates a
non-deterministic finite automaton corresponding to the expression, resulting
in an running time (where is the length of the pattern and is
the length of the text). This running time can be improved slightly (by a
polylogarithmic factor), but no significantly faster solutions are known. At
the same time, much faster algorithms exist for various special cases of
regular expressions, including dictionary matching, wildcard matching, subset
matching, word break problem etc.
In this paper, we show that the complexity of regular expression matching can
be characterized based on its {\em depth} (when interpreted as a formula). Our
results hold for expressions involving concatenation, OR, Kleene star and
Kleene plus. For regular expressions of depth two (involving any combination of
the above operators), we show the following dichotomy: matching and membership
testing can be solved in near-linear time, except for "concatenations of
stars", which cannot be solved in strongly sub-quadratic time assuming the
Strong Exponential Time Hypothesis (SETH). For regular expressions of depth
three the picture is more complex. Nevertheless, we show that all problems can
either be solved in strongly sub-quadratic time, or cannot be solved in
strongly sub-quadratic time assuming SETH.
An intriguing special case of membership testing involves regular expressions
of the form "a star of an OR of concatenations", e.g., . This
corresponds to the so-called {\em word break} problem, for which a dynamic
programming algorithm with a runtime of (roughly) is known. We
show that the latter bound is not tight and improve the runtime to
Efficient LZ78 factorization of grammar compressed text
We present an efficient algorithm for computing the LZ78 factorization of a
text, where the text is represented as a straight line program (SLP), which is
a context free grammar in the Chomsky normal form that generates a single
string. Given an SLP of size representing a text of length , our
algorithm computes the LZ78 factorization of in time
and space, where is the number of resulting LZ78 factors.
We also show how to improve the algorithm so that the term in the
time and space complexities becomes either , where is the length of the
longest LZ78 factor, or where is a quantity
which depends on the amount of redundancy that the SLP captures with respect to
substrings of of a certain length. Since where
is the alphabet size, the latter is asymptotically at least as fast as
a linear time algorithm which runs on the uncompressed string when is
constant, and can be more efficient when the text is compressible, i.e. when
and are small.Comment: SPIRE 201
Prospects and limitations of full-text index structures in genome analysis
The combination of incessant advances in sequencing technology producing large amounts of data and innovative bioinformatics approaches, designed to cope with this data flood, has led to new interesting results in the life sciences. Given the magnitude of sequence data to be processed, many bioinformatics tools rely on efficient solutions to a variety of complex string problems. These solutions include fast heuristic algorithms and advanced data structures, generally referred to as index structures. Although the importance of index structures is generally known to the bioinformatics community, the design and potency of these data structures, as well as their properties and limitations, are less understood. Moreover, the last decade has seen a boom in the number of variant index structures featuring complex and diverse memory-time trade-offs. This article brings a comprehensive state-of-the-art overview of the most popular index structures and their recently developed variants. Their features, interrelationships, the trade-offs they impose, but also their practical limitations, are explained and compared
Noiseless compression using non-Markov models
Adaptive data compression techniques can be viewed as consisting of a model specified by a database common to the encoder and decoder, an encoding rule and a rule for updating the model to ensure that the encoder and decoder always agree on the interpretation of the next transmission. The techniques which fit this framework range from run-length coding, to adaptive Huffman and arithmetic coding, to the string-matching techniques of Lempel and Ziv. The compression obtained by arithmetic coding is dependent on the generality of the source model. For many sources, an independent-letter model is clearly insufficient. Unfortunately, a straightforward implementation of a Markov model requires an amount of space exponential in the number of letters remembered. The Directed Acyclic Word Graph (DAWG) can be constructed in time and space proportional to the text encoded, and can be used to estimate the probabilities required for arithmetic coding based on an amount of memory which varies naturally depending on the encoded text. The tail of that portion of the text which was encoded is the longest suffix that has occurred previously. The frequencies of letters following these previous occurrences can be used to estimate the probability distribution of the next letter. Experimental results indicate that compression is often far better than that obtained using independent-letter models, and sometimes also significantly better than other non-independent techniques
Detecting One-variable Patterns
Given a pattern such that
, where is a
variable and its reversal, and
are strings that contain no variables, we describe an
algorithm that constructs in time a compact representation of all
instances of in an input string of length over a polynomially bounded
integer alphabet, so that one can report those instances in time.Comment: 16 pages (+13 pages of Appendix), 4 figures, accepted to SPIRE 201
- …