53,684 research outputs found
Which Regular Expression Patterns are Hard to Match?
Regular expressions constitute a fundamental notion in formal language theory
and are frequently used in computer science to define search patterns. A
classic algorithm for these problems constructs and simulates a
non-deterministic finite automaton corresponding to the expression, resulting
in an running time (where is the length of the pattern and is
the length of the text). This running time can be improved slightly (by a
polylogarithmic factor), but no significantly faster solutions are known. At
the same time, much faster algorithms exist for various special cases of
regular expressions, including dictionary matching, wildcard matching, subset
matching, word break problem etc.
In this paper, we show that the complexity of regular expression matching can
be characterized based on its {\em depth} (when interpreted as a formula). Our
results hold for expressions involving concatenation, OR, Kleene star and
Kleene plus. For regular expressions of depth two (involving any combination of
the above operators), we show the following dichotomy: matching and membership
testing can be solved in near-linear time, except for "concatenations of
stars", which cannot be solved in strongly sub-quadratic time assuming the
Strong Exponential Time Hypothesis (SETH). For regular expressions of depth
three the picture is more complex. Nevertheless, we show that all problems can
either be solved in strongly sub-quadratic time, or cannot be solved in
strongly sub-quadratic time assuming SETH.
An intriguing special case of membership testing involves regular expressions
of the form "a star of an OR of concatenations", e.g., . This
corresponds to the so-called {\em word break} problem, for which a dynamic
programming algorithm with a runtime of (roughly) is known. We
show that the latter bound is not tight and improve the runtime to
Pattern matching in compilers
In this thesis we develop tools for effective and flexible pattern matching.
We introduce a new pattern matching system called amethyst. Amethyst is not
only a generator of parsers of programming languages, but can also serve as an
alternative to tools for matching regular expressions.
Our framework also produces dynamic parsers. Its intended use is in the
context of IDE (accurate syntax highlighting and error detection on the fly).
Amethyst offers pattern matching of general data structures. This makes it a
useful tool for implementing compiler optimizations such as constant folding,
instruction scheduling, and dataflow analysis in general.
The parsers produced are essentially top-down parsers. Linear time complexity
is obtained by introducing the novel notion of structured grammars and
regularized regular expressions. Amethyst uses techniques known from compiler
optimizations to produce effective parsers.Comment: master thesi
MatchPy: A Pattern Matching Library
Pattern matching is a powerful tool for symbolic computations, based on the
well-defined theory of term rewriting systems. Application domains include
algebraic expressions, abstract syntax trees, and XML and JSON data.
Unfortunately, no lightweight implementation of pattern matching as general and
flexible as Mathematica exists for Python Mathics,MacroPy,patterns,PyPatt.
Therefore, we created the open source module MatchPy which offers similar
pattern matching functionality in Python using a novel algorithm which finds
matches for large pattern sets more efficiently by exploiting similarities
between patterns.Comment: arXiv admin note: substantial text overlap with arXiv:1710.0007
Efficient Pattern Matching in Python
Pattern matching is a powerful tool for symbolic computations. Applications
include term rewriting systems, as well as the manipulation of symbolic
expressions, abstract syntax trees, and XML and JSON data. It also allows for
an intuitive description of algorithms in the form of rewrite rules. We present
the open source Python module MatchPy, which offers functionality and
expressiveness similar to the pattern matching in Mathematica. In particular,
it includes syntactic pattern matching, as well as matching for commutative
and/or associative functions, sequence variables, and matching with
constraints. MatchPy uses new and improved algorithms to efficiently find
matches for large pattern sets by exploiting similarities between patterns. The
performance of MatchPy is investigated on several real-world problems
Regular Languages meet Prefix Sorting
Indexing strings via prefix (or suffix) sorting is, arguably, one of the most
successful algorithmic techniques developed in the last decades. Can indexing
be extended to languages? The main contribution of this paper is to initiate
the study of the sub-class of regular languages accepted by an automaton whose
states can be prefix-sorted. Starting from the recent notion of Wheeler graph
[Gagie et al., TCS 2017]-which extends naturally the concept of prefix sorting
to labeled graphs-we investigate the properties of Wheeler languages, that is,
regular languages admitting an accepting Wheeler finite automaton.
Interestingly, we characterize this family as the natural extension of regular
languages endowed with the co-lexicographic ordering: when sorted, the strings
belonging to a Wheeler language are partitioned into a finite number of
co-lexicographic intervals, each formed by elements from a single Myhill-Nerode
equivalence class. Moreover: (i) We show that every Wheeler NFA (WNFA) with
states admits an equivalent Wheeler DFA (WDFA) with at most
states that can be computed in time. This is in sharp contrast with
general NFAs. (ii) We describe a quadratic algorithm to prefix-sort a proper
superset of the WDFAs, a -time online algorithm to sort acyclic
WDFAs, and an optimal linear-time offline algorithm to sort general WDFAs. By
contribution (i), our algorithms can also be used to index any WNFA at the
moderate price of doubling the automaton's size. (iii) We provide a
minimization theorem that characterizes the smallest WDFA recognizing the same
language of any input WDFA. The corresponding constructive algorithm runs in
optimal linear time in the acyclic case, and in time in the
general case. (iv) We show how to compute the smallest WDFA equivalent to any
acyclic DFA in nearly-optimal time.Comment: added minimization theorems; uploaded submitted version; New version
with new results (W-MH theorem, linear determinization), added author:
Giovanna D'Agostin
On the Complexity of Exact Pattern Matching in Graphs: Binary Strings and Bounded Degree
Exact pattern matching in labeled graphs is the problem of searching paths of
a graph that spell the same string as the pattern . This
basic problem can be found at the heart of more complex operations on variation
graphs in computational biology, of query operations in graph databases, and of
analysis operations in heterogeneous networks, where the nodes of some paths
must match a sequence of labels or types. We describe a simple conditional
lower bound that, for any constant , an -time or an -time algorithm for exact pattern
matching on graphs, with node labels and patterns drawn from a binary alphabet,
cannot be achieved unless the Strong Exponential Time Hypothesis (SETH) is
false. The result holds even if restricted to undirected graphs of maximum
degree three or directed acyclic graphs of maximum sum of indegree and
outdegree three. Although a conditional lower bound of this kind can be somehow
derived from previous results (Backurs and Indyk, FOCS'16), we give a direct
reduction from SETH for dissemination purposes, as the result might interest
researchers from several areas, such as computational biology, graph database,
and graph mining, as mentioned before. Indeed, as approximate pattern matching
on graphs can be solved in time, exact and approximate matching are
thus equally hard (quadratic time) on graphs under the SETH assumption. In
comparison, the same problems restricted to strings have linear time vs
quadratic time solutions, respectively, where the latter ones have a matching
SETH lower bound on computing the edit distance of two strings (Backurs and
Indyk, STOC'15).Comment: Using Lemma 12 and Lemma 13 might to be enough to prove Lemma 14.
However, the proof of Lemma 14 is correct if you assume that the graph used
in the reduction is a DAG. Hence, since the problem is already quadratic for
a DAG and a binary alphabet, it has to be quadratic also for a general graph
and a binary alphabe
From Regular Expression Matching to Parsing
Given a regular expression and a string , the regular expression
parsing problem is to determine if matches and if so, determine how it
matches, e.g., by a mapping of the characters of to the characters in .
Regular expression parsing makes finding matches of a regular expression even
more useful by allowing us to directly extract subpatterns of the match, e.g.,
for extracting IP-addresses from internet traffic analysis or extracting
subparts of genomes from genetic data bases. We present a new general
techniques for efficiently converting a large class of algorithms that
determine if a string matches regular expression into algorithms that
can construct a corresponding mapping. As a consequence, we obtain the first
efficient linear space solutions for regular expression parsing
On-line construction of position heaps
We propose a simple linear-time on-line algorithm for constructing a position
heap for a string [Ehrenfeucht et al, 2011]. Our definition of position heap
differs slightly from the one proposed in [Ehrenfeucht et al, 2011] in that it
considers the suffixes ordered from left to right. Our construction is based on
classic suffix pointers and resembles the Ukkonen's algorithm for suffix trees
[Ukkonen, 1995]. Using suffix pointers, the position heap can be extended into
the augmented position heap that allows for a linear-time string matching
algorithm [Ehrenfeucht et al, 2011].Comment: to appear in Journal of Discrete Algorithm
- …