628 research outputs found
Joining Extractions of Regular Expressions
Regular expressions with capture variables, also known as "regex formulas,"
extract relations of spans (interval positions) from text. These relations can
be further manipulated via Relational Algebra as studied in the context of
document spanners, Fagin et al.'s formal framework for information extraction.
We investigate the complexity of querying text by Conjunctive Queries (CQs) and
Unions of CQs (UCQs) on top of regex formulas. We show that the lower bounds
(NP-completeness and W[1]-hardness) from the relational world also hold in our
setting; in particular, hardness hits already single-character text! Yet, the
upper bounds from the relational world do not carry over. Unlike the relational
world, acyclic CQs, and even gamma-acyclic CQs, are hard to compute. The source
of hardness is that it may be intractable to instantiate the relation defined
by a regex formula, simply because it has an exponential number of tuples. Yet,
we are able to establish general upper bounds. In particular, UCQs can be
evaluated with polynomial delay, provided that every CQ has a bounded number of
atoms (while unions and projection can be arbitrary). Furthermore, UCQ
evaluation is solvable with FPT (Fixed-Parameter Tractable) delay when the
parameter is the size of the UCQ
Deterministic Regular Expressions with Back-References
Most modern libraries for regular expression matching allow back-references (i.e. repetition operators) that substantially increase expressive power, but also lead to intractability. In order to find a better balance between expressiveness and tractability, we combine these with the notion of determinism for regular expressions used in XML DTDs and XML Schema. This includes the definition of a suitable automaton model, and a generalization of the Glushkov construction
Derivative Based Extended Regular Expression Matching Supporting Intersection, Complement and Lookarounds
Regular expressions are widely used in software. Various regular expression
engines support different combinations of extensions to classical regular
constructs such as Kleene star, concatenation, nondeterministic choice (union
in terms of match semantics). The extensions include e.g. anchors, lookarounds,
counters, backreferences. The properties of combinations of such extensions
have been subject of active recent research.
In the current paper we present a symbolic derivatives based approach to
finding matches to regular expressions that, in addition to the classical
regular constructs, also support complement, intersection and lookarounds (both
negative and positive lookaheads and lookbacks). The theory of computing
symbolic derivatives and determining nullability given an input string is
presented that shows that such a combination of extensions yields a match
semantics that corresponds to an effective Boolean algebra, which in turn opens
up possibilities of applying various Boolean logic rewrite rules to optimize
the search for matches.
In addition to the theoretical framework we present an implementation of the
combination of extensions to demonstrate the efficacy of the approach
accompanied with practical examples
A Logic for Document Spanners
Document spanners are a formal framework for information extraction that was introduced by [Fagin, Kimelfeld, Reiss, and Vansummeren, J.ACM, 2015]. One of the central models in this framework are core spanners, which are based on regular expressions with variables that are then extended with an algebra. As shown by [Freydenberger and Holldack, ICDT, 2016], there is a connection between core spanners and EC^{reg}, the existential theory of concatenation with regular constraints. The present paper further develops this connection by defining SpLog, a fragment of EC^{reg} that has the same expressive power as core spanners. This equivalence extends beyond equivalence of expressive power, as we show the existence of polynomial time conversions between this fragment and core spanners. This even holds for variants of core spanners that are based on automata instead of regular expressions. Applications of this approach include an alternative way of defining relations for spanners, insights into the relative succinctness of various classes of spanner representations, and a pumping lemma for core spanners
Extended Regular Expressions: Succinctness and Decidability
Most modern implementations of regular expression engines allow the use of variables (also called back references). The resulting extended regular expressions (which, in the literature, are also called practical regular expressions, rewbr, or regex) are able to express non-regular languages.
The present paper demonstrates that extended regular-expressions cannot be minimized effectively (neither with respect to length, nor number of variables), and that the tradeoff in size between extended and ``classical\u27\u27 regular expressions is not bounded by any recursive function. In addition to this, we prove the undecidability of several decision problems (universality, equivalence, inclusion, regularity, and cofiniteness) for extended regular expressions. Furthermore, we show that all these results hold even if the extended regular expressions contain only a single variable
A logic for document spanners
Document spanners are a formal framework for information extraction that was introduced by
Fagin, Kimelfeld, Reiss, and Vansummeren (PODS 2013, JACM 2015). One of the central models in this framework are core spanners, which are based on regular expressions with variables that are then extended with an algebra. As shown by Freydenberger and Holldack (ICDT 2016), there is a connection between core spanners and ECreg, the existential theory of concatenation with
regular constraints. The present paper further develops this connection by defining SpLog, a fragment of ECreg that has the same expressive power as core spanners. This equivalence extends beyond equivalence of expressive power, as we show the existence of polynomial time conversions between this fragment and core spanners. This even holds for variants of core spanners that are based on automata instead of regular expressions. Applications of this approach include an
alternative way of defining relations for spanners, insights into the relative succinctness of various classes of spanner representations, and a pumping lemma for core spanners
Linear Matching of JavaScript Regular Expressions
Modern regex languages have strayed far from well-understood traditional
regular expressions: they include features that fundamentally transform the
matching problem. In exchange for these features, modern regex engines at times
suffer from exponential complexity blowups, a frequent source of
denial-of-service vulnerabilities in JavaScript applications. Worse, regex
semantics differ across languages, and the impact of these divergences on
algorithmic design and worst-case matching complexity has seldom been
investigated.
This paper provides a novel perspective on JavaScript's regex semantics by
identifying a larger-than-previously-understood subset of the language that can
be matched with linear time guarantees. In the process, we discover several
cases where state-of-the-art algorithms were either wrong (semantically
incorrect), inefficient (suffering from superlinear complexity) or excessively
restrictive (assuming certain features could not be matched linearly). We
introduce novel algorithms to restore correctness and linear complexity. We
further advance the state-of-the-art in linear regex matching by presenting the
first nonbacktracking algorithms for matching lookarounds in linear time: one
supporting captureless lookbehinds in any regex language, and another
leveraging a JavaScript property to support unrestricted lookaheads and
lookbehinds. Finally, we describe new time and space complexity tradeoffs for
regex engines. All of our algorithms are practical: we validated them in a
prototype implementation, and some have also been merged in the V8 JavaScript
implementation used in Chrome and Node.js
Inside the class of REGEX Languages
We study different possibilities of combining the concept of homomorphic replacement with regular expressions in order to investigate the class of languages given by extended regular expressions with backreferences (REGEX). It is shown in which regard existing and natural ways to do this fail to reach the expressive power of REGEX. Furthermore, the complexity of the membership problem for REGEX with a bounded number of backreferences is considered
- …