279 research outputs found
Joining Extractions of Regular Expressions
Regular expressions with capture variables, also known as "regex formulas,"
extract relations of spans (interval positions) from text. These relations can
be further manipulated via Relational Algebra as studied in the context of
document spanners, Fagin et al.'s formal framework for information extraction.
We investigate the complexity of querying text by Conjunctive Queries (CQs) and
Unions of CQs (UCQs) on top of regex formulas. We show that the lower bounds
(NP-completeness and W[1]-hardness) from the relational world also hold in our
setting; in particular, hardness hits already single-character text! Yet, the
upper bounds from the relational world do not carry over. Unlike the relational
world, acyclic CQs, and even gamma-acyclic CQs, are hard to compute. The source
of hardness is that it may be intractable to instantiate the relation defined
by a regex formula, simply because it has an exponential number of tuples. Yet,
we are able to establish general upper bounds. In particular, UCQs can be
evaluated with polynomial delay, provided that every CQ has a bounded number of
atoms (while unions and projection can be arbitrary). Furthermore, UCQ
evaluation is solvable with FPT (Fixed-Parameter Tractable) delay when the
parameter is the size of the UCQ
Constant-Delay Enumeration for Nondeterministic Document Spanners
We consider the information extraction framework known as document spanners,
and study the problem of efficiently computing the results of the extraction
from an input document, where the extraction task is described as a sequential
variable-set automaton (VA). We pose this problem in the setting of enumeration
algorithms, where we can first run a preprocessing phase and must then produce
the results with a small delay between any two consecutive results. Our goal is
to have an algorithm which is tractable in combined complexity, i.e., in the
sizes of the input document and the VA; while ensuring the best possible data
complexity bounds in the input document size, i.e., constant delay in the
document size. Several recent works at PODS'18 proposed such algorithms but
with linear delay in the document size or with an exponential dependency in
size of the (generally nondeterministic) input VA. In particular, Florenzano et
al. suggest that our desired runtime guarantees cannot be met for general
sequential VAs. We refute this and show that, given a nondeterministic
sequential VA and an input document, we can enumerate the mappings of the VA on
the document with the following bounds: the preprocessing is linear in the
document size and polynomial in the size of the VA, and the delay is
independent of the document and polynomial in the size of the VA. The resulting
algorithm thus achieves tractability in combined complexity and the best
possible data complexity bounds. Moreover, it is rather easy to describe, in
particular for the restricted case of so-called extended VAs. Finally, we
evaluate our algorithm empirically using a prototype implementation.Comment: 29 pages. Extended version of arXiv:1807.09320. Integrates all
corrections following reviewer feedback. Outside of some minor formatting
differences and tweaks, this paper is the same as the paper to appear in the
ACM TODS journa
Grammars for Document Spanners
We propose a new grammar-based language for defining information-extractors from documents (text) that is built upon the well-studied framework of document spanners for extracting structured data from text. While previously studied formalisms for document spanners are mainly based on regular expressions, we use an extension of context-free grammars, called {extraction grammars}, to define the new class of context-free spanners. Extraction grammars are simply context-free grammars extended with variables that capture interval positions of the document, namely spans. While regular expressions are efficient for tokenizing and tagging, context-free grammars are also efficient for capturing structural properties. Indeed, we show that context-free spanners are strictly more expressive than their regular counterparts. We reason about the expressive power of our new class and present a pushdown-automata model that captures it. We show that extraction grammars can be evaluated with polynomial data complexity. Nevertheless, as the degree of the polynomial depends on the query, we present an enumeration algorithm for unambiguous extraction grammars that, after quintic preprocessing, outputs the results sequentially, without repetitions, with a constant delay between every two consecutive ones
Splitting Spanner Atoms: A Tool for Acyclic Core Spanners
This paper investigates regex CQs with string equalities (SERCQs), a subclass of core spanners. As shown by Freydenberger, Kimelfeld, and Peterfreund (PODS 2018), these queries are intractable, even if restricted to acyclic queries. This previous result defines acyclicity by treating regex formulas as atoms. In contrast to this, we propose an alternative definition by converting SERCQs into FC-CQs - conjunctive queries in FC, a logic that is based on word equations. We introduce a way to decompose word equations of unbounded arity into a conjunction of binary word equations. If the result of the decomposition is acyclic, then evaluation and enumeration of results become tractable. The main result of this work is an algorithm that decides in polynomial time whether an FC-CQ can be decomposed into an acyclic FC-CQ. We also give an efficient conversion from synchronized SERCQs to FC-CQs with regular constraints. As a consequence, tractability results for acyclic relational CQs directly translate to a large class of SERCQs
Ranked Enumeration of MSO Logic on Words
In the last years, enumeration algorithms with bounded delay have attracted a lot of attention for several data management tasks. Given a query and the data, the task is to preprocess the data and then enumerate all the answers to the query one by one and without repetitions. This enumeration scheme is typically useful when the solutions are treated on the fly or when we want to stop the enumeration once the pertinent solutions have been found. However, with the current schemes, there is no restriction on the order how the solutions are given and this order usually depends on the techniques used and not on the relevance for the user.
In this paper we study the enumeration of monadic second order logic (MSO) over words when the solutions are ranked. We present a framework based on MSO cost functions that allows to express MSO formulae on words with a cost associated with each solution. We then demonstrate the generality of our framework which subsumes, for instance, document spanners and adds ranking to them. The main technical result of the paper is an algorithm for enumerating all the solutions of formulae in increasing order of cost efficiently, namely, with a linear preprocessing phase and logarithmic delay between solutions. The novelty of this algorithm is based on using functional data structures, in particular, by extending functional Brodal queues to suit with the ranked enumeration of MSO on words
Complexity bounds for relational algebra over document spanners
We investigate the complexity of evaluating queries in Relational Algebra (RA) over the relations extracted by regex
formulas (i.e., regular expressions with capture variables)
over text documents. Such queries, also known as the regular document spanners, were shown to have an evaluation
with polynomial delay for every positive RA expression (i.e.,
consisting of only natural joins, projections and unions);
here, the RA expression is fixed and the input consists of
both the regex formulas and the document. In this work, we
explore the implication of two fundamental generalizations.
The first is adopting the âschemalessâ semantics for spanners, as proposed and studied by Maturana et al. The second
is going beyond the positive RA to allowing the difference
operator.
We show that each of the two generalizations introduces
computational hardness: it is intractable to compute the natural join of two regex formulas under the schemaless semantics, and the difference between two regex formulas under
both the ordinary and schemaless semantics. Nevertheless,
we propose and analyze syntactic constraints, on the RA
expression and the regex formulas at hand, such that the
expressive power is fully preserved and, yet, evaluation can
be done with polynomial delay. Unlike the previous work on
RA over regex formulas, our technique is not (and provably cannot be) based on the static compilation of regex formulas, but rather on an ad-hoc compilation into an automaton
that incorporates both the query and the document. This
approach also allows us to include black-box extractors in
the RA expression
Complexity Bounds for Relational Algebra over Document Spanners
We investigate the complexity of evaluating queries in Relational Algebra
(RA) over the relations extracted by regex formulas (i.e., regular expressions
with capture variables) over text documents. Such queries, also known as the
regular document spanners, were shown to have an evaluation with polynomial
delay for every positive RA expression (i.e., consisting of only natural joins,
projections and unions); here, the RA expression is fixed and the input
consists of both the regex formulas and the document. In this work, we explore
the implication of two fundamental generalizations. The first is adopting the
"schemaless" semantics for spanners, as proposed and studied by Maturana et al.
The second is going beyond the positive RA to allowing the difference operator.
We show that each of the two generalizations introduces computational hardness:
it is intractable to compute the natural join of two regex formulas under the
schemaless semantics, and the difference between two regex formulas under both
the ordinary and schemaless semantics. Nevertheless, we propose and analyze
syntactic constraints, on the RA expression and the regex formulas at hand,
such that the expressive power is fully preserved and, yet, evaluation can be
done with polynomial delay. Unlike the previous work on RA over regex formulas,
our technique is not (and provably cannot be) based on the static compilation
of regex formulas, but rather on an ad-hoc compilation into an automaton that
incorporates both the query and the document. This approach also allows us to
include black-box extractors in the RA expression
Conjunctive Queries for Logic-Based Information Extraction
This thesis offers two logic-based approaches to conjunctive queries in the
context of information extraction. The first and main approach is the
introduction of conjunctive query fragments of the logics FC and FC[REG],
denoted as FC-CQ and FC[REG]-CQ respectively. FC is a first-order logic based
on word equations, where the semantics are defined by limiting the universe to
the factors of some finite input word. FC[REG] is FC extended with regular
constraints. The second approach is to consider the dynamic complexity of FC.Comment: Based on the author's PhD thesis and contains work from two
conference publications (arXiv:2104.04758, arXiv:1909.10869) which are joint
work with Dominik D. Freydenberge
- âŠ