35 research outputs found
Constant-Delay Enumeration for Nondeterministic Document Spanners
We consider the information extraction framework known as document spanners,
and study the problem of efficiently computing the results of the extraction
from an input document, where the extraction task is described as a sequential
variable-set automaton (VA). We pose this problem in the setting of enumeration
algorithms, where we can first run a preprocessing phase and must then produce
the results with a small delay between any two consecutive results. Our goal is
to have an algorithm which is tractable in combined complexity, i.e., in the
sizes of the input document and the VA; while ensuring the best possible data
complexity bounds in the input document size, i.e., constant delay in the
document size. Several recent works at PODS'18 proposed such algorithms but
with linear delay in the document size or with an exponential dependency in
size of the (generally nondeterministic) input VA. In particular, Florenzano et
al. suggest that our desired runtime guarantees cannot be met for general
sequential VAs. We refute this and show that, given a nondeterministic
sequential VA and an input document, we can enumerate the mappings of the VA on
the document with the following bounds: the preprocessing is linear in the
document size and polynomial in the size of the VA, and the delay is
independent of the document and polynomial in the size of the VA. The resulting
algorithm thus achieves tractability in combined complexity and the best
possible data complexity bounds. Moreover, it is rather easy to describe, in
particular for the restricted case of so-called extended VAs. Finally, we
evaluate our algorithm empirically using a prototype implementation.Comment: 29 pages. Extended version of arXiv:1807.09320. Integrates all
corrections following reviewer feedback. Outside of some minor formatting
differences and tweaks, this paper is the same as the paper to appear in the
ACM TODS journa
Grammars for Document Spanners
We propose a new grammar-based language for defining information-extractors from documents (text) that is built upon the well-studied framework of document spanners for extracting structured data from text. While previously studied formalisms for document spanners are mainly based on regular expressions, we use an extension of context-free grammars, called {extraction grammars}, to define the new class of context-free spanners. Extraction grammars are simply context-free grammars extended with variables that capture interval positions of the document, namely spans. While regular expressions are efficient for tokenizing and tagging, context-free grammars are also efficient for capturing structural properties. Indeed, we show that context-free spanners are strictly more expressive than their regular counterparts. We reason about the expressive power of our new class and present a pushdown-automata model that captures it. We show that extraction grammars can be evaluated with polynomial data complexity. Nevertheless, as the degree of the polynomial depends on the query, we present an enumeration algorithm for unambiguous extraction grammars that, after quintic preprocessing, outputs the results sequentially, without repetitions, with a constant delay between every two consecutive ones
Complexity bounds for relational algebra over document spanners
We investigate the complexity of evaluating queries in Relational Algebra (RA) over the relations extracted by regex
formulas (i.e., regular expressions with capture variables)
over text documents. Such queries, also known as the regular document spanners, were shown to have an evaluation
with polynomial delay for every positive RA expression (i.e.,
consisting of only natural joins, projections and unions);
here, the RA expression is fixed and the input consists of
both the regex formulas and the document. In this work, we
explore the implication of two fundamental generalizations.
The first is adopting the âschemalessâ semantics for spanners, as proposed and studied by Maturana et al. The second
is going beyond the positive RA to allowing the difference
operator.
We show that each of the two generalizations introduces
computational hardness: it is intractable to compute the natural join of two regex formulas under the schemaless semantics, and the difference between two regex formulas under
both the ordinary and schemaless semantics. Nevertheless,
we propose and analyze syntactic constraints, on the RA
expression and the regex formulas at hand, such that the
expressive power is fully preserved and, yet, evaluation can
be done with polynomial delay. Unlike the previous work on
RA over regex formulas, our technique is not (and provably cannot be) based on the static compilation of regex formulas, but rather on an ad-hoc compilation into an automaton
that incorporates both the query and the document. This
approach also allows us to include black-box extractors in
the RA expression
A Purely Regular Approach to Non-Regular Core Spanners
The regular spanners (characterised by vset-automata) are closed under the
algebraic operations of union, join and projection, and have desirable
algorithmic properties. The core spanners (introduced by Fagin, Kimelfeld,
Reiss, and Vansummeren (PODS 2013, JACM 2015) as a formalisation of the core
functionality of the query language AQL used in IBM's SystemT) additionally
need string equality selections and it has been shown by Freydenberger and
Holldack (ICDT 2016, Theory of Computing Systems 2018) that this leads to high
complexity and even undecidability of the typical problems in static analysis
and query evaluation. We propose an alternative approach to core spanners: by
incorporating the string-equality selections directly into the regular language
that represents the underlying regular spanner (instead of treating it as an
algebraic operation on the table extracted by the regular spanner), we obtain a
fragment of core spanners that, while having slightly weaker expressive power
than the full class of core spanners, arguably still covers the intuitive
applications of string equality selections for information extraction and has
much better upper complexity bounds of the typical problems in static analysis
and query evaluation
Ranked Enumeration of MSO Logic on Words
In the last years, enumeration algorithms with bounded delay have attracted a lot of attention for several data management tasks. Given a query and the data, the task is to preprocess the data and then enumerate all the answers to the query one by one and without repetitions. This enumeration scheme is typically useful when the solutions are treated on the fly or when we want to stop the enumeration once the pertinent solutions have been found. However, with the current schemes, there is no restriction on the order how the solutions are given and this order usually depends on the techniques used and not on the relevance for the user.
In this paper we study the enumeration of monadic second order logic (MSO) over words when the solutions are ranked. We present a framework based on MSO cost functions that allows to express MSO formulae on words with a cost associated with each solution. We then demonstrate the generality of our framework which subsumes, for instance, document spanners and adds ranking to them. The main technical result of the paper is an algorithm for enumerating all the solutions of formulae in increasing order of cost efficiently, namely, with a linear preprocessing phase and logarithmic delay between solutions. The novelty of this algorithm is based on using functional data structures, in particular, by extending functional Brodal queues to suit with the ranked enumeration of MSO on words
Complexity Bounds for Relational Algebra over Document Spanners
We investigate the complexity of evaluating queries in Relational Algebra
(RA) over the relations extracted by regex formulas (i.e., regular expressions
with capture variables) over text documents. Such queries, also known as the
regular document spanners, were shown to have an evaluation with polynomial
delay for every positive RA expression (i.e., consisting of only natural joins,
projections and unions); here, the RA expression is fixed and the input
consists of both the regex formulas and the document. In this work, we explore
the implication of two fundamental generalizations. The first is adopting the
"schemaless" semantics for spanners, as proposed and studied by Maturana et al.
The second is going beyond the positive RA to allowing the difference operator.
We show that each of the two generalizations introduces computational hardness:
it is intractable to compute the natural join of two regex formulas under the
schemaless semantics, and the difference between two regex formulas under both
the ordinary and schemaless semantics. Nevertheless, we propose and analyze
syntactic constraints, on the RA expression and the regex formulas at hand,
such that the expressive power is fully preserved and, yet, evaluation can be
done with polynomial delay. Unlike the previous work on RA over regex formulas,
our technique is not (and provably cannot be) based on the static compilation
of regex formulas, but rather on an ad-hoc compilation into an automaton that
incorporates both the query and the document. This approach also allows us to
include black-box extractors in the RA expression
Splitting Spanner Atoms: A Tool for Acyclic Core Spanners
This paper investigates regex CQs with string equalities (SERCQs), a subclass of core spanners. As shown by Freydenberger, Kimelfeld, and Peterfreund (PODS 2018), these queries are intractable, even if restricted to acyclic queries. This previous result defines acyclicity by treating regex formulas as atoms. In contrast to this, we propose an alternative definition by converting SERCQs into FC-CQs - conjunctive queries in FC, a logic that is based on word equations. We introduce a way to decompose word equations of unbounded arity into a conjunction of binary word equations. If the result of the decomposition is acyclic, then evaluation and enumeration of results become tractable. The main result of this work is an algorithm that decides in polynomial time whether an FC-CQ can be decomposed into an acyclic FC-CQ. We also give an efficient conversion from synchronized SERCQs to FC-CQs with regular constraints. As a consequence, tractability results for acyclic relational CQs directly translate to a large class of SERCQs