5 research outputs found
Recursive Programs for Document Spanners
A document spanner models a program for Information Extraction (IE) as a function that takes as input a text document (string over a finite alphabet) and produces a relation of spans (intervals in the document) over a predefined schema. A well-studied language for expressing spanners is that of the regular spanners: relational algebra over regex formulas, which are regular expressions with capture variables. Equivalently, the regular spanners are the ones expressible in non-recursive Datalog over regex formulas (which extract relations that constitute the extensional database). This paper explores the expressive power of recursive Datalog over regex formulas. We show that such programs can express precisely the document spanners computable in polynomial time. We compare this expressiveness to known formalisms such as the closure of regex formulas under the relational algebra and string equality. Finally, we extend our study to a recently proposed framework that generalizes both the relational model and the document spanners
Complexity bounds for relational algebra over document spanners
We investigate the complexity of evaluating queries in Relational Algebra (RA) over the relations extracted by regex
formulas (i.e., regular expressions with capture variables)
over text documents. Such queries, also known as the regular document spanners, were shown to have an evaluation
with polynomial delay for every positive RA expression (i.e.,
consisting of only natural joins, projections and unions);
here, the RA expression is fixed and the input consists of
both the regex formulas and the document. In this work, we
explore the implication of two fundamental generalizations.
The first is adopting the “schemaless” semantics for spanners, as proposed and studied by Maturana et al. The second
is going beyond the positive RA to allowing the difference
operator.
We show that each of the two generalizations introduces
computational hardness: it is intractable to compute the natural join of two regex formulas under the schemaless semantics, and the difference between two regex formulas under
both the ordinary and schemaless semantics. Nevertheless,
we propose and analyze syntactic constraints, on the RA
expression and the regex formulas at hand, such that the
expressive power is fully preserved and, yet, evaluation can
be done with polynomial delay. Unlike the previous work on
RA over regex formulas, our technique is not (and provably cannot be) based on the static compilation of regex formulas, but rather on an ad-hoc compilation into an automaton
that incorporates both the query and the document. This
approach also allows us to include black-box extractors in
the RA expression
Complexity Bounds for Relational Algebra over Document Spanners
We investigate the complexity of evaluating queries in Relational Algebra
(RA) over the relations extracted by regex formulas (i.e., regular expressions
with capture variables) over text documents. Such queries, also known as the
regular document spanners, were shown to have an evaluation with polynomial
delay for every positive RA expression (i.e., consisting of only natural joins,
projections and unions); here, the RA expression is fixed and the input
consists of both the regex formulas and the document. In this work, we explore
the implication of two fundamental generalizations. The first is adopting the
"schemaless" semantics for spanners, as proposed and studied by Maturana et al.
The second is going beyond the positive RA to allowing the difference operator.
We show that each of the two generalizations introduces computational hardness:
it is intractable to compute the natural join of two regex formulas under the
schemaless semantics, and the difference between two regex formulas under both
the ordinary and schemaless semantics. Nevertheless, we propose and analyze
syntactic constraints, on the RA expression and the regex formulas at hand,
such that the expressive power is fully preserved and, yet, evaluation can be
done with polynomial delay. Unlike the previous work on RA over regex formulas,
our technique is not (and provably cannot be) based on the static compilation
of regex formulas, but rather on an ad-hoc compilation into an automaton that
incorporates both the query and the document. This approach also allows us to
include black-box extractors in the RA expression