1 research outputs found
Skyline Operators for Document Spanners
When extracting a relation of spans (intervals) from a text document, a
common practice is to filter out tuples of the relation that are deemed
dominated by others. The domination rule is defined as a partial order that
varies along different systems and tasks. For example, we may state that a
tuple is dominated by tuples which extend it by assigning additional
attributes, or assigning larger intervals. The result of filtering the relation
would then be the skyline according to this partial order. As this filtering
may remove most of the extracted tuples, we study whether we can improve the
performance of the extraction by compiling the domination rule into the
extractor.
To this aim, we introduce the skyline operator for declarative information
extraction tasks expressed as document spanners. We show that this operator can
be expressed via regular operations when the domination partial order can
itself be expressed as a regular spanner, which covers several natural
domination rules. Yet, we show that the skyline operator incurs a computational
cost (under combined complexity). First, there are cases where the operator
requires an exponential blowup on the number of states needed to represent the
spanner as a sequential variable-set automaton. Second, the evaluation may
become computationally hard. Our analysis more precisely identifies classes of
domination rules for which the combined complexity is tractable or intractable.Comment: 42 pages. Submitte