926 research outputs found
Wrapper Maintenance: A Machine Learning Approach
The proliferation of online information sources has led to an increased use
of wrappers for extracting data from Web sources. While most of the previous
research has focused on quick and efficient generation of wrappers, the
development of tools for wrapper maintenance has received less attention. This
is an important research problem because Web sources often change in ways that
prevent the wrappers from extracting data correctly. We present an efficient
algorithm that learns structural information about data from positive examples
alone. We describe how this information can be used for two wrapper maintenance
applications: wrapper verification and reinduction. The wrapper verification
system detects when a wrapper is not extracting correct data, usually because
the Web source has changed its format. The reinduction algorithm automatically
recovers from changes in the Web source by identifying data on Web pages so
that a new wrapper may be generated for this source. To validate our approach,
we monitored 27 wrappers over a period of a year. The verification algorithm
correctly discovered 35 of the 37 wrapper changes, and made 16 mistakes,
resulting in precision of 0.73 and recall of 0.95. We validated the reinduction
algorithm on ten Web sources. We were able to successfully reinduce the
wrappers, obtaining precision and recall values of 0.90 and 0.80 on the data
extraction task
An Integration-Oriented Ontology to Govern Evolution in Big Data Ecosystems
Big Data architectures allow to flexibly store and process heterogeneous
data, from multiple sources, in their original format. The structure of those
data, commonly supplied by means of REST APIs, is continuously evolving. Thus
data analysts need to adapt their analytical processes after each API release.
This gets more challenging when performing an integrated or historical
analysis. To cope with such complexity, in this paper, we present the Big Data
Integration ontology, the core construct to govern the data integration process
under schema evolution by systematically annotating it with information
regarding the schema of the sources. We present a query rewriting algorithm
that, using the annotated ontology, converts queries posed over the ontology to
queries over the sources. To cope with syntactic evolution in the sources, we
present an algorithm that semi-automatically adapts the ontology upon new
releases. This guarantees ontology-mediated queries to correctly retrieve data
from the most recent schema version as well as correctness in historical
queries. A functional and performance evaluation on real-world APIs is
performed to validate our approach.Comment: Preprint submitted to Information Systems. 35 page
- …