176 research outputs found
Mapping-equivalence and oid-equivalence of single-function object-creating conjunctive queries
Conjunctive database queries have been extended with a mechanism for object
creation to capture important applications such as data exchange, data
integration, and ontology-based data access. Object creation generates new
object identifiers in the result, that do not belong to the set of constants in
the source database. The new object identifiers can be also seen as Skolem
terms. Hence, object-creating conjunctive queries can also be regarded as
restricted second-order tuple-generating dependencies (SO tgds), considered in
the data exchange literature.
In this paper, we focus on the class of single-function object-creating
conjunctive queries, or sifo CQs for short. We give a new characterization for
oid-equivalence of sifo CQs that is simpler than the one given by Hull and
Yoshikawa and places the problem in the complexity class NP. Our
characterization is based on Cohen's equivalence notions for conjunctive
queries with multiplicities. We also solve the logical entailment problem for
sifo CQs, showing that also this problem belongs to NP. Results by Pichler et
al. have shown that logical equivalence for more general classes of SO tgds is
either undecidable or decidable with as yet unknown complexity upper bounds.Comment: This revised version has been accepted on 11 January 2016 for
publication in The VLDB Journa
Supporting Better Insights of Data Science Pipelines with Fine-grained Provenance
Successful data-driven science requires complex data engineering pipelines to
clean, transform, and alter data in preparation for machine learning, and
robust results can only be achieved when each step in the pipeline can be
justified, and its effect on the data explained. In this framework, our aim is
to provide data scientists with facilities to gain an in-depth understanding of
how each step in the pipeline affects the data, from the raw input to training
sets ready to be used for learning. Starting from an extensible set of data
preparation operators commonly used within a data science setting, in this work
we present a provenance management infrastructure for generating, storing, and
querying very granular accounts of data transformations, at the level of
individual elements within datasets whenever possible. Then, from the formal
definition of a core set of data science preprocessing operators, we derive a
provenance semantics embodied by a collection of templates expressed in PROV, a
standard model for data provenance. Using those templates as a reference, our
provenance generation algorithm generalises to any operator with observable
input/output pairs. We provide a prototype implementation of an
application-level provenance capture library to produce, in a semi-automatic
way, complete provenance documents that account for the entire pipeline. We
report on the ability of our implementations to capture provenance in real ML
benchmark pipelines and over TCP-DI synthetic data. We finally show how the
collected provenance can be used to answer a suite of provenance benchmark
queries that underpin some common pipeline inspection questions, as expressed
on the Data Science Stack Exchange.Comment: 37 pages, 27 figures, submitted to a journa
Fine-grained provenance for high-quality data science
In this work we analyze the typical operations of data preparation within a machine learning process, and provide infrastructure for generating very granular provenance records from it, at the level of individual elements within a dataset. Our contributions include: (i) the formal definition of a core set of preprocessing operators, (ii) the definition of provenance patterns for each of them, and (iii) a prototype implementation of an application-level provenance capture library that works alongside Python.</p
Processing keyword queries under access limitations
The Deep Web is constituted by data accessible through Web pages, but not readily indexable by search engines, as they are returned in dynamic pages. In this paper we propose a framework for accessing Deep Web sources, represented as relational tables with so-called access limitations, with keyword-based queries. We formalize the notion of optimal answer and propose methods for query processing. To the best of our knowledge, ours is the first systematic approach to keyword search in such context
How I Learned to Stop Worrying and Love NoSQL Databases
National audienceThe absence of a schema in NoSQL databases can disorient tradi- tional database specialists and can make the design activity in this context a leap of faith. However, we show in this paper that an effective design methodology for NoSQL systems supporting the scalability, performance, and consistency of next-generation Web applications can be indeed devised. The approach is based on NoAM (NoSQL Abstract Model), a novel abstract data model for NoSQL databases, which is used to specify a system-independent representation of the application data. This intermediate representation can be then implemented in target NoSQL databases, taking into account their specific features
Two Approaches to the Integration of Heterogeneous Data Warehouses
In this paper we address the problem of integrating independent and possibly heterogeneous data warehouses, a problem that has received little attention so far, but that arises very often in practice. We start by tackling the basic issue of matching heterogeneous dimensions and provide a number of general properties that a dimension matching should fulfill. We then propose two different approaches to the problem of integration that try to enforce matchings satisfying these properties. The first approach refers to a scenario of loosely coupled integration, in which we just need to identify the common information between data sources and perform join operations over the original sources. The goal of the second approach is the derivation of a materialized view built by merging the sources, and refers to a scenario of tightly coupled integration in which queries are performed against the view. We also illustrate architecture and functionality of a practical system that we have developed to demonstrate the effectiveness of our integration strategies
- …