250 research outputs found
Functional Dependencies Unleashed for Scalable Data Exchange
We address the problem of efficiently evaluating target functional
dependencies (fds) in the Data Exchange (DE) process. Target fds naturally
occur in many DE scenarios, including the ones in Life Sciences in which
multiple source relations need to be structured under a constrained target
schema. However, despite their wide use, target fds' evaluation is still a
bottleneck in the state-of-the-art DE engines. Systems relying on an all-SQL
approach typically do not support target fds unless additional information is
provided. Alternatively, DE engines that do include these dependencies
typically pay the price of a significant drop in performance and scalability.
In this paper, we present a novel chase-based algorithm that can efficiently
handle arbitrary fds on the target. Our approach essentially relies on
exploiting the interactions between source-to-target (s-t) tuple-generating
dependencies (tgds) and target fds. This allows us to tame the size of the
intermediate chase results, by playing on a careful ordering of chase steps
interleaving fds and (chosen) tgds. As a direct consequence, we importantly
diminish the fd application scope, often a central cause of the dramatic
overhead induced by target fds. Moreover, reasoning on dependency interaction
further leads us to interesting parallelization opportunities, yielding
additional scalability gains. We provide a proof-of-concept implementation of
our chase-based algorithm and an experimental study aiming at gauging its
scalability with respect to a number of parameters, among which the size of
source instances and the number of dependencies of each tested scenario.
Finally, we empirically compare with the latest DE engines, and show that our
algorithm outperforms them
Distribution Constraints: The Chase for Distributed Data
This paper introduces a declarative framework to specify and reason about distributions of data over computing nodes in a distributed setting. More specifically, it proposes distribution constraints which are tuple and equality generating dependencies (tgds and egds) extended with node variables ranging over computing nodes. In particular, they can express co-partitioning constraints and constraints about range-based data distributions by using comparison atoms. The main technical contribution is the study of the implication problem of distribution constraints. While implication is undecidable in general, relevant fragments of so-called data-full constraints are exhibited for which the corresponding implication problems are complete for EXPTIME, PSPACE and NP. These results yield bounds on deciding parallel-correctness for conjunctive queries in the presence of distribution constraints
Propagating functional dependencies with conditions
The dependency propagation problem is to determine, given a view defined on data sources and a set of dependencies on the sources, whether another dependency is guaranteed to hold on the view. This paper investigates dependency propagation for recently proposed conditional functional dependencies (CFDs). The need for this study is evident in data integration, exchange and cleaning since dependencies on data sources often only hold
conditionally
on the view. We investigate dependency propagation for views defined in various fragments of relational algebra, CFDs as view dependencies, and for source dependencies given as either CFDs or traditional functional dependencies (FDs). (a) We establish lower and upper bounds,
all matching
, ranging from PTIME to undecidable. These not only provide the
first
results for CFD propagation, but also extend the classical work of FD propagation by giving new complexity bounds in the presence of finite domains. (b) We provide the first algorithm for computing a minimal cover of
all
CFDs propagated via SPC views; the algorithm has the same complexity as one of the most efficient algorithms for computing a cover of FDs propagated via a projection view, despite the increased expressive power of CFDs and SPC views. (c) We experimentally verify that the algorithm is efficient.
</jats:p
Relational to RDF Data Exchange in Presence of a Shape Expression Schema
International audienceWe study the relational to RDF data exchange problem, where the target constraints are specified using Shape Expression schema (ShEx). We investigate two fundamental problems: 1) consistency which is checking for a given data exchange setting whether there always exists a solution for any source instance, and 2) constructing a universal solution which is a solution that represents the space of all solutions. We propose to use typed IRI constructors in source-to-target tuple generating dependencies to create the IRIs of the RDF graph from the values in the relational instance, and we translate ShEx into a set of target dependencies. We also identify data exchange settings that are key covered, a property that is decidable and guarantees consistency. Furthermore, we show that this property is a sufficient and necessary condition for the existence of universal solutions for a practical subclass of weakly-recursive ShEx
On Chase Termination Beyond Stratification
We study the termination problem of the chase algorithm, a central tool in
various database problems such as the constraint implication problem,
Conjunctive Query optimization, rewriting queries using views, data exchange,
and data integration. The basic idea of the chase is, given a database instance
and a set of constraints as input, to fix constraint violations in the database
instance. It is well-known that, for an arbitrary set of constraints, the chase
does not necessarily terminate (in general, it is even undecidable if it does
or not). Addressing this issue, we review the limitations of existing
sufficient termination conditions for the chase and develop new techniques that
allow us to establish weaker sufficient conditions. In particular, we introduce
two novel termination conditions called safety and inductive restriction, and
use them to define the so-called T-hierarchy of termination conditions. We then
study the interrelations of our termination conditions with previous conditions
and the complexity of checking our conditions. This analysis leads to an
algorithm that checks membership in a level of the T-hierarchy and accounts for
the complexity of termination conditions. As another contribution, we study the
problem of data-dependent chase termination and present sufficient termination
conditions w.r.t. fixed instances. They might guarantee termination although
the chase does not terminate in the general case. As an application of our
techniques beyond those already mentioned, we transfer our results into the
field of query answering over knowledge bases where the chase on the underlying
database may not terminate, making existing algorithms applicable to broader
classes of constraints.Comment: Technical Report of VLDB 2009 conference versio
Consistent Query Answers in the Presence of Universal Constraints
The framework of consistent query answers and repairs has been introduced to
alleviate the impact of inconsistent data on the answers to a query. A repair
is a minimally different consistent instance and an answer is consistent if it
is present in every repair. In this article we study the complexity of
consistent query answers and repair checking in the presence of universal
constraints.
We propose an extended version of the conflict hypergraph which allows to
capture all repairs w.r.t. a set of universal constraints. We show that repair
checking is in PTIME for the class of full tuple-generating dependencies and
denial constraints, and we present a polynomial repair algorithm. This
algorithm is sound, i.e. always produces a repair, but also complete, i.e.
every repair can be constructed. Next, we present a polynomial-time algorithm
computing consistent answers to ground quantifier-free queries in the presence
of denial constraints, join dependencies, and acyclic full-tuple generating
dependencies. Finally, we show that extending the class of constraints leads to
intractability. For arbitrary full tuple-generating dependencies consistent
query answering becomes coNP-complete. For arbitrary universal constraints
consistent query answering is \Pi_2^p-complete and repair checking
coNP-complete.Comment: Submitted to Information System
Analyses and Validation of Conditional Dependencies with Built-in Predicates
This paper proposes a natural extension of conditional functional dependencies (CFDS [14]) and conditional inclusion dependencies (CINDS [8]), denoted by CFD(p)s and CIND(p)s, respectively, by specifying patterns of data, values with not equal, <, <=, > and >= predicates. As data quality rules, CFD(p)s and CIND(p)s are able to capture errors that commonly arise in practice but cannot, be detected by CFDS and CINDS. We establish two sets of results for central technical problems associated with CFD(p)s and CIND(p)s. (a) One concerns the satisfiability and implication problems for CFD(p)s and CIND(p)s, taken separately or together. These are important for, e.g., deciding whether data, quality rules are dirty themselves, and for removing redundant rules. We show that despite the increased expressive power, the static analyses of CFD(p)s and CIND(p)s retain the same complexity as their CFDs and CINDs counterparts. (b) The other concerns validation of CFD(p)s and CIND(p)s. We show that given a set Sigma of CFD(p)s and CIND(p)s on a database D, a, set of SQL queries can be automatically generated that, when evaluated against D, return all tuples in D that violate some dependencies in Sigma. This provides commercial DBMS with an immediate capability to detect errors based on CFD(p)s and CIND(p)s.Computer Science, Information SystemsComputer Science, Theory & MethodsEICPCI-S(ISTP)
Canonical queries as a query answering device (Information Science)
Issued as Annual reports [nos. 1-2], and Final report, Project no. G-36-60
- …