19 research outputs found
Functional Dependencies Unleashed for Scalable Data Exchange
We address the problem of efficiently evaluating target functional
dependencies (fds) in the Data Exchange (DE) process. Target fds naturally
occur in many DE scenarios, including the ones in Life Sciences in which
multiple source relations need to be structured under a constrained target
schema. However, despite their wide use, target fds' evaluation is still a
bottleneck in the state-of-the-art DE engines. Systems relying on an all-SQL
approach typically do not support target fds unless additional information is
provided. Alternatively, DE engines that do include these dependencies
typically pay the price of a significant drop in performance and scalability.
In this paper, we present a novel chase-based algorithm that can efficiently
handle arbitrary fds on the target. Our approach essentially relies on
exploiting the interactions between source-to-target (s-t) tuple-generating
dependencies (tgds) and target fds. This allows us to tame the size of the
intermediate chase results, by playing on a careful ordering of chase steps
interleaving fds and (chosen) tgds. As a direct consequence, we importantly
diminish the fd application scope, often a central cause of the dramatic
overhead induced by target fds. Moreover, reasoning on dependency interaction
further leads us to interesting parallelization opportunities, yielding
additional scalability gains. We provide a proof-of-concept implementation of
our chase-based algorithm and an experimental study aiming at gauging its
scalability with respect to a number of parameters, among which the size of
source instances and the number of dependencies of each tested scenario.
Finally, we empirically compare with the latest DE engines, and show that our
algorithm outperforms them
Composition with Target Constraints
It is known that the composition of schema mappings, each specified by
source-to-target tgds (st-tgds), can be specified by a second-order tgd (SO
tgd). We consider the question of what happens when target constraints are
allowed. Specifically, we consider the question of specifying the composition
of standard schema mappings (those specified by st-tgds, target egds, and a
weakly acyclic set of target tgds). We show that SO tgds, even with the
assistance of arbitrary source constraints and target constraints, cannot
specify in general the composition of two standard schema mappings. Therefore,
we introduce source-to-target second-order dependencies (st-SO dependencies),
which are similar to SO tgds, but allow equations in the conclusion. We show
that st-SO dependencies (along with target egds and target tgds) are sufficient
to express the composition of every finite sequence of standard schema
mappings, and further, every st-SO dependency specifies such a composition. In
addition to this expressive power, we show that st-SO dependencies enjoy other
desirable properties. In particular, they have a polynomial-time chase that
generates a universal solution. This universal solution can be used to find the
certain answers to unions of conjunctive queries in polynomial time. It is easy
to show that the composition of an arbitrary number of standard schema mappings
is equivalent to the composition of only two standard schema mappings. We show
that surprisingly, the analogous result holds also for schema mappings
specified by just st-tgds (no target constraints). This is proven by showing
that every SO tgd is equivalent to an unnested SO tgd (one where there is no
nesting of function symbols). Similarly, we prove unnesting results for st-SO
dependencies, with the same types of consequences.Comment: This paper is an extended version of: M. Arenas, R. Fagin, and A.
Nash. Composition with Target Constraints. In 13th International Conference
on Database Theory (ICDT), pages 129-142, 201
Semantics for Non-Monotone Queries in Data Exchange and Data Integration
A fundamental question in data exchange and data integration is how to answer queries that are posed against the target schema, or the global schema, respectively. While the certain answers semantics has proved to be adequate for answering monotone queries, the question concerning an appropriate semantics for non-monotone queries turned out to be more difficult. This article surveys approaches and semantics for answering non-monotone queries in data exchange and data integration
Algorithms for Core Computation in Data Exchange
We describe the state of the art in the area of core computation for data exchange. Two main approaches are considered: post-processing core computation, applied to a canonical universal solution constructed by chasing a given schema mapping, and direct core computation, where the mapping is first rewritten in order to create core universal solutions by chasing it
Implementation of Tuned Schema Merging Approach
Schema merging is a process of integrating multiple data sources into a GCS (Global Conceptual Schema). It is pivotal to various application domains, like data ware housing and multi-databases. Schema merging requires the identification of corresponding elements, which is done through schema matching process. In this process, corresponding elements across multiple data sources are identified after the comparison of these data sources with each other. In this way, for a given set of data sources and the correspondence between them, different possibilities for creating GCS can be achieved. In applications like multi-databases and data warehousing, new data sources keep joining in and GCS relations are usually expanded horizontally or vertically. Schema merging approaches usually expand GCS relations horizontally or vertically as new data sources join in. As a result of such expansions, an unbalanced GCS is created which either produces too much NULL values in response to global queries or a result of too many Joins causes poor query processing. In this paper, a novel approach, TuSMe (Tuned Schema Merging) techniqueis introduced to overcome the above mentioned issue via developing a balanced GCS, which will be able to control both vertical and horizontal expansion of GCS relations. The approach employs a weighting mechanism in which the weights are assigned to individual attributes of GCS. These weights reflect the connectedness of GCS attributes in accordance with the attributes of the principle data sources. Moreover, the overall strength of the GCS could be scrutinized by combining these weights. A prototype implementation of TuSMe shows significant improvement against other contemporary state-of-the-art approaches
On Chase Termination Beyond Stratification
We study the termination problem of the chase algorithm, a central tool in
various database problems such as the constraint implication problem,
Conjunctive Query optimization, rewriting queries using views, data exchange,
and data integration. The basic idea of the chase is, given a database instance
and a set of constraints as input, to fix constraint violations in the database
instance. It is well-known that, for an arbitrary set of constraints, the chase
does not necessarily terminate (in general, it is even undecidable if it does
or not). Addressing this issue, we review the limitations of existing
sufficient termination conditions for the chase and develop new techniques that
allow us to establish weaker sufficient conditions. In particular, we introduce
two novel termination conditions called safety and inductive restriction, and
use them to define the so-called T-hierarchy of termination conditions. We then
study the interrelations of our termination conditions with previous conditions
and the complexity of checking our conditions. This analysis leads to an
algorithm that checks membership in a level of the T-hierarchy and accounts for
the complexity of termination conditions. As another contribution, we study the
problem of data-dependent chase termination and present sufficient termination
conditions w.r.t. fixed instances. They might guarantee termination although
the chase does not terminate in the general case. As an application of our
techniques beyond those already mentioned, we transfer our results into the
field of query answering over knowledge bases where the chase on the underlying
database may not terminate, making existing algorithms applicable to broader
classes of constraints.Comment: Technical Report of VLDB 2009 conference versio
Tree Projections and Structural Decomposition Methods: The Power of Local Consistency and Larger Islands of Tractability
Evaluating conjunctive queries and solving constraint satisfaction problems
are fundamental problems in database theory and artificial intelligence,
respectively. These problems are NP-hard, so that several research efforts have
been made in the literature for identifying tractable classes, known as islands
of tractability, as well as for devising clever heuristics for solving
efficiently real-world instances. Many heuristic approaches are based on
enforcing on the given instance a property called local consistency, where (in
database terms) each tuple in every query atom matches at least one tuple in
every other query atom. Interestingly, it turns out that, for many well-known
classes of queries, such as for the acyclic queries, enforcing local
consistency is even sufficient to solve the given instance correctly. However,
the precise power of such a procedure was unclear, but for some very restricted
cases. The paper provides full answers to the long-standing questions about the
precise power of algorithms based on enforcing local consistency. The classes
of instances where enforcing local consistency turns out to be a correct
query-answering procedure are however not efficiently recognizable. In fact,
the paper finally focuses on certain subclasses defined in terms of the novel
notion of greedy tree projections. These latter classes are shown to be
efficiently recognizable and strictly larger than most islands of tractability
known so far, both in the general case of tree projections and for specific
structural decomposition methods
What is the IQ of your data transformation system?
Mapping and translating data across different representations is a crucial problem in information systems. Many formalisms and tools are currently used for this purpose, to the point that devel- opers typically face a difficult question: “what is the right tool for my translation task?” In this paper, we introduce several techniques that contribute to answer this question. Among these, a fairly gen- eral definition of a data transformation system, a new and very effi- cient similarity measure to evaluate the outputs produced by such a system, and a metric to estimate user efforts. Based on these tech- niques, we are able to compare a wide range of systems on many translation tasks, to gain interesting insights about their effective- ness, and, ultimately, about their “intelligence”