19 research outputs found

    Functional Dependencies Unleashed for Scalable Data Exchange

    Full text link
    We address the problem of efficiently evaluating target functional dependencies (fds) in the Data Exchange (DE) process. Target fds naturally occur in many DE scenarios, including the ones in Life Sciences in which multiple source relations need to be structured under a constrained target schema. However, despite their wide use, target fds' evaluation is still a bottleneck in the state-of-the-art DE engines. Systems relying on an all-SQL approach typically do not support target fds unless additional information is provided. Alternatively, DE engines that do include these dependencies typically pay the price of a significant drop in performance and scalability. In this paper, we present a novel chase-based algorithm that can efficiently handle arbitrary fds on the target. Our approach essentially relies on exploiting the interactions between source-to-target (s-t) tuple-generating dependencies (tgds) and target fds. This allows us to tame the size of the intermediate chase results, by playing on a careful ordering of chase steps interleaving fds and (chosen) tgds. As a direct consequence, we importantly diminish the fd application scope, often a central cause of the dramatic overhead induced by target fds. Moreover, reasoning on dependency interaction further leads us to interesting parallelization opportunities, yielding additional scalability gains. We provide a proof-of-concept implementation of our chase-based algorithm and an experimental study aiming at gauging its scalability with respect to a number of parameters, among which the size of source instances and the number of dependencies of each tested scenario. Finally, we empirically compare with the latest DE engines, and show that our algorithm outperforms them

    Composition with Target Constraints

    Full text link
    It is known that the composition of schema mappings, each specified by source-to-target tgds (st-tgds), can be specified by a second-order tgd (SO tgd). We consider the question of what happens when target constraints are allowed. Specifically, we consider the question of specifying the composition of standard schema mappings (those specified by st-tgds, target egds, and a weakly acyclic set of target tgds). We show that SO tgds, even with the assistance of arbitrary source constraints and target constraints, cannot specify in general the composition of two standard schema mappings. Therefore, we introduce source-to-target second-order dependencies (st-SO dependencies), which are similar to SO tgds, but allow equations in the conclusion. We show that st-SO dependencies (along with target egds and target tgds) are sufficient to express the composition of every finite sequence of standard schema mappings, and further, every st-SO dependency specifies such a composition. In addition to this expressive power, we show that st-SO dependencies enjoy other desirable properties. In particular, they have a polynomial-time chase that generates a universal solution. This universal solution can be used to find the certain answers to unions of conjunctive queries in polynomial time. It is easy to show that the composition of an arbitrary number of standard schema mappings is equivalent to the composition of only two standard schema mappings. We show that surprisingly, the analogous result holds also for schema mappings specified by just st-tgds (no target constraints). This is proven by showing that every SO tgd is equivalent to an unnested SO tgd (one where there is no nesting of function symbols). Similarly, we prove unnesting results for st-SO dependencies, with the same types of consequences.Comment: This paper is an extended version of: M. Arenas, R. Fagin, and A. Nash. Composition with Target Constraints. In 13th International Conference on Database Theory (ICDT), pages 129-142, 201

    Semantics for Non-Monotone Queries in Data Exchange and Data Integration

    Get PDF
    A fundamental question in data exchange and data integration is how to answer queries that are posed against the target schema, or the global schema, respectively. While the certain answers semantics has proved to be adequate for answering monotone queries, the question concerning an appropriate semantics for non-monotone queries turned out to be more difficult. This article surveys approaches and semantics for answering non-monotone queries in data exchange and data integration

    Algorithms for Core Computation in Data Exchange

    Get PDF
    We describe the state of the art in the area of core computation for data exchange. Two main approaches are considered: post-processing core computation, applied to a canonical universal solution constructed by chasing a given schema mapping, and direct core computation, where the mapping is first rewritten in order to create core universal solutions by chasing it

    Implementation of Tuned Schema Merging Approach

    Get PDF
    Schema merging is a process of integrating multiple data sources into a GCS (Global Conceptual Schema). It is pivotal to various application domains, like data ware housing and multi-databases. Schema merging requires the identification of corresponding elements, which is done through schema matching process. In this process, corresponding elements across multiple data sources are identified after the comparison of these data sources with each other. In this way, for a given set of data sources and the correspondence between them, different possibilities for creating GCS can be achieved. In applications like multi-databases and data warehousing, new data sources keep joining in and GCS relations are usually expanded horizontally or vertically. Schema merging approaches usually expand GCS relations horizontally or vertically as new data sources join in. As a result of such expansions, an unbalanced GCS is created which either produces too much NULL values in response to global queries or a result of too many Joins causes poor query processing. In this paper, a novel approach, TuSMe (Tuned Schema Merging) techniqueis introduced to overcome the above mentioned issue via developing a balanced GCS, which will be able to control both vertical and horizontal expansion of GCS relations. The approach employs a weighting mechanism in which the weights are assigned to individual attributes of GCS. These weights reflect the connectedness of GCS attributes in accordance with the attributes of the principle data sources. Moreover, the overall strength of the GCS could be scrutinized by combining these weights. A prototype implementation of TuSMe shows significant improvement against other contemporary state-of-the-art approaches

    On Chase Termination Beyond Stratification

    Full text link
    We study the termination problem of the chase algorithm, a central tool in various database problems such as the constraint implication problem, Conjunctive Query optimization, rewriting queries using views, data exchange, and data integration. The basic idea of the chase is, given a database instance and a set of constraints as input, to fix constraint violations in the database instance. It is well-known that, for an arbitrary set of constraints, the chase does not necessarily terminate (in general, it is even undecidable if it does or not). Addressing this issue, we review the limitations of existing sufficient termination conditions for the chase and develop new techniques that allow us to establish weaker sufficient conditions. In particular, we introduce two novel termination conditions called safety and inductive restriction, and use them to define the so-called T-hierarchy of termination conditions. We then study the interrelations of our termination conditions with previous conditions and the complexity of checking our conditions. This analysis leads to an algorithm that checks membership in a level of the T-hierarchy and accounts for the complexity of termination conditions. As another contribution, we study the problem of data-dependent chase termination and present sufficient termination conditions w.r.t. fixed instances. They might guarantee termination although the chase does not terminate in the general case. As an application of our techniques beyond those already mentioned, we transfer our results into the field of query answering over knowledge bases where the chase on the underlying database may not terminate, making existing algorithms applicable to broader classes of constraints.Comment: Technical Report of VLDB 2009 conference versio

    Tree Projections and Structural Decomposition Methods: The Power of Local Consistency and Larger Islands of Tractability

    Full text link
    Evaluating conjunctive queries and solving constraint satisfaction problems are fundamental problems in database theory and artificial intelligence, respectively. These problems are NP-hard, so that several research efforts have been made in the literature for identifying tractable classes, known as islands of tractability, as well as for devising clever heuristics for solving efficiently real-world instances. Many heuristic approaches are based on enforcing on the given instance a property called local consistency, where (in database terms) each tuple in every query atom matches at least one tuple in every other query atom. Interestingly, it turns out that, for many well-known classes of queries, such as for the acyclic queries, enforcing local consistency is even sufficient to solve the given instance correctly. However, the precise power of such a procedure was unclear, but for some very restricted cases. The paper provides full answers to the long-standing questions about the precise power of algorithms based on enforcing local consistency. The classes of instances where enforcing local consistency turns out to be a correct query-answering procedure are however not efficiently recognizable. In fact, the paper finally focuses on certain subclasses defined in terms of the novel notion of greedy tree projections. These latter classes are shown to be efficiently recognizable and strictly larger than most islands of tractability known so far, both in the general case of tree projections and for specific structural decomposition methods

    What is the IQ of your data transformation system?

    Full text link
    Mapping and translating data across different representations is a crucial problem in information systems. Many formalisms and tools are currently used for this purpose, to the point that devel- opers typically face a difficult question: “what is the right tool for my translation task?” In this paper, we introduce several techniques that contribute to answer this question. Among these, a fairly gen- eral definition of a data transformation system, a new and very effi- cient similarity measure to evaluate the outputs produced by such a system, and a metric to estimate user efforts. Based on these tech- niques, we are able to compare a wide range of systems on many translation tasks, to gain interesting insights about their effective- ness, and, ultimately, about their “intelligence”
    corecore