46 research outputs found

    Causality and the semantics of provenance

    Full text link
    Provenance, or information about the sources, derivation, custody or history of data, has been studied recently in a number of contexts, including databases, scientific workflows and the Semantic Web. Many provenance mechanisms have been developed, motivated by informal notions such as influence, dependence, explanation and causality. However, there has been little study of whether these mechanisms formally satisfy appropriate policies or even how to formalize relevant motivating concepts such as causality. We contend that mathematical models of these concepts are needed to justify and compare provenance techniques. In this paper we review a theory of causality based on structural models that has been developed in artificial intelligence, and describe work in progress on a causal semantics for provenance graphs.Comment: Workshop submissio

    Storing Auxiliary Data for Efficient Maintenance and Lineage Tracing of Complex Views

    No full text
    As views in a data warehouse become more complex, the view maintenance process can become very complicated and potentially very inefficient. Storing auxiliary views in the warehouse can reduce the complexity and improve the efficiency of view maintenance, and the same auxiliary views can help in efficiently answering lineage tracing queries over the warehouse views. In this paper, we study the problem of selecting auxiliary views to materialize in order to minimize the total view maintenance and lineage tracing cost. We consider relational views with arbitrary use of aggregation operators, and we define an initial search space for our optimization problem based on a normal form for such view definitions. We present several auxiliary view selection algorithms, and to study their performance we conduct experiments using the TPC-D benchmark in addition to synthetic view definitions and statistics. The results of our experiments show: (1) the exhaustive algorithm that selects the optimal set of auxiliary views is far too expensive in many cases; (2) two heuristic algorithms that we present select good (often optimal) sets of auxiliary views in a much shorter time; (3) even auxiliary views selected by a very simple algorithm can significantly reduce the overall view maintenance and lineage tracing cost

    Lineage Tracing for General Data Warehouse Transformations

    No full text
    Data warehousing systems integrate information from operational data sources into a central repository to enable analysis and mining of the integrated information. During the integration process, source data typically undergoes a series of transformations, which may vary from simple algebraic operations or aggregations to complex "data cleansing" procedures. In a warehousing environment, the data lineage problem is that of tracing warehouse data items back to the original source items from which they were derived. We formally define the lineage tracing problem in the presence of general data warehouse transformations, and we present algorithms for lineage tracing in this environment. Our tracing procedures take advantage of known structure or properties of transformations when present, but also work in the absence of such information. Our results can be used as the basis for a lineage tracing tool in a general warehousing setting, and also can guide the design of data warehouses that enable efficient lineage tracing.

    Lineage Tracing in a Data Warehousing System (Demonstration Proposal)

    No full text
    A data warehousing system collects data from multiple distributed sources and stores the integrated information as materialized views in a local data warehouse. Users then perform data analysis and mining on the warehouse views. Figure 1 shows the basic architecture of a data warehousing system. In many cases, the warehouse view contents alone are not sufficient for in-depth analysis. It is often useful to be able to "drill through" from interesting (or potentially erroneous) view data to the original source data that derived the view data. For a given view data item, identifying the exact set of base data items that produced the view data item is termed the view data lineage problem. Motivation for and applications of lineage tracing in a warehousing environment are provided in [2]. In the context of the WHIPS data warehousing project at Stanford [3], we have developed a complete prototype that performs efficient and consistent lineage tracing. Some commercial data warehousing systems support schema-level lineage tracing, or provide specialized drill-down and/or drill-through facilities for multi-dimensional warehouse views. Our lineage tracing prototype supports more ne-grained instance-level lineage tracing for arbitrarily complex relational views, including aggregation. Our prototype automatically generates lineag

    Lineage Tracing in a Data Warehousing System

    No full text
    e system applies the tracing procedures to the source tables and/or auxiliary views to obtain the lineage results and show the specific view data derivation process. 1 Lineage Tracing System 1.1 Lineage Example Given a view data item I , the exact set of source data that produced I is called I's lineage. We use an example to illustrate the concepts; a full formalization of the problem along with solutions and algorithms are given in [2]. Consider a financial data warehouse with the three source tables shown in Figure 3. A view Promising (Figure 4) is defined to contain all "promising" industries, where an industry is regarded as promising if some stock in that industry is gaining money over all purchases, and the stock has a price-earnings ratio below 40. Over our sample source data the view contains two tuples, hcomputeri and hm

    Practical Lineage Tracing in Data Warehouses

    No full text
    We consider the view data lineage problem in a warehousing environment: For a given data item in a materialized warehouse view, we want to identify the set of source data items that produced the view item. We formalize the problem and present a lineage tracing algorithm for relational views with aggregation. Based on our tracing algorithm, we propose a number of schemes for storing auxiliary views that enable consistent and efficient lineage tracing in a multisource data warehouse. We report on a performance study of the various schemes, identifying which schemes perform best in which settings. Based on our results, we have implemented a lineage tracing package in the WHIPS data warehousing system prototype at Stanford. With this package, users can select view tuples of interest, then efficiently "drill down" to examine the source data that produced them. 1 Introduction Data warehousing systems collect data from multiple distributed sources, integrate the information as materialized v..

    Tracing the Lineage of View Data in a Warehousing Environment

    No full text
    We consider the view data lineage problem in a warehousing environment: For a given data item in a materialized warehouse view, we want to identify the set of source data items that produced the view item. We formally define the lineage problem, develop lineage tracing algorithms for relational views with aggregation, and propose mechanisms for performing consistent lineage tracing in a multi-source data warehousing environment. Our results can form the basis of a tool that allows analysts to browse warehouse data, select view tuples of interest, then "drill-through" to examine the exact source tuples that produced the view tuples of interest. 1 Introduction In a data warehousing system, materialized views over source data are defined, computed, and stored in the warehouse to answer queries about the source data (which may be stored in distributed and legacy systems) in an integrated and efficient way [CD97, Wid95]. Typically, on-line analytical processing and mining (OLAP and..
    corecore