630 research outputs found
Lineage tracing in data warehousing systems : a design and implementation
Data warehouse, as the foundation of decision support system, is critical for the managers to make decisions. It is different with operational database. Data warehouse reads data from multiple operational databases instead of getting the data from the end user transaction input. In a warehousing environment, the data lineage problem is that of tracing warehouse data items back to the original source items from which they were derived. Enabling lineage tracing in a data warehouse environment has several benefits and applications, including in-depth data analysis and data mining, authorization management, efficient warehouse recovery, etc. In this report, we firstly introduce the basic concept and architecture of data warehouse, as well as the development tools and methods about data warehouse. Secondly, we discuss the lineage tracing problems and challenges in the data warehousing system, and then use an example to present the algorithms and procedure of lineage tracing. As well, we will present our design and implementation of a prototype system called LTI, to demonstrate the lineage tracing procedures using an inventory system as a data warehouse system. We also developed various graphical user interfaces required to facilitate interacting with the system in order to update the source databases in the LTI system. Finally, we will show the experimentation of using our LTI system through tracing inventory and sales order data in the data warehouse syste
Using schema transformation pathways for data lineage tracing
With the increasing amount and diversity of information available on the Internet, there has been a huge growth in information systems that need to integrate data from distributed, heterogeneous data sources. Tracing the lineage of the integrated data is one of the problems being addressed in data warehousing research. This paper presents a data lineage tracing approach based on schema transformation pathways. Our approach is not limited to one specific data model or query language, and would be useful in any data transformation/integration framework based on sequences of primitive schema transformations
Općeniti postupak za i ntegracijsko testiranje ETL procedura
In order to attain a certain degree of confidence in the quality of the data in the data warehouse it is necessary to perform a series of tests. There are many components (and aspects) of the data warehouse that can be tested, and in this paper we focus on the ETL procedures. Due to the complexity of ETL process, ETL procedure tests are usually custom written, having a very low level of reusability. In this paper we address this issue and work towards establishing a generic procedure for integration testing of certain aspects of ETL procedures. In this approach, ETL procedures are treated as a black box and are tested by comparing their inputs and outputs – datasets. Datasets from three locations are compared: datasets from the relational source(s), datasets from the staging area and datasets from the data warehouse. Proposed procedure is generic and can be implemented on any data warehouse employing dimensional model and having relational database(s) as a source. Our work pertains only to certain aspects of data quality problems that can be found in DW systems. It provides a basic testing foundation or augments existing data warehouse system’s testing capabilities. We comment on proposed mechanisms both in terms of full reload and incremental loading.Kako bi se ostvarila određena razina povjerenja u kvalitetu podataka potrebno je obaviti niz provjera. Postoje brojne komponente (i aspekti) skladišta podataka koji se mogu testirati. U ovom radu smo se usredotočili na testiranje ETL procedura. S obzirom na složenost sustava skladišta podataka, testovi ETL procedura se pišu posebno za svako skladište podataka i rijetko se mogu ponovo upotrebljavati. Ovdje se obrađuje taj problem i predlaže općenita procedura za integracijsko testiranje određ enih aspekata ETL procedura. Predloženi pristup tretira ETL procedure kao crnu kutiju, te se procedure testiraju tako što se uspoređuju ulazni i izlazni skupovi podataka. Uspoređuju se skupovi podataka s tri lokacije: podaci iz izvorišta podataka, podaci iz konsolidiranog pripremnog područja te podaci iz skladišta podataka. Predložena procedura je općenita i može se primijeniti na bilo koje skladište podatka koje koristi dimenzijski model pri čemu podatke dobavlja iz relacijskih baza podataka. Predložene provjere se odnose samo na određene aspekte problema kvalitete podataka koji se mogu pojaviti u sustavu skladišta podataka, te služe za uspostavljanje osnovnog skupa provjera ili uvećanje mogućnosti provjere postojećih sustava. Predloženi postupak se komentira u kontekstu potpunog i inkrementalnog učitavanja podataka u skladište podataka
Discovering data lineage in data warehouse : methods and techniques for tracing the origins of data in data-warehouse
A data warehouse enables enterprise-wide analysis and reporting functionality that is usually used to support decision-making. Data warehousing system integrates data from different data sources. Typically, the data are extracted from different data sources, then transformed several times and integrated before they are finally
stored in the central repository. The extraction and transformation processes vary widely - both in theory and between solution providers. Some are generic, others are tailored to users' transformation and reporting requirements through hand-coded solutions. Most research related to data integration is focused on this area, i.e., on the transformation of data. Since data in a data warehouse undergo various complex transformation processes, often at many different levels and in many stages, it is very important to be able to ensure the quality of the data that the data warehouse
contains. The objective of this thesis is to study and compare existing approaches (methods and techniques) for tracing data lineage, and to propose a data lineage solution specific to a business enterprise data warehouse
Causality and the semantics of provenance
Provenance, or information about the sources, derivation, custody or history
of data, has been studied recently in a number of contexts, including
databases, scientific workflows and the Semantic Web. Many provenance
mechanisms have been developed, motivated by informal notions such as
influence, dependence, explanation and causality. However, there has been
little study of whether these mechanisms formally satisfy appropriate policies
or even how to formalize relevant motivating concepts such as causality. We
contend that mathematical models of these concepts are needed to justify and
compare provenance techniques. In this paper we review a theory of causality
based on structural models that has been developed in artificial intelligence,
and describe work in progress on a causal semantics for provenance graphs.Comment: Workshop submissio
Measuring Data Believability: A Provenance Approach
Data quality is crucial for operational efficiency
and sound decision making. This paper focuses on
believability, a major aspect of quality, measured
along three dimensions: trustworthiness,
reasonableness, and temporality. We ground our
approach on provenance, i.e. the origin and
subsequent processing history of data. We present our
provenance model and our approach for computing
believability based on provenance metadata. The
approach is structured into three increasingly complex
building blocks: (1) definition of metrics for assessing
the believability of data sources, (2) definition of
metrics for assessing the believability of data resulting
from one process run and (3) assessment of
believability based on all the sources and processing
history of data. We illustrate our approach with a
scenario based on Internet data. To our knowledge,
this is the first work to develop a precise approach to
measuring data believability and making explicit use of
provenance-based measurements
Database Queries that Explain their Work
Provenance for database queries or scientific workflows is often motivated as
providing explanation, increasing understanding of the underlying data sources
and processes used to compute the query, and reproducibility, the capability to
recompute the results on different inputs, possibly specialized to a part of
the output. Many provenance systems claim to provide such capabilities;
however, most lack formal definitions or guarantees of these properties, while
others provide formal guarantees only for relatively limited classes of
changes. Building on recent work on provenance traces and slicing for
functional programming languages, we introduce a detailed tracing model of
provenance for multiset-valued Nested Relational Calculus, define trace slicing
algorithms that extract subtraces needed to explain or recompute specific parts
of the output, and define query slicing and differencing techniques that
support explanation. We state and prove correctness properties for these
techniques and present a proof-of-concept implementation in Haskell.Comment: PPDP 201
A Brief Tour through Provenance in Scientific Workflows and Databases
Within computer science, the term provenance has multiple meanings, due to different motivations, perspectives, and assumptions prevalent in the respective communities. This chapter provides a high-level “sightseeing tour” of some of those different notions and uses of provenance in scientific workflows and databases.Ope
- …