Search CORE

630 research outputs found

Lineage tracing in data warehousing systems : a design and implementation

Author: Xu Jiu
Publication venue
Publication date: 01/01/2003
Field of study

Data warehouse, as the foundation of decision support system, is critical for the managers to make decisions. It is different with operational database. Data warehouse reads data from multiple operational databases instead of getting the data from the end user transaction input. In a warehousing environment, the data lineage problem is that of tracing warehouse data items back to the original source items from which they were derived. Enabling lineage tracing in a data warehouse environment has several benefits and applications, including in-depth data analysis and data mining, authorization management, efficient warehouse recovery, etc. In this report, we firstly introduce the basic concept and architecture of data warehouse, as well as the development tools and methods about data warehouse. Secondly, we discuss the lineage tracing problems and challenges in the data warehousing system, and then use an example to present the algorithms and procedure of lineage tracing. As well, we will present our design and implementation of a prototype system called LTI, to demonstrate the lineage tracing procedures using an inventory system as a data warehouse system. We also developed various graphical user interfaces required to facilitate interacting with the system in order to update the source databases in the LTI system. Finally, we will show the experimentation of using our LTI system through tracing inventory and sales order data in the data warehouse syste

Concordia University Research Repository

Using schema transformation pathways for data lineage tracing

Author: A. Woodruff
C. Faloutsos
H. Fan
H. Fan
H. Fan
J. Albert
L. Zamboulis
L. Zamboulis
M. Boyd
P. Buneman
P. Buneman
P. McBrien
P. McBrien
P.A. Bernstein
Y. Cui
Y. Cui
Y. Cui
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2005
Field of study

With the increasing amount and diversity of information available on the Internet, there has been a huge growth in information systems that need to integrate data from distributed, heterogeneous data sources. Tracing the lineage of the integrated data is one of the problems being addressed in data warehousing research. This paper presents a data lineage tracing approach based on schema transformation pathways. Our approach is not limited to one specific data model or query language, and would be useful in any data transformation/integration framework based on sequences of primitive schema transformations

CiteSeerX

Crossref

Birkbeck Institutional Research Online

Općeniti postupak za i ntegracijsko testiranje ETL procedura

Author: Igor Mekterović
Ljiljana Brkić
Mirta Baranović
Publication venue: KoREMA - Croatian Society for Communications, Computing, Electronics, Measurement and Control
Publication date: 01/01/2011
Field of study

In order to attain a certain degree of conﬁdence in the quality of the data in the data warehouse it is necessary to perform a series of tests. There are many components (and aspects) of the data warehouse that can be tested, and in this paper we focus on the ETL procedures. Due to the complexity of ETL process, ETL procedure tests are usually custom written, having a very low level of reusability. In this paper we address this issue and work towards establishing a generic procedure for integration testing of certain aspects of ETL procedures. In this approach, ETL procedures are treated as a black box and are tested by comparing their inputs and outputs – datasets. Datasets from three locations are compared: datasets from the relational source(s), datasets from the staging area and datasets from the data warehouse. Proposed procedure is generic and can be implemented on any data warehouse employing dimensional model and having relational database(s) as a source. Our work pertains only to certain aspects of data quality problems that can be found in DW systems. It provides a basic testing foundation or augments existing data warehouse system’s testing capabilities. We comment on proposed mechanisms both in terms of full reload and incremental loading.Kako bi se ostvarila određena razina povjerenja u kvalitetu podataka potrebno je obaviti niz provjera. Postoje brojne komponente (i aspekti) skladišta podataka koji se mogu testirati. U ovom radu smo se usredotočili na testiranje ETL procedura. S obzirom na složenost sustava skladišta podataka, testovi ETL procedura se pišu posebno za svako skladište podataka i rijetko se mogu ponovo upotrebljavati. Ovdje se obrađuje taj problem i predlaže općenita procedura za integracijsko testiranje određ enih aspekata ETL procedura. Predloženi pristup tretira ETL procedure kao crnu kutiju, te se procedure testiraju tako što se uspoređuju ulazni i izlazni skupovi podataka. Uspoređuju se skupovi podataka s tri lokacije: podaci iz izvorišta podataka, podaci iz konsolidiranog pripremnog područja te podaci iz skladišta podataka. Predložena procedura je općenita i može se primijeniti na bilo koje skladište podatka koje koristi dimenzijski model pri čemu podatke dobavlja iz relacijskih baza podataka. Predložene provjere se odnose samo na određene aspekte problema kvalitete podataka koji se mogu pojaviti u sustavu skladišta podataka, te služe za uspostavljanje osnovnog skupa provjera ili uvećanje mogućnosti provjere postojećih sustava. Predloženi postupak se komentira u kontekstu potpunog i inkrementalnog učitavanja podataka u skladište podataka

HRČAK - Portal of Croatian Scientific and Professional Journals

Hrčak - Portal of scientific journals of Croatia

Discovering data lineage in data warehouse : methods and techniques for tracing the origins of data in data-warehouse

Author: Webjørnsen Roselie Bandibas
Publication venue
Publication date: 01/01/2005
Field of study

A data warehouse enables enterprise-wide analysis and reporting functionality that is usually used to support decision-making. Data warehousing system integrates data from different data sources. Typically, the data are extracted from different data sources, then transformed several times and integrated before they are finally stored in the central repository. The extraction and transformation processes vary widely - both in theory and between solution providers. Some are generic, others are tailored to users' transformation and reporting requirements through hand-coded solutions. Most research related to data integration is focused on this area, i.e., on the transformation of data. Since data in a data warehouse undergo various complex transformation processes, often at many different levels and in many stages, it is very important to be able to ensure the quality of the data that the data warehouse contains. The objective of this thesis is to study and compare existing approaches (methods and techniques) for tracing data lineage, and to propose a data lineage solution specific to a business enterprise data warehouse

NORA - Norwegian Open Research Archives

Improving business intelligence traceability and accountability: an integrated framework of BI product and metacontent map

Author: Chee Chin-Hoong
Gao Shijia
Richards G
Yeoh William
Publication venue: 'IGI Global'
Publication date: 01/09/2014
Field of study

Deakin Research Online

Causality and the semantics of provenance

Provenance, or information about the sources, derivation, custody or history of data, has been studied recently in a number of contexts, including databases, scientific workflows and the Semantic Web. Many provenance mechanisms have been developed, motivated by informal notions such as influence, dependence, explanation and causality. However, there has been little study of whether these mechanisms formally satisfy appropriate policies or even how to formalize relevant motivating concepts such as causality. We contend that mathematical models of these concepts are needed to justify and compare provenance techniques. In this paper we review a theory of causality based on structural models that has been developed in artificial intelligence, and describe work in progress on a causal semantics for provenance graphs.Comment: Workshop submissio

arXiv.org e-Print Archive

Crossref

Directory of Open Access Journals

Liquid: unifying nearline and offline big data integration

Author: Castro Fernandez R
Koshy J
Kreps J
Lin D
Narkhede N
Pietzuch PR
Rao J
Riccomini C
Wang G
Publication venue
Publication date: 01/10/2014
Field of study

Spiral - Imperial College Digital Repository

Measuring Data Believability: A Provenance Approach

Author: Madnick Stuart E.
Prat Nicolas
Publication venue
Publication date: 01/01/2007
Field of study

Data quality is crucial for operational efficiency and sound decision making. This paper focuses on believability, a major aspect of quality, measured along three dimensions: trustworthiness, reasonableness, and temporality. We ground our approach on provenance, i.e. the origin and subsequent processing history of data. We present our provenance model and our approach for computing believability based on provenance metadata. The approach is structured into three increasingly complex building blocks: (1) definition of metrics for assessing the believability of data sources, (2) definition of metrics for assessing the believability of data resulting from one process run and (3) assessment of believability based on all the sources and processing history of data. We illustrate our approach with a scenario based on Internet data. To our knowledge, this is the first work to develop a precise approach to measuring data believability and making explicit use of provenance-based measurements

DSpace@MIT

Crossref

Database Queries that Explain their Work

Author: Acar Umut A.
Ahmed Amal
Cheney James
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2014
Field of study

Provenance for database queries or scientific workflows is often motivated as providing explanation, increasing understanding of the underlying data sources and processes used to compute the query, and reproducibility, the capability to recompute the results on different inputs, possibly specialized to a part of the output. Many provenance systems claim to provide such capabilities; however, most lack formal definitions or guarantees of these properties, while others provide formal guarantees only for relatively limited classes of changes. Building on recent work on provenance traces and slicing for functional programming languages, we introduce a detailed tracing model of provenance for multiset-valued Nested Relational Calculus, define trace slicing algorithms that extract subtraces needed to explain or recompute specific parts of the output, and define query slicing and differencing techniques that support explanation. We state and prove correctness properties for these techniques and present a proof-of-concept implementation in Haskell.Comment: PPDP 201

arXiv.org e-Print Archive

CiteSeerX

Crossref

INRIA a CCSD electronic archive server

Edinburgh Research Explorer

A Brief Tour through Provenance in Scientific Workflows and Databases

Author: Bertram Ludäscher
Publication venue
Publication date: 03/03/2016
Field of study

Within computer science, the term provenance has multiple meanings, due to different motivations, perspectives, and assumptions prevalent in the respective communities. This chapter provides a high-level “sightseeing tour” of some of those different notions and uses of provenance in scientific workflows and databases.Ope

Illinois Digital Environment for Access to Learning and Scholarship Repository