1,766 research outputs found

    Using schema transformation pathways for data lineage tracing

    Get PDF
    With the increasing amount and diversity of information available on the Internet, there has been a huge growth in information systems that need to integrate data from distributed, heterogeneous data sources. Tracing the lineage of the integrated data is one of the problems being addressed in data warehousing research. This paper presents a data lineage tracing approach based on schema transformation pathways. Our approach is not limited to one specific data model or query language, and would be useful in any data transformation/integration framework based on sequences of primitive schema transformations

    Using domain ontologies to help track data provenance.

    Get PDF
    Motivating example. POESIA ontologies and ontological coverages. Ontological estimation of data provenance. Ontological nets for data integration. Data integration operators. Data reconciling through articulation of ontologies. Semantic workflows. Related work. Conclusions

    A framework for detecting unnecessary industrial data in ETL processes

    Get PDF
    Extract transform and load (ETL) is a critical process used by industrial organisations to shift data from one database to another, such as from an operational system to a data warehouse. With the increasing amount of data stored by industrial organisations, some ETL processes can take in excess of 12 hours to complete; this can leave decision makers stranded while they wait for the data needed to support their decisions. After designing the ETL processes, inevitably data requirements can change, and much of the data that goes through the ETL process may not ever be used or needed. This paper therefore proposes a framework for dynamically detecting and predicting unnecessary data and preventing it from slowing down ETL processes - either by removing it entirely or deprioritizing it. Other advantages of the framework include being able to prioritise data cleansing tasks and determining what data should be processed first and placed into fast access memory. We show existing example algorithms that can be used for each component of the framework, and present some initial testing results as part of our research to determine whether the framework can help to reduce ETL time.This is the author accepted manuscript. The final version is available from IEEE via http://dx.doi.org/10.1109/INDIN.2014.694555

    Općeniti postupak za i ntegracijsko testiranje ETL procedura

    Get PDF
    In order to attain a certain degree of conļ¬dence in the quality of the data in the data warehouse it is necessary to perform a series of tests. There are many components (and aspects) of the data warehouse that can be tested, and in this paper we focus on the ETL procedures. Due to the complexity of ETL process, ETL procedure tests are usually custom written, having a very low level of reusability. In this paper we address this issue and work towards establishing a generic procedure for integration testing of certain aspects of ETL procedures. In this approach, ETL procedures are treated as a black box and are tested by comparing their inputs and outputs ā€“ datasets. Datasets from three locations are compared: datasets from the relational source(s), datasets from the staging area and datasets from the data warehouse. Proposed procedure is generic and can be implemented on any data warehouse employing dimensional model and having relational database(s) as a source. Our work pertains only to certain aspects of data quality problems that can be found in DW systems. It provides a basic testing foundation or augments existing data warehouse systemā€™s testing capabilities. We comment on proposed mechanisms both in terms of full reload and incremental loading.Kako bi se ostvarila određena razina povjerenja u kvalitetu podataka potrebno je obaviti niz provjera. Postoje brojne komponente (i aspekti) skladiÅ”ta podataka koji se mogu testirati. U ovom radu smo se usredotočili na testiranje ETL procedura. S obzirom na složenost sustava skladiÅ”ta podataka, testovi ETL procedura se piÅ”u posebno za svako skladiÅ”te podataka i rijetko se mogu ponovo upotrebljavati. Ovdje se obrađuje taj problem i predlaže općenita procedura za integracijsko testiranje određ enih aspekata ETL procedura. Predloženi pristup tretira ETL procedure kao crnu kutiju, te se procedure testiraju tako Å”to se uspoređuju ulazni i izlazni skupovi podataka. Uspoređuju se skupovi podataka s tri lokacije: podaci iz izvoriÅ”ta podataka, podaci iz konsolidiranog pripremnog područja te podaci iz skladiÅ”ta podataka. Predložena procedura je općenita i može se primijeniti na bilo koje skladiÅ”te podatka koje koristi dimenzijski model pri čemu podatke dobavlja iz relacijskih baza podataka. Predložene provjere se odnose samo na određene aspekte problema kvalitete podataka koji se mogu pojaviti u sustavu skladiÅ”ta podataka, te služe za uspostavljanje osnovnog skupa provjera ili uvećanje mogućnosti provjere postojećih sustava. Predloženi postupak se komentira u kontekstu potpunog i inkrementalnog učitavanja podataka u skladiÅ”te podataka

    bdbms -- A Database Management System for Biological Data

    Full text link
    Biologists are increasingly using databases for storing and managing their data. Biological databases typically consist of a mixture of raw data, metadata, sequences, annotations, and related data obtained from various sources. Current database technology lacks several functionalities that are needed by biological databases. In this paper, we introduce bdbms, an extensible prototype database management system for supporting biological data. bdbms extends the functionalities of current DBMSs to include: (1) Annotation and provenance management including storage, indexing, manipulation, and querying of annotation and provenance as first class objects in bdbms, (2) Local dependency tracking to track the dependencies and derivations among data items, (3) Update authorization to support data curation via content-based authorization, in contrast to identity-based authorization, and (4) New access methods and their supporting operators that support pattern matching on various types of compressed biological data types. This paper presents the design of bdbms along with the techniques proposed to support these functionalities including an extension to SQL. We also outline some open issues in building bdbms.Comment: This article is published under a Creative Commons License Agreement (http://creativecommons.org/licenses/by/2.5/.) You may copy, distribute, display, and perform the work, make derivative works and make commercial use of the work, but, you must attribute the work to the author and CIDR 2007. 3rd Biennial Conference on Innovative Data Systems Research (CIDR) January 710, 2007, Asilomar, California, US

    Discovering data lineage in data warehouse : methods and techniques for tracing the origins of data in data-warehouse

    Get PDF
    A data warehouse enables enterprise-wide analysis and reporting functionality that is usually used to support decision-making. Data warehousing system integrates data from different data sources. Typically, the data are extracted from different data sources, then transformed several times and integrated before they are finally stored in the central repository. The extraction and transformation processes vary widely - both in theory and between solution providers. Some are generic, others are tailored to users' transformation and reporting requirements through hand-coded solutions. Most research related to data integration is focused on this area, i.e., on the transformation of data. Since data in a data warehouse undergo various complex transformation processes, often at many different levels and in many stages, it is very important to be able to ensure the quality of the data that the data warehouse contains. The objective of this thesis is to study and compare existing approaches (methods and techniques) for tracing data lineage, and to propose a data lineage solution specific to a business enterprise data warehouse

    Impliance: A Next Generation Information Management Appliance

    Full text link
    ably successful in building a large market and adapting to the changes of the last three decades, its impact on the broader market of information management is surprisingly limited. If we were to design an information management system from scratch, based upon today's requirements and hardware capabilities, would it look anything like today's database systems?" In this paper, we introduce Impliance, a next-generation information management system consisting of hardware and software components integrated to form an easy-to-administer appliance that can store, retrieve, and analyze all types of structured, semi-structured, and unstructured information. We first summarize the trends that will shape information management for the foreseeable future. Those trends imply three major requirements for Impliance: (1) to be able to store, manage, and uniformly query all data, not just structured records; (2) to be able to scale out as the volume of this data grows; and (3) to be simple and robust in operation. We then describe four key ideas that are uniquely combined in Impliance to address these requirements, namely the ideas of: (a) integrating software and off-the-shelf hardware into a generic information appliance; (b) automatically discovering, organizing, and managing all data - unstructured as well as structured - in a uniform way; (c) achieving scale-out by exploiting simple, massive parallel processing, and (d) virtualizing compute and storage resources to unify, simplify, and streamline the management of Impliance. Impliance is an ambitious, long-term effort to define simpler, more robust, and more scalable information systems for tomorrow's enterprises.Comment: This article is published under a Creative Commons License Agreement (http://creativecommons.org/licenses/by/2.5/.) You may copy, distribute, display, and perform the work, make derivative works and make commercial use of the work, but, you must attribute the work to the author and CIDR 2007. 3rd Biennial Conference on Innovative Data Systems Research (CIDR) January 710, 2007, Asilomar, California, US

    Ontology for Application Development

    Get PDF
    The chapter describes the process of ontology development for different subject domains for application designing. The analysis of existing approaches to ontology development for software platform realization in some subject domains is depicted. The example of ontology model development for telecom operator billing system based on descriptive logic is shown. For ontology model designing, it is proposed to use two formal theories: descriptive logic and set theory, which allow to systematize data and knowledge, to organize search and navigation, and to describe informational and computational recourses according to the meta-notion standards

    A abordagem POESIA para a integraĆ§Ć£o de dados e serviƧos na Web semantica

    Get PDF
    Orientador: Claudia Bauzer MedeirosTese (doutorado) - Universidade Estadual de Campinas, Instituto de ComputaĆ§Ć£oResumo: POESIA (Processes for Open-Ended Systems for lnformation Analysis), a abordagem proposta neste trabalho, visa a construĆ§Ć£o de processos complexos envolvendo integraĆ§Ć£o e anĆ”lise de dados de diversas fontes, particularmente em aplicaƧƵes cientĆ­ficas. A abordagem Ć© centrada em dois tipos de mecanismos da Web semĆ¢ntica: workflows cientĆ­ficos, para especificar e compor serviƧos Web; e ontologias de domĆ­nio, para viabilizar a interoperabilidade e o gerenciamento semĆ¢nticos dos dados e processos. As principais contribuiƧƵes desta tese sĆ£o: (i) um arcabouƧo teĆ³rico para a descriĆ§Ć£o, localizaĆ§Ć£o e composiĆ§Ć£o de dados e serviƧos na Web, com regras para verificar a consistĆŖncia semĆ¢ntica de composiƧƵes desses recursos; (ii) mĆ©todos baseados em ontologias de domĆ­nio para auxiliar a integraĆ§Ć£o de dados e estimar a proveniĆŖncia de dados em processos cooperativos na Web; (iii) implementaĆ§Ć£o e validaĆ§Ć£o parcial das propostas, em urna aplicaĆ§Ć£o real no domĆ­nio de planejamento agrĆ­cola, analisando os benefĆ­cios e as limitaƧƵes de eficiĆŖncia e escalabilidade da tecnologia atual da Web semĆ¢ntica, face a grandes volumes de dadosAbstract: POESIA (Processes for Open-Ended Systems for Information Analysis), the approach proposed in this work, supports the construction of complex processes that involve the integration and analysis of data from several sources, particularly in scientific applications. This approach is centered in two types of semantic Web mechanisms: scientific workflows, to specify and compose Web services; and domain ontologies, to enable semantic interoperability and management of data and processes. The main contributions of this thesis are: (i) a theoretical framework to describe, discover and compose data and services on the Web, inc1uding mIes to check the semantic consistency of resource compositions; (ii) ontology-based methods to help data integration and estimate data provenance in cooperative processes on the Web; (iii) partial implementation and validation of the proposal, in a real application for the domain of agricultural planning, analyzing the benefits and scalability problems of the current semantic Web technology, when faced with large volumes of dataDoutoradoCiĆŖncia da ComputaĆ§Ć£oDoutor em CiĆŖncia da ComputaĆ§Ć£
    • ā€¦
    corecore