1,652 research outputs found

    A framework for detecting unnecessary industrial data in ETL processes

    Get PDF
    Extract transform and load (ETL) is a critical process used by industrial organisations to shift data from one database to another, such as from an operational system to a data warehouse. With the increasing amount of data stored by industrial organisations, some ETL processes can take in excess of 12 hours to complete; this can leave decision makers stranded while they wait for the data needed to support their decisions. After designing the ETL processes, inevitably data requirements can change, and much of the data that goes through the ETL process may not ever be used or needed. This paper therefore proposes a framework for dynamically detecting and predicting unnecessary data and preventing it from slowing down ETL processes - either by removing it entirely or deprioritizing it. Other advantages of the framework include being able to prioritise data cleansing tasks and determining what data should be processed first and placed into fast access memory. We show existing example algorithms that can be used for each component of the framework, and present some initial testing results as part of our research to determine whether the framework can help to reduce ETL time.This is the author accepted manuscript. The final version is available from IEEE via http://dx.doi.org/10.1109/INDIN.2014.694555

    Općeniti postupak za i ntegracijsko testiranje ETL procedura

    Get PDF
    In order to attain a certain degree of confidence in the quality of the data in the data warehouse it is necessary to perform a series of tests. There are many components (and aspects) of the data warehouse that can be tested, and in this paper we focus on the ETL procedures. Due to the complexity of ETL process, ETL procedure tests are usually custom written, having a very low level of reusability. In this paper we address this issue and work towards establishing a generic procedure for integration testing of certain aspects of ETL procedures. In this approach, ETL procedures are treated as a black box and are tested by comparing their inputs and outputs – datasets. Datasets from three locations are compared: datasets from the relational source(s), datasets from the staging area and datasets from the data warehouse. Proposed procedure is generic and can be implemented on any data warehouse employing dimensional model and having relational database(s) as a source. Our work pertains only to certain aspects of data quality problems that can be found in DW systems. It provides a basic testing foundation or augments existing data warehouse system’s testing capabilities. We comment on proposed mechanisms both in terms of full reload and incremental loading.Kako bi se ostvarila određena razina povjerenja u kvalitetu podataka potrebno je obaviti niz provjera. Postoje brojne komponente (i aspekti) skladišta podataka koji se mogu testirati. U ovom radu smo se usredotočili na testiranje ETL procedura. S obzirom na složenost sustava skladišta podataka, testovi ETL procedura se pišu posebno za svako skladište podataka i rijetko se mogu ponovo upotrebljavati. Ovdje se obrađuje taj problem i predlaže općenita procedura za integracijsko testiranje određ enih aspekata ETL procedura. Predloženi pristup tretira ETL procedure kao crnu kutiju, te se procedure testiraju tako što se uspoređuju ulazni i izlazni skupovi podataka. Uspoređuju se skupovi podataka s tri lokacije: podaci iz izvorišta podataka, podaci iz konsolidiranog pripremnog područja te podaci iz skladišta podataka. Predložena procedura je općenita i može se primijeniti na bilo koje skladište podatka koje koristi dimenzijski model pri čemu podatke dobavlja iz relacijskih baza podataka. Predložene provjere se odnose samo na određene aspekte problema kvalitete podataka koji se mogu pojaviti u sustavu skladišta podataka, te služe za uspostavljanje osnovnog skupa provjera ili uvećanje mogućnosti provjere postojećih sustava. Predloženi postupak se komentira u kontekstu potpunog i inkrementalnog učitavanja podataka u skladište podataka

    Using domain ontologies to help track data provenance.

    Get PDF
    Motivating example. POESIA ontologies and ontological coverages. Ontological estimation of data provenance. Ontological nets for data integration. Data integration operators. Data reconciling through articulation of ontologies. Semantic workflows. Related work. Conclusions

    Impliance: A Next Generation Information Management Appliance

    Full text link
    ably successful in building a large market and adapting to the changes of the last three decades, its impact on the broader market of information management is surprisingly limited. If we were to design an information management system from scratch, based upon today's requirements and hardware capabilities, would it look anything like today's database systems?" In this paper, we introduce Impliance, a next-generation information management system consisting of hardware and software components integrated to form an easy-to-administer appliance that can store, retrieve, and analyze all types of structured, semi-structured, and unstructured information. We first summarize the trends that will shape information management for the foreseeable future. Those trends imply three major requirements for Impliance: (1) to be able to store, manage, and uniformly query all data, not just structured records; (2) to be able to scale out as the volume of this data grows; and (3) to be simple and robust in operation. We then describe four key ideas that are uniquely combined in Impliance to address these requirements, namely the ideas of: (a) integrating software and off-the-shelf hardware into a generic information appliance; (b) automatically discovering, organizing, and managing all data - unstructured as well as structured - in a uniform way; (c) achieving scale-out by exploiting simple, massive parallel processing, and (d) virtualizing compute and storage resources to unify, simplify, and streamline the management of Impliance. Impliance is an ambitious, long-term effort to define simpler, more robust, and more scalable information systems for tomorrow's enterprises.Comment: This article is published under a Creative Commons License Agreement (http://creativecommons.org/licenses/by/2.5/.) You may copy, distribute, display, and perform the work, make derivative works and make commercial use of the work, but, you must attribute the work to the author and CIDR 2007. 3rd Biennial Conference on Innovative Data Systems Research (CIDR) January 710, 2007, Asilomar, California, US

    Architecture for Provenance Systems

    No full text
    This document covers the logical and process architectures of provenance systems. The logical architecture identifies key roles and their interactions, whereas the process architecture discusses distribution and security. A fundamental aspect of our presentation is its technology-independent nature, which makes it reusable: the principles that are exposed in this document may be applied to different technologies

    Discovering data lineage in data warehouse : methods and techniques for tracing the origins of data in data-warehouse

    Get PDF
    A data warehouse enables enterprise-wide analysis and reporting functionality that is usually used to support decision-making. Data warehousing system integrates data from different data sources. Typically, the data are extracted from different data sources, then transformed several times and integrated before they are finally stored in the central repository. The extraction and transformation processes vary widely - both in theory and between solution providers. Some are generic, others are tailored to users' transformation and reporting requirements through hand-coded solutions. Most research related to data integration is focused on this area, i.e., on the transformation of data. Since data in a data warehouse undergo various complex transformation processes, often at many different levels and in many stages, it is very important to be able to ensure the quality of the data that the data warehouse contains. The objective of this thesis is to study and compare existing approaches (methods and techniques) for tracing data lineage, and to propose a data lineage solution specific to a business enterprise data warehouse

    A abordagem POESIA para a integração de dados e serviços na Web semantica

    Get PDF
    Orientador: Claudia Bauzer MedeirosTese (doutorado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: POESIA (Processes for Open-Ended Systems for lnformation Analysis), a abordagem proposta neste trabalho, visa a construção de processos complexos envolvendo integração e análise de dados de diversas fontes, particularmente em aplicações científicas. A abordagem é centrada em dois tipos de mecanismos da Web semântica: workflows científicos, para especificar e compor serviços Web; e ontologias de domínio, para viabilizar a interoperabilidade e o gerenciamento semânticos dos dados e processos. As principais contribuições desta tese são: (i) um arcabouço teórico para a descrição, localização e composição de dados e serviços na Web, com regras para verificar a consistência semântica de composições desses recursos; (ii) métodos baseados em ontologias de domínio para auxiliar a integração de dados e estimar a proveniência de dados em processos cooperativos na Web; (iii) implementação e validação parcial das propostas, em urna aplicação real no domínio de planejamento agrícola, analisando os benefícios e as limitações de eficiência e escalabilidade da tecnologia atual da Web semântica, face a grandes volumes de dadosAbstract: POESIA (Processes for Open-Ended Systems for Information Analysis), the approach proposed in this work, supports the construction of complex processes that involve the integration and analysis of data from several sources, particularly in scientific applications. This approach is centered in two types of semantic Web mechanisms: scientific workflows, to specify and compose Web services; and domain ontologies, to enable semantic interoperability and management of data and processes. The main contributions of this thesis are: (i) a theoretical framework to describe, discover and compose data and services on the Web, inc1uding mIes to check the semantic consistency of resource compositions; (ii) ontology-based methods to help data integration and estimate data provenance in cooperative processes on the Web; (iii) partial implementation and validation of the proposal, in a real application for the domain of agricultural planning, analyzing the benefits and scalability problems of the current semantic Web technology, when faced with large volumes of dataDoutoradoCiência da ComputaçãoDoutor em Ciência da Computaçã

    Improving Organizational Decision Making Using a SAF-T based Business Intelligence System

    Get PDF
    Today, companies need to quickly adapt to business changes and react to customers\u27 tendencies and market demands in an unpredictable environment. In this field, the analytical systems represent an important asset that each company should have and use. Data Warehousing Systems (DWS) support companies\u27 analytical needs, however, the development and integration of the data systems is a critical part. Due to specificities of the involved data, each DWS is unique, which compromises the use of reusable components or even the use of pre-built solutions. In this paper, we propose a standard skeleton for a DWS based on Portuguese Audit Tax documents (SAF-T (PT)). These documents represent a standardized procedure for every Portuguese company, providing the necessary data about billing, accounting, and taxation. Thus, they can provide the foundations to use them as a standard data representation to create a DWS that can be posteriorly explored by analytical techniques to generate useful insights

    Knowledge visualizations: a tool to achieve optimized operational decision making and data integration

    Get PDF
    The overabundance of data created by modern information systems (IS) has led to a breakdown in cognitive decision-making. Without authoritative source data, commanders’ decision-making processes are hindered as they attempt to paint an accurate shared operational picture (SOP). Further impeding the decision-making process is the lack of proper interface interaction to provide a visualization that aids in the extraction of the most relevant and accurate data. Utilizing the DSS to present visualizations based on OLAP cube integrated data allow decision-makers to rapidly glean information and build their situation awareness (SA). This yields a competitive advantage to the organization while in garrison or in combat. Additionally, OLAP cube data integration enables analysis to be performed on an organization’s data-flows. This analysis is used to identify the critical path of data throughout the organization. Linking a decision-maker to the authoritative data along this critical path eliminates the many decision layers in a hierarchal command structure that can introduce latency or error into the decision-making process. Furthermore, the organization has an integrated SOP from which to rapidly build SA, and make effective and efficient decisions.http://archive.org/details/knowledgevisuali1094545877Outstanding ThesisOutstanding ThesisMajor, United States Marine CorpsCaptain, United States Marine CorpsApproved for public release; distribution is unlimited

    A Social Media Tax Data Warehouse to Manage the Underground Economy

    Full text link
    © 2018 IEEE. Social media can provide a wealth of information valuable to tax administrators in managing the underground economy. This paper proposes a data warehouse design to integrate social media data into tax analytics processes. The warehouse is designed to support modern tax administration strategies that encourage self-regulation and voluntary compliance by shaping public opinion, improving services and developing inclusive tax policies. The warehouse also incorporates the use of social media analytics to support tax evasion detection and enforcement activities such as compliance risk assessment, audits, inspections and investigations
    corecore