35,542 research outputs found

    A Review of Contemporary Data Quality Issues in Data Warehouse ETL Environment

    Get PDF
    In today’s scenario, Extraction–transformation– loading (ETL) tools have become important pieces of software responsible for integrating heterogeneous information from several sources. The task of carrying out the ETL process is potentially a complex, hard and time consuming. Organisations now –a-days are concerned about vast qualities of data. The data quality is concerned with technical issues in data warehouse environment. Research in last few decades has laid more stress on data quality issues in a data warehouse ETL process. The data quality can be ensured cleaning the data prior to loading the data into a warehouse. Since the data is collected from various sources, it comes in various formats. The standardization of formats and cleaning such data becomes the need of clean data warehouse environment. Data quality attributes like accuracy, correctness, consistency, timeliness are required for a Knowledge discovery process. The present state -of –the- art purpose of the research work is to deal on data quality issues at all the aforementioned stages of data warehousing 1) Data sources, 2) Data integration 3) Data staging, 4) Data warehouse modelling and schematic design and to formulate descriptive classification of these causes. The discovered knowledge is used to repair the data deficiencies. This work proposes a framework for quality of extraction transformation and loading of data into a warehouse

    A family of experiments to validate measures for UML activity diagrams of ETL processes in data warehouses

    Get PDF
    In data warehousing, Extract, Transform, and Load (ETL) processes are in charge of extracting the data from the data sources that will be contained in the data warehouse. Their design and maintenance is thus a cornerstone in any data warehouse development project. Due to their relevance, the quality of these processes should be formally assessed early in the development in order to avoid populating the data warehouse with incorrect data. To this end, this paper presents a set of measures with which to evaluate the structural complexity of ETL process models at the conceptual level. This study is, moreover, accompanied by the application of formal frameworks and a family of experiments whose aim is to theoretical and empirically validate the proposed measures, respectively. Our experiments show that the use of these measures can aid designers to predict the effort associated with the maintenance tasks of ETL processes and to make ETL process models more usable. Our work is based on Unified Modeling Language (UML) activity diagrams for modeling ETL processes, and on the Framework for the Modeling and Evaluation of Software Processes (FMESP) framework for the definition and validation of the measures.In data warehousing, Extract, Transform, and Load (ETL) processes are in charge of extracting the data from the data sources that will be contained in the data warehouse. Their design and maintenance is thus a cornerstone in any data warehouse development project. Due to their relevance, the quality of these processes should be formally assessed early in the development in order to avoid populating the data warehouse with incorrect data. To this end, this paper presents a set of measures with which to evaluate the structural complexity of ETL process models at the conceptual level. This study is, moreover, accompanied by the application of formal frameworks and a family of experiments whose aim is to theoretical and empirically validate the proposed measures, respectively. Our experiments show that the use of these measures can aid designers to predict the effort associated with the maintenance tasks of ETL processes and to make ETL process models more usable. Our work is based on Unified Modeling Language (UML) activity diagrams for modeling ETL processes, and on the Framework for the Modeling and Evaluation of Software Processes (FMESP) framework for the definition and validation of the measures

    An i2b2-based, generalizable, open source, self-scaling chronic disease registry

    Get PDF
    Objective: Registries are a well-established mechanism for obtaining high quality, disease-specific data, but are often highly project-specific in their design, implementation, and policies for data use. In contrast to the conventional model of centralized data contribution, warehousing, and control, we design a self-scaling registry technology for collaborative data sharing, based upon the widely adopted Integrating Biology & the Bedside (i2b2) data warehousing framework and the Shared Health Research Information Network (SHRINE) peer-to-peer networking software. Materials and methods Focusing our design around creation of a scalable solution for collaboration within multi-site disease registries, we leverage the i2b2 and SHRINE open source software to create a modular, ontology-based, federated infrastructure that provides research investigators full ownership and access to their contributed data while supporting permissioned yet robust data sharing. We accomplish these objectives via web services supporting peer-group overlays, group-aware data aggregation, and administrative functions. Results: The 56-site Childhood Arthritis & Rheumatology Research Alliance (CARRA) Registry and 3-site Harvard Inflammatory Bowel Diseases Longitudinal Data Repository now utilize i2b2 self-scaling registry technology (i2b2-SSR). This platform, extensible to federation of multiple projects within and between research networks, encompasses >6000 subjects at sites throughout the USA. Discussion We utilize the i2b2-SSR platform to minimize technical barriers to collaboration while enabling fine-grained control over data sharing. Conclusions: The implementation of i2b2-SSR for the multi-site, multi-stakeholder CARRA Registry has established a digital infrastructure for community-driven research data sharing in pediatric rheumatology in the USA. We envision i2b2-SSR as a scalable, reusable solution facilitating interdisciplinary research across diseases

    Big Data guided Digital Petroleum Ecosystems for Visual Analytics and Knowledge Management

    Get PDF
    The North West Shelf (NWS) interpreted as a Total Petroleum System (TPS), is Super Westralian Basin with active onshore and offshore basins through which shelf, - slope and deep-oceanic geological events are construed. In addition to their data associativity, TPS emerges with geographic connectivity through phenomena of digital petroleum ecosystem. The super basin has a multitude of sub-basins, each basin is associated with several petroleum systems and each system comprised of multiple oil and gas fields with either known or unknown areal extents. Such hierarchical ontologies make connections between attribute relationships of diverse petroleum systems. Besides, NWS has a scope of storing volumes of instances in a data-warehousing environment to analyse and motivate to create new business opportunities. Furthermore, the big exploration data, characterized as heterogeneous and multidimensional, can complicate the data integration process, precluding interpretation of data views, drawn from TPS metadata in new knowledge domains. The research objective is to develop an integrated framework that can unify the exploration and other interrelated multidisciplinary data into a holistic TPS metadata for visualization and valued interpretation. Petroleum digital ecosystem is prototyped as a digital oil field solution, with multitude of big data tools. Big data associated with elements and processes of petroleum systems are examined using prototype solutions. With conceptual framework of Digital Petroleum Ecosystems and Technologies (DPEST), we manage the interconnectivity between diverse petroleum systems and their linked basins. The ontology-based data warehousing and mining articulations ascertain the collaboration through data artefacts, the coexistence between different petroleum systems and their linked oil and gas fields that benefit the explorers. The connectivity between systems further facilitates us with presentable exploration data views, improvising visualization and interpretation. The metadata with meta-knowledge in diverse knowledge domains of digital petroleum ecosystems ensures the quality of untapped reservoirs and their associativity between Westralian basins

    Eco-efficient supply chain networks: Development of a design framework and application to a real case study

    Get PDF
    © 2015 Taylor & Francis. This paper presents a supply chain network design framework that is based on multi-objective mathematical programming and that can identify 'eco-efficient' configuration alternatives that are both efficient and ecologically sound. This work is original in that it encompasses the environmental impact of both transportation and warehousing activities. We apply the proposed framework to a real-life case study (i.e. Lindt & SprĂŒngli) for the distribution of chocolate products. The results show that cost-driven network optimisation may lead to beneficial effects for the environment and that a minor increase in distribution costs can be offset by a major improvement in environmental performance. This paper contributes to the body of knowledge on eco-efficient supply chain design and closes the missing link between model-based methods and empirical applied research. It also generates insights into the growing debate on the trade-off between the economic and environmental performance of supply chains, supporting organisations in the eco-efficient configuration of their supply chains
    • 

    corecore