35,542 research outputs found
Recommended from our members
Towards a Data Quality Framework for Heterogeneous Data
yesEvery industry has signiïŹcant data output as a product of their working process, and with the recent advent of big data mining and integrated data warehousing it is the case for a robust methodology for assessing the quality for sustainable and consistent processing. In this paper a review is conducted on Data Quality (DQ) in multiple domains in order to propose connections between their methodologies. This critical review suggests that within the process of DQ assessment of heterogeneous data sets, not often are they treated as separate types of data in need of an alternate data quality assessment framework. We discuss the need for such a directed DQ framework and the opportunities that are foreseen in this research area and propose to address it through degrees of heterogeneity
A Review of Contemporary Data Quality Issues in Data Warehouse ETL Environment
In todayâs scenario, Extractionâtransformationâ loading (ETL) tools have become important pieces of software responsible for integrating heterogeneous information from several sources. The task of carrying out the ETL process is potentially a complex, hard and time consuming. Organisations now âa-days are concerned about vast qualities of data. The data quality is concerned with technical issues in data warehouse environment. Research in last few decades has laid more stress on data quality issues in a data warehouse ETL process. The data quality can be ensured cleaning the data prior to loading the data into a warehouse. Since the data is collected from various sources, it comes in various formats. The standardization of formats and cleaning such data becomes the need of clean data warehouse environment. Data quality attributes like accuracy, correctness, consistency, timeliness are required for a Knowledge discovery process. The present state -of âthe- art purpose of the research work is to deal on data quality issues at all the aforementioned stages of data warehousing 1) Data sources, 2) Data integration 3) Data staging, 4) Data warehouse modelling and schematic design and to formulate descriptive classification of these causes. The discovered knowledge is used to repair the data deficiencies. This work proposes a framework for quality of extraction transformation and loading of data into a warehouse
A family of experiments to validate measures for UML activity diagrams of ETL processes in data warehouses
In data warehousing, Extract, Transform, and Load (ETL) processes are in charge of extracting the data from the data sources that will be contained in the data warehouse. Their design and maintenance is thus a cornerstone in any data warehouse development project. Due to their relevance, the quality of these processes should be formally assessed early in the development in order to avoid populating the data warehouse with incorrect data. To this end, this paper presents a set of measures with which to evaluate the structural complexity of ETL process models at the conceptual level. This study is, moreover, accompanied by the application of formal frameworks and a family of experiments whose aim is to theoretical and empirically validate the proposed measures, respectively. Our experiments show that the use of these measures can aid designers to predict the effort associated with the maintenance tasks of ETL processes and to make ETL process models more usable. Our work is based on Unified Modeling Language (UML) activity diagrams for modeling ETL processes, and on the Framework for the Modeling and Evaluation of Software Processes (FMESP) framework for the definition and validation of the measures.In data warehousing, Extract, Transform, and Load (ETL) processes are in charge of extracting the data from the data sources that will be contained in the data warehouse. Their design and maintenance is thus a cornerstone in any data warehouse development project. Due to their relevance, the quality of these processes should be formally assessed early in the development in order to avoid populating the data warehouse with incorrect data. To this end, this paper presents a set of measures with which to evaluate the structural complexity of ETL process models at the conceptual level. This study is, moreover, accompanied by the application of formal frameworks and a family of experiments whose aim is to theoretical and empirically validate the proposed measures, respectively. Our experiments show that the use of these measures can aid designers to predict the effort associated with the maintenance tasks of ETL processes and to make ETL process models more usable. Our work is based on Unified Modeling Language (UML) activity diagrams for modeling ETL processes, and on the Framework for the Modeling and Evaluation of Software Processes (FMESP) framework for the definition and validation of the measures
An i2b2-based, generalizable, open source, self-scaling chronic disease registry
Objective: Registries are a well-established mechanism for obtaining high quality, disease-specific data, but are often highly project-specific in their design, implementation, and policies for data use. In contrast to the conventional model of centralized data contribution, warehousing, and control, we design a self-scaling registry technology for collaborative data sharing, based upon the widely adopted Integrating Biology & the Bedside (i2b2) data warehousing framework and the Shared Health Research Information Network (SHRINE) peer-to-peer networking software. Materials and methods Focusing our design around creation of a scalable solution for collaboration within multi-site disease registries, we leverage the i2b2 and SHRINE open source software to create a modular, ontology-based, federated infrastructure that provides research investigators full ownership and access to their contributed data while supporting permissioned yet robust data sharing. We accomplish these objectives via web services supporting peer-group overlays, group-aware data aggregation, and administrative functions. Results: The 56-site Childhood Arthritis & Rheumatology Research Alliance (CARRA) Registry and 3-site Harvard Inflammatory Bowel Diseases Longitudinal Data Repository now utilize i2b2 self-scaling registry technology (i2b2-SSR). This platform, extensible to federation of multiple projects within and between research networks, encompasses >6000 subjects at sites throughout the USA. Discussion We utilize the i2b2-SSR platform to minimize technical barriers to collaboration while enabling fine-grained control over data sharing. Conclusions: The implementation of i2b2-SSR for the multi-site, multi-stakeholder CARRA Registry has established a digital infrastructure for community-driven research data sharing in pediatric rheumatology in the USA. We envision i2b2-SSR as a scalable, reusable solution facilitating interdisciplinary research across diseases
Big Data guided Digital Petroleum Ecosystems for Visual Analytics and Knowledge Management
The North West Shelf (NWS) interpreted as a Total
Petroleum System (TPS), is Super Westralian Basin with
active onshore and offshore basins through which shelf, -
slope and deep-oceanic geological events are construed. In
addition to their data associativity, TPS emerges with
geographic connectivity through phenomena of digital
petroleum ecosystem. The super basin has a multitude of
sub-basins, each basin is associated with several petroleum
systems and each system comprised of multiple oil and gas
fields with either known or unknown areal extents. Such
hierarchical ontologies make connections between
attribute relationships of diverse petroleum systems.
Besides, NWS has a scope of storing volumes of instances
in a data-warehousing environment to analyse and
motivate to create new business opportunities.
Furthermore, the big exploration data, characterized as
heterogeneous and multidimensional, can complicate the
data integration process, precluding interpretation of data
views, drawn from TPS metadata in new knowledge
domains. The research objective is to develop an
integrated framework that can unify the exploration and
other interrelated multidisciplinary data into a holistic TPS
metadata for visualization and valued interpretation.
Petroleum digital ecosystem is prototyped as a digital oil
field solution, with multitude of big data tools. Big data
associated with elements and processes of petroleum
systems are examined using prototype solutions. With
conceptual framework of Digital Petroleum Ecosystems
and Technologies (DPEST), we manage the
interconnectivity between diverse petroleum systems and
their linked basins. The ontology-based data warehousing
and mining articulations ascertain the collaboration
through data artefacts, the coexistence between different
petroleum systems and their linked oil and gas fields that
benefit the explorers. The connectivity between systems
further facilitates us with presentable exploration data
views, improvising visualization and interpretation. The
metadata with meta-knowledge in diverse knowledge
domains of digital petroleum ecosystems ensures the
quality of untapped reservoirs and their associativity
between Westralian basins
Eco-efficient supply chain networks: Development of a design framework and application to a real case study
© 2015 Taylor & Francis. This paper presents a supply chain network design framework that is based on multi-objective mathematical programming and that can identify 'eco-efficient' configuration alternatives that are both efficient and ecologically sound. This work is original in that it encompasses the environmental impact of both transportation and warehousing activities. We apply the proposed framework to a real-life case study (i.e. Lindt & SprĂŒngli) for the distribution of chocolate products. The results show that cost-driven network optimisation may lead to beneficial effects for the environment and that a minor increase in distribution costs can be offset by a major improvement in environmental performance. This paper contributes to the body of knowledge on eco-efficient supply chain design and closes the missing link between model-based methods and empirical applied research. It also generates insights into the growing debate on the trade-off between the economic and environmental performance of supply chains, supporting organisations in the eco-efficient configuration of their supply chains
- âŠ