3,014,472 research outputs found

    Discovering Data Quality Problems

    Get PDF
    Existing methodologies for identifying dataquality problems are typically user-centric, where dataquality requirements are first determined in a top-downmanner following well-established design guidelines, orga-nizational structures and data governance frameworks. In thecurrent data landscape, however, users are often confrontedwith new, unexplored datasets that they may not have anyownership of, but that are perceived to have relevance andpotential to create value for them. Such repurposed datasetscan be found in government open data portals, data marketsand several publicly available data repositories. In suchscenarios, applying top-down data quality checkingapproaches is not feasible, as the consumers of the data haveno control over its creation and governance. Hence, dataconsumers – data scientists and analysts – need to beempowered with data exploration capabilities that allowthem to investigate and understand the quality of suchdatasets to facilitate well-informed decisions on their use.This research aims to develop such an approach fordiscovering data quality problems using generic exploratorymethods that can be effectively applied in settings where datacreation and use is separated. The approach, named LANG,is developed through a Design Science approach on the basisof semiotics theory and data quality dimensions. LANG isempirically validated in terms of soundness of the approach,its repeatability and generalizability

    An intelligent linked data quality dashboard

    Get PDF
    This paper describes a new intelligent, data-driven dashboard for linked data quality assessment. The development goal was to assist data quality engineers to interpret data quality problems found when evaluating a dataset us-ing a metrics-based data quality assessment. This required construction of a graph linking the problematic things identified in the data, the assessment metrics and the source data. This context and supporting user interfaces help the user to un-derstand data quality problems. An analysis widget also helped the user identify the root cause multiple problems. This supported the user in identification and prioritization of the problems that need to be fixed and to improve data quality. The dashboard was shown to be useful for users to clean data. A user evaluation was performed with both expert and novice data quality engineers

    Medical Big Data and Big Data Quality Problems

    Get PDF
    Medical big data has generated much excitement in recent years and for good reason. It can be an invaluable resource for researchers in general and insurers in particular. This Article, however, argues that users of medical big data must proceed with caution and recognize the data’s considerable limitations and shortcomings. These can consist of data errors, missing information, lack of standardization, record fragmentation, software problems, and other flaws. The Article analyzes a variety of data quality problems. It also formulates recommendations to address these deficiencies, including data audits, workforce and technical solutions, and regulatory approache

    Data quality problems in TPC-DI based data integration processes

    Get PDF
    Many data driven organisations need to integrate data from multiple, distributed and heterogeneous resources for advanced data analysis. A data integration system is an essential component to collect data into a data warehouse or other data analytics systems. There are various alternatives of data integration systems which are created in-house or provided by vendors. Hence, it is necessary for an organisation to compare and benchmark them when choosing a suitable one to meet its requirements. Recently, the TPC-DI is proposed as the first industrial benchmark for evaluating data integration systems. When using this benchmark, we find some typical data quality problems in the TPC-DI data source such as multi-meaning attributes and inconsistent data schemas, which could delay or even fail the data integration process. This paper explains processes of this benchmark and summarises typical data quality problems identified in the TPC-DI data source. Furthermore, in order to prevent data quality problems and proactively manage data quality, we propose a set of practical guidelines for researchers and practitioners to conduct data quality management when using the TPC-DI benchmark

    Using Archival Data Sources to Conduct Nonprofit Accounting Research

    Get PDF
    Research in nonprofit accounting is steadily increasing as more data is available. In an effort to broaden the awareness of the data sources and ensure the quality of nonprofit research, we discuss archival data sources available to nonprofit researchers, data issues, and potential resolutions to those problems. Overall, our paper should raise awareness of data sources in the nonprofit area, increase production, and enhance the quality of nonprofit research

    Representing Dataset Quality Metadata using Multi-Dimensional Views

    Full text link
    Data quality is commonly defined as fitness for use. The problem of identifying quality of data is faced by many data consumers. Data publishers often do not have the means to identify quality problems in their data. To make the task for both stakeholders easier, we have developed the Dataset Quality Ontology (daQ). daQ is a core vocabulary for representing the results of quality benchmarking of a linked dataset. It represents quality metadata as multi-dimensional and statistical observations using the Data Cube vocabulary. Quality metadata are organised as a self-contained graph, which can, e.g., be embedded into linked open datasets. We discuss the design considerations, give examples for extending daQ by custom quality metrics, and present use cases such as analysing data versions, browsing datasets by quality, and link identification. We finally discuss how data cube visualisation tools enable data publishers and consumers to analyse better the quality of their data.Comment: Preprint of a paper submitted to the forthcoming SEMANTiCS 2014, 4-5 September 2014, Leipzig, German

    An Investigation Into HPLC Data Quality Problems

    Get PDF
    This report summarizes the analyses and results produced by a five-member investigative team of Government, university, and industry experts, established by NASA HQ. The team examined data quality problems associated with high performance liquid chromatography (HPLC) analyses of pigment concentrations in seawater samples produced by the San Diego State University (SDSU) Center for Hydro-Optics and Remote Sensing (CHORS). This report shows CHORS did not validate the methods used before placing them into service to analyze field samples for NASA principal investigators (PIs), even though the HPLC literature contained easily accessible method validation procedures, and the importance of implementing them, more than a decade ago. In addition, there were so many sources of significant variance in the CHORS methodologies, that the HPLC system rarely operated within performance criteria capable of producing the requisite data quality. It is the recommendation of the investigative team to a) not correct the data, b) make all the data that was temporarily sequestered available for scientific use, and c) label the affected data with an appropriate warning, e.g., "These data are not validated and should not be used as the sole basis for a scientific result, conclusion, or hypothesis--independent corroborating evidence is required.
    corecore