3,014,472 research outputs found
Discovering Data Quality Problems
Existing methodologies for identifying dataquality problems are typically user-centric, where dataquality requirements are first determined in a top-downmanner following well-established design guidelines, orga-nizational structures and data governance frameworks. In thecurrent data landscape, however, users are often confrontedwith new, unexplored datasets that they may not have anyownership of, but that are perceived to have relevance andpotential to create value for them. Such repurposed datasetscan be found in government open data portals, data marketsand several publicly available data repositories. In suchscenarios, applying top-down data quality checkingapproaches is not feasible, as the consumers of the data haveno control over its creation and governance. Hence, dataconsumers – data scientists and analysts – need to beempowered with data exploration capabilities that allowthem to investigate and understand the quality of suchdatasets to facilitate well-informed decisions on their use.This research aims to develop such an approach fordiscovering data quality problems using generic exploratorymethods that can be effectively applied in settings where datacreation and use is separated. The approach, named LANG,is developed through a Design Science approach on the basisof semiotics theory and data quality dimensions. LANG isempirically validated in terms of soundness of the approach,its repeatability and generalizability
An intelligent linked data quality dashboard
This paper describes a new intelligent, data-driven dashboard for linked data quality assessment. The development goal was to assist data quality engineers to interpret data quality problems found when evaluating a dataset us-ing a metrics-based data quality assessment. This required construction of a graph linking the problematic things identified in the data, the assessment metrics and the source data. This context and supporting user interfaces help the user to un-derstand data quality problems. An analysis widget also helped the user identify the root cause multiple problems. This supported the user in identification and prioritization of the problems that need to be fixed and to improve data quality. The dashboard was shown to be useful for users to clean data. A user evaluation was performed with both expert and novice data quality engineers
Data Quality Problems and Proactive Data Quality Management in Data-Warehouse-Systems
The abstract is included in the text
Medical Big Data and Big Data Quality Problems
Medical big data has generated much excitement in recent years and for good reason. It can be an invaluable resource for researchers in general and insurers in particular. This Article, however, argues that users of medical big data must proceed with caution and recognize the data’s considerable limitations and shortcomings. These can consist of data errors, missing information, lack of standardization, record fragmentation, software problems, and other flaws. The Article analyzes a variety of data quality problems. It also formulates recommendations to address these deficiencies, including data audits, workforce and technical solutions, and regulatory approache
Data quality problems in TPC-DI based data integration processes
Many data driven organisations need to integrate data from multiple, distributed and heterogeneous resources for advanced data analysis. A data integration system is an essential component to collect data into a data warehouse or other data analytics systems. There are various alternatives of data integration systems which are created in-house or provided by vendors. Hence, it is necessary for an organisation to compare and benchmark them when choosing a suitable one to meet its requirements. Recently, the TPC-DI is proposed as the first industrial benchmark for evaluating data integration systems. When using this benchmark, we find some typical data quality problems in the TPC-DI data source such as multi-meaning attributes and inconsistent data schemas, which could delay or even fail the data integration process. This paper explains processes of this benchmark and summarises typical data quality problems identified in the TPC-DI data source. Furthermore, in order to prevent data quality problems and proactively manage data quality, we propose a set of practical guidelines for researchers and practitioners to conduct data quality management when using the TPC-DI benchmark
Using Archival Data Sources to Conduct Nonprofit Accounting Research
Research in nonprofit accounting is steadily increasing as more data is available. In an effort to broaden the awareness of the data sources and ensure the quality of nonprofit research, we discuss archival data sources available to nonprofit researchers, data issues, and potential resolutions to those problems. Overall, our paper should raise awareness of data sources in the nonprofit area, increase production, and enhance the quality of nonprofit research
Representing Dataset Quality Metadata using Multi-Dimensional Views
Data quality is commonly defined as fitness for use. The problem of
identifying quality of data is faced by many data consumers. Data publishers
often do not have the means to identify quality problems in their data. To make
the task for both stakeholders easier, we have developed the Dataset Quality
Ontology (daQ). daQ is a core vocabulary for representing the results of
quality benchmarking of a linked dataset. It represents quality metadata as
multi-dimensional and statistical observations using the Data Cube vocabulary.
Quality metadata are organised as a self-contained graph, which can, e.g., be
embedded into linked open datasets. We discuss the design considerations, give
examples for extending daQ by custom quality metrics, and present use cases
such as analysing data versions, browsing datasets by quality, and link
identification. We finally discuss how data cube visualisation tools enable
data publishers and consumers to analyse better the quality of their data.Comment: Preprint of a paper submitted to the forthcoming SEMANTiCS 2014, 4-5
September 2014, Leipzig, German
An Investigation Into HPLC Data Quality Problems
This report summarizes the analyses and results produced by a five-member investigative team of Government, university, and industry experts, established by NASA HQ. The team examined data quality problems associated with high performance liquid chromatography (HPLC) analyses of pigment concentrations in seawater samples produced by the San Diego State University (SDSU) Center for Hydro-Optics and Remote Sensing (CHORS). This report shows CHORS did not validate the methods used before placing them into service to analyze field samples for NASA principal investigators (PIs), even though the HPLC literature contained easily accessible method validation procedures, and the importance of implementing them, more than a decade ago. In addition, there were so many sources of significant variance in the CHORS methodologies, that the HPLC system rarely operated within performance criteria capable of producing the requisite data quality. It is the recommendation of the investigative team to a) not correct the data, b) make all the data that was temporarily sequestered available for scientific use, and c) label the affected data with an appropriate warning, e.g., "These data are not validated and should not be used as the sole basis for a scientific result, conclusion, or hypothesis--independent corroborating evidence is required.
- …