5,787 research outputs found

    Scalable and Holistic Qualitative Data Cleaning

    Get PDF
    Data quality is one of the most important problems in data management, since dirty data often leads to inaccurate data analytics results and wrong business decisions. Poor data across businesses and the government cost the U.S. economy 3.1 trillion a year, according to a report by InsightSquared in 2012. Data scientists reportedly spend 60% of their time in cleaning and organizing the data according to a survey published in Forbes in 2016. Therefore, we need effective and efficient techniques to reduce the human efforts in data cleaning. Data cleaning activities usually consist of two phases: error detection and error repair. Error detection techniques can be generally classified as either quantitative or qualitative. Quantitative error detection techniques often involve statistical and machine learning methods to identify abnormal behaviors and errors. Quantitative error detection techniques have been mostly studied in the context of outlier detection. On the other hand, qualitative error detection techniques rely on descriptive approaches to specify patterns or constraints of a legal data instance. One common way of specifying those patterns or constraints is by using data quality rules expressed in some integrity constraint languages; and errors are captured by identifying violations of the specified rules. This dissertation focuses on tackling the challenges associated with detecting and repairing qualitative errors. To clean a dirty dataset using rule-based qualitative data cleaning techniques, we first need to design data quality rules that reflect the semantics of the data. Since obtaining data quality rules by consulting domain experts is usually a time-consuming processing, we need automatic techniques to discover them. We show how to mine data quality rules expressed in the formalism of denial constraints (DCs). We choose DCs as the formal integrity constraint language for capturing data quality rules because it is able to capture many real-life data quality rules, and at the same time, it allows for efficient discovery algorithm. Since error detection often requires a tuple pairwise comparison, a quadratic complexity that is expensive for a large dataset, we present a distribution strategy that distributes the error detection workload to a cluster of machines in a parallel shared-nothing computing environment. Our proposed distribution strategy aims at minimizing, across all machines, the maximum computation cost and the maximum communication cost, which are the two main types of cost one needs to consider in a shared-nothing environment. In repairing qualitative errors, we propose a holistic data cleaning technique, which accumulates evidences from a broad spectrum of data quality rules, and suggests possible data updates in a holistic manner. Compared with previous piece-meal data repairing approaches, the holistic approach produces data updates with higher accuracy because it realizes the interactions between different errors using one representation, and aims at generating data updates that can fix as many errors as possible

    CleanML: A Study for Evaluating the Impact of Data Cleaning on ML Classification Tasks

    Full text link
    Data quality affects machine learning (ML) model performances, and data scientists spend considerable amount of time on data cleaning before model training. However, to date, there does not exist a rigorous study on how exactly cleaning affects ML -- ML community usually focuses on developing ML algorithms that are robust to some particular noise types of certain distributions, while database (DB) community has been mostly studying the problem of data cleaning alone without considering how data is consumed by downstream ML analytics. We propose a CleanML study that systematically investigates the impact of data cleaning on ML classification tasks. The open-source and extensible CleanML study currently includes 14 real-world datasets with real errors, five common error types, seven different ML models, and multiple cleaning algorithms for each error type (including both commonly used algorithms in practice as well as state-of-the-art solutions in academic literature). We control the randomness in ML experiments using statistical hypothesis testing, and we also control false discovery rate in our experiments using the Benjamini-Yekutieli (BY) procedure. We analyze the results in a systematic way to derive many interesting and nontrivial observations. We also put forward multiple research directions for researchers.Comment: published in ICDE 202

    Luzzu - A Framework for Linked Data Quality Assessment

    Full text link
    With the increasing adoption and growth of the Linked Open Data cloud [9], with RDFa, Microformats and other ways of embedding data into ordinary Web pages, and with initiatives such as schema.org, the Web is currently being complemented with a Web of Data. Thus, the Web of Data shares many characteristics with the original Web of Documents, which also varies in quality. This heterogeneity makes it challenging to determine the quality of the data published on the Web and to subsequently make this information explicit to data consumers. The main contribution of this article is LUZZU, a quality assessment framework for Linked Open Data. Apart from providing quality metadata and quality problem reports that can be used for data cleaning, LUZZU is extensible: third party metrics can be easily plugged-in the framework. The framework does not rely on SPARQL endpoints, and is thus free of all the problems that come with them, such as query timeouts. Another advantage over SPARQL based qual- ity assessment frameworks is that metrics implemented in LUZZU can have more complex functionality than triple matching. Using the framework, we performed a quality assessment of a number of statistical linked datasets that are available on the LOD cloud. For this evaluation, 25 metrics from ten different dimensions were implemented

    Reinventing the toilet: academic research meets design practice in the pursuit of an effective sanitation solution for all

    Get PDF
    This paper outlines the research and design development process undertaken to create a user centred sanitation design solution for developing countries as part of the Bill & Melinda Gates Foundation ‘Reinventing the toilet’ challenge. The context of the Gates Foundation challenge is outlined as well as the development of the project within the University that led to the formation of a multidisciplinary team from Loughborough Design School and the Water, Engineering and Development Centre (WEDC) within the School of Civil and Building Engineering. This team would not only develop an innovative engineering solution in the form of a hydrothermal carbonisation reactor (HCR) for processing human waste, but also develop the innovative user centred, ergonomic and industrially designed front end toilet solution that is the main focus of this paper. The user centred research methods used both within Loughborough Design School and in the field in developing countries are examined as well as the design development methods that were used for the ideation and development of a range of prototypes that were to be exhibited at the Foundations Reinvent the Toilet Fair in Deli, India in early 2014. The paper concludes with a review of the success of the project so far and the challenges that lie ahead as the project moves from the prototype development phase to real world field testing in China towards the end of 2014

    Data Governance on Data Platforms : Designing Playbook for Data Platforms

    Get PDF
    The amount and value of data are increasing. Data can be seen as one of the key enterprise assets. Organizations have started to build data platforms to have the data from all sources available for use in cross-organization and cross-industry business ecosystems. The constraints with data are no longer the amount of data or the technology. It is the structures and processes. Data needs governing like any other enterprise asset. Data professionals interviewed for this research had rarely experienced successful data governance. The academic literature on data governance frameworks is fragmented and lacks holistic guidance on how data governance on data platforms should be designed, implemented, and monitored. This master’s thesis contributes to the lack of holistic guidance on designing data governance on data platforms through design science research. The research supports the theoretical framework and advances it with data analysis of 13 semi-structured interviews with data professionals. This research produces a canvas tool, Playbook for Data Platforms, to help people working with data design successful data governance in the data platform context. The canvas tool aims to increase data value and minimize the data-related cost and risk in the platform context. The research also considers how cloud data affects the data governance framework. As a result, 12 areas and 31 guiding questions are proposed to be included in the data governance framework on data platforms
    • …
    corecore