5,458 research outputs found

    CleanML: A Study for Evaluating the Impact of Data Cleaning on ML Classification Tasks

    Full text link
    Data quality affects machine learning (ML) model performances, and data scientists spend considerable amount of time on data cleaning before model training. However, to date, there does not exist a rigorous study on how exactly cleaning affects ML -- ML community usually focuses on developing ML algorithms that are robust to some particular noise types of certain distributions, while database (DB) community has been mostly studying the problem of data cleaning alone without considering how data is consumed by downstream ML analytics. We propose a CleanML study that systematically investigates the impact of data cleaning on ML classification tasks. The open-source and extensible CleanML study currently includes 14 real-world datasets with real errors, five common error types, seven different ML models, and multiple cleaning algorithms for each error type (including both commonly used algorithms in practice as well as state-of-the-art solutions in academic literature). We control the randomness in ML experiments using statistical hypothesis testing, and we also control false discovery rate in our experiments using the Benjamini-Yekutieli (BY) procedure. We analyze the results in a systematic way to derive many interesting and nontrivial observations. We also put forward multiple research directions for researchers.Comment: published in ICDE 202

    Bootstrapping statistical inferences of decomposition methods for gender earnings differentials

    Full text link
    Applying the standard bootstrapping technique with corrections for heteroskedasticity for a sample of the 1997 Urban Household Survey in China, the present paper attempts to test (1) whether the commonly used decomposition methods for gender earnings differentials give significantly different results, and (2) whether the explained component is significantly different from the unexplained component (which is commonly referred to as discrimination) within each decomposition method. Based on a national data set, the empirical results indicated some significant differences in both tests. The implication of the results is that the proposed bootstrapping technique can be regarded as a guideline on applying which approach to decompose gender earnings differentials among different methods without losing important information, and on evaluating the relative importance of the decomposition components for any chosen method

    Loss prevention for hog farmers: Insurance, on-farm biosecurity practices, and vaccination

    Get PDF
    Using agricultural household survey data and claim records from insurers for the year 2009, this paper analyzes hog producers' choice of means of loss prevention and identifies the relationships among biosecurity practices, vaccination, and hog insurance. By combining one probit and two structural equations, we adopt three-stage estimations on a mixed-process model to obtain the results. The findings indicate that biosecurity practices provide the basic infrastructure for operating pig farms and complement both the usage of quality vaccines and the uptake of hog insurance. In addition, there is a strong relationship of substitution between quality of vaccine and demand for hog insurance. Hog farmers that implement better biosecurity practices are more likely to seek high-quality vaccines or buy into hog insurance schemes but not both. For those households with hog insurance, better biosecurity status, better management practices, and higher-quality vaccine significantly help to reduce loss ratios. However, we also find a moral hazard effect in that higher premium expenditure by the insured households might induce larger loss ratios.Biosecurity, hog insurance, loss prevention, vaccine,
    corecore