2,912 research outputs found

    CleanML: A Study for Evaluating the Impact of Data Cleaning on ML Classification Tasks

    Full text link
    Data quality affects machine learning (ML) model performances, and data scientists spend considerable amount of time on data cleaning before model training. However, to date, there does not exist a rigorous study on how exactly cleaning affects ML -- ML community usually focuses on developing ML algorithms that are robust to some particular noise types of certain distributions, while database (DB) community has been mostly studying the problem of data cleaning alone without considering how data is consumed by downstream ML analytics. We propose a CleanML study that systematically investigates the impact of data cleaning on ML classification tasks. The open-source and extensible CleanML study currently includes 14 real-world datasets with real errors, five common error types, seven different ML models, and multiple cleaning algorithms for each error type (including both commonly used algorithms in practice as well as state-of-the-art solutions in academic literature). We control the randomness in ML experiments using statistical hypothesis testing, and we also control false discovery rate in our experiments using the Benjamini-Yekutieli (BY) procedure. We analyze the results in a systematic way to derive many interesting and nontrivial observations. We also put forward multiple research directions for researchers.Comment: published in ICDE 202

    Learning Local Features Using Boosted Trees for Face Recognition

    Get PDF
    Face recognition is fundamental to a number of significant applications that include but not limited to video surveillance and content based image retrieval. Some of the challenges which make this task difficult are variations in faces due to changes in pose, illumination and deformation. This dissertation proposes a face recognition system to overcome these difficulties. We propose methods for different stages of face recognition which will make the system more robust to these variations. We propose a novel method to perform skin segmentation which is fast and able to perform well under different illumination conditions. We also propose a method to transform face images from any given lighting condition to a reference lighting condition using color constancy. Finally we propose methods to extract local features and train classifiers using these features. We developed two algorithms using these local features, modular PCA (Principal Component Analysis) and boosted tree. We present experimental results which show local features improve recognition accuracy when compared to accuracy of methods which use global features. The boosted tree algorithm recursively learns a tree of strong classifiers by splitting the training data in to smaller sets. We apply this method to learn features on the intrapersonal and extra-personal feature space. Once trained each node of the boosted tree will be a strong classifier. We used this method with Gabor features to perform experiments on benchmark face databases. Results clearly show that the proposed method has better face recognition and verification accuracy than the traditional AdaBoost strong classifier
    • …
    corecore