2,912 research outputs found
CleanML: A Study for Evaluating the Impact of Data Cleaning on ML Classification Tasks
Data quality affects machine learning (ML) model performances, and data
scientists spend considerable amount of time on data cleaning before model
training. However, to date, there does not exist a rigorous study on how
exactly cleaning affects ML -- ML community usually focuses on developing ML
algorithms that are robust to some particular noise types of certain
distributions, while database (DB) community has been mostly studying the
problem of data cleaning alone without considering how data is consumed by
downstream ML analytics. We propose a CleanML study that systematically
investigates the impact of data cleaning on ML classification tasks. The
open-source and extensible CleanML study currently includes 14 real-world
datasets with real errors, five common error types, seven different ML models,
and multiple cleaning algorithms for each error type (including both commonly
used algorithms in practice as well as state-of-the-art solutions in academic
literature). We control the randomness in ML experiments using statistical
hypothesis testing, and we also control false discovery rate in our experiments
using the Benjamini-Yekutieli (BY) procedure. We analyze the results in a
systematic way to derive many interesting and nontrivial observations. We also
put forward multiple research directions for researchers.Comment: published in ICDE 202
Learning Local Features Using Boosted Trees for Face Recognition
Face recognition is fundamental to a number of significant applications that include but not limited to video surveillance and content based image retrieval. Some of the challenges which make this task difficult are variations in faces due to changes in pose, illumination and deformation. This dissertation proposes a face recognition system to overcome these difficulties. We propose methods for different stages of face recognition which will make the system more robust to these variations. We propose a novel method to perform skin segmentation which is fast and able to perform well under different illumination conditions. We also propose a method to transform face images from any given lighting condition to a reference lighting condition using color constancy. Finally we propose methods to extract local features and train classifiers using these features. We developed two algorithms using these local features, modular PCA (Principal Component Analysis) and boosted tree. We present experimental results which show local features improve recognition accuracy when compared to accuracy of methods which use global features.
The boosted tree algorithm recursively learns a tree of strong classifiers by splitting the training data in to smaller sets. We apply this method to learn features on the intrapersonal and extra-personal feature space. Once trained each node of the boosted tree will be a strong classifier. We used this method with Gabor features to perform experiments on benchmark face databases. Results clearly show that the proposed method has better face recognition and verification accuracy than the traditional AdaBoost strong classifier
- …