15 research outputs found

    A Semi-Supervised Two-Stage Approach to Learning from Noisy Labels

    Full text link
    The recent success of deep neural networks is powered in part by large-scale well-labeled training data. However, it is a daunting task to laboriously annotate an ImageNet-like dateset. On the contrary, it is fairly convenient, fast, and cheap to collect training images from the Web along with their noisy labels. This signifies the need of alternative approaches to training deep neural networks using such noisy labels. Existing methods tackling this problem either try to identify and correct the wrong labels or reweigh the data terms in the loss function according to the inferred noisy rates. Both strategies inevitably incur errors for some of the data points. In this paper, we contend that it is actually better to ignore the labels of some of the data points than to keep them if the labels are incorrect, especially when the noisy rate is high. After all, the wrong labels could mislead a neural network to a bad local optimum. We suggest a two-stage framework for the learning from noisy labels. In the first stage, we identify a small portion of images from the noisy training set of which the labels are correct with a high probability. The noisy labels of the other images are ignored. In the second stage, we train a deep neural network in a semi-supervised manner. This framework effectively takes advantage of the whole training set and yet only a portion of its labels that are most likely correct. Experiments on three datasets verify the effectiveness of our approach especially when the noisy rate is high

    CleanML: A Study for Evaluating the Impact of Data Cleaning on ML Classification Tasks

    Full text link
    Data quality affects machine learning (ML) model performances, and data scientists spend considerable amount of time on data cleaning before model training. However, to date, there does not exist a rigorous study on how exactly cleaning affects ML -- ML community usually focuses on developing ML algorithms that are robust to some particular noise types of certain distributions, while database (DB) community has been mostly studying the problem of data cleaning alone without considering how data is consumed by downstream ML analytics. We propose a CleanML study that systematically investigates the impact of data cleaning on ML classification tasks. The open-source and extensible CleanML study currently includes 14 real-world datasets with real errors, five common error types, seven different ML models, and multiple cleaning algorithms for each error type (including both commonly used algorithms in practice as well as state-of-the-art solutions in academic literature). We control the randomness in ML experiments using statistical hypothesis testing, and we also control false discovery rate in our experiments using the Benjamini-Yekutieli (BY) procedure. We analyze the results in a systematic way to derive many interesting and nontrivial observations. We also put forward multiple research directions for researchers.Comment: published in ICDE 202

    Multi-Label Noise Robust Collaborative Learning Model for Remote Sensing Image Classification

    Full text link
    The development of accurate methods for multi-label classification (MLC) of remote sensing (RS) images is one of the most important research topics in RS. Methods based on Deep Convolutional Neural Networks (CNNs) have shown strong performance gains in RS MLC problems. However, CNN-based methods usually require a high number of reliable training images annotated by multiple land-cover class labels. Collecting such data is time-consuming and costly. To address this problem, the publicly available thematic products, which can include noisy labels, can be used to annotate RS images with zero-labeling cost. However, multi-label noise (which can be associated with wrong and missing label annotations) can distort the learning process of the MLC algorithm. The detection and correction of label noise are challenging tasks, especially in a multi-label scenario, where each image can be associated with more than one label. To address this problem, we propose a novel noise robust collaborative multi-label learning (RCML) method to alleviate the adverse effects of multi-label noise during the training phase of the CNN model. RCML identifies, ranks and excludes noisy multi-labels in RS images based on three main modules: 1) discrepancy module; 2) group lasso module; and 3) swap module. The discrepancy module ensures that the two networks learn diverse features, while producing the same predictions. The task of the group lasso module is to detect the potentially noisy labels assigned to the multi-labeled training images, while the swap module task is devoted to exchanging the ranking information between two networks. Unlike existing methods that make assumptions about the noise distribution, our proposed RCML does not make any prior assumption about the type of noise in the training set. Our code is publicly available online: http://www.noisy-labels-in-rs.orgComment: Our code is publicly available online: http://www.noisy-labels-in-rs.or

    Experience: Quality benchmarking of datasets used in software effort estimation

    Get PDF
    Data is a cornerstone of empirical software engineering (ESE) research and practice. Data underpin numerous process and project management activities, including the estimation of development effort and the prediction of the likely location and severity of defects in code. Serious questions have been raised, however, over the quality of the data used in ESE. Data quality problems caused by noise, outliers, and incompleteness have been noted as being especially prevalent. Other quality issues, although also potentially important, have received less attention. In this study, we assess the quality of 13 datasets that have been used extensively in research on software effort estimation. The quality issues considered in this article draw on a taxonomy that we published previously based on a systematic mapping of data quality issues in ESE. Our contributions are as follows: (1) an evaluation of the “fitness for purpose” of these commonly used datasets and (2) an assessment of the utility of the taxonomy in terms of dataset benchmarking. We also propose a template that could be used to both improve the ESE data collection/submission process and to evaluate other such datasets, contributing to enhanced awareness of data quality issues in the ESE community and, in time, the availability and use of higher-quality datasets

    Classification with Measurement Error in Covariates Or Response, with Application to Prostate Cancer Imaging Study

    Get PDF
    The research is motivated by the prostate cancer imaging study conducted at the University of Western Ontario to classify cancer status using multiple in-vivo images. The prostate cancer histological image and the in-vivo images are subject to misalignment in the co-registration procedure, which can be viewed as measurement error in covariates or response. We investigate methods to correct this problem. The first proposed method corrects the predicted class probability when the data has misclassified labels. The correction equation is derived from the relationship between the true response and the error-prone response. The probability for the observed class label is adjusted so it is close to the probability of the true label. A model can be built with the corrected class probability and the covariates for prediction purpose. A weighted model method is proposed to construct classifiers with error-prone response. A weight is assigned to each data point according to its position, which indicates the data point\u27s reliability. We propose the weighted models for different machine learning classifiers, such as logistic regression, SVM, KNN and classification tree. The weighted model incorporates the weight for each instance in the model building procedure, and the weighted classifiers trained with the error-prone data can be used for future prediction. The misalignment in the co-registration procedure can also be treated as measurement error in covariates. A weighted data reconstruction method was proposed to deal with the corrupted covariates. The proposed method combines two moment reconstruction forms under different assumptions. We incorporated the weights of the data to build adjusted variables to replace the error-prone covariates. The classifiers can be trained on the reconstructed data set. Numerical studies were carried out to assess the performance of each method, and the methods were applied to the prostate cancer imaging study. The results show all methods had significantly resolved the misalignment problem

    DECODE:Deep Confidence Network for Robust Image Classification

    Get PDF
    The recent years have witnessed the success of deep convolutional neural networks for image classification and many related tasks. It should be pointed out that the existing training strategies assume there is a clean dataset for model learning. In elaborately constructed benchmark datasets, deep network has yielded promising performance under the assumption. However, in real-world applications, it is burdensome and expensive to collect sufficient clean training samples. On the other hand, collecting noisy labeled samples is much economical and practical, especially with the rapidly increasing amount of visual data in theWeb. Unfortunately, the accuracy of current deep models may drop dramatically even with 5% to 10% label noise. Therefore, enabling label noise resistant classification has become a crucial issue in the data driven deep learning approaches. In this paper, we propose a DEep COnfiDEnce network, DECODE, to address this issue. In particular, based on the distribution of mislabeled data, we adopt a confidence evaluation module which is able to determine the confidence that a sample is mislabeled. With the confidence, we further use a weighting strategy to assign different weights to different samples so that the model pays less attention to low confidence data which is more likely to be noise. In this way, the deep model is more robust to label noise. DECODE is designed to be general such that it can be easily combine with existing architectures. We conduct extensive experiments on several datasets and the results validate that DECODE can improve the accuracy of deep models trained with noisy data
    corecore