15 research outputs found
A Semi-Supervised Two-Stage Approach to Learning from Noisy Labels
The recent success of deep neural networks is powered in part by large-scale
well-labeled training data. However, it is a daunting task to laboriously
annotate an ImageNet-like dateset. On the contrary, it is fairly convenient,
fast, and cheap to collect training images from the Web along with their noisy
labels. This signifies the need of alternative approaches to training deep
neural networks using such noisy labels. Existing methods tackling this problem
either try to identify and correct the wrong labels or reweigh the data terms
in the loss function according to the inferred noisy rates. Both strategies
inevitably incur errors for some of the data points. In this paper, we contend
that it is actually better to ignore the labels of some of the data points than
to keep them if the labels are incorrect, especially when the noisy rate is
high. After all, the wrong labels could mislead a neural network to a bad local
optimum. We suggest a two-stage framework for the learning from noisy labels.
In the first stage, we identify a small portion of images from the noisy
training set of which the labels are correct with a high probability. The noisy
labels of the other images are ignored. In the second stage, we train a deep
neural network in a semi-supervised manner. This framework effectively takes
advantage of the whole training set and yet only a portion of its labels that
are most likely correct. Experiments on three datasets verify the effectiveness
of our approach especially when the noisy rate is high
CleanML: A Study for Evaluating the Impact of Data Cleaning on ML Classification Tasks
Data quality affects machine learning (ML) model performances, and data
scientists spend considerable amount of time on data cleaning before model
training. However, to date, there does not exist a rigorous study on how
exactly cleaning affects ML -- ML community usually focuses on developing ML
algorithms that are robust to some particular noise types of certain
distributions, while database (DB) community has been mostly studying the
problem of data cleaning alone without considering how data is consumed by
downstream ML analytics. We propose a CleanML study that systematically
investigates the impact of data cleaning on ML classification tasks. The
open-source and extensible CleanML study currently includes 14 real-world
datasets with real errors, five common error types, seven different ML models,
and multiple cleaning algorithms for each error type (including both commonly
used algorithms in practice as well as state-of-the-art solutions in academic
literature). We control the randomness in ML experiments using statistical
hypothesis testing, and we also control false discovery rate in our experiments
using the Benjamini-Yekutieli (BY) procedure. We analyze the results in a
systematic way to derive many interesting and nontrivial observations. We also
put forward multiple research directions for researchers.Comment: published in ICDE 202
Multi-Label Noise Robust Collaborative Learning Model for Remote Sensing Image Classification
The development of accurate methods for multi-label classification (MLC) of
remote sensing (RS) images is one of the most important research topics in RS.
Methods based on Deep Convolutional Neural Networks (CNNs) have shown strong
performance gains in RS MLC problems. However, CNN-based methods usually
require a high number of reliable training images annotated by multiple
land-cover class labels. Collecting such data is time-consuming and costly. To
address this problem, the publicly available thematic products, which can
include noisy labels, can be used to annotate RS images with zero-labeling
cost. However, multi-label noise (which can be associated with wrong and
missing label annotations) can distort the learning process of the MLC
algorithm. The detection and correction of label noise are challenging tasks,
especially in a multi-label scenario, where each image can be associated with
more than one label. To address this problem, we propose a novel noise robust
collaborative multi-label learning (RCML) method to alleviate the adverse
effects of multi-label noise during the training phase of the CNN model. RCML
identifies, ranks and excludes noisy multi-labels in RS images based on three
main modules: 1) discrepancy module; 2) group lasso module; and 3) swap module.
The discrepancy module ensures that the two networks learn diverse features,
while producing the same predictions. The task of the group lasso module is to
detect the potentially noisy labels assigned to the multi-labeled training
images, while the swap module task is devoted to exchanging the ranking
information between two networks. Unlike existing methods that make assumptions
about the noise distribution, our proposed RCML does not make any prior
assumption about the type of noise in the training set. Our code is publicly
available online: http://www.noisy-labels-in-rs.orgComment: Our code is publicly available online:
http://www.noisy-labels-in-rs.or
Experience: Quality benchmarking of datasets used in software effort estimation
Data is a cornerstone of empirical software engineering (ESE) research and practice. Data underpin numerous
process and project management activities, including the estimation of development effort and the prediction
of the likely location and severity of defects in code. Serious questions have been raised, however, over the
quality of the data used in ESE. Data quality problems caused by noise, outliers, and incompleteness have
been noted as being especially prevalent. Other quality issues, although also potentially important, have
received less attention. In this study, we assess the quality of 13 datasets that have been used extensively
in research on software effort estimation. The quality issues considered in this article draw on a taxonomy
that we published previously based on a systematic mapping of data quality issues in ESE. Our contributions
are as follows: (1) an evaluation of the “fitness for purpose” of these commonly used datasets and (2) an
assessment of the utility of the taxonomy in terms of dataset benchmarking. We also propose a template
that could be used to both improve the ESE data collection/submission process and to evaluate other such
datasets, contributing to enhanced awareness of data quality issues in the ESE community and, in time, the
availability and use of higher-quality datasets
Classification with Measurement Error in Covariates Or Response, with Application to Prostate Cancer Imaging Study
The research is motivated by the prostate cancer imaging study conducted at the University of Western Ontario to classify cancer status using multiple in-vivo images. The prostate cancer histological image and the in-vivo images are subject to misalignment in the co-registration procedure, which can be viewed as measurement error in covariates or response. We investigate methods to correct this problem.
The first proposed method corrects the predicted class probability when the data has misclassified labels. The correction equation is derived from the relationship between the true response and the error-prone response. The probability for the observed class label is adjusted so it is close to the probability of the true label. A model can be built with the corrected class probability and the covariates for prediction purpose.
A weighted model method is proposed to construct classifiers with error-prone response. A weight is assigned to each data point according to its position, which indicates the data point\u27s reliability. We propose the weighted models for different machine learning classifiers, such as logistic regression, SVM, KNN and classification tree. The weighted model incorporates the weight for each instance in the model building procedure, and the weighted classifiers trained with the error-prone data can be used for future prediction.
The misalignment in the co-registration procedure can also be treated as measurement error in covariates. A weighted data reconstruction method was proposed to deal with the corrupted covariates. The proposed method combines two moment reconstruction forms under different assumptions. We incorporated the weights of the data to build adjusted variables to replace the error-prone covariates. The classifiers can be trained on the reconstructed data set.
Numerical studies were carried out to assess the performance of each method, and the methods were applied to the prostate cancer imaging study. The results show all methods had significantly resolved the misalignment problem
Recommended from our members
Data cleaning techniques for software engineering data sets
This thesis was submitted for the degree of Doctor of Philosophy and awarded by Brunel University.Data quality is an important issue which has been addressed and recognised in research communities such as data warehousing, data mining and information systems. It has been agreed that poor data quality will impact the quality of results of analyses and that it will therefore impact on decisions made on the basis of these results. Empirical software engineering has neglected the issue of data quality to some extent. This fact poses the question of how researchers in empirical software engineering can trust their results without addressing the quality of the analysed data. One widely accepted definition for data quality describes it as `fitness for purpose', and the issue of poor data quality can be addressed by either introducing preventative measures or by applying means to cope with data quality issues. The research presented in this thesis addresses the latter with the special focus on noise handling.
Three noise handling techniques, which utilise decision trees, are proposed for application to software engineering data sets. Each technique represents a noise handling approach: robust filtering, where training and test sets are the same; predictive filtering, where training and test sets are different; and filtering and polish, where noisy instances are corrected. The techniques were first evaluated in two different investigations by applying them to a large real world software engineering data set. In the first investigation the techniques' ability to improve predictive accuracy in differing noise levels was tested. All three techniques improved predictive accuracy in comparison to the do-nothing approach. The filtering and polish was the most successful technique in improving predictive accuracy. The second investigation utilising the large real world software engineering data set tested the techniques' ability to identify instances with implausible values. These instances were flagged for the purpose of evaluation before applying the three techniques. Robust filtering and predictive filtering decreased the number of instances with implausible values, but substantially decreased the size of the data set too. The filtering and polish technique actually increased the number of implausible values, but it did not reduce the size of the data set.
Since the data set contained historical software project data, it was not possible to know the real extent of noise detected. This led to the production of simulated software engineering data sets, which were modelled on the real data set used in the previous evaluations to ensure domain specific characteristics. These simulated versions of the data set were then injected with noise, such that the real extent of the noise was known. After the noise injection the three noise handling techniques were applied to allow evaluation. This procedure of simulating software engineering data sets combined the incorporation of domain specific characteristics of the real world with the control over the simulated data. This is seen as a special strength of this evaluation approach.
The results of the evaluation of the simulation showed that none of the techniques performed well. Robust filtering and filtering and polish performed very poorly, and based on the results of this evaluation they would not be recommended for the task of noise reduction. The predictive filtering technique was the best performing technique in this evaluation, but it did not perform significantly well either.
An exhaustive systematic literature review has been carried out investigating to what extent the empirical software engineering community has considered data quality. The findings showed that the issue of data quality has been largely neglected by the empirical software engineering community.
The work in this thesis highlights an important gap in empirical software engineering. It provided clarification and distinctions of the terms noise and outliers. Noise and outliers are overlapping, but they are fundamentally different. Since noise and outliers are often treated the same in noise handling techniques, a clarification of the two terms was necessary.
To investigate the capabilities of noise handling techniques a single investigation was deemed as insufficient. The reasons for this are that the distinction between noise and outliers is not trivial, and that the investigated noise cleaning techniques are derived from traditional noise handling techniques where noise and outliers are combined. Therefore three investigations were undertaken to assess the effectiveness of the three presented noise handling techniques. Each investigation should be seen as a part of a multi-pronged approach.
This thesis also highlights possible shortcomings of current automated noise handling techniques. The poor performance of the three techniques led to the conclusion that noise handling should be integrated into a data cleaning process where the input of domain knowledge and the replicability of the data cleaning process are ensured
DECODE:Deep Confidence Network for Robust Image Classification
The recent years have witnessed the success of deep convolutional neural networks for image classification and many related tasks. It should be pointed out that the existing training strategies assume there is a clean dataset for model learning. In elaborately constructed benchmark datasets, deep network has yielded promising performance under the assumption. However, in real-world applications, it is burdensome and expensive to collect sufficient clean training samples. On the other hand, collecting noisy labeled samples is much economical and practical, especially with the rapidly increasing amount of visual data in theWeb. Unfortunately, the accuracy of current deep models may drop dramatically even with 5% to 10% label noise. Therefore, enabling label noise resistant classification has become a crucial issue in the data driven deep learning approaches. In this paper, we propose a DEep COnfiDEnce network, DECODE, to address this issue. In particular, based on the distribution of mislabeled data, we adopt a confidence evaluation module which is able to determine the confidence that a sample is mislabeled. With the confidence, we further use a weighting strategy to assign different weights to different samples so that the model pays less attention to low confidence data which is more likely to be noise. In this way, the deep model is more robust to label noise. DECODE is designed to be general such that it can be easily combine with existing architectures. We conduct extensive experiments on several datasets and the results validate that DECODE can improve the accuracy of deep models trained with noisy data