4,734 research outputs found
Identifying Mislabeled Training Data
This paper presents a new approach to identifying and eliminating mislabeled
training instances for supervised learning. The goal of this approach is to
improve classification accuracies produced by learning algorithms by improving
the quality of the training data. Our approach uses a set of learning
algorithms to create classifiers that serve as noise filters for the training
data. We evaluate single algorithm, majority vote and consensus filters on five
datasets that are prone to labeling errors. Our experiments illustrate that
filtering significantly improves classification accuracy for noise levels up to
30 percent. An analytical and empirical evaluation of the precision of our
approach shows that consensus filters are conservative at throwing away good
data at the expense of retaining bad data and that majority filters are better
at detecting bad data at the expense of throwing away good data. This suggests
that for situations in which there is a paucity of data, consensus filters are
preferable, whereas majority vote filters are preferable for situations with an
abundance of data
Cost-sensitive elimination of mislabeled training data
© 2017 Elsevier Inc. Accurately labeling training data plays a critical role in various supervised learning tasks. Since labeling in practical applications might be erroneous due to various reasons, a wide range of algorithms have been developed to eliminate mislabeled data. These algorithms may make the following two types of errors: identifying a noise-free data as mislabeled, or identifying a mislabeled data as noise free. The effects of these errors may generate different costs, depending on the training datasets and applications. However, the cost variations are usually ignored thus existing works are not optimal regarding costs. In this work, the novel problem of cost-sensitive mislabeled data filtering is studied. By wrapping a cost-minimizing procedure, we propose the prototype cost-sensitive ensemble learning based mislabeled data filtering algorithm, named CSENF. Based on CSENF, we further propose two novel algorithms: the cost-sensitive repeated majority filtering algorithm CSRMF and cost-sensitive repeated consensus filtering algorithm CSRCF. Compared to CSENF, these two algorithms could estimate the mislabeling probability of each training data more confidently. Therefore, they produce less cost compared to CSENF and cost-blind mislabeling filters. Empirical and theoretical evaluations on a set of benchmark datasets illustrate the superior performance of the proposed methods
Classification with class noises through probabilistic sampling
© 2017 Accurately labeling training data plays a critical role in various supervised learning tasks. Now a wide range of algorithms have been developed to identify and remove mislabeled data as labeling in practical applications might be erroneous due to various reasons. In essence, these algorithms adopt the strategy of one-zero sampling (OSAM), wherein a sample will be selected and retained only if it is recognized as clean. There are two types of errors in OSAM: identifying a clean sample as mislabeled and discarding it, or identifying a mislabeled sample as clean and retaining it. These errors could lead to poor classification performance. To improve classification accuracy, this paper proposes a novel probabilistic sampling (PSAM) scheme. In PSAM, a cleaner sample has more chance to be selected. The degree of cleanliness is measured by the confidence on the label. To accurately estimate the confidence value, a probabilistic multiple voting idea is proposed which is able to assign a high confidence value to a clean sample and a low confidence value to a mislabeled sample. Finally, we demonstrate that PSAM could effectively improve the classification accuracy over existing OSAM methods
Implementable Quantum Classifier for Nonlinear Data
In this Letter, we propose a quantum machine learning scheme for the
classification of classical nonlinear data. The main ingredients of our method
are variational quantum perceptron (VQP) and a quantum generalization of
classical ensemble learning. Our VQP employs parameterized quantum circuits to
learn a Grover search (or amplitude amplification) operation with classical
optimization, and can achieve quadratic speedup in query complexity compared to
its classical counterparts. We show how the trained VQP can be used to predict
future data with {query} complexity. Ultimately, a stronger nonlinear
classifier can be established, the so-called quantum ensemble learning (QEL),
by combining a set of weak VQPs produced using a subsampling method. The
subsampling method has two significant advantages. First, all weak VQPs
employed in QEL can be trained in parallel, therefore, the query complexity of
QEL is equal to that of each weak VQP multiplied by . Second, it
dramatically reduce the {runtime} complexity of encoding circuits that map
classical data to a quantum state because this dataset can be significantly
smaller than the original dataset given to QEL. This arguably provides a most
satisfactory solution to one of the most criticized issues in quantum machine
learning proposals. To conclude, we perform two numerical experiments for our
VQP and QEL, implemented by Python and pyQuil library. Our experiments show
that excellent performance can be achieved using a very small quantum circuit
size that is implementable under current quantum hardware development.
Specifically, given a nonlinear synthetic dataset with features for each
example, the trained QEL can classify the test examples that are sampled away
from the decision boundaries using single and two qubits quantum gates
with accuracy.Comment: 9 page
Improved label noise identification by exploiting unlabeled data
© 2017 IEEE. In machine learning, the available training samples are not always perfect and some labels can be corrupted which are called label noises. This may cause the reduction of accuracy. Meanwhile it will also increase the complexity of model. To mitigate the detrimental effect of label noises, noise filtering has been widely used which tries to identify label noises and remove them prior to learning. Almost all existing works only focus on the mislabeled training dataset and ignore the existence of unlabeled data. In fact, unlabeled data are easily accessible in many applications. In this work, we explore how to utilize these unlabeled data to increase the noise filtering effect. To this end, we have proposed a method named MFUDCM (Multiple Filtering with the aid of Unlabeled Data using Confidence Measurement). This method applies the novel multiple soft majority voting idea to make use unlabeled data. In addition, MFUDCM is expected to have a higher accuracy of identifying mislabeled data by using the concept of multiple voting. Finally, the validity of the proposed method MFUDCM is confirmed by experiments and the comparison results with other methods
A Semi-Supervised Two-Stage Approach to Learning from Noisy Labels
The recent success of deep neural networks is powered in part by large-scale
well-labeled training data. However, it is a daunting task to laboriously
annotate an ImageNet-like dateset. On the contrary, it is fairly convenient,
fast, and cheap to collect training images from the Web along with their noisy
labels. This signifies the need of alternative approaches to training deep
neural networks using such noisy labels. Existing methods tackling this problem
either try to identify and correct the wrong labels or reweigh the data terms
in the loss function according to the inferred noisy rates. Both strategies
inevitably incur errors for some of the data points. In this paper, we contend
that it is actually better to ignore the labels of some of the data points than
to keep them if the labels are incorrect, especially when the noisy rate is
high. After all, the wrong labels could mislead a neural network to a bad local
optimum. We suggest a two-stage framework for the learning from noisy labels.
In the first stage, we identify a small portion of images from the noisy
training set of which the labels are correct with a high probability. The noisy
labels of the other images are ignored. In the second stage, we train a deep
neural network in a semi-supervised manner. This framework effectively takes
advantage of the whole training set and yet only a portion of its labels that
are most likely correct. Experiments on three datasets verify the effectiveness
of our approach especially when the noisy rate is high
- …