255 research outputs found
Active Learning from Imperfect Labelers
We study active learning where the labeler can not only return incorrect
labels but also abstain from labeling. We consider different noise and
abstention conditions of the labeler. We propose an algorithm which utilizes
abstention responses, and analyze its statistical consistency and query
complexity under fairly natural assumptions on the noise and abstention rate of
the labeler. This algorithm is adaptive in a sense that it can automatically
request less queries with a more informed or less noisy labeler. We couple our
algorithm with lower bounds to show that under some technical conditions, it
achieves nearly optimal query complexity.Comment: To appear in NIPS 201
A Full Probabilistic Model for Yes/No Type Crowdsourcing in Multi-Class Classification
Crowdsourcing has become widely used in supervised scenarios where training
sets are scarce and difficult to obtain. Most crowdsourcing models in the
literature assume labelers can provide answers to full questions. In
classification contexts, full questions require a labeler to discern among all
possible classes. Unfortunately, discernment is not always easy in realistic
scenarios. Labelers may not be experts in differentiating all classes. In this
work, we provide a full probabilistic model for a shorter type of queries. Our
shorter queries only require "yes" or "no" responses. Our model estimates a
joint posterior distribution of matrices related to labelers' confusions and
the posterior probability of the class of every object. We developed an
approximate inference approach, using Monte Carlo Sampling and Black Box
Variational Inference, which provides the derivation of the necessary
gradients. We built two realistic crowdsourcing scenarios to test our model.
The first scenario queries for irregular astronomical time-series. The second
scenario relies on the image classification of animals. We achieved results
that are comparable with those of full query crowdsourcing. Furthermore, we
show that modeling labelers' failures plays an important role in estimating
true classes. Finally, we provide the community with two real datasets obtained
from our crowdsourcing experiments. All our code is publicly available.Comment: SIAM International Conference on Data Mining (SDM19), 9 official
pages, 5 supplementary page
Repeated Labeling Using Multiple Noisy Labelers
This paper addresses the repeated acquisition of labels for data items
when the labeling is imperfect. We examine the improvement (or lack
thereof) in data quality via repeated labeling, and focus especially on
the improvement of training labels for supervised induction. With the
outsourcing of small tasks becoming easier, for example via Amazon's
Mechanical Turk, it often is possible to obtain less-than-expert
labeling at low cost. With low-cost labeling, preparing the unlabeled
part of the data can become considerably more expensive than labeling.
We present repeated-labeling strategies of increasing complexity, and
show several main results. (i) Repeated-labeling can improve label
quality and model quality, but not always. (ii) When labels are noisy,
repeated labeling can be preferable to single labeling even in the
traditional setting where labels are not particularly cheap. (iii) As
soon as the cost of processing the unlabeled data is not free, even the
simple strategy of labeling everything multiple times can give
considerable advantage. (iv) Repeatedly labeling a carefully chosen set
of points is generally preferable, and we present a set of robust
techniques that combine different notions of uncertainty to select data
points for which quality should be improved. The bottom line: the
results show clearly that when labeling is not perfect, selective
acquisition of multiple labels is a strategy that data miners should
have in their repertoire. For certain label-quality/cost regimes, the
benefit is substantial.This work was supported by the National Science Foundation under Grant
No. IIS-0643846, by an NSERC Postdoctoral Fellowship, and by an NEC
Faculty Fellowship
Active Learning with Noisy Labelers for Improving Classification Accuracy of Connected Vehicles
Machine learning has emerged as a promising paradigm for enabling connected, automated vehicles to autonomously cruise the streets and react to unexpected situations. Reacting to such situations requires accurate classification for uncommon events, which in turn depends on the selection of large, diverse, and high-quality training data. In fact, the data available at a vehicle (e.g., photos of road signs) may be affected by errors or have different levels of resolution and freshness. To tackle this challenge, we propose an active learning framework that, leveraging the information collected through onboard sensors as well as received from other vehicles, effectively deals with scarce and noisy data. Given the information received from neighboring vehicles, our solution: (i) selects which vehicles can reliably generate high-quality training data, and (ii) obtains a reliable subset of data to add to the training set by trading off between two essential features, i.e., quality and diversity. The results, obtained with different real-world datasets, demonstrate that our framework significantly outperforms state-of-the-art solutions, providing high classification accuracy with a limited bandwidth requirement for the data exchange between vehicles
- …