1 research outputs found
Exploiting Class Learnability in Noisy Data
In many domains, collecting sufficient labeled training data for supervised
machine learning requires easily accessible but noisy sources, such as
crowdsourcing services or tagged Web data. Noisy labels occur frequently in
data sets harvested via these means, sometimes resulting in entire classes of
data on which learned classifiers generalize poorly. For real world
applications, we argue that it can be beneficial to avoid training on such
classes entirely. In this work, we aim to explore the classes in a given data
set, and guide supervised training to spend time on a class proportional to its
learnability. By focusing the training process, we aim to improve model
generalization on classes with a strong signal. To that end, we develop an
online algorithm that works in conjunction with classifier and training
algorithm, iteratively selecting training data for the classifier based on how
well it appears to generalize on each class. Testing our approach on a variety
of data sets, we show our algorithm learns to focus on classes for which the
model has low generalization error relative to strong baselines, yielding a
classifier with good performance on learnable classes.Comment: Accepted to AAAI 201