1,893 research outputs found
One-Class Classification: Taxonomy of Study and Review of Techniques
One-class classification (OCC) algorithms aim to build classification models
when the negative class is either absent, poorly sampled or not well defined.
This unique situation constrains the learning of efficient classifiers by
defining class boundary just with the knowledge of positive class. The OCC
problem has been considered and applied under many research themes, such as
outlier/novelty detection and concept learning. In this paper we present a
unified view of the general problem of OCC by presenting a taxonomy of study
for OCC problems, which is based on the availability of training data,
algorithms used and the application domains applied. We further delve into each
of the categories of the proposed taxonomy and present a comprehensive
literature review of the OCC algorithms, techniques and methodologies with a
focus on their significance, limitations and applications. We conclude our
paper by discussing some open research problems in the field of OCC and present
our vision for future research.Comment: 24 pages + 11 pages of references, 8 figure
Get Another Label? Improving Data Quality and Data Mining
This paper addresses the repeated acquisition of labels for data items when the labeling is imperfect. We examine the improvement (or lack thereof) in data quality via repeated labeling, and focus especially on the improvement of training labels for supervised induction. With the outsourcing of small tasks becoming easier, for example via Rent-A-Coder or Amazon's Mechanical Turk, it often is possible to obtain less-than-expert labeling at low cost. With low-cost labeling, preparing the unlabeled part of the data can become considerably more expensive than labeling. We present repeated-labeling strategies of increasing complexity and show several main results: (i) Repeated-labeling can improve label and model quality, but not always. (ii) When labels are noisy, repeated labeling can be preferable to single labeling even in the traditional setting where labels are not particularly cheap. (iii) As soon as the cost of processing the unlabeled data is not free, even the simple strategy of labeling everything multiple times can give considerable advantage. (iv) Repeatedly labeling a carefully chosen set of points is generally preferable, and we present a robust technique that combines different notions of uncertainty to select data points for which quality should be improved. The bottom line: the results show clearly that when labeling is not perfect, selective acquisition of multiple labels is a strategy that data miners should have in their repertoire; for certain label-quality/cost regimes, the benefit is substantial.NYU, Stern School of Business, IOMS Department, Center for Digital Economy Researc
Repeated Labeling Using Multiple Noisy Labelers
This paper addresses the repeated acquisition of labels for data items
when the labeling is imperfect. We examine the improvement (or lack
thereof) in data quality via repeated labeling, and focus especially on
the improvement of training labels for supervised induction. With the
outsourcing of small tasks becoming easier, for example via Amazon's
Mechanical Turk, it often is possible to obtain less-than-expert
labeling at low cost. With low-cost labeling, preparing the unlabeled
part of the data can become considerably more expensive than labeling.
We present repeated-labeling strategies of increasing complexity, and
show several main results. (i) Repeated-labeling can improve label
quality and model quality, but not always. (ii) When labels are noisy,
repeated labeling can be preferable to single labeling even in the
traditional setting where labels are not particularly cheap. (iii) As
soon as the cost of processing the unlabeled data is not free, even the
simple strategy of labeling everything multiple times can give
considerable advantage. (iv) Repeatedly labeling a carefully chosen set
of points is generally preferable, and we present a set of robust
techniques that combine different notions of uncertainty to select data
points for which quality should be improved. The bottom line: the
results show clearly that when labeling is not perfect, selective
acquisition of multiple labels is a strategy that data miners should
have in their repertoire. For certain label-quality/cost regimes, the
benefit is substantial.This work was supported by the National Science Foundation under Grant
No. IIS-0643846, by an NSERC Postdoctoral Fellowship, and by an NEC
Faculty Fellowship
A review of domain adaptation without target labels
Domain adaptation has become a prominent problem setting in machine learning
and related fields. This review asks the question: how can a classifier learn
from a source domain and generalize to a target domain? We present a
categorization of approaches, divided into, what we refer to as, sample-based,
feature-based and inference-based methods. Sample-based methods focus on
weighting individual observations during training based on their importance to
the target domain. Feature-based methods revolve around on mapping, projecting
and representing features such that a source classifier performs well on the
target domain and inference-based methods incorporate adaptation into the
parameter estimation procedure, for instance through constraints on the
optimization procedure. Additionally, we review a number of conditions that
allow for formulating bounds on the cross-domain generalization error. Our
categorization highlights recurring ideas and raises questions important to
further research.Comment: 20 pages, 5 figure
- …