16 research outputs found
Locally Non-linear Embeddings for Extreme Multi-label Learning
The objective in extreme multi-label learning is to train a classifier that
can automatically tag a novel data point with the most relevant subset of
labels from an extremely large label set. Embedding based approaches make
training and prediction tractable by assuming that the training label matrix is
low-rank and hence the effective number of labels can be reduced by projecting
the high dimensional label vectors onto a low dimensional linear subspace.
Still, leading embedding approaches have been unable to deliver high prediction
accuracies or scale to large problems as the low rank assumption is violated in
most real world applications.
This paper develops the X-One classifier to address both limitations. The
main technical contribution in X-One is a formulation for learning a small
ensemble of local distance preserving embeddings which can accurately predict
infrequently occurring (tail) labels. This allows X-One to break free of the
traditional low-rank assumption and boost classification accuracy by learning
embeddings which preserve pairwise distances between only the nearest label
vectors.
We conducted extensive experiments on several real-world as well as benchmark
data sets and compared our method against state-of-the-art methods for extreme
multi-label classification. Experiments reveal that X-One can make
significantly more accurate predictions then the state-of-the-art methods
including both embeddings (by as much as 35%) as well as trees (by as much as
6%). X-One can also scale efficiently to data sets with a million labels which
are beyond the pale of leading embedding methods
Online and Stochastic Gradient Methods for Non-decomposable Loss Functions
Modern applications in sensitive domains such as biometrics and medicine
frequently require the use of non-decomposable loss functions such as
precision@k, F-measure etc. Compared to point loss functions such as
hinge-loss, these offer much more fine grained control over prediction, but at
the same time present novel challenges in terms of algorithm design and
analysis. In this work we initiate a study of online learning techniques for
such non-decomposable loss functions with an aim to enable incremental learning
as well as design scalable solvers for batch problems. To this end, we propose
an online learning framework for such loss functions. Our model enjoys several
nice properties, chief amongst them being the existence of efficient online
learning algorithms with sublinear regret and online to batch conversion
bounds. Our model is a provable extension of existing online learning models
for point loss functions. We instantiate two popular losses, prec@k and pAUC,
in our model and prove sublinear regret bounds for both of them. Our proofs
require a novel structural lemma over ranked lists which may be of independent
interest. We then develop scalable stochastic gradient descent solvers for
non-decomposable loss functions. We show that for a large family of loss
functions satisfying a certain uniform convergence property (that includes
prec@k, pAUC, and F-measure), our methods provably converge to the empirical
risk minimizer. Such uniform convergence results were not known for these
losses and we establish these using novel proof techniques. We then use
extensive experimentation on real life and benchmark datasets to establish that
our method can be orders of magnitude faster than a recently proposed cutting
plane method.Comment: 25 pages, 3 figures, To appear in the proceedings of the 28th Annual
Conference on Neural Information Processing Systems, NIPS 201
1-Bit Matrix Completion
In this paper we develop a theory of matrix completion for the extreme case
of noisy 1-bit observations. Instead of observing a subset of the real-valued
entries of a matrix M, we obtain a small number of binary (1-bit) measurements
generated according to a probability distribution determined by the real-valued
entries of M. The central question we ask is whether or not it is possible to
obtain an accurate estimate of M from this data. In general this would seem
impossible, but we show that the maximum likelihood estimate under a suitable
constraint returns an accurate estimate of M when ||M||_{\infty} <= \alpha, and
rank(M) <= r. If the log-likelihood is a concave function (e.g., the logistic
or probit observation models), then we can obtain this maximum likelihood
estimate by optimizing a convex program. In addition, we also show that if
instead of recovering M we simply wish to obtain an estimate of the
distribution generating the 1-bit measurements, then we can eliminate the
requirement that ||M||_{\infty} <= \alpha. For both cases, we provide lower
bounds showing that these estimates are near-optimal. We conclude with a suite
of experiments that both verify the implications of our theorems as well as
illustrate some of the practical applications of 1-bit matrix completion. In
particular, we compare our program to standard matrix completion methods on
movie rating data in which users submit ratings from 1 to 5. In order to use
our program, we quantize this data to a single bit, but we allow the standard
matrix completion program to have access to the original ratings (from 1 to 5).
Surprisingly, the approach based on binary data performs significantly better