440 research outputs found
Sparse Continuous Distributions and Fenchel-Young Losses
Exponential families are widely used in machine learning, including many
distributions in continuous and discrete domains (e.g., Gaussian, Dirichlet,
Poisson, and categorical distributions via the softmax transformation).
Distributions in each of these families have fixed support. In contrast, for
finite domains, recent work on sparse alternatives to softmax (e.g., sparsemax,
-entmax, and fusedmax), has led to distributions with varying support.
This paper develops sparse alternatives to continuous distributions, based on
several technical contributions: First, we define -regularized
prediction maps and Fenchel-Young losses for arbitrary domains (possibly
countably infinite or continuous). For linearly parametrized families, we show
that minimization of Fenchel-Young losses is equivalent to moment matching of
the statistics, generalizing a fundamental property of exponential families.
When is a Tsallis negentropy with parameter , we obtain
``deformed exponential families,'' which include -entmax and sparsemax
() as particular cases. For quadratic energy functions, the resulting
densities are -Gaussians, an instance of elliptical distributions that
contain as particular cases the Gaussian, biweight, triweight, and Epanechnikov
densities, and for which we derive closed-form expressions for the variance,
Tsallis entropy, and Fenchel-Young loss. When is a total variation or
Sobolev regularizer, we obtain a continuous version of the fusedmax. Finally,
we introduce continuous-domain attention mechanisms, deriving efficient
gradient backpropagation algorithms for . Using
these algorithms, we demonstrate our sparse continuous distributions for
attention-based audio classification and visual question answering, showing
that they allow attending to time intervals and compact regions.Comment: JMLR 2022 camera ready version. arXiv admin note: text overlap with
arXiv:2006.0721
A Primal-Dual Convergence Analysis of Boosting
Boosting combines weak learners into a predictor with low empirical risk. Its
dual constructs a high entropy distribution upon which weak learners and
training labels are uncorrelated. This manuscript studies this primal-dual
relationship under a broad family of losses, including the exponential loss of
AdaBoost and the logistic loss, revealing:
- Weak learnability aids the whole loss family: for any {\epsilon}>0,
O(ln(1/{\epsilon})) iterations suffice to produce a predictor with empirical
risk {\epsilon}-close to the infimum;
- The circumstances granting the existence of an empirical risk minimizer may
be characterized in terms of the primal and dual problems, yielding a new proof
of the known rate O(ln(1/{\epsilon}));
- Arbitrary instances may be decomposed into the above two, granting rate
O(1/{\epsilon}), with a matching lower bound provided for the logistic loss.Comment: 40 pages, 8 figures; the NIPS 2011 submission "The Fast Convergence
of Boosting" is a brief presentation of the primary results; compared with
the JMLR version, this arXiv version has hyperref and some formatting tweak
A Modern Introduction to Online Learning
In this monograph, I introduce the basic concepts of Online Learning through
a modern view of Online Convex Optimization. Here, online learning refers to
the framework of regret minimization under worst-case assumptions. I present
first-order and second-order algorithms for online learning with convex losses,
in Euclidean and non-Euclidean settings. All the algorithms are clearly
presented as instantiation of Online Mirror Descent or
Follow-The-Regularized-Leader and their variants. Particular attention is given
to the issue of tuning the parameters of the algorithms and learning in
unbounded domains, through adaptive and parameter-free online learning
algorithms. Non-convex losses are dealt through convex surrogate losses and
through randomization. The bandit setting is also briefly discussed, touching
on the problem of adversarial and stochastic multi-armed bandits. These notes
do not require prior knowledge of convex analysis and all the required
mathematical tools are rigorously explained. Moreover, all the proofs have been
carefully chosen to be as simple and as short as possible.Comment: Fixed more typos, added more history bits, added local norms bounds
for OMD and FTR
Optimization with Sparsity-Inducing Penalties
Sparse estimation methods are aimed at using or obtaining parsimonious
representations of data or models. They were first dedicated to linear variable
selection but numerous extensions have now emerged such as structured sparsity
or kernel selection. It turns out that many of the related estimation problems
can be cast as convex optimization problems by regularizing the empirical risk
with appropriate non-smooth norms. The goal of this paper is to present from a
general perspective optimization tools and techniques dedicated to such
sparsity-inducing penalties. We cover proximal methods, block-coordinate
descent, reweighted -penalized techniques, working-set and homotopy
methods, as well as non-convex formulations and extensions, and provide an
extensive set of experiments to compare various algorithms from a computational
point of view
- …