1,789 research outputs found
Convex Analysis and Optimization with Submodular Functions: a Tutorial
Set-functions appear in many areas of computer science and applied
mathematics, such as machine learning, computer vision, operations research or
electrical networks. Among these set-functions, submodular functions play an
important role, similar to convex functions on vector spaces. In this tutorial,
the theory of submodular functions is presented, in a self-contained way, with
all results shown from first principles. A good knowledge of convex analysis is
assumed
Graph kernels between point clouds
Point clouds are sets of points in two or three dimensions. Most kernel
methods for learning on sets of points have not yet dealt with the specific
geometrical invariances and practical constraints associated with point clouds
in computer vision and graphics. In this paper, we present extensions of graph
kernels for point clouds, which allow to use kernel methods for such ob jects
as shapes, line drawings, or any three-dimensional point clouds. In order to
design rich and numerically efficient kernels with as few free parameters as
possible, we use kernels between covariance matrices and their factorizations
on graphical models. We derive polynomial time dynamic programming recursions
and present applications to recognition of handwritten digits and Chinese
characters from few training examples
Sharp analysis of low-rank kernel matrix approximations
We consider supervised learning problems within the positive-definite kernel
framework, such as kernel ridge regression, kernel logistic regression or the
support vector machine. With kernels leading to infinite-dimensional feature
spaces, a common practical limiting difficulty is the necessity of computing
the kernel matrix, which most frequently leads to algorithms with running time
at least quadratic in the number of observations n, i.e., O(n^2). Low-rank
approximations of the kernel matrix are often considered as they allow the
reduction of running time complexities to O(p^2 n), where p is the rank of the
approximation. The practicality of such methods thus depends on the required
rank p. In this paper, we show that in the context of kernel ridge regression,
for approximations based on a random subset of columns of the original kernel
matrix, the rank p may be chosen to be linear in the degrees of freedom
associated with the problem, a quantity which is classically used in the
statistical analysis of such methods, and is often seen as the implicit number
of parameters of non-parametric estimators. This result enables simple
algorithms that have sub-quadratic running time complexity, but provably
exhibit the same predictive performance than existing algorithms, for any given
problem instance, and not only for worst-case situations
Exploring Large Feature Spaces with Hierarchical Multiple Kernel Learning
For supervised and unsupervised learning, positive definite kernels allow to
use large and potentially infinite dimensional feature spaces with a
computational cost that only depends on the number of observations. This is
usually done through the penalization of predictor functions by Euclidean or
Hilbertian norms. In this paper, we explore penalizing by sparsity-inducing
norms such as the l1-norm or the block l1-norm. We assume that the kernel
decomposes into a large sum of individual basis kernels which can be embedded
in a directed acyclic graph; we show that it is then possible to perform kernel
selection through a hierarchical multiple kernel learning framework, in
polynomial time in the number of selected kernels. This framework is naturally
applied to non linear variable selection; our extensive simulations on
synthetic datasets and datasets from the UCI repository show that efficiently
exploring the large feature space through sparsity-inducing norms leads to
state-of-the-art predictive performance
Structured sparsity-inducing norms through submodular functions
Sparse methods for supervised learning aim at finding good linear predictors
from as few variables as possible, i.e., with small cardinality of their
supports. This combinatorial selection problem is often turned into a convex
optimization problem by replacing the cardinality function by its convex
envelope (tightest convex lower bound), in this case the L1-norm. In this
paper, we investigate more general set-functions than the cardinality, that may
incorporate prior knowledge or structural constraints which are common in many
applications: namely, we show that for nondecreasing submodular set-functions,
the corresponding convex envelope can be obtained from its \lova extension, a
common tool in submodular analysis. This defines a family of polyhedral norms,
for which we provide generic algorithmic tools (subgradients and proximal
operators) and theoretical results (conditions for support recovery or
high-dimensional inference). By selecting specific submodular functions, we can
give a new interpretation to known norms, such as those based on
rank-statistics or grouped norms with potentially overlapping groups; we also
define new norms, in particular ones that can be used as non-factorial priors
for supervised learning
A Convex Relaxation for Weakly Supervised Classifiers
This paper introduces a general multi-class approach to weakly supervised
classification. Inferring the labels and learning the parameters of the model
is usually done jointly through a block-coordinate descent algorithm such as
expectation-maximization (EM), which may lead to local minima. To avoid this
problem, we propose a cost function based on a convex relaxation of the
soft-max loss. We then propose an algorithm specifically designed to
efficiently solve the corresponding semidefinite program (SDP). Empirically,
our method compares favorably to standard ones on different datasets for
multiple instance learning and semi-supervised learning as well as on
clustering tasks.Comment: Appears in Proceedings of the 29th International Conference on
Machine Learning (ICML 2012
From Averaging to Acceleration, There is Only a Step-size
We show that accelerated gradient descent, averaged gradient descent and the
heavy-ball method for non-strongly-convex problems may be reformulated as
constant parameter second-order difference equation algorithms, where stability
of the system is equivalent to convergence at rate O(1/n 2), where n is the
number of iterations. We provide a detailed analysis of the eigenvalues of the
corresponding linear dynamical system , showing various oscillatory and
non-oscillatory behaviors, together with a sharp stability result with explicit
constants. We also consider the situation where noisy gradients are available,
where we extend our general convergence result, which suggests an alternative
algorithm (i.e., with different step sizes) that exhibits the good aspects of
both averaging and acceleration
Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n)
We consider the stochastic approximation problem where a convex function has
to be minimized, given only the knowledge of unbiased estimates of its
gradients at certain points, a framework which includes machine learning
methods based on the minimization of the empirical risk. We focus on problems
without strong convexity, for which all previously known algorithms achieve a
convergence rate for function values of O(1/n^{1/2}). We consider and analyze
two algorithms that achieve a rate of O(1/n) for classical supervised learning
problems. For least-squares regression, we show that averaged stochastic
gradient descent with constant step-size achieves the desired rate. For
logistic regression, this is achieved by a simple novel stochastic gradient
algorithm that (a) constructs successive local quadratic approximations of the
loss functions, while (b) preserving the same running time complexity as
stochastic gradient descent. For these algorithms, we provide a non-asymptotic
analysis of the generalization error (in expectation, and also in high
probability for least-squares), and run extensive experiments on standard
machine learning benchmarks showing that they often outperform existing
approaches
AdaBatch: Efficient Gradient Aggregation Rules for Sequential and Parallel Stochastic Gradient Methods
We study a new aggregation operator for gradients coming from a mini-batch
for stochastic gradient (SG) methods that allows a significant speed-up in the
case of sparse optimization problems. We call this method AdaBatch and it only
requires a few lines of code change compared to regular mini-batch SGD
algorithms. We provide a theoretical insight to understand how this new class
of algorithms is performing and show that it is equivalent to an implicit
per-coordinate rescaling of the gradients, similarly to what Adagrad methods
can do. In theory and in practice, this new aggregation allows to keep the same
sample efficiency of SG methods while increasing the batch size.
Experimentally, we also show that in the case of smooth convex optimization,
our procedure can even obtain a better loss when increasing the batch size for
a fixed number of samples. We then apply this new algorithm to obtain a
parallelizable stochastic gradient method that is synchronous but allows
speed-up on par with Hogwild! methods as convergence does not deteriorate with
the increase of the batch size. The same approach can be used to make
mini-batch provably efficient for variance-reduced SG methods such as SVRG
- …