801 research outputs found
Model Consistency for Learning with Mirror-Stratifiable Regularizers
Low-complexity non-smooth convex regularizers are routinely used to impose
some structure (such as sparsity or low-rank) on the coefficients for linear
predictors in supervised learning. Model consistency consists then in selecting
the correct structure (for instance support or rank) by regularized empirical
risk minimization.
It is known that model consistency holds under appropriate non-degeneracy
conditions. However such conditions typically fail for highly correlated
designs and it is observed that regularization methods tend to select larger
models.
In this work, we provide the theoretical underpinning of this behavior using
the notion of mirror-stratifiable regularizers. This class of regularizers
encompasses the most well-known in the literature, including the or
trace norms. It brings into play a pair of primal-dual models, which in turn
allows one to locate the structure of the solution using a specific dual
certificate.
We also show how this analysis is applicable to optimal solutions of the
learning problem, and also to the iterates computed by a certain class of
stochastic proximal-gradient algorithms.Comment: 14 pages, 4 figure
A Variance-Reduced and Stabilized Proximal Stochastic Gradient Method with Support Identification Guarantees for Structured Optimization
This paper introduces a new proximal stochastic gradient method with variance
reduction and stabilization for minimizing the sum of a convex stochastic
function and a group sparsity-inducing regularization function. Since the
method may be viewed as a stabilized version of the recently proposed algorithm
PStorm, we call our algorithm S-PStorm. Our analysis shows that S-PStorm has
strong convergence results. In particular, we prove an upper bound on the
number of iterations required by S-PStorm before its iterates correctly
identify (with high probability) an optimal support (i.e., the zero and nonzero
structure of an optimal solution). Most algorithms in the literature with such
a support identification property use variance reduction techniques that
require either periodically evaluating an exact gradient or storing a history
of stochastic gradients. Unlike these methods, S-PStorm achieves variance
reduction without requiring either of these, which is advantageous. Moreover,
our support-identification result for S-PStorm shows that, with high
probability, an optimal support will be identified correctly in all iterations
with the index above a threshold. We believe that this type of result is new to
the literature since the few existing other results prove that the optimal
support is identified with high probability at each iteration with a
sufficiently large index (meaning that the optimal support might be identified
in some iterations, but not in others). Numerical experiments on regularized
logistic loss problems show that S-PStorm outperforms existing methods in
various metrics that measure how efficiently and robustly iterates of an
algorithm identify an optimal support.Comment: The work is accepted for presentation at AISTATS 2023. This is a
technical report versio
Screening for Sparse Online Learning
Sparsity promoting regularizers are widely used to impose low-complexity
structure (e.g. l1-norm for sparsity) to the regression coefficients of
supervised learning. In the realm of deterministic optimization, the sequence
generated by iterative algorithms (such as proximal gradient descent) exhibit
"finite activity identification", namely, they can identify the low-complexity
structure in a finite number of iterations. However, most online algorithms
(such as proximal stochastic gradient descent) do not have the property owing
to the vanishing step-size and non-vanishing variance. In this paper, by
combining with a screening rule, we show how to eliminate useless features of
the iterates generated by online algorithms, and thereby enforce finite
activity identification. One consequence is that when combined with any
convergent online algorithm, sparsity properties imposed by the regularizer can
be exploited for computational gains. Numerically, significant acceleration can
be obtained
- …