101 research outputs found
Optimization with Sparsity-Inducing Penalties
Sparse estimation methods are aimed at using or obtaining parsimonious
representations of data or models. They were first dedicated to linear variable
selection but numerous extensions have now emerged such as structured sparsity
or kernel selection. It turns out that many of the related estimation problems
can be cast as convex optimization problems by regularizing the empirical risk
with appropriate non-smooth norms. The goal of this paper is to present from a
general perspective optimization tools and techniques dedicated to such
sparsity-inducing penalties. We cover proximal methods, block-coordinate
descent, reweighted -penalized techniques, working-set and homotopy
methods, as well as non-convex formulations and extensions, and provide an
extensive set of experiments to compare various algorithms from a computational
point of view
Proximal Methods for Hierarchical Sparse Coding
Sparse coding consists in representing signals as sparse linear combinations
of atoms selected from a dictionary. We consider an extension of this framework
where the atoms are further assumed to be embedded in a tree. This is achieved
using a recently introduced tree-structured sparse regularization norm, which
has proven useful in several applications. This norm leads to regularized
problems that are difficult to optimize, and we propose in this paper efficient
algorithms for solving them. More precisely, we show that the proximal operator
associated with this norm is computable exactly via a dual approach that can be
viewed as the composition of elementary proximal operators. Our procedure has a
complexity linear, or close to linear, in the number of atoms, and allows the
use of accelerated gradient techniques to solve the tree-structured sparse
approximation problem at the same computational cost as traditional ones using
the L1-norm. Our method is efficient and scales gracefully to millions of
variables, which we illustrate in two types of applications: first, we consider
fixed hierarchical dictionaries of wavelets to denoise natural images. Then, we
apply our optimization tools in the context of dictionary learning, where
learned dictionary elements naturally organize in a prespecified arborescent
structure, leading to a better performance in reconstruction of natural image
patches. When applied to text documents, our method learns hierarchies of
topics, thus providing a competitive alternative to probabilistic topic models
Proximal Gradient methods with Adaptive Subspace Sampling
Many applications in machine learning or signal processing involve nonsmooth
optimization problems. This nonsmoothness brings a low-dimensional structure to
the optimal solutions. In this paper, we propose a randomized proximal gradient
method harnessing this underlying structure. We introduce two key components:
i) a random subspace proximal gradient algorithm; ii) an identification-based
sampling of the subspaces. Their interplay brings a significant performance
improvement on typical learning problems in terms of dimensions explored
Learning Hierarchical and Topographic Dictionaries with Structured Sparsity
Recent work in signal processing and statistics have focused on defining new
regularization functions, which not only induce sparsity of the solution, but
also take into account the structure of the problem. We present in this paper a
class of convex penalties introduced in the machine learning community, which
take the form of a sum of l_2 and l_infinity-norms over groups of variables.
They extend the classical group-sparsity regularization in the sense that the
groups possibly overlap, allowing more flexibility in the group design. We
review efficient optimization methods to deal with the corresponding inverse
problems, and their application to the problem of learning dictionaries of
natural image patches: On the one hand, dictionary learning has indeed proven
effective for various signal processing tasks. On the other hand, structured
sparsity provides a natural framework for modeling dependencies between
dictionary elements. We thus consider a structured sparse regularization to
learn dictionaries embedded in a particular structure, for instance a tree or a
two-dimensional grid. In the latter case, the results we obtain are similar to
the dictionaries produced by topographic independent component analysis
Rigorous optimization recipes for sparse and low rank inverse problems with applications in data sciences
Many natural and man-made signals can be described as having a few degrees of freedom relative to their size due to natural parameterizations or constraints; examples include bandlimited signals, collections of signals observed from multiple viewpoints in a network-of-sensors, and per-flow traffic measurements of the Internet. Low-dimensional models (LDMs) mathematically capture the inherent structure of such signals via combinatorial and geometric data models, such as sparsity, unions-of-subspaces, low-rankness, manifolds, and mixtures of factor analyzers, and are emerging to revolutionize the way we treat inverse problems (e.g., signal recovery, parameter estimation, or structure learning) from dimensionality-reduced or incomplete data. Assuming our problem resides in a LDM space, in this thesis we investigate how to integrate such models in convex and non-convex optimization algorithms for significant gains in computational complexity. We mostly focus on two LDMs: sparsity and low-rankness. We study trade-offs and their implications to develop efficient and provable optimization algorithms, and--more importantly--to exploit convex and combinatorial optimization that can enable cross-pollination of decades of research in both
Accelerating greedy coordinate descent methods
We introduce and study two algorithms to accelerate greedy coordinate descent in theory and in practice: Accelerated Semi-Greedy Coordinate Descent (ASCD) and Accelerated Greedy Co-ordinate Descent (AGCD). On the theory side, our main results are for ASCD: We show that ASCD achieves 0(l/k[superscript 2]) convergence, and it also achieves accelerated linear convergence for strongly convex functions. On the empirical side, while both AGCD and ASCD outperform Accelerated Randomized Coordinate Descent on most instances in our numerical experiments, we note that AGCD significantly outperforms the other two methods in our experiments, in spite of a lack of theoretical guarantees for this method. To complement this empirical finding for AGCD, we present an explanation why standard proof techniques for acceleration cannot work for AGCD, and we introduce a technical condition under which AGCD is guaranteed to have accelerated convergence. Finally, we confirm that this technical condition holds in our numerical experiments
- …