35 research outputs found
From Symmetry to Geometry: Tractable Nonconvex Problems
As science and engineering have become increasingly data-driven, the role of
optimization has expanded to touch almost every stage of the data analysis
pipeline, from the signal and data acquisition to modeling and prediction. The
optimization problems encountered in practice are often nonconvex. While
challenges vary from problem to problem, one common source of nonconvexity is
nonlinearity in the data or measurement model. Nonlinear models often exhibit
symmetries, creating complicated, nonconvex objective landscapes, with multiple
equivalent solutions. Nevertheless, simple methods (e.g., gradient descent)
often perform surprisingly well in practice.
The goal of this survey is to highlight a class of tractable nonconvex
problems, which can be understood through the lens of symmetries. These
problems exhibit a characteristic geometric structure: local minimizers are
symmetric copies of a single "ground truth" solution, while other critical
points occur at balanced superpositions of symmetric copies of the ground
truth, and exhibit negative curvature in directions that break the symmetry.
This structure enables efficient methods to obtain global minimizers. We
discuss examples of this phenomenon arising from a wide range of problems in
imaging, signal processing, and data analysis. We highlight the key role of
symmetry in shaping the objective landscape and discuss the different roles of
rotational and discrete symmetries. This area is rich with observed phenomena
and open problems; we close by highlighting directions for future research.Comment: review paper submitted to SIAM Review, 34 pages, 10 figure
Recommended from our members
Nonconvex Recovery of Low-complexity Models
Today we are living in the era of big data, there is a pressing need for efficient, scalable and robust optimization methods to analyze the data we create and collect. Although Convex methods offer tractable solutions with global optimality, heuristic nonconvex methods are often more attractive in practice due to their superior efficiency and scalability. Moreover, for better representations of the data, the mathematical model we are building today are much more complicated, which often results in highly nonlinear and nonconvex optimizations problems. Both of these challenges require us to go beyond convex optimization. While nonconvex optimization is extraordinarily successful in practice, unlike convex optimization, guaranteeing the correctness of nonconvex methods is notoriously difficult. In theory, even finding a local minimum of a general nonconvex function is NP-hard β nevermind the global minimum.
This thesis aims to bridge the gap between practice and theory of nonconvex optimization, by developing global optimality guarantees for nonconvex problems arising in real-world engineering applications, and provable, efficient nonconvex optimization algorithms. First, this thesis reveals that for certain nonconvex problems we can construct a model specialized initialization that is close to the optimal solution, so that simple and efficient methods provably converge to the global solution with linear rate. These problem include sparse basis learning and convolutional phase retrieval. In addition, the work has led to the discovery of a broader class of nonconvex problems β the so-called ridable saddle functions. Those problems possess characteristic structures, in which (i) all local minima are global, (ii) the energy landscape does not have any ''flat'' saddle points. More interestingly, when data are large and random, this thesis reveals that many problems in the real world are indeed ridable saddle, those problems include complete dictionary learning and generalized phase retrieval. For each of the aforementioned problems, the benign geometric structure allows us to obtain global recovery guarantees by using efficient optimization methods with arbitrary initialization
Recommended from our members
When Are Nonconvex Optimization Problems Not Scary?
Nonconvex optimization is NP-hard, even the goal is to compute a local minimizer. In applied disciplines, however, nonconvex problems abound, and simple algorithms, such as gradient descent and alternating direction, are often surprisingly effective. The ability of simple algorithms to find high-quality solutions for practical nonconvex problems remains largely mysterious.
This thesis focuses on a class of nonconvex optimization problems which CAN be solved to global optimality with polynomial-time algorithms. This class covers natural nonconvex formulations of central problems in signal processing, machine learning, and statistical estimation, such as sparse dictionary learning (DL), generalized phase retrieval (GPR), and orthogonal tensor decomposition. For each of the listed problems, the nonconvex formulation and optimization lead to novel and often improved computational guarantees.
This class of nonconvex problems has two distinctive features: (i) All local minimizer are also global. Thus obtaining any local minimizer solves the optimization problem; (ii) Around each saddle point or local maximizer, the function has a negative directional curvature. In other words, around these points, the Hessian matrices have negative eigenvalues. We call smooth functions with these two properties (qualitative) X functions, and derive concrete quantities and strategy to help verify the properties, particularly for functions with random inputs or parameters. As practical examples, we establish that certain natural nonconvex formulations for complete DL and GPR are X functions with concrete parameters.
Optimizing X functions amounts to finding any local minimizer. With generic initializations, typical iterative methods at best only guarantee to converge to a critical point that might be a saddle point or local maximizer. Interestingly, the X structure allows a number of iterative methods to escape from saddle points and local maximizers and efficiently find a local minimizer, without special initializations. We choose to describe and analyze the second-order trust-region method (TRM) that seems to yield the strongest computational guarantees. Intuitively, second-order methods can exploit Hessian to extract negative curvature directions around saddle points and local maximizers, and hence are able to successfully escape from the saddles and local maximizers of X functions. We state the TRM in a Riemannian optimization framework to cater to practical manifold-constrained problems. For DL and GPR, we show that under technical conditions, the TRM algorithm finds a global minimizer in a polynomial number of steps, from arbitrary initializations
Sparse Coding and Autoencoders
In "Dictionary Learning" one tries to recover incoherent matrices (typically overcomplete and whose columns are assumed
to be normalized) and sparse vectors with a small
support of size for some while having access to observations
where . In this work we undertake a rigorous
analysis of whether gradient descent on the squared loss of an autoencoder can
solve the dictionary learning problem. The "Autoencoder" architecture we
consider is a mapping with a single
ReLU activation layer of size .
Under very mild distributional assumptions on , we prove that the norm
of the expected gradient of the standard squared loss function is
asymptotically (in sparse code dimension) negligible for all points in a small
neighborhood of . This is supported with experimental evidence using
synthetic data. We also conduct experiments to suggest that is a local
minimum. Along the way we prove that a layer of ReLU gates can be set up to
automatically recover the support of the sparse codes. This property holds
independent of the loss function. We believe that it could be of independent
interest.Comment: In this new version of the paper with a small change in the
distributional assumptions we are actually able to prove the asymptotic
criticality of a neighbourhood of the ground truth dictionary for even just
the standard squared loss of the ReLU autoencoder (unlike the regularized
loss in the older version
Neural Collapse with Normalized Features: A Geometric Analysis over the Riemannian Manifold
When training overparameterized deep networks for classification tasks, it
has been widely observed that the learned features exhibit a so-called "neural
collapse" phenomenon. More specifically, for the output features of the
penultimate layer, for each class the within-class features converge to their
means, and the means of different classes exhibit a certain tight frame
structure, which is also aligned with the last layer's classifier. As feature
normalization in the last layer becomes a common practice in modern
representation learning, in this work we theoretically justify the neural
collapse phenomenon for normalized features. Based on an unconstrained feature
model, we simplify the empirical loss function in a multi-class classification
task into a nonconvex optimization problem over the Riemannian manifold by
constraining all features and classifiers over the sphere. In this context, we
analyze the nonconvex landscape of the Riemannian optimization problem over the
product of spheres, showing a benign global landscape in the sense that the
only global minimizers are the neural collapse solutions while all other
critical points are strict saddles with negative curvature. Experimental
results on practical deep networks corroborate our theory and demonstrate that
better representations can be learned faster via feature normalization.Comment: The first two authors contributed to this work equally; 38 pages, 13
figures. Accepted at NeurIPS'2