39 research outputs found
On the Optimization Landscape of Tensor Decompositions
Non-convex optimization with local search heuristics has been widely used in
machine learning, achieving many state-of-art results. It becomes increasingly
important to understand why they can work for these NP-hard problems on typical
data. The landscape of many objective functions in learning has been
conjectured to have the geometric property that "all local optima are
(approximately) global optima", and thus they can be solved efficiently by
local search algorithms. However, establishing such property can be very
difficult.
In this paper, we analyze the optimization landscape of the random
over-complete tensor decomposition problem, which has many applications in
unsupervised learning, especially in learning latent variable models. In
practice, it can be efficiently solved by gradient ascent on a non-convex
objective. We show that for any small constant , among the set of
points with function values -factor larger than the expectation
of the function, all the local maxima are approximate global maxima.
Previously, the best-known result only characterizes the geometry in small
neighborhoods around the true components. Our result implies that even with an
initialization that is barely better than the random guess, the gradient ascent
algorithm is guaranteed to solve this problem.
Our main technique uses Kac-Rice formula and random matrix theory. To our
best knowledge, this is the first time when Kac-Rice formula is successfully
applied to counting the number of local minima of a highly-structured random
polynomial with dependent coefficients.Comment: Best paper in the NIPS 2016 Workshop on Nonconvex Optimization for
Machine Learning: Theory and Practice. In submissio
PGDOT -- Perturbed Gradient Descent Adapted with Occupation Time
This paper develops further the idea of perturbed gradient descent (PGD), by
adapting perturbation with the history of states via the notion of occupation
time. The proposed algorithm, perturbed gradient descent adapted with
occupation time (PGDOT), is shown to converge at least as fast as the PGD
algorithm and is guaranteed to avoid getting stuck at saddle points. The
analysis is corroborated by empirical studies, in which a mini-batch version of
PGDOT is shown to outperform alternatives such as mini-batch gradient descent,
Adam, AMSGrad, and RMSProp in training multilayer perceptrons (MLPs). In
particular, the mini-batch PGDOT manages to escape saddle points whereas these
alternatives fail.Comment: 15 pages, 7 figures, 1 tabl
Smoothed Analysis of Discrete Tensor Decomposition and Assemblies of Neurons
We analyze linear independence of rank one tensors produced by tensor powers
of randomly perturbed vectors. This enables efficient decomposition of sums of
high-order tensors. Our analysis builds upon [BCMV14] but allows for a wider
range of perturbation models, including discrete ones. We give an application
to recovering assemblies of neurons.
Assemblies are large sets of neurons representing specific memories or
concepts. The size of the intersection of two assemblies has been shown in
experiments to represent the extent to which these memories co-occur or these
concepts are related; the phenomenon is called association of assemblies. This
suggests that an animal's memory is a complex web of associations, and poses
the problem of recovering this representation from cognitive data. Motivated by
this problem, we study the following more general question: Can we reconstruct
the Venn diagram of a family of sets, given the sizes of their -wise
intersections? We show that as long as the family of sets is randomly
perturbed, it is enough for the number of measurements to be polynomially
larger than the number of nonempty regions of the Venn diagram to fully
reconstruct the diagram.Comment: To appear in NIPS 201
Time-varying Autoregression with Low Rank Tensors
We present a windowed technique to learn parsimonious time-varying
autoregressive models from multivariate timeseries. This unsupervised method
uncovers interpretable spatiotemporal structure in data via non-smooth and
non-convex optimization. In each time window, we assume the data follow a
linear model parameterized by a system matrix, and we model this stack of
potentially different system matrices as a low rank tensor. Because of its
structure, the model is scalable to high-dimensional data and can easily
incorporate priors such as smoothness over time. We find the components of the
tensor using alternating minimization and prove that any stationary point of
this algorithm is a local minimum. We demonstrate on a synthetic example that
our method identifies the true rank of a switching linear system in the
presence of noise. We illustrate our model's utility and superior scalability
over extant methods when applied to several synthetic and real-world example:
two types of time-varying linear systems, worm behavior, sea surface
temperature, and monkey brain datasets
Are ResNets Provably Better than Linear Predictors?
A residual network (or ResNet) is a standard deep neural net architecture,
with state-of-the-art performance across numerous applications. The main
premise of ResNets is that they allow the training of each layer to focus on
fitting just the residual of the previous layer's output and the target output.
Thus, we should expect that the trained network is no worse than what we can
obtain if we remove the residual layers and train a shallower network instead.
However, due to the non-convexity of the optimization problem, it is not at all
clear that ResNets indeed achieve this behavior, rather than getting stuck at
some arbitrarily poor local minimum. In this paper, we rigorously prove that
arbitrarily deep, nonlinear residual units indeed exhibit this behavior, in the
sense that the optimization landscape contains no local minima with value above
what can be obtained with a linear predictor (namely a 1-layer network).
Notably, we show this under minimal or no assumptions on the precise network
architecture, data distribution, or loss function used. We also provide a
quantitative analysis of approximate stationary points for this problem.
Finally, we show that with a certain tweak to the architecture, training the
network with standard stochastic gradient descent achieves an objective value
close or better than any linear predictor.Comment: Comparison to previous arXiv version: Minor changes to incorporate
comments of NIPS 2018 reviewers (main results are unaffected
Depth with Nonlinearity Creates No Bad Local Minima in ResNets
In this paper, we prove that depth with nonlinearity creates no bad local
minima in a type of arbitrarily deep ResNets with arbitrary nonlinear
activation functions, in the sense that the values of all local minima are no
worse than the global minimum value of corresponding classical machine-learning
models, and are guaranteed to further improve via residual representations. As
a result, this paper provides an affirmative answer to an open question stated
in a paper in the conference on Neural Information Processing Systems 2018.
This paper advances the optimization theory of deep learning only for ResNets
and not for other network architectures
A theory on the absence of spurious solutions for nonconvex and nonsmooth optimization
We study the set of continuous functions that admit no spurious local optima
(i.e. local minima that are not global minima) which we term \textit{global
functions}. They satisfy various powerful properties for analyzing nonconvex
and nonsmooth optimization problems. For instance, they satisfy a theorem akin
to the fundamental uniform limit theorem in the analysis regarding continuous
functions. Global functions are also endowed with useful properties regarding
the composition of functions and change of variables. Using these new results,
we show that a class of nonconvex and nonsmooth optimization problems arising
in tensor decomposition applications are global functions. This is the first
result concerning nonconvex methods for nonsmooth objective functions. Our
result provides a theoretical guarantee for the widely-used norm to
avoid outliers in nonconvex optimization.Comment: 22 pages, 13 figure
Synchronization of Kuramoto Oscillators in Dense Networks
We study synchronization properties of systems of Kuramoto oscillators. The
problem can also be understood as a question about the properties of an energy
landscape created by a graph. More formally, let be a connected graph
and denotes its adjacency matrix. Let the function
be given by This
function has a global maximum when for all . It is known that if every vertex is connected to at least other
vertices for sufficiently large, then every local maximum is global.
Taylor proved this for and Ling, Xu \& Bandeira improved this
to . We give a slight improvement to .
Townsend, Stillman \& Strogatz suggested that the critical value might be
Notes on computational-to-statistical gaps: predictions using statistical physics
In these notes we describe heuristics to predict computational-to-statistical
gaps in certain statistical problems. These are regimes in which the underlying
statistical problem is information-theoretically possible although no efficient
algorithm exists, rendering the problem essentially unsolvable for large
instances. The methods we describe here are based on mature, albeit
non-rigorous, tools from statistical physics.
These notes are based on a lecture series given by the authors at the Courant
Institute of Mathematical Sciences in New York City, on May 16th, 2017.Comment: 22 pages, 2 figure
Towards the optimal construction of a loss function without spurious local minima for solving quadratic equations
The problem of finding a vector which obeys a set of quadratic equations
, , plays an important role in many
applications. In this paper we consider the case when both and are
real-valued vectors of length . A new loss function is constructed for this
problem, which combines the smooth quadratic loss function with an activation
function. Under the Gaussian measurement model, we establish that with high
probability the target solution is the unique local minimizer (up to a
global phase factor) of the new loss function provided . Moreover,
the loss function always has a negative directional curvature around its saddle
points