Search CORE

39 research outputs found

On the Optimization Landscape of Tensor Decompositions

Author: Ge Rong
Ma Tengyu
Publication venue
Publication date: 17/06/2017
Field of study

Non-convex optimization with local search heuristics has been widely used in machine learning, achieving many state-of-art results. It becomes increasingly important to understand why they can work for these NP-hard problems on typical data. The landscape of many objective functions in learning has been conjectured to have the geometric property that "all local optima are (approximately) global optima", and thus they can be solved efficiently by local search algorithms. However, establishing such property can be very difficult. In this paper, we analyze the optimization landscape of the random over-complete tensor decomposition problem, which has many applications in unsupervised learning, especially in learning latent variable models. In practice, it can be efficiently solved by gradient ascent on a non-convex objective. We show that for any small constant

\epsilon > 0

, among the set of points with function values

(1+\epsilon)

-factor larger than the expectation of the function, all the local maxima are approximate global maxima. Previously, the best-known result only characterizes the geometry in small neighborhoods around the true components. Our result implies that even with an initialization that is barely better than the random guess, the gradient ascent algorithm is guaranteed to solve this problem. Our main technique uses Kac-Rice formula and random matrix theory. To our best knowledge, this is the first time when Kac-Rice formula is successfully applied to counting the number of local minima of a highly-structured random polynomial with dependent coefficients.Comment: Best paper in the NIPS 2016 Workshop on Nonconvex Optimization for Machine Learning: Theory and Practice. In submissio

arXiv.org e-Print Archive

PGDOT -- Perturbed Gradient Descent Adapted with Occupation Time

Author: Guo Xin
Han Jiequn
Tajrobehkar Mahan
Tang Wenpin
Publication venue
Publication date: 12/06/2021
Field of study

This paper develops further the idea of perturbed gradient descent (PGD), by adapting perturbation with the history of states via the notion of occupation time. The proposed algorithm, perturbed gradient descent adapted with occupation time (PGDOT), is shown to converge at least as fast as the PGD algorithm and is guaranteed to avoid getting stuck at saddle points. The analysis is corroborated by empirical studies, in which a mini-batch version of PGDOT is shown to outperform alternatives such as mini-batch gradient descent, Adam, AMSGrad, and RMSProp in training multilayer perceptrons (MLPs). In particular, the mini-batch PGDOT manages to escape saddle points whereas these alternatives fail.Comment: 15 pages, 7 figures, 1 tabl

arXiv.org e-Print Archive

Smoothed Analysis of Discrete Tensor Decomposition and Assemblies of Neurons

Author: Anari Nima
Daskalakis Constantinos
Maass Wolfgang
Papadimitriou Christos H.
Saberi Amin
Vempala Santosh
Publication venue
Publication date: 28/10/2018
Field of study

We analyze linear independence of rank one tensors produced by tensor powers of randomly perturbed vectors. This enables efficient decomposition of sums of high-order tensors. Our analysis builds upon [BCMV14] but allows for a wider range of perturbation models, including discrete ones. We give an application to recovering assemblies of neurons. Assemblies are large sets of neurons representing specific memories or concepts. The size of the intersection of two assemblies has been shown in experiments to represent the extent to which these memories co-occur or these concepts are related; the phenomenon is called association of assemblies. This suggests that an animal's memory is a complex web of associations, and poses the problem of recovering this representation from cognitive data. Motivated by this problem, we study the following more general question: Can we reconstruct the Venn diagram of a family of sets, given the sizes of their

\ell

-wise intersections? We show that as long as the family of sets is randomly perturbed, it is enough for the number of measurements to be polynomially larger than the number of nonempty regions of the Venn diagram to fully reconstruct the diagram.Comment: To appear in NIPS 201

arXiv.org e-Print Archive

Time-varying Autoregression with Low Rank Tensors

Author: Aravkin Aleksandr
Brunton Bingni Wen
Harris Kameron Decker
Rao Rajesh
Publication venue
Publication date: 19/05/2020
Field of study

We present a windowed technique to learn parsimonious time-varying autoregressive models from multivariate timeseries. This unsupervised method uncovers interpretable spatiotemporal structure in data via non-smooth and non-convex optimization. In each time window, we assume the data follow a linear model parameterized by a system matrix, and we model this stack of potentially different system matrices as a low rank tensor. Because of its structure, the model is scalable to high-dimensional data and can easily incorporate priors such as smoothness over time. We find the components of the tensor using alternating minimization and prove that any stationary point of this algorithm is a local minimum. We demonstrate on a synthetic example that our method identifies the true rank of a switching linear system in the presence of noise. We illustrate our model's utility and superior scalability over extant methods when applied to several synthetic and real-world example: two types of time-varying linear systems, worm behavior, sea surface temperature, and monkey brain datasets

arXiv.org e-Print Archive

Are ResNets Provably Better than Linear Predictors?

Author: Shamir Ohad
Publication venue
Publication date: 27/09/2018
Field of study

A residual network (or ResNet) is a standard deep neural net architecture, with state-of-the-art performance across numerous applications. The main premise of ResNets is that they allow the training of each layer to focus on fitting just the residual of the previous layer's output and the target output. Thus, we should expect that the trained network is no worse than what we can obtain if we remove the residual layers and train a shallower network instead. However, due to the non-convexity of the optimization problem, it is not at all clear that ResNets indeed achieve this behavior, rather than getting stuck at some arbitrarily poor local minimum. In this paper, we rigorously prove that arbitrarily deep, nonlinear residual units indeed exhibit this behavior, in the sense that the optimization landscape contains no local minima with value above what can be obtained with a linear predictor (namely a 1-layer network). Notably, we show this under minimal or no assumptions on the precise network architecture, data distribution, or loss function used. We also provide a quantitative analysis of approximate stationary points for this problem. Finally, we show that with a certain tweak to the architecture, training the network with standard stochastic gradient descent achieves an objective value close or better than any linear predictor.Comment: Comparison to previous arXiv version: Minor changes to incorporate comments of NIPS 2018 reviewers (main results are unaffected

arXiv.org e-Print Archive

Depth with Nonlinearity Creates No Bad Local Minima in ResNets

Author: Bengio Yoshua
Kawaguchi Kenji
Publication venue: 'Elsevier BV'
Publication date: 09/07/2019
Field of study

In this paper, we prove that depth with nonlinearity creates no bad local minima in a type of arbitrarily deep ResNets with arbitrary nonlinear activation functions, in the sense that the values of all local minima are no worse than the global minimum value of corresponding classical machine-learning models, and are guaranteed to further improve via residual representations. As a result, this paper provides an affirmative answer to an open question stated in a paper in the conference on Neural Information Processing Systems 2018. This paper advances the optimization theory of deep learning only for ResNets and not for other network architectures

arXiv.org e-Print Archive

A theory on the absence of spurious solutions for nonconvex and nonsmooth optimization

Author: Josz Cedric
Lavaei Javad
Ouyang Yi
Sojoudi Somayeh
Zhang Richard Y.
Publication venue
Publication date: 31/10/2018
Field of study

We study the set of continuous functions that admit no spurious local optima (i.e. local minima that are not global minima) which we term \textit{global functions}. They satisfy various powerful properties for analyzing nonconvex and nonsmooth optimization problems. For instance, they satisfy a theorem akin to the fundamental uniform limit theorem in the analysis regarding continuous functions. Global functions are also endowed with useful properties regarding the composition of functions and change of variables. Using these new results, we show that a class of nonconvex and nonsmooth optimization problems arising in tensor decomposition applications are global functions. This is the first result concerning nonconvex methods for nonsmooth objective functions. Our result provides a theoretical guarantee for the widely-used

\ell_1

norm to avoid outliers in nonconvex optimization.Comment: 22 pages, 13 figure

arXiv.org e-Print Archive

Synchronization of Kuramoto Oscillators in Dense Networks

Author: Lu Jianfeng
Steinerberger Stefan
Publication venue
Publication date: 17/04/2020
Field of study

We study synchronization properties of systems of Kuramoto oscillators. The problem can also be understood as a question about the properties of an energy landscape created by a graph. More formally, let

G=(V,E)

be a connected graph and

(a_{ij})_{i,j=1}^{n}

denotes its adjacency matrix. Let the function

f:\mathbb{T}^n \rightarrow \mathbb{R}

be given by

f(\theta_1, \dots, \theta_n) = \sum_{i,j=1}^{n}{ a_{ij} \cos{(\theta_i - \theta_j)}}.

This function has a global maximum when

\theta_i = \theta

for all

1\leq i \leq n

. It is known that if every vertex is connected to at least

\mu(n-1)

other vertices for

\mu

sufficiently large, then every local maximum is global. Taylor proved this for

\mu \geq 0.9395

and Ling, Xu \& Bandeira improved this to

\mu \geq 0.7929

. We give a slight improvement to

\mu \geq 0.7889

. Townsend, Stillman \& Strogatz suggested that the critical value might be

\mu_c = 0.75

arXiv.org e-Print Archive

Notes on computational-to-statistical gaps: predictions using statistical physics

Author: Bandeira Afonso S.
Perry Amelia
Wein Alexander S.
Publication venue
Publication date: 20/04/2018
Field of study

In these notes we describe heuristics to predict computational-to-statistical gaps in certain statistical problems. These are regimes in which the underlying statistical problem is information-theoretically possible although no efficient algorithm exists, rendering the problem essentially unsolvable for large instances. The methods we describe here are based on mature, albeit non-rigorous, tools from statistical physics. These notes are based on a lecture series given by the authors at the Courant Institute of Mathematical Sciences in New York City, on May 16th, 2017.Comment: 22 pages, 2 figure

arXiv.org e-Print Archive

Towards the optimal construction of a loss function without spurious local minima for solving quadratic equations

Author: Cai Jian-Feng
Li Zhenzhen
Wei Ke
Publication venue
Publication date: 06/11/2019
Field of study

The problem of finding a vector

x

which obeys a set of quadratic equations

|a_k^\top x|^2=y_k

k=1,\cdots,m

, plays an important role in many applications. In this paper we consider the case when both

x

and

a_k

are real-valued vectors of length

n

. A new loss function is constructed for this problem, which combines the smooth quadratic loss function with an activation function. Under the Gaussian measurement model, we establish that with high probability the target solution

x

is the unique local minimizer (up to a global phase factor) of the new loss function provided

m\gtrsim n

. Moreover, the loss function always has a negative directional curvature around its saddle points

arXiv.org e-Print Archive