39 research outputs found

    On the Optimization Landscape of Tensor Decompositions

    Full text link
    Non-convex optimization with local search heuristics has been widely used in machine learning, achieving many state-of-art results. It becomes increasingly important to understand why they can work for these NP-hard problems on typical data. The landscape of many objective functions in learning has been conjectured to have the geometric property that "all local optima are (approximately) global optima", and thus they can be solved efficiently by local search algorithms. However, establishing such property can be very difficult. In this paper, we analyze the optimization landscape of the random over-complete tensor decomposition problem, which has many applications in unsupervised learning, especially in learning latent variable models. In practice, it can be efficiently solved by gradient ascent on a non-convex objective. We show that for any small constant ϵ>0\epsilon > 0, among the set of points with function values (1+ϵ)(1+\epsilon)-factor larger than the expectation of the function, all the local maxima are approximate global maxima. Previously, the best-known result only characterizes the geometry in small neighborhoods around the true components. Our result implies that even with an initialization that is barely better than the random guess, the gradient ascent algorithm is guaranteed to solve this problem. Our main technique uses Kac-Rice formula and random matrix theory. To our best knowledge, this is the first time when Kac-Rice formula is successfully applied to counting the number of local minima of a highly-structured random polynomial with dependent coefficients.Comment: Best paper in the NIPS 2016 Workshop on Nonconvex Optimization for Machine Learning: Theory and Practice. In submissio

    PGDOT -- Perturbed Gradient Descent Adapted with Occupation Time

    Full text link
    This paper develops further the idea of perturbed gradient descent (PGD), by adapting perturbation with the history of states via the notion of occupation time. The proposed algorithm, perturbed gradient descent adapted with occupation time (PGDOT), is shown to converge at least as fast as the PGD algorithm and is guaranteed to avoid getting stuck at saddle points. The analysis is corroborated by empirical studies, in which a mini-batch version of PGDOT is shown to outperform alternatives such as mini-batch gradient descent, Adam, AMSGrad, and RMSProp in training multilayer perceptrons (MLPs). In particular, the mini-batch PGDOT manages to escape saddle points whereas these alternatives fail.Comment: 15 pages, 7 figures, 1 tabl

    Smoothed Analysis of Discrete Tensor Decomposition and Assemblies of Neurons

    Full text link
    We analyze linear independence of rank one tensors produced by tensor powers of randomly perturbed vectors. This enables efficient decomposition of sums of high-order tensors. Our analysis builds upon [BCMV14] but allows for a wider range of perturbation models, including discrete ones. We give an application to recovering assemblies of neurons. Assemblies are large sets of neurons representing specific memories or concepts. The size of the intersection of two assemblies has been shown in experiments to represent the extent to which these memories co-occur or these concepts are related; the phenomenon is called association of assemblies. This suggests that an animal's memory is a complex web of associations, and poses the problem of recovering this representation from cognitive data. Motivated by this problem, we study the following more general question: Can we reconstruct the Venn diagram of a family of sets, given the sizes of their \ell-wise intersections? We show that as long as the family of sets is randomly perturbed, it is enough for the number of measurements to be polynomially larger than the number of nonempty regions of the Venn diagram to fully reconstruct the diagram.Comment: To appear in NIPS 201

    Time-varying Autoregression with Low Rank Tensors

    Full text link
    We present a windowed technique to learn parsimonious time-varying autoregressive models from multivariate timeseries. This unsupervised method uncovers interpretable spatiotemporal structure in data via non-smooth and non-convex optimization. In each time window, we assume the data follow a linear model parameterized by a system matrix, and we model this stack of potentially different system matrices as a low rank tensor. Because of its structure, the model is scalable to high-dimensional data and can easily incorporate priors such as smoothness over time. We find the components of the tensor using alternating minimization and prove that any stationary point of this algorithm is a local minimum. We demonstrate on a synthetic example that our method identifies the true rank of a switching linear system in the presence of noise. We illustrate our model's utility and superior scalability over extant methods when applied to several synthetic and real-world example: two types of time-varying linear systems, worm behavior, sea surface temperature, and monkey brain datasets

    Are ResNets Provably Better than Linear Predictors?

    Full text link
    A residual network (or ResNet) is a standard deep neural net architecture, with state-of-the-art performance across numerous applications. The main premise of ResNets is that they allow the training of each layer to focus on fitting just the residual of the previous layer's output and the target output. Thus, we should expect that the trained network is no worse than what we can obtain if we remove the residual layers and train a shallower network instead. However, due to the non-convexity of the optimization problem, it is not at all clear that ResNets indeed achieve this behavior, rather than getting stuck at some arbitrarily poor local minimum. In this paper, we rigorously prove that arbitrarily deep, nonlinear residual units indeed exhibit this behavior, in the sense that the optimization landscape contains no local minima with value above what can be obtained with a linear predictor (namely a 1-layer network). Notably, we show this under minimal or no assumptions on the precise network architecture, data distribution, or loss function used. We also provide a quantitative analysis of approximate stationary points for this problem. Finally, we show that with a certain tweak to the architecture, training the network with standard stochastic gradient descent achieves an objective value close or better than any linear predictor.Comment: Comparison to previous arXiv version: Minor changes to incorporate comments of NIPS 2018 reviewers (main results are unaffected

    Depth with Nonlinearity Creates No Bad Local Minima in ResNets

    Full text link
    In this paper, we prove that depth with nonlinearity creates no bad local minima in a type of arbitrarily deep ResNets with arbitrary nonlinear activation functions, in the sense that the values of all local minima are no worse than the global minimum value of corresponding classical machine-learning models, and are guaranteed to further improve via residual representations. As a result, this paper provides an affirmative answer to an open question stated in a paper in the conference on Neural Information Processing Systems 2018. This paper advances the optimization theory of deep learning only for ResNets and not for other network architectures

    A theory on the absence of spurious solutions for nonconvex and nonsmooth optimization

    Full text link
    We study the set of continuous functions that admit no spurious local optima (i.e. local minima that are not global minima) which we term \textit{global functions}. They satisfy various powerful properties for analyzing nonconvex and nonsmooth optimization problems. For instance, they satisfy a theorem akin to the fundamental uniform limit theorem in the analysis regarding continuous functions. Global functions are also endowed with useful properties regarding the composition of functions and change of variables. Using these new results, we show that a class of nonconvex and nonsmooth optimization problems arising in tensor decomposition applications are global functions. This is the first result concerning nonconvex methods for nonsmooth objective functions. Our result provides a theoretical guarantee for the widely-used 1\ell_1 norm to avoid outliers in nonconvex optimization.Comment: 22 pages, 13 figure

    Synchronization of Kuramoto Oscillators in Dense Networks

    Full text link
    We study synchronization properties of systems of Kuramoto oscillators. The problem can also be understood as a question about the properties of an energy landscape created by a graph. More formally, let G=(V,E)G=(V,E) be a connected graph and (aij)i,j=1n(a_{ij})_{i,j=1}^{n} denotes its adjacency matrix. Let the function f:TnRf:\mathbb{T}^n \rightarrow \mathbb{R} be given by f(θ1,,θn)=i,j=1naijcos(θiθj). f(\theta_1, \dots, \theta_n) = \sum_{i,j=1}^{n}{ a_{ij} \cos{(\theta_i - \theta_j)}}. This function has a global maximum when θi=θ\theta_i = \theta for all 1in1\leq i \leq n. It is known that if every vertex is connected to at least μ(n1)\mu(n-1) other vertices for μ\mu sufficiently large, then every local maximum is global. Taylor proved this for μ0.9395\mu \geq 0.9395 and Ling, Xu \& Bandeira improved this to μ0.7929\mu \geq 0.7929. We give a slight improvement to μ0.7889\mu \geq 0.7889. Townsend, Stillman \& Strogatz suggested that the critical value might be μc=0.75\mu_c = 0.75

    Notes on computational-to-statistical gaps: predictions using statistical physics

    Full text link
    In these notes we describe heuristics to predict computational-to-statistical gaps in certain statistical problems. These are regimes in which the underlying statistical problem is information-theoretically possible although no efficient algorithm exists, rendering the problem essentially unsolvable for large instances. The methods we describe here are based on mature, albeit non-rigorous, tools from statistical physics. These notes are based on a lecture series given by the authors at the Courant Institute of Mathematical Sciences in New York City, on May 16th, 2017.Comment: 22 pages, 2 figure

    Towards the optimal construction of a loss function without spurious local minima for solving quadratic equations

    Full text link
    The problem of finding a vector xx which obeys a set of quadratic equations akx2=yk|a_k^\top x|^2=y_k, k=1,,mk=1,\cdots,m, plays an important role in many applications. In this paper we consider the case when both xx and aka_k are real-valued vectors of length nn. A new loss function is constructed for this problem, which combines the smooth quadratic loss function with an activation function. Under the Gaussian measurement model, we establish that with high probability the target solution xx is the unique local minimizer (up to a global phase factor) of the new loss function provided mnm\gtrsim n. Moreover, the loss function always has a negative directional curvature around its saddle points
    corecore