832 research outputs found

    Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor Methods

    Get PDF
    Training neural networks is a challenging non-convex optimization problem, and backpropagation or gradient descent can get stuck in spurious local optima. We propose a novel algorithm based on tensor decomposition for guaranteed training of two-layer neural networks. We provide risk bounds for our proposed method, with a polynomial sample complexity in the relevant parameters, such as input dimension and number of neurons. While learning arbitrary target functions is NP-hard, we provide transparent conditions on the function and the input for learnability. Our training method is based on tensor decomposition, which provably converges to the global optimum, under a set of mild non-degeneracy conditions. It consists of simple embarrassingly parallel linear and multi-linear operations, and is competitive with standard stochastic gradient descent (SGD), in terms of computational complexity. Thus, we propose a computationally efficient method with guaranteed risk bounds for training neural networks with one hidden layer.Comment: The tensor decomposition analysis is expanded, and the analysis of ridge regression is added for recovering the parameters of last layer of neural networ

    Guaranteed Non-Orthogonal Tensor Decomposition via Alternating Rank-11 Updates

    Full text link
    In this paper, we provide local and global convergence guarantees for recovering CP (Candecomp/Parafac) tensor decomposition. The main step of the proposed algorithm is a simple alternating rank-11 update which is the alternating version of the tensor power iteration adapted for asymmetric tensors. Local convergence guarantees are established for third order tensors of rank kk in dd dimensions, when k=o(d1.5)k=o \bigl( d^{1.5} \bigr) and the tensor components are incoherent. Thus, we can recover overcomplete tensor decomposition. We also strengthen the results to global convergence guarantees under stricter rank condition kβdk \le \beta d (for arbitrary constant β>1\beta > 1) through a simple initialization procedure where the algorithm is initialized by top singular vectors of random tensor slices. Furthermore, the approximate local convergence guarantees for pp-th order tensors are also provided under rank condition k=o(dp/2)k=o \bigl( d^{p/2} \bigr). The guarantees also include tight perturbation analysis given noisy tensor.Comment: We have added an additional sub-algorithm to remove the (approximate) residual error left after the tensor power iteratio

    Polynomial-time Tensor Decompositions with Sum-of-Squares

    Full text link
    We give new algorithms based on the sum-of-squares method for tensor decomposition. Our results improve the best known running times from quasi-polynomial to polynomial for several problems, including decomposing random overcomplete 3-tensors and learning overcomplete dictionaries with constant relative sparsity. We also give the first robust analysis for decomposing overcomplete 4-tensors in the smoothed analysis model. A key ingredient of our analysis is to establish small spectral gaps in moment matrices derived from solutions to sum-of-squares relaxations. To enable this analysis we augment sum-of-squares relaxations with spectral analogs of maximum entropy constraints.Comment: to appear in FOCS 201

    Dictionary Learning and Tensor Decomposition via the Sum-of-Squares Method

    Full text link
    We give a new approach to the dictionary learning (also known as "sparse coding") problem of recovering an unknown n×mn\times m matrix AA (for mnm \geq n) from examples of the form y=Ax+e, y = Ax + e, where xx is a random vector in Rm\mathbb R^m with at most τm\tau m nonzero coordinates, and ee is a random noise vector in Rn\mathbb R^n with bounded magnitude. For the case m=O(n)m=O(n), our algorithm recovers every column of AA within arbitrarily good constant accuracy in time mO(logm/log(τ1))m^{O(\log m/\log(\tau^{-1}))}, in particular achieving polynomial time if τ=mδ\tau = m^{-\delta} for any δ>0\delta>0, and time mO(logm)m^{O(\log m)} if τ\tau is (a sufficiently small) constant. Prior algorithms with comparable assumptions on the distribution required the vector xx to be much sparser---at most n\sqrt{n} nonzero coordinates---and there were intrinsic barriers preventing these algorithms from applying for denser xx. We achieve this by designing an algorithm for noisy tensor decomposition that can recover, under quite general conditions, an approximate rank-one decomposition of a tensor TT, given access to a tensor TT' that is τ\tau-close to TT in the spectral norm (when considered as a matrix). To our knowledge, this is the first algorithm for tensor decomposition that works in the constant spectral-norm noise regime, where there is no guarantee that the local optima of TT and TT' have similar structures. Our algorithm is based on a novel approach to using and analyzing the Sum of Squares semidefinite programming hierarchy (Parrilo 2000, Lasserre 2001), and it can be viewed as an indication of the utility of this very general and powerful tool for unsupervised learning problems

    Convolutional Dictionary Learning through Tensor Factorization

    Get PDF
    Tensor methods have emerged as a powerful paradigm for consistent learning of many latent variable models such as topic models, independent component analysis and dictionary learning. Model parameters are estimated via CP decomposition of the observed higher order input moments. However, in many domains, additional invariances such as shift invariances exist, enforced via models such as convolutional dictionary learning. In this paper, we develop novel tensor decomposition algorithms for parameter estimation of convolutional models. Our algorithm is based on the popular alternating least squares method, but with efficient projections onto the space of stacked circulant matrices. Our method is embarrassingly parallel and consists of simple operations such as fast Fourier transforms and matrix multiplications. Our algorithm converges to the dictionary much faster and more accurately compared to the alternating minimization over filters and activation maps
    corecore