7,064 research outputs found

    Theory III: Dynamics and Generalization in Deep Networks

    Full text link
    The key to generalization is controlling the complexity of the network. However, there is no obvious control of complexity -- such as an explicit regularization term -- in the training of deep networks for classification. We will show that a classical form of norm control -- but kind of hidden -- is present in deep networks trained with gradient descent techniques on exponential-type losses. In particular, gradient descent induces a dynamics of the normalized weights which converge for tt \to \infty to an equilibrium which corresponds to a minimum norm (or maximum margin) solution. For sufficiently large but finite ρ\rho -- and thus finite tt -- the dynamics converges to one of several margin maximizers, with the margin monotonically increasing towards a limit stationary point of the flow. In the usual case of stochastic gradient descent, most of the stationary points are likely to be convex minima corresponding to a constrained minimizer -- the network with normalized weights-- which corresponds to vanishing regularization. The solution has zero generalization gap, for fixed architecture, asymptotically for NN \to \infty, where NN is the number of training examples. Our approach extends some of the original results of Srebro from linear networks to deep networks and provides a new perspective on the implicit bias of gradient descent. We believe that the elusive complexity control we describe is responsible for the puzzling empirical finding of good predictive performance by deep networks, despite overparametrization.Comment: 47 pages, 11 figures. This replaces previous versions of Theory III, that appeared on Arxiv [arXiv:1806.11379, arXiv:1801.00173] or on the CBMM site. v5: Changes throughout the paper to the presentation and tightening some of the statement

    Theoretical insights into the optimization landscape of over-parameterized shallow neural networks

    Full text link
    In this paper we study the problem of learning a shallow artificial neural network that best fits a training data set. We study this problem in the over-parameterized regime where the number of observations are fewer than the number of parameters in the model. We show that with quadratic activations the optimization landscape of training such shallow neural networks has certain favorable characteristics that allow globally optimal models to be found efficiently using a variety of local search heuristics. This result holds for an arbitrary training data of input/output pairs. For differentiable activation functions we also show that gradient descent, when suitably initialized, converges at a linear rate to a globally optimal model. This result focuses on a realizable model where the inputs are chosen i.i.d. from a Gaussian distribution and the labels are generated according to planted weight coefficients.Comment: Section 3 on numerical experiments is added. Theorems 2.1 and 2.2 are improved to apply to almost all input data (not just Gaussian inputs). Related work section is expanded. The paper is accepted for publication in IEEE transaction on Information Theory (2018

    Distributed coordination for nonsmooth convex optimization via saddle-point dynamics

    Full text link
    This paper considers continuous-time coordination algorithms for networks of agents that seek to collectively solve a general class of nonsmooth convex optimization problems with an inherent distributed structure. Our algorithm design builds on the characterization of the solutions of the nonsmooth convex program as saddle points of an augmented Lagrangian. We show that the associated saddle-point dynamics are asymptotically correct but, in general, not distributed because of the presence of a global penalty parameter. This motivates the design of a discontinuous saddle-point-like algorithm that enjoys the same convergence properties and is fully amenable to distributed implementation. Our convergence proofs rely on the identification of a novel global Lyapunov function for saddle-point dynamics. This novelty also allows us to identify mild convexity and regularity conditions on the objective function that guarantee the exponential convergence rate of the proposed algorithms for convex optimization problems subject to equality constraints. Various examples illustrate our discussion.Comment: 20 page

    Adaptive norms for deep learning with regularized Newton methods

    Full text link
    We investigate the use of regularized Newton methods with adaptive norms for optimizing neural networks. This approach can be seen as a second-order counterpart of adaptive gradient methods, which we here show to be interpretable as first-order trust region methods with ellipsoidal constraints. In particular, we prove that the preconditioning matrix used in RMSProp and Adam satisfies the necessary conditions for provable convergence of second-order trust region methods with standard worst-case complexities on general non-convex objectives. Furthermore, we run experiments across different neural architectures and datasets to find that the ellipsoidal constraints constantly outperform their spherical counterpart both in terms of number of backpropagations and asymptotic loss value. Finally, we find comparable performance to state-of-the-art first-order methods in terms of backpropagations, but further advances in hardware are needed to render Newton methods competitive in terms of computational time

    Sinkhorn Algorithm for Lifted Assignment Problems

    Full text link
    Recently, Sinkhorn's algorithm was applied for approximately solving linear programs emerging from optimal transport very efficiently. This was accomplished by formulating a regularized version of the linear program as Bregman projection problem onto the polytope of doubly-stochastic matrices, and then computing the projection using the efficient Sinkhorn algorithm, which is based on alternating closed-form Bregman projections on the larger polytopes of row-stochastic and column-stochastic matrices. In this paper we suggest a generalization of this algorithm for solving a well-known lifted linear program relaxations of the Quadratic Assignment Problem (QAP), which is known as the Johnson Adams (JA) Relaxation. First, an efficient algorithm for Bregman projection onto the JA polytope by alternating closed-form Bregman projections onto one-sided local polytopes is devised. The one-sided polytopes can be seen as a high-dimensional, generalized version of the row/column-stochastic polytopes. Second, a new method for solving the original linear programs using the Bregman projections onto the JA polytope is developed and shown to be more accurate and numerically stable than the standard approach of driving the regularizer to zero. The resulting algorithm is considerably more scalable than standard linear solvers and is able to solve significantly larger linear programs

    On the Fine-Grained Complexity of Empirical Risk Minimization: Kernel Methods and Neural Networks

    Full text link
    Empirical risk minimization (ERM) is ubiquitous in machine learning and underlies most supervised learning methods. While there has been a large body of work on algorithms for various ERM problems, the exact computational complexity of ERM is still not understood. We address this issue for multiple popular ERM problems including kernel SVMs, kernel ridge regression, and training the final layer of a neural network. In particular, we give conditional hardness results for these problems based on complexity-theoretic assumptions such as the Strong Exponential Time Hypothesis. Under these assumptions, we show that there are no algorithms that solve the aforementioned ERM problems to high accuracy in sub-quadratic time. We also give similar hardness results for computing the gradient of the empirical loss, which is the main computational burden in many non-convex learning tasks

    Nonasymptotic convergence of stochastic proximal point algorithms for constrained convex optimization

    Full text link
    A very popular approach for solving stochastic optimization problems is the stochastic gradient descent method (SGD). Although the SGD iteration is computationally cheap and the practical performance of this method may be satisfactory under certain circumstances, there is recent evidence of its convergence difficulties and instability for unappropriate parameters choice. To avoid these drawbacks naturally introduced by the SGD scheme, the stochastic proximal point algorithms have been recently considered in the literature. We introduce a new variant of the stochastic proximal point method (SPP) for solving stochastic convex optimization problems subject to (in)finite intersection of constraints satisfying a linear regularity type condition. For the newly introduced SPP scheme we prove new nonasymptotic convergence results. In particular, for convex and Lipschitz continuous objective functions, we prove nonasymptotic estimates for the rate of convergence in terms of the expected value function gap of order O(1/k1/2)\mathcal{O}(1/k^{1/2}), where kk is the iteration counter. We also derive better nonasymptotic bounds for the rate of convergence in terms of expected quadratic distance from the iterates to the optimal solution for smooth strongly convex objective functions, which in the best case is of order O(1/k)\mathcal{O}(1/k). Since these convergence rates can be attained by our SPP algorithm only under some natural restrictions on the stepsize, we also introduce a restarting variant of SPP method that overcomes these difficulties and derive the corresponding nonasymptotic convergence rates. Numerical evidence supports the effectiveness of our methods in real-world problems

    Fine-grained Optimization of Deep Neural Networks

    Full text link
    In recent studies, several asymptotic upper bounds on generalization errors on deep neural networks (DNNs) are theoretically derived. These bounds are functions of several norms of weights of the DNNs, such as the Frobenius and spectral norms, and they are computed for weights grouped according to either input and output channels of the DNNs. In this work, we conjecture that if we can impose multiple constraints on weights of DNNs to upper bound the norms of the weights, and train the DNNs with these weights, then we can attain empirical generalization errors closer to the derived theoretical bounds, and improve accuracy of the DNNs. To this end, we pose two problems. First, we aim to obtain weights whose different norms are all upper bounded by a constant number, e.g. 1.0. To achieve these bounds, we propose a two-stage renormalization procedure; (i) normalization of weights according to different norms used in the bounds, and (ii) reparameterization of the normalized weights to set a constant and finite upper bound of their norms. In the second problem, we consider training DNNs with these renormalized weights. To this end, we first propose a strategy to construct joint spaces (manifolds) of weights according to different constraints in DNNs. Next, we propose a fine-grained SGD algorithm (FG-SGD) for optimization on the weight manifolds to train DNNs with assurance of convergence to minima. Experimental results show that image classification accuracy of baseline DNNs can be boosted using FG-SGD on collections of manifolds identified by multiple constraints

    Distributed model predictive control for continuous-time nonlinear systems based on suboptimal ADMM

    Full text link
    The paper presents a distributed model predictive control (DMPC) scheme for continuous-time nonlinear systems based on the alternating direction method of multipliers (ADMM). A stopping criterion in the ADMM algorithm limits the iterations and therefore the required communication effort during the distributed MPC solution at the expense of a suboptimal solution. Stability results are presented for the suboptimal DMPC scheme under two different ADMM convergence assumptions. In particular, it is shown that the required iterations in each ADMM step are bounded, which is also confirmed in simulation studies.Comment: 26 pages, 7 figure

    Gradient Dynamic Approach to the Tensor Complementarity Problem

    Full text link
    Nonlinear gradient dynamic approach for solving the tensor complementarity problem (TCP) is presented. Theoretical analysis shows that each of the defined dynamical system models ensures the convergence performance. The computer simulation results further substantiate that the considered dynamical system can solve the tensor complementarity problem (TCP).Comment: 18pages. arXiv admin note: text overlap with arXiv:1804.00406 by other author
    corecore