7,064 research outputs found
Theory III: Dynamics and Generalization in Deep Networks
The key to generalization is controlling the complexity of the network.
However, there is no obvious control of complexity -- such as an explicit
regularization term -- in the training of deep networks for classification. We
will show that a classical form of norm control -- but kind of hidden -- is
present in deep networks trained with gradient descent techniques on
exponential-type losses. In particular, gradient descent induces a dynamics of
the normalized weights which converge for to an equilibrium
which corresponds to a minimum norm (or maximum margin) solution. For
sufficiently large but finite -- and thus finite -- the dynamics
converges to one of several margin maximizers, with the margin monotonically
increasing towards a limit stationary point of the flow. In the usual case of
stochastic gradient descent, most of the stationary points are likely to be
convex minima corresponding to a constrained minimizer -- the network with
normalized weights-- which corresponds to vanishing regularization. The
solution has zero generalization gap, for fixed architecture, asymptotically
for , where is the number of training examples. Our approach
extends some of the original results of Srebro from linear networks to deep
networks and provides a new perspective on the implicit bias of gradient
descent. We believe that the elusive complexity control we describe is
responsible for the puzzling empirical finding of good predictive performance
by deep networks, despite overparametrization.Comment: 47 pages, 11 figures. This replaces previous versions of Theory III,
that appeared on Arxiv [arXiv:1806.11379, arXiv:1801.00173] or on the CBMM
site. v5: Changes throughout the paper to the presentation and tightening
some of the statement
Theoretical insights into the optimization landscape of over-parameterized shallow neural networks
In this paper we study the problem of learning a shallow artificial neural
network that best fits a training data set. We study this problem in the
over-parameterized regime where the number of observations are fewer than the
number of parameters in the model. We show that with quadratic activations the
optimization landscape of training such shallow neural networks has certain
favorable characteristics that allow globally optimal models to be found
efficiently using a variety of local search heuristics. This result holds for
an arbitrary training data of input/output pairs. For differentiable activation
functions we also show that gradient descent, when suitably initialized,
converges at a linear rate to a globally optimal model. This result focuses on
a realizable model where the inputs are chosen i.i.d. from a Gaussian
distribution and the labels are generated according to planted weight
coefficients.Comment: Section 3 on numerical experiments is added. Theorems 2.1 and 2.2 are
improved to apply to almost all input data (not just Gaussian inputs).
Related work section is expanded. The paper is accepted for publication in
IEEE transaction on Information Theory (2018
Distributed coordination for nonsmooth convex optimization via saddle-point dynamics
This paper considers continuous-time coordination algorithms for networks of
agents that seek to collectively solve a general class of nonsmooth convex
optimization problems with an inherent distributed structure. Our algorithm
design builds on the characterization of the solutions of the nonsmooth convex
program as saddle points of an augmented Lagrangian. We show that the
associated saddle-point dynamics are asymptotically correct but, in general,
not distributed because of the presence of a global penalty parameter. This
motivates the design of a discontinuous saddle-point-like algorithm that enjoys
the same convergence properties and is fully amenable to distributed
implementation. Our convergence proofs rely on the identification of a novel
global Lyapunov function for saddle-point dynamics. This novelty also allows us
to identify mild convexity and regularity conditions on the objective function
that guarantee the exponential convergence rate of the proposed algorithms for
convex optimization problems subject to equality constraints. Various examples
illustrate our discussion.Comment: 20 page
Adaptive norms for deep learning with regularized Newton methods
We investigate the use of regularized Newton methods with adaptive norms for
optimizing neural networks. This approach can be seen as a second-order
counterpart of adaptive gradient methods, which we here show to be
interpretable as first-order trust region methods with ellipsoidal constraints.
In particular, we prove that the preconditioning matrix used in RMSProp and
Adam satisfies the necessary conditions for provable convergence of
second-order trust region methods with standard worst-case complexities on
general non-convex objectives. Furthermore, we run experiments across different
neural architectures and datasets to find that the ellipsoidal constraints
constantly outperform their spherical counterpart both in terms of number of
backpropagations and asymptotic loss value. Finally, we find comparable
performance to state-of-the-art first-order methods in terms of
backpropagations, but further advances in hardware are needed to render Newton
methods competitive in terms of computational time
Sinkhorn Algorithm for Lifted Assignment Problems
Recently, Sinkhorn's algorithm was applied for approximately solving linear
programs emerging from optimal transport very efficiently. This was
accomplished by formulating a regularized version of the linear program as
Bregman projection problem onto the polytope of doubly-stochastic matrices, and
then computing the projection using the efficient Sinkhorn algorithm, which is
based on alternating closed-form Bregman projections on the larger polytopes of
row-stochastic and column-stochastic matrices. In this paper we suggest a
generalization of this algorithm for solving a well-known lifted linear program
relaxations of the Quadratic Assignment Problem (QAP), which is known as the
Johnson Adams (JA) Relaxation. First, an efficient algorithm for Bregman
projection onto the JA polytope by alternating closed-form Bregman projections
onto one-sided local polytopes is devised. The one-sided polytopes can be seen
as a high-dimensional, generalized version of the row/column-stochastic
polytopes. Second, a new method for solving the original linear programs using
the Bregman projections onto the JA polytope is developed and shown to be more
accurate and numerically stable than the standard approach of driving the
regularizer to zero. The resulting algorithm is considerably more scalable than
standard linear solvers and is able to solve significantly larger linear
programs
On the Fine-Grained Complexity of Empirical Risk Minimization: Kernel Methods and Neural Networks
Empirical risk minimization (ERM) is ubiquitous in machine learning and
underlies most supervised learning methods. While there has been a large body
of work on algorithms for various ERM problems, the exact computational
complexity of ERM is still not understood. We address this issue for multiple
popular ERM problems including kernel SVMs, kernel ridge regression, and
training the final layer of a neural network. In particular, we give
conditional hardness results for these problems based on complexity-theoretic
assumptions such as the Strong Exponential Time Hypothesis. Under these
assumptions, we show that there are no algorithms that solve the aforementioned
ERM problems to high accuracy in sub-quadratic time. We also give similar
hardness results for computing the gradient of the empirical loss, which is the
main computational burden in many non-convex learning tasks
Nonasymptotic convergence of stochastic proximal point algorithms for constrained convex optimization
A very popular approach for solving stochastic optimization problems is the
stochastic gradient descent method (SGD). Although the SGD iteration is
computationally cheap and the practical performance of this method may be
satisfactory under certain circumstances, there is recent evidence of its
convergence difficulties and instability for unappropriate parameters choice.
To avoid these drawbacks naturally introduced by the SGD scheme, the stochastic
proximal point algorithms have been recently considered in the literature. We
introduce a new variant of the stochastic proximal point method (SPP) for
solving stochastic convex optimization problems subject to (in)finite
intersection of constraints satisfying a linear regularity type condition. For
the newly introduced SPP scheme we prove new nonasymptotic convergence results.
In particular, for convex and Lipschitz continuous objective functions, we
prove nonasymptotic estimates for the rate of convergence in terms of the
expected value function gap of order , where is the
iteration counter. We also derive better nonasymptotic bounds for the rate of
convergence in terms of expected quadratic distance from the iterates to the
optimal solution for smooth strongly convex objective functions, which in the
best case is of order . Since these convergence rates can be
attained by our SPP algorithm only under some natural restrictions on the
stepsize, we also introduce a restarting variant of SPP method that overcomes
these difficulties and derive the corresponding nonasymptotic convergence
rates. Numerical evidence supports the effectiveness of our methods in
real-world problems
Fine-grained Optimization of Deep Neural Networks
In recent studies, several asymptotic upper bounds on generalization errors
on deep neural networks (DNNs) are theoretically derived. These bounds are
functions of several norms of weights of the DNNs, such as the Frobenius and
spectral norms, and they are computed for weights grouped according to either
input and output channels of the DNNs. In this work, we conjecture that if we
can impose multiple constraints on weights of DNNs to upper bound the norms of
the weights, and train the DNNs with these weights, then we can attain
empirical generalization errors closer to the derived theoretical bounds, and
improve accuracy of the DNNs.
To this end, we pose two problems. First, we aim to obtain weights whose
different norms are all upper bounded by a constant number, e.g. 1.0. To
achieve these bounds, we propose a two-stage renormalization procedure; (i)
normalization of weights according to different norms used in the bounds, and
(ii) reparameterization of the normalized weights to set a constant and finite
upper bound of their norms. In the second problem, we consider training DNNs
with these renormalized weights. To this end, we first propose a strategy to
construct joint spaces (manifolds) of weights according to different
constraints in DNNs. Next, we propose a fine-grained SGD algorithm (FG-SGD) for
optimization on the weight manifolds to train DNNs with assurance of
convergence to minima. Experimental results show that image classification
accuracy of baseline DNNs can be boosted using FG-SGD on collections of
manifolds identified by multiple constraints
Distributed model predictive control for continuous-time nonlinear systems based on suboptimal ADMM
The paper presents a distributed model predictive control (DMPC) scheme for
continuous-time nonlinear systems based on the alternating direction method of
multipliers (ADMM). A stopping criterion in the ADMM algorithm limits the
iterations and therefore the required communication effort during the
distributed MPC solution at the expense of a suboptimal solution. Stability
results are presented for the suboptimal DMPC scheme under two different ADMM
convergence assumptions. In particular, it is shown that the required
iterations in each ADMM step are bounded, which is also confirmed in simulation
studies.Comment: 26 pages, 7 figure
Gradient Dynamic Approach to the Tensor Complementarity Problem
Nonlinear gradient dynamic approach for solving the tensor complementarity
problem (TCP) is presented. Theoretical analysis shows that each of the defined
dynamical system models ensures the convergence performance. The computer
simulation results further substantiate that the considered dynamical system
can solve the tensor complementarity problem (TCP).Comment: 18pages. arXiv admin note: text overlap with arXiv:1804.00406 by
other author
- …