17,895 research outputs found
Generalization Error Bounds of Gradient Descent for Learning Over-parameterized Deep ReLU Networks
Empirical studies show that gradient-based methods can learn deep neural
networks (DNNs) with very good generalization performance in the
over-parameterization regime, where DNNs can easily fit a random labeling of
the training data. Very recently, a line of work explains in theory that with
over-parameterization and proper random initialization, gradient-based methods
can find the global minima of the training loss for DNNs. However, existing
generalization error bounds are unable to explain the good generalization
performance of over-parameterized DNNs. The major limitation of most existing
generalization bounds is that they are based on uniform convergence and are
independent of the training algorithm. In this work, we derive an
algorithm-dependent generalization error bound for deep ReLU networks, and show
that under certain assumptions on the data distribution, gradient descent (GD)
with proper random initialization is able to train a sufficiently
over-parameterized DNN to achieve arbitrarily small generalization error. Our
work sheds light on explaining the good generalization performance of
over-parameterized deep neural networks.Comment: 27 pages. This version simplifies the proof and improves the
presentation in Version 3. In AAAI 202
Hybrid Deterministic-Stochastic Methods for Data Fitting
Many structured data-fitting applications require the solution of an
optimization problem involving a sum over a potentially large number of
measurements. Incremental gradient algorithms offer inexpensive iterations by
sampling a subset of the terms in the sum. These methods can make great
progress initially, but often slow as they approach a solution. In contrast,
full-gradient methods achieve steady convergence at the expense of evaluating
the full objective and gradient on each iteration. We explore hybrid methods
that exhibit the benefits of both approaches. Rate-of-convergence analysis
shows that by controlling the sample size in an incremental gradient algorithm,
it is possible to maintain the steady convergence rates of full-gradient
methods. We detail a practical quasi-Newton implementation based on this
approach. Numerical experiments illustrate its potential benefits.Comment: 26 pages. Revised proofs of Theorems 2.6 and 3.1, results unchange
Information-theoretic lower bounds on the oracle complexity of stochastic convex optimization
Relative to the large literature on upper bounds on complexity of convex
optimization, lesser attention has been paid to the fundamental hardness of
these problems. Given the extensive use of convex optimization in machine
learning and statistics, gaining an understanding of these complexity-theoretic
issues is important. In this paper, we study the complexity of stochastic
convex optimization in an oracle model of computation. We improve upon known
results and obtain tight minimax complexity estimates for various function
classes
- …