285 research outputs found
The Extended Regularized Dual Averaging Method for Composite Optimization
We present a new algorithm, extended regularized dual averaging (XRDA), for
solving composite optimization problems, which are a generalization of the
regularized dual averaging (RDA) method. The main novelty of the method is that
it allows more flexible control of the backward step size. For instance, the
backward step size for RDA grows without bound, while XRDA the backward step
size can be kept bounded
Optimal Approximation of Zonoids and Uniform Approximation by Shallow Neural Networks
We study the following two related problems. The first is to determine to
what error an arbitrary zonoid in can be approximated in the
Hausdorff distance by a sum of line segments. The second is to determine
optimal approximation rates in the uniform norm for shallow ReLU neural
networks on their variation spaces. The first of these problems has been solved
for , but when a logarithmic gap between the best upper and
lower bounds remains. We close this gap, which completes the solution in all
dimensions. For the second problem, our techniques significantly improve upon
existing approximation rates when , and enable uniform approximation
of both the target function and its derivatives
Sharp Lower Bounds on Interpolation by Deep ReLU Neural Networks at Irregularly Spaced Data
We study the interpolation power of deep ReLU neural networks. Specifically,
we consider the question of how efficiently, in terms of the number of
parameters, deep ReLU networks can interpolate values at datapoints in the
unit ball which are separated by a distance . We show that
parameters are required in the regime where is exponentially small in
, which gives the sharp result in this regime since parameters are
always sufficient. This also shows that the bit-extraction technique used to
prove lower bounds on the VC dimension cannot be applied to irregularly spaced
datapoints. Finally, as an application we give a lower bound on the
approximation rates that deep ReLU neural networks can achieve for Sobolev
spaces at the embedding endpoint
Optimal Approximation Rates for Deep ReLU Neural Networks on Sobolev and Besov Spaces
Let be the unit cube in . We study the
problem of how efficiently, in terms of the number of parameters, deep neural
networks with the ReLU activation function can approximate functions in the
Sobolev spaces and Besov spaces , with
error measured in the norm. This problem is important when
studying the application of neural networks in a variety of fields, including
scientific computing and signal processing, and has previously been solved only
when . Our contribution is to provide a complete solution for all
and for which the corresponding Sobolev or Besov
space compactly embeds into . The key technical tool is a novel
bit-extraction technique which gives an optimal encoding of sparse vectors.
This enables us to obtain sharp upper bounds in the non-linear regime where . We also provide a novel method for deriving -approximation lower
bounds based upon VC-dimension when . Our results show that very
deep ReLU networks significantly outperform classical methods of approximation
in terms of the number of parameters, but that this comes at the cost of
parameters which are not encodable
A qualitative difference between gradient flows of convex functions in finite- and infinite-dimensional Hilbert spaces
We consider gradient flow/gradient descent and heavy ball/accelerated
gradient descent optimization for convex objective functions. In the gradient
flow case, we prove the following:
1. If does not have a minimizer, the convergence can
be arbitrarily slow.
2. If does have a minimizer, the excess energy is
integrable/summable in time. In particular, as
.
3. In Hilbert spaces, this is optimal: can decay to as
slowly as any given function which is monotone decreasing and integrable at
, even for a fixed quadratic objective.
4. In finite dimension (or more generally, for all gradient flow curves of
finite length), this is not optimal: We prove that there are convex monotone
decreasing integrable functions which decrease to zero slower than
for the gradient flow of any convex function on .
For instance, we show that any gradient flow of a convex function in
finite dimension satisfies .
This improves on the commonly reported rate and provides a sharp
characterization of the energy decay law. We also note that it is impossible to
establish a rate for any function which satisfies
, even asymptotically.
Similar results are obtained in related settings for (1) discrete time
gradient descent, (2) stochastic gradient descent with multiplicative noise and
(3) the heavy ball ODE. In the case of stochastic gradient descent, the
summability of is used to prove that almost surely - an improvement on the convergence almost surely up to a
subsequence which follows from the decay estimate
Sharp Convergence Rates for Matching Pursuit
We study the fundamental limits of matching pursuit, or the pure greedy
algorithm, for approximating a target function by a sparse linear combination
of elements from a dictionary. When the target function is contained in the
variation space corresponding to the dictionary, many impressive works over the
past few decades have obtained upper and lower bounds on the error of matching
pursuit, but they do not match. The main contribution of this paper is to close
this gap and obtain a sharp characterization of the decay rate of matching
pursuit. Specifically, we construct a worst case dictionary which shows that
the existing best upper bound cannot be significantly improved. It turns out
that, unlike other greedy algorithm variants, the converge rate is suboptimal
and is determined by the solution to a certain non-linear equation. This
enables us to conclude that any amount of shrinkage improves matching pursuit
in the worst case
- β¦