285 research outputs found

    The Extended Regularized Dual Averaging Method for Composite Optimization

    Full text link
    We present a new algorithm, extended regularized dual averaging (XRDA), for solving composite optimization problems, which are a generalization of the regularized dual averaging (RDA) method. The main novelty of the method is that it allows more flexible control of the backward step size. For instance, the backward step size for RDA grows without bound, while XRDA the backward step size can be kept bounded

    Optimal Approximation of Zonoids and Uniform Approximation by Shallow Neural Networks

    Full text link
    We study the following two related problems. The first is to determine to what error an arbitrary zonoid in Rd+1\mathbb{R}^{d+1} can be approximated in the Hausdorff distance by a sum of nn line segments. The second is to determine optimal approximation rates in the uniform norm for shallow ReLUk^k neural networks on their variation spaces. The first of these problems has been solved for dβ‰ 2,3d\neq 2,3, but when d=2,3d=2,3 a logarithmic gap between the best upper and lower bounds remains. We close this gap, which completes the solution in all dimensions. For the second problem, our techniques significantly improve upon existing approximation rates when kβ‰₯1k\geq 1, and enable uniform approximation of both the target function and its derivatives

    Sharp Lower Bounds on Interpolation by Deep ReLU Neural Networks at Irregularly Spaced Data

    Full text link
    We study the interpolation power of deep ReLU neural networks. Specifically, we consider the question of how efficiently, in terms of the number of parameters, deep ReLU networks can interpolate values at NN datapoints in the unit ball which are separated by a distance Ξ΄\delta. We show that Ξ©(N)\Omega(N) parameters are required in the regime where Ξ΄\delta is exponentially small in NN, which gives the sharp result in this regime since O(N)O(N) parameters are always sufficient. This also shows that the bit-extraction technique used to prove lower bounds on the VC dimension cannot be applied to irregularly spaced datapoints. Finally, as an application we give a lower bound on the approximation rates that deep ReLU neural networks can achieve for Sobolev spaces at the embedding endpoint

    Optimal Approximation Rates for Deep ReLU Neural Networks on Sobolev and Besov Spaces

    Full text link
    Let Ξ©=[0,1]d\Omega = [0,1]^d be the unit cube in Rd\mathbb{R}^d. We study the problem of how efficiently, in terms of the number of parameters, deep neural networks with the ReLU activation function can approximate functions in the Sobolev spaces Ws(Lq(Ξ©))W^s(L_q(\Omega)) and Besov spaces Brs(Lq(Ξ©))B^s_r(L_q(\Omega)), with error measured in the Lp(Ξ©)L_p(\Omega) norm. This problem is important when studying the application of neural networks in a variety of fields, including scientific computing and signal processing, and has previously been solved only when p=q=∞p=q=\infty. Our contribution is to provide a complete solution for all 1≀p,qβ‰€βˆž1\leq p,q\leq \infty and s>0s > 0 for which the corresponding Sobolev or Besov space compactly embeds into LpL_p. The key technical tool is a novel bit-extraction technique which gives an optimal encoding of sparse vectors. This enables us to obtain sharp upper bounds in the non-linear regime where p>qp > q. We also provide a novel method for deriving LpL_p-approximation lower bounds based upon VC-dimension when p<∞p < \infty. Our results show that very deep ReLU networks significantly outperform classical methods of approximation in terms of the number of parameters, but that this comes at the cost of parameters which are not encodable

    A qualitative difference between gradient flows of convex functions in finite- and infinite-dimensional Hilbert spaces

    Full text link
    We consider gradient flow/gradient descent and heavy ball/accelerated gradient descent optimization for convex objective functions. In the gradient flow case, we prove the following: 1. If ff does not have a minimizer, the convergence f(xt)β†’inf⁑ff(x_t)\to \inf f can be arbitrarily slow. 2. If ff does have a minimizer, the excess energy f(xt)βˆ’inf⁑ff(x_t) - \inf f is integrable/summable in time. In particular, f(xt)βˆ’inf⁑f=o(1/t)f(x_t) - \inf f = o(1/t) as tβ†’βˆžt\to\infty. 3. In Hilbert spaces, this is optimal: f(xt)βˆ’inf⁑ff(x_t) - \inf f can decay to 00 as slowly as any given function which is monotone decreasing and integrable at ∞\infty, even for a fixed quadratic objective. 4. In finite dimension (or more generally, for all gradient flow curves of finite length), this is not optimal: We prove that there are convex monotone decreasing integrable functions g(t)g(t) which decrease to zero slower than f(xt)βˆ’inf⁑ff(x_t)-\inf f for the gradient flow of any convex function on Rd\mathbb R^d. For instance, we show that any gradient flow xtx_t of a convex function ff in finite dimension satisfies lim inf⁑tβ†’βˆž(tβ‹…log⁑2(t)β‹…{f(xt)βˆ’inf⁑f})=0\liminf_{t\to\infty} \big(t\cdot \log^2(t)\cdot \big\{f(x_t) -\inf f\big\}\big)=0. This improves on the commonly reported O(1/t)O(1/t) rate and provides a sharp characterization of the energy decay law. We also note that it is impossible to establish a rate O(1/(tΟ•(t))O(1/(t\phi(t)) for any function Ο•\phi which satisfies lim⁑tβ†’βˆžΟ•(t)=∞\lim_{t\to\infty}\phi(t) = \infty, even asymptotically. Similar results are obtained in related settings for (1) discrete time gradient descent, (2) stochastic gradient descent with multiplicative noise and (3) the heavy ball ODE. In the case of stochastic gradient descent, the summability of E[f(xn)βˆ’inf⁑f]\mathbb E[f(x_n) - \inf f] is used to prove that f(xn)β†’inf⁑ff(x_n)\to \inf f almost surely - an improvement on the convergence almost surely up to a subsequence which follows from the O(1/n)O(1/n) decay estimate

    Sharp Convergence Rates for Matching Pursuit

    Full text link
    We study the fundamental limits of matching pursuit, or the pure greedy algorithm, for approximating a target function by a sparse linear combination of elements from a dictionary. When the target function is contained in the variation space corresponding to the dictionary, many impressive works over the past few decades have obtained upper and lower bounds on the error of matching pursuit, but they do not match. The main contribution of this paper is to close this gap and obtain a sharp characterization of the decay rate of matching pursuit. Specifically, we construct a worst case dictionary which shows that the existing best upper bound cannot be significantly improved. It turns out that, unlike other greedy algorithm variants, the converge rate is suboptimal and is determined by the solution to a certain non-linear equation. This enables us to conclude that any amount of shrinkage improves matching pursuit in the worst case
    • …
    corecore