4 research outputs found

    Sharp Representation Theorems for ReLU Networks with Precise Dependence on Depth

    Full text link
    We prove sharp dimension-free representation results for neural networks with DD ReLU layers under square loss for a class of functions GD\mathcal{G}_D defined in the paper. These results capture the precise benefits of depth in the following sense: 1. The rates for representing the class of functions GD\mathcal{G}_D via DD ReLU layers is sharp up to constants, as shown by matching lower bounds. 2. For each DD, GDβŠ†GD+1\mathcal{G}_{D} \subseteq \mathcal{G}_{D+1} and as DD grows the class of functions GD\mathcal{G}_{D} contains progressively less smooth functions. 3. If Dβ€²<DD^{\prime} < D, then the approximation rate for the class GD\mathcal{G}_D achieved by depth Dβ€²D^{\prime} networks is strictly worse than that achieved by depth DD networks. This constitutes a fine-grained characterization of the representation power of feedforward networks of arbitrary depth DD and number of neurons NN, in contrast to existing representation results which either require DD growing quickly with NN or assume that the function being represented is highly smooth. In the latter case similar rates can be obtained with a single nonlinear layer. Our results confirm the prevailing hypothesis that deeper networks are better at representing less smooth functions, and indeed, the main technical novelty is to fully exploit the fact that deep networks can produce highly oscillatory functions with few activation functions.Comment: 12 pages, 1 figure (surprisingly short isn't it?

    High-Order Approximation Rates for Shallow Neural Networks with Cosine and ReLUk^k Activation Functions

    Full text link
    We study the approximation properties of shallow neural networks with an activation function which is a power of the rectified linear unit. Specifically, we consider the dependence of the approximation rate on the dimension and the smoothness in the spectral Barron space of the underlying function ff to be approximated. We show that as the smoothness index ss of ff increases, shallow neural networks with ReLUk^k activation function obtain an improved approximation rate up to a best possible rate of O(nβˆ’(k+1)log⁑(n))O(n^{-(k+1)}\log(n)) in L2L^2, independent of the dimension dd. The significance of this result is that the activation function ReLUk^k is fixed independent of the dimension, while for classical methods the degree of polynomial approximation or the smoothness of the wavelets used would have to increase in order to take advantage of the dimension dependent smoothness of ff. In addition, we derive improved approximation rates for shallow neural networks with cosine activation function on the spectral Barron space. Finally, we prove lower bounds showing that the approximation rates attained are optimal under the given assumptions

    Depth separation beyond radial functions

    Full text link
    High-dimensional depth separation results for neural networks show that certain functions can be efficiently approximated by two-hidden-layer networks but not by one-hidden-layer ones in high-dimensions dd. Existing results of this type mainly focus on functions with an underlying radial or one-dimensional structure, which are usually not encountered in practice. The first contribution of this paper is to extend such results to a more general class of functions, namely functions with piece-wise oscillatory structure, by building on the proof strategy of (Eldan and Shamir, 2016). We complement these results by showing that, if the domain radius and the rate of oscillation of the objective function are constant, then approximation by one-hidden-layer networks holds at a poly(d)\mathrm{poly}(d) rate for any fixed error threshold. A common theme in the proof of such results is the fact that one-hidden-layer networks fail to approximate high-energy functions whose Fourier representation is spread in the domain. On the other hand, existing approximation results of a function by one-hidden-layer neural networks rely on the function having a sparse Fourier representation. The choice of the domain also represents a source of gaps between upper and lower approximation bounds. Focusing on a fixed approximation domain, namely the sphere Sdβˆ’1\mathbb{S}^{d-1} in dimension dd, we provide a characterization of both functions which are efficiently approximable by one-hidden-layer networks and of functions which are provably not, in terms of their Fourier expansion

    A Corrective View of Neural Networks: Representation, Memorization and Learning

    Full text link
    We develop a corrective mechanism for neural network approximation: the total available non-linear units are divided into multiple groups and the first group approximates the function under consideration, the second group approximates the error in approximation produced by the first group and corrects it, the third group approximates the error produced by the first and second groups together and so on. This technique yields several new representation and learning results for neural networks. First, we show that two-layer neural networks in the random features regime (RF) can memorize arbitrary labels for arbitrary points under under Euclidean distance separation condition using O~(n)\tilde{O}(n) ReLUs which is optimal in nn up to logarithmic factors. Next, we give a powerful representation result for two-layer neural networks with ReLUs and smoothed ReLUs which can achieve a squared error of at most Ο΅\epsilon with O(C(a,d)Ο΅βˆ’1/(a+1))O(C(a,d)\epsilon^{-1/(a+1)}) for a∈Nβˆͺ{0}a \in \mathbb{N}\cup\{0\} when the function is smooth enough (roughly when it has Θ(ad)\Theta(ad) bounded derivatives). In certain cases dd can be replaced with effective dimension qβ‰ͺdq \ll d. Previous results of this type implement Taylor series approximation using deep architectures. We also consider three-layer neural networks and show that the corrective mechanism yields faster representation rates for smooth radial functions. Lastly, we obtain the first O(subpoly(1/Ο΅))O(\mathrm{subpoly}(1/\epsilon)) upper bound on the number of neurons required for a two layer network to learn low degree polynomials up to squared error Ο΅\epsilon via gradient descent. Even though deep networks can express these polynomials with O(polylog(1/Ο΅))O(\mathrm{polylog}(1/\epsilon)) neurons, the best learning bounds on this problem require poly(1/Ο΅)\mathrm{poly}(1/\epsilon) neurons.Comment: Contains 2 figures (you heard that right!), V2 removes dimension dependence in memorization bound
    corecore