Search CORE

4 research outputs found

Sharp Representation Theorems for ReLU Networks with Precise Dependence on Depth

Author: Bresler Guy
Nagaraj Dheeraj
Publication venue
Publication date: 21/02/2021
Field of study

We prove sharp dimension-free representation results for neural networks with

D

ReLU layers under square loss for a class of functions

\mathcal{G}_D

defined in the paper. These results capture the precise benefits of depth in the following sense: 1. The rates for representing the class of functions

\mathcal{G}_D

via

D

ReLU layers is sharp up to constants, as shown by matching lower bounds. 2. For each

D

\mathcal{G}_{D} \subseteq \mathcal{G}_{D+1}

and as

D

grows the class of functions

\mathcal{G}_{D}

contains progressively less smooth functions. 3. If

D^{\prime} < D

, then the approximation rate for the class

\mathcal{G}_D

achieved by depth

D^{\prime}

networks is strictly worse than that achieved by depth

D

networks. This constitutes a fine-grained characterization of the representation power of feedforward networks of arbitrary depth

D

and number of neurons

N

, in contrast to existing representation results which either require

D

growing quickly with

N

or assume that the function being represented is highly smooth. In the latter case similar rates can be obtained with a single nonlinear layer. Our results confirm the prevailing hypothesis that deeper networks are better at representing less smooth functions, and indeed, the main technical novelty is to fully exploit the fact that deep networks can produce highly oscillatory functions with few activation functions.Comment: 12 pages, 1 figure (surprisingly short isn't it?

arXiv.org e-Print Archive

High-Order Approximation Rates for Shallow Neural Networks with Cosine and ReLU $^k$ Activation Functions

Author: Siegel Jonathan W.
Xu Jinchao
Publication venue
Publication date: 21/12/2021
Field of study

We study the approximation properties of shallow neural networks with an activation function which is a power of the rectified linear unit. Specifically, we consider the dependence of the approximation rate on the dimension and the smoothness in the spectral Barron space of the underlying function

f

to be approximated. We show that as the smoothness index

s

f

increases, shallow neural networks with ReLU

^k

activation function obtain an improved approximation rate up to a best possible rate of

O(n^{-(k+1)}\log(n))

L^2

, independent of the dimension

d

. The significance of this result is that the activation function ReLU

^k

is fixed independent of the dimension, while for classical methods the degree of polynomial approximation or the smoothness of the wavelets used would have to increase in order to take advantage of the dimension dependent smoothness of

f

. In addition, we derive improved approximation rates for shallow neural networks with cosine activation function on the spectral Barron space. Finally, we prove lower bounds showing that the approximation rates attained are optimal under the given assumptions

arXiv.org e-Print Archive

Depth separation beyond radial functions

Author: Bruna Joan
Jelassi Samy
Ozuch Tristan
Venturi Luca
Publication venue
Publication date: 03/02/2021
Field of study

High-dimensional depth separation results for neural networks show that certain functions can be efficiently approximated by two-hidden-layer networks but not by one-hidden-layer ones in high-dimensions

d

. Existing results of this type mainly focus on functions with an underlying radial or one-dimensional structure, which are usually not encountered in practice. The first contribution of this paper is to extend such results to a more general class of functions, namely functions with piece-wise oscillatory structure, by building on the proof strategy of (Eldan and Shamir, 2016). We complement these results by showing that, if the domain radius and the rate of oscillation of the objective function are constant, then approximation by one-hidden-layer networks holds at a

\mathrm{poly}(d)

rate for any fixed error threshold. A common theme in the proof of such results is the fact that one-hidden-layer networks fail to approximate high-energy functions whose Fourier representation is spread in the domain. On the other hand, existing approximation results of a function by one-hidden-layer neural networks rely on the function having a sparse Fourier representation. The choice of the domain also represents a source of gaps between upper and lower approximation bounds. Focusing on a fixed approximation domain, namely the sphere

\mathbb{S}^{d-1}

in dimension

d

, we provide a characterization of both functions which are efficiently approximable by one-hidden-layer networks and of functions which are provably not, in terms of their Fourier expansion

arXiv.org e-Print Archive

A Corrective View of Neural Networks: Representation, Memorization and Learning

Author: Bresler Guy
Nagaraj Dheeraj
Publication venue
Publication date: 19/06/2020
Field of study

We develop a corrective mechanism for neural network approximation: the total available non-linear units are divided into multiple groups and the first group approximates the function under consideration, the second group approximates the error in approximation produced by the first group and corrects it, the third group approximates the error produced by the first and second groups together and so on. This technique yields several new representation and learning results for neural networks. First, we show that two-layer neural networks in the random features regime (RF) can memorize arbitrary labels for arbitrary points under under Euclidean distance separation condition using

\tilde{O}(n)

ReLUs which is optimal in

n

up to logarithmic factors. Next, we give a powerful representation result for two-layer neural networks with ReLUs and smoothed ReLUs which can achieve a squared error of at most

\epsilon

with

O(C(a,d)\epsilon^{-1/(a+1)})

for

a \in \mathbb{N}\cup\{0\}

when the function is smooth enough (roughly when it has

\Theta(ad)

bounded derivatives). In certain cases

d

can be replaced with effective dimension

q \ll d

. Previous results of this type implement Taylor series approximation using deep architectures. We also consider three-layer neural networks and show that the corrective mechanism yields faster representation rates for smooth radial functions. Lastly, we obtain the first

O(\mathrm{subpoly}(1/\epsilon))

upper bound on the number of neurons required for a two layer network to learn low degree polynomials up to squared error

\epsilon

via gradient descent. Even though deep networks can express these polynomials with

O(\mathrm{polylog}(1/\epsilon))

neurons, the best learning bounds on this problem require

\mathrm{poly}(1/\epsilon)

neurons.Comment: Contains 2 figures (you heard that right!), V2 removes dimension dependence in memorization bound

arXiv.org e-Print Archive