Search CORE

89 research outputs found

Approximation in $L^p(\mu)$ with deep ReLU neural networks

Author: Petersen Philipp
Voigtlaender Felix
Publication venue
Publication date: 01/01/2019
Field of study

We discuss the expressive power of neural networks which use the non-smooth ReLU activation function

\varrho(x) = \max\{0,x\}

by analyzing the approximation theoretic properties of such networks. The existing results mainly fall into two categories: approximation using ReLU networks with a fixed depth, or using ReLU networks whose depth increases with the approximation accuracy. After reviewing these findings, we show that the results concerning networks with fixed depth--- which up to now only consider approximation in

L^p(\lambda)

for the Lebesgue measure

\lambda

--- can be generalized to approximation in

L^p(\mu)

, for any finite Borel measure

\mu

. In particular, the generalized results apply in the usual setting of statistical learning theory, where one is interested in approximation in

L^2(\mathbb{P})

, with the probability measure

\mathbb{P}

describing the distribution of the data.Comment: Accepted for presentation at SampTA 201

arXiv.org e-Print Archive

Crossref

Publikationsserver der Katholischen Universität Eichstätt-Ingolstadt

Is Stochastic Gradient Descent Near Optimal?

Author: Jeon Hong Jun
Van Roy Benjamin
Zhu Yifan
Publication venue
Publication date: 06/10/2022
Field of study

The success of neural networks over the past decade has established them as effective models for many relevant data generating processes. Statistical theory on neural networks indicates graceful scaling of sample complexity. For example, Joen & Van Roy (arXiv:2203.00246) demonstrate that, when data is generated by a ReLU teacher network with

W

parameters, an optimal learner needs only

\tilde{O}(W/\epsilon)

samples to attain expected error

\epsilon

. However, existing computational theory suggests that, even for single-hidden-layer teacher networks, to attain small error for all such teacher networks, the computation required to achieve this sample complexity is intractable. In this work, we fit single-hidden-layer neural networks to data generated by single-hidden-layer ReLU teacher networks with parameters drawn from a natural distribution. We demonstrate that stochastic gradient descent (SGD) with automated width selection attains small expected error with a number of samples and total number of queries both nearly linear in the input dimension and width. This suggests that SGD nearly achieves the information-theoretic sample complexity bounds of Joen & Van Roy (arXiv:2203.00246) in a computationally efficient manner. An important difference between our positive empirical results and the negative theoretical results is that the latter address worst-case error of deterministic algorithms, while our analysis centers on expected error of a stochastic algorithm.Comment: arXiv admin note: substantial text overlap with arXiv:2203.0024

arXiv.org e-Print Archive

Seconder of the vote of thanks and contribution to the Discussion of ‘the Discussion Meeting on Probabilistic and statistical aspects of machine learning’

Author: Nemeth Christopher
Publication venue
Publication date: 02/01/2024
Field of study

Lancaster E-Prints