89 research outputs found
Approximation in with deep ReLU neural networks
We discuss the expressive power of neural networks which use the non-smooth
ReLU activation function by analyzing the
approximation theoretic properties of such networks. The existing results
mainly fall into two categories: approximation using ReLU networks with a fixed
depth, or using ReLU networks whose depth increases with the approximation
accuracy. After reviewing these findings, we show that the results concerning
networks with fixed depth--- which up to now only consider approximation in
for the Lebesgue measure --- can be generalized to
approximation in , for any finite Borel measure . In particular,
the generalized results apply in the usual setting of statistical learning
theory, where one is interested in approximation in , with the
probability measure describing the distribution of the data.Comment: Accepted for presentation at SampTA 201
Is Stochastic Gradient Descent Near Optimal?
The success of neural networks over the past decade has established them as
effective models for many relevant data generating processes. Statistical
theory on neural networks indicates graceful scaling of sample complexity. For
example, Joen & Van Roy (arXiv:2203.00246) demonstrate that, when data is
generated by a ReLU teacher network with parameters, an optimal learner
needs only samples to attain expected error .
However, existing computational theory suggests that, even for
single-hidden-layer teacher networks, to attain small error for all such
teacher networks, the computation required to achieve this sample complexity is
intractable. In this work, we fit single-hidden-layer neural networks to data
generated by single-hidden-layer ReLU teacher networks with parameters drawn
from a natural distribution. We demonstrate that stochastic gradient descent
(SGD) with automated width selection attains small expected error with a number
of samples and total number of queries both nearly linear in the input
dimension and width. This suggests that SGD nearly achieves the
information-theoretic sample complexity bounds of Joen & Van Roy
(arXiv:2203.00246) in a computationally efficient manner. An important
difference between our positive empirical results and the negative theoretical
results is that the latter address worst-case error of deterministic
algorithms, while our analysis centers on expected error of a stochastic
algorithm.Comment: arXiv admin note: substantial text overlap with arXiv:2203.0024
- …