15,496 research outputs found
Posterior Concentration for Sparse Deep Learning
Spike-and-Slab Deep Learning (SS-DL) is a fully Bayesian alternative to
Dropout for improving generalizability of deep ReLU networks. This new type of
regularization enables provable recovery of smooth input-output maps with
unknown levels of smoothness. Indeed, we show that the posterior distribution
concentrates at the near minimax rate for -H\"older smooth maps,
performing as well as if we knew the smoothness level ahead of time.
Our result sheds light on architecture design for deep neural networks, namely
the choice of depth, width and sparsity level. These network attributes
typically depend on unknown smoothness in order to be optimal. We obviate this
constraint with the fully Bayes construction. As an aside, we show that SS-DL
does not overfit in the sense that the posterior concentrates on smaller
networks with fewer (up to the optimal number of) nodes and links. Our results
provide new theoretical justifications for deep ReLU networks from a Bayesian
point of view
Better Approximations of High Dimensional Smooth Functions by Deep Neural Networks with Rectified Power Units
Deep neural networks with rectified linear units (ReLU) are getting more and
more popular due to their universal representation power and successful
applications. Some theoretical progress regarding the approximation power of
deep ReLU network for functions in Sobolev space and Korobov space have
recently been made by [D. Yarotsky, Neural Network, 94:103-114, 2017] and [H.
Montanelli and Q. Du, SIAM J Math. Data Sci., 1:78-92, 2019], etc. In this
paper, we show that deep networks with rectified power units (RePU) can give
better approximations for smooth functions than deep ReLU networks. Our
analysis bases on classical polynomial approximation theory and some efficient
algorithms proposed in this paper to convert polynomials into deep RePU
networks of optimal size with no approximation error. Comparing to the results
on ReLU networks, the sizes of RePU networks required to approximate functions
in Sobolev space and Korobov space with an error tolerance , by
our constructive proofs, are in general
times smaller than the sizes of
corresponding ReLU networks constructed in most of the existing literature.
Comparing to the classical results of Mhaskar [Mhaskar, Adv. Comput. Math.
1:61-80, 1993], our constructions use less number of activation functions and
numerically more stable, they can be served as good initials of deep RePU
networks and further trained to break the limit of linear approximation theory.
The functions represented by RePU networks are smooth functions, so they
naturally fit in the places where derivatives are involved in the loss
function.Comment: 28 pages, 4 figure
On the Convergence of Gradient Descent Training for Two-layer ReLU-networks in the Mean Field Regime
We describe a necessary and sufficient condition for the convergence to
minimum Bayes risk when training two-layer ReLU-networks by gradient descent in
the mean field regime with omni-directional initial parameter distribution.
This article extends recent results of Chizat and Bach to ReLU-activated
networks and to the situation in which there are no parameters which exactly
achieve MBR. The condition does not depend on the initalization of parameters
and concerns only the weak convergence of the realization of the neural
network, not its parameter distribution
Deep ReLU networks overcome the curse of dimensionality for bandlimited functions
We prove a theorem concerning the approximation of bandlimited multivariate
functions by deep ReLU networks for which the curse of the dimensionality is
overcome. Our theorem is based on a result by Maurey and on the ability of deep
ReLU networks to approximate Chebyshev polynomials and analytic functions
efficiently
Solving Irregular and Data-enriched Differential Equations using Deep Neural Networks
Recent work has introduced a simple numerical method for solving partial
differential equations (PDEs) with deep neural networks (DNNs). This paper
reviews and extends the method while applying it to analyze one of the most
fundamental features in numerical PDEs and nonlinear analysis: irregular
solutions. First, the Sod shock tube solution to compressible Euler equations
is discussed, analyzed, and then compared to conventional finite element and
finite volume methods. These methods are extended to consider performance
improvements and simultaneous parameter space exploration. Next, a shock
solution to compressible magnetohydrodynamics (MHD) is solved for, and used in
a scenario where experimental data is utilized to enhance a PDE system that is
\emph{a priori} insufficient to validate against the observed/experimental
data. This is accomplished by enriching the model PDE system with source terms
and using supervised training on synthetic experimental data. The resulting DNN
framework for PDEs seems to demonstrate almost fantastical ease of system
prototyping, natural integration of large data sets (be they synthetic or
experimental), all while simultaneously enabling single-pass exploration of the
entire parameter space.Comment: 21 pages, 14 figures, 3 table
A Theoretical Analysis of Deep Neural Networks and Parametric PDEs
We derive upper bounds on the complexity of ReLU neural networks
approximating the solution maps of parametric partial differential equations.
In particular, without any knowledge of its concrete shape, we use the inherent
low-dimensionality of the solution manifold to obtain approximation rates which
are significantly superior to those provided by classical neural network
approximation results. Concretely, we use the existence of a small reduced
basis to construct, for a large variety of parametric partial differential
equations, neural networks that yield approximations of the parametric solution
maps in such a way that the sizes of these networks essentially only depend on
the size of the reduced basis
Kolmogorov Width Decay and Poor Approximators in Machine Learning: Shallow Neural Networks, Random Feature Models and Neural Tangent Kernels
We establish a scale separation of Kolmogorov width type between subspaces of
a given Banach space under the condition that a sequence of linear maps
converges much faster on one of the subspaces. The general technique is then
applied to show that reproducing kernel Hilbert spaces are poor
-approximators for the class of two-layer neural networks in high
dimension, and that multi-layer networks with small path norm are poor
approximators for certain Lipschitz functions, also in the -topology
Approximation Rates for Neural Networks with General Activation Functions
We prove some new results concerning the approximation rate of neural
networks with general activation functions. Our first result concerns the rate
of approximation of a two layer neural network with a polynomially-decaying
non-sigmoidal activation function. We extend the dimension independent
approximation rates previously obtained to this new class of activation
functions. Our second result gives a weaker, but still dimension independent,
approximation rate for a larger class of activation functions, removing the
polynomial decay assumption. This result applies to any bounded, integrable
activation function. Finally, we show that a stratified sampling approach can
be used to improve the approximation rate for polynomially decaying activation
functions under mild additional assumptions
Techniques for Gradient Based Bilevel Optimization with Nonsmooth Lower Level Problems
We propose techniques for approximating bilevel optimization problems with
non-smooth lower level problems that can have a non-unique solution. To this
end, we substitute the expression of a minimizer of the lower level
minimization problem with an iterative algorithm that is guaranteed to converge
to a minimizer of the problem. Using suitable non-linear proximal distance
functions, the update mappings of such an iterative algorithm can be
differentiable, notwithstanding the fact that the minimization problem is
non-smooth
Rates of Convergence of Spectral Methods for Graphon Estimation
This paper studies the problem of estimating the grahpon model - the
underlying generating mechanism of a network. Graphon estimation arises in many
applications such as predicting missing links in networks and learning user
preferences in recommender systems. The graphon model deals with a random graph
of vertices such that each pair of two vertices and are connected
independently with probability , where is the
unknown -dimensional label of vertex , is an unknown symmetric
function, and is a scaling parameter characterizing the graph sparsity.
Recent studies have identified the minimax error rate of estimating the graphon
from a single realization of the random graph. However, there exists a wide gap
between the known error rates of computationally efficient estimation
procedures and the minimax optimal error rate.
Here we analyze a spectral method, namely universal singular value
thresholding (USVT) algorithm, in the relatively sparse regime with the average
vertex degree . When belongs to H\"{o}lder or Sobolev
space with smoothness index , we show the error rate of USVT is at most
, approaching the minimax optimal error
rate for as increases. Furthermore, when
is analytic, we show the error rate of USVT is at most . In the special case of stochastic block model with
blocks, the error rate of USVT is at most , which is larger than the
minimax optimal error rate by at most a multiplicative factor . This
coincides with the computational gap observed for community detection. A key
step of our analysis is to derive the eigenvalue decaying rate of the edge
probability matrix using piecewise polynomial approximations of the graphon
function
- …