15,496 research outputs found

    Posterior Concentration for Sparse Deep Learning

    Full text link
    Spike-and-Slab Deep Learning (SS-DL) is a fully Bayesian alternative to Dropout for improving generalizability of deep ReLU networks. This new type of regularization enables provable recovery of smooth input-output maps with unknown levels of smoothness. Indeed, we show that the posterior distribution concentrates at the near minimax rate for α\alpha-H\"older smooth maps, performing as well as if we knew the smoothness level α\alpha ahead of time. Our result sheds light on architecture design for deep neural networks, namely the choice of depth, width and sparsity level. These network attributes typically depend on unknown smoothness in order to be optimal. We obviate this constraint with the fully Bayes construction. As an aside, we show that SS-DL does not overfit in the sense that the posterior concentrates on smaller networks with fewer (up to the optimal number of) nodes and links. Our results provide new theoretical justifications for deep ReLU networks from a Bayesian point of view

    Better Approximations of High Dimensional Smooth Functions by Deep Neural Networks with Rectified Power Units

    Full text link
    Deep neural networks with rectified linear units (ReLU) are getting more and more popular due to their universal representation power and successful applications. Some theoretical progress regarding the approximation power of deep ReLU network for functions in Sobolev space and Korobov space have recently been made by [D. Yarotsky, Neural Network, 94:103-114, 2017] and [H. Montanelli and Q. Du, SIAM J Math. Data Sci., 1:78-92, 2019], etc. In this paper, we show that deep networks with rectified power units (RePU) can give better approximations for smooth functions than deep ReLU networks. Our analysis bases on classical polynomial approximation theory and some efficient algorithms proposed in this paper to convert polynomials into deep RePU networks of optimal size with no approximation error. Comparing to the results on ReLU networks, the sizes of RePU networks required to approximate functions in Sobolev space and Korobov space with an error tolerance ε\varepsilon, by our constructive proofs, are in general O(log1ε)\mathcal{O}(\log\frac{1}{\varepsilon}) times smaller than the sizes of corresponding ReLU networks constructed in most of the existing literature. Comparing to the classical results of Mhaskar [Mhaskar, Adv. Comput. Math. 1:61-80, 1993], our constructions use less number of activation functions and numerically more stable, they can be served as good initials of deep RePU networks and further trained to break the limit of linear approximation theory. The functions represented by RePU networks are smooth functions, so they naturally fit in the places where derivatives are involved in the loss function.Comment: 28 pages, 4 figure

    On the Convergence of Gradient Descent Training for Two-layer ReLU-networks in the Mean Field Regime

    Full text link
    We describe a necessary and sufficient condition for the convergence to minimum Bayes risk when training two-layer ReLU-networks by gradient descent in the mean field regime with omni-directional initial parameter distribution. This article extends recent results of Chizat and Bach to ReLU-activated networks and to the situation in which there are no parameters which exactly achieve MBR. The condition does not depend on the initalization of parameters and concerns only the weak convergence of the realization of the neural network, not its parameter distribution

    Deep ReLU networks overcome the curse of dimensionality for bandlimited functions

    Full text link
    We prove a theorem concerning the approximation of bandlimited multivariate functions by deep ReLU networks for which the curse of the dimensionality is overcome. Our theorem is based on a result by Maurey and on the ability of deep ReLU networks to approximate Chebyshev polynomials and analytic functions efficiently

    Solving Irregular and Data-enriched Differential Equations using Deep Neural Networks

    Full text link
    Recent work has introduced a simple numerical method for solving partial differential equations (PDEs) with deep neural networks (DNNs). This paper reviews and extends the method while applying it to analyze one of the most fundamental features in numerical PDEs and nonlinear analysis: irregular solutions. First, the Sod shock tube solution to compressible Euler equations is discussed, analyzed, and then compared to conventional finite element and finite volume methods. These methods are extended to consider performance improvements and simultaneous parameter space exploration. Next, a shock solution to compressible magnetohydrodynamics (MHD) is solved for, and used in a scenario where experimental data is utilized to enhance a PDE system that is \emph{a priori} insufficient to validate against the observed/experimental data. This is accomplished by enriching the model PDE system with source terms and using supervised training on synthetic experimental data. The resulting DNN framework for PDEs seems to demonstrate almost fantastical ease of system prototyping, natural integration of large data sets (be they synthetic or experimental), all while simultaneously enabling single-pass exploration of the entire parameter space.Comment: 21 pages, 14 figures, 3 table

    A Theoretical Analysis of Deep Neural Networks and Parametric PDEs

    Full text link
    We derive upper bounds on the complexity of ReLU neural networks approximating the solution maps of parametric partial differential equations. In particular, without any knowledge of its concrete shape, we use the inherent low-dimensionality of the solution manifold to obtain approximation rates which are significantly superior to those provided by classical neural network approximation results. Concretely, we use the existence of a small reduced basis to construct, for a large variety of parametric partial differential equations, neural networks that yield approximations of the parametric solution maps in such a way that the sizes of these networks essentially only depend on the size of the reduced basis

    Kolmogorov Width Decay and Poor Approximators in Machine Learning: Shallow Neural Networks, Random Feature Models and Neural Tangent Kernels

    Full text link
    We establish a scale separation of Kolmogorov width type between subspaces of a given Banach space under the condition that a sequence of linear maps converges much faster on one of the subspaces. The general technique is then applied to show that reproducing kernel Hilbert spaces are poor L2L^2-approximators for the class of two-layer neural networks in high dimension, and that multi-layer networks with small path norm are poor approximators for certain Lipschitz functions, also in the L2L^2-topology

    Approximation Rates for Neural Networks with General Activation Functions

    Full text link
    We prove some new results concerning the approximation rate of neural networks with general activation functions. Our first result concerns the rate of approximation of a two layer neural network with a polynomially-decaying non-sigmoidal activation function. We extend the dimension independent approximation rates previously obtained to this new class of activation functions. Our second result gives a weaker, but still dimension independent, approximation rate for a larger class of activation functions, removing the polynomial decay assumption. This result applies to any bounded, integrable activation function. Finally, we show that a stratified sampling approach can be used to improve the approximation rate for polynomially decaying activation functions under mild additional assumptions

    Techniques for Gradient Based Bilevel Optimization with Nonsmooth Lower Level Problems

    Full text link
    We propose techniques for approximating bilevel optimization problems with non-smooth lower level problems that can have a non-unique solution. To this end, we substitute the expression of a minimizer of the lower level minimization problem with an iterative algorithm that is guaranteed to converge to a minimizer of the problem. Using suitable non-linear proximal distance functions, the update mappings of such an iterative algorithm can be differentiable, notwithstanding the fact that the minimization problem is non-smooth

    Rates of Convergence of Spectral Methods for Graphon Estimation

    Full text link
    This paper studies the problem of estimating the grahpon model - the underlying generating mechanism of a network. Graphon estimation arises in many applications such as predicting missing links in networks and learning user preferences in recommender systems. The graphon model deals with a random graph of nn vertices such that each pair of two vertices ii and jj are connected independently with probability ρ×f(xi,xj)\rho \times f(x_i,x_j), where xix_i is the unknown dd-dimensional label of vertex ii, ff is an unknown symmetric function, and ρ\rho is a scaling parameter characterizing the graph sparsity. Recent studies have identified the minimax error rate of estimating the graphon from a single realization of the random graph. However, there exists a wide gap between the known error rates of computationally efficient estimation procedures and the minimax optimal error rate. Here we analyze a spectral method, namely universal singular value thresholding (USVT) algorithm, in the relatively sparse regime with the average vertex degree nρ=Ω(logn)n\rho=\Omega(\log n). When ff belongs to H\"{o}lder or Sobolev space with smoothness index α\alpha, we show the error rate of USVT is at most (nρ)2α/(2α+d)(n\rho)^{ -2 \alpha / (2\alpha+d)}, approaching the minimax optimal error rate log(nρ)/(nρ)\log (n\rho)/(n\rho) for d=1d=1 as α\alpha increases. Furthermore, when ff is analytic, we show the error rate of USVT is at most logd(nρ)/(nρ)\log^d (n\rho)/(n\rho). In the special case of stochastic block model with kk blocks, the error rate of USVT is at most k/(nρ)k/(n\rho), which is larger than the minimax optimal error rate by at most a multiplicative factor k/logkk/\log k. This coincides with the computational gap observed for community detection. A key step of our analysis is to derive the eigenvalue decaying rate of the edge probability matrix using piecewise polynomial approximations of the graphon function ff
    corecore