249 research outputs found
Deep Neural Networks with Trainable Activations and Controlled Lipschitz Constant
We introduce a variational framework to learn the activation functions of
deep neural networks. Our aim is to increase the capacity of the network while
controlling an upper-bound of the actual Lipschitz constant of the input-output
relation. To that end, we first establish a global bound for the Lipschitz
constant of neural networks. Based on the obtained bound, we then formulate a
variational problem for learning activation functions. Our variational problem
is infinite-dimensional and is not computationally tractable. However, we prove
that there always exists a solution that has continuous and piecewise-linear
(linear-spline) activations. This reduces the original problem to a
finite-dimensional minimization where an l1 penalty on the parameters of the
activations favors the learning of sparse nonlinearities. We numerically
compare our scheme with standard ReLU network and its variations, PReLU and
LeakyReLU and we empirically demonstrate the practical aspects of our
framework
Generalization Error in Deep Learning
Deep learning models have lately shown great performance in various fields
such as computer vision, speech recognition, speech translation, and natural
language processing. However, alongside their state-of-the-art performance, it
is still generally unclear what is the source of their generalization ability.
Thus, an important question is what makes deep neural networks able to
generalize well from the training set to new data. In this article, we provide
an overview of the existing theory and bounds for the characterization of the
generalization error of deep neural networks, combining both classical and more
recent theoretical and empirical results
Recommended from our members
When Can Nonconvex Optimization Problems be Solved with Gradient Descent? A Few Case Studies
Gradient descent and related algorithms are ubiquitously used to solve optimization problems arising in machine learning and signal processing. In many cases, these problems are nonconvex yet such simple algorithms are still effective. In an attempt to better understand this phenomenon, we study a number of nonconvex problems, proving that they can be solved efficiently with gradient descent. We will consider complete, orthogonal dictionary learning, and present a geometric analysis allowing us to obtain efficient convergence rate for gradient descent that hold with high probability. We also show that similar geometric structure is present in other nonconvex problems such as generalized phase retrieval.
Turning next to neural networks, we will also calculate conditions on certain classes of networks under which signals and gradients propagate through the network in a stable manner during the initial stages of training. Initialization schemes derived using these calculations allow training recurrent networks on long sequence tasks, and in the case of networks with low precision activation functions they make explicit a tradeoff between the reduction in precision and the maximal depth of a model that can be trained with gradient descent.
We finally consider manifold classification with a deep feed-forward neural network, for a particularly simple configuration of the manifolds. We provide an end-to-end analysis of the training process, proving that under certain conditions on the architectural hyperparameters of the network, it can successfully classify any point on the manifolds with high probability given a sufficient number of independent samples from the manifold, in a timely manner. Our analysis relates the depth and width of the network to its fitting capacity and statistical regularity respectively in early stages of training
Connections Between Numerical Algorithms for PDEs and Neural Networks
We investigate numerous structural connections between numerical algorithms for partial differential equations (PDEs) and
neural architectures. Our goal is to transfer the rich set of mathematical foundations from the world of PDEs to neural networks.
Besides structural insights, we provide concrete examples and experimental evaluations of the resulting architectures. Using
the example of generalised nonlinear diffusion in 1D, we consider explicit schemes, acceleration strategies thereof, implicit
schemes, and multigrid approaches. We connect these concepts to residual networks, recurrent neural networks, and U-net
architectures. Our findings inspire a symmetric residual network design with provable stability guarantees and justify the
effectiveness of skip connections in neural networks from a numerical perspective. Moreover, we present U-net architectures
that implement multigrid techniques for learning efficient solutions of partial differential equation models, and motivate
uncommon design choices such as trainable nonmonotone activation functions. Experimental evaluations show that the
proposed architectures save half of the trainable parameters and can thus outperform standard ones with the same model
complexity. Our considerations serve as a basis for explaining the success of popular neural architectures and provide a
blueprint for developing new mathematically well-founded neural building blocks
Lifted Regression/Reconstruction Networks
In this work we propose lifted regression/reconstruction networks(LRRNs), which combine lifted neural networks with a guaranteed Lipschitz continuity property for the output layer. Lifted neural networks explicitly optimize an energy model to infer the unit activations and therefore—in contrast to standard feed-forward neural networks—allow bidirectional feedback between layers. So far lifted neural networks have been modelled around standard feed-forward architectures. We propose to take further advantage of the feedback property by letting the layers simultaneously perform regression and reconstruction. The resulting lifted network architecture allows to control the desired amount of Lipschitz continuity, which is an important feature to obtain adversarially robust regression and classification methods. We analyse and numerically demonstrate applications for unsupervised and supervised learnin
Lifted Regression/Reconstruction Networks
In this work we propose lifted regression/reconstruction networks (LRRNs),
which combine lifted neural networks with a guaranteed Lipschitz continuity
property for the output layer. Lifted neural networks explicitly optimize an
energy model to infer the unit activations and therefore---in contrast to
standard feed-forward neural networks---allow bidirectional feedback between
layers. So far lifted neural networks have been modelled around standard
feed-forward architectures. We propose to take further advantage of the
feedback property by letting the layers simultaneously perform regression and
reconstruction. The resulting lifted network architecture allows to control the
desired amount of Lipschitz continuity, which is an important feature to obtain
adversarially robust regression and classification methods. We analyse and
numerically demonstrate applications for unsupervised and supervised learning.Comment: 12 pages, 8 figure
Global Convergence of SGD For Logistic Loss on Two Layer Neural Nets
In this note, we demonstrate a first-of-its-kind provable convergence of SGD
to the global minima of appropriately regularized logistic empirical risk of
depth nets -- for arbitrary data and with any number of gates with
adequately smooth and bounded activations like sigmoid and tanh. We also prove
an exponentially fast convergence rate for continuous time SGD that also
applies to smooth unbounded activations like SoftPlus. Our key idea is to show
the existence of Frobenius norm regularized logistic loss functions on
constant-sized neural nets which are "Villani functions" and thus be able to
build on recent progress with analyzing SGD on such objectives.Comment: 18 Pages, 1 figure. arXiv admin note: substantial text overlap with
arXiv:2210.1145
NAIS-Net: Stable Deep Networks from Non-Autonomous Differential Equations
This paper introduces Non-Autonomous Input-Output Stable Network (NAIS-Net),
a very deep architecture where each stacked processing block is derived from a
time-invariant non-autonomous dynamical system. Non-autonomy is implemented by
skip connections from the block input to each of the unrolled processing stages
and allows stability to be enforced so that blocks can be unrolled adaptively
to a pattern-dependent processing depth. NAIS-Net induces non-trivial,
Lipschitz input-output maps, even for an infinite unroll length. We prove that
the network is globally asymptotically stable so that for every initial
condition there is exactly one input-dependent equilibrium assuming tanh units,
and multiple stable equilibria for ReL units. An efficient implementation that
enforces the stability under derived conditions for both fully-connected and
convolutional layers is also presented. Experimental results show how NAIS-Net
exhibits stability in practice, yielding a significant reduction in
generalization gap compared to ResNets.Comment: NIPS 201
- …