6,074 research outputs found
Transport Analysis of Infinitely Deep Neural Network
We investigated the feature map inside deep neural networks (DNNs) by
tracking the transport map. We are interested in the role of depth (why do DNNs
perform better than shallow models?) and the interpretation of DNNs (what do
intermediate layers do?) Despite the rapid development in their application,
DNNs remain analytically unexplained because the hidden layers are nested and
the parameters are not faithful. Inspired by the integral representation of
shallow NNs, which is the continuum limit of the width, or the hidden unit
number, we developed the flow representation and transport analysis of DNNs.
The flow representation is the continuum limit of the depth or the hidden layer
number, and it is specified by an ordinary differential equation with a vector
field. We interpret an ordinary DNN as a transport map or a Euler broken line
approximation of the flow. Technically speaking, a dynamical system is a
natural model for the nested feature maps. In addition, it opens a new way to
the coordinate-free treatment of DNNs by avoiding the redundant parametrization
of DNNs. Following Wasserstein geometry, we analyze a flow in three aspects:
dynamical system, continuity equation, and Wasserstein gradient flow. A key
finding is that we specified a series of transport maps of the denoising
autoencoder (DAE). Starting from the shallow DAE, this paper develops three
topics: the transport map of the deep DAE, the equivalence between the stacked
DAE and the composition of DAEs, and the development of the double continuum
limit or the integral representation of the flow representation. As partial
answers to the research questions, we found that deeper DAEs converge faster
and the extracted features are better; in addition, a deep Gaussian DAE
transports mass to decrease the Shannon entropy of the data distribution
Kolmogorov Width Decay and Poor Approximators in Machine Learning: Shallow Neural Networks, Random Feature Models and Neural Tangent Kernels
We establish a scale separation of Kolmogorov width type between subspaces of
a given Banach space under the condition that a sequence of linear maps
converges much faster on one of the subspaces. The general technique is then
applied to show that reproducing kernel Hilbert spaces are poor
-approximators for the class of two-layer neural networks in high
dimension, and that multi-layer networks with small path norm are poor
approximators for certain Lipschitz functions, also in the -topology
Transportation analysis of denoising autoencoders: a novel method for analyzing deep neural networks
The feature map obtained from the denoising autoencoder (DAE) is investigated
by determining transportation dynamics of the DAE, which is a cornerstone for
deep learning. Despite the rapid development in its application, deep neural
networks remain analytically unexplained, because the feature maps are nested
and parameters are not faithful. In this paper, we address the problem of the
formulation of nested complex of parameters by regarding the feature map as a
transport map. Even when a feature map has different dimensions between input
and output, we can regard it as a transportation map by considering that both
the input and output spaces are embedded in a common high-dimensional space. In
addition, the trajectory is a geometric object and thus, is independent of
parameterization. In this manner, transportation can be regarded as a universal
character of deep neural networks. By determining and analyzing the
transportation dynamics, we can understand the behavior of a deep neural
network. In this paper, we investigate a fundamental case of deep neural
networks: the DAE. We derive the transport map of the DAE, and reveal that the
infinitely deep DAE transports mass to decrease a certain quantity, such as
entropy, of the data distribution. These results though analytically simple,
shed light on the correspondence between deep neural networks and the
Wasserstein gradient flows.Comment: Accepted at NIPS 2017 workshop on Optimal Transport & Machine
Learning (OTML2017
A Convex Duality Framework for GANs
Generative adversarial network (GAN) is a minimax game between a generator
mimicking the true model and a discriminator distinguishing the samples
produced by the generator from the real training samples. Given an
unconstrained discriminator able to approximate any function, this game reduces
to finding the generative model minimizing a divergence measure, e.g. the
Jensen-Shannon (JS) divergence, to the data distribution. However, in practice
the discriminator is constrained to be in a smaller class such as
neural nets. Then, a natural question is how the divergence minimization
interpretation changes as we constrain . In this work, we address
this question by developing a convex duality framework for analyzing GANs. For
a convex set , this duality framework interprets the original GAN
formulation as finding the generative model with minimum JS-divergence to the
distributions penalized to match the moments of the data distribution, with the
moments specified by the discriminators in . We show that this
interpretation more generally holds for f-GAN and Wasserstein GAN. As a
byproduct, we apply the duality framework to a hybrid of f-divergence and
Wasserstein distance. Unlike the f-divergence, we prove that the proposed
hybrid divergence changes continuously with the generative model, which
suggests regularizing the discriminator's Lipschitz constant in f-GAN and
vanilla GAN. We numerically evaluate the power of the suggested regularization
schemes for improving GAN's training performance
On the Convergence of Gradient Descent Training for Two-layer ReLU-networks in the Mean Field Regime
We describe a necessary and sufficient condition for the convergence to
minimum Bayes risk when training two-layer ReLU-networks by gradient descent in
the mean field regime with omni-directional initial parameter distribution.
This article extends recent results of Chizat and Bach to ReLU-activated
networks and to the situation in which there are no parameters which exactly
achieve MBR. The condition does not depend on the initalization of parameters
and concerns only the weak convergence of the realization of the neural
network, not its parameter distribution
Gradient Descent Provably Optimizes Over-parameterized Neural Networks
One of the mysteries in the success of neural networks is randomly
initialized first order methods like gradient descent can achieve zero training
loss even though the objective function is non-convex and non-smooth. This
paper demystifies this surprising phenomenon for two-layer fully connected ReLU
activated neural networks. For an hidden node shallow neural network with
ReLU activation and training data, we show as long as is large enough
and no two inputs are parallel, randomly initialized gradient descent converges
to a globally optimal solution at a linear convergence rate for the quadratic
loss function.
Our analysis relies on the following observation: over-parameterization and
random initialization jointly restrict every weight vector to be close to its
initialization for all iterations, which allows us to exploit a strong
convexity-like property to show that gradient descent converges at a global
linear rate to the global optimum. We believe these insights are also useful in
analyzing deep models and other first order methods.Comment: ICLR 201
On Scalable and Efficient Computation of Large Scale Optimal Transport
Optimal Transport (OT) naturally arises in many machine learning
applications, yet the heavy computational burden limits its wide-spread uses.
To address the scalability issue, we propose an implicit generative
learning-based framework called SPOT (Scalable Push-forward of Optimal
Transport). Specifically, we approximate the optimal transport plan by a
pushforward of a reference distribution, and cast the optimal transport problem
into a minimax problem. We then can solve OT problems efficiently using primal
dual stochastic gradient-type algorithms. We also show that we can recover the
density of the optimal transport plan using neural ordinary differential
equations. Numerical experiments on both synthetic and real datasets illustrate
that SPOT is robust and has favorable convergence behavior. SPOT also allows us
to efficiently sample from the optimal transport plan, which benefits
downstream applications such as domain adaptation
The global optimum of shallow neural network is attained by ridgelet transform
We prove that the global minimum of the backpropagation (BP) training problem
of neural networks with an arbitrary nonlinear activation is given by the
ridgelet transform. A series of computational experiments show that there
exists an interesting similarity between the scatter plot of hidden parameters
in a shallow neural network after the BP training and the spectrum of the
ridgelet transform. By introducing a continuous model of neural networks, we
reduce the training problem to a convex optimization in an infinite dimensional
Hilbert space, and obtain the explicit expression of the global optimizer via
the ridgelet transform.Comment: under revie
Integral Equations and Machine Learning
As both light transport simulation and reinforcement learning are ruled by
the same Fredholm integral equation of the second kind, reinforcement learning
techniques may be used for photorealistic image synthesis: Efficiency may be
dramatically improved by guiding light transport paths by an approximate
solution of the integral equation that is learned during rendering. In the
light of the recent advances in reinforcement learning for playing games, we
investigate the representation of an approximate solution of an integral
equation by artificial neural networks and derive a loss function for that
purpose. The resulting Monte Carlo and quasi-Monte Carlo methods train neural
networks with standard information instead of linear information and naturally
are able to generate an arbitrary number of training samples. The methods are
demonstrated for applications in light transport simulation
Measure, Manifold, Learning, and Optimization: A Theory Of Neural Networks
We present a formal measure-theoretical theory of neural networks (NN) built
on probability coupling theory. Our main contributions are summarized as
follows.
* Built on the formalism of probability coupling theory, we derive an
algorithm framework, named Hierarchical Measure Group and Approximate System
(HMGAS), nicknamed S-System, that is designed to learn the complex
hierarchical, statistical dependency in the physical world.
* We show that NNs are special cases of S-System when the probability kernels
assume certain exponential family distributions. Activation Functions are
derived formally. We further endow geometry on NNs through information
geometry, show that intermediate feature spaces of NNs are stochastic
manifolds, and prove that "distance" between samples is contracted as layers
stack up.
* S-System shows NNs are inherently stochastic, and under a set of realistic
boundedness and diversity conditions, it enables us to prove that for large
size nonlinear deep NNs with a class of losses, including the hinge loss, all
local minima are global minima with zero loss errors, and regions around the
minima are flat basins where all eigenvalues of Hessians are concentrated
around zero, using tools and ideas from mean field theory, random matrix
theory, and nonlinear operator equations.
* S-System, the information-geometry structure and the optimization behaviors
combined completes the analog between Renormalization Group (RG) and NNs. It
shows that a NN is a complex adaptive system that estimates the statistic
dependency of microscopic object, e.g., pixels, in multiple scales. Unlike
clear-cut physical quantity produced by RG in physics, e.g., temperature, NNs
renormalize/recompose manifolds emerging through learning/optimization that
divide the sample space into highly semantically meaningful groups that are
dictated by supervised labels (in supervised NNs)
- …