10,586 research outputs found
On the Expressive Power of Deep Polynomial Neural Networks
We study deep neural networks with polynomial activations, particularly their
expressive power. For a fixed architecture and activation degree, a polynomial
neural network defines an algebraic map from weights to polynomials. The image
of this map is the functional space associated to the network, and it is an
irreducible algebraic variety upon taking closure. This paper proposes the
dimension of this variety as a precise measure of the expressive power of
polynomial neural networks. We obtain several theoretical results regarding
this dimension as a function of architecture, including an exact formula for
high activation degrees, as well as upper and lower bounds on layer widths in
order for deep polynomials networks to fill the ambient functional space. We
also present computational evidence that it is profitable in terms of
expressiveness for layer widths to increase monotonically and then decrease
monotonically. Finally, we link our study to favorable optimization properties
when training weights, and we draw intriguing connections with tensor and
polynomial decompositions
Transport Analysis of Infinitely Deep Neural Network
We investigated the feature map inside deep neural networks (DNNs) by
tracking the transport map. We are interested in the role of depth (why do DNNs
perform better than shallow models?) and the interpretation of DNNs (what do
intermediate layers do?) Despite the rapid development in their application,
DNNs remain analytically unexplained because the hidden layers are nested and
the parameters are not faithful. Inspired by the integral representation of
shallow NNs, which is the continuum limit of the width, or the hidden unit
number, we developed the flow representation and transport analysis of DNNs.
The flow representation is the continuum limit of the depth or the hidden layer
number, and it is specified by an ordinary differential equation with a vector
field. We interpret an ordinary DNN as a transport map or a Euler broken line
approximation of the flow. Technically speaking, a dynamical system is a
natural model for the nested feature maps. In addition, it opens a new way to
the coordinate-free treatment of DNNs by avoiding the redundant parametrization
of DNNs. Following Wasserstein geometry, we analyze a flow in three aspects:
dynamical system, continuity equation, and Wasserstein gradient flow. A key
finding is that we specified a series of transport maps of the denoising
autoencoder (DAE). Starting from the shallow DAE, this paper develops three
topics: the transport map of the deep DAE, the equivalence between the stacked
DAE and the composition of DAEs, and the development of the double continuum
limit or the integral representation of the flow representation. As partial
answers to the research questions, we found that deeper DAEs converge faster
and the extracted features are better; in addition, a deep Gaussian DAE
transports mass to decrease the Shannon entropy of the data distribution
Machine Learning on Graphs: A Model and Comprehensive Taxonomy
There has been a surge of recent interest in learning representations for
graph-structured data. Graph representation learning methods have generally
fallen into three main categories, based on the availability of labeled data.
The first, network embedding (such as shallow graph embedding or graph
auto-encoders), focuses on learning unsupervised representations of relational
structure. The second, graph regularized neural networks, leverages graphs to
augment neural network losses with a regularization objective for
semi-supervised learning. The third, graph neural networks, aims to learn
differentiable functions over discrete topologies with arbitrary structure.
However, despite the popularity of these areas there has been surprisingly
little work on unifying the three paradigms. Here, we aim to bridge the gap
between graph neural networks, network embedding and graph regularization
models. We propose a comprehensive taxonomy of representation learning methods
for graph-structured data, aiming to unify several disparate bodies of work.
Specifically, we propose a Graph Encoder Decoder Model (GRAPHEDM), which
generalizes popular algorithms for semi-supervised learning on graphs (e.g.
GraphSage, Graph Convolutional Networks, Graph Attention Networks), and
unsupervised learning of graph representations (e.g. DeepWalk, node2vec, etc)
into a single consistent approach. To illustrate the generality of this
approach, we fit over thirty existing methods into this framework. We believe
that this unifying view both provides a solid foundation for understanding the
intuition behind these methods, and enables future research in the area
Universal Statistics of Fisher Information in Deep Neural Networks: Mean Field Approach
The Fisher information matrix (FIM) is a fundamental quantity to represent
the characteristics of a stochastic model, including deep neural networks
(DNNs). The present study reveals novel statistics of FIM that are universal
among a wide class of DNNs. To this end, we use random weights and large width
limits, which enables us to utilize mean field theories. We investigate the
asymptotic statistics of the FIM's eigenvalues and reveal that most of them are
close to zero while the maximum eigenvalue takes a huge value. Because the
landscape of the parameter space is defined by the FIM, it is locally flat in
most dimensions, but strongly distorted in others. Moreover, we demonstrate the
potential usage of the derived statistics in learning strategies. First, small
eigenvalues that induce flatness can be connected to a norm-based capacity
measure of generalization ability. Second, the maximum eigenvalue that induces
the distortion enables us to quantitatively estimate an appropriately sized
learning rate for gradient methods to converge.Comment: Accepted at AISTATS2019. Main text: 10 pages, 2 figures.
Supplementary material: 9 pages, 2 figures, typos correcte
Critical Points of Neural Networks: Analytical Forms and Landscape Properties
Due to the success of deep learning to solving a variety of challenging
machine learning tasks, there is a rising interest in understanding loss
functions for training neural networks from a theoretical aspect. Particularly,
the properties of critical points and the landscape around them are of
importance to determine the convergence performance of optimization algorithms.
In this paper, we provide full (necessary and sufficient) characterization of
the analytical forms for the critical points (as well as global minimizers) of
the square loss functions for various neural networks. We show that the
analytical forms of the critical points characterize the values of the
corresponding loss functions as well as the necessary and sufficient conditions
to achieve global minimum. Furthermore, we exploit the analytical forms of the
critical points to characterize the landscape properties for the loss functions
of these neural networks. One particular conclusion is that: The loss function
of linear networks has no spurious local minimum, while the loss function of
one-hidden-layer nonlinear networks with ReLU activation function does have
local minimum that is not global minimum
Visualizing the Loss Landscape of Neural Nets
Neural network training relies on our ability to find "good" minimizers of
highly non-convex loss functions. It is well-known that certain network
architecture designs (e.g., skip connections) produce loss functions that train
easier, and well-chosen training parameters (batch size, learning rate,
optimizer) produce minimizers that generalize better. However, the reasons for
these differences, and their effects on the underlying loss landscape, are not
well understood. In this paper, we explore the structure of neural loss
functions, and the effect of loss landscapes on generalization, using a range
of visualization methods. First, we introduce a simple "filter normalization"
method that helps us visualize loss function curvature and make meaningful
side-by-side comparisons between loss functions. Then, using a variety of
visualizations, we explore how network architecture affects the loss landscape,
and how training parameters affect the shape of minimizers.Comment: NIPS 2018 (extended version, 10.5 pages), code is available at
https://github.com/tomgoldstein/loss-landscap
Mathematics of Deep Learning
Recently there has been a dramatic increase in the performance of recognition
systems due to the introduction of deep architectures for representation
learning and classification. However, the mathematical reasons for this success
remain elusive. This tutorial will review recent work that aims to provide a
mathematical justification for several properties of deep networks, such as
global optimality, geometric stability, and invariance of the learned
representations
Is deeper better? It depends on locality of relevant features
It has been recognized that a heavily overparameterized artificial neural
network exhibits surprisingly good generalization performance in various
machine-learning tasks. Recent theoretical studies have made attempts to unveil
the mystery of the overparameterization. In most of those previous works, the
overparameterization is achieved by increasing the width of the network, while
the effect of increasing the depth has remained less well understood. In this
work, we investigate the effect of increasing the depth within an
overparameterized regime. To gain an insight into the advantage of depth, we
introduce local and global labels as abstract but simple classification rules.
It turns out that the locality of the relevant feature for a given
classification rule plays a key role; our experimental results suggest that
deeper is better for local labels, whereas shallower is better for global
labels. We also compare the results of finite networks with those of the neural
tangent kernel (NTK), which is equivalent to an infinitely wide network with a
proper initialization and an infinitesimal learning rate. It is shown that the
NTK does not correctly capture the depth dependence of the generalization
performance, which indicates the importance of the feature learning rather than
the lazy learning.Comment: 13+4 page
A Differential Topological View of Challenges in Learning with Feedforward Neural Networks
Among many unsolved puzzles in theories of Deep Neural Networks (DNNs), there
are three most fundamental challenges that highly demand solutions, namely,
expressibility, optimisability, and generalisability. Although there have been
significant progresses in seeking answers using various theories, e.g.
information bottleneck theory, sparse representation, statistical inference,
Riemannian geometry, etc., so far there is no single theory that is able to
provide solutions to all these challenges. In this work, we propose to engage
the theory of differential topology to address the three problems. By modelling
the dataset of interest as a smooth manifold, DNNs can be considered as
compositions of smooth maps between smooth manifolds. Specifically, our work
offers a differential topological view of loss landscape of DNNs, interplay
between width and depth in expressibility, and regularisations for
generalisability. Finally, in the setting of deep representation learning, we
further apply the quotient topology to investigate the architecture of DNNs,
which enables to capture nuisance factors in data with respect to a specific
learning task.Comment: 17 pages, 3 figure
Lexicographic and Depth-Sensitive Margins in Homogeneous and Non-Homogeneous Deep Models
With an eye toward understanding complexity control in deep learning, we
study how infinitesimal regularization or gradient descent optimization lead to
margin maximizing solutions in both homogeneous and non-homogeneous models,
extending previous work that focused on infinitesimal regularization only in
homogeneous models. To this end we study the limit of loss minimization with a
diverging norm constraint (the "constrained path"), relate it to the limit of a
"margin path" and characterize the resulting solution. For non-homogeneous
ensemble models, which output is a sum of homogeneous sub-models, we show that
this solution discards the shallowest sub-models if they are unnecessary. For
homogeneous models, we show convergence to a "lexicographic max-margin
solution", and provide conditions under which max-margin solutions are also
attained as the limit of unconstrained gradient descent.Comment: ICML Camera ready versio
- …