23,963 research outputs found
Equivalent and Approximate Transformations of Deep Neural Networks
Two networks are equivalent if they produce the same output for any given
input. In this paper, we study the possibility of transforming a deep neural
network to another network with a different number of units or layers, which
can be either equivalent, a local exact approximation, or a global linear
approximation of the original network. On the practical side, we show that
certain rectified linear units (ReLUs) can be safely removed from a network if
they are always active or inactive for any valid input. If we only need an
equivalent network for a smaller domain, then more units can be removed and
some layers collapsed. On the theoretical side, we constructively show that for
any feed-forward ReLU network, there exists a global linear approximation to a
2-hidden-layer shallow network with a fixed number of units. This result is a
balance between the increasing number of units for arbitrary approximation with
a single layer and the known upper bound of layers
for exact representation, where is the input dimension. While the
transformed network may require an exponential number of units to capture the
activation patterns of the original network, we show that it can be made
substantially smaller by only accounting for the patterns that define linear
regions. Based on experiments with ReLU networks on the MNIST dataset, we found
that -regularization and adversarial training reduces the number of linear
regions significantly as the number of stable units increases due to weight
sparsity. Therefore, we can also intentionally train ReLU networks to allow for
effective loss-less compression and approximation
A Kronecker-factored approximate Fisher matrix for convolution layers
Second-order optimization methods such as natural gradient descent have the
potential to speed up training of neural networks by correcting for the
curvature of the loss function. Unfortunately, the exact natural gradient is
impractical to compute for large models, and most approximations either require
an expensive iterative procedure or make crude approximations to the curvature.
We present Kronecker Factors for Convolution (KFC), a tractable approximation
to the Fisher matrix for convolutional networks based on a structured
probabilistic model for the distribution over backpropagated derivatives.
Similarly to the recently proposed Kronecker-Factored Approximate Curvature
(K-FAC), each block of the approximate Fisher matrix decomposes as the
Kronecker product of small matrices, allowing for efficient inversion. KFC
captures important curvature information while still yielding comparably
efficient updates to stochastic gradient descent (SGD). We show that the
updates are invariant to commonly used reparameterizations, such as centering
of the activations. In our experiments, approximate natural gradient descent
with KFC was able to train convolutional networks several times faster than
carefully tuned SGD. Furthermore, it was able to train the networks in 10-20
times fewer iterations than SGD, suggesting its potential applicability in a
distributed setting
A Kernel Theory of Modern Data Augmentation
Data augmentation, a technique in which a training set is expanded with
class-preserving transformations, is ubiquitous in modern machine learning
pipelines. In this paper, we seek to establish a theoretical framework for
understanding data augmentation. We approach this from two directions: First,
we provide a general model of augmentation as a Markov process, and show that
kernels appear naturally with respect to this model, even when we do not employ
kernel classification. Next, we analyze more directly the effect of
augmentation on kernel classifiers, showing that data augmentation can be
approximated by first-order feature averaging and second-order variance
regularization components. These frameworks both serve to illustrate the ways
in which data augmentation affects the downstream learning model, and the
resulting analyses provide novel connections between prior work in invariant
kernels, tangent propagation, and robust optimization. Finally, we provide
several proof-of-concept applications showing that our theory can be useful for
accelerating machine learning workflows, such as reducing the amount of
computation needed to train using augmented data, and predicting the utility of
a transformation prior to training
Improving Variational Inference with Inverse Autoregressive Flow
The framework of normalizing flows provides a general strategy for flexible
variational inference of posteriors over latent variables. We propose a new
type of normalizing flow, inverse autoregressive flow (IAF), that, in contrast
to earlier published flows, scales well to high-dimensional latent spaces. The
proposed flow consists of a chain of invertible transformations, where each
transformation is based on an autoregressive neural network. In experiments, we
show that IAF significantly improves upon diagonal Gaussian approximate
posteriors. In addition, we demonstrate that a novel type of variational
autoencoder, coupled with IAF, is competitive with neural autoregressive models
in terms of attained log-likelihood on natural images, while allowing
significantly faster synthesis
Sum-of-Squares Polynomial Flow
Triangular map is a recent construct in probability theory that allows one to
transform any source probability density function to any target density
function. Based on triangular maps, we propose a general framework for
high-dimensional density estimation, by specifying one-dimensional
transformations (equivalently conditional densities) and appropriate
conditioner networks. This framework (a) reveals the commonalities and
differences of existing autoregressive and flow based methods, (b) allows a
unified understanding of the limitations and representation power of these
recent approaches and, (c) motivates us to uncover a new Sum-of-Squares (SOS)
flow that is interpretable, universal, and easy to train. We perform several
synthetic experiments on various density geometries to demonstrate the benefits
(and short-comings) of such transformations. SOS flows achieve competitive
results in simulations and several real-world datasets.Comment: 13 pages, ICML'201
A probabilistic framework for multi-view feature learning with many-to-many associations via neural networks
A simple framework Probabilistic Multi-view Graph Embedding (PMvGE) is
proposed for multi-view feature learning with many-to-many associations so that
it generalizes various existing multi-view methods. PMvGE is a probabilistic
model for predicting new associations via graph embedding of the nodes of data
vectors with links of their associations. Multi-view data vectors with
many-to-many associations are transformed by neural networks to feature vectors
in a shared space, and the probability of new association between two data
vectors is modeled by the inner product of their feature vectors. While
existing multi-view feature learning techniques can treat only either of
many-to-many association or non-linear transformation, PMvGE can treat both
simultaneously. By combining Mercer's theorem and the universal approximation
theorem, we prove that PMvGE learns a wide class of similarity measures across
views. Our likelihood-based estimator enables efficient computation of
non-linear transformations of data vectors in large-scale datasets by minibatch
SGD, and numerical experiments illustrate that PMvGE outperforms existing
multi-view methods.Comment: 16 pages (with Supplementary Material), 5 figures, ICML201
Deep Component Analysis via Alternating Direction Neural Networks
Despite a lack of theoretical understanding, deep neural networks have
achieved unparalleled performance in a wide range of applications. On the other
hand, shallow representation learning with component analysis is associated
with rich intuition and theory, but smaller capacity often limits its
usefulness. To bridge this gap, we introduce Deep Component Analysis (DeepCA),
an expressive multilayer model formulation that enforces hierarchical structure
through constraints on latent variables in each layer. For inference, we
propose a differentiable optimization algorithm implemented using recurrent
Alternating Direction Neural Networks (ADNNs) that enable parameter learning
using standard backpropagation. By interpreting feed-forward networks as
single-iteration approximations of inference in our model, we provide both a
novel theoretical perspective for understanding them and a practical technique
for constraining predictions with prior knowledge. Experimentally, we
demonstrate performance improvements on a variety of tasks, including
single-image depth prediction with sparse output constraints
Data-Dependent Path Normalization in Neural Networks
We propose a unified framework for neural net normalization, regularization
and optimization, which includes Path-SGD and Batch-Normalization and
interpolates between them across two different dimensions. Through this
framework we investigate issue of invariance of the optimization, data
dependence and the connection with natural gradients.Comment: 17 pages, 3 figure
Optimizing Neural Networks with Kronecker-factored Approximate Curvature
We propose an efficient method for approximating natural gradient descent in
neural networks which we call Kronecker-Factored Approximate Curvature (K-FAC).
K-FAC is based on an efficiently invertible approximation of a neural network's
Fisher information matrix which is neither diagonal nor low-rank, and in some
cases is completely non-sparse. It is derived by approximating various large
blocks of the Fisher (corresponding to entire layers) as being the Kronecker
product of two much smaller matrices. While only several times more expensive
to compute than the plain stochastic gradient, the updates produced by K-FAC
make much more progress optimizing the objective, which results in an algorithm
that can be much faster than stochastic gradient descent with momentum in
practice. And unlike some previously proposed approximate
natural-gradient/Newton methods which use high-quality non-diagonal curvature
matrices (such as Hessian-free optimization), K-FAC works very well in highly
stochastic optimization regimes. This is because the cost of storing and
inverting K-FAC's approximation to the curvature matrix does not depend on the
amount of data used to estimate it, which is a feature typically associated
only with diagonal or low-rank approximations to the curvature matrix.Comment: Reduction ratio formula corrected. Removed incorrect claim about
geodesics in footnot
Understanding Deep Convolutional Networks
Deep convolutional networks provide state of the art classifications and
regressions results over many high-dimensional problems. We review their
architecture, which scatters data with a cascade of linear filter weights and
non-linearities. A mathematical framework is introduced to analyze their
properties. Computations of invariants involve multiscale contractions, the
linearization of hierarchical symmetries, and sparse separations. Applications
are discussed.Comment: 17 pages, 4 Figure
- …