8,570 research outputs found
Overparameterized ReLU Neural Networks Learn the Simplest Models: Neural Isometry and Exact Recovery
The practice of deep learning has shown that neural networks generalize
remarkably well even with an extreme number of learned parameters. This appears
to contradict traditional statistical wisdom, in which a trade-off between
model complexity and fit to the data is essential. We set out to resolve this
discrepancy from a convex optimization and sparse recovery perspective. We
consider the training and generalization properties of two-layer ReLU networks
with standard weight decay regularization. Under certain regularity assumptions
on the data, we show that ReLU networks with an arbitrary number of parameters
learn only simple models that explain the data. This is analogous to the
recovery of the sparsest linear model in compressed sensing. For ReLU networks
and their variants with skip connections or normalization layers, we present
isometry conditions that ensure the exact recovery of planted neurons. For
randomly generated data, we show the existence of a phase transition in
recovering planted neural network models. The situation is simple: whenever the
ratio between the number of samples and the dimension exceeds a numerical
threshold, the recovery succeeds with high probability; otherwise, it fails
with high probability. Surprisingly, ReLU networks learn simple and sparse
models even when the labels are noisy. The phase transition phenomenon is
confirmed through numerical experiments
Sparse Coding and Autoencoders
In "Dictionary Learning" one tries to recover incoherent matrices (typically overcomplete and whose columns are assumed
to be normalized) and sparse vectors with a small
support of size for some while having access to observations
where . In this work we undertake a rigorous
analysis of whether gradient descent on the squared loss of an autoencoder can
solve the dictionary learning problem. The "Autoencoder" architecture we
consider is a mapping with a single
ReLU activation layer of size .
Under very mild distributional assumptions on , we prove that the norm
of the expected gradient of the standard squared loss function is
asymptotically (in sparse code dimension) negligible for all points in a small
neighborhood of . This is supported with experimental evidence using
synthetic data. We also conduct experiments to suggest that is a local
minimum. Along the way we prove that a layer of ReLU gates can be set up to
automatically recover the support of the sparse codes. This property holds
independent of the loss function. We believe that it could be of independent
interest.Comment: In this new version of the paper with a small change in the
distributional assumptions we are actually able to prove the asymptotic
criticality of a neighbourhood of the ground truth dictionary for even just
the standard squared loss of the ReLU autoencoder (unlike the regularized
loss in the older version
Alternating Back-Propagation for Generator Network
This paper proposes an alternating back-propagation algorithm for learning
the generator network model. The model is a non-linear generalization of factor
analysis. In this model, the mapping from the continuous latent factors to the
observed signal is parametrized by a convolutional neural network. The
alternating back-propagation algorithm iterates the following two steps: (1)
Inferential back-propagation, which infers the latent factors by Langevin
dynamics or gradient descent. (2) Learning back-propagation, which updates the
parameters given the inferred latent factors by gradient descent. The gradient
computations in both steps are powered by back-propagation, and they share most
of their code in common. We show that the alternating back-propagation
algorithm can learn realistic generator models of natural images, video
sequences, and sounds. Moreover, it can also be used to learn from incomplete
or indirect training data
- …