108 research outputs found
Input and Weight Space Smoothing for Semi-supervised Learning
We propose regularizing the empirical loss for semi-supervised learning by
acting on both the input (data) space, and the weight (parameter) space. We
show that the two are not equivalent, and in fact are complementary, one
affecting the minimality of the resulting representation, the other
insensitivity to nuisance variability. We propose a method to perform such
smoothing, which combines known input-space smoothing with a novel weight-space
smoothing, based on a min-max (adversarial) optimization. The resulting
Adversarial Block Coordinate Descent (ABCD) algorithm performs gradient ascent
with a small learning rate for a random subset of the weights, and standard
gradient descent on the remaining weights in the same mini-batch. It achieves
comparable performance to the state-of-the-art without resorting to heavy data
augmentation, using a relatively simple architecture
Online Structured Laplace Approximations For Overcoming Catastrophic Forgetting
We introduce the Kronecker factored online Laplace approximation for
overcoming catastrophic forgetting in neural networks. The method is grounded
in a Bayesian online learning framework, where we recursively approximate the
posterior after every task with a Gaussian, leading to a quadratic penalty on
changes to the weights. The Laplace approximation requires calculating the
Hessian around a mode, which is typically intractable for modern architectures.
In order to make our method scalable, we leverage recent block-diagonal
Kronecker factored approximations to the curvature. Our algorithm achieves over
90% test accuracy across a sequence of 50 instantiations of the permuted MNIST
dataset, substantially outperforming related methods for overcoming
catastrophic forgetting.Comment: 13 pages, 6 figure
A jamming transition from under- to over-parametrization affects loss landscape and generalization
We argue that in fully-connected networks a phase transition delimits the
over- and under-parametrized regimes where fitting can or cannot be achieved.
Under some general conditions, we show that this transition is sharp for the
hinge loss. In the whole over-parametrized regime, poor minima of the loss are
not encountered during training since the number of constraints to satisfy is
too small to hamper minimization. Our findings support a link between this
transition and the generalization properties of the network: as we increase the
number of parameters of a given model, starting from an under-parametrized
network, we observe that the generalization error displays three phases: (i)
initial decay, (ii) increase until the transition point --- where it displays a
cusp --- and (iii) slow decay toward a constant for the rest of the
over-parametrized regime. Thereby we identify the region where the classical
phenomenon of over-fitting takes place, and the region where the model keeps
improving, in line with previous empirical observations for modern neural
networks.Comment: arXiv admin note: text overlap with arXiv:1809.0934
- …