587 research outputs found
Input and Weight Space Smoothing for Semi-supervised Learning
We propose regularizing the empirical loss for semi-supervised learning by
acting on both the input (data) space, and the weight (parameter) space. We
show that the two are not equivalent, and in fact are complementary, one
affecting the minimality of the resulting representation, the other
insensitivity to nuisance variability. We propose a method to perform such
smoothing, which combines known input-space smoothing with a novel weight-space
smoothing, based on a min-max (adversarial) optimization. The resulting
Adversarial Block Coordinate Descent (ABCD) algorithm performs gradient ascent
with a small learning rate for a random subset of the weights, and standard
gradient descent on the remaining weights in the same mini-batch. It achieves
comparable performance to the state-of-the-art without resorting to heavy data
augmentation, using a relatively simple architecture
The Anisotropic Noise in Stochastic Gradient Descent: Its Behavior of Escaping from Sharp Minima and Regularization Effects
Understanding the behavior of stochastic gradient descent (SGD) in the
context of deep neural networks has raised lots of concerns recently. Along
this line, we study a general form of gradient based optimization dynamics with
unbiased noise, which unifies SGD and standard Langevin dynamics. Through
investigating this general optimization dynamics, we analyze the behavior of
SGD on escaping from minima and its regularization effects. A novel indicator
is derived to characterize the efficiency of escaping from minima through
measuring the alignment of noise covariance and the curvature of loss function.
Based on this indicator, two conditions are established to show which type of
noise structure is superior to isotropic noise in term of escaping efficiency.
We further show that the anisotropic noise in SGD satisfies the two conditions,
and thus helps to escape from sharp and poor minima effectively, towards more
stable and flat minima that typically generalize well. We systematically design
various experiments to verify the benefits of the anisotropic noise, compared
with full gradient descent plus isotropic diffusion (i.e. Langevin dynamics).Comment: ICML 2019 camera read
- …