180 research outputs found
Regularizing Recurrent Networks - On Injected Noise and Norm-based Methods
Advancements in parallel processing have lead to a surge in multilayer
perceptrons' (MLP) applications and deep learning in the past decades.
Recurrent Neural Networks (RNNs) give additional representational power to
feedforward MLPs by providing a way to treat sequential data. However, RNNs are
hard to train using conventional error backpropagation methods because of the
difficulty in relating inputs over many time-steps. Regularization approaches
from MLP sphere, like dropout and noisy weight training, have been
insufficiently applied and tested on simple RNNs. Moreover, solutions have been
proposed to improve convergence in RNNs but not enough to improve the long term
dependency remembering capabilities thereof.
In this study, we aim to empirically evaluate the remembering and
generalization ability of RNNs on polyphonic musical datasets. The models are
trained with injected noise, random dropout, norm-based regularizers and their
respective performances compared to well-initialized plain RNNs and advanced
regularization methods like fast-dropout. We conclude with evidence that
training with noise does not improve performance as conjectured by a few works
in RNN optimization before ours
Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations
We propose zoneout, a novel method for regularizing RNNs. At each timestep,
zoneout stochastically forces some hidden units to maintain their previous
values. Like dropout, zoneout uses random noise to train a pseudo-ensemble,
improving generalization. But by preserving instead of dropping hidden units,
gradient information and state information are more readily propagated through
time, as in feedforward stochastic depth networks. We perform an empirical
investigation of various RNN regularizers, and find that zoneout gives
significant performance improvements across tasks. We achieve competitive
results with relatively simple models in character- and word-level language
modelling on the Penn Treebank and Text8 datasets, and combining with recurrent
batch normalization yields state-of-the-art results on permuted sequential
MNIST.Comment: David Krueger and Tegan Maharaj contributed equally to this wor
Noisy Parallel Approximate Decoding for Conditional Recurrent Language Model
Recent advances in conditional recurrent language modelling have mainly
focused on network architectures (e.g., attention mechanism), learning
algorithms (e.g., scheduled sampling and sequence-level training) and novel
applications (e.g., image/video description generation, speech recognition,
etc.) On the other hand, we notice that decoding algorithms/strategies have not
been investigated as much, and it has become standard to use greedy or beam
search. In this paper, we propose a novel decoding strategy motivated by an
earlier observation that nonlinear hidden layers of a deep neural network
stretch the data manifold. The proposed strategy is embarrassingly
parallelizable without any communication overhead, while improving an existing
decoding algorithm. We extensively evaluate it with attention-based neural
machine translation on the task of En->Cz translation
Deconstructing the Ladder Network Architecture
The Manual labeling of data is and will remain a costly endeavor. For this
reason, semi-supervised learning remains a topic of practical importance. The
recently proposed Ladder Network is one such approach that has proven to be
very successful. In addition to the supervised objective, the Ladder Network
also adds an unsupervised objective corresponding to the reconstruction costs
of a stack of denoising autoencoders. Although the empirical results are
impressive, the Ladder Network has many components intertwined, whose
contributions are not obvious in such a complex architecture. In order to help
elucidate and disentangle the different ingredients in the Ladder Network
recipe, this paper presents an extensive experimental investigation of variants
of the Ladder Network in which we replace or remove individual components to
gain more insight into their relative importance. We find that all of the
components are necessary for achieving optimal performance, but they do not
contribute equally. For semi-supervised tasks, we conclude that the most
important contribution is made by the lateral connection, followed by the
application of noise, and finally the choice of what we refer to as the
`combinator function' in the decoder path. We also find that as the number of
labeled training examples increases, the lateral connections and reconstruction
criterion become less important, with most of the improvement in generalization
being due to the injection of noise in each layer. Furthermore, we present a
new type of combinator function that outperforms the original design in both
fully- and semi-supervised tasks, reducing record test error rates on
Permutation-Invariant MNIST to 0.57% for the supervised setting, and to 0.97%
and 1.0% for semi-supervised settings with 1000 and 100 labeled examples
respectively.Comment: Proceedings of the 33 rd International Conference on Machine
Learning, New York, NY, USA, 201
Adding noise to the input of a model trained with a regularized objective
Regularization is a well studied problem in the context of neural networks.
It is usually used to improve the generalization performance when the number of
input samples is relatively small or heavily contaminated with noise. The
regularization of a parametric model can be achieved in different manners some
of which are early stopping (Morgan and Bourlard, 1990), weight decay, output
smoothing that are used to avoid overfitting during the training of the
considered model. From a Bayesian point of view, many regularization techniques
correspond to imposing certain prior distributions on model parameters (Krogh
and Hertz, 1991). Using Bishop's approximation (Bishop, 1995) of the objective
function when a restricted type of noise is added to the input of a parametric
function, we derive the higher order terms of the Taylor expansion and analyze
the coefficients of the regularization terms induced by the noisy input. In
particular we study the effect of penalizing the Hessian of the mapping
function with respect to the input in terms of generalization performance. We
also show how we can control independently this coefficient by explicitly
penalizing the Jacobian of the mapping function on corrupted inputs
Regularizing Deep Neural Networks by Noise: Its Interpretation and Optimization
Overfitting is one of the most critical challenges in deep neural networks,
and there are various types of regularization methods to improve generalization
performance. Injecting noises to hidden units during training, e.g., dropout,
is known as a successful regularizer, but it is still not clear enough why such
training techniques work well in practice and how we can maximize their benefit
in the presence of two conflicting objectives---optimizing to true data
distribution and preventing overfitting by regularization. This paper addresses
the above issues by 1) interpreting that the conventional training methods with
regularization by noise injection optimize the lower bound of the true
objective and 2) proposing a technique to achieve a tighter lower bound using
multiple noise samples per training example in a stochastic gradient descent
iteration. We demonstrate the effectiveness of our idea in several computer
vision applications.Comment: NIPS 2017 camera read
Regularization for Deep Learning: A Taxonomy
Regularization is one of the crucial ingredients of deep learning, yet the
term regularization has various definitions, and regularization methods are
often studied separately from each other. In our work we present a systematic,
unifying taxonomy to categorize existing methods. We distinguish methods that
affect data, network architectures, error terms, regularization terms, and
optimization procedures. We do not provide all details about the listed
methods; instead, we present an overview of how the methods can be sorted into
meaningful categories and sub-categories. This helps revealing links and
fundamental similarities between them. Finally, we include practical
recommendations both for users and for developers of new regularization
methods
GAR: An efficient and scalable Graph-based Activity Regularization for semi-supervised learning
In this paper, we propose a novel graph-based approach for semi-supervised
learning problems, which considers an adaptive adjacency of the examples
throughout the unsupervised portion of the training. Adjacency of the examples
is inferred using the predictions of a neural network model which is first
initialized by a supervised pretraining. These predictions are then updated
according to a novel unsupervised objective which regularizes another
adjacency, now linking the output nodes. Regularizing the adjacency of the
output nodes, inferred from the predictions of the network, creates an easier
optimization problem and ultimately provides that the predictions of the
network turn into the optimal embedding. Ultimately, the proposed framework
provides an effective and scalable graph-based solution which is natural to the
operational mechanism of deep neural networks. Our results show comparable
performance with state-of-the-art generative approaches for semi-supervised
learning on an easier-to-train, low-cost framework
The many faces of deep learning
Deep learning has sparked a network of mutual interactions between different
disciplines and AI. Naturally, each discipline focuses and interprets the
workings of deep learning in different ways. This diversity of perspectives on
deep learning, from neuroscience to statistical physics, is a rich source of
inspiration that fuels novel developments in the theory and applications of
machine learning. In this perspective, we collect and synthesize different
intuitions scattered across several communities as for how deep learning works.
In particular, we will briefly discuss the different perspectives that
disciplines across mathematics, physics, computation, and neuroscience take on
how deep learning does its tricks. Our discussion on each perspective is
necessarily shallow due to the multiple views that had to be covered. The
deepness in this case should come from putting all these faces of deep learning
together in the reader's mind, so that one can look at the same problem from
different angles.Comment: 18 page
Mean Shift Rejection: Training Deep Neural Networks Without Minibatch Statistics or Normalization
Deep convolutional neural networks are known to be unstable during training
at high learning rate unless normalization techniques are employed. Normalizing
weights or activations allows the use of higher learning rates, resulting in
faster convergence and higher test accuracy. Batch normalization requires
minibatch statistics that approximate the dataset statistics but this incurs
additional compute and memory costs and causes a communication bottleneck for
distributed training. Weight normalization and initialization-only schemes do
not achieve comparable test accuracy.
We introduce a new understanding of the cause of training instability and
provide a technique that is independent of normalization and minibatch
statistics. Our approach treats training instability as a spatial common mode
signal which is suppressed by placing the model on a channel-wise zero-mean
isocline that is maintained throughout training. Firstly, we apply channel-wise
zero-mean initialization of filter kernels with overall unity kernel magnitude.
At each training step we modify the gradients of spatial kernels so that their
weighted channel-wise mean is subtracted in order to maintain the common mode
rejection condition. This prevents the onset of mean shift. This new technique
allows direct training of the test graph so that training and test models are
identical. We also demonstrate that injecting random noise throughout the
network during training improves generalization. This is based on the idea
that, as a side effect, batch normalization performs deep data augmentation by
injecting minibatch noise due to the weakness of the dataset approximation.
Our technique achieves higher accuracy compared to batch normalization and
for the first time shows that minibatches and normalization are unnecessary for
state-of-the-art training.Comment: under review at ECAI202
- …