9,763 research outputs found
Network of Experts for Large-Scale Image Categorization
We present a tree-structured network architecture for large scale image
classification. The trunk of the network contains convolutional layers
optimized over all classes. At a given depth, the trunk splits into separate
branches, each dedicated to discriminate a different subset of classes. Each
branch acts as an expert classifying a set of categories that are difficult to
tell apart, while the trunk provides common knowledge to all experts in the
form of shared features. The training of our "network of experts" is completely
end-to-end: the partition of categories into disjoint subsets is learned
simultaneously with the parameters of the network trunk and the experts are
trained jointly by minimizing a single learning objective over all classes. The
proposed structure can be built from any existing convolutional neural network
(CNN). We demonstrate its generality by adapting 4 popular CNNs for image
categorization into the form of networks of experts. Our experiments on
CIFAR100 and ImageNet show that in every case our method yields a substantial
improvement in accuracy over the base CNN, and gives the best result achieved
so far on CIFAR100. Finally, the improvement in accuracy comes at little
additional cost: compared to the base network, the training time is only
moderately increased and the number of parameters is comparable or in some
cases even lower.Comment: ECCV 201
Adjusting for Dropout Variance in Batch Normalization and Weight Initialization
We show how to adjust for the variance introduced by dropout with corrections
to weight initialization and Batch Normalization, yielding higher accuracy.
Though dropout can preserve the expected input to a neuron between train and
test, the variance of the input differs. We thus propose a new weight
initialization by correcting for the influence of dropout rates and an
arbitrary nonlinearity's influence on variance through simple corrective
scalars. Since Batch Normalization trained with dropout estimates the variance
of a layer's incoming distribution with some inputs dropped, the variance also
differs between train and test. After training a network with Batch
Normalization and dropout, we simply update Batch Normalization's variance
moving averages with dropout off and obtain state of the art on CIFAR-10 and
CIFAR-100 without data augmentation
The effect of Target Normalization and Momentum on Dying ReLU
Optimizing parameters with momentum, normalizing data values, and using
rectified linear units (ReLUs) are popular choices in neural network (NN)
regression. Although ReLUs are popular, they can collapse to a constant
function and "die", effectively removing their contribution from the model.
While some mitigations are known, the underlying reasons of ReLUs dying during
optimization are currently poorly understood. In this paper, we consider the
effects of target normalization and momentum on dying ReLUs. We find
empirically that unit variance targets are well motivated and that ReLUs die
more easily, when target variance approaches zero. To further investigate this
matter, we analyze a discrete-time linear autonomous system, and show
theoretically how this relates to a model with a single ReLU and how common
properties can result in dying ReLU. We also analyze the gradients of a
single-ReLU model to identify saddle points and regions corresponding to dying
ReLU and how parameters evolve into these regions when momentum is used.
Finally, we show empirically that this problem persist, and is aggravated, for
deeper models including residual networks
Gradient-based Hyperparameter Optimization through Reversible Learning
Tuning hyperparameters of learning algorithms is hard because gradients are
usually unavailable. We compute exact gradients of cross-validation performance
with respect to all hyperparameters by chaining derivatives backwards through
the entire training procedure. These gradients allow us to optimize thousands
of hyperparameters, including step-size and momentum schedules, weight
initialization distributions, richly parameterized regularization schemes, and
neural network architectures. We compute hyperparameter gradients by exactly
reversing the dynamics of stochastic gradient descent with momentum.Comment: 10 figures. Submitted to ICM
Adding Gradient Noise Improves Learning for Very Deep Networks
Deep feedforward and recurrent networks have achieved impressive results in
many perception and language processing applications. This success is partially
attributed to architectural innovations such as convolutional and long
short-term memory networks. The main motivation for these architectural
innovations is that they capture better domain knowledge, and importantly are
easier to optimize than more basic architectures. Recently, more complex
architectures such as Neural Turing Machines and Memory Networks have been
proposed for tasks including question answering and general computation,
creating a new set of optimization challenges. In this paper, we discuss a
low-overhead and easy-to-implement technique of adding gradient noise which we
find to be surprisingly effective when training these very deep architectures.
The technique not only helps to avoid overfitting, but also can result in lower
training loss. This method alone allows a fully-connected 20-layer deep network
to be trained with standard gradient descent, even starting from a poor
initialization. We see consistent improvements for many complex models,
including a 72% relative reduction in error rate over a carefully-tuned
baseline on a challenging question-answering task, and a doubling of the number
of accurate binary multiplication models learned across 7,000 random restarts.
We encourage further application of this technique to additional complex modern
architectures
A Walk with SGD
We present novel empirical observations regarding how stochastic gradient
descent (SGD) navigates the loss landscape of over-parametrized deep neural
networks (DNNs). These observations expose the qualitatively different roles of
learning rate and batch-size in DNN optimization and generalization.
Specifically we study the DNN loss surface along the trajectory of SGD by
interpolating the loss surface between parameters from consecutive
\textit{iterations} and tracking various metrics during training. We find that
the loss interpolation between parameters before and after each training
iteration's update is roughly convex with a minimum (\textit{valley floor}) in
between for most of the training. Based on this and other metrics, we deduce
that for most of the training update steps, SGD moves in valley like regions of
the loss surface by jumping from one valley wall to another at a height above
the valley floor. This 'bouncing between walls at a height' mechanism helps SGD
traverse larger distance for small batch sizes and large learning rates which
we find play qualitatively different roles in the dynamics. While a large
learning rate maintains a large height from the valley floor, a small batch
size injects noise facilitating exploration. We find this mechanism is crucial
for generalization because the valley floor has barriers and this exploration
above the valley floor allows SGD to quickly travel far away from the
initialization point (without being affected by barriers) and find flatter
regions, corresponding to better generalization.Comment: First two authors contributed equall
WNGrad: Learn the Learning Rate in Gradient Descent
Adjusting the learning rate schedule in stochastic gradient methods is an
important unresolved problem which requires tuning in practice. If certain
parameters of the loss function such as smoothness or strong convexity
constants are known, theoretical learning rate schedules can be applied.
However, in practice, such parameters are not known, and the loss function of
interest is not convex in any case. The recently proposed batch normalization
reparametrization is widely adopted in most neural network architectures today
because, among other advantages, it is robust to the choice of Lipschitz
constant of the gradient in loss function, allowing one to set a large learning
rate without worry. Inspired by batch normalization, we propose a general
nonlinear update rule for the learning rate in batch and stochastic gradient
descent so that the learning rate can be initialized at a high value, and is
subsequently decreased according to gradient observations along the way. The
proposed method is shown to achieve robustness to the relationship between the
learning rate and the Lipschitz constant, and near-optimal convergence rates in
both the batch and stochastic settings ( for smooth loss in the batch
setting, and for convex loss in the stochastic setting). We
also show through numerical evidence that such robustness of the proposed
method extends to highly nonconvex and possibly non-smooth loss function in
deep learning problems.Our analysis establishes some first theoretical
understanding into the observed robustness for batch normalization and weight
normalization.Comment: 10 pages, 3 figures, conferenc
The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks
Neural network pruning techniques can reduce the parameter counts of trained
networks by over 90%, decreasing storage requirements and improving
computational performance of inference without compromising accuracy. However,
contemporary experience is that the sparse architectures produced by pruning
are difficult to train from the start, which would similarly improve training
performance.
We find that a standard pruning technique naturally uncovers subnetworks
whose initializations made them capable of training effectively. Based on these
results, we articulate the "lottery ticket hypothesis:" dense,
randomly-initialized, feed-forward networks contain subnetworks ("winning
tickets") that - when trained in isolation - reach test accuracy comparable to
the original network in a similar number of iterations. The winning tickets we
find have won the initialization lottery: their connections have initial
weights that make training particularly effective.
We present an algorithm to identify winning tickets and a series of
experiments that support the lottery ticket hypothesis and the importance of
these fortuitous initializations. We consistently find winning tickets that are
less than 10-20% of the size of several fully-connected and convolutional
feed-forward architectures for MNIST and CIFAR10. Above this size, the winning
tickets that we find learn faster than the original network and reach higher
test accuracy.Comment: ICLR camera read
Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks
We present weight normalization: a reparameterization of the weight vectors
in a neural network that decouples the length of those weight vectors from
their direction. By reparameterizing the weights in this way we improve the
conditioning of the optimization problem and we speed up convergence of
stochastic gradient descent. Our reparameterization is inspired by batch
normalization but does not introduce any dependencies between the examples in a
minibatch. This means that our method can also be applied successfully to
recurrent models such as LSTMs and to noise-sensitive applications such as deep
reinforcement learning or generative models, for which batch normalization is
less well suited. Although our method is much simpler, it still provides much
of the speed-up of full batch normalization. In addition, the computational
overhead of our method is lower, permitting more optimization steps to be taken
in the same amount of time. We demonstrate the usefulness of our method on
applications in supervised image recognition, generative modelling, and deep
reinforcement learning
Resnet in Resnet: Generalizing Residual Architectures
Residual networks (ResNets) have recently achieved state-of-the-art on
challenging computer vision tasks. We introduce Resnet in Resnet (RiR): a deep
dual-stream architecture that generalizes ResNets and standard CNNs and is
easily implemented with no computational overhead. RiR consistently improves
performance over ResNets, outperforms architectures with similar amounts of
augmentation on CIFAR-10, and establishes a new state-of-the-art on CIFAR-100
- …