31,520 research outputs found
HyperAdam: A Learnable Task-Adaptive Adam for Network Training
Deep neural networks are traditionally trained using human-designed
stochastic optimization algorithms, such as SGD and Adam. Recently, the
approach of learning to optimize network parameters has emerged as a promising
research topic. However, these learned black-box optimizers sometimes do not
fully utilize the experience in human-designed optimizers, therefore have
limitation in generalization ability. In this paper, a new optimizer, dubbed as
\textit{HyperAdam}, is proposed that combines the idea of "learning to
optimize" and traditional Adam optimizer. Given a network for training, its
parameter update in each iteration generated by HyperAdam is an adaptive
combination of multiple updates generated by Adam with varying decay rates. The
combination weights and decay rates in HyperAdam are adaptively learned
depending on the task. HyperAdam is modeled as a recurrent neural network with
AdamCell, WeightCell and StateCell. It is justified to be state-of-the-art for
various network training, such as multilayer perceptron, CNN and LSTM
Three Mechanisms of Weight Decay Regularization
Weight decay is one of the standard tricks in the neural network toolbox, but
the reasons for its regularization effect are poorly understood, and recent
results have cast doubt on the traditional interpretation in terms of
regularization. Literal weight decay has been shown to outperform
regularization for optimizers for which they differ. We empirically investigate
weight decay for three optimization algorithms (SGD, Adam, and K-FAC) and a
variety of network architectures. We identify three distinct mechanisms by
which weight decay exerts a regularization effect, depending on the particular
optimization algorithm and architecture: (1) increasing the effective learning
rate, (2) approximately regularizing the input-output Jacobian norm, and (3)
reducing the effective damping coefficient for second-order optimization. Our
results provide insight into how to improve the regularization of neural
networks
Adam Induces Implicit Weight Sparsity in Rectifier Neural Networks
In recent years, deep neural networks (DNNs) have been applied to various
machine leaning tasks, including image recognition, speech recognition, and
machine translation. However, large DNN models are needed to achieve
state-of-the-art performance, exceeding the capabilities of edge devices. Model
reduction is thus needed for practical use. In this paper, we point out that
deep learning automatically induces group sparsity of weights, in which all
weights connected to an output channel (node) are zero, when training DNNs
under the following three conditions: (1) rectified-linear-unit (ReLU)
activations, (2) an -regularized objective function, and (3) the Adam
optimizer. Next, we analyze this behavior both theoretically and
experimentally, and propose a simple model reduction method: eliminate the zero
weights after training the DNN. In experiments on MNIST and CIFAR-10 datasets,
we demonstrate the sparsity with various training setups. Finally, we show that
our method can efficiently reduce the model size and performs well relative to
methods that use a sparsity-inducing regularizer.Comment: 8 pages, 7 figures, 6 tables, 2018 17th IEEE International Conference
on Machine Learning and Applications (ICMLA
Neuromodulated Learning in Deep Neural Networks
In the brain, learning signals change over time and synaptic location, and
are applied based on the learning history at the synapse, in the complex
process of neuromodulation. Learning in artificial neural networks, on the
other hand, is shaped by hyper-parameters set before learning starts, which
remain static throughout learning, and which are uniform for the entire
network. In this work, we propose a method of deep artificial neuromodulation
which applies the concepts of biological neuromodulation to stochastic gradient
descent. Evolved neuromodulatory dynamics modify learning parameters at each
layer in a deep neural network over the course of the network's training. We
show that the same neuromodulatory dynamics can be applied to different models
and can scale to new problems not encountered during evolution. Finally, we
examine the evolved neuromodulation, showing that evolution found dynamic,
location-specific learning strategies
Implementation of Fruits Recognition Classifier using Convolutional Neural Network Algorithm for Observation of Accuracies for Various Hidden Layers
Fruit recognition using Deep Convolutional Neural Network (CNN) is one of the
most promising applications in computer vision. In recent times, deep learning
based classifications are making it possible to recognize fruits from images.
However, fruit recognition is still a problem for the stacked fruits on
weighing scale because of the complexity and similarity. In this paper, a fruit
recognition system using CNN is proposed. The proposed method uses deep
learning techniques for the classification. We have used Fruits-360 dataset for
the evaluation purpose. From the dataset, we have established a dataset which
contains 17,823 images from 25 different categories. The images are divided
into training and test dataset. Moreover, for the classification accuracies, we
have used various combinations of hidden layer and epochs for different cases
and made a comparison between them. The overall performance losses of the
network for different cases also observed. Finally, we have achieved the best
test accuracy of 100% and a training accuracy of 99.79%.Comment: 4 Pages, 5 Figure
Neumann Optimizer: A Practical Optimization Algorithm for Deep Neural Networks
Progress in deep learning is slowed by the days or weeks it takes to train
large models. The natural solution of using more hardware is limited by
diminishing returns, and leads to inefficient use of additional resources. In
this paper, we present a large batch, stochastic optimization algorithm that is
both faster than widely used algorithms for fixed amounts of computation, and
also scales up substantially better as more computational resources become
available. Our algorithm implicitly computes the inverse Hessian of each
mini-batch to produce descent directions; we do so without either an explicit
approximation to the Hessian or Hessian-vector products. We demonstrate the
effectiveness of our algorithm by successfully training large ImageNet models
(Inception-V3, Resnet-50, Resnet-101 and Inception-Resnet-V2) with mini-batch
sizes of up to 32000 with no loss in validation error relative to current
baselines, and no increase in the total number of steps. At smaller mini-batch
sizes, our optimizer improves the validation error in these models by 0.8-0.9%.
Alternatively, we can trade off this accuracy to reduce the number of training
steps needed by roughly 10-30%. Our work is practical and easily usable by
others -- only one hyperparameter (learning rate) needs tuning, and
furthermore, the algorithm is as computationally cheap as the commonly used
Adam optimizer
Training Deep Neural Network in Limited Precision
Energy and resource efficient training of DNNs will greatly extend the
applications of deep learning. However, there are three major obstacles which
mandate accurate calculation in high precision. In this paper, we tackle two of
them related to the loss of gradients during parameter update and
backpropagation through a softmax nonlinearity layer in low precision training.
We implemented SGD with Kahan summation by employing an additional parameter to
virtually extend the bit-width of the parameters for a reliable parameter
update. We also proposed a simple guideline to help select the appropriate
bit-width for the last FC layer followed by a softmax nonlinearity layer. It
determines the lower bound of the required bit-width based on the class size of
the dataset. Extensive experiments on various network architectures and
benchmarks verifies the effectiveness of the proposed technique for low
precision training
Learning to learn by gradient descent by gradient descent
The move from hand-designed features to learned features in machine learning
has been wildly successful. In spite of this, optimization algorithms are still
designed by hand. In this paper we show how the design of an optimization
algorithm can be cast as a learning problem, allowing the algorithm to learn to
exploit structure in the problems of interest in an automatic way. Our learned
algorithms, implemented by LSTMs, outperform generic, hand-designed competitors
on the tasks for which they are trained, and also generalize well to new tasks
with similar structure. We demonstrate this on a number of tasks, including
simple convex problems, training neural networks, and styling images with
neural art
Large Batch Optimization for Deep Learning: Training BERT in 76 minutes
Training large deep neural networks on massive datasets is computationally
very challenging. There has been recent surge in interest in using large batch
stochastic optimization methods to tackle this issue. The most prominent
algorithm in this line of research is LARS, which by employing layerwise
adaptive learning rates trains ResNet on ImageNet in a few minutes. However,
LARS performs poorly for attention models like BERT, indicating that its
performance gains are not consistent across tasks. In this paper, we first
study a principled layerwise adaptation strategy to accelerate training of deep
neural networks using large mini-batches. Using this strategy, we develop a new
layerwise adaptive large batch optimization technique called LAMB; we then
provide convergence analysis of LAMB as well as LARS, showing convergence to a
stationary point in general nonconvex settings. Our empirical results
demonstrate the superior performance of LAMB across various tasks such as BERT
and ResNet-50 training with very little hyperparameter tuning. In particular,
for BERT training, our optimizer enables use of very large batch sizes of 32868
without any degradation of performance. By increasing the batch size to the
memory limit of a TPUv3 Pod, BERT training time can be reduced from 3 days to
just 76 minutes (Table 1). The LAMB implementation is available at
https://github.com/tensorflow/addons/blob/master/tensorflow_addons/optimizers/lamb.pyComment: Published as a conference paper at ICLR 202
Deep Learning Scaling is Predictable, Empirically
Deep learning (DL) creates impactful advances following a virtuous recipe:
model architecture search, creating large training data sets, and scaling
computation. It is widely believed that growing training sets and models should
improve accuracy and result in better products. As DL application domains grow,
we would like a deeper understanding of the relationships between training set
size, computational scale, and model accuracy improvements to advance the
state-of-the-art.
This paper presents a large scale empirical characterization of
generalization error and model size growth as training sets grow. We introduce
a methodology for this measurement and test four machine learning domains:
machine translation, language modeling, image processing, and speech
recognition. Our empirical results show power-law generalization error scaling
across a breadth of factors, resulting in power-law exponents---the "steepness"
of the learning curve---yet to be explained by theoretical work. Further, model
improvements only shift the error but do not appear to affect the power-law
exponent. We also show that model size scales sublinearly with data size. These
scaling relationships have significant implications on deep learning research,
practice, and systems. They can assist model debugging, setting accuracy
targets, and decisions about data set growth. They can also guide computing
system design and underscore the importance of continued computational scaling.Comment: 19 pages, 11 figure
- …