25,098 research outputs found
SHADE: Information Based Regularization for Deep Learning
Regularization is a big issue for training deep neural networks. In this
paper, we propose a new information-theory-based regularization scheme named
SHADE for SHAnnon DEcay. The originality of the approach is to define a prior
based on conditional entropy, which explicitly decouples the learning of
invariant representations in the regularizer and the learning of correlations
between inputs and labels in the data fitting term. Our second contribution is
to derive a stochastic version of the regularizer compatible with deep
learning, resulting in a tractable training scheme. We empirically validate the
efficiency of our approach to improve classification performances compared to
common regularization schemes on several standard architectures
Dynamic Sparse Training: Find Efficient Sparse Network From Scratch With Trainable Masked Layers
We present a novel network pruning algorithm called Dynamic Sparse Training
that can jointly find the optimal network parameters and sparse network
structure in a unified optimization process with trainable pruning thresholds.
These thresholds can have fine-grained layer-wise adjustments dynamically via
backpropagation. We demonstrate that our dynamic sparse training algorithm can
easily train very sparse neural network models with little performance loss
using the same number of training epochs as dense models. Dynamic Sparse
Training achieves the state of the art performance compared with other sparse
training algorithms on various network architectures. Additionally, we have
several surprising observations that provide strong evidence for the
effectiveness and efficiency of our algorithm. These observations reveal the
underlying problems of traditional three-stage pruning algorithms and present
the potential guidance provided by our algorithm to the design of more compact
network architectures.Comment: ICLR 2020, camera ready versio
Network Deconvolution
Convolution is a central operation in Convolutional Neural Networks (CNNs),
which applies a kernel to overlapping regions shifted across the image.
However, because of the strong correlations in real-world image data,
convolutional kernels are in effect re-learning redundant data. In this work,
we show that this redundancy has made neural network training challenging, and
propose network deconvolution, a procedure which optimally removes pixel-wise
and channel-wise correlations before the data is fed into each layer. Network
deconvolution can be efficiently calculated at a fraction of the computational
cost of a convolution layer. We also show that the deconvolution filters in the
first layer of the network resemble the center-surround structure found in
biological neurons in the visual regions of the brain. Filtering with such
kernels results in a sparse representation, a desired property that has been
missing in the training of neural networks. Learning from the sparse
representation promotes faster convergence and superior results without the use
of batch normalization. We apply our network deconvolution operation to 10
modern neural network models by replacing batch normalization within each.
Extensive experiments show that the network deconvolution operation is able to
deliver performance improvement in all cases on the CIFAR-10, CIFAR-100, MNIST,
Fashion-MNIST, Cityscapes, and ImageNet datasets.Comment: ICLR 202
Soft Threshold Weight Reparameterization for Learnable Sparsity
Sparsity in Deep Neural Networks (DNNs) is studied extensively with the focus
of maximizing prediction accuracy given an overall parameter budget. Existing
methods rely on uniform or heuristic non-uniform sparsity budgets which have
sub-optimal layer-wise parameter allocation resulting in a) lower prediction
accuracy or b) higher inference cost (FLOPs). This work proposes Soft Threshold
Reparameterization (STR), a novel use of the soft-threshold operator on DNN
weights. STR smoothly induces sparsity while learning pruning thresholds
thereby obtaining a non-uniform sparsity budget. Our method achieves
state-of-the-art accuracy for unstructured sparsity in CNNs (ResNet50 and
MobileNetV1 on ImageNet-1K), and, additionally, learns non-uniform budgets that
empirically reduce the FLOPs by up to 50%. Notably, STR boosts the accuracy
over existing results by up to 10% in the ultra sparse (99%) regime and can
also be used to induce low-rank (structured sparsity) in RNNs. In short, STR is
a simple mechanism which learns effective sparsity budgets that contrast with
popular heuristics. Code, pretrained models and sparsity budgets are at
https://github.com/RAIVNLab/STR.Comment: 19 pages, 10 figures, Published at International Conference on
Machine Learning (ICML) 202
Visualizing the Loss Landscape of Neural Nets
Neural network training relies on our ability to find "good" minimizers of
highly non-convex loss functions. It is well-known that certain network
architecture designs (e.g., skip connections) produce loss functions that train
easier, and well-chosen training parameters (batch size, learning rate,
optimizer) produce minimizers that generalize better. However, the reasons for
these differences, and their effects on the underlying loss landscape, are not
well understood. In this paper, we explore the structure of neural loss
functions, and the effect of loss landscapes on generalization, using a range
of visualization methods. First, we introduce a simple "filter normalization"
method that helps us visualize loss function curvature and make meaningful
side-by-side comparisons between loss functions. Then, using a variety of
visualizations, we explore how network architecture affects the loss landscape,
and how training parameters affect the shape of minimizers.Comment: NIPS 2018 (extended version, 10.5 pages), code is available at
https://github.com/tomgoldstein/loss-landscap
AdaBatch: Adaptive Batch Sizes for Training Deep Neural Networks
Training deep neural networks with Stochastic Gradient Descent, or its
variants, requires careful choice of both learning rate and batch size. While
smaller batch sizes generally converge in fewer training epochs, larger batch
sizes offer more parallelism and hence better computational efficiency. We have
developed a new training approach that, rather than statically choosing a
single batch size for all epochs, adaptively increases the batch size during
the training process. Our method delivers the convergence rate of small batch
sizes while achieving performance similar to large batch sizes. We analyse our
approach using the standard AlexNet, ResNet, and VGG networks operating on the
popular CIFAR-10, CIFAR-100, and ImageNet datasets. Our results demonstrate
that learning with adaptive batch sizes can improve performance by factors of
up to 6.25 on 4 NVIDIA Tesla P100 GPUs while changing accuracy by less than 1%
relative to training with fixed batch sizes.Comment: 14 page
SYQ: Learning Symmetric Quantization For Efficient Deep Neural Networks
Inference for state-of-the-art deep neural networks is computationally
expensive, making them difficult to deploy on constrained hardware
environments. An efficient way to reduce this complexity is to quantize the
weight parameters and/or activations during training by approximating their
distributions with a limited entry codebook. For very low-precisions, such as
binary or ternary networks with 1-8-bit activations, the information loss from
quantization leads to significant accuracy degradation due to large gradient
mismatches between the forward and backward functions. In this paper, we
introduce a quantization method to reduce this loss by learning a symmetric
codebook for particular weight subgroups. These subgroups are determined based
on their locality in the weight matrix, such that the hardware simplicity of
the low-precision representations is preserved. Empirically, we show that
symmetric quantization can substantially improve accuracy for networks with
extremely low-precision weights and activations. We also demonstrate that this
representation imposes minimal or no hardware implications to more
coarse-grained approaches. Source code is available at
https://www.github.com/julianfaraone/SYQ.Comment: Published as a conference paper at the 2018 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR
Improving Deep Neural Network with Multiple Parametric Exponential Linear Units
Activation function is crucial to the recent successes of deep neural
networks. In this paper, we first propose a new activation function, Multiple
Parametric Exponential Linear Units (MPELU), aiming to generalize and unify the
rectified and exponential linear units. As the generalized form, MPELU shares
the advantages of Parametric Rectified Linear Unit (PReLU) and Exponential
Linear Unit (ELU), leading to better classification performance and convergence
property. In addition, weight initialization is very important to train very
deep networks. The existing methods laid a solid foundation for networks using
rectified linear units but not for exponential linear units. This paper
complements the current theory and extends it to the wider range. Specifically,
we put forward a way of initialization, enabling training of very deep networks
using exponential linear units. Experiments demonstrate that the proposed
initialization not only helps the training process but leads to better
generalization performance. Finally, utilizing the proposed activation function
and initialization, we present a deep MPELU residual architecture that achieves
state-of-the-art performance on the CIFAR-10/100 datasets. The code is
available at https://github.com/Coldmooon/Code-for-MPELU
TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning
High network communication cost for synchronizing gradients and parameters is
the well-known bottleneck of distributed training. In this work, we propose
TernGrad that uses ternary gradients to accelerate distributed deep learning in
data parallelism. Our approach requires only three numerical levels {-1,0,1},
which can aggressively reduce the communication time. We mathematically prove
the convergence of TernGrad under the assumption of a bound on gradients.
Guided by the bound, we propose layer-wise ternarizing and gradient clipping to
improve its convergence. Our experiments show that applying TernGrad on AlexNet
does not incur any accuracy loss and can even improve accuracy. The accuracy
loss of GoogLeNet induced by TernGrad is less than 2% on average. Finally, a
performance model is proposed to study the scalability of TernGrad. Experiments
show significant speed gains for various deep neural networks. Our source code
is available.Comment: NIPS 2017 Ora
PILAE: A Non-gradient Descent Learning Scheme for Deep Feedforward Neural Networks
In this work, a non-gradient descent learning scheme is proposed for deep
feedforward neural networks (DNN). As we known, autoencoder can be used as the
building blocks of the multi-layer perceptron (MLP) deep neural network. So,
the MLP will be taken as an example to illustrate the proposed scheme of
pseudoinverse learning algorithm for autoencoder (PILAE) training. The PILAE
with low rank approximation is a non-gradient based learning algorithm, and the
encoder weight matrix is set to be the low rank approximation of the
pseudoinverse of the input matrix, while the decoder weight matrix is
calculated by the pseudoinverse learning algorithm. It is worth to note that
only few network structure hyperparameters need to be tuned. Hence, the
proposed algorithm can be regarded as a quasi-automated training algorithm
which can be utilized in autonomous machine learning research field. The
experimental results show that the proposed learning scheme for DNN can achieve
better performance on considering the tradeoff between training efficiency and
classification accuracy.Comment: This work is our effort toward to realize AutoM
- …