348 research outputs found
Soft Threshold Weight Reparameterization for Learnable Sparsity
Sparsity in Deep Neural Networks (DNNs) is studied extensively with the focus
of maximizing prediction accuracy given an overall parameter budget. Existing
methods rely on uniform or heuristic non-uniform sparsity budgets which have
sub-optimal layer-wise parameter allocation resulting in a) lower prediction
accuracy or b) higher inference cost (FLOPs). This work proposes Soft Threshold
Reparameterization (STR), a novel use of the soft-threshold operator on DNN
weights. STR smoothly induces sparsity while learning pruning thresholds
thereby obtaining a non-uniform sparsity budget. Our method achieves
state-of-the-art accuracy for unstructured sparsity in CNNs (ResNet50 and
MobileNetV1 on ImageNet-1K), and, additionally, learns non-uniform budgets that
empirically reduce the FLOPs by up to 50%. Notably, STR boosts the accuracy
over existing results by up to 10% in the ultra sparse (99%) regime and can
also be used to induce low-rank (structured sparsity) in RNNs. In short, STR is
a simple mechanism which learns effective sparsity budgets that contrast with
popular heuristics. Code, pretrained models and sparsity budgets are at
https://github.com/RAIVNLab/STR.Comment: 19 pages, 10 figures, Published at International Conference on
Machine Learning (ICML) 202
A Classification Supervised Auto-Encoder Based on Predefined Evenly-Distributed Class Centroids
Classic variational autoencoders are used to learn complex data
distributions, that are built on standard function approximators. Especially,
VAE has shown promise on a lot of complex task. In this paper, a new
autoencoder model - classification supervised autoencoder (CSAE) based on
predefined evenly-distributed class centroids (PEDCC) is proposed. Our method
uses PEDCC of latent variables to train the network to ensure the maximization
of inter-class distance and the minimization of inner-class distance. Instead
of learning mean/variance of latent variables distribution and taking
reparameterization of VAE, latent variables of CSAE are directly used to
classify and as input of decoder. In addition, a new loss function is proposed
to combine the loss function of classification. Based on the basic structure of
the universal autoencoder, we realized the comprehensive optimal results of
encoding, decoding, classification, and good model generalization performance
at the same time. Theoretical advantages are reflected in experimental results.Comment: 16 pages,12 figures, 4 table
Dynamic Channel Pruning: Feature Boosting and Suppression
Making deep convolutional neural networks more accurate typically comes at
the cost of increased computational and memory resources. In this paper, we
reduce this cost by exploiting the fact that the importance of features
computed by convolutional layers is highly input-dependent, and propose feature
boosting and suppression (FBS), a new method to predictively amplify salient
convolutional channels and skip unimportant ones at run-time. FBS introduces
small auxiliary connections to existing convolutional layers. In contrast to
channel pruning methods which permanently remove channels, it preserves the
full network structures and accelerates convolution by dynamically skipping
unimportant input and output channels. FBS-augmented networks are trained with
conventional stochastic gradient descent, making it readily available for many
state-of-the-art CNNs. We compare FBS to a range of existing channel pruning
and dynamic execution schemes and demonstrate large improvements on ImageNet
classification. Experiments show that FBS can respectively provide
and savings in compute on VGG-16 and ResNet-18, both with less than
top-5 accuracy loss.Comment: 14 pages, 5 figures, 4 tables, published as a conference paper at
ICLR 201
Dynamic Routing Networks
The deployment of deep neural networks in real-world applications is mostly
restricted by their high inference costs. Extensive efforts have been made to
improve the accuracy with expert-designed or algorithm-searched architectures.
However, the incremental improvement is typically achieved with increasingly
more expensive models that only a small portion of input instances really need.
Inference with a static architecture that processes all input instances via the
same transformation would thus incur unnecessary computational costs.
Therefore, customizing the model capacity in an instance-aware manner is much
needed for higher inference efficiency. In this paper, we propose Dynamic
Routing Networks (DRNets), which support efficient instance-aware inference by
routing the input instance to only necessary transformation branches selected
from a candidate set of branches for each connection between transformation
nodes. The branch selection is dynamically determined via the corresponding
branch importance weights, which are first generated from lightweight
hypernetworks (RouterNets) and then recalibrated with Gumbel-Softmax before the
selection. Extensive experiments show that DRNets can reduce a substantial
amount of parameter size and FLOPs during inference with prediction performance
comparable to state-of-the-art architectures.Comment: 10 pages, 3 figures, 3 table
Fisher-Bures Adversary Graph Convolutional Networks
In a graph convolutional network, we assume that the graph is generated
wrt some observation noise. During learning, we make small random perturbations
of the graph and try to improve generalization. Based on quantum
information geometry, can be characterized by the
eigendecomposition of the graph Laplacian matrix. We try to minimize the loss
wrt the perturbed while making to be effective in
terms of the Fisher information of the neural network. Our proposed model can
consistently improve graph convolutional networks on semi-supervised node
classification tasks with reasonable computational overhead. We present three
different geometries on the manifold of graphs: the intrinsic geometry measures
the information theoretic dynamics of a graph; the extrinsic geometry
characterizes how such dynamics can affect externally a graph neural network;
the embedding geometry is for measuring node embeddings. These new analytical
tools are useful in developing a good understanding of graph neural networks
and fostering new techniques.Comment: Published in UAI 201
Routing Networks and the Challenges of Modular and Compositional Computation
Compositionality is a key strategy for addressing combinatorial complexity
and the curse of dimensionality. Recent work has shown that compositional
solutions can be learned and offer substantial gains across a variety of
domains, including multi-task learning, language modeling, visual question
answering, machine comprehension, and others. However, such models present
unique challenges during training when both the module parameters and their
composition must be learned jointly. In this paper, we identify several of
these issues and analyze their underlying causes. Our discussion focuses on
routing networks, a general approach to this problem, and examines empirically
the interplay of these challenges and a variety of design decisions. In
particular, we consider the effect of how the algorithm decides on module
composition, how the algorithm updates the modules, and if the algorithm uses
regularization
Balanced Sparsity for Efficient DNN Inference on GPU
In trained deep neural networks, unstructured pruning can reduce redundant
weights to lower storage cost. However, it requires the customization of
hardwares to speed up practical inference. Another trend accelerates sparse
model inference on general-purpose hardwares by adopting coarse-grained
sparsity to prune or regularize consecutive weights for efficient computation.
But this method often sacrifices model accuracy. In this paper, we propose a
novel fine-grained sparsity approach, balanced sparsity, to achieve high model
accuracy with commercial hardwares efficiently. Our approach adapts to high
parallelism property of GPU, showing incredible potential for sparsity in the
widely deployment of deep learning services. Experiment results show that
balanced sparsity achieves up to 3.1x practical speedup for model inference on
GPU, while retains the same high model accuracy as fine-grained sparsity
The State of Sparsity in Deep Neural Networks
We rigorously evaluate three state-of-the-art techniques for inducing
sparsity in deep neural networks on two large-scale learning tasks: Transformer
trained on WMT 2014 English-to-German, and ResNet-50 trained on ImageNet.
Across thousands of experiments, we demonstrate that complex techniques
(Molchanov et al., 2017; Louizos et al., 2017b) shown to yield high compression
rates on smaller datasets perform inconsistently, and that simple magnitude
pruning approaches achieve comparable or better results. Additionally, we
replicate the experiments performed by (Frankle & Carbin, 2018) and (Liu et
al., 2018) at scale and show that unstructured sparse architectures learned
through pruning cannot be trained from scratch to the same test set performance
as a model trained with joint sparsification and optimization. Together, these
results highlight the need for large-scale benchmarks in the field of model
compression. We open-source our code, top performing model checkpoints, and
results of all hyperparameter configurations to establish rigorous baselines
for future work on compression and sparsification
ReSet: Learning Recurrent Dynamic Routing in ResNet-like Neural Networks
Neural Network is a powerful Machine Learning tool that shows outstanding
performance in Computer Vision, Natural Language Processing, and Artificial
Intelligence. In particular, recently proposed ResNet architecture and its
modifications produce state-of-the-art results in image classification
problems. ResNet and most of the previously proposed architectures have a fixed
structure and apply the same transformation to all input images. In this work,
we develop a ResNet-based model that dynamically selects Computational Units
(CU) for each input object from a learned set of transformations. Dynamic
selection allows the network to learn a sequence of useful transformations and
apply only required units to predict the image label. We compare our model to
ResNet-38 architecture and achieve better results than the original ResNet on
CIFAR-10.1 test set. While examining the produced paths, we discovered that the
network learned different routes for images from different classes and similar
routes for similar images.Comment: Published in Proceedings of The 10th Asian Conference on Machine
Learning, http://proceedings.mlr.press/v95/kemaev18a.htm
Improving Variational Inference with Inverse Autoregressive Flow
The framework of normalizing flows provides a general strategy for flexible
variational inference of posteriors over latent variables. We propose a new
type of normalizing flow, inverse autoregressive flow (IAF), that, in contrast
to earlier published flows, scales well to high-dimensional latent spaces. The
proposed flow consists of a chain of invertible transformations, where each
transformation is based on an autoregressive neural network. In experiments, we
show that IAF significantly improves upon diagonal Gaussian approximate
posteriors. In addition, we demonstrate that a novel type of variational
autoencoder, coupled with IAF, is competitive with neural autoregressive models
in terms of attained log-likelihood on natural images, while allowing
significantly faster synthesis
- …