6,021 research outputs found
Joint Multi-Dimension Pruning
We present joint multi-dimension pruning (named as JointPruning), a new
perspective of pruning a network on three crucial aspects: spatial, depth and
channel simultaneously. The joint strategy enables to search a better status
than previous studies that focused on individual dimension solely, as our
method is optimized collaboratively across the three dimensions in a single
end-to-end training. Moreover, each dimension that we consider can promote to
get better performance through colluding with the other two. Our method is
realized by the adapted stochastic gradient estimation. Extensive experiments
on large-scale ImageNet dataset across a variety of network architectures
MobileNet V1&V2 and ResNet demonstrate the effectiveness of our proposed
method. For instance, we achieve significant margins of 2.5% and 2.6%
improvement over the state-of-the-art approach on the already compact MobileNet
V1&V2 under an extremely large compression ratio
Parsimonious Deep Learning: A Differential Inclusion Approach with Global Convergence
Over-parameterization is ubiquitous nowadays in training neural networks to
benefit both optimization in seeking global optima and generalization in
reducing prediction error. However, compressive networks are desired in many
real world applications and direct training of small networks may be trapped in
local optima. In this paper, instead of pruning or distilling an
over-parameterized model to compressive ones, we propose a parsimonious
learning approach based on differential inclusions of inverse scale spaces,
that generates a family of models from simple to complex ones with a better
efficiency and interpretability than stochastic gradient descent in exploring
the model space. It enjoys a simple discretization, the Split Linearized
Bregman Iterations, with provable global convergence that from any
initializations, algorithmic iterations converge to a critical point of
empirical risks. One may exploit the proposed method to boost the complexity of
neural networks progressively. Numerical experiments with MNIST, Cifar-10/100,
and ImageNet are conducted to show the method is promising in training large
scale models with a favorite interpretability.Comment: 25 pages, 7 figure
The Impact of Neural Network Overparameterization on Gradient Confusion and Stochastic Gradient Descent
This paper studies how neural network architecture affects the speed of
training. We introduce a simple concept called gradient confusion to help
formally analyze this. When gradient confusion is high, stochastic gradients
produced by different data samples may be negatively correlated, slowing down
convergence. But when gradient confusion is low, data samples interact
harmoniously, and training proceeds quickly. Through theoretical and
experimental results, we demonstrate how the neural network architecture
affects gradient confusion, and thus the efficiency of training. Our results
show that, for popular initialization techniques, increasing the width of
neural networks leads to lower gradient confusion, and thus faster model
training. On the other hand, increasing the depth of neural networks has the
opposite effect. Our results indicate that alternate initialization techniques
or networks using both batch normalization and skip connections help reduce the
training burden of very deep networks.Comment: ICML 2020 camera-ready versio
A Selective Overview of Deep Learning
Deep learning has arguably achieved tremendous success in recent years. In
simple words, deep learning uses the composition of many nonlinear functions to
model the complex dependency between input features and labels. While neural
networks have a long history, recent advances have greatly improved their
performance in computer vision, natural language processing, etc. From the
statistical and scientific perspective, it is natural to ask: What is deep
learning? What are the new characteristics of deep learning, compared with
classical methods? What are the theoretical foundations of deep learning? To
answer these questions, we introduce common neural network models (e.g.,
convolutional neural nets, recurrent neural nets, generative adversarial nets)
and training techniques (e.g., stochastic gradient descent, dropout, batch
normalization) from a statistical point of view. Along the way, we highlight
new characteristics of deep learning (including depth and over-parametrization)
and explain their practical and theoretical benefits. We also sample recent
results on theories of deep learning, many of which are only suggestive. While
a complete understanding of deep learning remains elusive, we hope that our
perspectives and discussions serve as a stimulus for new statistical research
Ensemble Model Patching: A Parameter-Efficient Variational Bayesian Neural Network
Two main obstacles preventing the widespread adoption of variational Bayesian
neural networks are the high parameter overhead that makes them infeasible on
large networks, and the difficulty of implementation, which can be thought of
as "programming overhead." MC dropout [Gal and Ghahramani, 2016] is popular
because it sidesteps these obstacles. Nevertheless, dropout is often harmful to
model performance when used in networks with batch normalization layers [Li et
al., 2018], which are an indispensable part of modern neural networks. We
construct a general variational family for ensemble-based Bayesian neural
networks that encompasses dropout as a special case. We further present two
specific members of this family that work well with batch normalization layers,
while retaining the benefits of low parameter and programming overhead,
comparable to non-Bayesian training. Our proposed methods improve predictive
accuracy and achieve almost perfect calibration on a ResNet-18 trained with
ImageNet
A Simple Baseline for Bayesian Uncertainty in Deep Learning
We propose SWA-Gaussian (SWAG), a simple, scalable, and general purpose
approach for uncertainty representation and calibration in deep learning.
Stochastic Weight Averaging (SWA), which computes the first moment of
stochastic gradient descent (SGD) iterates with a modified learning rate
schedule, has recently been shown to improve generalization in deep learning.
With SWAG, we fit a Gaussian using the SWA solution as the first moment and a
low rank plus diagonal covariance also derived from the SGD iterates, forming
an approximate posterior distribution over neural network weights; we then
sample from this Gaussian distribution to perform Bayesian model averaging. We
empirically find that SWAG approximates the shape of the true posterior, in
accordance with results describing the stationary distribution of SGD iterates.
Moreover, we demonstrate that SWAG performs well on a wide variety of tasks,
including out of sample detection, calibration, and transfer learning, in
comparison to many popular alternatives including MC dropout, KFAC Laplace,
SGLD, and temperature scaling.Comment: Published at NeurIPS 201
Understanding the Energy and Precision Requirements for Online Learning
It is well-known that the precision of data, hyperparameters, and internal
representations employed in learning systems directly impacts its energy,
throughput, and latency. The precision requirements for the training algorithm
are also important for systems that learn on-the-fly. Prior work has shown that
the data and hyperparameters can be quantized heavily without incurring much
penalty in classification accuracy when compared to floating point
implementations. These works suffer from two key limitations. First, they
assume uniform precision for the classifier and for the training algorithm and
thus miss out on the opportunity to further reduce precision. Second, prior
works are empirical studies. In this article, we overcome both these
limitations by deriving analytical lower bounds on the precision requirements
of the commonly employed stochastic gradient descent (SGD) on-line learning
algorithm in the specific context of a support vector machine (SVM). Lower
bounds on the data precision are derived in terms of the the desired
classification accuracy and precision of the hyperparameters used in the
classifier. Additionally, lower bounds on the hyperparameter precision in the
SGD training algorithm are obtained. These bounds are validated using both
synthetic and the UCI breast cancer dataset. Additionally, the impact of these
precisions on the energy consumption of a fixed-point SVM with on-line training
is studied.Comment: 14 pages, 5 figures 4 of which have 2 subfigure
TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning
High network communication cost for synchronizing gradients and parameters is
the well-known bottleneck of distributed training. In this work, we propose
TernGrad that uses ternary gradients to accelerate distributed deep learning in
data parallelism. Our approach requires only three numerical levels {-1,0,1},
which can aggressively reduce the communication time. We mathematically prove
the convergence of TernGrad under the assumption of a bound on gradients.
Guided by the bound, we propose layer-wise ternarizing and gradient clipping to
improve its convergence. Our experiments show that applying TernGrad on AlexNet
does not incur any accuracy loss and can even improve accuracy. The accuracy
loss of GoogLeNet induced by TernGrad is less than 2% on average. Finally, a
performance model is proposed to study the scalability of TernGrad. Experiments
show significant speed gains for various deep neural networks. Our source code
is available.Comment: NIPS 2017 Ora
Federated Learning with Cooperating Devices: A Consensus Approach for Massive IoT Networks
Federated learning (FL) is emerging as a new paradigm to train machine
learning models in distributed systems. Rather than sharing, and disclosing,
the training dataset with the server, the model parameters (e.g. neural
networks weights and biases) are optimized collectively by large populations of
interconnected devices, acting as local learners. FL can be applied to
power-constrained IoT devices with slow and sporadic connections. In addition,
it does not need data to be exported to third parties, preserving privacy.
Despite these benefits, a main limit of existing approaches is the centralized
optimization which relies on a server for aggregation and fusion of local
parameters; this has the drawback of a single point of failure and scaling
issues for increasing network size. The paper proposes a fully distributed (or
server-less) learning approach: the proposed FL algorithms leverage the
cooperation of devices that perform data operations inside the network by
iterating local computations and mutual interactions via consensus-based
methods. The approach lays the groundwork for integration of FL within 5G and
beyond networks characterized by decentralized connectivity and computing, with
intelligence distributed over the end-devices. The proposed methodology is
verified by experimental datasets collected inside an industrial IoT
environment.Comment: This work received support from the CHIST-ERA III Grant RadioSense
(Big Data and Process Modelling for the Smart Industry - BDSI). The paper has
been accepted for publication in the IEEE Internet of Things Journal. The
current arXiv contains an additional Appendix C that describes the database
and the Python scripts. Published version:
http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=8950073&isnumber=670252
Edge Intelligence: Paving the Last Mile of Artificial Intelligence with Edge Computing
With the breakthroughs in deep learning, the recent years have witnessed a
booming of artificial intelligence (AI) applications and services, spanning
from personal assistant to recommendation systems to video/audio surveillance.
More recently, with the proliferation of mobile computing and
Internet-of-Things (IoT), billions of mobile and IoT devices are connected to
the Internet, generating zillions Bytes of data at the network edge. Driving by
this trend, there is an urgent need to push the AI frontiers to the network
edge so as to fully unleash the potential of the edge big data. To meet this
demand, edge computing, an emerging paradigm that pushes computing tasks and
services from the network core to the network edge, has been widely recognized
as a promising solution. The resulted new inter-discipline, edge AI or edge
intelligence, is beginning to receive a tremendous amount of interest. However,
research on edge intelligence is still in its infancy stage, and a dedicated
venue for exchanging the recent advances of edge intelligence is highly desired
by both the computer system and artificial intelligence communities. To this
end, we conduct a comprehensive survey of the recent research efforts on edge
intelligence. Specifically, we first review the background and motivation for
artificial intelligence running at the network edge. We then provide an
overview of the overarching architectures, frameworks and emerging key
technologies for deep learning model towards training/inference at the network
edge. Finally, we discuss future research opportunities on edge intelligence.
We believe that this survey will elicit escalating attentions, stimulate
fruitful discussions and inspire further research ideas on edge intelligence.Comment: Zhi Zhou, Xu Chen, En Li, Liekang Zeng, Ke Luo, and Junshan Zhang,
"Edge Intelligence: Paving the Last Mile of Artificial Intelligence with Edge
Computing," Proceedings of the IEE
- …