6,076 research outputs found
A Hitchhiker's Guide On Distributed Training of Deep Neural Networks
Deep learning has led to tremendous advancements in the field of Artificial
Intelligence. One caveat however is the substantial amount of compute needed to
train these deep learning models. Training a benchmark dataset like ImageNet on
a single machine with a modern GPU can take upto a week, distributing training
on multiple machines has been observed to drastically bring this time down.
Recent work has brought down ImageNet training time to a time as low as 4
minutes by using a cluster of 2048 GPUs. This paper surveys the various
algorithms and techniques used to distribute training and presents the current
state of the art for a modern distributed training framework. More
specifically, we explore the synchronous and asynchronous variants of
distributed Stochastic Gradient Descent, various All Reduce gradient
aggregation strategies and best practices for obtaining higher throughout and
lower latency over a cluster such as mixed precision training, large batch
training and gradient compression.Comment: 14 page
Variational Adaptive-Newton Method for Explorative Learning
We present the Variational Adaptive Newton (VAN) method which is a black-box
optimization method especially suitable for explorative-learning tasks such as
active learning and reinforcement learning. Similar to Bayesian methods, VAN
estimates a distribution that can be used for exploration, but requires
computations that are similar to continuous optimization methods. Our
theoretical contribution reveals that VAN is a second-order method that unifies
existing methods in distinct fields of continuous optimization, variational
inference, and evolution strategies. Our experimental results show that VAN
performs well on a wide-variety of learning tasks. This work presents a
general-purpose explorative-learning method that has the potential to improve
learning in areas such as active learning and reinforcement learning
Understanding the Energy and Precision Requirements for Online Learning
It is well-known that the precision of data, hyperparameters, and internal
representations employed in learning systems directly impacts its energy,
throughput, and latency. The precision requirements for the training algorithm
are also important for systems that learn on-the-fly. Prior work has shown that
the data and hyperparameters can be quantized heavily without incurring much
penalty in classification accuracy when compared to floating point
implementations. These works suffer from two key limitations. First, they
assume uniform precision for the classifier and for the training algorithm and
thus miss out on the opportunity to further reduce precision. Second, prior
works are empirical studies. In this article, we overcome both these
limitations by deriving analytical lower bounds on the precision requirements
of the commonly employed stochastic gradient descent (SGD) on-line learning
algorithm in the specific context of a support vector machine (SVM). Lower
bounds on the data precision are derived in terms of the the desired
classification accuracy and precision of the hyperparameters used in the
classifier. Additionally, lower bounds on the hyperparameter precision in the
SGD training algorithm are obtained. These bounds are validated using both
synthetic and the UCI breast cancer dataset. Additionally, the impact of these
precisions on the energy consumption of a fixed-point SVM with on-line training
is studied.Comment: 14 pages, 5 figures 4 of which have 2 subfigure
ATOMO: Communication-efficient Learning via Atomic Sparsification
Distributed model training suffers from communication overheads due to
frequent gradient updates transmitted between compute nodes. To mitigate these
overheads, several studies propose the use of sparsified stochastic gradients.
We argue that these are facets of a general sparsification method that can
operate on any possible atomic decomposition. Notable examples include
element-wise, singular value, and Fourier decompositions. We present ATOMO, a
general framework for atomic sparsification of stochastic gradients. Given a
gradient, an atomic decomposition, and a sparsity budget, ATOMO gives a random
unbiased sparsification of the atoms minimizing variance. We show that recent
methods such as QSGD and TernGrad are special cases of ATOMO and that
sparsifiying the singular value decomposition of neural networks gradients,
rather than their coordinates, can lead to significantly faster distributed
training
Edge Intelligence: Paving the Last Mile of Artificial Intelligence with Edge Computing
With the breakthroughs in deep learning, the recent years have witnessed a
booming of artificial intelligence (AI) applications and services, spanning
from personal assistant to recommendation systems to video/audio surveillance.
More recently, with the proliferation of mobile computing and
Internet-of-Things (IoT), billions of mobile and IoT devices are connected to
the Internet, generating zillions Bytes of data at the network edge. Driving by
this trend, there is an urgent need to push the AI frontiers to the network
edge so as to fully unleash the potential of the edge big data. To meet this
demand, edge computing, an emerging paradigm that pushes computing tasks and
services from the network core to the network edge, has been widely recognized
as a promising solution. The resulted new inter-discipline, edge AI or edge
intelligence, is beginning to receive a tremendous amount of interest. However,
research on edge intelligence is still in its infancy stage, and a dedicated
venue for exchanging the recent advances of edge intelligence is highly desired
by both the computer system and artificial intelligence communities. To this
end, we conduct a comprehensive survey of the recent research efforts on edge
intelligence. Specifically, we first review the background and motivation for
artificial intelligence running at the network edge. We then provide an
overview of the overarching architectures, frameworks and emerging key
technologies for deep learning model towards training/inference at the network
edge. Finally, we discuss future research opportunities on edge intelligence.
We believe that this survey will elicit escalating attentions, stimulate
fruitful discussions and inspire further research ideas on edge intelligence.Comment: Zhi Zhou, Xu Chen, En Li, Liekang Zeng, Ke Luo, and Junshan Zhang,
"Edge Intelligence: Paving the Last Mile of Artificial Intelligence with Edge
Computing," Proceedings of the IEE
Model compression as constrained optimization, with application to neural nets. Part II: quantization
We consider the problem of deep neural net compression by quantization: given
a large, reference net, we want to quantize its real-valued weights using a
codebook with entries so that the training loss of the quantized net is
minimal. The codebook can be optimally learned jointly with the net, or fixed,
as for binarization or ternarization approaches. Previous work has quantized
the weights of the reference net, or incorporated rounding operations in the
backpropagation algorithm, but this has no guarantee of converging to a
loss-optimal, quantized net. We describe a new approach based on the recently
proposed framework of model compression as constrained optimization
\citep{Carreir17a}. This results in a simple iterative "learning-compression"
algorithm, which alternates a step that learns a net of continuous weights with
a step that quantizes (or binarizes/ternarizes) the weights, and is guaranteed
to converge to local optimum of the loss for quantized nets. We develop
algorithms for an adaptive codebook or a (partially) fixed codebook. The latter
includes binarization, ternarization, powers-of-two and other important
particular cases. We show experimentally that we can achieve much higher
compression rates than previous quantization work (even using just 1 bit per
weight) with negligible loss degradation.Comment: 33 pages, 15 figures, 3 table
Scaling Distributed Training of Flood-Filling Networks on HPC Infrastructure for Brain Mapping
Mapping all the neurons in the brain requires automatic reconstruction of
entire cells from volume electron microscopy data. The flood-filling network
(FFN) architecture has demonstrated leading performance for segmenting
structures from this data. However, the training of the network is
computationally expensive. In order to reduce the training time, we implemented
synchronous and data-parallel distributed training using the Horovod library,
which is different from the asynchronous training scheme used in the published
FFN code. We demonstrated that our distributed training scaled well up to 2048
Intel Knights Landing (KNL) nodes on the Theta supercomputer. Our trained
models achieved similar level of inference performance, but took less training
time compared to previous methods. Our study on the effects of different batch
sizes on FFN training suggests ways to further improve training efficiency. Our
findings on optimal learning rate and batch sizes agree with previous works.Comment: 9 pages, 10 figure
Addressing the Loss-Metric Mismatch with Adaptive Loss Alignment
In most machine learning training paradigms a fixed, often handcrafted, loss
function is assumed to be a good proxy for an underlying evaluation metric. In
this work we assess this assumption by meta-learning an adaptive loss function
to directly optimize the evaluation metric. We propose a sample efficient
reinforcement learning approach for adapting the loss dynamically during
training. We empirically show how this formulation improves performance by
simultaneously optimizing the evaluation metric and smoothing the loss
landscape. We verify our method in metric learning and classification
scenarios, showing considerable improvements over the state-of-the-art on a
diverse set of tasks. Importantly, our method is applicable to a wide range of
loss functions and evaluation metrics. Furthermore, the learned policies are
transferable across tasks and data, demonstrating the versatility of the
method.Comment: Accepted to ICML 201
Online Learning to Sample
Stochastic Gradient Descent (SGD) is one of the most widely used techniques
for online optimization in machine learning. In this work, we accelerate SGD by
adaptively learning how to sample the most useful training examples at each
time step. First, we show that SGD can be used to learn the best possible
sampling distribution of an importance sampling estimator. Second, we show that
the sampling distribution of an SGD algorithm can be estimated online by
incrementally minimizing the variance of the gradient. The resulting algorithm
- called Adaptive Weighted SGD (AW-SGD) - maintains a set of parameters to
optimize, as well as a set of parameters to sample learning examples. We show
that AWSGD yields faster convergence in three different applications: (i) image
classification with deep features, where the sampling of images depends on
their labels, (ii) matrix factorization, where rows and columns are not sampled
uniformly, and (iii) reinforcement learning, where the optimized and
exploration policies are estimated at the same time, where our approach
corresponds to an off-policy gradient algorithm.Comment: Update: removed convergence theorem and proof as there is an error.
Submitted to UAI 201
A Survey on Methods and Theories of Quantized Neural Networks
Deep neural networks are the state-of-the-art methods for many real-world
tasks, such as computer vision, natural language processing and speech
recognition. For all its popularity, deep neural networks are also criticized
for consuming a lot of memory and draining battery life of devices during
training and inference. This makes it hard to deploy these models on mobile or
embedded devices which have tight resource constraints. Quantization is
recognized as one of the most effective approaches to satisfy the extreme
memory requirements that deep neural network models demand. Instead of adopting
32-bit floating point format to represent weights, quantized representations
store weights using more compact formats such as integers or even binary
numbers. Despite a possible degradation in predictive performance, quantization
provides a potential solution to greatly reduce the model size and the energy
consumption. In this survey, we give a thorough review of different aspects of
quantized neural networks. Current challenges and trends of quantized neural
networks are also discussed.Comment: 17 pages, 8 figure
- …