7,818 research outputs found
The Convergence of Sparsified Gradient Methods
Distributed training of massive machine learning models, in particular deep
neural networks, via Stochastic Gradient Descent (SGD) is becoming commonplace.
Several families of communication-reduction methods, such as quantization,
large-batch methods, and gradient sparsification, have been proposed. To date,
gradient sparsification methods - where each node sorts gradients by magnitude,
and only communicates a subset of the components, accumulating the rest locally
- are known to yield some of the largest practical gains. Such methods can
reduce the amount of communication per step by up to three orders of magnitude,
while preserving model accuracy. Yet, this family of methods currently has no
theoretical justification.
This is the question we address in this paper. We prove that, under analytic
assumptions, sparsifying gradients by magnitude with local error correction
provides convergence guarantees, for both convex and non-convex smooth
objectives, for data-parallel SGD. The main insight is that sparsification
methods implicitly maintain bounds on the maximum impact of stale updates,
thanks to selection by magnitude. Our analysis and empirical validation also
reveal that these methods do require analytical conditions to converge well,
justifying existing heuristics.Comment: NIPS 2018 - Advances in Neural Information Processing Systems;
Authors in alphabetic orde
AdaComp : Adaptive Residual Gradient Compression for Data-Parallel Distributed Training
Highly distributed training of Deep Neural Networks (DNNs) on future compute
platforms (offering 100 of TeraOps/s of computational capacity) is expected to
be severely communication constrained. To overcome this limitation, new
gradient compression techniques are needed that are computationally friendly,
applicable to a wide variety of layers seen in Deep Neural Networks and
adaptable to variations in network architectures as well as their
hyper-parameters. In this paper we introduce a novel technique - the Adaptive
Residual Gradient Compression (AdaComp) scheme. AdaComp is based on localized
selection of gradient residues and automatically tunes the compression rate
depending on local activity. We show excellent results on a wide spectrum of
state of the art Deep Learning models in multiple domains (vision, speech,
language), datasets (MNIST, CIFAR10, ImageNet, BN50, Shakespeare), optimizers
(SGD with momentum, Adam) and network parameters (number of learners,
minibatch-size etc.). Exploiting both sparsity and quantization, we demonstrate
end-to-end compression rates of ~200X for fully-connected and recurrent layers,
and ~40X for convolutional layers, without any noticeable degradation in model
accuracies.Comment: IBM Research AI, 9 pages, 7 figures, AAAI18 accepte
Asynchronous spiking neurons, the natural key to exploit temporal sparsity
Inference of Deep Neural Networks for stream signal (Video/Audio) processing in edge devices is still challenging. Unlike the most state of the art inference engines which are efficient for static signals, our brain is optimized for real-time dynamic signal processing. We believe one important feature of the brain (asynchronous state-full processing) is the key to its excellence in this domain. In this work, we show how asynchronous processing with state-full neurons allows exploitation of the existing sparsity in natural signals. This paper explains three different types of sparsity and proposes an inference algorithm which exploits all types of sparsities in the execution of already trained networks. Our experiments in three different applications (Handwritten digit recognition, Autonomous Steering and Hand-Gesture recognition) show that this model of inference reduces the number of required operations for sparse input data by a factor of one to two orders of magnitudes. Additionally, due to fully asynchronous processing this type of inference can be run on fully distributed and scalable neuromorphic hardware platforms
- …