Search CORE

7,818 research outputs found

The Convergence of Sparsified Gradient Methods

Author: Alistarh Dan
Hoefler Torsten
Johansson Mikael
Khirirat Sarit
Konstantinov Nikola
Renggli Cédric
Publication venue
Publication date: 01/01/2018
Field of study

Distributed training of massive machine learning models, in particular deep neural networks, via Stochastic Gradient Descent (SGD) is becoming commonplace. Several families of communication-reduction methods, such as quantization, large-batch methods, and gradient sparsification, have been proposed. To date, gradient sparsification methods - where each node sorts gradients by magnitude, and only communicates a subset of the components, accumulating the rest locally - are known to yield some of the largest practical gains. Such methods can reduce the amount of communication per step by up to three orders of magnitude, while preserving model accuracy. Yet, this family of methods currently has no theoretical justification. This is the question we address in this paper. We prove that, under analytic assumptions, sparsifying gradients by magnitude with local error correction provides convergence guarantees, for both convex and non-convex smooth objectives, for data-parallel SGD. The main insight is that sparsification methods implicitly maintain bounds on the maximum impact of stale updates, thanks to selection by magnitude. Our analysis and empirical validation also reveal that these methods do require analytical conditions to converge well, justifying existing heuristics.Comment: NIPS 2018 - Advances in Neural Information Processing Systems; Authors in alphabetic orde

arXiv.org e-Print Archive

IST Austria: PubRep (Institute of Science and Technology)

AdaComp : Adaptive Residual Gradient Compression for Data-Parallel Distributed Training

Author: Agrawal Ankur
Brand Daniel
Chen Chia-Yu
Choi Jungwook
Gopalakrishnan Kailash
Zhang Wei
Publication venue
Publication date: 07/12/2017
Field of study

Highly distributed training of Deep Neural Networks (DNNs) on future compute platforms (offering 100 of TeraOps/s of computational capacity) is expected to be severely communication constrained. To overcome this limitation, new gradient compression techniques are needed that are computationally friendly, applicable to a wide variety of layers seen in Deep Neural Networks and adaptable to variations in network architectures as well as their hyper-parameters. In this paper we introduce a novel technique - the Adaptive Residual Gradient Compression (AdaComp) scheme. AdaComp is based on localized selection of gradient residues and automatically tunes the compression rate depending on local activity. We show excellent results on a wide spectrum of state of the art Deep Learning models in multiple domains (vision, speech, language), datasets (MNIST, CIFAR10, ImageNet, BN50, Shakespeare), optimizers (SGD with momentum, Adam) and network parameters (number of learners, minibatch-size etc.). Exploiting both sparsity and quantization, we demonstrate end-to-end compression rates of ~200X for fully-connected and recurrent layers, and ~40X for convolutional layers, without any noticeable degradation in model accuracies.Comment: IBM Research AI, 9 pages, 7 figures, AAAI18 accepte

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

Asynchronous spiking neurons, the natural key to exploit temporal sparsity

Author: Cavalcante Holanda Priscila
Dhoedt Bart
Hoseini Sahar
Khoei Mina A.
Leroux Sam
Linares-Barranco Bernabe
Moreira Orlando
Serrano-Gotarredona Teresa
Simoens Pieter
Tapson Jonathan
Yousefzadeh Amirreza
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2019
Field of study

Inference of Deep Neural Networks for stream signal (Video/Audio) processing in edge devices is still challenging. Unlike the most state of the art inference engines which are efficient for static signals, our brain is optimized for real-time dynamic signal processing. We believe one important feature of the brain (asynchronous state-full processing) is the key to its excellence in this domain. In this work, we show how asynchronous processing with state-full neurons allows exploitation of the existing sparsity in natural signals. This paper explains three different types of sparsity and proposes an inference algorithm which exploits all types of sparsities in the execution of already trained networks. Our experiments in three different applications (Handwritten digit recognition, Autonomous Steering and Hand-Gesture recognition) show that this model of inference reduces the number of required operations for sparse input data by a factor of one to two orders of magnitudes. Additionally, due to fully asynchronous processing this type of inference can be run on fully distributed and scalable neuromorphic hardware platforms

Ghent University Academic Bibliography