1,107 research outputs found
The Convergence of Sparsified Gradient Methods
Distributed training of massive machine learning models, in particular deep
neural networks, via Stochastic Gradient Descent (SGD) is becoming commonplace.
Several families of communication-reduction methods, such as quantization,
large-batch methods, and gradient sparsification, have been proposed. To date,
gradient sparsification methods - where each node sorts gradients by magnitude,
and only communicates a subset of the components, accumulating the rest locally
- are known to yield some of the largest practical gains. Such methods can
reduce the amount of communication per step by up to three orders of magnitude,
while preserving model accuracy. Yet, this family of methods currently has no
theoretical justification.
This is the question we address in this paper. We prove that, under analytic
assumptions, sparsifying gradients by magnitude with local error correction
provides convergence guarantees, for both convex and non-convex smooth
objectives, for data-parallel SGD. The main insight is that sparsification
methods implicitly maintain bounds on the maximum impact of stale updates,
thanks to selection by magnitude. Our analysis and empirical validation also
reveal that these methods do require analytical conditions to converge well,
justifying existing heuristics.Comment: NIPS 2018 - Advances in Neural Information Processing Systems;
Authors in alphabetic orde
-ARM: Network Sparsification via Stochastic Binary Optimization
We consider network sparsification as an -norm regularized binary
optimization problem, where each unit of a neural network (e.g., weight,
neuron, or channel, etc.) is attached with a stochastic binary gate, whose
parameters are jointly optimized with original network parameters. The
Augment-Reinforce-Merge (ARM), a recently proposed unbiased gradient estimator,
is investigated for this binary optimization problem. Compared to the hard
concrete gradient estimator from Louizos et al., ARM demonstrates superior
performance of pruning network architectures while retaining almost the same
accuracies of baseline methods. Similar to the hard concrete estimator, ARM
also enables conditional computation during model training but with improved
effectiveness due to the exact binary stochasticity. Thanks to the flexibility
of ARM, many smooth or non-smooth parametric functions, such as scaled sigmoid
or hard sigmoid, can be used to parameterize this binary optimization problem
and the unbiasness of the ARM estimator is retained, while the hard concrete
estimator has to rely on the hard sigmoid function to achieve conditional
computation and thus accelerated training. Extensive experiments on multiple
public datasets demonstrate state-of-the-art pruning rates with almost the same
accuracies of baseline methods. The resulting algorithm -ARM sparsifies
the Wide-ResNet models on CIFAR-10 and CIFAR-100 while the hard concrete
estimator cannot. The code is public available at
https://github.com/leo-yangli/l0-arm.Comment: Published as a conference paper at ECML 201
- …