29 research outputs found
Gradient Descent with Compressed Iterates
We propose and analyze a new type of stochastic first order method: gradient
descent with compressed iterates (GDCI). GDCI in each iteration first
compresses the current iterate using a lossy randomized compression technique,
and subsequently takes a gradient step. This method is a distillation of a key
ingredient in the current practice of federated learning, where a model needs
to be compressed by a mobile device before it is sent back to a server for
aggregation. Our analysis provides a step towards closing the gap between the
theory and practice of federated learning, and opens the possibility for many
extensions.Comment: NeurIPS 2019 Workshop on Federated Learning for Data Privacy and
Confidentiality. 10 pages, 1 algorithm, 1 theorem, 5 lemma
A Closer Look at Codistillation for Distributed Training
Codistillation has been proposed as a mechanism to share knowledge among
concurrently trained models by encouraging them to represent the same function
through an auxiliary loss. This contrasts with the more commonly used
fully-synchronous data-parallel stochastic gradient descent methods, where
different model replicas average their gradients (or parameters) at every
iteration and thus maintain identical parameters. We investigate codistillation
in a distributed training setup, complementing previous work which focused on
extremely large batch sizes. Surprisingly, we find that even at moderate batch
sizes, models trained with codistillation can perform as well as models trained
with synchronous data-parallel methods, despite using a much weaker
synchronization mechanism. These findings hold across a range of batch sizes
and learning rate schedules, as well as different kinds of models and datasets.
Obtaining this level of accuracy, however, requires properly accounting for the
regularization effect of codistillation, which we highlight through several
empirical observations. Overall, this work contributes to a better
understanding of codistillation and how to best take advantage of it in a
distributed computing environment.Comment: Under revie
APMSqueeze: A Communication Efficient Adam-Preconditioned Momentum SGD Algorithm
Adam is the important optimization algorithm to guarantee efficiency and
accuracy for training many important tasks such as BERT and ImageNet. However,
Adam is generally not compatible with information (gradient) compression
technology. Therefore, the communication usually becomes the bottleneck for
parallelizing Adam. In this paper, we propose a communication efficient {\bf
A}DAM {\bf p}reconditioned {\bf M}omentum SGD algorithm-- named APMSqueeze--
through an error compensated method compressing gradients. The proposed
algorithm achieves a similar convergence efficiency to Adam in term of epochs,
but significantly reduces the running time per epoch. In terms of end-to-end
performance (including the full-precision pre-condition step), APMSqueeze is
able to provide {sometimes by up to speed-up depending on network
bandwidth.} We also conduct theoretical analysis on the convergence and
efficiency
Compressed Gradient Tracking Methods for Decentralized Optimization with Linear Convergence
Communication compression techniques are of growing interests for solving the
decentralized optimization problem under limited communication, where the
global objective is to minimize the average of local cost functions over a
multi-agent network using only local computation and peer-to-peer
communication. In this paper, we first propose a novel compressed gradient
tracking algorithm (C-GT) that combines gradient tracking technique with
communication compression. In particular, C-GT is compatible with a general
class of compression operators that unifies both unbiased and biased
compressors. We show that C-GT inherits the advantages of gradient
tracking-based algorithms and achieves linear convergence rate for strongly
convex and smooth objective functions. In the second part of this paper, we
propose an error feedback based compressed gradient tracking algorithm
(EF-C-GT) to further improve the algorithm efficiency for biased compression
operators. Numerical examples complement the theoretical findings and
demonstrate the efficiency and flexibility of the proposed algorithms
Periodic Stochastic Gradient Descent with Momentum for Decentralized Training
Decentralized training has been actively studied in recent years. Although a
wide variety of methods have been proposed, yet the decentralized momentum SGD
method is still underexplored. In this paper, we propose a novel periodic
decentralized momentum SGD method, which employs the momentum schema and
periodic communication for decentralized training. With these two strategies,
as well as the topology of the decentralized training system, the theoretical
convergence analysis of our proposed method is difficult. We address this
challenging problem and provide the condition under which our proposed method
can achieve the linear speedup regarding the number of workers. Furthermore, we
also introduce a communication-efficient variant to reduce the communication
cost in each communication round. The condition for achieving the linear
speedup is also provided for this variant. To the best of our knowledge, these
two methods are all the first ones achieving these theoretical results in their
corresponding domain. We conduct extensive experiments to verify the
performance of our proposed two methods, and both of them have shown superior
performance over existing methods
Communication-Efficient Decentralized Optimization Over Time-Varying Directed Graphs
We study decentralized optimization tasks carried out by a collection of
agents, each having access only to a local cost function; the agents, who can
communicate over a time-varying directed network, aim to minimize the sum of
those functions. In practical settings, communication constraints impose a
limit on the amount of information that can be exchanged between the agents. We
propose communication-efficient algorithms for decentralized convex
optimization and its special case, distributed average consensus, that rely on
sparsification of local updates exchanged between neighboring agents in the
network. Message sparsification alters column-stochasticity of the mixing
matrices of directed networks, a property that plays an important role in
establishing convergence of decentralized learning tasks. We show that by
locally modifying mixing matrices the proposed framework achieves
\O(\frac{\mathrm{ln}T}{\sqrt{T}}) convergence rate in general decentralized
optimization settings, and a geometric convergence rate in the average
consensus problem. Experimental results on synthetic and real datasets show
efficacy of the proposed algorithms
Communication-Efficient Distributed Stochastic AUC Maximization with Deep Neural Networks
In this paper, we study distributed algorithms for large-scale AUC
maximization with a deep neural network as a predictive model. Although
distributed learning techniques have been investigated extensively in deep
learning, they are not directly applicable to stochastic AUC maximization with
deep neural networks due to its striking differences from standard loss
minimization problems (e.g., cross-entropy). Towards addressing this challenge,
we propose and analyze a communication-efficient distributed optimization
algorithm based on a {\it non-convex concave} reformulation of the AUC
maximization, in which the communication of both the primal variable and the
dual variable between each worker and the parameter server only occurs after
multiple steps of gradient-based updates in each worker. Compared with the
naive parallel version of an existing algorithm that computes stochastic
gradients at individual machines and averages them for updating the model
parameters, our algorithm requires a much less number of communication rounds
and still achieves a linear speedup in theory. To the best of our knowledge,
this is the \textbf{first} work that solves the {\it non-convex concave
min-max} problem for AUC maximization with deep neural networks in a
communication-efficient distributed manner while still maintaining the linear
speedup property in theory. Our experiments on several benchmark datasets show
the effectiveness of our algorithm and also confirm our theory
PowerGossip: Practical Low-Rank Communication Compression in Decentralized Deep Learning
Lossy gradient compression has become a practical tool to overcome the
communication bottleneck in centrally coordinated distributed training of
machine learning models. However, algorithms for decentralized training with
compressed communication over arbitrary connected networks have been more
complicated, requiring additional memory and hyperparameters. We introduce a
simple algorithm that directly compresses the model differences between
neighboring workers using low-rank linear compressors applied on model
differences. Inspired by the PowerSGD algorithm for centralized deep learning,
this algorithm uses power iteration steps to maximize the information
transferred per bit. We prove that our method requires no additional
hyperparameters, converges faster than prior methods, and is asymptotically
independent of both the network and the compression. Out of the box, these
compressors perform on par with state-of-the-art tuned compression algorithms
in a series of deep learning benchmarks
Adaptive Serverless Learning
With the emergence of distributed data, training machine learning models in
the serverless manner has attracted increasing attention in recent years.
Numerous training approaches have been proposed in this regime, such as
decentralized SGD. However, all existing decentralized algorithms only focus on
standard SGD. It might not be suitable for some applications, such as deep
factorization machine in which the feature is highly sparse and categorical so
that the adaptive training algorithm is needed. In this paper, we propose a
novel adaptive decentralized training approach, which can compute the learning
rate from data dynamically. To the best of our knowledge, this is the first
adaptive decentralized training approach. Our theoretical results reveal that
the proposed algorithm can achieve linear speedup with respect to the number of
workers. Moreover, to reduce the communication-efficient overhead, we further
propose a communication-efficient adaptive decentralized training approach,
which can also achieve linear speedup with respect to the number of workers. At
last, extensive experiments on different tasks have confirmed the effectiveness
of our proposed two approaches
Linear Convergent Decentralized Optimization with Compression
Communication compression has become a key strategy to speed up distributed
optimization. However, existing decentralized algorithms with compression
mainly focus on compressing DGD-type algorithms. They are unsatisfactory in
terms of convergence rate, stability, and the capability to handle
heterogeneous data. Motivated by primal-dual algorithms, this paper proposes
the first \underline{L}in\underline{EA}r convergent \underline{D}ecentralized
algorithm with compression, LEAD. Our theory describes the coupled dynamics of
the inexact primal and dual update as well as compression error, and we provide
the first consensus error bound in such settings without assuming bounded
gradients. Experiments on convex problems validate our theoretical analysis,
and empirical study on deep neural nets shows that LEAD is applicable to
non-convex problems.Comment: ICLR 2021 (International Conference on Learning Representations