15 research outputs found
A Distributed Synchronous SGD Algorithm with Global Top- Sparsification for Low Bandwidth Networks
Distributed synchronous stochastic gradient descent (S-SGD) has been widely
used in training large-scale deep neural networks (DNNs), but it typically
requires very high communication bandwidth between computational workers (e.g.,
GPUs) to exchange gradients iteratively. Recently, Top- sparsification
techniques have been proposed to reduce the volume of data to be exchanged
among workers. Top- sparsification can zero-out a significant portion of
gradients without impacting the model convergence. However, the sparse
gradients should be transferred with their irregular indices, which makes the
sparse gradients aggregation difficult. Current methods that use AllGather to
accumulate the sparse gradients have a communication complexity of ,
where is the number of workers, which is inefficient on low bandwidth
networks with a large number of workers. We observe that not all top-
gradients from workers are needed for the model update, and therefore we
propose a novel global Top- (gTop-) sparsification mechanism to address
the problem. Specifically, we choose global top- largest absolute values of
gradients from workers, instead of accumulating all local top- gradients
to update the model in each iteration. The gradient aggregation method based on
gTop- sparsification reduces the communication complexity from to
. Through extensive experiments on different DNNs, we verify that
gTop- S-SGD has nearly consistent convergence performance with S-SGD, and it
has only slight degradations on generalization performance. In terms of scaling
efficiency, we evaluate gTop- on a cluster with 32 GPU machines which are
interconnected with 1 Gbps Ethernet. The experimental results show that our
method achieves higher scaling efficiency than S-SGD and
improvement than the existing Top- S-SGD.Comment: 10 pages. Add discussion with more experimental results. To appear at
the ICDCS2019 worksho
FEDZIP: A Compression Framework for Communication-Efficient Federated Learning
Federated Learning marks a turning point in the implementation of
decentralized machine learning (especially deep learning) for wireless devices
by protecting users' privacy and safeguarding raw data from third-party access.
It assigns the learning process independently to each client. First, clients
locally train a machine learning model based on local data. Next, clients
transfer local updates of model weights and biases (training data) to a server.
Then, the server aggregates updates (received from clients) to create a global
learning model. However, the continuous transfer between clients and the server
increases communication costs and is inefficient from a resource utilization
perspective due to the large number of parameters (weights and biases) used by
deep learning models. The cost of communication becomes a greater concern when
the number of contributing clients and communication rounds increases. In this
work, we propose a novel framework, FedZip, that significantly decreases the
size of updates while transferring weights from the deep learning model between
clients and their servers. FedZip implements Top-z sparsification, uses
quantization with clustering, and implements compression with three different
encoding methods. FedZip outperforms state-of-the-art compression frameworks
and reaches compression rates up to 1085x, and preserves up to 99% of bandwidth
and 99% of energy for clients during communication
Understanding Top-k Sparsification in Distributed Deep Learning
Distributed stochastic gradient descent (SGD) algorithms are widely deployed
in training large-scale deep learning models, while the communication overhead
among workers becomes the new system bottleneck. Recently proposed gradient
sparsification techniques, especially Top- sparsification with error
compensation (TopK-SGD), can significantly reduce the communication traffic
without an obvious impact on the model accuracy. Some theoretical studies have
been carried out to analyze the convergence property of TopK-SGD. However,
existing studies do not dive into the details of Top- operator in gradient
sparsification and use relaxed bounds (e.g., exact bound of Random-) for
analysis; hence the derived results cannot well describe the real convergence
performance of TopK-SGD. To this end, we first study the gradient distributions
of TopK-SGD during the training process through extensive experiments. We then
theoretically derive a tighter bound for the Top- operator. Finally, we
exploit the property of gradient distribution to propose an approximate top-
selection algorithm, which is computing-efficient for GPUs, to improve the
scaling efficiency of TopK-SGD by significantly reducing the computing
overhead. Codes are available at:
\url{https://github.com/hclhkbu/GaussianK-SGD}.Comment: 14 page
APMSqueeze: A Communication Efficient Adam-Preconditioned Momentum SGD Algorithm
Adam is the important optimization algorithm to guarantee efficiency and
accuracy for training many important tasks such as BERT and ImageNet. However,
Adam is generally not compatible with information (gradient) compression
technology. Therefore, the communication usually becomes the bottleneck for
parallelizing Adam. In this paper, we propose a communication efficient {\bf
A}DAM {\bf p}reconditioned {\bf M}omentum SGD algorithm-- named APMSqueeze--
through an error compensated method compressing gradients. The proposed
algorithm achieves a similar convergence efficiency to Adam in term of epochs,
but significantly reduces the running time per epoch. In terms of end-to-end
performance (including the full-precision pre-condition step), APMSqueeze is
able to provide {sometimes by up to speed-up depending on network
bandwidth.} We also conduct theoretical analysis on the convergence and
efficiency
Layer-wise Adaptive Gradient Sparsification for Distributed Deep Learning with Convergence Guarantees
To reduce the long training time of large deep neural network (DNN) models,
distributed synchronous stochastic gradient descent (S-SGD) is commonly used on
a cluster of workers. However, the speedup brought by multiple workers is
limited by the communication overhead. Two approaches, namely pipelining and
gradient sparsification, have been separately proposed to alleviate the impact
of communication overheads. Yet, the gradient sparsification methods can only
initiate the communication after the backpropagation, and hence miss the
pipelining opportunity. In this paper, we propose a new distributed
optimization method named LAGS-SGD, which combines S-SGD with a novel
layer-wise adaptive gradient sparsification (LAGS) scheme. In LAGS-SGD, every
worker selects a small set of "significant" gradients from each layer
independently whose size can be adaptive to the communication-to-computation
ratio of that layer. The layer-wise nature of LAGS-SGD opens the opportunity of
overlapping communications with computations, while the adaptive nature of
LAGS-SGD makes it flexible to control the communication time. We prove that
LAGS-SGD has convergence guarantees and it has the same order of convergence
rate as vanilla S-SGD under a weak analytical assumption. Extensive experiments
are conducted to verify the analytical assumption and the convergence
performance of LAGS-SGD. Experimental results on a 16-GPU cluster show that
LAGS-SGD outperforms the original S-SGD and existing sparsified S-SGD without
losing obvious model accuracy.Comment: 8 pages. To appear at ECAI 202
An Incentive-Based Mechanism for Volunteer Computing using Blockchain
The rise of fast communication media both at the core and at the edge has
resulted in unprecedented numbers of sophisticated and intelligent wireless IoT
devices. Tactile Internet has enabled the interaction between humans and
machines within their environment to achieve revolutionized solutions both on
the move and in real-time. Many applications such as intelligent autonomous
self-driving, smart agriculture and industrial solutions, and self-learning
multimedia content filtering and sharing have become attainable through
cooperative, distributed and decentralized systems, namely, volunteer
computing. This article introduces a blockchain-enabled resource sharing and
service composition solution through volunteer computing. Device resource,
computing, and intelligence capabilities are advertised in the environment to
be made discoverable and available for sharing with the aid of blockchain
technology. Incentives in the form of on-demand service availability are given
to resource and service providers to ensure fair and balanced cooperative
resource usage. Blockchains are formed whenever a service request is initiated
with the aid of fog and mobile edge computing (MEC) devices to ensure secure
communication and service delivery for the participants. Using both volunteer
computing techniques and tactile internet architectures, we devise a fast and
reliable service provisioning framework that relies on a reinforcement learning
technique. Simulation results show that the proposed solution can achieve high
reward distribution, increased number of blockchain formations, reduced delays,
and balanced resource usage among participants, under the premise of high IoT
device availability.Comment: 22 pages, 12 Figures, 1 Table. Accepted. ACM Transaction On Internet
Technolog
1-bit LAMB: Communication Efficient Large-Scale Large-Batch Training with LAMB's Convergence Speed
To train large models (like BERT and GPT-3) with hundreds or even thousands
of GPUs, the communication has become a major bottleneck, especially on
commodity systems with limited-bandwidth TCP interconnects network. On one side
large-batch optimization such as LAMB algorithm was proposed to reduce the
number of communications. On the other side, communication compression
algorithms such as 1-bit SGD and 1-bit Adam help to reduce the volume of each
communication. However, we find that simply using one of the techniques is not
sufficient to solve the communication challenge, especially on low-bandwidth
Ethernet networks. Motivated by this we aim to combine the power of large-batch
optimization and communication compression, but we find that existing
compression strategies cannot be directly applied to LAMB due to its unique
adaptive layerwise learning rates. To this end, we design a new
communication-efficient algorithm, 1-bit LAMB, which introduces a novel way to
support adaptive layerwise learning rates even when communication is
compressed. In addition, we introduce a new system implementation for
compressed communication using the NCCL backend of PyTorch distributed, which
improves both usability and performance compared to existing MPI-based
implementation. For BERT-Large pre-training task with batch sizes from 8K to
64K, our evaluations on up to 256 GPUs demonstrate that 1-bit LAMB with
NCCL-based backend is able to achieve up to 4.6x communication volume
reduction, up to 2.8x end-to-end speedup (in terms of number of training
samples per second), and the same convergence speed (in terms of number of
pre-training samples to reach the same accuracy on fine-tuning tasks) compared
to uncompressed LAMB
Adaptive Gradient Sparsification for Efficient Federated Learning: An Online Learning Approach
Federated learning (FL) is an emerging technique for training machine
learning models using geographically dispersed data collected by local
entities. It includes local computation and synchronization steps. To reduce
the communication overhead and improve the overall efficiency of FL, gradient
sparsification (GS) can be applied, where instead of the full gradient, only a
small subset of important elements of the gradient is communicated. Existing
work on GS uses a fixed degree of gradient sparsity for i.i.d.-distributed data
within a datacenter. In this paper, we consider adaptive degree of sparsity and
non-i.i.d. local datasets. We first present a fairness-aware GS method which
ensures that different clients provide a similar amount of updates. Then, with
the goal of minimizing the overall training time, we propose a novel online
learning formulation and algorithm for automatically determining the
near-optimal communication and computation trade-off that is controlled by the
degree of gradient sparsity. The online learning algorithm uses an estimated
sign of the derivative of the objective function, which gives a regret bound
that is asymptotically equal to the case where exact derivative is available.
Experiments with real datasets confirm the benefits of our proposed approaches,
showing up to improvement in model accuracy for a finite training time.Comment: Accepted at IEEE ICDCS 202
A flexible framework for communication-efficient machine learning: from HPC to IoT
With the increasing scale of machine learning tasks, it has become essential
to reduce the communication between computing nodes. Early work on gradient
compression focused on the bottleneck between CPUs and GPUs, but
communication-efficiency is now needed in a variety of different system
architectures, from high-performance clusters to energy-constrained IoT
devices. In the current practice, compression levels are typically chosen
before training and settings that work well for one task may be vastly
suboptimal for another dataset on another architecture. In this paper, we
propose a flexible framework which adapts the compression level to the true
gradient at each iteration, maximizing the improvement in the objective
function that is achieved per communicated bit. Our framework is easy to adapt
from one technology to the next by modeling how the communication cost depends
on the compression level for the specific technology. Theoretical results and
practical experiments indicate that the automatic tuning strategies
significantly increase communication efficiency on several state-of-the-art
compression schemes.Comment: 27 pages, 11 figures, 1 tabl
Communication-Efficient Decentralized Learning with Sparsification and Adaptive Peer Selection
Distributed learning techniques such as federated learning have enabled
multiple workers to train machine learning models together to reduce the
overall training time. However, current distributed training algorithms
(centralized or decentralized) suffer from the communication bottleneck on
multiple low-bandwidth workers (also on the server under the centralized
architecture). Although decentralized algorithms generally have lower
communication complexity than the centralized counterpart, they still suffer
from the communication bottleneck for workers with low network bandwidth. To
deal with the communication problem while being able to preserve the
convergence performance, we introduce a novel decentralized training algorithm
with the following key features: 1) It does not require a parameter server to
maintain the model during training, which avoids the communication pressure on
any single peer. 2) Each worker only needs to communicate with a single peer at
each communication round with a highly compressed model, which can
significantly reduce the communication traffic on the worker. We theoretically
prove that our sparsification algorithm still preserves convergence properties.
3) Each worker dynamically selects its peer at different communication rounds
to better utilize the bandwidth resources. We conduct experiments with
convolutional neural networks on 32 workers to verify the effectiveness of our
proposed algorithm compared to seven existing methods. Experimental results
show that our algorithm significantly reduces the communication traffic and
generally select relatively high bandwidth peers