Search CORE

15 research outputs found

A Distributed Synchronous SGD Algorithm with Global Top- $k$ Sparsification for Low Bandwidth Networks

Author: Chu Xiaowen
Huang Xiang
Shi Shaohuai
Tang Zhenheng
Wang Qiang
Wang Yuxin
Zhao Kaiyong
Publication venue
Publication date: 17/04/2019
Field of study

Distributed synchronous stochastic gradient descent (S-SGD) has been widely used in training large-scale deep neural networks (DNNs), but it typically requires very high communication bandwidth between computational workers (e.g., GPUs) to exchange gradients iteratively. Recently, Top-

k

sparsification techniques have been proposed to reduce the volume of data to be exchanged among workers. Top-

k

sparsification can zero-out a significant portion of gradients without impacting the model convergence. However, the sparse gradients should be transferred with their irregular indices, which makes the sparse gradients aggregation difficult. Current methods that use AllGather to accumulate the sparse gradients have a communication complexity of

O(kP)

, where

P

is the number of workers, which is inefficient on low bandwidth networks with a large number of workers. We observe that not all top-

k

gradients from

P

workers are needed for the model update, and therefore we propose a novel global Top-

k

(gTop-

k

) sparsification mechanism to address the problem. Specifically, we choose global top-

k

largest absolute values of gradients from

P

workers, instead of accumulating all local top-

k

gradients to update the model in each iteration. The gradient aggregation method based on gTop-

k

sparsification reduces the communication complexity from

O(kP)

O(k\log P)

. Through extensive experiments on different DNNs, we verify that gTop-

k

S-SGD has nearly consistent convergence performance with S-SGD, and it has only slight degradations on generalization performance. In terms of scaling efficiency, we evaluate gTop-

k

on a cluster with 32 GPU machines which are interconnected with 1 Gbps Ethernet. The experimental results show that our method achieves

2.7-12\times

higher scaling efficiency than S-SGD and

1.1-1.7\times

improvement than the existing Top-

k

S-SGD.Comment: 10 pages. Add discussion with more experimental results. To appear at the ICDCS2019 worksho

arXiv.org e-Print Archive

FEDZIP: A Compression Framework for Communication-Efficient Federated Learning

Author: Alizadeh-Shabdiz Farshid
Fadaeieslam Mohammad Javad
Homayounfar Morteza
Malekijoo Amirhossein
Malekijou Hanieh
Rawassizadeh Reza
Publication venue
Publication date: 02/02/2021
Field of study

Federated Learning marks a turning point in the implementation of decentralized machine learning (especially deep learning) for wireless devices by protecting users' privacy and safeguarding raw data from third-party access. It assigns the learning process independently to each client. First, clients locally train a machine learning model based on local data. Next, clients transfer local updates of model weights and biases (training data) to a server. Then, the server aggregates updates (received from clients) to create a global learning model. However, the continuous transfer between clients and the server increases communication costs and is inefficient from a resource utilization perspective due to the large number of parameters (weights and biases) used by deep learning models. The cost of communication becomes a greater concern when the number of contributing clients and communication rounds increases. In this work, we propose a novel framework, FedZip, that significantly decreases the size of updates while transferring weights from the deep learning model between clients and their servers. FedZip implements Top-z sparsification, uses quantization with clustering, and implements compression with three different encoding methods. FedZip outperforms state-of-the-art compression frameworks and reaches compression rates up to 1085x, and preserves up to 99% of bandwidth and 99% of energy for clients during communication

arXiv.org e-Print Archive

Understanding Top-k Sparsification in Distributed Deep Learning

Author: Cheung Ka Chun
Chu Xiaowen
See Simon
Shi Shaohuai
Publication venue
Publication date: 20/11/2019
Field of study

Distributed stochastic gradient descent (SGD) algorithms are widely deployed in training large-scale deep learning models, while the communication overhead among workers becomes the new system bottleneck. Recently proposed gradient sparsification techniques, especially Top-

k

sparsification with error compensation (TopK-SGD), can significantly reduce the communication traffic without an obvious impact on the model accuracy. Some theoretical studies have been carried out to analyze the convergence property of TopK-SGD. However, existing studies do not dive into the details of Top-

k

operator in gradient sparsification and use relaxed bounds (e.g., exact bound of Random-

k

) for analysis; hence the derived results cannot well describe the real convergence performance of TopK-SGD. To this end, we first study the gradient distributions of TopK-SGD during the training process through extensive experiments. We then theoretically derive a tighter bound for the Top-

k

operator. Finally, we exploit the property of gradient distribution to propose an approximate top-

k

selection algorithm, which is computing-efficient for GPUs, to improve the scaling efficiency of TopK-SGD by significantly reducing the computing overhead. Codes are available at: \url{https://github.com/hclhkbu/GaussianK-SGD}.Comment: 14 page

arXiv.org e-Print Archive

APMSqueeze: A Communication Efficient Adam-Preconditioned Momentum SGD Algorithm

Author: Gan Shaoduo
He Yuxiong
Lian Xiangru
Liu Ji
Rajbhandari Samyam
Tang Hanlin
Zhang Ce
Publication venue
Publication date: 27/08/2020
Field of study

Adam is the important optimization algorithm to guarantee efficiency and accuracy for training many important tasks such as BERT and ImageNet. However, Adam is generally not compatible with information (gradient) compression technology. Therefore, the communication usually becomes the bottleneck for parallelizing Adam. In this paper, we propose a communication efficient {\bf A}DAM {\bf p}reconditioned {\bf M}omentum SGD algorithm-- named APMSqueeze-- through an error compensated method compressing gradients. The proposed algorithm achieves a similar convergence efficiency to Adam in term of epochs, but significantly reduces the running time per epoch. In terms of end-to-end performance (including the full-precision pre-condition step), APMSqueeze is able to provide {sometimes by up to

2-10\times

speed-up depending on network bandwidth.} We also conduct theoretical analysis on the convergence and efficiency

arXiv.org e-Print Archive

Layer-wise Adaptive Gradient Sparsification for Distributed Deep Learning with Convergence Guarantees

Author: Chu Xiaowen
Shi Shaohuai
Tang Zhenheng
Wang Qiang
Zhao Kaiyong
Publication venue
Publication date: 01/03/2020
Field of study

To reduce the long training time of large deep neural network (DNN) models, distributed synchronous stochastic gradient descent (S-SGD) is commonly used on a cluster of workers. However, the speedup brought by multiple workers is limited by the communication overhead. Two approaches, namely pipelining and gradient sparsification, have been separately proposed to alleviate the impact of communication overheads. Yet, the gradient sparsification methods can only initiate the communication after the backpropagation, and hence miss the pipelining opportunity. In this paper, we propose a new distributed optimization method named LAGS-SGD, which combines S-SGD with a novel layer-wise adaptive gradient sparsification (LAGS) scheme. In LAGS-SGD, every worker selects a small set of "significant" gradients from each layer independently whose size can be adaptive to the communication-to-computation ratio of that layer. The layer-wise nature of LAGS-SGD opens the opportunity of overlapping communications with computations, while the adaptive nature of LAGS-SGD makes it flexible to control the communication time. We prove that LAGS-SGD has convergence guarantees and it has the same order of convergence rate as vanilla S-SGD under a weak analytical assumption. Extensive experiments are conducted to verify the analytical assumption and the convergence performance of LAGS-SGD. Experimental results on a 16-GPU cluster show that LAGS-SGD outperforms the original S-SGD and existing sparsified S-SGD without losing obvious model accuracy.Comment: 8 pages. To appear at ECAI 202

arXiv.org e-Print Archive

An Incentive-Based Mechanism for Volunteer Computing using Blockchain

Author: Aloqaily Moayad
Jararweh Yaser
Ridhawi Ismaeel Al
Publication venue
Publication date: 24/09/2020
Field of study

The rise of fast communication media both at the core and at the edge has resulted in unprecedented numbers of sophisticated and intelligent wireless IoT devices. Tactile Internet has enabled the interaction between humans and machines within their environment to achieve revolutionized solutions both on the move and in real-time. Many applications such as intelligent autonomous self-driving, smart agriculture and industrial solutions, and self-learning multimedia content filtering and sharing have become attainable through cooperative, distributed and decentralized systems, namely, volunteer computing. This article introduces a blockchain-enabled resource sharing and service composition solution through volunteer computing. Device resource, computing, and intelligence capabilities are advertised in the environment to be made discoverable and available for sharing with the aid of blockchain technology. Incentives in the form of on-demand service availability are given to resource and service providers to ensure fair and balanced cooperative resource usage. Blockchains are formed whenever a service request is initiated with the aid of fog and mobile edge computing (MEC) devices to ensure secure communication and service delivery for the participants. Using both volunteer computing techniques and tactile internet architectures, we devise a fast and reliable service provisioning framework that relies on a reinforcement learning technique. Simulation results show that the proposed solution can achieve high reward distribution, increased number of blockchain formations, reduced delays, and balanced resource usage among participants, under the premise of high IoT device availability.Comment: 22 pages, 12 Figures, 1 Table. Accepted. ACM Transaction On Internet Technolog

arXiv.org e-Print Archive

1-bit LAMB: Communication Efficient Large-Scale Large-Batch Training with LAMB's Convergence Speed

Author: Awan Ammar Ahmad
He Yuxiong
Li Conglong
Rajbhandari Samyam
Tang Hanlin
Publication venue
Publication date: 13/04/2021
Field of study

To train large models (like BERT and GPT-3) with hundreds or even thousands of GPUs, the communication has become a major bottleneck, especially on commodity systems with limited-bandwidth TCP interconnects network. On one side large-batch optimization such as LAMB algorithm was proposed to reduce the number of communications. On the other side, communication compression algorithms such as 1-bit SGD and 1-bit Adam help to reduce the volume of each communication. However, we find that simply using one of the techniques is not sufficient to solve the communication challenge, especially on low-bandwidth Ethernet networks. Motivated by this we aim to combine the power of large-batch optimization and communication compression, but we find that existing compression strategies cannot be directly applied to LAMB due to its unique adaptive layerwise learning rates. To this end, we design a new communication-efficient algorithm, 1-bit LAMB, which introduces a novel way to support adaptive layerwise learning rates even when communication is compressed. In addition, we introduce a new system implementation for compressed communication using the NCCL backend of PyTorch distributed, which improves both usability and performance compared to existing MPI-based implementation. For BERT-Large pre-training task with batch sizes from 8K to 64K, our evaluations on up to 256 GPUs demonstrate that 1-bit LAMB with NCCL-based backend is able to achieve up to 4.6x communication volume reduction, up to 2.8x end-to-end speedup (in terms of number of training samples per second), and the same convergence speed (in terms of number of pre-training samples to reach the same accuracy on fine-tuning tasks) compared to uncompressed LAMB

arXiv.org e-Print Archive

Adaptive Gradient Sparsification for Efficient Federated Learning: An Online Learning Approach

Author: Han Pengchao
Leung Kin K.
Wang Shiqiang
Publication venue
Publication date: 20/03/2020
Field of study

Federated learning (FL) is an emerging technique for training machine learning models using geographically dispersed data collected by local entities. It includes local computation and synchronization steps. To reduce the communication overhead and improve the overall efficiency of FL, gradient sparsification (GS) can be applied, where instead of the full gradient, only a small subset of important elements of the gradient is communicated. Existing work on GS uses a fixed degree of gradient sparsity for i.i.d.-distributed data within a datacenter. In this paper, we consider adaptive degree of sparsity and non-i.i.d. local datasets. We first present a fairness-aware GS method which ensures that different clients provide a similar amount of updates. Then, with the goal of minimizing the overall training time, we propose a novel online learning formulation and algorithm for automatically determining the near-optimal communication and computation trade-off that is controlled by the degree of gradient sparsity. The online learning algorithm uses an estimated sign of the derivative of the objective function, which gives a regret bound that is asymptotically equal to the case where exact derivative is available. Experiments with real datasets confirm the benefits of our proposed approaches, showing up to

40\%

improvement in model accuracy for a finite training time.Comment: Accepted at IEEE ICDCS 202

arXiv.org e-Print Archive

A flexible framework for communication-efficient machine learning: from HPC to IoT

Author: Aytekin Arda
Johansson Mikael
Khirirat Sarit
Magnússon Sindri
Publication venue
Publication date: 17/06/2020
Field of study

With the increasing scale of machine learning tasks, it has become essential to reduce the communication between computing nodes. Early work on gradient compression focused on the bottleneck between CPUs and GPUs, but communication-efficiency is now needed in a variety of different system architectures, from high-performance clusters to energy-constrained IoT devices. In the current practice, compression levels are typically chosen before training and settings that work well for one task may be vastly suboptimal for another dataset on another architecture. In this paper, we propose a flexible framework which adapts the compression level to the true gradient at each iteration, maximizing the improvement in the objective function that is achieved per communicated bit. Our framework is easy to adapt from one technology to the next by modeling how the communication cost depends on the compression level for the specific technology. Theoretical results and practical experiments indicate that the automatic tuning strategies significantly increase communication efficiency on several state-of-the-art compression schemes.Comment: 27 pages, 11 figures, 1 tabl

arXiv.org e-Print Archive

Communication-Efficient Decentralized Learning with Sparsification and Adaptive Peer Selection

Author: Chu Xiaowen
Shi Shaohuai
Tang Zhenheng
Publication venue
Publication date: 22/02/2020
Field of study

Distributed learning techniques such as federated learning have enabled multiple workers to train machine learning models together to reduce the overall training time. However, current distributed training algorithms (centralized or decentralized) suffer from the communication bottleneck on multiple low-bandwidth workers (also on the server under the centralized architecture). Although decentralized algorithms generally have lower communication complexity than the centralized counterpart, they still suffer from the communication bottleneck for workers with low network bandwidth. To deal with the communication problem while being able to preserve the convergence performance, we introduce a novel decentralized training algorithm with the following key features: 1) It does not require a parameter server to maintain the model during training, which avoids the communication pressure on any single peer. 2) Each worker only needs to communicate with a single peer at each communication round with a highly compressed model, which can significantly reduce the communication traffic on the worker. We theoretically prove that our sparsification algorithm still preserves convergence properties. 3) Each worker dynamically selects its peer at different communication rounds to better utilize the bandwidth resources. We conduct experiments with convolutional neural networks on 32 workers to verify the effectiveness of our proposed algorithm compared to seven existing methods. Experimental results show that our algorithm significantly reduces the communication traffic and generally select relatively high bandwidth peers

arXiv.org e-Print Archive