15 research outputs found

    A Distributed Synchronous SGD Algorithm with Global Top-kk Sparsification for Low Bandwidth Networks

    Full text link
    Distributed synchronous stochastic gradient descent (S-SGD) has been widely used in training large-scale deep neural networks (DNNs), but it typically requires very high communication bandwidth between computational workers (e.g., GPUs) to exchange gradients iteratively. Recently, Top-kk sparsification techniques have been proposed to reduce the volume of data to be exchanged among workers. Top-kk sparsification can zero-out a significant portion of gradients without impacting the model convergence. However, the sparse gradients should be transferred with their irregular indices, which makes the sparse gradients aggregation difficult. Current methods that use AllGather to accumulate the sparse gradients have a communication complexity of O(kP)O(kP), where PP is the number of workers, which is inefficient on low bandwidth networks with a large number of workers. We observe that not all top-kk gradients from PP workers are needed for the model update, and therefore we propose a novel global Top-kk (gTop-kk) sparsification mechanism to address the problem. Specifically, we choose global top-kk largest absolute values of gradients from PP workers, instead of accumulating all local top-kk gradients to update the model in each iteration. The gradient aggregation method based on gTop-kk sparsification reduces the communication complexity from O(kP)O(kP) to O(klogP)O(k\log P). Through extensive experiments on different DNNs, we verify that gTop-kk S-SGD has nearly consistent convergence performance with S-SGD, and it has only slight degradations on generalization performance. In terms of scaling efficiency, we evaluate gTop-kk on a cluster with 32 GPU machines which are interconnected with 1 Gbps Ethernet. The experimental results show that our method achieves 2.712×2.7-12\times higher scaling efficiency than S-SGD and 1.11.7×1.1-1.7\times improvement than the existing Top-kk S-SGD.Comment: 10 pages. Add discussion with more experimental results. To appear at the ICDCS2019 worksho

    FEDZIP: A Compression Framework for Communication-Efficient Federated Learning

    Full text link
    Federated Learning marks a turning point in the implementation of decentralized machine learning (especially deep learning) for wireless devices by protecting users' privacy and safeguarding raw data from third-party access. It assigns the learning process independently to each client. First, clients locally train a machine learning model based on local data. Next, clients transfer local updates of model weights and biases (training data) to a server. Then, the server aggregates updates (received from clients) to create a global learning model. However, the continuous transfer between clients and the server increases communication costs and is inefficient from a resource utilization perspective due to the large number of parameters (weights and biases) used by deep learning models. The cost of communication becomes a greater concern when the number of contributing clients and communication rounds increases. In this work, we propose a novel framework, FedZip, that significantly decreases the size of updates while transferring weights from the deep learning model between clients and their servers. FedZip implements Top-z sparsification, uses quantization with clustering, and implements compression with three different encoding methods. FedZip outperforms state-of-the-art compression frameworks and reaches compression rates up to 1085x, and preserves up to 99% of bandwidth and 99% of energy for clients during communication

    Understanding Top-k Sparsification in Distributed Deep Learning

    Full text link
    Distributed stochastic gradient descent (SGD) algorithms are widely deployed in training large-scale deep learning models, while the communication overhead among workers becomes the new system bottleneck. Recently proposed gradient sparsification techniques, especially Top-kk sparsification with error compensation (TopK-SGD), can significantly reduce the communication traffic without an obvious impact on the model accuracy. Some theoretical studies have been carried out to analyze the convergence property of TopK-SGD. However, existing studies do not dive into the details of Top-kk operator in gradient sparsification and use relaxed bounds (e.g., exact bound of Random-kk) for analysis; hence the derived results cannot well describe the real convergence performance of TopK-SGD. To this end, we first study the gradient distributions of TopK-SGD during the training process through extensive experiments. We then theoretically derive a tighter bound for the Top-kk operator. Finally, we exploit the property of gradient distribution to propose an approximate top-kk selection algorithm, which is computing-efficient for GPUs, to improve the scaling efficiency of TopK-SGD by significantly reducing the computing overhead. Codes are available at: \url{https://github.com/hclhkbu/GaussianK-SGD}.Comment: 14 page

    APMSqueeze: A Communication Efficient Adam-Preconditioned Momentum SGD Algorithm

    Full text link
    Adam is the important optimization algorithm to guarantee efficiency and accuracy for training many important tasks such as BERT and ImageNet. However, Adam is generally not compatible with information (gradient) compression technology. Therefore, the communication usually becomes the bottleneck for parallelizing Adam. In this paper, we propose a communication efficient {\bf A}DAM {\bf p}reconditioned {\bf M}omentum SGD algorithm-- named APMSqueeze-- through an error compensated method compressing gradients. The proposed algorithm achieves a similar convergence efficiency to Adam in term of epochs, but significantly reduces the running time per epoch. In terms of end-to-end performance (including the full-precision pre-condition step), APMSqueeze is able to provide {sometimes by up to 210×2-10\times speed-up depending on network bandwidth.} We also conduct theoretical analysis on the convergence and efficiency

    Layer-wise Adaptive Gradient Sparsification for Distributed Deep Learning with Convergence Guarantees

    Full text link
    To reduce the long training time of large deep neural network (DNN) models, distributed synchronous stochastic gradient descent (S-SGD) is commonly used on a cluster of workers. However, the speedup brought by multiple workers is limited by the communication overhead. Two approaches, namely pipelining and gradient sparsification, have been separately proposed to alleviate the impact of communication overheads. Yet, the gradient sparsification methods can only initiate the communication after the backpropagation, and hence miss the pipelining opportunity. In this paper, we propose a new distributed optimization method named LAGS-SGD, which combines S-SGD with a novel layer-wise adaptive gradient sparsification (LAGS) scheme. In LAGS-SGD, every worker selects a small set of "significant" gradients from each layer independently whose size can be adaptive to the communication-to-computation ratio of that layer. The layer-wise nature of LAGS-SGD opens the opportunity of overlapping communications with computations, while the adaptive nature of LAGS-SGD makes it flexible to control the communication time. We prove that LAGS-SGD has convergence guarantees and it has the same order of convergence rate as vanilla S-SGD under a weak analytical assumption. Extensive experiments are conducted to verify the analytical assumption and the convergence performance of LAGS-SGD. Experimental results on a 16-GPU cluster show that LAGS-SGD outperforms the original S-SGD and existing sparsified S-SGD without losing obvious model accuracy.Comment: 8 pages. To appear at ECAI 202

    An Incentive-Based Mechanism for Volunteer Computing using Blockchain

    Full text link
    The rise of fast communication media both at the core and at the edge has resulted in unprecedented numbers of sophisticated and intelligent wireless IoT devices. Tactile Internet has enabled the interaction between humans and machines within their environment to achieve revolutionized solutions both on the move and in real-time. Many applications such as intelligent autonomous self-driving, smart agriculture and industrial solutions, and self-learning multimedia content filtering and sharing have become attainable through cooperative, distributed and decentralized systems, namely, volunteer computing. This article introduces a blockchain-enabled resource sharing and service composition solution through volunteer computing. Device resource, computing, and intelligence capabilities are advertised in the environment to be made discoverable and available for sharing with the aid of blockchain technology. Incentives in the form of on-demand service availability are given to resource and service providers to ensure fair and balanced cooperative resource usage. Blockchains are formed whenever a service request is initiated with the aid of fog and mobile edge computing (MEC) devices to ensure secure communication and service delivery for the participants. Using both volunteer computing techniques and tactile internet architectures, we devise a fast and reliable service provisioning framework that relies on a reinforcement learning technique. Simulation results show that the proposed solution can achieve high reward distribution, increased number of blockchain formations, reduced delays, and balanced resource usage among participants, under the premise of high IoT device availability.Comment: 22 pages, 12 Figures, 1 Table. Accepted. ACM Transaction On Internet Technolog

    1-bit LAMB: Communication Efficient Large-Scale Large-Batch Training with LAMB's Convergence Speed

    Full text link
    To train large models (like BERT and GPT-3) with hundreds or even thousands of GPUs, the communication has become a major bottleneck, especially on commodity systems with limited-bandwidth TCP interconnects network. On one side large-batch optimization such as LAMB algorithm was proposed to reduce the number of communications. On the other side, communication compression algorithms such as 1-bit SGD and 1-bit Adam help to reduce the volume of each communication. However, we find that simply using one of the techniques is not sufficient to solve the communication challenge, especially on low-bandwidth Ethernet networks. Motivated by this we aim to combine the power of large-batch optimization and communication compression, but we find that existing compression strategies cannot be directly applied to LAMB due to its unique adaptive layerwise learning rates. To this end, we design a new communication-efficient algorithm, 1-bit LAMB, which introduces a novel way to support adaptive layerwise learning rates even when communication is compressed. In addition, we introduce a new system implementation for compressed communication using the NCCL backend of PyTorch distributed, which improves both usability and performance compared to existing MPI-based implementation. For BERT-Large pre-training task with batch sizes from 8K to 64K, our evaluations on up to 256 GPUs demonstrate that 1-bit LAMB with NCCL-based backend is able to achieve up to 4.6x communication volume reduction, up to 2.8x end-to-end speedup (in terms of number of training samples per second), and the same convergence speed (in terms of number of pre-training samples to reach the same accuracy on fine-tuning tasks) compared to uncompressed LAMB

    Adaptive Gradient Sparsification for Efficient Federated Learning: An Online Learning Approach

    Full text link
    Federated learning (FL) is an emerging technique for training machine learning models using geographically dispersed data collected by local entities. It includes local computation and synchronization steps. To reduce the communication overhead and improve the overall efficiency of FL, gradient sparsification (GS) can be applied, where instead of the full gradient, only a small subset of important elements of the gradient is communicated. Existing work on GS uses a fixed degree of gradient sparsity for i.i.d.-distributed data within a datacenter. In this paper, we consider adaptive degree of sparsity and non-i.i.d. local datasets. We first present a fairness-aware GS method which ensures that different clients provide a similar amount of updates. Then, with the goal of minimizing the overall training time, we propose a novel online learning formulation and algorithm for automatically determining the near-optimal communication and computation trade-off that is controlled by the degree of gradient sparsity. The online learning algorithm uses an estimated sign of the derivative of the objective function, which gives a regret bound that is asymptotically equal to the case where exact derivative is available. Experiments with real datasets confirm the benefits of our proposed approaches, showing up to 40%40\% improvement in model accuracy for a finite training time.Comment: Accepted at IEEE ICDCS 202

    A flexible framework for communication-efficient machine learning: from HPC to IoT

    Full text link
    With the increasing scale of machine learning tasks, it has become essential to reduce the communication between computing nodes. Early work on gradient compression focused on the bottleneck between CPUs and GPUs, but communication-efficiency is now needed in a variety of different system architectures, from high-performance clusters to energy-constrained IoT devices. In the current practice, compression levels are typically chosen before training and settings that work well for one task may be vastly suboptimal for another dataset on another architecture. In this paper, we propose a flexible framework which adapts the compression level to the true gradient at each iteration, maximizing the improvement in the objective function that is achieved per communicated bit. Our framework is easy to adapt from one technology to the next by modeling how the communication cost depends on the compression level for the specific technology. Theoretical results and practical experiments indicate that the automatic tuning strategies significantly increase communication efficiency on several state-of-the-art compression schemes.Comment: 27 pages, 11 figures, 1 tabl

    Communication-Efficient Decentralized Learning with Sparsification and Adaptive Peer Selection

    Full text link
    Distributed learning techniques such as federated learning have enabled multiple workers to train machine learning models together to reduce the overall training time. However, current distributed training algorithms (centralized or decentralized) suffer from the communication bottleneck on multiple low-bandwidth workers (also on the server under the centralized architecture). Although decentralized algorithms generally have lower communication complexity than the centralized counterpart, they still suffer from the communication bottleneck for workers with low network bandwidth. To deal with the communication problem while being able to preserve the convergence performance, we introduce a novel decentralized training algorithm with the following key features: 1) It does not require a parameter server to maintain the model during training, which avoids the communication pressure on any single peer. 2) Each worker only needs to communicate with a single peer at each communication round with a highly compressed model, which can significantly reduce the communication traffic on the worker. We theoretically prove that our sparsification algorithm still preserves convergence properties. 3) Each worker dynamically selects its peer at different communication rounds to better utilize the bandwidth resources. We conduct experiments with convolutional neural networks on 32 workers to verify the effectiveness of our proposed algorithm compared to seven existing methods. Experimental results show that our algorithm significantly reduces the communication traffic and generally select relatively high bandwidth peers