417 research outputs found
SAFA : a semi-asynchronous protocol for fast federated learning with low overhead
Federated learning (FL) has attracted increasing attention as a promising approach to driving a vast number of end devices with artificial intelligence. However, it is very challenging to guarantee the efficiency of FL considering the unreliable nature of end devices while the cost of device-server communication cannot be neglected. In this paper, we propose SAFA, a semi-asynchronous FL protocol, to address the problems in federated learning such as low round efficiency and poor convergence rate in extreme conditions (e.g., clients dropping offline frequently). We introduce novel designs in the steps of model distribution, client selection and global aggregation to mitigate the impacts of stragglers, crashes and model staleness in order to boost efficiency and improve the quality of the global model. We have conducted extensive experiments with typical machine learning tasks. The results demonstrate that the proposed protocol is effective in terms of shortening federated round duration, reducing local resource wastage, and improving the accuracy of the global model at an acceptable communication cost
Making Asynchronous Stochastic Gradient Descent Work for Transformers
Asynchronous stochastic gradient descent (SGD) is attractive from a speed
perspective because workers do not wait for synchronization. However, the
Transformer model converges poorly with asynchronous SGD, resulting in
substantially lower quality compared to synchronous SGD. To investigate why
this is the case, we isolate differences between asynchronous and synchronous
methods to investigate batch size and staleness effects. We find that summing
several asynchronous updates, rather than applying them immediately, restores
convergence behavior. With this hybrid method, Transformer training for neural
machine translation task reaches a near-convergence level 1.36x faster in
single-node multi-GPU training with no impact on model quality
Accelerating Neural Network Training with Distributed Asynchronous and Selective Optimization (DASO)
With increasing data and model complexities, the time required to train neural networks has become prohibitively large. To address the exponential rise in training time, users are turning to data parallel neural networks (DPNN) and large-scale distributed resources on computer clusters. Current DPNN approaches implement the network parameter updates by synchronizing and averaging gradients across all processes with blocking communication operations after each forward-backward pass. This synchronization is the central algorithmic bottleneck. We introduce the Distributed Asynchronous and Selective Optimization (DASO) method, which leverages multi-GPU compute node architectures to accelerate network training while maintaining accuracy. DASO uses a hierarchical and asynchronous communication scheme comprised of node-local and global networks while adjusting the global synchronization rate during the learning process. We show that DASO yields a reduction in training time of up to 34% on classical and state-of-the-art networks, as compared to current optimized data parallel training methods
Robust and Communication-Efficient Collaborative Learning
We consider a decentralized learning problem, where a set of computing nodes
aim at solving a non-convex optimization problem collaboratively. It is
well-known that decentralized optimization schemes face two major system
bottlenecks: stragglers' delay and communication overhead. In this paper, we
tackle these bottlenecks by proposing a novel decentralized and gradient-based
optimization algorithm named as QuanTimed-DSGD. Our algorithm stands on two
main ideas: (i) we impose a deadline on the local gradient computations of each
node at each iteration of the algorithm, and (ii) the nodes exchange quantized
versions of their local models. The first idea robustifies to straggling nodes
and the second alleviates communication efficiency. The key technical
contribution of our work is to prove that with non-vanishing noises for
quantization and stochastic gradients, the proposed method exactly converges to
the global optimal for convex loss functions, and finds a first-order
stationary point in non-convex scenarios. Our numerical evaluations of the
QuanTimed-DSGD on training benchmark datasets, MNIST and CIFAR-10, demonstrate
speedups of up to 3x in run-time, compared to state-of-the-art decentralized
optimization methods
FedGSM: Efficient Federated Learning for LEO Constellations with Gradient Staleness Mitigation
Recent advancements in space technology have equipped low Earth Orbit (LEO)
satellites with the capability to perform complex functions and run AI
applications. Federated Learning (FL) on LEO satellites enables collaborative
training of a global ML model without the need for sharing large datasets.
However, intermittent connectivity between satellites and ground stations can
lead to stale gradients and unstable learning, thereby limiting learning
performance. In this paper, we propose FedGSM, a novel asynchronous FL
algorithm that introduces a compensation mechanism to mitigate gradient
staleness. FedGSM leverages the deterministic and time-varying topology of the
orbits to offset the negative effects of staleness. Our simulation results
demonstrate that FedGSM outperforms state-of-the-art algorithms for both IID
and non-IID datasets, underscoring its effectiveness and advantages. We also
investigate the effect of system parameters.Comment: 5 pages,6 figure
- …