15,150 research outputs found
Accelerating Parallel Stochastic Gradient Descent via Non-blocking Mini-batches
SOTA decentralized SGD algorithms can overcome the bandwidth bottleneck at
the parameter server by using communication collectives like Ring All-Reduce
for synchronization. While the parameter updates in distributed SGD may happen
asynchronously there is still a synchronization barrier to make sure that the
local training epoch at every learner is complete before the learners can
advance to the next epoch. The delays in waiting for the slowest
learners(stragglers) remain to be a problem in the synchronization steps of
these state-of-the-art decentralized frameworks. In this paper, we propose the
(de)centralized Non-blocking SGD (Non-blocking SGD) which can address the
straggler problem in a heterogeneous environment. The main idea of Non-blocking
SGD is to split the original batch into mini-batches, then accumulate the
gradients and update the model based on finished mini-batches. The Non-blocking
idea can be implemented using decentralized algorithms including Ring
All-reduce, D-PSGD, and MATCHA to solve the straggler problem. Moreover, using
gradient accumulation to update the model also guarantees convergence and
avoids gradient staleness. Run-time analysis with random straggler delays and
computational efficiency/throughput of devices is also presented to show the
advantage of Non-blocking SGD. Experiments on a suite of datasets and deep
learning networks validate the theoretical analyses and demonstrate that
Non-blocking SGD speeds up the training and fastens the convergence. Compared
with the state-of-the-art decentralized asynchronous algorithms like D-PSGD and
MACHA, Non-blocking SGD takes up to 2x fewer time to reach the same training
loss in a heterogeneous environment.Comment: 12 pages, 4 figure
An efficient algorithm for data parallelism based on stochastic optimization
Deep neural network models can achieve greater performance in numerous machine learning tasks by raising the depth of the model and the amount of training data samples. However, these essential procedures will proportionally raise the cost of training deep neural network models. Accelerating the training process of deep neural network models in a distributed computing environment has become the most often utilized strategy for developers in order to better cope with a huge quantity of training overhead. The current deep neural network model is the stochastic gradient descent (SGD) technique. It is one of the most widely used training techniques in network models, although it is prone to gradient obsolescence during parallelization, which impacts the overall convergence. The majority of present solutions are geared at high-performance nodes with minor performance changes. Few studies have taken into account the cluster environment in high-performance computing (HPC), where the performance of each node varies substantially. A dynamic batch size stochastic gradient descent approach based on performance-aware technology is suggested to address the aforesaid difficulties (DBS-SGD). By assessing the processing capacity of each node, this method dynamically allocates the minibatch of each node, guaranteeing that the update time of each iteration between nodes is essentially the same, lowering the average gradient of the node. The suggested approach may successfully solve the asynchronous update strategy’s gradient outdated problem. The Mnist and cifar10 are two widely used image classification benchmarks, that are employed as training data sets, and the approach is compared with the asynchronous stochastic gradient descent (ASGD) technique. The experimental findings demonstrate that the proposed algorithm has better performance as compared with existing algorithms
- …