2,191 research outputs found
Developing a loss prediction-based asynchronous stochastic gradient descent algorithm for distributed training of deep neural networks
Training Deep Neural Network is a computation-intensive and time-consuming task. Asynchronous Stochastic Gradient Descent (ASGD) is an effective solution to accelerate the training process since it enables the network to be trained in a distributed fashion, but with a main issue of the delayed gradient update. A recent notable work called DC-ASGD improves the performance of ASGD by compensating the delay using a cheap approximation of the Hessian matrix. DC-ASGD works well with a short delay; however, the performance drops considerably with an increasing delay between the workers and the server. In real-life large-scale distributed training, such gradient delay experienced by the worker is usually high and volatile. In this paper, we propose a novel algorithm called LC-ASGD to compensate for the delay, basing on Loss Prediction. It effectively extends the tolerable delay duration for the compensation mechanism. Specifically, LC-ASGD utilizes additional models that reside in the parameter server and predict the loss to compensate for the delay, basing on historical losses collected from each worker. The algorithm is evaluated on the popular networks and benchmark datasets. The experimental results show that our LC-ASGD significantly improves over existing methods, especially when the networks are trained with a large number of workers
Do optimization methods in deep learning applications matter?
With advances in deep learning, exponential data growth and increasing model
complexity, developing efficient optimization methods are attracting much
research attention. Several implementations favor the use of Conjugate Gradient
(CG) and Stochastic Gradient Descent (SGD) as being practical and elegant
solutions to achieve quick convergence, however, these optimization processes
also present many limitations in learning across deep learning applications.
Recent research is exploring higher-order optimization functions as better
approaches, but these present very complex computational challenges for
practical use. Comparing first and higher-order optimization functions, in this
paper, our experiments reveal that Levemberg-Marquardt (LM) significantly
supersedes optimal convergence but suffers from very large processing time
increasing the training complexity of both, classification and reinforcement
learning problems. Our experiments compare off-the-shelf optimization
functions(CG, SGD, LM and L-BFGS) in standard CIFAR, MNIST, CartPole and
FlappyBird experiments.The paper presents arguments on which optimization
functions to use and further, which functions would benefit from
parallelization efforts to improve pretraining time and learning rate
convergence
Federated Optimization: Distributed Machine Learning for On-Device Intelligence
We introduce a new and increasingly relevant setting for distributed
optimization in machine learning, where the data defining the optimization are
unevenly distributed over an extremely large number of nodes. The goal is to
train a high-quality centralized model. We refer to this setting as Federated
Optimization. In this setting, communication efficiency is of the utmost
importance and minimizing the number of rounds of communication is the
principal goal.
A motivating example arises when we keep the training data locally on users'
mobile devices instead of logging it to a data center for training. In
federated optimziation, the devices are used as compute nodes performing
computation on their local data in order to update a global model. We suppose
that we have extremely large number of devices in the network --- as many as
the number of users of a given service, each of which has only a tiny fraction
of the total data available. In particular, we expect the number of data points
available locally to be much smaller than the number of devices. Additionally,
since different users generate data with different patterns, it is reasonable
to assume that no device has a representative sample of the overall
distribution.
We show that existing algorithms are not suitable for this setting, and
propose a new algorithm which shows encouraging experimental results for sparse
convex problems. This work also sets a path for future research needed in the
context of \federated optimization.Comment: 38 page
- …