798 research outputs found
Distributed Machine Learning Framework: New Algorithms and Theoretical Foundation
Machine learning is gaining fresh momentum, and has helped us to enhance not only many industrial and professional processes but also our everyday living. The recent success of machine learning relies heavily on the surge of big data, big models, and big computing. However, inefficient algorithms restrict the applications of machine learning to big data mining tasks. In terms of big data, serious concerns, such as communication overhead and data privacy, should be rigorously addressed when we train models using large amounts of data located on multiple devices. In terms of the big model, it is still an underexplored research area if a model is too big to train on a single device. To address these challenging problems, this thesis is focusing on designing new large-scale machine learning models, efficiently optimizing and training methods for big data mining, and studying new discoveries in both theory and applications.
For the challenges raised by big data, we proposed several new asynchronous distributed stochastic gradient descent or coordinate descent methods for efficiently solving convex and non-convex problems. We also designed new large-batch training methods for deep learning models to reduce the computation time significantly with better generalization performance. For the challenges raised by the big model, We scaled up the deep learning models by parallelizing the layer-wise computations with a theoretical guarantee, which is the first algorithm breaking the lock of backpropagation such that the large model can be dramatically accelerated
Layer-Wise Partitioning and Merging for Efficient and Scalable Deep Learning
Deep Neural Network (DNN) models are usually trained sequentially from one
layer to another, which causes forward, backward and update locking's problems,
leading to poor performance in terms of training time. The existing parallel
strategies to mitigate these problems provide suboptimal runtime performance.
In this work, we have proposed a novel layer-wise partitioning and merging,
forward and backward pass parallel framework to provide better training
performance. The novelty of the proposed work consists of 1) a layer-wise
partition and merging model which can minimise communication overhead between
devices without the memory cost of existing strategies during the training
process; 2) a forward pass and backward pass parallelisation and optimisation
to address the update locking problem and minimise the total training cost. The
experimental evaluation on real use cases shows that the proposed method
outperforms the state-of-the-art approaches in terms of training speed; and
achieves almost linear speedup without compromising the accuracy performance of
the non-parallel approach
- …