36 research outputs found
High-Performance Distributed ML at Scale through Parameter Server Consistency Models
As Machine Learning (ML) applications increase in data size and model
complexity, practitioners turn to distributed clusters to satisfy the increased
computational and memory demands. Unfortunately, effective use of clusters for
ML requires considerable expertise in writing distributed code, while
highly-abstracted frameworks like Hadoop have not, in practice, approached the
performance seen in specialized ML implementations. The recent Parameter Server
(PS) paradigm is a middle ground between these extremes, allowing easy
conversion of single-machine parallel ML applications into distributed ones,
while maintaining high throughput through relaxed "consistency models" that
allow inconsistent parameter reads. However, due to insufficient theoretical
study, it is not clear which of these consistency models can really ensure
correct ML algorithm output; at the same time, there remain many
theoretically-motivated but undiscovered opportunities to maximize
computational throughput. Motivated by this challenge, we study both the
theoretical guarantees and empirical behavior of iterative-convergent ML
algorithms in existing PS consistency models. We then use the gleaned insights
to improve a consistency model using an "eager" PS communication mechanism, and
implement it as a new PS system that enables ML algorithms to reach their
solution more quickly.Comment: 19 pages, 2 figure
DBS: Dynamic Batch Size For Distributed Deep Neural Network Training
Synchronous strategies with data parallelism, such as the Synchronous
StochasticGradient Descent (S-SGD) and the model averaging methods, are widely
utilizedin distributed training of Deep Neural Networks (DNNs), largely owing
to itseasy implementation yet promising performance. Particularly, each worker
ofthe cluster hosts a copy of the DNN and an evenly divided share of the
datasetwith the fixed mini-batch size, to keep the training of DNNs
convergence. In thestrategies, the workers with different computational
capability, need to wait foreach other because of the synchronization and
delays in network transmission,which will inevitably result in the
high-performance workers wasting computation.Consequently, the utilization of
the cluster is relatively low. To alleviate thisissue, we propose the Dynamic
Batch Size (DBS) strategy for the distributedtraining of DNNs. Specifically,
the performance of each worker is evaluatedfirst based on the fact in the
previous epoch, and then the batch size and datasetpartition are dynamically
adjusted in consideration of the current performanceof the worker, thereby
improving the utilization of the cluster. To verify theeffectiveness of the
proposed strategy, extensive experiments have been conducted,and the
experimental results indicate that the proposed strategy can fully utilizethe
performance of the cluster, reduce the training time, and have good
robustnesswith disturbance by irrelevant tasks. Furthermore, rigorous
theoretical analysis hasalso been provided to prove the convergence of the
proposed strategy.Comment: The latest version of this article has been accepted by IEEE TETC
AI Technical Considerations:Data Storage, Cloud usage and AI Pipeline
Artificial intelligence (AI), especially deep learning, requires vast amounts
of data for training, testing, and validation. Collecting these data and the
corresponding annotations requires the implementation of imaging biobanks that
provide access to these data in a standardized way. This requires careful
design and implementation based on the current standards and guidelines and
complying with the current legal restrictions. However, the realization of
proper imaging data collections is not sufficient to train, validate and deploy
AI as resource demands are high and require a careful hybrid implementation of
AI pipelines both on-premise and in the cloud. This chapter aims to help the
reader when technical considerations have to be made about the AI environment
by providing a technical background of different concepts and implementation
aspects involved in data storage, cloud usage, and AI pipelines