238 research outputs found
Breaking (Global) Barriers in Parallel Stochastic Optimization with Wait-Avoiding Group Averaging
Deep learning at scale is dominated by communication time. Distributing
samples across nodes usually yields the best performance, but poses scaling
challenges due to global information dissemination and load imbalance across
uneven sample lengths. State-of-the-art decentralized optimizers mitigate the
problem, but require more iterations to achieve the same accuracy as their
globally-communicating counterparts. We present Wait-Avoiding Group Model
Averaging (WAGMA) SGD, a wait-avoiding stochastic optimizer that reduces global
communication via subgroup weight exchange. The key insight is a combination of
algorithmic changes to the averaging scheme and the use of a group allreduce
operation. We prove the convergence of WAGMA-SGD, and empirically show that it
retains convergence rates similar to Allreduce-SGD. For evaluation, we train
ResNet-50 on ImageNet; Transformer for machine translation; and deep
reinforcement learning for navigation at scale. Compared with state-of-the-art
decentralized SGD variants, WAGMA-SGD significantly improves training
throughput (e.g., 2.1x on 1,024 GPUs for reinforcement learning), and achieves
the fastest time-to-solution (e.g., the highest score using the shortest
training time for Transformer).Comment: Published in IEEE Transactions on Parallel and Distributed Systems
(IEEE TPDS), vol. 32, no. 7, pp. 1725-1739, 1 July 202
๋ถ์ฐ ๊ธฐ๊ณ ํ์ต์ ์์ ํจ์จ์ ์ธ ์ํ์ ์ํ ๋์ ์ต์ ํ ๊ธฐ์
ํ์๋
ผ๋ฌธ(๋ฐ์ฌ)--์์ธ๋ํ๊ต ๋ํ์ :๊ณต๊ณผ๋ํ ์ปดํจํฐ๊ณตํ๋ถ,2020. 2. ์ ๋ณ๊ณค.Machine Learning(ML) systems are widely used to extract insights from data. Ever increasing dataset sizes and model complexity gave rise to many efforts towards ef๏ฌcient distributed machine learning systems. One of the popular approaches to support large scale data and complicated models is the parameter server (PS) approach. In this approach, a training job runs with distributed worker and server tasks, where workers iteratively compute gradients to update the global model parameters that are kept in servers.
To improve the PS system performance, this dissertation proposes two solutions that automatically optimize resource ef๏ฌciency and system performance. First, we propose a solution that optimizes the resource con๏ฌguration and workload partitioning of distributed ML training on PS system. To ๏ฌnd the best con๏ฌguration, we build an Optimizer based on a cost model that works with online metrics. To ef๏ฌciently apply decisions by Optimizer, we design our runtime elastic to perform recon๏ฌguration in the background with minimal overhead.
The second solution optimizes the scheduling of resources and tasks of multiple ML training jobs in a shared cluster. Speci๏ฌcally, we co-locate jobs with complementary resource use to increase resource utilization, while executing their tasks with ๏ฌne-grained unit to avoid resource contention. To alleviate memory pressure by co-located jobs, we enable dynamic spill/reload of data, which adaptively changes the ratio of data between disk and memory.
We build a working system that implements our approaches. The above two solutions are implemented in the same system and share the runtime part that can dynamically migrate jobs between machines and reallocate machine resources. We evaluate our system with popular ML applications to verify the effectiveness of our solutions.๊ธฐ๊ณ ํ์ต ์์คํ
์ ๋ฐ์ดํฐ์ ์จ๊ฒจ์ง ์๋ฏธ๋ฅผ ๋ฝ์๋ด๊ธฐ ์ํด ๋๋ฆฌ ์ฌ์ฉ๋๊ณ ์๋ค. ๋ฐ์ดํฐ์
์ ํฌ๊ธฐ์ ๋ชจ๋ธ์ ๋ณต์ก๋๊ฐ ์ด๋๋๋ณด๋ค ์ปค์ง์ ๋ฐ๋ผ ํจ์จ์ ์ธ ๋ถ์ฐ ๊ธฐ๊ณ ํ์ต ์์คํ
์์ํ ๋ง์ ๋
ธ๋ ฅ๋ค์ด ์ด๋ฃจ์ด์ง๊ณ ์๋ค. ํ๋ผ๋ฏธํฐ ์๋ฒ ๋ฐฉ์์ ๊ฑฐ๋ํ ์ค์ผ์ผ์ ๋ฐ์ดํฐ์ ๋ณต์กํ ๋ชจ๋ธ์ ์ง์ํ๊ธฐ ์ํ ์ ๋ช
ํ ๋ฐฉ๋ฒ๋ค ์ค ํ๋์ด๋ค. ์ด ๋ฐฉ์์์, ํ์ต ์์
์ ๋ถ์ฐ ์์ปค์ ์๋ฒ๋ค๋ก ๊ตฌ์ฑ๋๊ณ , ์์ปค๋ค์ ํ ๋น๋ ์
๋ ฅ ๋ฐ์ดํฐ๋ก๋ถํฐ ๋ฐ๋ณต์ ์ผ๋ก ๊ทธ๋ ๋์ธํธ๋ฅผ ๊ณ์ฐํ์ฌ ์๋ฒ๋ค์ ๋ณด๊ด๋ ๊ธ๋ก๋ฒ ๋ชจ๋ธ ํ ๋ผ๋ฏธํฐ๋ค์ ์
๋ฐ์ดํธํ๋ค.
ํ๋ผ๋ฏธํฐ ์๋ฒ ์์คํ
์ ์ฑ๋ฅ์ ํฅ์์ํค๊ธฐ ์ํด, ์ด ๋
ผ๋ฌธ์์๋ ์๋์ ์ผ๋ก ์์ ํจ์จ์ฑ๊ณผ ์์คํ
์ฑ๋ฅ์ ์ต์ ํํ๋ ๋๊ฐ์ง์ ํด๋ฒ์ ์ ์ํ๋ค. ์ฒซ๋ฒ์งธ ํด๋ฒ์, ํ๋ผ๋ฏธํฐ ์์คํ
์์ ๋ถ์ฐ ๊ธฐ๊ณ ํ์ต์ ์ํ์์ ์์ ์ค์ ๋ฐ ์ํฌ๋ก๋ ๋ถ๋ฐฐ๋ฅผ ์๋ํํ๋ ๊ฒ์ด๋ค. ์ต๊ณ ์ ์ค์ ์ ์ฐพ๊ธฐ ์ํด ์ฐ๋ฆฌ๋ ์จ๋ผ์ธ ๋ฉํธ๋ฆญ์ ์ฌ์ฉํ๋ ๋น์ฉ ๋ชจ๋ธ์ ๊ธฐ๋ฐ์ผ๋ก ํ๋ Optimizer๋ฅผ ๋ง๋ค์๋ค. Optimizer์ ๊ฒฐ์ ์ ํจ์จ์ ์ผ๋ก ์ ์ฉํ๊ธฐ ์ํด, ์ฐ๋ฆฌ๋ ๋ฐํ์์ ๋์ ์ฌ์ค์ ์ ์ต์์ ์ค๋ฒํค๋๋ก ๋ฐฑ๊ทธ๋ผ์ด๋์์ ์ํํ๋๋ก ๋์์ธํ๋ค.
๋๋ฒ์งธ ํด๋ฒ์ ๊ณต์ ํด๋ฌ์คํฐ ์ํฉ์์ ์ฌ๋ฌ ๊ฐ์ ๊ธฐ๊ณ ํ์ต ์์
์ ์ธ๋ถ ์์
๊ณผ ์์์ ์ค์ผ์ฅด๋ง์ ์ต์ ํํ ๊ฒ์ด๋ค. ๊ตฌ์ฒด์ ์ผ๋ก, ์ฐ๋ฆฌ๋ ์ธ๋ถ ์์
๋ค์ ์ธ๋ฐํ ๋จ์๋ก ์ํํจ์ผ๋ก์จ ์์ ๊ฒฝ์์ ์ต์ ํ๊ณ , ์๋ก๋ฅผ ๋ณด์ํ๋ ์์ ์ฌ์ฉ ํจํด์ ๋ณด์ด๋ ์์
๋ค์ ๊ฐ์ ์์์ ํจ๊ป ์์น์์ผ ์์ ํ์ฉ์จ์ ๋์ด์ฌ๋ ธ๋ค. ํจ๊ป ์์นํ ์์
๋ค์ ๋ฉ๋ชจ๋ฆฌ ์๋ ฅ์ ๊ฒฝ๊ฐ์ํค๊ธฐ ์ํด ์ฐ๋ฆฌ๋ ๋์ ์ผ๋ก ๋ฐ์ดํฐ๋ฅผ ๋์คํฌ๋ก ๋ด๋ ธ๋ค๊ฐ ๋ค์ ๋ฉ๋ชจ๋ฆฌ๋ก ์ฝ์ด์ค๋ ๊ธฐ๋ฅ์ ์ง์ํจ๊ณผ ๋์์, ๋์คํฌ์ ๋ฉ๋ชจ๋ฆฌ๊ฐ์ ๋ฐ์ดํฐ ๋น์จ์ ์ํฉ์ ๋ง๊ฒ ์์คํ
์ด ์๋์ผ๋ก ๋ง์ถ๋๋ก ํ์๋ค.
์์ ํด๋ฒ๋ค์ ์ค์ฒดํํ๊ธฐ ์ํด, ์ค์ ๋์ํ๋ ์์คํ
์ ๋ง๋ค์๋ค. ๋๊ฐ์ง์ ํด๋ฒ์ ํ๋์ ์์คํ
์ ๊ตฌํํจ์ผ๋ก์จ, ๋์ ์ผ๋ก ์์
์ ๋จธ์ ๊ฐ์ ์ฎ๊ธฐ๊ณ ์์์ ์ฌํ ๋นํ ์ ์๋ ๋ฐํ์์ ๊ณต์ ํ๋ค. ํด๋น ์๋ฃจ์
๋ค์ ํจ๊ณผ๋ฅผ ๋ณด์ฌ์ฃผ๊ธฐ ์ํด, ์ด ์์คํ
์ ๋ง์ด ์ฌ์ฉ๋๋ ๊ธฐ๊ณ ํ์ต ์ดํ๋ฆฌ์ผ์ด์
์ผ๋ก ์คํํ์๊ณ ๊ธฐ์กด ์์คํ
๋ค ๋๋น ๋ฐ์ด๋ ์ฑ๋ฅ ํฅ์์ ๋ณด์ฌ์ฃผ์๋ค.Chapter1. Introduction 1
1.1 Distributed Machine Learning on Parameter Servers 1
1.2 Automating System Conguration of Distributed Machine Learning 2
1.3 Scheduling of Multiple Distributed Machine Learning Jobs 3
1.4 Contributions 5
1.5 Dissertation Structure 6
Chapter2. Background 7
Chapter3. Automating System Conguration of Distributed Machine Learning 10
3.1 System Conguration Challenges 11
3.2 Finding Good System Conguration 13
3.2.1 Cost Model 13
3.2.2 Cost Formulation 15
3.2.3 Optimization 16
3.3 Cruise 18
3.3.1 Optimizer 19
3.3.2 Elastic Runtime 21
3.4 Evaluation 26
3.4.1 Experimental Setup 26
3.4.2 Finding Baselines with Grid Search 28
3.4.3 Optimization in the Homogeneous Environment 28
3.4.4 Utilizing Opportunistic Resources 30
3.4.5 Optimization in the Heterogeneous Environment 31
3.4.6 Reconguration Speed 32
3.5 Related Work 33
3.6 Summary 34
Chapter4 A Scheduling Framework Optimized for Multiple Distributed Machine Learning Jobs 36
4.1 Resource Under-utilization Problems in PS ML Training 37
4.2 Harmony Overview 42
4.3 Multiplexing ML Jobs 43
4.3.1 Fine-grained Execution with Subtasks 44
4.3.2 Dynamic Grouping of Jobs 45
4.3.3 Dynamic Data Reloading 54
4.4 Evaluation 56
4.4.1 Baselines 56
4.4.2 Experimental Setup 57
4.4.3 Performance Comparison 59
4.4.4 Performance Breakdown 59
4.4.5 Workload Sensitivity Analysis 61
4.4.6 Accuracy of the Performance Model 63
4.4.7 Performance and Scalability of the Scheduling Algorithm 64
4.4.8 Dynamic Data Reloading 66
4.5 Discussion 67
4.6 Related Work 67
4.7 Summary 70
Chapter5 Conclusion 71
5.1 Summary 71
5.2 Future Work 71
5.2.1 Other Communication Architecture Support 71
5.2.2 Deep Learning & GPU Resource Support 72
์์ฝ 81Docto
Regularized Bottleneck with Early Labeling
International audienceSmall IoT devices, such as drones and lightweight battery-powered robots, are emerging as a major platform for the deployment of AI/ML capabilities. Autonomous and semiautonomous device operation relies on the systematic use of deep neural network models for solving complex tasks, such as image classification. The challenging restrictions of these devices in terms of computing capabilities, network connectivity, and power consumption are the main limits to the accuracy of latencysensitive inferences. This paper presents ReBEL, a split computing architecture enabling the dynamic remote offload of partial computations or, in alternative, a local approximate labeling based on a jointly-trained classifier. Our approach combines elements of head network distillation, early exit classification, and bottleneck injection with the goal of reducing the average endto-end latency of AI/ML inference on constrained IoT devices
SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient
Many deep learning applications benefit from using large models with billions
of parameters. Training these models is notoriously expensive due to the need
for specialized HPC clusters. In this work, we consider alternative setups for
training large models: using cheap "preemptible" instances or pooling existing
resources from multiple regions. We analyze the performance of existing
model-parallel algorithms in these conditions and find configurations where
training larger models becomes less communication-intensive. Based on these
findings, we propose SWARM parallelism, a model-parallel training algorithm
designed for poorly connected, heterogeneous and unreliable devices. SWARM
creates temporary randomized pipelines between nodes that are rebalanced in
case of failure. We empirically validate our findings and compare SWARM
parallelism with existing large-scale training approaches. Finally, we combine
our insights with compression strategies to train a large Transformer language
model with 1B shared parameters (approximately 13B before sharing) on
preemptible T4 GPUs with less than 200Mb/s network.Comment: Accepted to International Conference on Machine Learning (ICML) 2023.
25 pages, 8 figure
- โฆ