238 research outputs found

    Breaking (Global) Barriers in Parallel Stochastic Optimization with Wait-Avoiding Group Averaging

    Full text link
    Deep learning at scale is dominated by communication time. Distributing samples across nodes usually yields the best performance, but poses scaling challenges due to global information dissemination and load imbalance across uneven sample lengths. State-of-the-art decentralized optimizers mitigate the problem, but require more iterations to achieve the same accuracy as their globally-communicating counterparts. We present Wait-Avoiding Group Model Averaging (WAGMA) SGD, a wait-avoiding stochastic optimizer that reduces global communication via subgroup weight exchange. The key insight is a combination of algorithmic changes to the averaging scheme and the use of a group allreduce operation. We prove the convergence of WAGMA-SGD, and empirically show that it retains convergence rates similar to Allreduce-SGD. For evaluation, we train ResNet-50 on ImageNet; Transformer for machine translation; and deep reinforcement learning for navigation at scale. Compared with state-of-the-art decentralized SGD variants, WAGMA-SGD significantly improves training throughput (e.g., 2.1x on 1,024 GPUs for reinforcement learning), and achieves the fastest time-to-solution (e.g., the highest score using the shortest training time for Transformer).Comment: Published in IEEE Transactions on Parallel and Distributed Systems (IEEE TPDS), vol. 32, no. 7, pp. 1725-1739, 1 July 202

    Large Scale Sparse Neural Networks

    Get PDF

    ๋ถ„์‚ฐ ๊ธฐ๊ณ„ ํ•™์Šต์˜ ์ž์› ํšจ์œจ์ ์ธ ์ˆ˜ํ–‰์„ ์œ„ํ•œ ๋™์  ์ตœ์ ํ™” ๊ธฐ์ˆ 

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ)--์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› :๊ณต๊ณผ๋Œ€ํ•™ ์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€,2020. 2. ์ „๋ณ‘๊ณค.Machine Learning(ML) systems are widely used to extract insights from data. Ever increasing dataset sizes and model complexity gave rise to many efforts towards ef๏ฌcient distributed machine learning systems. One of the popular approaches to support large scale data and complicated models is the parameter server (PS) approach. In this approach, a training job runs with distributed worker and server tasks, where workers iteratively compute gradients to update the global model parameters that are kept in servers. To improve the PS system performance, this dissertation proposes two solutions that automatically optimize resource ef๏ฌciency and system performance. First, we propose a solution that optimizes the resource con๏ฌguration and workload partitioning of distributed ML training on PS system. To ๏ฌnd the best con๏ฌguration, we build an Optimizer based on a cost model that works with online metrics. To ef๏ฌciently apply decisions by Optimizer, we design our runtime elastic to perform recon๏ฌguration in the background with minimal overhead. The second solution optimizes the scheduling of resources and tasks of multiple ML training jobs in a shared cluster. Speci๏ฌcally, we co-locate jobs with complementary resource use to increase resource utilization, while executing their tasks with ๏ฌne-grained unit to avoid resource contention. To alleviate memory pressure by co-located jobs, we enable dynamic spill/reload of data, which adaptively changes the ratio of data between disk and memory. We build a working system that implements our approaches. The above two solutions are implemented in the same system and share the runtime part that can dynamically migrate jobs between machines and reallocate machine resources. We evaluate our system with popular ML applications to verify the effectiveness of our solutions.๊ธฐ๊ณ„ ํ•™์Šต ์‹œ์Šคํ…œ์€ ๋ฐ์ดํ„ฐ์— ์ˆจ๊ฒจ์ง„ ์˜๋ฏธ๋ฅผ ๋ฝ‘์•„๋‚ด๊ธฐ ์œ„ํ•ด ๋„๋ฆฌ ์‚ฌ์šฉ๋˜๊ณ  ์žˆ๋‹ค. ๋ฐ์ดํ„ฐ์…‹์˜ ํฌ๊ธฐ์™€ ๋ชจ๋ธ์˜ ๋ณต์žก๋„๊ฐ€ ์–ด๋Š๋•Œ๋ณด๋‹ค ์ปค์ง์— ๋”ฐ๋ผ ํšจ์œจ์ ์ธ ๋ถ„์‚ฐ ๊ธฐ๊ณ„ ํ•™์Šต ์‹œ์Šคํ…œ์„์œ„ํ•œ ๋งŽ์€ ๋…ธ๋ ฅ๋“ค์ด ์ด๋ฃจ์–ด์ง€๊ณ  ์žˆ๋‹ค. ํŒŒ๋ผ๋ฏธํ„ฐ ์„œ๋ฒ„ ๋ฐฉ์‹์€ ๊ฑฐ๋Œ€ํ•œ ์Šค์ผ€์ผ์˜ ๋ฐ์ดํ„ฐ์™€ ๋ณต์žกํ•œ ๋ชจ๋ธ์„ ์ง€์›ํ•˜๊ธฐ ์œ„ํ•œ ์œ ๋ช…ํ•œ ๋ฐฉ๋ฒ•๋“ค ์ค‘ ํ•˜๋‚˜์ด๋‹ค. ์ด ๋ฐฉ์‹์—์„œ, ํ•™์Šต ์ž‘์—…์€ ๋ถ„์‚ฐ ์›Œ์ปค์™€ ์„œ๋ฒ„๋“ค๋กœ ๊ตฌ์„ฑ๋˜๊ณ , ์›Œ์ปค๋“ค์€ ํ• ๋‹น๋œ ์ž…๋ ฅ ๋ฐ์ดํ„ฐ๋กœ๋ถ€ํ„ฐ ๋ฐ˜๋ณต์ ์œผ๋กœ ๊ทธ๋ ˆ๋””์–ธํŠธ๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ ์„œ๋ฒ„๋“ค์— ๋ณด๊ด€๋œ ๊ธ€๋กœ๋ฒŒ ๋ชจ๋ธ ํŒŒ ๋ผ๋ฏธํ„ฐ๋“ค์„ ์—…๋ฐ์ดํŠธํ•œ๋‹ค. ํŒŒ๋ผ๋ฏธํ„ฐ ์„œ๋ฒ„ ์‹œ์Šคํ…œ์˜ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๊ธฐ ์œ„ํ•ด, ์ด ๋…ผ๋ฌธ์—์„œ๋Š” ์ž๋™์ ์œผ๋กœ ์ž์› ํšจ์œจ์„ฑ๊ณผ ์‹œ์Šคํ…œ ์„ฑ๋Šฅ์„ ์ตœ์ ํ™”ํ•˜๋Š” ๋‘๊ฐ€์ง€์˜ ํ•ด๋ฒ•์„ ์ œ์•ˆํ•œ๋‹ค. ์ฒซ๋ฒˆ์งธ ํ•ด๋ฒ•์€, ํŒŒ๋ผ๋ฏธํ„ฐ ์‹œ์Šคํ…œ์—์„œ ๋ถ„์‚ฐ ๊ธฐ๊ณ„ ํ•™์Šต์„ ์ˆ˜ํ–‰์‹œ์— ์ž์› ์„ค์ • ๋ฐ ์›Œํฌ๋กœ๋“œ ๋ถ„๋ฐฐ๋ฅผ ์ž๋™ํ™”ํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ์ตœ๊ณ ์˜ ์„ค์ •์„ ์ฐพ๊ธฐ ์œ„ํ•ด ์šฐ๋ฆฌ๋Š” ์˜จ๋ผ์ธ ๋ฉ”ํŠธ๋ฆญ์„ ์‚ฌ์šฉํ•˜๋Š” ๋น„์šฉ ๋ชจ๋ธ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋Š” Optimizer๋ฅผ ๋งŒ๋“ค์—ˆ๋‹ค. Optimizer์˜ ๊ฒฐ์ •์„ ํšจ์œจ์ ์œผ๋กœ ์ ์šฉํ•˜๊ธฐ ์œ„ํ•ด, ์šฐ๋ฆฌ๋Š” ๋Ÿฐํƒ€์ž„์„ ๋™์  ์žฌ์„ค์ •์„ ์ตœ์†Œ์˜ ์˜ค๋ฒ„ํ—ค๋“œ๋กœ ๋ฐฑ๊ทธ๋ผ์šด๋“œ์—์„œ ์ˆ˜ํ–‰ํ•˜๋„๋ก ๋””์ž์ธํ–ˆ๋‹ค. ๋‘๋ฒˆ์งธ ํ•ด๋ฒ•์€ ๊ณต์œ  ํด๋Ÿฌ์Šคํ„ฐ ์ƒํ™ฉ์—์„œ ์—ฌ๋Ÿฌ ๊ฐœ์˜ ๊ธฐ๊ณ„ ํ•™์Šต ์ž‘์—…์˜ ์„ธ๋ถ€ ์ž‘์—… ๊ณผ ์ž์›์˜ ์Šค์ผ€์ฅด๋ง์„ ์ตœ์ ํ™”ํ•œ ๊ฒƒ์ด๋‹ค. ๊ตฌ์ฒด์ ์œผ๋กœ, ์šฐ๋ฆฌ๋Š” ์„ธ๋ถ€ ์ž‘์—…๋“ค์„ ์„ธ๋ฐ€ํ•œ ๋‹จ์œ„๋กœ ์ˆ˜ํ–‰ํ•จ์œผ๋กœ์จ ์ž์› ๊ฒฝ์Ÿ์„ ์–ต์ œํ•˜๊ณ , ์„œ๋กœ๋ฅผ ๋ณด์™„ํ•˜๋Š” ์ž์› ์‚ฌ์šฉ ํŒจํ„ด์„ ๋ณด์ด๋Š” ์ž‘์—…๋“ค์„ ๊ฐ™์€ ์ž์›์— ํ•จ๊ป˜ ์œ„์น˜์‹œ์ผœ ์ž์› ํ™œ์šฉ์œจ์„ ๋Œ์–ด์˜ฌ๋ ธ๋‹ค. ํ•จ๊ป˜ ์œ„์น˜ํ•œ ์ž‘์—…๋“ค์˜ ๋ฉ”๋ชจ๋ฆฌ ์••๋ ฅ์„ ๊ฒฝ๊ฐ์‹œํ‚ค๊ธฐ ์œ„ํ•ด ์šฐ๋ฆฌ๋Š” ๋™์ ์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ๋””์Šคํฌ๋กœ ๋‚ด๋ ธ๋‹ค๊ฐ€ ๋‹ค์‹œ ๋ฉ”๋ชจ๋ฆฌ๋กœ ์ฝ์–ด์˜ค๋Š” ๊ธฐ๋Šฅ์„ ์ง€์›ํ•จ๊ณผ ๋™์‹œ์—, ๋””์Šคํฌ์™€ ๋ฉ”๋ชจ๋ฆฌ๊ฐ„์˜ ๋ฐ์ดํ„ฐ ๋น„์œจ์„ ์ƒํ™ฉ์— ๋งž๊ฒŒ ์‹œ์Šคํ…œ์ด ์ž๋™์œผ๋กœ ๋งž์ถ”๋„๋ก ํ•˜์˜€๋‹ค. ์œ„์˜ ํ•ด๋ฒ•๋“ค์„ ์‹ค์ฒดํ™”ํ•˜๊ธฐ ์œ„ํ•ด, ์‹ค์ œ ๋™์ž‘ํ•˜๋Š” ์‹œ์Šคํ…œ์„ ๋งŒ๋“ค์—ˆ๋‹ค. ๋‘๊ฐ€์ง€์˜ ํ•ด๋ฒ•์„ ํ•˜๋‚˜์˜ ์‹œ์Šคํ…œ์— ๊ตฌํ˜„ํ•จ์œผ๋กœ์จ, ๋™์ ์œผ๋กœ ์ž‘์—…์„ ๋จธ์‹  ๊ฐ„์— ์˜ฎ๊ธฐ๊ณ  ์ž์›์„ ์žฌํ• ๋‹นํ•  ์ˆ˜ ์žˆ๋Š” ๋Ÿฐํƒ€์ž„์„ ๊ณต์œ ํ•œ๋‹ค. ํ•ด๋‹น ์†”๋ฃจ์…˜๋“ค์˜ ํšจ๊ณผ๋ฅผ ๋ณด์—ฌ์ฃผ๊ธฐ ์œ„ํ•ด, ์ด ์‹œ์Šคํ…œ์„ ๋งŽ์ด ์‚ฌ์šฉ๋˜๋Š” ๊ธฐ๊ณ„ ํ•™์Šต ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜์œผ๋กœ ์‹คํ—˜ํ•˜์˜€๊ณ  ๊ธฐ์กด ์‹œ์Šคํ…œ๋“ค ๋Œ€๋น„ ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ๋ณด์—ฌ์ฃผ์—ˆ๋‹ค.Chapter1. Introduction 1 1.1 Distributed Machine Learning on Parameter Servers 1 1.2 Automating System Conguration of Distributed Machine Learning 2 1.3 Scheduling of Multiple Distributed Machine Learning Jobs 3 1.4 Contributions 5 1.5 Dissertation Structure 6 Chapter2. Background 7 Chapter3. Automating System Conguration of Distributed Machine Learning 10 3.1 System Conguration Challenges 11 3.2 Finding Good System Conguration 13 3.2.1 Cost Model 13 3.2.2 Cost Formulation 15 3.2.3 Optimization 16 3.3 Cruise 18 3.3.1 Optimizer 19 3.3.2 Elastic Runtime 21 3.4 Evaluation 26 3.4.1 Experimental Setup 26 3.4.2 Finding Baselines with Grid Search 28 3.4.3 Optimization in the Homogeneous Environment 28 3.4.4 Utilizing Opportunistic Resources 30 3.4.5 Optimization in the Heterogeneous Environment 31 3.4.6 Reconguration Speed 32 3.5 Related Work 33 3.6 Summary 34 Chapter4 A Scheduling Framework Optimized for Multiple Distributed Machine Learning Jobs 36 4.1 Resource Under-utilization Problems in PS ML Training 37 4.2 Harmony Overview 42 4.3 Multiplexing ML Jobs 43 4.3.1 Fine-grained Execution with Subtasks 44 4.3.2 Dynamic Grouping of Jobs 45 4.3.3 Dynamic Data Reloading 54 4.4 Evaluation 56 4.4.1 Baselines 56 4.4.2 Experimental Setup 57 4.4.3 Performance Comparison 59 4.4.4 Performance Breakdown 59 4.4.5 Workload Sensitivity Analysis 61 4.4.6 Accuracy of the Performance Model 63 4.4.7 Performance and Scalability of the Scheduling Algorithm 64 4.4.8 Dynamic Data Reloading 66 4.5 Discussion 67 4.6 Related Work 67 4.7 Summary 70 Chapter5 Conclusion 71 5.1 Summary 71 5.2 Future Work 71 5.2.1 Other Communication Architecture Support 71 5.2.2 Deep Learning & GPU Resource Support 72 ์š”์•ฝ 81Docto

    Regularized Bottleneck with Early Labeling

    Get PDF
    International audienceSmall IoT devices, such as drones and lightweight battery-powered robots, are emerging as a major platform for the deployment of AI/ML capabilities. Autonomous and semiautonomous device operation relies on the systematic use of deep neural network models for solving complex tasks, such as image classification. The challenging restrictions of these devices in terms of computing capabilities, network connectivity, and power consumption are the main limits to the accuracy of latencysensitive inferences. This paper presents ReBEL, a split computing architecture enabling the dynamic remote offload of partial computations or, in alternative, a local approximate labeling based on a jointly-trained classifier. Our approach combines elements of head network distillation, early exit classification, and bottleneck injection with the goal of reducing the average endto-end latency of AI/ML inference on constrained IoT devices

    SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient

    Full text link
    Many deep learning applications benefit from using large models with billions of parameters. Training these models is notoriously expensive due to the need for specialized HPC clusters. In this work, we consider alternative setups for training large models: using cheap "preemptible" instances or pooling existing resources from multiple regions. We analyze the performance of existing model-parallel algorithms in these conditions and find configurations where training larger models becomes less communication-intensive. Based on these findings, we propose SWARM parallelism, a model-parallel training algorithm designed for poorly connected, heterogeneous and unreliable devices. SWARM creates temporary randomized pipelines between nodes that are rebalanced in case of failure. We empirically validate our findings and compare SWARM parallelism with existing large-scale training approaches. Finally, we combine our insights with compression strategies to train a large Transformer language model with 1B shared parameters (approximately 13B before sharing) on preemptible T4 GPUs with less than 200Mb/s network.Comment: Accepted to International Conference on Machine Learning (ICML) 2023. 25 pages, 8 figure
    • โ€ฆ
    corecore