1,030 research outputs found
GPU νκ²½μμ λ¨Έμ λ¬λ μν¬λ‘λμ ν¨μ¨μ μΈ μ€ν
νμλ
Όλ¬Έ(λ°μ¬) -- μμΈλνκ΅λνμ : 곡과λν μ»΄ν¨ν°κ³΅νλΆ, 2023. 2. μ λ³κ³€.Machine learning (ML) workloads are becoming increasingly important in many types of real-world applications. We attribute this trend to the development of software systems for ML, which have facilitated the widespread adoption of heterogeneous accelerators such as GPUs. Todays ML software stack has made great improvements in terms of efficiency, however, not all use cases are well supported. In this dissertation, we study how to improve execution efficiency of ML workloads on GPUs from a software system perspective. We identify workloads where current systems for ML have inefficiencies in utilizing GPUs and devise new system techniques that handle those workloads efficiently.
We first present Nimble, a ML execution engine equipped with carefully optimized GPU scheduling. The proposed scheduling techniques can be used to improve execution efficiency by up to 22.34Γ. Second, we propose Orca, an inference serving system specialized for Transformer-based generative models. By incorporating new scheduling and batching techniques, Orca significantly outperforms state-of-the-art systems β 36.9Γ throughput improvement at the same level of latency. The last topic of this dissertation is WindTunnel, a framework that translates classical ML pipelines into neural networks, providing GPU training capabilities for classical ML workloads. WindTunnel also allows joint training of pipeline components via backpropagation, resulting in improved accuracy over the original pipeline and neural network baselines.μ΅κ·Ό κ²½ν₯μ 보면 λ€μν μ’
λ₯μ μ ν리μΌμ΄μ
μμ λ¨Έμ λ¬λ(ML) μν¬λ‘λκ° μ μ λ μ€μνκ² νμ©λκ³ μλ€. μ΄λ MLμ© μμ€ν
μννΈμ¨μ΄μ κ°λ°μ ν΅ν΄ GPU μ κ°μ μ΄κΈ°μ’
κ°μκΈ°μ κ΄λ²μν νμ©μ΄ κ°λ₯ν΄μ‘κΈ° λλ¬Έμ΄λ€. λ§μ μ°κ΅¬μλ€μ κ΄μ¬ λμ MLμ© μμ€ν
μννΈμ¨μ΄ μ€νμ λΆλͺ
νλ£¨κ° λ€λ₯΄κ² κ°μ λκ³ μμ§λ§, μ¬μ ν λͺ¨λ μ¬λ‘μμ λμ ν¨μ¨μ±μ 보μ¬μ£Όμ§λ λͺ»νλ€. μ΄ νμλ
Όλ¬Έμμλ μμ€ ν
μννΈμ¨μ΄ κ΄μ μμ GPU νκ²½μμ ML μν¬λ‘λμ μ€ν ν¨μ¨μ±μ κ°μ νλ λ°©λ²μ μ°κ΅¬νλ€. ꡬ체μ μΌλ‘λ μ€λλ μ MLμ© μμ€ν
μ΄ GPUλ₯Ό ν¨μ¨μ μΌλ‘ μ¬ μ©νμ§ λͺ»νλ μν¬λ‘λλ₯Ό κ·λͺ
νκ³ λ λμκ°μ ν΄λΉ μν¬λ‘λλ₯Ό ν¨μ¨μ μΌλ‘ μ²λ¦¬ν μ μλ μμ€ν
κΈ°μ μ κ³ μνλ κ²μ λͺ©νλ‘ νλ€.
λ³Έ λ
Όλ¬Έμμλ λ¨Όμ μ΅μ νλ GPU μ€μΌμ€λ§μ κ°μΆ ML μ€ν μμ§μΈ Nimble μ μκ°νλ€. μ μ€μΌμ€λ§ κΈ°λ²μ ν΅ν΄ Nimbleμ κΈ°μ‘΄ λλΉ GPU μ€ν ν¨μ¨μ± μ μ΅λ 22.34λ°°κΉμ§ ν₯μμν¬ μ μλ€. λμ§Έλ‘ Transformer κΈ°λ°μ μμ± λͺ¨λΈμ νΉνλ μΆλ‘ μλΉμ€ μμ€ν
Orcaλ₯Ό μ μνλ€. μλ‘μ΄ μ€μΌμ€λ§ λ° batching κΈ° μ μ νμ
μ΄, Orcaλ λμΌν μμ€μ μ§μ° μκ°μ κΈ°μ€μΌλ‘ νμ λ κΈ°μ‘΄ μμ€ν
λλΉ 36.9λ°° ν₯μλ μ²λ¦¬λμ 보μΈλ€. λ§μ§λ§μΌλ‘ μ κ²½λ§μ μ¬μ©νμ§ μλ κ³ μ ML νμ΄νλΌμΈμ μ κ²½λ§μΌλ‘ λ³ννλ νλ μμν¬ WindTunnelμ μκ°νλ€. μ΄ λ₯Ό ν΅ν΄ κ³ μ ML νμ΄νλΌμΈ νμ΅μ GPUλ₯Ό μ¬μ©ν΄ μ§νν μ μκ² λλ€. λν WindTunnelμ gradient backpropagationμ ν΅ν΄ νμ΄νλΌμΈμ μ¬λ¬ μμλ₯Ό ν λ²μ 곡λμΌλ‘ νμ΅ ν μ μμΌλ©°, μ΄λ₯Ό ν΅ν΄ νμ΄νλΌμΈμ μ νλλ₯Ό λ ν₯μμν¬ μ μμμ νμΈνμλ€.Chapter 1 Introduction 1
1.1 Motivation 1
1.2 Dissertation Overview 2
1.3 Previous Publications 4
1.4 Roadmap 5
Chapter 2 Background 6
2.1 ML Workloads 6
2.2 The GPU Execution Model 7
2.3 GPU Scheduling in ML Frameworks 8
2.4 Engine Scheduling in Inference Servers 10
2.5 Inference Procedure of Generative Models 11
Chapter 3 Nimble: Lightweight and Parallel GPU Task Scheduling for Deep Learning 17
3.1 Introduction 17
3.2 Motivation 21
3.3 System Design 24
3.3.1 Ahead-of-time (AoT) Scheduling 25
3.3.2 Stream Assignment Algorithm 28
3.4 Evaluation 32
3.4.1 Inference Latency 36
3.4.2 Impact of Multi-stream Execution 36
3.4.3 Training Throughput 37
3.5 Summary 38
Chapter 4 Orca: A Distributed Serving System for Transformer-Based Generative Models 40
4.1 Introduction 40
4.2 Challenges and Proposed Solutions 44
4.3 Orca System Design 51
4.3.1 Distributed Architecture 51
4.3.2 Scheduling Algorithm 54
4.4 Implementation 60
4.5 Evaluation 61
4.5.1 Engine Microbenchmark 63
4.5.2 End-to-end Performance 66
4.6 Summary 71
Chapter 5 WindTunnel: Towards Differentiable ML Pipelines Beyond a Single Model 72
5.1 Introduction 72
5.2 Pipeline Translation 78
5.2.1 Translating Arithmetic Operators 80
5.2.2 Translating Algorithmic Operators: GBDT 81
5.2.3 Translating Algorithmic Operators for Categorical Features 85
5.2.4 Fine-Tuning 87
5.3 Implementation 87
5.4 Experiments 88
5.4.1 Experimental Setup 89
5.4.2 Overall Performance 94
5.4.3 Ablation Study 95
5.5 Summary 98
Chapter 6 Related Work 99
Chapter 7 Conclusion 105
Bibliography 107
Appendix A Appendix: Nimble 131
A.1 Proofs on the Stream Assignment Algorithm of Nimble 131
A.1.1 Proof of Theorem 1 132
A.1.2 Proof of Theorem 2 134
A.1.3 Proof of Theorem 3 135
A.1.4 Time Complexity Analysis 137
A.2 Evaluation Results on Various GPUs 139
A.3 Evaluation Results on Different Training Batch Sizes 139λ°
Recommended from our members
Ray: A Distributed Execution Engine for the Machine Learning Ecosystem
In recent years, growing data volumes and more sophisticated computational procedures have greatly increased the demand for computational power. Machine learning and artificial intelligence applications, for example, are notorious for their computational requirements. At the same time, Moores law is ending and processor speeds are stalling. As a result, distributed computing has become ubiquitous. While the cloud makes distributed hardware infrastructure widely accessible and therefore offers the potential of horizontal scale, developing these distributed algorithms and applications remains surprisingly hard. This is due to the inherent complexity of concurrent algorithms, the engineering challenges that arise when communicating between many machines, the requirements like fault tolerance and straggler mitigation that arise at large scale and the lack of a general-purpose distributed execution engine that can support a wide variety of applications.In this thesis, we study the requirements for a general-purpose distributed computation model and present a solution that is easy to use yet expressive and resilient to faults. At its core our model takes familiar concepts from serial programming, namely functions and classes, and generalizes them to the distributed world, therefore unifying stateless and stateful distributed computation. This model not only supports many machine learning workloads like training or serving, but is also a good t for cross-cutting machine learning applications like reinforcement learning and data processing applications like streaming or graph processing. We implement this computational model as an open-source system called Ray, which matches or exceeds the performance of specialized systems in many application domains, while also offering horizontally scalability and strong fault tolerance properties
- β¦