3,459 research outputs found
Tiramisu: A Polyhedral Compiler for Expressing Fast and Portable Code
This paper introduces Tiramisu, a polyhedral framework designed to generate
high performance code for multiple platforms including multicores, GPUs, and
distributed machines. Tiramisu introduces a scheduling language with novel
extensions to explicitly manage the complexities that arise when targeting
these systems. The framework is designed for the areas of image processing,
stencils, linear algebra and deep learning. Tiramisu has two main features: it
relies on a flexible representation based on the polyhedral model and it has a
rich scheduling language allowing fine-grained control of optimizations.
Tiramisu uses a four-level intermediate representation that allows full
separation between the algorithms, loop transformations, data layouts, and
communication. This separation simplifies targeting multiple hardware
architectures with the same algorithm. We evaluate Tiramisu by writing a set of
image processing, deep learning, and linear algebra benchmarks and compare them
with state-of-the-art compilers and hand-tuned libraries. We show that Tiramisu
matches or outperforms existing compilers and libraries on different hardware
architectures, including multicore CPUs, GPUs, and distributed machines.Comment: arXiv admin note: substantial text overlap with arXiv:1803.0041
GPU νκ²½μμ λ¨Έμ λ¬λ μν¬λ‘λμ ν¨μ¨μ μΈ μ€ν
νμλ
Όλ¬Έ(λ°μ¬) -- μμΈλνκ΅λνμ : 곡과λν μ»΄ν¨ν°κ³΅νλΆ, 2023. 2. μ λ³κ³€.Machine learning (ML) workloads are becoming increasingly important in many types of real-world applications. We attribute this trend to the development of software systems for ML, which have facilitated the widespread adoption of heterogeneous accelerators such as GPUs. Todays ML software stack has made great improvements in terms of efficiency, however, not all use cases are well supported. In this dissertation, we study how to improve execution efficiency of ML workloads on GPUs from a software system perspective. We identify workloads where current systems for ML have inefficiencies in utilizing GPUs and devise new system techniques that handle those workloads efficiently.
We first present Nimble, a ML execution engine equipped with carefully optimized GPU scheduling. The proposed scheduling techniques can be used to improve execution efficiency by up to 22.34Γ. Second, we propose Orca, an inference serving system specialized for Transformer-based generative models. By incorporating new scheduling and batching techniques, Orca significantly outperforms state-of-the-art systems β 36.9Γ throughput improvement at the same level of latency. The last topic of this dissertation is WindTunnel, a framework that translates classical ML pipelines into neural networks, providing GPU training capabilities for classical ML workloads. WindTunnel also allows joint training of pipeline components via backpropagation, resulting in improved accuracy over the original pipeline and neural network baselines.μ΅κ·Ό κ²½ν₯μ 보면 λ€μν μ’
λ₯μ μ ν리μΌμ΄μ
μμ λ¨Έμ λ¬λ(ML) μν¬λ‘λκ° μ μ λ μ€μνκ² νμ©λκ³ μλ€. μ΄λ MLμ© μμ€ν
μννΈμ¨μ΄μ κ°λ°μ ν΅ν΄ GPU μ κ°μ μ΄κΈ°μ’
κ°μκΈ°μ κ΄λ²μν νμ©μ΄ κ°λ₯ν΄μ‘κΈ° λλ¬Έμ΄λ€. λ§μ μ°κ΅¬μλ€μ κ΄μ¬ λμ MLμ© μμ€ν
μννΈμ¨μ΄ μ€νμ λΆλͺ
νλ£¨κ° λ€λ₯΄κ² κ°μ λκ³ μμ§λ§, μ¬μ ν λͺ¨λ μ¬λ‘μμ λμ ν¨μ¨μ±μ 보μ¬μ£Όμ§λ λͺ»νλ€. μ΄ νμλ
Όλ¬Έμμλ μμ€ ν
μννΈμ¨μ΄ κ΄μ μμ GPU νκ²½μμ ML μν¬λ‘λμ μ€ν ν¨μ¨μ±μ κ°μ νλ λ°©λ²μ μ°κ΅¬νλ€. ꡬ체μ μΌλ‘λ μ€λλ μ MLμ© μμ€ν
μ΄ GPUλ₯Ό ν¨μ¨μ μΌλ‘ μ¬ μ©νμ§ λͺ»νλ μν¬λ‘λλ₯Ό κ·λͺ
νκ³ λ λμκ°μ ν΄λΉ μν¬λ‘λλ₯Ό ν¨μ¨μ μΌλ‘ μ²λ¦¬ν μ μλ μμ€ν
κΈ°μ μ κ³ μνλ κ²μ λͺ©νλ‘ νλ€.
λ³Έ λ
Όλ¬Έμμλ λ¨Όμ μ΅μ νλ GPU μ€μΌμ€λ§μ κ°μΆ ML μ€ν μμ§μΈ Nimble μ μκ°νλ€. μ μ€μΌμ€λ§ κΈ°λ²μ ν΅ν΄ Nimbleμ κΈ°μ‘΄ λλΉ GPU μ€ν ν¨μ¨μ± μ μ΅λ 22.34λ°°κΉμ§ ν₯μμν¬ μ μλ€. λμ§Έλ‘ Transformer κΈ°λ°μ μμ± λͺ¨λΈμ νΉνλ μΆλ‘ μλΉμ€ μμ€ν
Orcaλ₯Ό μ μνλ€. μλ‘μ΄ μ€μΌμ€λ§ λ° batching κΈ° μ μ νμ
μ΄, Orcaλ λμΌν μμ€μ μ§μ° μκ°μ κΈ°μ€μΌλ‘ νμ λ κΈ°μ‘΄ μμ€ν
λλΉ 36.9λ°° ν₯μλ μ²λ¦¬λμ 보μΈλ€. λ§μ§λ§μΌλ‘ μ κ²½λ§μ μ¬μ©νμ§ μλ κ³ μ ML νμ΄νλΌμΈμ μ κ²½λ§μΌλ‘ λ³ννλ νλ μμν¬ WindTunnelμ μκ°νλ€. μ΄ λ₯Ό ν΅ν΄ κ³ μ ML νμ΄νλΌμΈ νμ΅μ GPUλ₯Ό μ¬μ©ν΄ μ§νν μ μκ² λλ€. λν WindTunnelμ gradient backpropagationμ ν΅ν΄ νμ΄νλΌμΈμ μ¬λ¬ μμλ₯Ό ν λ²μ 곡λμΌλ‘ νμ΅ ν μ μμΌλ©°, μ΄λ₯Ό ν΅ν΄ νμ΄νλΌμΈμ μ νλλ₯Ό λ ν₯μμν¬ μ μμμ νμΈνμλ€.Chapter 1 Introduction 1
1.1 Motivation 1
1.2 Dissertation Overview 2
1.3 Previous Publications 4
1.4 Roadmap 5
Chapter 2 Background 6
2.1 ML Workloads 6
2.2 The GPU Execution Model 7
2.3 GPU Scheduling in ML Frameworks 8
2.4 Engine Scheduling in Inference Servers 10
2.5 Inference Procedure of Generative Models 11
Chapter 3 Nimble: Lightweight and Parallel GPU Task Scheduling for Deep Learning 17
3.1 Introduction 17
3.2 Motivation 21
3.3 System Design 24
3.3.1 Ahead-of-time (AoT) Scheduling 25
3.3.2 Stream Assignment Algorithm 28
3.4 Evaluation 32
3.4.1 Inference Latency 36
3.4.2 Impact of Multi-stream Execution 36
3.4.3 Training Throughput 37
3.5 Summary 38
Chapter 4 Orca: A Distributed Serving System for Transformer-Based Generative Models 40
4.1 Introduction 40
4.2 Challenges and Proposed Solutions 44
4.3 Orca System Design 51
4.3.1 Distributed Architecture 51
4.3.2 Scheduling Algorithm 54
4.4 Implementation 60
4.5 Evaluation 61
4.5.1 Engine Microbenchmark 63
4.5.2 End-to-end Performance 66
4.6 Summary 71
Chapter 5 WindTunnel: Towards Differentiable ML Pipelines Beyond a Single Model 72
5.1 Introduction 72
5.2 Pipeline Translation 78
5.2.1 Translating Arithmetic Operators 80
5.2.2 Translating Algorithmic Operators: GBDT 81
5.2.3 Translating Algorithmic Operators for Categorical Features 85
5.2.4 Fine-Tuning 87
5.3 Implementation 87
5.4 Experiments 88
5.4.1 Experimental Setup 89
5.4.2 Overall Performance 94
5.4.3 Ablation Study 95
5.5 Summary 98
Chapter 6 Related Work 99
Chapter 7 Conclusion 105
Bibliography 107
Appendix A Appendix: Nimble 131
A.1 Proofs on the Stream Assignment Algorithm of Nimble 131
A.1.1 Proof of Theorem 1 132
A.1.2 Proof of Theorem 2 134
A.1.3 Proof of Theorem 3 135
A.1.4 Time Complexity Analysis 137
A.2 Evaluation Results on Various GPUs 139
A.3 Evaluation Results on Different Training Batch Sizes 139λ°
- β¦