3,459 research outputs found

    Tiramisu: A Polyhedral Compiler for Expressing Fast and Portable Code

    Full text link
    This paper introduces Tiramisu, a polyhedral framework designed to generate high performance code for multiple platforms including multicores, GPUs, and distributed machines. Tiramisu introduces a scheduling language with novel extensions to explicitly manage the complexities that arise when targeting these systems. The framework is designed for the areas of image processing, stencils, linear algebra and deep learning. Tiramisu has two main features: it relies on a flexible representation based on the polyhedral model and it has a rich scheduling language allowing fine-grained control of optimizations. Tiramisu uses a four-level intermediate representation that allows full separation between the algorithms, loop transformations, data layouts, and communication. This separation simplifies targeting multiple hardware architectures with the same algorithm. We evaluate Tiramisu by writing a set of image processing, deep learning, and linear algebra benchmarks and compare them with state-of-the-art compilers and hand-tuned libraries. We show that Tiramisu matches or outperforms existing compilers and libraries on different hardware architectures, including multicore CPUs, GPUs, and distributed machines.Comment: arXiv admin note: substantial text overlap with arXiv:1803.0041

    GPU ν™˜κ²½μ—μ„œ λ¨Έμ‹ λŸ¬λ‹ μ›Œν¬λ‘œλ“œμ˜ 효율적인 μ‹€ν–‰

    Get PDF
    ν•™μœ„λ…Όλ¬Έ(박사) -- μ„œμšΈλŒ€ν•™κ΅λŒ€ν•™μ› : κ³΅κ³ΌλŒ€ν•™ 컴퓨터곡학뢀, 2023. 2. 전병곀.Machine learning (ML) workloads are becoming increasingly important in many types of real-world applications. We attribute this trend to the development of software systems for ML, which have facilitated the widespread adoption of heterogeneous accelerators such as GPUs. Todays ML software stack has made great improvements in terms of efficiency, however, not all use cases are well supported. In this dissertation, we study how to improve execution efficiency of ML workloads on GPUs from a software system perspective. We identify workloads where current systems for ML have inefficiencies in utilizing GPUs and devise new system techniques that handle those workloads efficiently. We first present Nimble, a ML execution engine equipped with carefully optimized GPU scheduling. The proposed scheduling techniques can be used to improve execution efficiency by up to 22.34Γ—. Second, we propose Orca, an inference serving system specialized for Transformer-based generative models. By incorporating new scheduling and batching techniques, Orca significantly outperforms state-of-the-art systems – 36.9Γ— throughput improvement at the same level of latency. The last topic of this dissertation is WindTunnel, a framework that translates classical ML pipelines into neural networks, providing GPU training capabilities for classical ML workloads. WindTunnel also allows joint training of pipeline components via backpropagation, resulting in improved accuracy over the original pipeline and neural network baselines.졜근 κ²½ν–₯을 보면 λ‹€μ–‘ν•œ μ’…λ₯˜μ˜ μ• ν”Œλ¦¬μΌ€μ΄μ…˜μ—μ„œ λ¨Έμ‹  λŸ¬λ‹(ML) μ›Œν¬λ‘œλ“œκ°€ 점 점 더 μ€‘μš”ν•˜κ²Œ ν™œμš©λ˜κ³  μžˆλ‹€. μ΄λŠ” ML용 μ‹œμŠ€ν…œ μ†Œν”„νŠΈμ›¨μ–΄μ˜ κ°œλ°œμ„ 톡해 GPU 와 같은 이기쒅 κ°€μ†κΈ°μ˜ κ΄‘λ²”μœ„ν•œ ν™œμš©μ΄ κ°€λŠ₯ν•΄μ‘ŒκΈ° λ•Œλ¬Έμ΄λ‹€. λ§Žμ€ μ—°κ΅¬μžλ“€μ˜ 관심 덕에 ML용 μ‹œμŠ€ν…œ μ†Œν”„νŠΈμ›¨μ–΄ μŠ€νƒμ€ λΆ„λͺ… ν•˜λ£¨κ°€ λ‹€λ₯΄κ²Œ κ°œμ„ λ˜κ³  μžˆμ§€λ§Œ, μ—¬μ „νžˆ λͺ¨λ“  μ‚¬λ‘€μ—μ„œ 높은 νš¨μœ¨μ„±μ„ λ³΄μ—¬μ£Όμ§€λŠ” λͺ»ν•œλ‹€. 이 ν•™μœ„λ…Όλ¬Έμ—μ„œλŠ” μ‹œμŠ€ ν…œ μ†Œν”„νŠΈμ›¨μ–΄ κ΄€μ μ—μ„œ GPU ν™˜κ²½μ—μ„œ ML μ›Œν¬λ‘œλ“œμ˜ μ‹€ν–‰ νš¨μœ¨μ„±μ„ κ°œμ„ ν•˜λŠ” 방법을 μ—°κ΅¬ν•œλ‹€. κ΅¬μ²΄μ μœΌλ‘œλŠ” μ˜€λŠ˜λ‚ μ˜ ML용 μ‹œμŠ€ν…œμ΄ GPUλ₯Ό 효율적으둜 사 μš©ν•˜μ§€ λͺ»ν•˜λŠ” μ›Œν¬λ‘œλ“œλ₯Ό 규λͺ…ν•˜κ³  더 λ‚˜μ•„κ°€μ„œ ν•΄λ‹Ή μ›Œν¬λ‘œλ“œλ₯Ό 효율적으둜 μ²˜λ¦¬ν•  수 μžˆλŠ” μ‹œμŠ€ν…œ κΈ°μˆ μ„ κ³ μ•ˆν•˜λŠ” 것을 λͺ©ν‘œλ‘œ ν•œλ‹€. λ³Έ λ…Όλ¬Έμ—μ„œλŠ” λ¨Όμ € μ΅œμ ν™”λœ GPU μŠ€μΌ€μ€„λ§μ„ κ°–μΆ˜ ML μ‹€ν–‰ 엔진인 Nimble 을 μ†Œκ°œν•œλ‹€. μƒˆ μŠ€μΌ€μ€„λ§ 기법을 톡해 Nimble은 κΈ°μ‘΄ λŒ€λΉ„ GPU μ‹€ν–‰ νš¨μœ¨μ„± 을 μ΅œλŒ€ 22.34λ°°κΉŒμ§€ ν–₯μƒμ‹œν‚¬ 수 μžˆλ‹€. λ‘˜μ§Έλ‘œ Transformer 기반의 생성 λͺ¨λΈμ— νŠΉν™”λœ μΆ”λ‘  μ„œλΉ„μŠ€ μ‹œμŠ€ν…œ Orcaλ₯Ό μ œμ•ˆν•œλ‹€. μƒˆλ‘œμš΄ μŠ€μΌ€μ€„λ§ 및 batching κΈ° μˆ μ— νž˜μž…μ–΄, OrcaλŠ” λ™μΌν•œ μˆ˜μ€€μ˜ 지연 μ‹œκ°„μ„ κΈ°μ€€μœΌλ‘œ ν–ˆμ„ λ•Œ κΈ°μ‘΄ μ‹œμŠ€ν…œ λŒ€λΉ„ 36.9λ°° ν–₯μƒλœ μ²˜λ¦¬λŸ‰μ„ 보인닀. λ§ˆμ§€λ§‰μœΌλ‘œ 신경망을 μ‚¬μš©ν•˜μ§€ μ•ŠλŠ” κ³ μ „ ML νŒŒμ΄ν”„λΌμΈμ„ μ‹ κ²½λ§μœΌλ‘œ λ³€ν™˜ν•˜λŠ” ν”„λ ˆμž„μ›Œν¬ WindTunnel을 μ†Œκ°œν•œλ‹€. 이 λ₯Ό 톡해 κ³ μ „ ML νŒŒμ΄ν”„λΌμΈ ν•™μŠ΅μ„ GPUλ₯Ό μ‚¬μš©ν•΄ 진행할 수 있게 λœλ‹€. λ˜ν•œ WindTunnel은 gradient backpropagation을 톡해 νŒŒμ΄ν”„λΌμΈμ˜ μ—¬λŸ¬ μš”μ†Œλ₯Ό ν•œ λ²ˆμ— κ³΅λ™μœΌλ‘œ ν•™μŠ΅ ν•  수 있으며, 이λ₯Ό 톡해 νŒŒμ΄ν”„λΌμΈμ˜ 정확도λ₯Ό 더 ν–₯μƒμ‹œν‚¬ 수 μžˆμŒμ„ ν™•μΈν•˜μ˜€λ‹€.Chapter 1 Introduction 1 1.1 Motivation 1 1.2 Dissertation Overview 2 1.3 Previous Publications 4 1.4 Roadmap 5 Chapter 2 Background 6 2.1 ML Workloads 6 2.2 The GPU Execution Model 7 2.3 GPU Scheduling in ML Frameworks 8 2.4 Engine Scheduling in Inference Servers 10 2.5 Inference Procedure of Generative Models 11 Chapter 3 Nimble: Lightweight and Parallel GPU Task Scheduling for Deep Learning 17 3.1 Introduction 17 3.2 Motivation 21 3.3 System Design 24 3.3.1 Ahead-of-time (AoT) Scheduling 25 3.3.2 Stream Assignment Algorithm 28 3.4 Evaluation 32 3.4.1 Inference Latency 36 3.4.2 Impact of Multi-stream Execution 36 3.4.3 Training Throughput 37 3.5 Summary 38 Chapter 4 Orca: A Distributed Serving System for Transformer-Based Generative Models 40 4.1 Introduction 40 4.2 Challenges and Proposed Solutions 44 4.3 Orca System Design 51 4.3.1 Distributed Architecture 51 4.3.2 Scheduling Algorithm 54 4.4 Implementation 60 4.5 Evaluation 61 4.5.1 Engine Microbenchmark 63 4.5.2 End-to-end Performance 66 4.6 Summary 71 Chapter 5 WindTunnel: Towards Differentiable ML Pipelines Beyond a Single Model 72 5.1 Introduction 72 5.2 Pipeline Translation 78 5.2.1 Translating Arithmetic Operators 80 5.2.2 Translating Algorithmic Operators: GBDT 81 5.2.3 Translating Algorithmic Operators for Categorical Features 85 5.2.4 Fine-Tuning 87 5.3 Implementation 87 5.4 Experiments 88 5.4.1 Experimental Setup 89 5.4.2 Overall Performance 94 5.4.3 Ablation Study 95 5.5 Summary 98 Chapter 6 Related Work 99 Chapter 7 Conclusion 105 Bibliography 107 Appendix A Appendix: Nimble 131 A.1 Proofs on the Stream Assignment Algorithm of Nimble 131 A.1.1 Proof of Theorem 1 132 A.1.2 Proof of Theorem 2 134 A.1.3 Proof of Theorem 3 135 A.1.4 Time Complexity Analysis 137 A.2 Evaluation Results on Various GPUs 139 A.3 Evaluation Results on Different Training Batch Sizes 139λ°•
    • …
    corecore