1,538 research outputs found

    Adaptive Transactional Memories: Performance and Energy Consumption Tradeoffs

    Get PDF
    Energy efficiency is becoming a pressing issue, especially in large data centers where it entails, at the same time, a non-negligible management cost, an enhancement of hardware fault probability, and a significant environmental footprint. In this paper, we study how Software Transactional Memories (STM) can provide benefits on both power saving and the overall applicationsโ€™ execution performance. This is related to the fact that encapsulating shared-data accesses within transactions gives the freedom to the STM middleware to both ensure consistency and reduce the actual data contention, the latter having been shown to affect the overall power needed to complete the applicationโ€™s execution. We have selected a set of self-adaptive extensions to existing STM middlewares (namely, TinySTM and R-STM) to prove how self-adapting computation can capture the actual degree of parallelism and/or logical contention on shared data in a better way, enhancing even more the intrinsic benefits provided by STM. Of course, this benefit comes at a cost, which is the actual execution time required by the proposed approaches to precisely tune the execution parameters for reducing power consumption and enhancing execution performance. Nevertheless, the results hereby provided show that adaptivity is a strictly necessary requirement to reduce energy consumption in STM systems: Without it, it is not possible to reach any acceptable level of energy efficiency at all

    GPU ํ™˜๊ฒฝ์—์„œ ๋จธ์‹ ๋Ÿฌ๋‹ ์›Œํฌ๋กœ๋“œ์˜ ํšจ์œจ์ ์ธ ์‹คํ–‰

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2023. 2. ์ „๋ณ‘๊ณค.Machine learning (ML) workloads are becoming increasingly important in many types of real-world applications. We attribute this trend to the development of software systems for ML, which have facilitated the widespread adoption of heterogeneous accelerators such as GPUs. Todays ML software stack has made great improvements in terms of efficiency, however, not all use cases are well supported. In this dissertation, we study how to improve execution efficiency of ML workloads on GPUs from a software system perspective. We identify workloads where current systems for ML have inefficiencies in utilizing GPUs and devise new system techniques that handle those workloads efficiently. We first present Nimble, a ML execution engine equipped with carefully optimized GPU scheduling. The proposed scheduling techniques can be used to improve execution efficiency by up to 22.34ร—. Second, we propose Orca, an inference serving system specialized for Transformer-based generative models. By incorporating new scheduling and batching techniques, Orca significantly outperforms state-of-the-art systems โ€“ 36.9ร— throughput improvement at the same level of latency. The last topic of this dissertation is WindTunnel, a framework that translates classical ML pipelines into neural networks, providing GPU training capabilities for classical ML workloads. WindTunnel also allows joint training of pipeline components via backpropagation, resulting in improved accuracy over the original pipeline and neural network baselines.์ตœ๊ทผ ๊ฒฝํ–ฅ์„ ๋ณด๋ฉด ๋‹ค์–‘ํ•œ ์ข…๋ฅ˜์˜ ์• ํ”Œ๋ฆฌ์ผ€์ด์…˜์—์„œ ๋จธ์‹  ๋Ÿฌ๋‹(ML) ์›Œํฌ๋กœ๋“œ๊ฐ€ ์  ์  ๋” ์ค‘์š”ํ•˜๊ฒŒ ํ™œ์šฉ๋˜๊ณ  ์žˆ๋‹ค. ์ด๋Š” ML์šฉ ์‹œ์Šคํ…œ ์†Œํ”„ํŠธ์›จ์–ด์˜ ๊ฐœ๋ฐœ์„ ํ†ตํ•ด GPU ์™€ ๊ฐ™์€ ์ด๊ธฐ์ข… ๊ฐ€์†๊ธฐ์˜ ๊ด‘๋ฒ”์œ„ํ•œ ํ™œ์šฉ์ด ๊ฐ€๋Šฅํ•ด์กŒ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ๋งŽ์€ ์—ฐ๊ตฌ์ž๋“ค์˜ ๊ด€์‹ฌ ๋•์— ML์šฉ ์‹œ์Šคํ…œ ์†Œํ”„ํŠธ์›จ์–ด ์Šคํƒ์€ ๋ถ„๋ช… ํ•˜๋ฃจ๊ฐ€ ๋‹ค๋ฅด๊ฒŒ ๊ฐœ์„ ๋˜๊ณ  ์žˆ์ง€๋งŒ, ์—ฌ์ „ํžˆ ๋ชจ๋“  ์‚ฌ๋ก€์—์„œ ๋†’์€ ํšจ์œจ์„ฑ์„ ๋ณด์—ฌ์ฃผ์ง€๋Š” ๋ชปํ•œ๋‹ค. ์ด ํ•™์œ„๋…ผ๋ฌธ์—์„œ๋Š” ์‹œ์Šค ํ…œ ์†Œํ”„ํŠธ์›จ์–ด ๊ด€์ ์—์„œ GPU ํ™˜๊ฒฝ์—์„œ ML ์›Œํฌ๋กœ๋“œ์˜ ์‹คํ–‰ ํšจ์œจ์„ฑ์„ ๊ฐœ์„ ํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์—ฐ๊ตฌํ•œ๋‹ค. ๊ตฌ์ฒด์ ์œผ๋กœ๋Š” ์˜ค๋Š˜๋‚ ์˜ ML์šฉ ์‹œ์Šคํ…œ์ด GPU๋ฅผ ํšจ์œจ์ ์œผ๋กœ ์‚ฌ ์šฉํ•˜์ง€ ๋ชปํ•˜๋Š” ์›Œํฌ๋กœ๋“œ๋ฅผ ๊ทœ๋ช…ํ•˜๊ณ  ๋” ๋‚˜์•„๊ฐ€์„œ ํ•ด๋‹น ์›Œํฌ๋กœ๋“œ๋ฅผ ํšจ์œจ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋Š” ์‹œ์Šคํ…œ ๊ธฐ์ˆ ์„ ๊ณ ์•ˆํ•˜๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•œ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๋จผ์ € ์ตœ์ ํ™”๋œ GPU ์Šค์ผ€์ค„๋ง์„ ๊ฐ–์ถ˜ ML ์‹คํ–‰ ์—”์ง„์ธ Nimble ์„ ์†Œ๊ฐœํ•œ๋‹ค. ์ƒˆ ์Šค์ผ€์ค„๋ง ๊ธฐ๋ฒ•์„ ํ†ตํ•ด Nimble์€ ๊ธฐ์กด ๋Œ€๋น„ GPU ์‹คํ–‰ ํšจ์œจ์„ฑ ์„ ์ตœ๋Œ€ 22.34๋ฐฐ๊นŒ์ง€ ํ–ฅ์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ๋‹ค. ๋‘˜์งธ๋กœ Transformer ๊ธฐ๋ฐ˜์˜ ์ƒ์„ฑ ๋ชจ๋ธ์— ํŠนํ™”๋œ ์ถ”๋ก  ์„œ๋น„์Šค ์‹œ์Šคํ…œ Orca๋ฅผ ์ œ์•ˆํ•œ๋‹ค. ์ƒˆ๋กœ์šด ์Šค์ผ€์ค„๋ง ๋ฐ batching ๊ธฐ ์ˆ ์— ํž˜์ž…์–ด, Orca๋Š” ๋™์ผํ•œ ์ˆ˜์ค€์˜ ์ง€์—ฐ ์‹œ๊ฐ„์„ ๊ธฐ์ค€์œผ๋กœ ํ–ˆ์„ ๋•Œ ๊ธฐ์กด ์‹œ์Šคํ…œ ๋Œ€๋น„ 36.9๋ฐฐ ํ–ฅ์ƒ๋œ ์ฒ˜๋ฆฌ๋Ÿ‰์„ ๋ณด์ธ๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ ์‹ ๊ฒฝ๋ง์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š๋Š” ๊ณ ์ „ ML ํŒŒ์ดํ”„๋ผ์ธ์„ ์‹ ๊ฒฝ๋ง์œผ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ํ”„๋ ˆ์ž„์›Œํฌ WindTunnel์„ ์†Œ๊ฐœํ•œ๋‹ค. ์ด ๋ฅผ ํ†ตํ•ด ๊ณ ์ „ ML ํŒŒ์ดํ”„๋ผ์ธ ํ•™์Šต์„ GPU๋ฅผ ์‚ฌ์šฉํ•ด ์ง„ํ–‰ํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋œ๋‹ค. ๋˜ํ•œ WindTunnel์€ gradient backpropagation์„ ํ†ตํ•ด ํŒŒ์ดํ”„๋ผ์ธ์˜ ์—ฌ๋Ÿฌ ์š”์†Œ๋ฅผ ํ•œ ๋ฒˆ์— ๊ณต๋™์œผ๋กœ ํ•™์Šต ํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์ด๋ฅผ ํ†ตํ•ด ํŒŒ์ดํ”„๋ผ์ธ์˜ ์ •ํ™•๋„๋ฅผ ๋” ํ–ฅ์ƒ์‹œํ‚ฌ ์ˆ˜ ์žˆ์Œ์„ ํ™•์ธํ•˜์˜€๋‹ค.Chapter 1 Introduction 1 1.1 Motivation 1 1.2 Dissertation Overview 2 1.3 Previous Publications 4 1.4 Roadmap 5 Chapter 2 Background 6 2.1 ML Workloads 6 2.2 The GPU Execution Model 7 2.3 GPU Scheduling in ML Frameworks 8 2.4 Engine Scheduling in Inference Servers 10 2.5 Inference Procedure of Generative Models 11 Chapter 3 Nimble: Lightweight and Parallel GPU Task Scheduling for Deep Learning 17 3.1 Introduction 17 3.2 Motivation 21 3.3 System Design 24 3.3.1 Ahead-of-time (AoT) Scheduling 25 3.3.2 Stream Assignment Algorithm 28 3.4 Evaluation 32 3.4.1 Inference Latency 36 3.4.2 Impact of Multi-stream Execution 36 3.4.3 Training Throughput 37 3.5 Summary 38 Chapter 4 Orca: A Distributed Serving System for Transformer-Based Generative Models 40 4.1 Introduction 40 4.2 Challenges and Proposed Solutions 44 4.3 Orca System Design 51 4.3.1 Distributed Architecture 51 4.3.2 Scheduling Algorithm 54 4.4 Implementation 60 4.5 Evaluation 61 4.5.1 Engine Microbenchmark 63 4.5.2 End-to-end Performance 66 4.6 Summary 71 Chapter 5 WindTunnel: Towards Differentiable ML Pipelines Beyond a Single Model 72 5.1 Introduction 72 5.2 Pipeline Translation 78 5.2.1 Translating Arithmetic Operators 80 5.2.2 Translating Algorithmic Operators: GBDT 81 5.2.3 Translating Algorithmic Operators for Categorical Features 85 5.2.4 Fine-Tuning 87 5.3 Implementation 87 5.4 Experiments 88 5.4.1 Experimental Setup 89 5.4.2 Overall Performance 94 5.4.3 Ablation Study 95 5.5 Summary 98 Chapter 6 Related Work 99 Chapter 7 Conclusion 105 Bibliography 107 Appendix A Appendix: Nimble 131 A.1 Proofs on the Stream Assignment Algorithm of Nimble 131 A.1.1 Proof of Theorem 1 132 A.1.2 Proof of Theorem 2 134 A.1.3 Proof of Theorem 3 135 A.1.4 Time Complexity Analysis 137 A.2 Evaluation Results on Various GPUs 139 A.3 Evaluation Results on Different Training Batch Sizes 139๋ฐ•

    Towards Intelligent Runtime Framework for Distributed Heterogeneous Systems

    Get PDF
    Scientific applications strive for increased memory and computing performance, requiring massive amounts of data and time to produce results. Applications utilize large-scale, parallel computing platforms with advanced architectures to accommodate their needs. However, developing performance-portable applications for modern, heterogeneous platforms requires lots of effort and expertise in both the application and systems domains. This is more relevant for unstructured applications whose workflow is not statically predictable due to their heavily data-dependent nature. One possible solution for this problem is the introduction of an intelligent Domain-Specific Language (iDSL) that transparently helps to maintain correctness, hides the idiosyncrasies of lowlevel hardware, and scales applications. An iDSL includes domain-specific language constructs, a compilation toolchain, and a runtime providing task scheduling, data placement, and workload balancing across and within heterogeneous nodes. In this work, we focus on the runtime framework. We introduce a novel design and extension of a runtime framework, the Parallel Runtime Environment for Multicore Applications. In response to the ever-increasing intra/inter-node concurrency, the runtime system supports efficient task scheduling and workload balancing at both levels while allowing the development of custom policies. Moreover, the new framework provides abstractions supporting the utilization of heterogeneous distributed nodes consisting of CPUs and GPUs and is extensible to other devices. We demonstrate that by utilizing this work, an application (or the iDSL) can scale its performance on heterogeneous exascale-era supercomputers with minimal effort. A future goal for this framework (out of the scope of this thesis) is to be integrated with machine learning to improve its decision-making and performance further. As a bridge to this goal, since the framework is under development, we experiment with data from Nuclear Physics Particle Accelerators and demonstrate the significant improvements achieved by utilizing machine learning in the hit-based track reconstruction process

    Tuning the Level of Concurrency in Software Transactional Memory: An Overview of Recent Analytical, Machine Learning and Mixed Approaches

    Get PDF
    Synchronization transparency offered by Software Transactional Memory (STM) must not come at the expense of run-time efficiency, thus demanding from the STM-designer the inclusion of mechanisms properly oriented to performance and other quality indexes. Particularly, one core issue to cope with in STM is related to exploiting parallelism while also avoiding thrashing phenomena due to excessive transaction rollbacks, caused by excessively high levels of contention on logical resources, namely concurrently accessed data portions. A means to address run-time efficiency consists in dynamically determining the best-suited level of concurrency (number of threads) to be employed for running the application (or specific application phases) on top of the STM layer. For too low levels of concurrency, parallelism can be hampered. Conversely, over-dimensioning the concurrency level may give rise to the aforementioned thrashing phenomena caused by excessive data contentionโ€”an aspect which has reflections also on the side of reduced energy-efficiency. In this chapter we overview a set of recent techniques aimed at building โ€œapplication-specificโ€ performance models that can be exploited to dynamically tune the level of concurrency to the best-suited value. Although they share some base concepts while modeling the system performance vs the degree of concurrency, these techniques rely on disparate methods, such as machine learning or analytic methods (or combinations of the two), and achieve different tradeoffs in terms of the relation between the precision of the performance model and the latency for model instantiation. Implications of the different tradeoffs in real-life scenarios are also discussed
    • โ€ฆ
    corecore