2,761 research outputs found

    ๋™์‹œ์— ์‹คํ–‰๋˜๋Š” ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜๋“ค์„ ์œ„ํ•œ ๋ณ‘๋ ฌ์„ฑ ๊ด€๋ฆฌ

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2020. 8. Bernhard Egger.Running multiple parallel jobs on the same multicore machine is becoming more important to improve utilization of the given hardware resources. While co-location of parallel jobs is common practice, it still remains a challenge for current parallel runtime systems to efficiently execute multiple parallel applications simultaneously. Conventional parallelization runtimes such as OpenMP generate a fixed number of worker threads, typically as many as there are cores in the system, to utilize all physical core resources. On such runtime systems, applications may not achieve their peak performance when given full use of all physical core resources. Moreover, the OS kernel needs to manage all worker threads generated by all running parallel applications, and it may require huge management costs with an increasing number of co-located applications. In this thesis, we focus on improving runtime performance for co-located parallel applications. To achieve this goal, the first idea of this work is to ensure spatial scheduling to execute multiple co-located parallel applications simultaneously. Spatial scheduling that provides distinct core resources for applications is considered a promising and scalable approach for executing co-located applications. Despite the growing importance of spatial scheduling, there are still two fundamental research issues with this approach. First, spatial scheduling requires a runtime support for parallel applications to run efficiently in spatial core allocation that can change at runtime. Second, the scheduler needs to assign the proper number of core resources to applications depending on the applications performance characteristics for better runtime performance. To this end, in this thesis, we present three novel runtime-level techniques to efficiently execute co-located parallel applications with spatial scheduling. First, we present a cooperative runtime technique that provides malleable parallel execution for OpenMP parallel applications. The malleable execution means that applications can dynamically adapt their degree of parallelism to the varying core resource availability. It allows parallel applications to run efficiently at changing core resource availability compared to conventional runtime systems that do not adjust the degree of parallelism of the application. Second, this thesis introduces an analytical performance model that can estimate resource utilization and the performance of parallel programs in dependence of the provided core resources. We observe that the performance of parallel loops is typically limited by memory performance, and employ queueing theory to model the memory performance. The queueing system-based approach allows us to estimate the performance by using closed-form equations and hardware performance counters. Third, we present a core allocation framework to manage core resources between co-located parallel applications. With analytical modeling, we observe that maximizing both CPU utilization and memory bandwidth usage can generally lead to better performance compared to conventional core allocation policies that maximize only CPU usage. The presented core allocation framework optimizes utilization of multi-dimensional resources of CPU cores and memory bandwidth on multi-socket multicore systems based on the cooperative parallel runtime support and the analytical model.๋ฉ€ํ‹ฐ์ฝ”์–ด ์‹œ์Šคํ…œ์—์„œ ์—ฌ๋Ÿฌ ๊ฐœ์˜ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜๋“ค์„ ํ•จ๊ป˜ ์‹คํ–‰์‹œํ‚ค๋Š” ๊ฒƒ ์€ ์ฃผ์–ด์ง„ ํ•˜๋“œ์›จ์–ด ์ž์›์„ ํšจ์œจ์ ์œผ๋กœ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•ด์„œ ์ ์  ๋” ์ค‘์š”ํ•ด์ง€๊ณ  ์žˆ๋‹ค. ํ•˜์ง€๋งŒ, ํ˜„์žฌ ๋Ÿฐํƒ€์ž„ ์‹œ์Šคํ…œ์—์„œ ์—ฌ๋Ÿฌ ๊ฐœ์˜ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜๋“ค์„ ๋™์‹œ์— ํšจ์œจ์ ์œผ๋กœ ์‹คํ–‰์‹œํ‚ค๋Š” ๊ฒƒ์€ ์—ฌ์ „ํžˆ ์–ด๋ ค์šด ๋ฌธ์ œ์ด๋‹ค. OpenMP์™€ ๊ฐ™์ด ํ†ต์ƒ ์‚ฌ ์šฉ๋˜๋Š” ๋ณ‘๋ ฌํ™” ๋Ÿฐํƒ€์ž„ ์‹œ์Šคํ…œ๋“ค์€ ๋ชจ๋“  ํ•˜๋“œ์›จ์–ด ์ฝ”์–ด ์ž์›์„ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•ด์„œ ์ผ๋ฐ˜์ ์œผ๋กœ ์ฝ”์–ด ๊ฐœ์ˆ˜ ๋งŒํผ ์Šค๋ ˆ๋“œ๋ฅผ ์ƒ์„ฑํ•˜์—ฌ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜์„ ์‹คํ–‰์‹œํ‚จ๋‹ค. ์ด ๋•Œ, ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜์€ ๋ชจ๋“  ์ฝ”์–ด ์ž์›์„ ํ™œ์šฉํ•  ๋•Œ ์˜คํžˆ๋ ค ์ตœ์ ์˜ ์„ฑ๋Šฅ์„ ์–ป์ง€ ๋ชปํ•  ์ˆ˜๋„ ์žˆ์œผ๋ฉฐ, ์šด์˜์ฒด์ œ ์ปค๋„์˜ ๋ถ€ํ•˜๋Š” ์‹คํ–‰๋˜๋Š” ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜์˜ ๊ฐœ์ˆ˜๊ฐ€ ๋Š˜์–ด๋‚  ์ˆ˜๋ก ๊ด€๋ฆฌํ•ด์•ผ ํ•˜๋Š” ์Šค๋ ˆ๋“œ์˜ ๊ฐœ์ˆ˜๊ฐ€ ๋Š˜์–ด๋‚˜๊ธฐ ๋•Œ๋ฌธ์— ๊ณ„์†ํ•ด์„œ ์ปค์ง€๊ฒŒ ๋œ๋‹ค. ๋ณธ ํ•™์œ„ ๋…ผ๋ฌธ์—์„œ, ์šฐ๋ฆฌ๋Š” ํ•จ๊ป˜ ์‹คํ–‰๋˜๋Š” ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜๋“ค์˜ ๋Ÿฐํƒ€์ž„ ์„ฑ๋Šฅ์„ ๋†’์ด๋Š” ๊ฒƒ์— ์ง‘์ค‘ํ•œ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด, ๋ณธ ์—ฐ๊ตฌ์˜ ํ•ต์‹ฌ ๋ชฉํ‘œ๋Š” ํ•จ๊ป˜ ์‹คํ–‰๋˜๋Š” ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜๋“ค์—๊ฒŒ ๊ณต๊ฐ„ ๋ถ„ํ• ์‹ ์Šค์ผ€์ค„๋ง ๋ฐฉ๋ฒ•์„ ์ ์šฉํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ๊ฐ ์–ดํ”Œ๋ฆฌ ์ผ€์ด์…˜์—๊ฒŒ ๋…๋ฆฝ์ ์ธ ์ฝ”์–ด ์ž์›์„ ํ• ๋‹นํ•ด์ฃผ๋Š” ๊ณต๊ฐ„ ๋ถ„ํ• ์‹ ์Šค์ผ€์ค„๋ง์€ ์ ์  ๋” ๋Š˜์–ด๋‚˜๋Š” ์ฝ”์–ด ์ž์›์˜ ๊ฐœ์ˆ˜๋ฅผ ํšจ์œจ์ ์œผ๋กœ ๊ด€๋ฆฌํ•˜๊ธฐ ์œ„ํ•œ ๋ฐฉ๋ฒ•์œผ๋กœ ๋งŽ์€ ๊ด€์‹ฌ์„ ๋ฐ›๊ณ  ์žˆ๋‹ค. ํ•˜์ง€๋งŒ, ๊ณต๊ฐ„ ๋ถ„ํ•  ์Šค์ผ€์ค„๋ง ๋ฐฉ๋ฒ•์„ ํ†ตํ•ด ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜์„ ์‹คํ–‰์‹œํ‚ค๋Š” ๊ฒƒ์€ ๋‘ ๊ฐ€์ง€ ์—ฐ๊ตฌ ๊ณผ์ œ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค. ๋จผ์ €, ๊ฐ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜์€ ๊ฐ€๋ณ€์ ์ธ ์ฝ”์–ด ์ž์› ์ƒ์—์„œ ํšจ์œจ์ ์œผ๋กœ ์‹คํ–‰๋˜๊ธฐ ์œ„ํ•œ ๋Ÿฐํƒ€์ž„ ๊ธฐ์ˆ ์„ ํ•„์š”๋กœ ํ•˜๊ณ , ์Šค์ผ€์ค„๋Ÿฌ๋Š” ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜๋“ค์˜ ์„ฑ๋Šฅ ํŠน์„ฑ์„ ๊ณ ๋ คํ•ด์„œ ๋Ÿฐํƒ€์ž„ ์„ฑ๋Šฅ์„ ๋†’์ผ ์ˆ˜ ์žˆ๋„๋ก ์ ๋‹นํ•œ ์ˆ˜์˜ ์ฝ”์–ด ์ž์›์„ ์ œ๊ณตํ•ด์•ผํ•œ๋‹ค. ์ด ํ•™์œ„ ๋…ผ๋ฌธ์—์„œ, ์šฐ๋ฆฌ๋Š” ํ•จ๊ป˜ ์‹คํ–‰๋˜๋Š” ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜๋“ค์„ ๊ณต๊ฐ„ ๋ถ„ ํ•  ์Šค์ผ€์ค„๋ง์„ ํ†ตํ•ด์„œ ํšจ์œจ์ ์œผ๋กœ ์‹คํ–‰์‹œํ‚ค๊ธฐ ์œ„ํ•œ ์„ธ๊ฐ€์ง€ ๋Ÿฐํƒ€์ž„ ์‹œ์Šคํ…œ ๊ธฐ์ˆ ์„ ์†Œ๊ฐœํ•œ๋‹ค. ๋จผ์ € ์šฐ๋ฆฌ๋Š” ํ˜‘๋™์ ์ธ ๋Ÿฐํƒ€์ž„ ์‹œ์Šคํ…œ์ด๋ผ๋Š” ๊ธฐ์ˆ ์„ ์†Œ๊ฐœํ•˜๋Š”๋ฐ, ์ด๋Š” OpenMP ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜๋“ค์—๊ฒŒ ์œ ์—ฐํ•˜๊ณ  ํšจ์œจ์ ์ธ ์‹คํ–‰ ํ™˜๊ฒฝ์„ ์ œ๊ณตํ•œ๋‹ค. ์ด ๊ธฐ์ˆ ์€ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘๋ ฌ ์‹คํ–‰์— ๋‚ด์žฌ๋˜์–ด ์žˆ๋Š” ํŠน์„ฑ์„ ํ™œ์šฉํ•˜์—ฌ ๋ณ‘๋ ฌ์ฒ˜๋ฆฌ ํ”„๋กœ๊ทธ๋žจ๋“ค์ด ๋ณ€ํ™”ํ•˜๋Š” ์ฝ”์–ด ์ž์›์— ๋งž์ถ”์–ด ๋ณ‘๋ ฌ์„ฑ์˜ ์ •๋„๋ฅผ ๋™์ ์œผ๋กœ ์กฐ์ ˆํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•ด์ค€๋‹ค. ์ด๋Ÿฌํ•œ ์œ ์—ฐํ•œ ์‹คํ–‰ ๋ชจ๋ธ์€ ๋ณ‘๋ ฌ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜๋“ค์ด ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ์ฝ”์–ด ์ž์›์ด ๋™์ ์œผ๋กœ ๋ณ€ํ™”ํ•˜๋Š” ํ™˜๊ฒฝ์—์„œ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜์˜ ์Šค๋ ˆ๋“œ ์ˆ˜์ค€ ๋ณ‘๋ ฌ์„ฑ์„ ๋‹ค๋ฃจ์ง€ ๋ชปํ•˜๋Š” ๊ธฐ์กด ๋Ÿฐํƒ€์ž„ ์‹œ์Šคํ…œ๋“ค์— ๋น„ํ•ด์„œ ๋” ํšจ์œจ์ ์œผ๋กœ ์‹คํ–‰๋  ์ˆ˜ ์žˆ๋„๋ก ํ•ด์ค€๋‹ค. ๋‘๋ฒˆ์งธ๋กœ, ๋ณธ ๋…ผ๋ฌธ์€ ์‚ฌ์šฉ๋˜๋Š” ์ฝ”์–ด ์ž์›์— ๋”ฐ๋ฅธ ๋ณ‘๋ ฌ์ฒ˜๋ฆฌ ํ”„๋กœ๊ทธ๋žจ์˜ ์„ฑ๋Šฅ ๋ฐ ์ž์› ํ™œ์šฉ๋„๋ฅผ ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•ด์ฃผ๋Š” ๋ถ„์„์  ์„ฑ๋Šฅ ๋ชจ๋ธ์„ ์†Œ๊ฐœํ•œ๋‹ค. ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ ์ฝ”๋“œ์˜ ์„ฑ๋Šฅ ํ™•์žฅ์„ฑ์ด ์ผ๋ฐ˜์ ์œผ๋กœ ๋ฉ”๋ชจ๋ฆฌ ์„ฑ๋Šฅ์— ์ขŒ์šฐ๋œ๋‹ค๋Š” ๊ด€์ฐฐ์— ๊ธฐ์ดˆํ•˜์—ฌ, ์ œ ์•ˆ๋œ ํ•ด์„ ๋ชจ๋ธ์€ ํ์ž‰ ์ด๋ก ์„ ํ™œ์šฉํ•˜์—ฌ ๋ฉ”๋ชจ๋ฆฌ ์‹œ์Šคํ…œ์˜ ์„ฑ๋Šฅ ์ •๋ณด๋“ค์„ ๊ณ„์‚ฐํ•œ๋‹ค. ์ด ํ์ž‰ ์‹œ์Šคํ…œ์— ๊ธฐ๋ฐ˜ํ•œ ๋ฐฉ๋ฒ•์€ ์œ ์šฉํ•œ ์„ฑ๋Šฅ ์ •๋ณด๋“ค์„ ์ˆ˜์‹์„ ํ†ตํ•ด ํšจ์œจ์ ์œผ๋กœ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜๋ฉฐ ์ƒ์šฉ ์‹œ์Šคํ…œ์—์„œ ์ œ๊ณตํ•˜๋Š” ํ•˜๋“œ์›จ์–ด ์„ฑ๋Šฅ ์นด์šดํ„ฐ๋งŒ์„ ์š”๊ตฌ ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ํ™œ์šฉ ๊ฐ€๋Šฅ์„ฑ ๋˜ํ•œ ๋†’๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ, ๋ณธ ๋…ผ๋ฌธ์€ ๋™์‹œ์— ์‹คํ–‰๋˜๋Š” ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜๋“ค ์‚ฌ์ด์—์„œ ์ฝ”์–ด ์ž์›์„ ํ• ๋‹นํ•ด์ฃผ๋Š” ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์†Œ๊ฐœํ•œ๋‹ค. ์ œ์•ˆ๋œ ํ”„๋ ˆ์ž„์›Œํฌ๋Š” ๋™์‹œ์— ๋™ ์ž‘ํ•˜๋Š” ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜์˜ ๋ณ‘๋ ฌ์„ฑ ๋ฐ ์ฝ”์–ด ์ž์›์„ ๊ด€๋ฆฌํ•˜์—ฌ ๋ฉ€ํ‹ฐ ์†Œ์ผ“ ๋ฉ€ํ‹ฐ์ฝ”์–ด ์‹œ์Šคํ…œ์—์„œ CPU ์ž์› ๋ฐ ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ ์ž์› ํ™œ์šฉ๋„๋ฅผ ๋™์‹œ์— ์ตœ์  ํ™”ํ•œ๋‹ค. ํ•ด์„์ ์ธ ๋ชจ๋ธ๋ง๊ณผ ์ œ์•ˆ๋œ ์ฝ”์–ด ํ• ๋‹น ํ”„๋ ˆ์ž„์›Œํฌ์˜ ์„ฑ๋Šฅ ํ‰๊ฐ€๋ฅผ ํ†ตํ•ด์„œ, ์šฐ๋ฆฌ๊ฐ€ ์ œ์•ˆํ•˜๋Š” ์ •์ฑ…์ด ์ผ๋ฐ˜์ ์ธ ๊ฒฝ์šฐ์— CPU ์ž์›์˜ ํ™œ์šฉ๋„๋งŒ์„ ์ตœ์ ํ™”ํ•˜๋Š” ๋ฐฉ๋ฒ•์— ๋น„ํ•ด์„œ ํ•จ๊ป˜ ๋™์ž‘ํ•˜๋Š” ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜๋“ค์˜ ์‹คํ–‰์‹œ๊ฐ„์„ ๊ฐ์†Œ์‹œํ‚ฌ ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ค€๋‹ค.1 Introduction 1 1.1 Motivation 1 1.2 Background 5 1.2.1 The OpenMP Runtime System 5 1.2.2 Target Multi-Socket Multicore Systems 7 1.3 Contributions 8 1.3.1 Cooperative Runtime Systems 9 1.3.2 Performance Modeling 9 1.3.3 Parallelism Management 10 1.4 Related Work 11 1.4.1 Cooperative Runtime Systems 11 1.4.2 Performance Modeling 12 1.4.3 Parallelism Management 14 1.5 Organization of this Thesis 15 2 Dynamic Spatial Scheduling with Cooperative Runtime Systems 17 2.1 Overview 17 2.2 Malleable Workloads 19 2.3 Cooperative OpenMP Runtime System 21 2.3.1 Cooperative User-Level Tasking 22 2.3.2 Cooperative Dynamic Loop Scheduling 27 2.4 Experimental Results 30 2.4.1 Standalone Application Performance 30 2.4.2 Performance in Spatial Core Allocation 33 2.5 Discussion 35 2.5.1 Contributions 35 2.5.2 Limitations and Future Work 36 2.5.3 Summary 37 3 Performance Modeling of Parallel Loops using Queueing Systems 38 3.1 Overview 38 3.2 Background 41 3.2.1 Queueing Models 41 3.2.2 Insights on Performance Modeling of Parallel Loops 43 3.2.3 Performance Analysis 46 3.3 Queueing Systems for Multi-Socket Multicores 54 3.3.1 Hierarchical Queueing Systems 54 3.3.2 Computingthe Parameter Values 60 3.4 The Speedup Prediction Model 63 3.4.1 The Speedup Model 63 3.4.2 Implementation 64 3.5 Evaluation 65 3.5.1 64-core AMD Opteron Platform 66 3.5.2 72-core Intel Xeon Platform 68 3.6 Discussion 70 3.6.1 Applicability of the Model 70 3.6.2 Limitations of the Model 72 3.6.3 Summary 73 4 Maximizing System Utilization via Parallelism Management 74 4.1 Overview 74 4.2 Background 76 4.2.1 Modeling Performance Metrics 76 4.2.2 Our Resource Management Policy 79 4.3 NuPoCo: Parallelism Management for Co-Located Parallel Loops 82 4.3.1 Online Performance Model 82 4.3.2 Managing Parallelism 86 4.4 Evaluation of NuPoCo 90 4.4.1 Evaluation Scenario 1 90 4.4.2 Evaluation Scenario 2 98 4.5 MOCA: An Evolutionary Approach to Core Allocation 103 4.5.1 Evolutionary Core Allocation 104 4.5.2 Model-Based Allocation 106 4.6 Evaluation of MOCA 113 4.7 Discussion 118 4.7.1 Contributions and Limitations 118 4.7.2 Summary 119 5 Conclusion and Future Work 120 5.1 Conclusion 120 5.2 Future work 122 5.2.1 Improving Multi-Objective Core Allocation 122 5.2.2 Co-Scheduling of Parallel Jobs for HPC Systems 123 A Additional Experiments for the Performance Model 124 A.1 Memory Access Distribution and Poisson Distribution 124 A.1.1 Memory Access Distribution 124 A.1.2 Kolmogorov Smirnov Test 127 A.2 Additional Performance Modeling Results 134 A.2.1 Results with Intel Hyperthreading 134 A.2.2 Results with Cooperative User-Level Tasking 134 A.2.3 Results with Other Loop Schedulers 138 A.2.4 Results with Different Number of Memory Nodes 138 B Other Research Contributions of the Author 141 B.1 Compiler and Runtime Support for Integrated CPU-GPU Systems 141 B.2 Modeling NUMA Architectures with Stochastic Tool 143 B.3 Runtime Environment for a Manycore Architecture 143 ์ดˆ๋ก 159 Acknowledgements 161Docto

    GPUs as Storage System Accelerators

    Full text link
    Massively multicore processors, such as Graphics Processing Units (GPUs), provide, at a comparable price, a one order of magnitude higher peak performance than traditional CPUs. This drop in the cost of computation, as any order-of-magnitude drop in the cost per unit of performance for a class of system components, triggers the opportunity to redesign systems and to explore new ways to engineer them to recalibrate the cost-to-performance relation. This project explores the feasibility of harnessing GPUs' computational power to improve the performance, reliability, or security of distributed storage systems. In this context, we present the design of a storage system prototype that uses GPU offloading to accelerate a number of computationally intensive primitives based on hashing, and introduce techniques to efficiently leverage the processing power of GPUs. We evaluate the performance of this prototype under two configurations: as a content addressable storage system that facilitates online similarity detection between successive versions of the same file and as a traditional system that uses hashing to preserve data integrity. Further, we evaluate the impact of offloading to the GPU on competing applications' performance. Our results show that this technique can bring tangible performance gains without negatively impacting the performance of concurrently running applications.Comment: IEEE Transactions on Parallel and Distributed Systems, 201

    TimeTrader: Exploiting Latency Tail to Save Datacenter Energy for On-line Data-Intensive Applications

    Get PDF
    Datacenters running on-line, data-intensive applications (OLDIs) consume significant amounts of energy. However, reducing their energy is challenging due to their tight response time requirements. A key aspect of OLDIs is that each user query goes to all or many of the nodes in the cluster, so that the overall time budget is dictated by the tail of the replies' latency distribution; replies see latency variations both in the network and compute. Previous work proposes to achieve load-proportional energy by slowing down the computation at lower datacenter loads based directly on response times (i.e., at lower loads, the proposal exploits the average slack in the time budget provisioned for the peak load). In contrast, we propose TimeTrader to reduce energy by exploiting the latency slack in the sub- critical replies which arrive before the deadline (e.g., 80% of replies are 3-4x faster than the tail). This slack is present at all loads and subsumes the previous work's load-related slack. While the previous work shifts the leaves' response time distribution to consume the slack at lower loads, TimeTrader reshapes the distribution at all loads by slowing down individual sub-critical nodes without increasing missed deadlines. TimeTrader exploits slack in both the network and compute budgets. Further, TimeTrader leverages Earliest Deadline First scheduling to largely decouple critical requests from the queuing delays of sub- critical requests which can then be slowed down without hurting critical requests. A combination of real-system measurements and at-scale simulations shows that without adding to missed deadlines, TimeTrader saves 15-19% and 41-49% energy at 90% and 30% loading, respectively, in a datacenter with 512 nodes, whereas previous work saves 0% and 31-37%.Comment: 13 page

    Managing Communication Latency-Hiding at Runtime for Parallel Programming Languages and Libraries

    Full text link
    This work introduces a runtime model for managing communication with support for latency-hiding. The model enables non-computer science researchers to exploit communication latency-hiding techniques seamlessly. For compiled languages, it is often possible to create efficient schedules for communication, but this is not the case for interpreted languages. By maintaining data dependencies between scheduled operations, it is possible to aggressively initiate communication and lazily evaluate tasks to allow maximal time for the communication to finish before entering a wait state. We implement a heuristic of this model in DistNumPy, an auto-parallelizing version of numerical Python that allows sequential NumPy programs to run on distributed memory architectures. Furthermore, we present performance comparisons for eight benchmarks with and without automatic latency-hiding. The results shows that our model reduces the time spent on waiting for communication as much as 27 times, from a maximum of 54% to only 2% of the total execution time, in a stencil application.Comment: PREPRIN

    BriskStream: Scaling Data Stream Processing on Shared-Memory Multicore Architectures

    Full text link
    We introduce BriskStream, an in-memory data stream processing system (DSPSs) specifically designed for modern shared-memory multicore architectures. BriskStream's key contribution is an execution plan optimization paradigm, namely RLAS, which takes relative-location (i.e., NUMA distance) of each pair of producer-consumer operators into consideration. We propose a branch and bound based approach with three heuristics to resolve the resulting nontrivial optimization problem. The experimental evaluations demonstrate that BriskStream yields much higher throughput and better scalability than existing DSPSs on multi-core architectures when processing different types of workloads.Comment: To appear in SIGMOD'1

    Exploiting partial reconfiguration through PCIe for a microphone array network emulator

    Get PDF
    The current Microelectromechanical Systems (MEMS) technology enables the deployment of relatively low-cost wireless sensor networks composed of MEMS microphone arrays for accurate sound source localization. However, the evaluation and the selection of the most accurate and power-efficient networkโ€™s topology are not trivial when considering dynamic MEMS microphone arrays. Although software simulators are usually considered, they consist of high-computational intensive tasks, which require hours to days to be completed. In this paper, we present an FPGA-based platform to emulate a network of microphone arrays. Our platform provides a controlled simulated acoustic environment, able to evaluate the impact of different network configurations such as the number of microphones per array, the networkโ€™s topology, or the used detection method. Data fusion techniques, combining the data collected by each node, are used in this platform. The platform is designed to exploit the FPGAโ€™s partial reconfiguration feature to increase the flexibility of the network emulator as well as to increase performance thanks to the use of the PCI-express high-bandwidth interface. On the one hand, the network emulator presents a higher flexibility by partially reconfiguring the nodesโ€™ architecture in runtime. On the other hand, a set of strategies and heuristics to properly use partial reconfiguration allows the acceleration of the emulation by exploiting the execution parallelism. Several experiments are presented to demonstrate some of the capabilities of our platform and the benefits of using partial reconfiguration

    Adaptive Knobs for Resource Efficient Computing

    Get PDF
    Performance demands of emerging domains such as artificial intelligence, machine learning and vision, Internet-of-things etc., continue to grow. Meeting such requirements on modern multi/many core systems with higher power densities, fixed power and energy budgets, and thermal constraints exacerbates the run-time management challenge. This leaves an open problem on extracting the required performance within the power and energy limits, while also ensuring thermal safety. Existing architectural solutions including asymmetric and heterogeneous cores and custom acceleration improve performance-per-watt in specific design time and static scenarios. However, satisfying applicationsโ€™ performance requirements under dynamic and unknown workload scenarios subject to varying system dynamics of power, temperature and energy requires intelligent run-time management. Adaptive strategies are necessary for maximizing resource efficiency, considering i) diverse requirements and characteristics of concurrent applications, ii) dynamic workload variation, iii) core-level heterogeneity and iv) power, thermal and energy constraints. This dissertation proposes such adaptive techniques for efficient run-time resource management to maximize performance within fixed budgets under unknown and dynamic workload scenarios. Resource management strategies proposed in this dissertation comprehensively consider application and workload characteristics and variable effect of power actuation on performance for pro-active and appropriate allocation decisions. Specific contributions include i) run-time mapping approach to improve power budgets for higher throughput, ii) thermal aware performance boosting for efficient utilization of power budget and higher performance, iii) approximation as a run-time knob exploiting accuracy performance trade-offs for maximizing performance under power caps at minimal loss of accuracy and iv) co-ordinated approximation for heterogeneous systems through joint actuation of dynamic approximation and power knobs for performance guarantees with minimal power consumption. The approaches presented in this dissertation focus on adapting existing mapping techniques, performance boosting strategies, software and dynamic approximations to meet the performance requirements, simultaneously considering system constraints. The proposed strategies are compared against relevant state-of-the-art run-time management frameworks to qualitatively evaluate their efficacy
    • โ€ฆ
    corecore