81 research outputs found

    SCALO: Scalability-Aware Parallelism Orchestration for Multi-Threaded Workloads

    Get PDF
    This article contributes a solution to orchestrate concurrent application execution to increase throughput. SCALO monitors co-executing applications at runtime to evaluate their scalability

    ForestGOMP: an efficient OpenMP environment for NUMA architectures

    Get PDF
    International audienceExploiting the full computational power of current hierarchical multiprocessor machines requires a very careful distribution of threads and data among the underlying non-uniform architecture so as to avoid remote memory access penalties. Directive-based programming languages such as OpenMP, can greatly help to perform such a distribution by providing programmers with an easy way to structure the parallelism of their application and to transmit this information to the runtime system. Our runtime, which is based on a multi-level thread scheduler combined with a NUMA-aware memory manager, converts this information into Scheduling Hints related to thread-memory affinity issues. These hints enable dynamic load distribution guided by application structure and hardware topology, thus helping to achieve performance portability. Several experiments show that mixed solutions (migrating both threads and data) outperform work-stealing based balancing strategies and Next-Touch-based data distribution policies. These techniques provide insights about additional optimizations

    ๋™์‹œ์— ์‹คํ–‰๋˜๋Š” ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜๋“ค์„ ์œ„ํ•œ ๋ณ‘๋ ฌ์„ฑ ๊ด€๋ฆฌ

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2020. 8. Bernhard Egger.Running multiple parallel jobs on the same multicore machine is becoming more important to improve utilization of the given hardware resources. While co-location of parallel jobs is common practice, it still remains a challenge for current parallel runtime systems to efficiently execute multiple parallel applications simultaneously. Conventional parallelization runtimes such as OpenMP generate a fixed number of worker threads, typically as many as there are cores in the system, to utilize all physical core resources. On such runtime systems, applications may not achieve their peak performance when given full use of all physical core resources. Moreover, the OS kernel needs to manage all worker threads generated by all running parallel applications, and it may require huge management costs with an increasing number of co-located applications. In this thesis, we focus on improving runtime performance for co-located parallel applications. To achieve this goal, the first idea of this work is to ensure spatial scheduling to execute multiple co-located parallel applications simultaneously. Spatial scheduling that provides distinct core resources for applications is considered a promising and scalable approach for executing co-located applications. Despite the growing importance of spatial scheduling, there are still two fundamental research issues with this approach. First, spatial scheduling requires a runtime support for parallel applications to run efficiently in spatial core allocation that can change at runtime. Second, the scheduler needs to assign the proper number of core resources to applications depending on the applications performance characteristics for better runtime performance. To this end, in this thesis, we present three novel runtime-level techniques to efficiently execute co-located parallel applications with spatial scheduling. First, we present a cooperative runtime technique that provides malleable parallel execution for OpenMP parallel applications. The malleable execution means that applications can dynamically adapt their degree of parallelism to the varying core resource availability. It allows parallel applications to run efficiently at changing core resource availability compared to conventional runtime systems that do not adjust the degree of parallelism of the application. Second, this thesis introduces an analytical performance model that can estimate resource utilization and the performance of parallel programs in dependence of the provided core resources. We observe that the performance of parallel loops is typically limited by memory performance, and employ queueing theory to model the memory performance. The queueing system-based approach allows us to estimate the performance by using closed-form equations and hardware performance counters. Third, we present a core allocation framework to manage core resources between co-located parallel applications. With analytical modeling, we observe that maximizing both CPU utilization and memory bandwidth usage can generally lead to better performance compared to conventional core allocation policies that maximize only CPU usage. The presented core allocation framework optimizes utilization of multi-dimensional resources of CPU cores and memory bandwidth on multi-socket multicore systems based on the cooperative parallel runtime support and the analytical model.๋ฉ€ํ‹ฐ์ฝ”์–ด ์‹œ์Šคํ…œ์—์„œ ์—ฌ๋Ÿฌ ๊ฐœ์˜ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜๋“ค์„ ํ•จ๊ป˜ ์‹คํ–‰์‹œํ‚ค๋Š” ๊ฒƒ ์€ ์ฃผ์–ด์ง„ ํ•˜๋“œ์›จ์–ด ์ž์›์„ ํšจ์œจ์ ์œผ๋กœ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•ด์„œ ์ ์  ๋” ์ค‘์š”ํ•ด์ง€๊ณ  ์žˆ๋‹ค. ํ•˜์ง€๋งŒ, ํ˜„์žฌ ๋Ÿฐํƒ€์ž„ ์‹œ์Šคํ…œ์—์„œ ์—ฌ๋Ÿฌ ๊ฐœ์˜ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜๋“ค์„ ๋™์‹œ์— ํšจ์œจ์ ์œผ๋กœ ์‹คํ–‰์‹œํ‚ค๋Š” ๊ฒƒ์€ ์—ฌ์ „ํžˆ ์–ด๋ ค์šด ๋ฌธ์ œ์ด๋‹ค. OpenMP์™€ ๊ฐ™์ด ํ†ต์ƒ ์‚ฌ ์šฉ๋˜๋Š” ๋ณ‘๋ ฌํ™” ๋Ÿฐํƒ€์ž„ ์‹œ์Šคํ…œ๋“ค์€ ๋ชจ๋“  ํ•˜๋“œ์›จ์–ด ์ฝ”์–ด ์ž์›์„ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•ด์„œ ์ผ๋ฐ˜์ ์œผ๋กœ ์ฝ”์–ด ๊ฐœ์ˆ˜ ๋งŒํผ ์Šค๋ ˆ๋“œ๋ฅผ ์ƒ์„ฑํ•˜์—ฌ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜์„ ์‹คํ–‰์‹œํ‚จ๋‹ค. ์ด ๋•Œ, ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜์€ ๋ชจ๋“  ์ฝ”์–ด ์ž์›์„ ํ™œ์šฉํ•  ๋•Œ ์˜คํžˆ๋ ค ์ตœ์ ์˜ ์„ฑ๋Šฅ์„ ์–ป์ง€ ๋ชปํ•  ์ˆ˜๋„ ์žˆ์œผ๋ฉฐ, ์šด์˜์ฒด์ œ ์ปค๋„์˜ ๋ถ€ํ•˜๋Š” ์‹คํ–‰๋˜๋Š” ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜์˜ ๊ฐœ์ˆ˜๊ฐ€ ๋Š˜์–ด๋‚  ์ˆ˜๋ก ๊ด€๋ฆฌํ•ด์•ผ ํ•˜๋Š” ์Šค๋ ˆ๋“œ์˜ ๊ฐœ์ˆ˜๊ฐ€ ๋Š˜์–ด๋‚˜๊ธฐ ๋•Œ๋ฌธ์— ๊ณ„์†ํ•ด์„œ ์ปค์ง€๊ฒŒ ๋œ๋‹ค. ๋ณธ ํ•™์œ„ ๋…ผ๋ฌธ์—์„œ, ์šฐ๋ฆฌ๋Š” ํ•จ๊ป˜ ์‹คํ–‰๋˜๋Š” ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜๋“ค์˜ ๋Ÿฐํƒ€์ž„ ์„ฑ๋Šฅ์„ ๋†’์ด๋Š” ๊ฒƒ์— ์ง‘์ค‘ํ•œ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด, ๋ณธ ์—ฐ๊ตฌ์˜ ํ•ต์‹ฌ ๋ชฉํ‘œ๋Š” ํ•จ๊ป˜ ์‹คํ–‰๋˜๋Š” ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜๋“ค์—๊ฒŒ ๊ณต๊ฐ„ ๋ถ„ํ• ์‹ ์Šค์ผ€์ค„๋ง ๋ฐฉ๋ฒ•์„ ์ ์šฉํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ๊ฐ ์–ดํ”Œ๋ฆฌ ์ผ€์ด์…˜์—๊ฒŒ ๋…๋ฆฝ์ ์ธ ์ฝ”์–ด ์ž์›์„ ํ• ๋‹นํ•ด์ฃผ๋Š” ๊ณต๊ฐ„ ๋ถ„ํ• ์‹ ์Šค์ผ€์ค„๋ง์€ ์ ์  ๋” ๋Š˜์–ด๋‚˜๋Š” ์ฝ”์–ด ์ž์›์˜ ๊ฐœ์ˆ˜๋ฅผ ํšจ์œจ์ ์œผ๋กœ ๊ด€๋ฆฌํ•˜๊ธฐ ์œ„ํ•œ ๋ฐฉ๋ฒ•์œผ๋กœ ๋งŽ์€ ๊ด€์‹ฌ์„ ๋ฐ›๊ณ  ์žˆ๋‹ค. ํ•˜์ง€๋งŒ, ๊ณต๊ฐ„ ๋ถ„ํ•  ์Šค์ผ€์ค„๋ง ๋ฐฉ๋ฒ•์„ ํ†ตํ•ด ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜์„ ์‹คํ–‰์‹œํ‚ค๋Š” ๊ฒƒ์€ ๋‘ ๊ฐ€์ง€ ์—ฐ๊ตฌ ๊ณผ์ œ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค. ๋จผ์ €, ๊ฐ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜์€ ๊ฐ€๋ณ€์ ์ธ ์ฝ”์–ด ์ž์› ์ƒ์—์„œ ํšจ์œจ์ ์œผ๋กœ ์‹คํ–‰๋˜๊ธฐ ์œ„ํ•œ ๋Ÿฐํƒ€์ž„ ๊ธฐ์ˆ ์„ ํ•„์š”๋กœ ํ•˜๊ณ , ์Šค์ผ€์ค„๋Ÿฌ๋Š” ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜๋“ค์˜ ์„ฑ๋Šฅ ํŠน์„ฑ์„ ๊ณ ๋ คํ•ด์„œ ๋Ÿฐํƒ€์ž„ ์„ฑ๋Šฅ์„ ๋†’์ผ ์ˆ˜ ์žˆ๋„๋ก ์ ๋‹นํ•œ ์ˆ˜์˜ ์ฝ”์–ด ์ž์›์„ ์ œ๊ณตํ•ด์•ผํ•œ๋‹ค. ์ด ํ•™์œ„ ๋…ผ๋ฌธ์—์„œ, ์šฐ๋ฆฌ๋Š” ํ•จ๊ป˜ ์‹คํ–‰๋˜๋Š” ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜๋“ค์„ ๊ณต๊ฐ„ ๋ถ„ ํ•  ์Šค์ผ€์ค„๋ง์„ ํ†ตํ•ด์„œ ํšจ์œจ์ ์œผ๋กœ ์‹คํ–‰์‹œํ‚ค๊ธฐ ์œ„ํ•œ ์„ธ๊ฐ€์ง€ ๋Ÿฐํƒ€์ž„ ์‹œ์Šคํ…œ ๊ธฐ์ˆ ์„ ์†Œ๊ฐœํ•œ๋‹ค. ๋จผ์ € ์šฐ๋ฆฌ๋Š” ํ˜‘๋™์ ์ธ ๋Ÿฐํƒ€์ž„ ์‹œ์Šคํ…œ์ด๋ผ๋Š” ๊ธฐ์ˆ ์„ ์†Œ๊ฐœํ•˜๋Š”๋ฐ, ์ด๋Š” OpenMP ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜๋“ค์—๊ฒŒ ์œ ์—ฐํ•˜๊ณ  ํšจ์œจ์ ์ธ ์‹คํ–‰ ํ™˜๊ฒฝ์„ ์ œ๊ณตํ•œ๋‹ค. ์ด ๊ธฐ์ˆ ์€ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘๋ ฌ ์‹คํ–‰์— ๋‚ด์žฌ๋˜์–ด ์žˆ๋Š” ํŠน์„ฑ์„ ํ™œ์šฉํ•˜์—ฌ ๋ณ‘๋ ฌ์ฒ˜๋ฆฌ ํ”„๋กœ๊ทธ๋žจ๋“ค์ด ๋ณ€ํ™”ํ•˜๋Š” ์ฝ”์–ด ์ž์›์— ๋งž์ถ”์–ด ๋ณ‘๋ ฌ์„ฑ์˜ ์ •๋„๋ฅผ ๋™์ ์œผ๋กœ ์กฐ์ ˆํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•ด์ค€๋‹ค. ์ด๋Ÿฌํ•œ ์œ ์—ฐํ•œ ์‹คํ–‰ ๋ชจ๋ธ์€ ๋ณ‘๋ ฌ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜๋“ค์ด ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ์ฝ”์–ด ์ž์›์ด ๋™์ ์œผ๋กœ ๋ณ€ํ™”ํ•˜๋Š” ํ™˜๊ฒฝ์—์„œ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜์˜ ์Šค๋ ˆ๋“œ ์ˆ˜์ค€ ๋ณ‘๋ ฌ์„ฑ์„ ๋‹ค๋ฃจ์ง€ ๋ชปํ•˜๋Š” ๊ธฐ์กด ๋Ÿฐํƒ€์ž„ ์‹œ์Šคํ…œ๋“ค์— ๋น„ํ•ด์„œ ๋” ํšจ์œจ์ ์œผ๋กœ ์‹คํ–‰๋  ์ˆ˜ ์žˆ๋„๋ก ํ•ด์ค€๋‹ค. ๋‘๋ฒˆ์งธ๋กœ, ๋ณธ ๋…ผ๋ฌธ์€ ์‚ฌ์šฉ๋˜๋Š” ์ฝ”์–ด ์ž์›์— ๋”ฐ๋ฅธ ๋ณ‘๋ ฌ์ฒ˜๋ฆฌ ํ”„๋กœ๊ทธ๋žจ์˜ ์„ฑ๋Šฅ ๋ฐ ์ž์› ํ™œ์šฉ๋„๋ฅผ ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•ด์ฃผ๋Š” ๋ถ„์„์  ์„ฑ๋Šฅ ๋ชจ๋ธ์„ ์†Œ๊ฐœํ•œ๋‹ค. ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ ์ฝ”๋“œ์˜ ์„ฑ๋Šฅ ํ™•์žฅ์„ฑ์ด ์ผ๋ฐ˜์ ์œผ๋กœ ๋ฉ”๋ชจ๋ฆฌ ์„ฑ๋Šฅ์— ์ขŒ์šฐ๋œ๋‹ค๋Š” ๊ด€์ฐฐ์— ๊ธฐ์ดˆํ•˜์—ฌ, ์ œ ์•ˆ๋œ ํ•ด์„ ๋ชจ๋ธ์€ ํ์ž‰ ์ด๋ก ์„ ํ™œ์šฉํ•˜์—ฌ ๋ฉ”๋ชจ๋ฆฌ ์‹œ์Šคํ…œ์˜ ์„ฑ๋Šฅ ์ •๋ณด๋“ค์„ ๊ณ„์‚ฐํ•œ๋‹ค. ์ด ํ์ž‰ ์‹œ์Šคํ…œ์— ๊ธฐ๋ฐ˜ํ•œ ๋ฐฉ๋ฒ•์€ ์œ ์šฉํ•œ ์„ฑ๋Šฅ ์ •๋ณด๋“ค์„ ์ˆ˜์‹์„ ํ†ตํ•ด ํšจ์œจ์ ์œผ๋กœ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜๋ฉฐ ์ƒ์šฉ ์‹œ์Šคํ…œ์—์„œ ์ œ๊ณตํ•˜๋Š” ํ•˜๋“œ์›จ์–ด ์„ฑ๋Šฅ ์นด์šดํ„ฐ๋งŒ์„ ์š”๊ตฌ ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ํ™œ์šฉ ๊ฐ€๋Šฅ์„ฑ ๋˜ํ•œ ๋†’๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ, ๋ณธ ๋…ผ๋ฌธ์€ ๋™์‹œ์— ์‹คํ–‰๋˜๋Š” ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜๋“ค ์‚ฌ์ด์—์„œ ์ฝ”์–ด ์ž์›์„ ํ• ๋‹นํ•ด์ฃผ๋Š” ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์†Œ๊ฐœํ•œ๋‹ค. ์ œ์•ˆ๋œ ํ”„๋ ˆ์ž„์›Œํฌ๋Š” ๋™์‹œ์— ๋™ ์ž‘ํ•˜๋Š” ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜์˜ ๋ณ‘๋ ฌ์„ฑ ๋ฐ ์ฝ”์–ด ์ž์›์„ ๊ด€๋ฆฌํ•˜์—ฌ ๋ฉ€ํ‹ฐ ์†Œ์ผ“ ๋ฉ€ํ‹ฐ์ฝ”์–ด ์‹œ์Šคํ…œ์—์„œ CPU ์ž์› ๋ฐ ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ ์ž์› ํ™œ์šฉ๋„๋ฅผ ๋™์‹œ์— ์ตœ์  ํ™”ํ•œ๋‹ค. ํ•ด์„์ ์ธ ๋ชจ๋ธ๋ง๊ณผ ์ œ์•ˆ๋œ ์ฝ”์–ด ํ• ๋‹น ํ”„๋ ˆ์ž„์›Œํฌ์˜ ์„ฑ๋Šฅ ํ‰๊ฐ€๋ฅผ ํ†ตํ•ด์„œ, ์šฐ๋ฆฌ๊ฐ€ ์ œ์•ˆํ•˜๋Š” ์ •์ฑ…์ด ์ผ๋ฐ˜์ ์ธ ๊ฒฝ์šฐ์— CPU ์ž์›์˜ ํ™œ์šฉ๋„๋งŒ์„ ์ตœ์ ํ™”ํ•˜๋Š” ๋ฐฉ๋ฒ•์— ๋น„ํ•ด์„œ ํ•จ๊ป˜ ๋™์ž‘ํ•˜๋Š” ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜๋“ค์˜ ์‹คํ–‰์‹œ๊ฐ„์„ ๊ฐ์†Œ์‹œํ‚ฌ ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ค€๋‹ค.1 Introduction 1 1.1 Motivation 1 1.2 Background 5 1.2.1 The OpenMP Runtime System 5 1.2.2 Target Multi-Socket Multicore Systems 7 1.3 Contributions 8 1.3.1 Cooperative Runtime Systems 9 1.3.2 Performance Modeling 9 1.3.3 Parallelism Management 10 1.4 Related Work 11 1.4.1 Cooperative Runtime Systems 11 1.4.2 Performance Modeling 12 1.4.3 Parallelism Management 14 1.5 Organization of this Thesis 15 2 Dynamic Spatial Scheduling with Cooperative Runtime Systems 17 2.1 Overview 17 2.2 Malleable Workloads 19 2.3 Cooperative OpenMP Runtime System 21 2.3.1 Cooperative User-Level Tasking 22 2.3.2 Cooperative Dynamic Loop Scheduling 27 2.4 Experimental Results 30 2.4.1 Standalone Application Performance 30 2.4.2 Performance in Spatial Core Allocation 33 2.5 Discussion 35 2.5.1 Contributions 35 2.5.2 Limitations and Future Work 36 2.5.3 Summary 37 3 Performance Modeling of Parallel Loops using Queueing Systems 38 3.1 Overview 38 3.2 Background 41 3.2.1 Queueing Models 41 3.2.2 Insights on Performance Modeling of Parallel Loops 43 3.2.3 Performance Analysis 46 3.3 Queueing Systems for Multi-Socket Multicores 54 3.3.1 Hierarchical Queueing Systems 54 3.3.2 Computingthe Parameter Values 60 3.4 The Speedup Prediction Model 63 3.4.1 The Speedup Model 63 3.4.2 Implementation 64 3.5 Evaluation 65 3.5.1 64-core AMD Opteron Platform 66 3.5.2 72-core Intel Xeon Platform 68 3.6 Discussion 70 3.6.1 Applicability of the Model 70 3.6.2 Limitations of the Model 72 3.6.3 Summary 73 4 Maximizing System Utilization via Parallelism Management 74 4.1 Overview 74 4.2 Background 76 4.2.1 Modeling Performance Metrics 76 4.2.2 Our Resource Management Policy 79 4.3 NuPoCo: Parallelism Management for Co-Located Parallel Loops 82 4.3.1 Online Performance Model 82 4.3.2 Managing Parallelism 86 4.4 Evaluation of NuPoCo 90 4.4.1 Evaluation Scenario 1 90 4.4.2 Evaluation Scenario 2 98 4.5 MOCA: An Evolutionary Approach to Core Allocation 103 4.5.1 Evolutionary Core Allocation 104 4.5.2 Model-Based Allocation 106 4.6 Evaluation of MOCA 113 4.7 Discussion 118 4.7.1 Contributions and Limitations 118 4.7.2 Summary 119 5 Conclusion and Future Work 120 5.1 Conclusion 120 5.2 Future work 122 5.2.1 Improving Multi-Objective Core Allocation 122 5.2.2 Co-Scheduling of Parallel Jobs for HPC Systems 123 A Additional Experiments for the Performance Model 124 A.1 Memory Access Distribution and Poisson Distribution 124 A.1.1 Memory Access Distribution 124 A.1.2 Kolmogorov Smirnov Test 127 A.2 Additional Performance Modeling Results 134 A.2.1 Results with Intel Hyperthreading 134 A.2.2 Results with Cooperative User-Level Tasking 134 A.2.3 Results with Other Loop Schedulers 138 A.2.4 Results with Different Number of Memory Nodes 138 B Other Research Contributions of the Author 141 B.1 Compiler and Runtime Support for Integrated CPU-GPU Systems 141 B.2 Modeling NUMA Architectures with Stochastic Tool 143 B.3 Runtime Environment for a Manycore Architecture 143 ์ดˆ๋ก 159 Acknowledgements 161Docto

    Sigmoid: An auto-tuned load balancing algorithm for heterogeneous systems

    Get PDF
    A challenge that heterogeneous system programmers face is leveraging the performance of all the devices that integrate the system. This paper presents Sigmoid, a new load balancing algorithm that efficiently co-executes a single OpenCL data-parallel kernel on all the devices of heterogeneous systems. Sigmoid splits the workload proportionally to the capabilities of the devices, drastically reducing response time and energy consumption. It is designed around several features; it is dynamic, adaptive, guided and effortless, as it does not require the user to give any parameter, adapting to the behaviourof each kernel at runtime. To evaluate Sigmoid's performance, it has been implemented in Maat, a system abstraction library. Experimental results with different kernel types show that Sigmoid exhibits excellent performance, reaching a utilization of 90%, together with energy savings up to 20%, always reducing programming effort compared to OpenCL, and facilitating the portability to other heterogeneous machines.This work has been supported by the Spanish Science and Technology Commission under contract PID2019-105660RB-C22 and the European HiPEAC Network of Excellence

    Dynamic Task and Data Placement over NUMA Architectures: an OpenMP Runtime Perspective

    Get PDF
    International audienceExploiting the full computational power of current hierarchical multiprocessor machines requires a very careful distribution of threads and data among the underlying non-uniform architecture so as to avoid memory access penalties. Directive-based programming languages such as OpenMP provide programmers with an easy way to structure the parallelism of their application and to transmit this information to the runtime system. Our runtime, which is based on a multi-level thread scheduler combined with a NUMA-aware memory manager, converts this information into ``scheduling hints'' to solve thread/memory affinity issues. It enables dynamic load distribution guided by application structure and hardware topology, thus helping to achieve performance portability. First experiments show that mixed solutions (migrating threads and data) outperform Next-touch-based data distribution policies and open possibilities for new optimizations

    Modelli e strumenti di programmazione parallela per piattaforme many-core

    Get PDF
    The negotiation between power consumption, performance, programmability, and portability drives all computing industry designs, in particular the mobile and embedded systems domains. Two design paradigms have proven particularly promising in this context: architectural heterogeneity and many-core processors. Parallel programming models are key to effectively harness the computational power of heterogeneous many-core SoC. This thesis presents a set of techniques and HW/SW extensions that enable performance improvements and that simplify programmability for heterogeneous many-core platforms. The thesis contributions cover vertically the entire software stack for many-core platforms, from hardware abstraction layers running on top of bare-metal, to programming models; from hardware extensions for efficient parallelism support to middleware that enables optimized resource management within many-core platforms. First, we present mechanisms to decrease parallelism overheads on parallel programming runtimes for many-core platforms, targeting fine-grain parallelism. Second, we present programming model support that enables the offload of computational kernels within heterogeneous many-core systems. Third, we present a novel approach to dynamically sharing and managing many-core platforms when multiple applications coded with different programming models execute concurrently. All these contributions were validated using STMicroelectronics STHORM, a real embodiment of a state-of-the-art many-core system. Hardware extensions and architectural explorations were explored using VirtualSoC, a SystemC based cycle-accurate simulator of many-core platforms

    Optimizaciรณn del rendimiento y la eficiencia energรฉtica en sistemas masivamente paralelos

    Get PDF
    RESUMEN Los sistemas heterogรฉneos son cada vez mรกs relevantes, debido a sus capacidades de rendimiento y eficiencia energรฉtica, estando presentes en todo tipo de plataformas de cรณmputo, desde dispositivos embebidos y servidores, hasta nodos HPC de grandes centros de datos. Su complejidad hace que sean habitualmente usados bajo el paradigma de tareas y el modelo de programaciรณn host-device. Esto penaliza fuertemente el aprovechamiento de los aceleradores y el consumo energรฉtico del sistema, ademรกs de dificultar la adaptaciรณn de las aplicaciones. La co-ejecuciรณn permite que todos los dispositivos cooperen para computar el mismo problema, consumiendo menos tiempo y energรญa. No obstante, los programadores deben encargarse de toda la gestiรณn de los dispositivos, la distribuciรณn de la carga y la portabilidad del cรณdigo entre sistemas, complicando notablemente su programaciรณn. Esta tesis ofrece contribuciones para mejorar el rendimiento y la eficiencia energรฉtica en estos sistemas masivamente paralelos. Se realizan propuestas que abordan objetivos generalmente contrapuestos: se mejora la usabilidad y la programabilidad, a la vez que se garantiza una mayor abstracciรณn y extensibilidad del sistema, y al mismo tiempo se aumenta el rendimiento, la escalabilidad y la eficiencia energรฉtica. Para ello, se proponen dos motores de ejecuciรณn con enfoques completamente distintos. EngineCL, centrado en OpenCL y con una API de alto nivel, favorece la mรกxima compatibilidad entre todo tipo de dispositivos y proporciona un sistema modular extensible. Su versatilidad permite adaptarlo a entornos para los que no fue concebido, como aplicaciones con ejecuciones restringidas por tiempo o simuladores HPC de dinรกmica molecular, como el utilizado en un centro de investigaciรณn internacional. Considerando las tendencias industriales y enfatizando la aplicabilidad profesional, CoexecutorRuntime proporciona un sistema flexible centrado en C++/SYCL que dota de soporte a la co-ejecuciรณn a la tecnologรญa oneAPI. Este runtime acerca a los programadores al dominio del problema, posibilitando la explotaciรณn de estrategias dinรกmicas adaptativas que mejoran la eficiencia en todo tipo de aplicaciones.ABSTRACT Heterogeneous systems are becoming increasingly relevant, due to their performance and energy efficiency capabilities, being present in all types of computing platforms, from embedded devices and servers to HPC nodes in large data centers. Their complexity implies that they are usually used under the task paradigm and the host-device programming model. This strongly penalizes accelerator utilization and system energy consumption, as well as making it difficult to adapt applications. Co-execution allows all devices to simultaneously compute the same problem, cooperating to consume less time and energy. However, programmers must handle all device management, workload distribution and code portability between systems, significantly complicating their programming. This thesis offers contributions to improve performance and energy efficiency in these massively parallel systems. The proposals address the following generally conflicting objectives: usability and programmability are improved, while ensuring enhanced system abstraction and extensibility, and at the same time performance, scalability and energy efficiency are increased. To achieve this, two runtime systems with completely different approaches are proposed. EngineCL, focused on OpenCL and with a high-level API, provides an extensible modular system and favors maximum compatibility between all types of devices. Its versatility allows it to be adapted to environments for which it was not originally designed, including applications with time-constrained executions or molecular dynamics HPC simulators, such as the one used in an international research center. Considering industrial trends and emphasizing professional applicability, CoexecutorRuntime provides a flexible C++/SYCL-based system that provides co-execution support for oneAPI technology. This runtime brings programmers closer to the problem domain, enabling the exploitation of dynamic adaptive strategies that improve efficiency in all types of applications.Funding: This PhD has been supported by the Spanish Ministry of Education (FPU16/03299 grant), the Spanish Science and Technology Commission under contracts TIN2016-76635-C2-2-R and PID2019-105660RB-C22. This work has also been partially supported by the Mont-Blanc 3: European Scalable and Power Efficient HPC Platform based on Low-Power Embedded Technology project (G.A. No. 671697) from the European Unionโ€™s Horizon 2020 Research and Innovation Programme (H2020 Programme). Some activities have also been funded by the Spanish Science and Technology Commission under contract TIN2016-81840-REDT (CAPAP-H6 network). The Integration II: Hybrid programming models of Chapter 4 has been partially performed under the Project HPC-EUROPA3 (INFRAIA-2016-1-730897), with the support of the EC Research Innovation Action under the H2020 Programme. In particular, the author gratefully acknowledges the support of the SPMT Department of the High Performance Computing Center Stuttgart (HLRS)

    Argobots: A Lightweight Low-Level Threading and Tasking Framework

    Get PDF
    In the past few decades, a number of user-level threading and tasking models have been proposed in the literature to address the shortcomings of OS-level threads, primarily with respect to cost and flexibility. Current state-of-the-art user-level threading and tasking models, however, either are too specific to applications or architectures or are not as powerful or flexible. In this paper, we present Argobots, a lightweight, low-level threading and tasking framework that is designed as a portable and performant substrate for high-level programming models or runtime systems. Argobots offers a carefully designed execution model that balances generality of functionality with providing a rich set of controls to allow specialization by end users or high-level programming models. We describe the design, implementation, and performance characterization of Argobots and present integrations with three high-level models: OpenMP, MPI, and colocated I/O services. Evaluations show that (1) Argobots, while providing richer capabilities, is competitive with existing simpler generic threading runtimes; (2) our OpenMP runtime offers more efficient interoperability capabilities than production OpenMP runtimes do; (3) when MPI interoperates with Argobots instead of Pthreads, it enjoys reduced synchronization costs and better latency-hiding capabilities; and (4) I/O services with Argobots reduce interference with colocated applications while achieving performance competitive with that of a Pthreads approach
    • โ€ฆ
    corecore