2,186 research outputs found

    Scheduling Dynamic OpenMP Applications over Multicore Architectures

    Get PDF
    International audienceApproaching the theoretical performance of hierarchical multicore machines requires a very careful distribution of threads and data among the underlying non-uniform architecture in order to minimize cache misses and NUMA penalties. While it is acknowledged that OpenMP can enhance the quality of thread scheduling on such architectures in a portable way, by transmitting precious information about the affinities between threads and data to the underlying runtime system, most OpenMP runtime systems are actually unable to efficiently support highly irregular, massively parallel applications on NUMA machines. In this paper, we present a thread scheduling policy suited to the execution of OpenMP programs featuring irregular and massive nested parallelism over hierarchical architectures. Our policy enforces a distribution of threads that maximizes the proximity of threads belonging to the same parallel section, and uses a NUMA-aware work stealing strategy when load balancing is needed. It has been developed as a plug-in to the ForestGOMP OpenMP platform. We demonstrate the efficiency of our approach with a highly irregular recursive OpenMP program resulting from the generic parallelization of a surface reconstruction application. We achieve a speedup of 14 on a 16-core machine with no application-level optimization

    Improving the scalability of parallel N-body applications with an event driven constraint based execution model

    Full text link
    The scalability and efficiency of graph applications are significantly constrained by conventional systems and their supporting programming models. Technology trends like multicore, manycore, and heterogeneous system architectures are introducing further challenges and possibilities for emerging application domains such as graph applications. This paper explores the space of effective parallel execution of ephemeral graphs that are dynamically generated using the Barnes-Hut algorithm to exemplify dynamic workloads. The workloads are expressed using the semantics of an Exascale computing execution model called ParalleX. For comparison, results using conventional execution model semantics are also presented. We find improved load balancing during runtime and automatic parallelism discovery improving efficiency using the advanced semantics for Exascale computing.Comment: 11 figure

    Architectural support for task dependence management with flexible software scheduling

    Get PDF
    The growing complexity of multi-core architectures has motivated a wide range of software mechanisms to improve the orchestration of parallel executions. Task parallelism has become a very attractive approach thanks to its programmability, portability and potential for optimizations. However, with the expected increase in core counts, finer-grained tasking will be required to exploit the available parallelism, which will increase the overheads introduced by the runtime system. This work presents Task Dependence Manager (TDM), a hardware/software co-designed mechanism to mitigate runtime system overheads. TDM introduces a hardware unit, denoted Dependence Management Unit (DMU), and minimal ISA extensions that allow the runtime system to offload costly dependence tracking operations to the DMU and to still perform task scheduling in software. With lower hardware cost, TDM outperforms hardware-based solutions and enhances the flexibility, adaptability and composability of the system. Results show that TDM improves performance by 12.3% and reduces EDP by 20.4% on average with respect to a software runtime system. Compared to a runtime system fully implemented in hardware, TDM achieves an average speedup of 4.2% with 7.3x less area requirements and significant EDP reductions. In addition, five different software schedulers are evaluated with TDM, illustrating its flexibility and performance gains.This work has been supported by the RoMoL ERC Advanced Grant (GA 321253), by the European HiPEAC Network of Excellence, by the Spanish Ministry of Science and Innovation (contracts TIN2015-65316-P, TIN2016-76635-C2-2-R and TIN2016-81840-REDT), by the Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272), and by the European Unionโ€™s Horizon 2020 research and innovation programme under grant agreement No 671697 and No. 671610. M. Moretรณ has been partially supported by the Ministry of Economy and Competitiveness under Juan de la Cierva postdoctoral fellowship number JCI-2012-15047.Peer ReviewedPostprint (author's final draft

    CoreTSAR: Task Scheduling for Accelerator-aware Runtimes

    Get PDF
    Heterogeneous supercomputers that incorporate computational accelerators such as GPUs are increasingly popular due to their high peak performance, energy efficiency and comparatively low cost. Unfortunately, the programming models and frameworks designed to extract performance from all computational units still lack the flexibility of their CPU-only counterparts. Accelerated OpenMP improves this situation by supporting natural migration of OpenMP code from CPUs to a GPU. However, these implementations currently lose one of OpenMPโ€™s best features, its flexibility: typical OpenMP applications can run on any number of CPUs. GPU implementations do not transparently employ multiple GPUs on a node or a mix of GPUs and CPUs. To address these shortcomings, we present CoreTSAR, our runtime library for dynamically scheduling tasks across heterogeneous resources, and propose straightforward extensions that incorporate this functionality into Accelerated OpenMP. We show that our approach can provide nearly linear speedup to four GPUs over only using CPUs or one GPU while increasing the overall flexibility of Accelerated OpenMP

    ๋™์‹œ์— ์‹คํ–‰๋˜๋Š” ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜๋“ค์„ ์œ„ํ•œ ๋ณ‘๋ ฌ์„ฑ ๊ด€๋ฆฌ

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2020. 8. Bernhard Egger.Running multiple parallel jobs on the same multicore machine is becoming more important to improve utilization of the given hardware resources. While co-location of parallel jobs is common practice, it still remains a challenge for current parallel runtime systems to efficiently execute multiple parallel applications simultaneously. Conventional parallelization runtimes such as OpenMP generate a fixed number of worker threads, typically as many as there are cores in the system, to utilize all physical core resources. On such runtime systems, applications may not achieve their peak performance when given full use of all physical core resources. Moreover, the OS kernel needs to manage all worker threads generated by all running parallel applications, and it may require huge management costs with an increasing number of co-located applications. In this thesis, we focus on improving runtime performance for co-located parallel applications. To achieve this goal, the first idea of this work is to ensure spatial scheduling to execute multiple co-located parallel applications simultaneously. Spatial scheduling that provides distinct core resources for applications is considered a promising and scalable approach for executing co-located applications. Despite the growing importance of spatial scheduling, there are still two fundamental research issues with this approach. First, spatial scheduling requires a runtime support for parallel applications to run efficiently in spatial core allocation that can change at runtime. Second, the scheduler needs to assign the proper number of core resources to applications depending on the applications performance characteristics for better runtime performance. To this end, in this thesis, we present three novel runtime-level techniques to efficiently execute co-located parallel applications with spatial scheduling. First, we present a cooperative runtime technique that provides malleable parallel execution for OpenMP parallel applications. The malleable execution means that applications can dynamically adapt their degree of parallelism to the varying core resource availability. It allows parallel applications to run efficiently at changing core resource availability compared to conventional runtime systems that do not adjust the degree of parallelism of the application. Second, this thesis introduces an analytical performance model that can estimate resource utilization and the performance of parallel programs in dependence of the provided core resources. We observe that the performance of parallel loops is typically limited by memory performance, and employ queueing theory to model the memory performance. The queueing system-based approach allows us to estimate the performance by using closed-form equations and hardware performance counters. Third, we present a core allocation framework to manage core resources between co-located parallel applications. With analytical modeling, we observe that maximizing both CPU utilization and memory bandwidth usage can generally lead to better performance compared to conventional core allocation policies that maximize only CPU usage. The presented core allocation framework optimizes utilization of multi-dimensional resources of CPU cores and memory bandwidth on multi-socket multicore systems based on the cooperative parallel runtime support and the analytical model.๋ฉ€ํ‹ฐ์ฝ”์–ด ์‹œ์Šคํ…œ์—์„œ ์—ฌ๋Ÿฌ ๊ฐœ์˜ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜๋“ค์„ ํ•จ๊ป˜ ์‹คํ–‰์‹œํ‚ค๋Š” ๊ฒƒ ์€ ์ฃผ์–ด์ง„ ํ•˜๋“œ์›จ์–ด ์ž์›์„ ํšจ์œจ์ ์œผ๋กœ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•ด์„œ ์ ์  ๋” ์ค‘์š”ํ•ด์ง€๊ณ  ์žˆ๋‹ค. ํ•˜์ง€๋งŒ, ํ˜„์žฌ ๋Ÿฐํƒ€์ž„ ์‹œ์Šคํ…œ์—์„œ ์—ฌ๋Ÿฌ ๊ฐœ์˜ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜๋“ค์„ ๋™์‹œ์— ํšจ์œจ์ ์œผ๋กœ ์‹คํ–‰์‹œํ‚ค๋Š” ๊ฒƒ์€ ์—ฌ์ „ํžˆ ์–ด๋ ค์šด ๋ฌธ์ œ์ด๋‹ค. OpenMP์™€ ๊ฐ™์ด ํ†ต์ƒ ์‚ฌ ์šฉ๋˜๋Š” ๋ณ‘๋ ฌํ™” ๋Ÿฐํƒ€์ž„ ์‹œ์Šคํ…œ๋“ค์€ ๋ชจ๋“  ํ•˜๋“œ์›จ์–ด ์ฝ”์–ด ์ž์›์„ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•ด์„œ ์ผ๋ฐ˜์ ์œผ๋กœ ์ฝ”์–ด ๊ฐœ์ˆ˜ ๋งŒํผ ์Šค๋ ˆ๋“œ๋ฅผ ์ƒ์„ฑํ•˜์—ฌ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜์„ ์‹คํ–‰์‹œํ‚จ๋‹ค. ์ด ๋•Œ, ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜์€ ๋ชจ๋“  ์ฝ”์–ด ์ž์›์„ ํ™œ์šฉํ•  ๋•Œ ์˜คํžˆ๋ ค ์ตœ์ ์˜ ์„ฑ๋Šฅ์„ ์–ป์ง€ ๋ชปํ•  ์ˆ˜๋„ ์žˆ์œผ๋ฉฐ, ์šด์˜์ฒด์ œ ์ปค๋„์˜ ๋ถ€ํ•˜๋Š” ์‹คํ–‰๋˜๋Š” ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜์˜ ๊ฐœ์ˆ˜๊ฐ€ ๋Š˜์–ด๋‚  ์ˆ˜๋ก ๊ด€๋ฆฌํ•ด์•ผ ํ•˜๋Š” ์Šค๋ ˆ๋“œ์˜ ๊ฐœ์ˆ˜๊ฐ€ ๋Š˜์–ด๋‚˜๊ธฐ ๋•Œ๋ฌธ์— ๊ณ„์†ํ•ด์„œ ์ปค์ง€๊ฒŒ ๋œ๋‹ค. ๋ณธ ํ•™์œ„ ๋…ผ๋ฌธ์—์„œ, ์šฐ๋ฆฌ๋Š” ํ•จ๊ป˜ ์‹คํ–‰๋˜๋Š” ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜๋“ค์˜ ๋Ÿฐํƒ€์ž„ ์„ฑ๋Šฅ์„ ๋†’์ด๋Š” ๊ฒƒ์— ์ง‘์ค‘ํ•œ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด, ๋ณธ ์—ฐ๊ตฌ์˜ ํ•ต์‹ฌ ๋ชฉํ‘œ๋Š” ํ•จ๊ป˜ ์‹คํ–‰๋˜๋Š” ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜๋“ค์—๊ฒŒ ๊ณต๊ฐ„ ๋ถ„ํ• ์‹ ์Šค์ผ€์ค„๋ง ๋ฐฉ๋ฒ•์„ ์ ์šฉํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ๊ฐ ์–ดํ”Œ๋ฆฌ ์ผ€์ด์…˜์—๊ฒŒ ๋…๋ฆฝ์ ์ธ ์ฝ”์–ด ์ž์›์„ ํ• ๋‹นํ•ด์ฃผ๋Š” ๊ณต๊ฐ„ ๋ถ„ํ• ์‹ ์Šค์ผ€์ค„๋ง์€ ์ ์  ๋” ๋Š˜์–ด๋‚˜๋Š” ์ฝ”์–ด ์ž์›์˜ ๊ฐœ์ˆ˜๋ฅผ ํšจ์œจ์ ์œผ๋กœ ๊ด€๋ฆฌํ•˜๊ธฐ ์œ„ํ•œ ๋ฐฉ๋ฒ•์œผ๋กœ ๋งŽ์€ ๊ด€์‹ฌ์„ ๋ฐ›๊ณ  ์žˆ๋‹ค. ํ•˜์ง€๋งŒ, ๊ณต๊ฐ„ ๋ถ„ํ•  ์Šค์ผ€์ค„๋ง ๋ฐฉ๋ฒ•์„ ํ†ตํ•ด ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜์„ ์‹คํ–‰์‹œํ‚ค๋Š” ๊ฒƒ์€ ๋‘ ๊ฐ€์ง€ ์—ฐ๊ตฌ ๊ณผ์ œ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค. ๋จผ์ €, ๊ฐ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜์€ ๊ฐ€๋ณ€์ ์ธ ์ฝ”์–ด ์ž์› ์ƒ์—์„œ ํšจ์œจ์ ์œผ๋กœ ์‹คํ–‰๋˜๊ธฐ ์œ„ํ•œ ๋Ÿฐํƒ€์ž„ ๊ธฐ์ˆ ์„ ํ•„์š”๋กœ ํ•˜๊ณ , ์Šค์ผ€์ค„๋Ÿฌ๋Š” ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜๋“ค์˜ ์„ฑ๋Šฅ ํŠน์„ฑ์„ ๊ณ ๋ คํ•ด์„œ ๋Ÿฐํƒ€์ž„ ์„ฑ๋Šฅ์„ ๋†’์ผ ์ˆ˜ ์žˆ๋„๋ก ์ ๋‹นํ•œ ์ˆ˜์˜ ์ฝ”์–ด ์ž์›์„ ์ œ๊ณตํ•ด์•ผํ•œ๋‹ค. ์ด ํ•™์œ„ ๋…ผ๋ฌธ์—์„œ, ์šฐ๋ฆฌ๋Š” ํ•จ๊ป˜ ์‹คํ–‰๋˜๋Š” ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜๋“ค์„ ๊ณต๊ฐ„ ๋ถ„ ํ•  ์Šค์ผ€์ค„๋ง์„ ํ†ตํ•ด์„œ ํšจ์œจ์ ์œผ๋กœ ์‹คํ–‰์‹œํ‚ค๊ธฐ ์œ„ํ•œ ์„ธ๊ฐ€์ง€ ๋Ÿฐํƒ€์ž„ ์‹œ์Šคํ…œ ๊ธฐ์ˆ ์„ ์†Œ๊ฐœํ•œ๋‹ค. ๋จผ์ € ์šฐ๋ฆฌ๋Š” ํ˜‘๋™์ ์ธ ๋Ÿฐํƒ€์ž„ ์‹œ์Šคํ…œ์ด๋ผ๋Š” ๊ธฐ์ˆ ์„ ์†Œ๊ฐœํ•˜๋Š”๋ฐ, ์ด๋Š” OpenMP ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜๋“ค์—๊ฒŒ ์œ ์—ฐํ•˜๊ณ  ํšจ์œจ์ ์ธ ์‹คํ–‰ ํ™˜๊ฒฝ์„ ์ œ๊ณตํ•œ๋‹ค. ์ด ๊ธฐ์ˆ ์€ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘๋ ฌ ์‹คํ–‰์— ๋‚ด์žฌ๋˜์–ด ์žˆ๋Š” ํŠน์„ฑ์„ ํ™œ์šฉํ•˜์—ฌ ๋ณ‘๋ ฌ์ฒ˜๋ฆฌ ํ”„๋กœ๊ทธ๋žจ๋“ค์ด ๋ณ€ํ™”ํ•˜๋Š” ์ฝ”์–ด ์ž์›์— ๋งž์ถ”์–ด ๋ณ‘๋ ฌ์„ฑ์˜ ์ •๋„๋ฅผ ๋™์ ์œผ๋กœ ์กฐ์ ˆํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•ด์ค€๋‹ค. ์ด๋Ÿฌํ•œ ์œ ์—ฐํ•œ ์‹คํ–‰ ๋ชจ๋ธ์€ ๋ณ‘๋ ฌ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜๋“ค์ด ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ์ฝ”์–ด ์ž์›์ด ๋™์ ์œผ๋กœ ๋ณ€ํ™”ํ•˜๋Š” ํ™˜๊ฒฝ์—์„œ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜์˜ ์Šค๋ ˆ๋“œ ์ˆ˜์ค€ ๋ณ‘๋ ฌ์„ฑ์„ ๋‹ค๋ฃจ์ง€ ๋ชปํ•˜๋Š” ๊ธฐ์กด ๋Ÿฐํƒ€์ž„ ์‹œ์Šคํ…œ๋“ค์— ๋น„ํ•ด์„œ ๋” ํšจ์œจ์ ์œผ๋กœ ์‹คํ–‰๋  ์ˆ˜ ์žˆ๋„๋ก ํ•ด์ค€๋‹ค. ๋‘๋ฒˆ์งธ๋กœ, ๋ณธ ๋…ผ๋ฌธ์€ ์‚ฌ์šฉ๋˜๋Š” ์ฝ”์–ด ์ž์›์— ๋”ฐ๋ฅธ ๋ณ‘๋ ฌ์ฒ˜๋ฆฌ ํ”„๋กœ๊ทธ๋žจ์˜ ์„ฑ๋Šฅ ๋ฐ ์ž์› ํ™œ์šฉ๋„๋ฅผ ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•ด์ฃผ๋Š” ๋ถ„์„์  ์„ฑ๋Šฅ ๋ชจ๋ธ์„ ์†Œ๊ฐœํ•œ๋‹ค. ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ ์ฝ”๋“œ์˜ ์„ฑ๋Šฅ ํ™•์žฅ์„ฑ์ด ์ผ๋ฐ˜์ ์œผ๋กœ ๋ฉ”๋ชจ๋ฆฌ ์„ฑ๋Šฅ์— ์ขŒ์šฐ๋œ๋‹ค๋Š” ๊ด€์ฐฐ์— ๊ธฐ์ดˆํ•˜์—ฌ, ์ œ ์•ˆ๋œ ํ•ด์„ ๋ชจ๋ธ์€ ํ์ž‰ ์ด๋ก ์„ ํ™œ์šฉํ•˜์—ฌ ๋ฉ”๋ชจ๋ฆฌ ์‹œ์Šคํ…œ์˜ ์„ฑ๋Šฅ ์ •๋ณด๋“ค์„ ๊ณ„์‚ฐํ•œ๋‹ค. ์ด ํ์ž‰ ์‹œ์Šคํ…œ์— ๊ธฐ๋ฐ˜ํ•œ ๋ฐฉ๋ฒ•์€ ์œ ์šฉํ•œ ์„ฑ๋Šฅ ์ •๋ณด๋“ค์„ ์ˆ˜์‹์„ ํ†ตํ•ด ํšจ์œจ์ ์œผ๋กœ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜๋ฉฐ ์ƒ์šฉ ์‹œ์Šคํ…œ์—์„œ ์ œ๊ณตํ•˜๋Š” ํ•˜๋“œ์›จ์–ด ์„ฑ๋Šฅ ์นด์šดํ„ฐ๋งŒ์„ ์š”๊ตฌ ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ํ™œ์šฉ ๊ฐ€๋Šฅ์„ฑ ๋˜ํ•œ ๋†’๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ, ๋ณธ ๋…ผ๋ฌธ์€ ๋™์‹œ์— ์‹คํ–‰๋˜๋Š” ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜๋“ค ์‚ฌ์ด์—์„œ ์ฝ”์–ด ์ž์›์„ ํ• ๋‹นํ•ด์ฃผ๋Š” ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์†Œ๊ฐœํ•œ๋‹ค. ์ œ์•ˆ๋œ ํ”„๋ ˆ์ž„์›Œํฌ๋Š” ๋™์‹œ์— ๋™ ์ž‘ํ•˜๋Š” ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜์˜ ๋ณ‘๋ ฌ์„ฑ ๋ฐ ์ฝ”์–ด ์ž์›์„ ๊ด€๋ฆฌํ•˜์—ฌ ๋ฉ€ํ‹ฐ ์†Œ์ผ“ ๋ฉ€ํ‹ฐ์ฝ”์–ด ์‹œ์Šคํ…œ์—์„œ CPU ์ž์› ๋ฐ ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ ์ž์› ํ™œ์šฉ๋„๋ฅผ ๋™์‹œ์— ์ตœ์  ํ™”ํ•œ๋‹ค. ํ•ด์„์ ์ธ ๋ชจ๋ธ๋ง๊ณผ ์ œ์•ˆ๋œ ์ฝ”์–ด ํ• ๋‹น ํ”„๋ ˆ์ž„์›Œํฌ์˜ ์„ฑ๋Šฅ ํ‰๊ฐ€๋ฅผ ํ†ตํ•ด์„œ, ์šฐ๋ฆฌ๊ฐ€ ์ œ์•ˆํ•˜๋Š” ์ •์ฑ…์ด ์ผ๋ฐ˜์ ์ธ ๊ฒฝ์šฐ์— CPU ์ž์›์˜ ํ™œ์šฉ๋„๋งŒ์„ ์ตœ์ ํ™”ํ•˜๋Š” ๋ฐฉ๋ฒ•์— ๋น„ํ•ด์„œ ํ•จ๊ป˜ ๋™์ž‘ํ•˜๋Š” ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜๋“ค์˜ ์‹คํ–‰์‹œ๊ฐ„์„ ๊ฐ์†Œ์‹œํ‚ฌ ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ค€๋‹ค.1 Introduction 1 1.1 Motivation 1 1.2 Background 5 1.2.1 The OpenMP Runtime System 5 1.2.2 Target Multi-Socket Multicore Systems 7 1.3 Contributions 8 1.3.1 Cooperative Runtime Systems 9 1.3.2 Performance Modeling 9 1.3.3 Parallelism Management 10 1.4 Related Work 11 1.4.1 Cooperative Runtime Systems 11 1.4.2 Performance Modeling 12 1.4.3 Parallelism Management 14 1.5 Organization of this Thesis 15 2 Dynamic Spatial Scheduling with Cooperative Runtime Systems 17 2.1 Overview 17 2.2 Malleable Workloads 19 2.3 Cooperative OpenMP Runtime System 21 2.3.1 Cooperative User-Level Tasking 22 2.3.2 Cooperative Dynamic Loop Scheduling 27 2.4 Experimental Results 30 2.4.1 Standalone Application Performance 30 2.4.2 Performance in Spatial Core Allocation 33 2.5 Discussion 35 2.5.1 Contributions 35 2.5.2 Limitations and Future Work 36 2.5.3 Summary 37 3 Performance Modeling of Parallel Loops using Queueing Systems 38 3.1 Overview 38 3.2 Background 41 3.2.1 Queueing Models 41 3.2.2 Insights on Performance Modeling of Parallel Loops 43 3.2.3 Performance Analysis 46 3.3 Queueing Systems for Multi-Socket Multicores 54 3.3.1 Hierarchical Queueing Systems 54 3.3.2 Computingthe Parameter Values 60 3.4 The Speedup Prediction Model 63 3.4.1 The Speedup Model 63 3.4.2 Implementation 64 3.5 Evaluation 65 3.5.1 64-core AMD Opteron Platform 66 3.5.2 72-core Intel Xeon Platform 68 3.6 Discussion 70 3.6.1 Applicability of the Model 70 3.6.2 Limitations of the Model 72 3.6.3 Summary 73 4 Maximizing System Utilization via Parallelism Management 74 4.1 Overview 74 4.2 Background 76 4.2.1 Modeling Performance Metrics 76 4.2.2 Our Resource Management Policy 79 4.3 NuPoCo: Parallelism Management for Co-Located Parallel Loops 82 4.3.1 Online Performance Model 82 4.3.2 Managing Parallelism 86 4.4 Evaluation of NuPoCo 90 4.4.1 Evaluation Scenario 1 90 4.4.2 Evaluation Scenario 2 98 4.5 MOCA: An Evolutionary Approach to Core Allocation 103 4.5.1 Evolutionary Core Allocation 104 4.5.2 Model-Based Allocation 106 4.6 Evaluation of MOCA 113 4.7 Discussion 118 4.7.1 Contributions and Limitations 118 4.7.2 Summary 119 5 Conclusion and Future Work 120 5.1 Conclusion 120 5.2 Future work 122 5.2.1 Improving Multi-Objective Core Allocation 122 5.2.2 Co-Scheduling of Parallel Jobs for HPC Systems 123 A Additional Experiments for the Performance Model 124 A.1 Memory Access Distribution and Poisson Distribution 124 A.1.1 Memory Access Distribution 124 A.1.2 Kolmogorov Smirnov Test 127 A.2 Additional Performance Modeling Results 134 A.2.1 Results with Intel Hyperthreading 134 A.2.2 Results with Cooperative User-Level Tasking 134 A.2.3 Results with Other Loop Schedulers 138 A.2.4 Results with Different Number of Memory Nodes 138 B Other Research Contributions of the Author 141 B.1 Compiler and Runtime Support for Integrated CPU-GPU Systems 141 B.2 Modeling NUMA Architectures with Stochastic Tool 143 B.3 Runtime Environment for a Manycore Architecture 143 ์ดˆ๋ก 159 Acknowledgements 161Docto

    A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures

    Full text link
    As multicore systems continue to gain ground in the High Performance Computing world, linear algebra algorithms have to be reformulated or new algorithms have to be developed in order to take advantage of the architectural features on these new processors. Fine grain parallelism becomes a major requirement and introduces the necessity of loose synchronization in the parallel execution of an operation. This paper presents an algorithm for the Cholesky, LU and QR factorization where the operations can be represented as a sequence of small tasks that operate on square blocks of data. These tasks can be dynamically scheduled for execution based on the dependencies among them and on the availability of computational resources. This may result in an out of order execution of the tasks which will completely hide the presence of intrinsically sequential tasks in the factorization. Performance comparisons are presented with the LAPACK algorithms where parallelism can only be exploited at the level of the BLAS operations and vendor implementations
    • โ€ฆ
    corecore