1,743 research outputs found

    Taking advantage of hybrid systems for sparse direct solvers via task-based runtimes

    Get PDF
    The ongoing hardware evolution exhibits an escalation in the number, as well as in the heterogeneity, of computing resources. The pressure to maintain reasonable levels of performance and portability forces application developers to leave the traditional programming paradigms and explore alternative solutions. PaStiX is a parallel sparse direct solver, based on a dynamic scheduler for modern hierarchical manycore architectures. In this paper, we study the benefits and limits of replacing the highly specialized internal scheduler of the PaStiX solver with two generic runtime systems: PaRSEC and StarPU. The tasks graph of the factorization step is made available to the two runtimes, providing them the opportunity to process and optimize its traversal in order to maximize the algorithm efficiency for the targeted hardware platform. A comparative study of the performance of the PaStiX solver on top of its native internal scheduler, PaRSEC, and StarPU frameworks, on different execution environments, is performed. The analysis highlights that these generic task-based runtimes achieve comparable results to the application-optimized embedded scheduler on homogeneous platforms. Furthermore, they are able to significantly speed up the solver on heterogeneous environments by taking advantage of the accelerators while hiding the complexity of their efficient manipulation from the programmer.Comment: Heterogeneity in Computing Workshop (2014

    A compiler approach to scalable concurrent program design

    Get PDF
    The programmer's most powerful tool for controlling complexity in program design is abstraction. We seek to use abstraction in the design of concurrent programs, so as to separate design decisions concerned with decomposition, communication, synchronization, mapping, granularity, and load balancing. This paper describes programming and compiler techniques intended to facilitate this design strategy. The programming techniques are based on a core programming notation with two important properties: the ability to separate concurrent programming concerns, and extensibility with reusable programmer-defined abstractions. The compiler techniques are based on a simple transformation system together with a set of compilation transformations and portable run-time support. The transformation system allows programmer-defined abstractions to be defined as source-to-source transformations that convert abstractions into the core notation. The same transformation system is used to apply compilation transformations that incrementally transform the core notation toward an abstract concurrent machine. This machine can be implemented on a variety of concurrent architectures using simple run-time support. The transformation, compilation, and run-time system techniques have been implemented and are incorporated in a public-domain program development toolkit. This toolkit operates on a wide variety of networked workstations, multicomputers, and shared-memory multiprocessors. It includes a program transformer, concurrent compiler, syntax checker, debugger, performance analyzer, and execution animator. A variety of substantial applications have been developed using the toolkit, in areas such as climate modeling and fluid dynamics

    07361 Abstracts Collection -- Programming Models for Ubiquitous Parallelism

    Get PDF
    From 02.09. to 07.09.2007, the Dagstuhl Seminar 07361 ``Programming Models for Ubiquitous Parallelism\u27\u27 was held in the International Conference and Research Center (IBFI), Schloss Dagstuhl. During the seminar, several participants presented their current research, and ongoing work and open problems were discussed. Abstracts of the presentations given during the seminar as well as abstracts of seminar results and ideas are put together in this paper. The first section describes the seminar topics and goals in general. Links to extended abstracts or full papers are provided, if available

    Optimizaciรณn del rendimiento y la eficiencia energรฉtica en sistemas masivamente paralelos

    Get PDF
    RESUMEN Los sistemas heterogรฉneos son cada vez mรกs relevantes, debido a sus capacidades de rendimiento y eficiencia energรฉtica, estando presentes en todo tipo de plataformas de cรณmputo, desde dispositivos embebidos y servidores, hasta nodos HPC de grandes centros de datos. Su complejidad hace que sean habitualmente usados bajo el paradigma de tareas y el modelo de programaciรณn host-device. Esto penaliza fuertemente el aprovechamiento de los aceleradores y el consumo energรฉtico del sistema, ademรกs de dificultar la adaptaciรณn de las aplicaciones. La co-ejecuciรณn permite que todos los dispositivos cooperen para computar el mismo problema, consumiendo menos tiempo y energรญa. No obstante, los programadores deben encargarse de toda la gestiรณn de los dispositivos, la distribuciรณn de la carga y la portabilidad del cรณdigo entre sistemas, complicando notablemente su programaciรณn. Esta tesis ofrece contribuciones para mejorar el rendimiento y la eficiencia energรฉtica en estos sistemas masivamente paralelos. Se realizan propuestas que abordan objetivos generalmente contrapuestos: se mejora la usabilidad y la programabilidad, a la vez que se garantiza una mayor abstracciรณn y extensibilidad del sistema, y al mismo tiempo se aumenta el rendimiento, la escalabilidad y la eficiencia energรฉtica. Para ello, se proponen dos motores de ejecuciรณn con enfoques completamente distintos. EngineCL, centrado en OpenCL y con una API de alto nivel, favorece la mรกxima compatibilidad entre todo tipo de dispositivos y proporciona un sistema modular extensible. Su versatilidad permite adaptarlo a entornos para los que no fue concebido, como aplicaciones con ejecuciones restringidas por tiempo o simuladores HPC de dinรกmica molecular, como el utilizado en un centro de investigaciรณn internacional. Considerando las tendencias industriales y enfatizando la aplicabilidad profesional, CoexecutorRuntime proporciona un sistema flexible centrado en C++/SYCL que dota de soporte a la co-ejecuciรณn a la tecnologรญa oneAPI. Este runtime acerca a los programadores al dominio del problema, posibilitando la explotaciรณn de estrategias dinรกmicas adaptativas que mejoran la eficiencia en todo tipo de aplicaciones.ABSTRACT Heterogeneous systems are becoming increasingly relevant, due to their performance and energy efficiency capabilities, being present in all types of computing platforms, from embedded devices and servers to HPC nodes in large data centers. Their complexity implies that they are usually used under the task paradigm and the host-device programming model. This strongly penalizes accelerator utilization and system energy consumption, as well as making it difficult to adapt applications. Co-execution allows all devices to simultaneously compute the same problem, cooperating to consume less time and energy. However, programmers must handle all device management, workload distribution and code portability between systems, significantly complicating their programming. This thesis offers contributions to improve performance and energy efficiency in these massively parallel systems. The proposals address the following generally conflicting objectives: usability and programmability are improved, while ensuring enhanced system abstraction and extensibility, and at the same time performance, scalability and energy efficiency are increased. To achieve this, two runtime systems with completely different approaches are proposed. EngineCL, focused on OpenCL and with a high-level API, provides an extensible modular system and favors maximum compatibility between all types of devices. Its versatility allows it to be adapted to environments for which it was not originally designed, including applications with time-constrained executions or molecular dynamics HPC simulators, such as the one used in an international research center. Considering industrial trends and emphasizing professional applicability, CoexecutorRuntime provides a flexible C++/SYCL-based system that provides co-execution support for oneAPI technology. This runtime brings programmers closer to the problem domain, enabling the exploitation of dynamic adaptive strategies that improve efficiency in all types of applications.Funding: This PhD has been supported by the Spanish Ministry of Education (FPU16/03299 grant), the Spanish Science and Technology Commission under contracts TIN2016-76635-C2-2-R and PID2019-105660RB-C22. This work has also been partially supported by the Mont-Blanc 3: European Scalable and Power Efficient HPC Platform based on Low-Power Embedded Technology project (G.A. No. 671697) from the European Unionโ€™s Horizon 2020 Research and Innovation Programme (H2020 Programme). Some activities have also been funded by the Spanish Science and Technology Commission under contract TIN2016-81840-REDT (CAPAP-H6 network). The Integration II: Hybrid programming models of Chapter 4 has been partially performed under the Project HPC-EUROPA3 (INFRAIA-2016-1-730897), with the support of the EC Research Innovation Action under the H2020 Programme. In particular, the author gratefully acknowledges the support of the SPMT Department of the High Performance Computing Center Stuttgart (HLRS)

    A scalable architecture for ordered parallelism

    Get PDF
    We present Swarm, a novel architecture that exploits ordered irregular parallelism, which is abundant but hard to mine with current software and hardware techniques. In this architecture, programs consist of short tasks with programmer-specified timestamps. Swarm executes tasks speculatively and out of order, and efficiently speculates thousands of tasks ahead of the earliest active task to uncover ordered parallelism. Swarm builds on prior TLS and HTM schemes, and contributes several new techniques that allow it to scale to large core counts and speculation windows, including a new execution model, speculation-aware hardware task management, selective aborts, and scalable ordered commits. We evaluate Swarm on graph analytics, simulation, and database benchmarks. At 64 cores, Swarm achieves 51--122ร— speedups over a single-core system, and out-performs software-only parallel algorithms by 3--18ร—.National Science Foundation (U.S.) (Award CAREER-145299

    ๋™์‹œ์— ์‹คํ–‰๋˜๋Š” ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜๋“ค์„ ์œ„ํ•œ ๋ณ‘๋ ฌ์„ฑ ๊ด€๋ฆฌ

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2020. 8. Bernhard Egger.Running multiple parallel jobs on the same multicore machine is becoming more important to improve utilization of the given hardware resources. While co-location of parallel jobs is common practice, it still remains a challenge for current parallel runtime systems to efficiently execute multiple parallel applications simultaneously. Conventional parallelization runtimes such as OpenMP generate a fixed number of worker threads, typically as many as there are cores in the system, to utilize all physical core resources. On such runtime systems, applications may not achieve their peak performance when given full use of all physical core resources. Moreover, the OS kernel needs to manage all worker threads generated by all running parallel applications, and it may require huge management costs with an increasing number of co-located applications. In this thesis, we focus on improving runtime performance for co-located parallel applications. To achieve this goal, the first idea of this work is to ensure spatial scheduling to execute multiple co-located parallel applications simultaneously. Spatial scheduling that provides distinct core resources for applications is considered a promising and scalable approach for executing co-located applications. Despite the growing importance of spatial scheduling, there are still two fundamental research issues with this approach. First, spatial scheduling requires a runtime support for parallel applications to run efficiently in spatial core allocation that can change at runtime. Second, the scheduler needs to assign the proper number of core resources to applications depending on the applications performance characteristics for better runtime performance. To this end, in this thesis, we present three novel runtime-level techniques to efficiently execute co-located parallel applications with spatial scheduling. First, we present a cooperative runtime technique that provides malleable parallel execution for OpenMP parallel applications. The malleable execution means that applications can dynamically adapt their degree of parallelism to the varying core resource availability. It allows parallel applications to run efficiently at changing core resource availability compared to conventional runtime systems that do not adjust the degree of parallelism of the application. Second, this thesis introduces an analytical performance model that can estimate resource utilization and the performance of parallel programs in dependence of the provided core resources. We observe that the performance of parallel loops is typically limited by memory performance, and employ queueing theory to model the memory performance. The queueing system-based approach allows us to estimate the performance by using closed-form equations and hardware performance counters. Third, we present a core allocation framework to manage core resources between co-located parallel applications. With analytical modeling, we observe that maximizing both CPU utilization and memory bandwidth usage can generally lead to better performance compared to conventional core allocation policies that maximize only CPU usage. The presented core allocation framework optimizes utilization of multi-dimensional resources of CPU cores and memory bandwidth on multi-socket multicore systems based on the cooperative parallel runtime support and the analytical model.๋ฉ€ํ‹ฐ์ฝ”์–ด ์‹œ์Šคํ…œ์—์„œ ์—ฌ๋Ÿฌ ๊ฐœ์˜ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜๋“ค์„ ํ•จ๊ป˜ ์‹คํ–‰์‹œํ‚ค๋Š” ๊ฒƒ ์€ ์ฃผ์–ด์ง„ ํ•˜๋“œ์›จ์–ด ์ž์›์„ ํšจ์œจ์ ์œผ๋กœ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•ด์„œ ์ ์  ๋” ์ค‘์š”ํ•ด์ง€๊ณ  ์žˆ๋‹ค. ํ•˜์ง€๋งŒ, ํ˜„์žฌ ๋Ÿฐํƒ€์ž„ ์‹œ์Šคํ…œ์—์„œ ์—ฌ๋Ÿฌ ๊ฐœ์˜ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜๋“ค์„ ๋™์‹œ์— ํšจ์œจ์ ์œผ๋กœ ์‹คํ–‰์‹œํ‚ค๋Š” ๊ฒƒ์€ ์—ฌ์ „ํžˆ ์–ด๋ ค์šด ๋ฌธ์ œ์ด๋‹ค. OpenMP์™€ ๊ฐ™์ด ํ†ต์ƒ ์‚ฌ ์šฉ๋˜๋Š” ๋ณ‘๋ ฌํ™” ๋Ÿฐํƒ€์ž„ ์‹œ์Šคํ…œ๋“ค์€ ๋ชจ๋“  ํ•˜๋“œ์›จ์–ด ์ฝ”์–ด ์ž์›์„ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•ด์„œ ์ผ๋ฐ˜์ ์œผ๋กœ ์ฝ”์–ด ๊ฐœ์ˆ˜ ๋งŒํผ ์Šค๋ ˆ๋“œ๋ฅผ ์ƒ์„ฑํ•˜์—ฌ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜์„ ์‹คํ–‰์‹œํ‚จ๋‹ค. ์ด ๋•Œ, ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜์€ ๋ชจ๋“  ์ฝ”์–ด ์ž์›์„ ํ™œ์šฉํ•  ๋•Œ ์˜คํžˆ๋ ค ์ตœ์ ์˜ ์„ฑ๋Šฅ์„ ์–ป์ง€ ๋ชปํ•  ์ˆ˜๋„ ์žˆ์œผ๋ฉฐ, ์šด์˜์ฒด์ œ ์ปค๋„์˜ ๋ถ€ํ•˜๋Š” ์‹คํ–‰๋˜๋Š” ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜์˜ ๊ฐœ์ˆ˜๊ฐ€ ๋Š˜์–ด๋‚  ์ˆ˜๋ก ๊ด€๋ฆฌํ•ด์•ผ ํ•˜๋Š” ์Šค๋ ˆ๋“œ์˜ ๊ฐœ์ˆ˜๊ฐ€ ๋Š˜์–ด๋‚˜๊ธฐ ๋•Œ๋ฌธ์— ๊ณ„์†ํ•ด์„œ ์ปค์ง€๊ฒŒ ๋œ๋‹ค. ๋ณธ ํ•™์œ„ ๋…ผ๋ฌธ์—์„œ, ์šฐ๋ฆฌ๋Š” ํ•จ๊ป˜ ์‹คํ–‰๋˜๋Š” ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜๋“ค์˜ ๋Ÿฐํƒ€์ž„ ์„ฑ๋Šฅ์„ ๋†’์ด๋Š” ๊ฒƒ์— ์ง‘์ค‘ํ•œ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด, ๋ณธ ์—ฐ๊ตฌ์˜ ํ•ต์‹ฌ ๋ชฉํ‘œ๋Š” ํ•จ๊ป˜ ์‹คํ–‰๋˜๋Š” ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜๋“ค์—๊ฒŒ ๊ณต๊ฐ„ ๋ถ„ํ• ์‹ ์Šค์ผ€์ค„๋ง ๋ฐฉ๋ฒ•์„ ์ ์šฉํ•˜๋Š” ๊ฒƒ์ด๋‹ค. ๊ฐ ์–ดํ”Œ๋ฆฌ ์ผ€์ด์…˜์—๊ฒŒ ๋…๋ฆฝ์ ์ธ ์ฝ”์–ด ์ž์›์„ ํ• ๋‹นํ•ด์ฃผ๋Š” ๊ณต๊ฐ„ ๋ถ„ํ• ์‹ ์Šค์ผ€์ค„๋ง์€ ์ ์  ๋” ๋Š˜์–ด๋‚˜๋Š” ์ฝ”์–ด ์ž์›์˜ ๊ฐœ์ˆ˜๋ฅผ ํšจ์œจ์ ์œผ๋กœ ๊ด€๋ฆฌํ•˜๊ธฐ ์œ„ํ•œ ๋ฐฉ๋ฒ•์œผ๋กœ ๋งŽ์€ ๊ด€์‹ฌ์„ ๋ฐ›๊ณ  ์žˆ๋‹ค. ํ•˜์ง€๋งŒ, ๊ณต๊ฐ„ ๋ถ„ํ•  ์Šค์ผ€์ค„๋ง ๋ฐฉ๋ฒ•์„ ํ†ตํ•ด ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜์„ ์‹คํ–‰์‹œํ‚ค๋Š” ๊ฒƒ์€ ๋‘ ๊ฐ€์ง€ ์—ฐ๊ตฌ ๊ณผ์ œ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค. ๋จผ์ €, ๊ฐ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜์€ ๊ฐ€๋ณ€์ ์ธ ์ฝ”์–ด ์ž์› ์ƒ์—์„œ ํšจ์œจ์ ์œผ๋กœ ์‹คํ–‰๋˜๊ธฐ ์œ„ํ•œ ๋Ÿฐํƒ€์ž„ ๊ธฐ์ˆ ์„ ํ•„์š”๋กœ ํ•˜๊ณ , ์Šค์ผ€์ค„๋Ÿฌ๋Š” ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜๋“ค์˜ ์„ฑ๋Šฅ ํŠน์„ฑ์„ ๊ณ ๋ คํ•ด์„œ ๋Ÿฐํƒ€์ž„ ์„ฑ๋Šฅ์„ ๋†’์ผ ์ˆ˜ ์žˆ๋„๋ก ์ ๋‹นํ•œ ์ˆ˜์˜ ์ฝ”์–ด ์ž์›์„ ์ œ๊ณตํ•ด์•ผํ•œ๋‹ค. ์ด ํ•™์œ„ ๋…ผ๋ฌธ์—์„œ, ์šฐ๋ฆฌ๋Š” ํ•จ๊ป˜ ์‹คํ–‰๋˜๋Š” ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜๋“ค์„ ๊ณต๊ฐ„ ๋ถ„ ํ•  ์Šค์ผ€์ค„๋ง์„ ํ†ตํ•ด์„œ ํšจ์œจ์ ์œผ๋กœ ์‹คํ–‰์‹œํ‚ค๊ธฐ ์œ„ํ•œ ์„ธ๊ฐ€์ง€ ๋Ÿฐํƒ€์ž„ ์‹œ์Šคํ…œ ๊ธฐ์ˆ ์„ ์†Œ๊ฐœํ•œ๋‹ค. ๋จผ์ € ์šฐ๋ฆฌ๋Š” ํ˜‘๋™์ ์ธ ๋Ÿฐํƒ€์ž„ ์‹œ์Šคํ…œ์ด๋ผ๋Š” ๊ธฐ์ˆ ์„ ์†Œ๊ฐœํ•˜๋Š”๋ฐ, ์ด๋Š” OpenMP ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜๋“ค์—๊ฒŒ ์œ ์—ฐํ•˜๊ณ  ํšจ์œจ์ ์ธ ์‹คํ–‰ ํ™˜๊ฒฝ์„ ์ œ๊ณตํ•œ๋‹ค. ์ด ๊ธฐ์ˆ ์€ ๊ณต์œ  ๋ฉ”๋ชจ๋ฆฌ ๋ณ‘๋ ฌ ์‹คํ–‰์— ๋‚ด์žฌ๋˜์–ด ์žˆ๋Š” ํŠน์„ฑ์„ ํ™œ์šฉํ•˜์—ฌ ๋ณ‘๋ ฌ์ฒ˜๋ฆฌ ํ”„๋กœ๊ทธ๋žจ๋“ค์ด ๋ณ€ํ™”ํ•˜๋Š” ์ฝ”์–ด ์ž์›์— ๋งž์ถ”์–ด ๋ณ‘๋ ฌ์„ฑ์˜ ์ •๋„๋ฅผ ๋™์ ์œผ๋กœ ์กฐ์ ˆํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•ด์ค€๋‹ค. ์ด๋Ÿฌํ•œ ์œ ์—ฐํ•œ ์‹คํ–‰ ๋ชจ๋ธ์€ ๋ณ‘๋ ฌ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜๋“ค์ด ์‚ฌ์šฉ ๊ฐ€๋Šฅํ•œ ์ฝ”์–ด ์ž์›์ด ๋™์ ์œผ๋กœ ๋ณ€ํ™”ํ•˜๋Š” ํ™˜๊ฒฝ์—์„œ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜์˜ ์Šค๋ ˆ๋“œ ์ˆ˜์ค€ ๋ณ‘๋ ฌ์„ฑ์„ ๋‹ค๋ฃจ์ง€ ๋ชปํ•˜๋Š” ๊ธฐ์กด ๋Ÿฐํƒ€์ž„ ์‹œ์Šคํ…œ๋“ค์— ๋น„ํ•ด์„œ ๋” ํšจ์œจ์ ์œผ๋กœ ์‹คํ–‰๋  ์ˆ˜ ์žˆ๋„๋ก ํ•ด์ค€๋‹ค. ๋‘๋ฒˆ์งธ๋กœ, ๋ณธ ๋…ผ๋ฌธ์€ ์‚ฌ์šฉ๋˜๋Š” ์ฝ”์–ด ์ž์›์— ๋”ฐ๋ฅธ ๋ณ‘๋ ฌ์ฒ˜๋ฆฌ ํ”„๋กœ๊ทธ๋žจ์˜ ์„ฑ๋Šฅ ๋ฐ ์ž์› ํ™œ์šฉ๋„๋ฅผ ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•ด์ฃผ๋Š” ๋ถ„์„์  ์„ฑ๋Šฅ ๋ชจ๋ธ์„ ์†Œ๊ฐœํ•œ๋‹ค. ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ ์ฝ”๋“œ์˜ ์„ฑ๋Šฅ ํ™•์žฅ์„ฑ์ด ์ผ๋ฐ˜์ ์œผ๋กœ ๋ฉ”๋ชจ๋ฆฌ ์„ฑ๋Šฅ์— ์ขŒ์šฐ๋œ๋‹ค๋Š” ๊ด€์ฐฐ์— ๊ธฐ์ดˆํ•˜์—ฌ, ์ œ ์•ˆ๋œ ํ•ด์„ ๋ชจ๋ธ์€ ํ์ž‰ ์ด๋ก ์„ ํ™œ์šฉํ•˜์—ฌ ๋ฉ”๋ชจ๋ฆฌ ์‹œ์Šคํ…œ์˜ ์„ฑ๋Šฅ ์ •๋ณด๋“ค์„ ๊ณ„์‚ฐํ•œ๋‹ค. ์ด ํ์ž‰ ์‹œ์Šคํ…œ์— ๊ธฐ๋ฐ˜ํ•œ ๋ฐฉ๋ฒ•์€ ์œ ์šฉํ•œ ์„ฑ๋Šฅ ์ •๋ณด๋“ค์„ ์ˆ˜์‹์„ ํ†ตํ•ด ํšจ์œจ์ ์œผ๋กœ ๊ณ„์‚ฐํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜๋ฉฐ ์ƒ์šฉ ์‹œ์Šคํ…œ์—์„œ ์ œ๊ณตํ•˜๋Š” ํ•˜๋“œ์›จ์–ด ์„ฑ๋Šฅ ์นด์šดํ„ฐ๋งŒ์„ ์š”๊ตฌ ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ํ™œ์šฉ ๊ฐ€๋Šฅ์„ฑ ๋˜ํ•œ ๋†’๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ, ๋ณธ ๋…ผ๋ฌธ์€ ๋™์‹œ์— ์‹คํ–‰๋˜๋Š” ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜๋“ค ์‚ฌ์ด์—์„œ ์ฝ”์–ด ์ž์›์„ ํ• ๋‹นํ•ด์ฃผ๋Š” ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์†Œ๊ฐœํ•œ๋‹ค. ์ œ์•ˆ๋œ ํ”„๋ ˆ์ž„์›Œํฌ๋Š” ๋™์‹œ์— ๋™ ์ž‘ํ•˜๋Š” ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜์˜ ๋ณ‘๋ ฌ์„ฑ ๋ฐ ์ฝ”์–ด ์ž์›์„ ๊ด€๋ฆฌํ•˜์—ฌ ๋ฉ€ํ‹ฐ ์†Œ์ผ“ ๋ฉ€ํ‹ฐ์ฝ”์–ด ์‹œ์Šคํ…œ์—์„œ CPU ์ž์› ๋ฐ ๋ฉ”๋ชจ๋ฆฌ ๋Œ€์—ญํญ ์ž์› ํ™œ์šฉ๋„๋ฅผ ๋™์‹œ์— ์ตœ์  ํ™”ํ•œ๋‹ค. ํ•ด์„์ ์ธ ๋ชจ๋ธ๋ง๊ณผ ์ œ์•ˆ๋œ ์ฝ”์–ด ํ• ๋‹น ํ”„๋ ˆ์ž„์›Œํฌ์˜ ์„ฑ๋Šฅ ํ‰๊ฐ€๋ฅผ ํ†ตํ•ด์„œ, ์šฐ๋ฆฌ๊ฐ€ ์ œ์•ˆํ•˜๋Š” ์ •์ฑ…์ด ์ผ๋ฐ˜์ ์ธ ๊ฒฝ์šฐ์— CPU ์ž์›์˜ ํ™œ์šฉ๋„๋งŒ์„ ์ตœ์ ํ™”ํ•˜๋Š” ๋ฐฉ๋ฒ•์— ๋น„ํ•ด์„œ ํ•จ๊ป˜ ๋™์ž‘ํ•˜๋Š” ์–ดํ”Œ๋ฆฌ์ผ€์ด์…˜๋“ค์˜ ์‹คํ–‰์‹œ๊ฐ„์„ ๊ฐ์†Œ์‹œํ‚ฌ ์ˆ˜ ์žˆ์Œ์„ ๋ณด์—ฌ์ค€๋‹ค.1 Introduction 1 1.1 Motivation 1 1.2 Background 5 1.2.1 The OpenMP Runtime System 5 1.2.2 Target Multi-Socket Multicore Systems 7 1.3 Contributions 8 1.3.1 Cooperative Runtime Systems 9 1.3.2 Performance Modeling 9 1.3.3 Parallelism Management 10 1.4 Related Work 11 1.4.1 Cooperative Runtime Systems 11 1.4.2 Performance Modeling 12 1.4.3 Parallelism Management 14 1.5 Organization of this Thesis 15 2 Dynamic Spatial Scheduling with Cooperative Runtime Systems 17 2.1 Overview 17 2.2 Malleable Workloads 19 2.3 Cooperative OpenMP Runtime System 21 2.3.1 Cooperative User-Level Tasking 22 2.3.2 Cooperative Dynamic Loop Scheduling 27 2.4 Experimental Results 30 2.4.1 Standalone Application Performance 30 2.4.2 Performance in Spatial Core Allocation 33 2.5 Discussion 35 2.5.1 Contributions 35 2.5.2 Limitations and Future Work 36 2.5.3 Summary 37 3 Performance Modeling of Parallel Loops using Queueing Systems 38 3.1 Overview 38 3.2 Background 41 3.2.1 Queueing Models 41 3.2.2 Insights on Performance Modeling of Parallel Loops 43 3.2.3 Performance Analysis 46 3.3 Queueing Systems for Multi-Socket Multicores 54 3.3.1 Hierarchical Queueing Systems 54 3.3.2 Computingthe Parameter Values 60 3.4 The Speedup Prediction Model 63 3.4.1 The Speedup Model 63 3.4.2 Implementation 64 3.5 Evaluation 65 3.5.1 64-core AMD Opteron Platform 66 3.5.2 72-core Intel Xeon Platform 68 3.6 Discussion 70 3.6.1 Applicability of the Model 70 3.6.2 Limitations of the Model 72 3.6.3 Summary 73 4 Maximizing System Utilization via Parallelism Management 74 4.1 Overview 74 4.2 Background 76 4.2.1 Modeling Performance Metrics 76 4.2.2 Our Resource Management Policy 79 4.3 NuPoCo: Parallelism Management for Co-Located Parallel Loops 82 4.3.1 Online Performance Model 82 4.3.2 Managing Parallelism 86 4.4 Evaluation of NuPoCo 90 4.4.1 Evaluation Scenario 1 90 4.4.2 Evaluation Scenario 2 98 4.5 MOCA: An Evolutionary Approach to Core Allocation 103 4.5.1 Evolutionary Core Allocation 104 4.5.2 Model-Based Allocation 106 4.6 Evaluation of MOCA 113 4.7 Discussion 118 4.7.1 Contributions and Limitations 118 4.7.2 Summary 119 5 Conclusion and Future Work 120 5.1 Conclusion 120 5.2 Future work 122 5.2.1 Improving Multi-Objective Core Allocation 122 5.2.2 Co-Scheduling of Parallel Jobs for HPC Systems 123 A Additional Experiments for the Performance Model 124 A.1 Memory Access Distribution and Poisson Distribution 124 A.1.1 Memory Access Distribution 124 A.1.2 Kolmogorov Smirnov Test 127 A.2 Additional Performance Modeling Results 134 A.2.1 Results with Intel Hyperthreading 134 A.2.2 Results with Cooperative User-Level Tasking 134 A.2.3 Results with Other Loop Schedulers 138 A.2.4 Results with Different Number of Memory Nodes 138 B Other Research Contributions of the Author 141 B.1 Compiler and Runtime Support for Integrated CPU-GPU Systems 141 B.2 Modeling NUMA Architectures with Stochastic Tool 143 B.3 Runtime Environment for a Manycore Architecture 143 ์ดˆ๋ก 159 Acknowledgements 161Docto

    Just-In-Time Locality and Percolation for Optimizing Irregular Applications on a Manycore Architecture

    Full text link

    Sigmoid: An auto-tuned load balancing algorithm for heterogeneous systems

    Get PDF
    A challenge that heterogeneous system programmers face is leveraging the performance of all the devices that integrate the system. This paper presents Sigmoid, a new load balancing algorithm that efficiently co-executes a single OpenCL data-parallel kernel on all the devices of heterogeneous systems. Sigmoid splits the workload proportionally to the capabilities of the devices, drastically reducing response time and energy consumption. It is designed around several features; it is dynamic, adaptive, guided and effortless, as it does not require the user to give any parameter, adapting to the behaviourof each kernel at runtime. To evaluate Sigmoid's performance, it has been implemented in Maat, a system abstraction library. Experimental results with different kernel types show that Sigmoid exhibits excellent performance, reaching a utilization of 90%, together with energy savings up to 20%, always reducing programming effort compared to OpenCL, and facilitating the portability to other heterogeneous machines.This work has been supported by the Spanish Science and Technology Commission under contract PID2019-105660RB-C22 and the European HiPEAC Network of Excellence

    Structuring the execution of OpenMP applications for multicore architectures

    Get PDF
    International audienceThe now commonplace multi-core chips have introduced, by design, a deep hierarchy of memory and cache banks within parallel computers as a tradeoff between the user friendliness of shared memory on the one side, and memory access scalability and efficiency on the other side. However, to get high performance out of such machines requires a dynamic mapping of application tasks and data onto the underlying architecture. Moreover, depending on the application behavior, this mapping should favor cache affinity, memory bandwidth, computation synchrony, or a combination of these. The great challenge is then to perform this hardware-dependent mapping in a portable, abstract way. To meet this need, we propose a new, hierarchical approach to the execution of OpenMP threads onto multicore machines. Our ForestGOMP runtime system dynamically generates structured trees out of OpenMP programs. It collects relationship information about threads and data as well. This information is used together with scheduling hints and hardware counter feedback by the scheduler to select the most appropriate threads and data distribution. ForestGOMP features a high-level platform for developing and tuning portable threads schedulers. We present several applications for which we developed specific scheduling policies that achieve excellent speedups on 16-core machines
    • โ€ฆ
    corecore