4 research outputs found

    MIPP: A Microbenchmark Suite for Performance, Power, and Energy Consumption Characterization of GPU architectures

    Get PDF
    GPU-accelerated applications are becoming increasingly common in high-performance computing as well as in low-power heterogeneous embedded systems. Nevertheless, GPU programming is a challenging task, especially if a GPU application has to be tuned to fully take advantage of the GPU architectural configuration. Even more challenging is the application tuning by considering power and energy consumption, which have emerged as first-order design constraints in additionto performance. Solving bottlenecks of a GPU application such as high thread divergence or poor memory coalescing have a different impact on the overall performance, power and energy consumption. Such an impact also depends on the GPU device on which the application is run. This paper presents a suite of microbenchmarks, which are specialized chunks of GPU code that exercise specific device components (e.g., arithmetic instruction units, shared memory, cache, DRAM, etc.) and that provide the actual characteristics of such components in terms of throughput, power, and energy consumption. The suite aims at enriching standard profiler information and guiding the GPU application tuning on a specific GPU architecture by considering all three design constraints (i.e., power, performance, energy consumption). The paper presents the results obtained by applying the proposed suite to characterize two different GPU devices and to understand how application tuning may impact differently on them

    Power-aware Performance Tuning of GPU Applications Through Microbenchmarking

    Get PDF
    Tuning GPU applications is a very challenging task as any source-code optimization can sensibly impact performance, power, and energy consumption of the GPU device. Such an impact also depends on the GPU on which the application is run. This paper presents a suite of microbenchmarks that provides the actual characteristics of specific GPU device components (e.g., arithmetic instruction units, memories, etc.) in terms of throughput, power, and energy consumption. It shows how the suite can be combined to standard profiler information to efficiently drive the application tuning by considering the three design constraints (power, performance, energy consumption) and the characteristics of the target GPU device

    High-Performance and Power-Aware Graph Processing on GPUs

    Get PDF
    Graphs are a common representation in many problem domains, including engineering, finance, medicine, and scientific applications. Different problems map to very large graphs, often involving millions of vertices. Even though very efficient sequential implementations of graph algorithms exist, they become impractical when applied on such actual very large graphs. On the other hand, graphics processing units (GPUs) have become widespread architectures as they provide massive parallelism at low cost. Parallel execution on GPUs may achieve speedup up to three orders of magnitude with respect to the sequential counterparts. Nevertheless, accelerating efficient and optimized sequential algorithms and porting (i.e., parallelizing) their implementation to such many-core architectures is a very challenging task. The task is made even harder since energy and power consumption are becoming constraints in addition, or in same case as an alternative, to performance. This work aims at developing a platform that provides (I) a library of parallel, efficient, and tunable implementations of the most important graph algorithms for GPUs, and (II) an advanced profiling model to analyze both performance and power consumption of the algorithm implementations. The platform goal is twofold. Through the library, it aims at saving developing effort in the parallelization task through a primitive-based approach. Through the profiling framework, it aims at customizing such primitives by considering both the architectural details and the target efficiency metrics (i.e., performance or power)

    Optimizaci贸n del rendimiento y la eficiencia energ茅tica en sistemas masivamente paralelos

    Get PDF
    RESUMEN Los sistemas heterog茅neos son cada vez m谩s relevantes, debido a sus capacidades de rendimiento y eficiencia energ茅tica, estando presentes en todo tipo de plataformas de c贸mputo, desde dispositivos embebidos y servidores, hasta nodos HPC de grandes centros de datos. Su complejidad hace que sean habitualmente usados bajo el paradigma de tareas y el modelo de programaci贸n host-device. Esto penaliza fuertemente el aprovechamiento de los aceleradores y el consumo energ茅tico del sistema, adem谩s de dificultar la adaptaci贸n de las aplicaciones. La co-ejecuci贸n permite que todos los dispositivos cooperen para computar el mismo problema, consumiendo menos tiempo y energ铆a. No obstante, los programadores deben encargarse de toda la gesti贸n de los dispositivos, la distribuci贸n de la carga y la portabilidad del c贸digo entre sistemas, complicando notablemente su programaci贸n. Esta tesis ofrece contribuciones para mejorar el rendimiento y la eficiencia energ茅tica en estos sistemas masivamente paralelos. Se realizan propuestas que abordan objetivos generalmente contrapuestos: se mejora la usabilidad y la programabilidad, a la vez que se garantiza una mayor abstracci贸n y extensibilidad del sistema, y al mismo tiempo se aumenta el rendimiento, la escalabilidad y la eficiencia energ茅tica. Para ello, se proponen dos motores de ejecuci贸n con enfoques completamente distintos. EngineCL, centrado en OpenCL y con una API de alto nivel, favorece la m谩xima compatibilidad entre todo tipo de dispositivos y proporciona un sistema modular extensible. Su versatilidad permite adaptarlo a entornos para los que no fue concebido, como aplicaciones con ejecuciones restringidas por tiempo o simuladores HPC de din谩mica molecular, como el utilizado en un centro de investigaci贸n internacional. Considerando las tendencias industriales y enfatizando la aplicabilidad profesional, CoexecutorRuntime proporciona un sistema flexible centrado en C++/SYCL que dota de soporte a la co-ejecuci贸n a la tecnolog铆a oneAPI. Este runtime acerca a los programadores al dominio del problema, posibilitando la explotaci贸n de estrategias din谩micas adaptativas que mejoran la eficiencia en todo tipo de aplicaciones.ABSTRACT Heterogeneous systems are becoming increasingly relevant, due to their performance and energy efficiency capabilities, being present in all types of computing platforms, from embedded devices and servers to HPC nodes in large data centers. Their complexity implies that they are usually used under the task paradigm and the host-device programming model. This strongly penalizes accelerator utilization and system energy consumption, as well as making it difficult to adapt applications. Co-execution allows all devices to simultaneously compute the same problem, cooperating to consume less time and energy. However, programmers must handle all device management, workload distribution and code portability between systems, significantly complicating their programming. This thesis offers contributions to improve performance and energy efficiency in these massively parallel systems. The proposals address the following generally conflicting objectives: usability and programmability are improved, while ensuring enhanced system abstraction and extensibility, and at the same time performance, scalability and energy efficiency are increased. To achieve this, two runtime systems with completely different approaches are proposed. EngineCL, focused on OpenCL and with a high-level API, provides an extensible modular system and favors maximum compatibility between all types of devices. Its versatility allows it to be adapted to environments for which it was not originally designed, including applications with time-constrained executions or molecular dynamics HPC simulators, such as the one used in an international research center. Considering industrial trends and emphasizing professional applicability, CoexecutorRuntime provides a flexible C++/SYCL-based system that provides co-execution support for oneAPI technology. This runtime brings programmers closer to the problem domain, enabling the exploitation of dynamic adaptive strategies that improve efficiency in all types of applications.Funding: This PhD has been supported by the Spanish Ministry of Education (FPU16/03299 grant), the Spanish Science and Technology Commission under contracts TIN2016-76635-C2-2-R and PID2019-105660RB-C22. This work has also been partially supported by the Mont-Blanc 3: European Scalable and Power Efficient HPC Platform based on Low-Power Embedded Technology project (G.A. No. 671697) from the European Union鈥檚 Horizon 2020 Research and Innovation Programme (H2020 Programme). Some activities have also been funded by the Spanish Science and Technology Commission under contract TIN2016-81840-REDT (CAPAP-H6 network). The Integration II: Hybrid programming models of Chapter 4 has been partially performed under the Project HPC-EUROPA3 (INFRAIA-2016-1-730897), with the support of the EC Research Innovation Action under the H2020 Programme. In particular, the author gratefully acknowledges the support of the SPMT Department of the High Performance Computing Center Stuttgart (HLRS)
    corecore