4 research outputs found
Pro++: A Profiling Framework for Primitive-based GPU Programming
Parallelizing software applications through the use of existing optimized primitives is a common trend that mediates the complexity of manual parallelization and the use of less efficient directive-based programming models. Parallel primitive libraries allow software engineers to map any sequential code to a target many-core architecture by identifying the most computational intensive code sections and mapping them into one ore more existing primitives. On the other hand, the spreading of such a primitive-based programming model and the different GPU architectures have led to a large and increasing number of third-party libraries, which often provide different implementations of the same primitive, each one optimized for a specific architecture. From the developer point of view, this moves the actual problem of parallelizing the software application to selecting, among the several implementations, the most efficient primitives for the target platform. This paper presents Pro++, a profiling framework for GPU primitives that allows measuring the implementation quality of a given primitive by considering the target architecture characteristics. The framework collects the information provided by a standard GPU profiler and combines them into optimization criteria. The criteria evaluations are weighed to distinguish the impact of each optimization on the overall quality of the primitive implementation. The paper shows how the tuning of the different weights has been conducted through the analysis of five of the most widespread existing primitive libraries and how the framework has been eventually applied to improve the implementation performance of two standard and widespread primitives
A Code Generation Framework for Targeting Optimized Library Calls for Multiple Platforms
Directive-based programming approaches such as OpenMP and OpenACC have gained popularity due to their ease of programming. These programming models typically involve adding compiler directives to code sections such as loops in order to parallelize them for execution on multicore CPUs or GPUs. However, one problem with this approach is that existing compilers generate code directly from the annotated sections and do not make use of hardware-specific architectural features. As a result, the generated code is unable to fully exploit the capabilities of the underlying hardware. Alternatively, we propose a code generation framework in which linear algebraic operations in the annotated codes are recognized, extracted and mapped to optimized vendor-provided platform-specific library calls. We demonstrate that such an approach can result in better performance in the generated code compared to those which are generated by existing compilers. This is substantiated by experimental results on multicore CPUs and GPUs.ASTAR (Agency for Sci., Tech. and Research, S’pore)Accepted versio
Diseño e implementación de un selector dinámico de implementaciones basado en probabilidades
Debido a la progresiva adopción de plataformas heterogéneas como vía para incrementar
el rendimiento de los sistemas de cómputo, ha surgido la necesidad de buscar soluciones
que permitan el desarrollo de aplicaciones que empleen de forma eficiente todos los
componentes que se encuentran en estos sistemas. A Causa de la variante naturaleza de
estos elementos y sus diferencias en términos de rendimiento en función de qué
operaciones se lleven a cabo, están siendo publicadas diferentes aproximaciones para
aprovechar las capacidades de cada recurso y al mismo tiempo proporcionar a los
desarrolladores herramientas que les facilite dicha tarea.
Con este trasfondo, el presente trabajó tratará de proporcionar una nueva solución que
cubra estas necesidades gracias al desarrollo de un selector de implementaciones que
fundamentará su algoritmo de selección en un planteamiento probabilístico desarrollado
por el profesor de la Universidad Carlos III de Madrid Javier Fernández Muñoz.
De esta forma, dado un sistema heterogéneo donde se cuente con múltiples unidades de
procesamiento de distinta naturaleza y un código fuente donde se especifique una
implementación para cada unidad, la solución presentada proporcionará los mecanismos
necesarios para la identificación de dichas implementaciones, los dispositivos sobre los
que se han de ejecutar y finalmente seleccionará qué implementación es la más adecuada
en términos de tiempo de ejecución para resolverla.
El desarrollo se llevará a cabo utilizando la metodología ágil SCRUM, y se realizará un
análisis comparado con el planificador “versioning” del framework OmpSs para
comprobar la eficacia del selector.The solution to increase the performance in computer-based systems has been, in the
recent years, the adoption of heterogeneous platforms, this decision produces the
necessity of developing applications that make an efficient use of the different
components in those systems. Since each element in this kind of platform produces a
different performance depending on the operations assigned, new approximations are
emerging to make a reasonable use of the resources and to give the developers tools to
facilitate this task.
With this background, the present document tries to provide a new solution to cover these
needs by means of a probabilistic selector, using a probabilistic model developed by
professor Javier Fernández Muñoz from the Carlos III University.
Considering a heterogeneous system with multiple processing units and a source code
with an implementation for each one, the proposed solution will identify the
implementations, the corresponding devices and will select which implementation is the
best in terms of execution time.
The development has been done following the SCRUM method and a comparative
analysis with the “versioning” selector from the OmpSs framework shall be carried out
to evaluate the selector.Ingeniería Informátic
High-Performance and Power-Aware Graph Processing on GPUs
Graphs are a common representation in many problem domains, including engineering, finance, medicine, and scientific applications. Different problems map to very large graphs, often involving millions of vertices. Even though very efficient sequential implementations of graph algorithms exist, they become impractical when applied on such actual very large graphs. On the other hand, graphics processing units (GPUs) have become widespread architectures as they provide massive parallelism at low cost. Parallel execution on GPUs may achieve speedup up to three orders of magnitude with respect to the sequential counterparts. Nevertheless, accelerating efficient and optimized sequential algorithms and porting (i.e., parallelizing) their implementation to such many-core architectures is a very challenging task. The task is made even harder since energy and power consumption are becoming constraints in addition, or in same case as an alternative, to performance. This work aims at developing a platform that provides (I) a library of parallel, efficient, and tunable implementations of the most important graph algorithms for GPUs, and (II) an advanced profiling model to analyze both performance and power consumption of the algorithm implementations. The platform goal is twofold. Through the library, it aims at saving developing effort in the parallelization task through a primitive-based approach. Through the profiling framework, it aims at customizing such primitives by considering both the architectural details and the target efficiency metrics (i.e., performance or power)