4 research outputs found

    Pro++: A Profiling Framework for Primitive-based GPU Programming

    Get PDF
    Parallelizing software applications through the use of existing optimized primitives is a common trend that mediates the complexity of manual parallelization and the use of less efficient directive-based programming models. Parallel primitive libraries allow software engineers to map any sequential code to a target many-core architecture by identifying the most computational intensive code sections and mapping them into one ore more existing primitives. On the other hand, the spreading of such a primitive-based programming model and the different GPU architectures have led to a large and increasing number of third-party libraries, which often provide different implementations of the same primitive, each one optimized for a specific architecture. From the developer point of view, this moves the actual problem of parallelizing the software application to selecting, among the several implementations, the most efficient primitives for the target platform. This paper presents Pro++, a profiling framework for GPU primitives that allows measuring the implementation quality of a given primitive by considering the target architecture characteristics. The framework collects the information provided by a standard GPU profiler and combines them into optimization criteria. The criteria evaluations are weighed to distinguish the impact of each optimization on the overall quality of the primitive implementation. The paper shows how the tuning of the different weights has been conducted through the analysis of five of the most widespread existing primitive libraries and how the framework has been eventually applied to improve the implementation performance of two standard and widespread primitives

    A Code Generation Framework for Targeting Optimized Library Calls for Multiple Platforms

    No full text
    Directive-based programming approaches such as OpenMP and OpenACC have gained popularity due to their ease of programming. These programming models typically involve adding compiler directives to code sections such as loops in order to parallelize them for execution on multicore CPUs or GPUs. However, one problem with this approach is that existing compilers generate code directly from the annotated sections and do not make use of hardware-specific architectural features. As a result, the generated code is unable to fully exploit the capabilities of the underlying hardware. Alternatively, we propose a code generation framework in which linear algebraic operations in the annotated codes are recognized, extracted and mapped to optimized vendor-provided platform-specific library calls. We demonstrate that such an approach can result in better performance in the generated code compared to those which are generated by existing compilers. This is substantiated by experimental results on multicore CPUs and GPUs.ASTAR (Agency for Sci., Tech. and Research, S’pore)Accepted versio

    Diseño e implementación de un selector dinámico de implementaciones basado en probabilidades

    Get PDF
    Debido a la progresiva adopción de plataformas heterogéneas como vía para incrementar el rendimiento de los sistemas de cómputo, ha surgido la necesidad de buscar soluciones que permitan el desarrollo de aplicaciones que empleen de forma eficiente todos los componentes que se encuentran en estos sistemas. A Causa de la variante naturaleza de estos elementos y sus diferencias en términos de rendimiento en función de qué operaciones se lleven a cabo, están siendo publicadas diferentes aproximaciones para aprovechar las capacidades de cada recurso y al mismo tiempo proporcionar a los desarrolladores herramientas que les facilite dicha tarea. Con este trasfondo, el presente trabajó tratará de proporcionar una nueva solución que cubra estas necesidades gracias al desarrollo de un selector de implementaciones que fundamentará su algoritmo de selección en un planteamiento probabilístico desarrollado por el profesor de la Universidad Carlos III de Madrid Javier Fernández Muñoz. De esta forma, dado un sistema heterogéneo donde se cuente con múltiples unidades de procesamiento de distinta naturaleza y un código fuente donde se especifique una implementación para cada unidad, la solución presentada proporcionará los mecanismos necesarios para la identificación de dichas implementaciones, los dispositivos sobre los que se han de ejecutar y finalmente seleccionará qué implementación es la más adecuada en términos de tiempo de ejecución para resolverla. El desarrollo se llevará a cabo utilizando la metodología ágil SCRUM, y se realizará un análisis comparado con el planificador “versioning” del framework OmpSs para comprobar la eficacia del selector.The solution to increase the performance in computer-based systems has been, in the recent years, the adoption of heterogeneous platforms, this decision produces the necessity of developing applications that make an efficient use of the different components in those systems. Since each element in this kind of platform produces a different performance depending on the operations assigned, new approximations are emerging to make a reasonable use of the resources and to give the developers tools to facilitate this task. With this background, the present document tries to provide a new solution to cover these needs by means of a probabilistic selector, using a probabilistic model developed by professor Javier Fernández Muñoz from the Carlos III University. Considering a heterogeneous system with multiple processing units and a source code with an implementation for each one, the proposed solution will identify the implementations, the corresponding devices and will select which implementation is the best in terms of execution time. The development has been done following the SCRUM method and a comparative analysis with the “versioning” selector from the OmpSs framework shall be carried out to evaluate the selector.Ingeniería Informátic

    High-Performance and Power-Aware Graph Processing on GPUs

    Get PDF
    Graphs are a common representation in many problem domains, including engineering, finance, medicine, and scientific applications. Different problems map to very large graphs, often involving millions of vertices. Even though very efficient sequential implementations of graph algorithms exist, they become impractical when applied on such actual very large graphs. On the other hand, graphics processing units (GPUs) have become widespread architectures as they provide massive parallelism at low cost. Parallel execution on GPUs may achieve speedup up to three orders of magnitude with respect to the sequential counterparts. Nevertheless, accelerating efficient and optimized sequential algorithms and porting (i.e., parallelizing) their implementation to such many-core architectures is a very challenging task. The task is made even harder since energy and power consumption are becoming constraints in addition, or in same case as an alternative, to performance. This work aims at developing a platform that provides (I) a library of parallel, efficient, and tunable implementations of the most important graph algorithms for GPUs, and (II) an advanced profiling model to analyze both performance and power consumption of the algorithm implementations. The platform goal is twofold. Through the library, it aims at saving developing effort in the parallelization task through a primitive-based approach. Through the profiling framework, it aims at customizing such primitives by considering both the architectural details and the target efficiency metrics (i.e., performance or power)
    corecore