14 research outputs found

    Multi-Stage Programming for GPUs in Modern C++ using PACXX

    Get PDF
    Writing and optimizing programs for high performance on systems with GPUs remains a challenging task even for expert programmers. One promising optimization technique is to evaluate parts of the program upfront on the CPU and embed the computed results in the GPU code allowing for more aggressive compiler optimizations. This technique is known as multi-stage programming and has proven to allow for significant performance benefits. Unfortunately, to achieve such optimizations in current GPU programming models like OpenCL, programmers are forced to manipulate the GPU source code as plain strings, which is error-prone and type-unsafe. In this paper we describe PACXX - a GPU programming approach using modern C++ standards, with the convenient features like type deduction, lambda expressions, and algorithms from the standard template library (STL). Using PACXX, a GPU program is written as a single C++ program, rather than two distinct host and kernel programs. We extend PACXX with an easy-to-use and type-safe API for multi-stage programming avoiding the pitfalls of string manipulation. Using just-in-time compilation techniques, PACXX generates efficient GPU code at runtime. Our evaluation shows that using PACXX allows for writing multi-stage code easier and safer than currently possible. Using two detailed application studies we show that multi-stage programming can significantly outperform equivalent non-staged programs. Furthermore, we show that PACXX generates code with performance comparable to industrial-strength OpenCL compilers

    Towards Composable GPU Programming: Programming GPUs with Eager Actions and Lazy Views

    Get PDF
    In this paper, we advocate a composable approach to programming systems with Graphics Processing Units (GPU): programs are developed as compositions of generic, reusable patterns. Current GPU programming approaches either rely on low-level, monolithic code without patterns (CUDA and OpenCL), which achieves high performance at the cost of cumbersome and error-prone programming, or they improve the programmability by using pattern-based abstractions (e.g., Thrust) but pay a performance penalty due to inefficient implementations of pattern composition. We develop an API for GPUs based programming on C++ with STL-style patterns and its compiler-based implementation. Our API gives the application developers the native C++ means (views and actions) to specify precisely which pattern compositions should be automatically fused during code generation into a single efficient GPU kernel, thereby ensuring a high target performance. We implement our approach by extending the range-v3 library which is currently being developed for the forthcoming C++ standards. The composable programming in our approach is done exclusively in the standard C++14, with STL algorithms used as patterns which we re-implemented in parallel for GPU. Our compiler implementation is based on the LLVM and Clang frameworks, and we use advanced multi-stage programming techniques for aggressive runtime optimizations. We experimentally evaluate our approach using a set of benchmark applications and a real-world case study from the area of image processing. Our codes achieve performance competitive with CUDA monolithic implementations, and we outperform pattern-based codes written using Nvidia鈥檚 Thrust

    Multi-Device Controllers: A Library To Simplify The Parallel Heterogeneous Programming

    Get PDF
    Producci贸n Cient铆ficaCurrent HPC clusters are composed by several machines with different computation capabilities and different kinds and families of accelerators. Programming efficiently for these heterogeneous systems has become an important challenge. There are many proposals to simplify the programming and management of accelerator devices, and the hybrid programming, mixing accelerators and CPU cores. However, in many cases, portability compromises the efficiency on different devices, and there are details concerning the coordination of different types of devices that should still be tackled by the programmer. In this work, we introduce the Multi-Controller, an abstract entity implemented in a library that coordinates the management of heterogeneous devices, including accelerators with different capabilities and sets of CPU-cores. Our proposal improves state-of-the-art solutions, simplifying data partition, mapping and the transparent deployment of both, simple generic kernels portable across different device types, and specialized implementations defined and optimized using specific native or vendor programming models (such as CUDA for NVIDIA鈥檚 GPUs, or OpenMP for CPU-cores). The run-time system automatically selects and deploys the most appropriate implementation of each kernel for each device, managing data movements and hiding the launch details. The results of an experimental study with five study cases indicates that our abstraction allows the development of flexible and highly efficient programs that adapt to the heterogeneous environment.2020-01-012020-01-01MICINN (Spain) and ERDF program of the European Union: HomProg-HetSys project (TIN2014-58876-P), CAPAP-H6 (TIN2016-81840-REDT), and COST Program Action IC1305: Network for Sustainable Ultrascale Computing (NESUS)

    HALO 1.0: A Hardware-agnostic Accelerator Orchestration Framework for Enabling Hardware-agnostic Programming with True Performance Portability for Heterogeneous HPC

    Full text link
    This paper presents HALO 1.0, an open-ended extensible multi-agent software framework that implements a set of proposed hardware-agnostic accelerator orchestration (HALO) principles. HALO implements a novel compute-centric message passing interface (C^2MPI) specification for enabling the performance-portable execution of a hardware-agnostic host application across heterogeneous accelerators. The experiment results of evaluating eight widely used HPC subroutines based on Intel Xeon E5-2620 CPUs, Intel Arria 10 GX FPGAs, and NVIDIA GeForce RTX 2080 Ti GPUs show that HALO 1.0 allows for a unified control flow for host programs to run across all the computing devices with a consistently top performance portability score, which is up to five orders of magnitude higher than the OpenCL-based solution.Comment: 21 page

    Easing parallel programming on heterogeneous systems

    Get PDF
    El modo m谩s frecuente de resolver aplicaciones de HPC (High performance Computing) en tiempos de ejecuci贸n razonables y de una forma escalable es mediante el uso de sistemas de c贸mputo paralelo. La tendencia actual en los sistemas de HPC es la inclusi贸n en la misma m谩quina de ejecuci贸n de varios dispositivos de c贸mputo, de diferente tipo y arquitectura. Sin embargo, su uso impone al programador retos espec铆ficos. Un programador debe ser experto en las herramientas y abstracciones existentes para memoria distribuida, los modelos de programaci贸n para sistemas de memoria compartida, y los modelos de programaci贸n espec铆ficos para para cada tipo de co-procesador, con el fin de crear programas h铆bridos que puedan explotar eficientemente todas las capacidades de la m谩quina. Actualmente, todos estos problemas deben ser resueltos por el programador, haciendo as铆 la programaci贸n de una m谩quina heterog茅nea un aut茅ntico reto. Esta Tesis trata varios de los problemas principales relacionados con la programaci贸n en paralelo de los sistemas altamente heterog茅neos y distribuidos. En ella se realizan propuestas que resuelven problemas que van desde la creaci贸n de c贸digos portables entre diferentes tipos de dispositivos, aceleradores, y arquitecturas, consiguiendo a su vez m谩xima eficiencia, hasta los problemas que aparecen en los sistemas de memoria distribuida relacionados con las comunicaciones y la partici贸n de estructuras de datosDepartamento de Inform谩tica (Arquitectura y Tecnolog铆a de Computadores, Ciencias de la Computaci贸n e Inteligencia Artificial, Lenguajes y Sistemas Inform谩ticos)Doctorado en Inform谩tic

    Relajaciones de ejecuci贸n definidas por el usuario para la mejora de la programabilidad en computaci贸n paralela de altas prestaciones

    Get PDF
    Tesis de la Universidad Complutense de Madrid, Facultad de Inform谩tica, le铆da el 22-11-2019This thesis proposes the development and implementation of a new programming model basedon execution relaxations, and focused on High-Performance Parallel Computing. Specifically,the main goals of the thesis are:1. Advocate a development methodology in which users define the basic computing units(tasks), together with a set of relaxations in, possibly, multiple dimensions. These relaxationswill be translated, at execution time, into expanded (and complex) scheduling opportunitiesdepending on the underlying architectural features, yielding improvements in termsof desired output metrics (e.g., performance or energy consumption).2. Abstract away users from the complexity of the underlying heterogeneous hardware, delegatingthe proper exploitation of expanded scheduling choices to a system software component(typically referred as a runtime). This piece of software, armed with knowledge fromstatic architectural characteristics and dynamic status of the hardware at execution time,will exploit those combinations considered optimal among those relaxations proposed bythe user for each task ready for execution.3. Extend this abstraction in order to describe both computing systems, by means of executor/ allocator hierarchies that describe the heterogeneous computing architecture, and applications,in terms of sets of interdependent tasks. In addition, the relations between executorsand tasks are categorized into a new task-executor taxonomy, which motivates ambiguityfreeHPC programming frontends based on the STSE, Single Task - Single Executor classification,distinguished from fully-automated runtime backends.4. Propose a new programming model (STEEL) based on previous ideas, that gathers featuresconsidered to be basic for future task-based programming models, namely: performance,composability, expressivity and hard-to-misuse interfaces.5. Specify an API to support the STEEL programming model, and a runtime implementationleveraging techniques and programming paradigms supported by modern C++, illustratingits flexibility, ease of use and performance impact by means of simple use cases and examples.Hence, the proposed methodology stands for a clear and strict separation of concerns betweenthe involved actors in a parallel executions: user / codes and underlying hardware. This kind ofabstractions allows a delegation of the expert knowledge from the user toward the system software(runtime) in a systematic way, and facilitates the integration of mechanisms to automate optimizations,adapting performance to the specificities of the heterogeneous parallel architecture in whichthe code is instantiated and executed.From this perspective, the thesis designs, implements and validates mechanisms to perform aso-called complexity formalization, classifying many actions that are currently done by the userand building a framework in which these complexities can be delegated to the runtime system. Thedelegation of these decisions is already a step forward to next generation of programming modelsseeking performance, expressivity, programmability and portability...La presente tesis doctoral propone el dise帽o e implementaci贸n de un nuevo modelo de programaci贸n basado en relajaciones de ejecuci贸n y enfocado al 谩mbito de la Computaci贸n Paralela de Altas Prestaciones. Concretamente, los objetivos principales de la tesis son:1. Abogar por una metodolog铆a de desarrollo en la que el usuario define las unidades b谩sicas de computo (tareas), junto con un conjunto de relajaciones en, posiblemente, m煤ltiples dimensiones. Estas relajaciones se traducir谩n, en tiempo de ejecuci贸n, en oportunidades expandidas(y complejas) de planificaci贸n en funci贸n de la arquitectura subyacente, impactando as铆 en m茅tricas como rendimiento o consumo energ茅tico.2. Abstraer al usuario de la complejidad del hardware subyacente, delegando la correcta explotaci贸n de dichas posibilidades de planificaci贸n expandidas a un componente software de sistema (t铆picamente conocido como runtime). Dicho software, dotado de conocimiento tanto de las caracter铆sticas est谩ticas de la arquitectura subyacente como del estado puntual de la misma en el momento de la ejecuci贸n, explotar谩 las combinaciones consideradas optimas de entre las relajaciones propuestas por el usuario para cada tarea lista para set ejecutada.3. Extender dicha abstracci贸n para describir tanto sistemas de c贸mputo, en forma de jerarqu铆a de ejecutores y alojadores de memoria que en 麓ultimo t茅rmino describen una arquitectura de c贸mputo heterog茅nea, como aplicaciones, en forma de un conjunto de tareas interrelacionadas. Adem谩s, las relaciones entre ejecutores y tareas son clasificadas en una nueva taxonom铆a tarea-ejecutor, la cual motiva frontends de programaci贸n HPC sin ambig眉edad basados en la clasificaci贸n STSE, Single Task - Single Executor, separada de backends runtime totalmente automatizados.4. Proponer un nuevo modelo de programaci贸n (STEEL) basado en la clasificaci贸n STSE que aglutine ciertas caracter铆sticas consideradas b谩sicas de cara al 茅xito de los futuros modelos de programaci贸n basados en tareas: rendimiento, facilidad de composici贸n, expresividad e interfaces no permisivos ante fallos.5. Especificar una API que d茅 soporte al modelo de programaci贸n, as铆 como una implementaci贸n runtime del mismo aprovechando t茅cnicas y paradigmas soportados en el lenguaje C++ de 煤ltima generaci贸n, e ilustrar su uso, flexibilidad e impacto en el rendimiento a trav茅s de ejemplos y casos de uso sencillos .La metodolog铆a que se propugna aboga por una clara y estricta separaci贸n de conceptos entre los actores b谩sicos que componen una ejecuci贸n paralela: usuario / c贸digo y hardware subyacente. Este tipo de abstracciones permite delegar el conocimiento experto desde el usuario hacia el software de sistema, proporcionando as铆 mecanismos para mecanizar y automatizar su optimizaci贸n ,y adaptar su rendimiento a la arquitectura paralela sobre la que se instanciar谩n los c贸digos. Desde este punto de vista, la tesis dise帽a, implementa y valida mecanismos para llevar a cabo una formalizaci贸n de la complejidad inherente a la programaci贸n paralela heterog茅nea, clasificando aquellas acciones que en la actualidad se llevan a cabo por parte del usuario en el proceso de desarrollo y optimizaci贸n de c贸digo, y proporcionando un marco de trabajo en el que dicha complejidad puede ser delegada, de forma eficiente y consistente, a un runtime...Fac. de Inform谩ticaTRUEunpu
    corecore