23 research outputs found

    Autotuning and Self-Adaptability in Concurrency Libraries

    Get PDF
    Autotuning is an established technique for optimizing the performance of parallel applications. However, programmers must prepare applications for autotuning, which is tedious and error prone coding work. We demonstrate how applications become ready for autotuning with few or no modifications by extending Threading Building Blocks (TBB), a library for parallel programming, with autotuning. The extended TBB library optimizes all application-independent tuning parameters fully automatically. We compare manual effort, autotuning overhead and performance gains on 17 examples. While some examples benefit only slightly, others speed up by 28% over standard TBB.Comment: Presented at 1st Workshop on Resource Awareness and Adaptivity in Multi-Core Computing (Racing 2014) (arXiv:1405.2281

    Enhancing productivity and performance portability of opencl applications on heterogeneous systems using runtime optimizations

    Get PDF
    Initially driven by a strong need for increased computational performance in science and engineering, heterogeneous systems have become ubiquitous and they are getting increasingly complex. The single processor era has been replaced with multi-core processors, which have quickly been surrounded by satellite devices aiming to increase the throughput of the entire system. These auxiliary devices, such as Graphics Processing Units, Field Programmable Gate Arrays or other specialized processors have very different architectures. This puts an enormous strain on programming models and software developers to take full advantage of the computing power at hand. Because of this diversity and the unachievable flexibility and portability necessary to optimize for each target individually, heterogeneous systems remain typically vastly under-utilized. In this thesis, we explore two distinct ways to tackle this problem. Providing automated, non intrusive methods in the form of compiler tools and implementing efficient abstractions to automatically tune parameters for a restricted domain are two complementary approaches investigated to better utilize compute resources in heterogeneous systems. First, we explore a fully automated compiler based approach, where a runtime system analyzes the computation flow of an OpenCL application and optimizes it across multiple compute kernels. This method can be deployed on any existing application transparently and replaces significant software engineering effort spent to tune application for a particular system. We show that this technique achieves speedups of up to 3x over unoptimized code and an average of 1.4x over manually optimized code for highly dynamic applications. Second, a library based approach is designed to provide a high level abstraction for complex problems in a specific domain, stencil computation. Using domain specific techniques, the underlying framework optimizes the code aggressively. We show that even in a restricted domain, automatic tuning mechanisms and robust architectural abstraction are necessary to improve performance. Using the abstraction layer, we demonstrate strong scaling of various applications to multiple GPUs with a speedup of up to 1.9x on two GPUs and 3.6x on four

    Optimizaci贸n del rendimiento y la eficiencia energ茅tica en sistemas masivamente paralelos

    Get PDF
    RESUMEN Los sistemas heterog茅neos son cada vez m谩s relevantes, debido a sus capacidades de rendimiento y eficiencia energ茅tica, estando presentes en todo tipo de plataformas de c贸mputo, desde dispositivos embebidos y servidores, hasta nodos HPC de grandes centros de datos. Su complejidad hace que sean habitualmente usados bajo el paradigma de tareas y el modelo de programaci贸n host-device. Esto penaliza fuertemente el aprovechamiento de los aceleradores y el consumo energ茅tico del sistema, adem谩s de dificultar la adaptaci贸n de las aplicaciones. La co-ejecuci贸n permite que todos los dispositivos cooperen para computar el mismo problema, consumiendo menos tiempo y energ铆a. No obstante, los programadores deben encargarse de toda la gesti贸n de los dispositivos, la distribuci贸n de la carga y la portabilidad del c贸digo entre sistemas, complicando notablemente su programaci贸n. Esta tesis ofrece contribuciones para mejorar el rendimiento y la eficiencia energ茅tica en estos sistemas masivamente paralelos. Se realizan propuestas que abordan objetivos generalmente contrapuestos: se mejora la usabilidad y la programabilidad, a la vez que se garantiza una mayor abstracci贸n y extensibilidad del sistema, y al mismo tiempo se aumenta el rendimiento, la escalabilidad y la eficiencia energ茅tica. Para ello, se proponen dos motores de ejecuci贸n con enfoques completamente distintos. EngineCL, centrado en OpenCL y con una API de alto nivel, favorece la m谩xima compatibilidad entre todo tipo de dispositivos y proporciona un sistema modular extensible. Su versatilidad permite adaptarlo a entornos para los que no fue concebido, como aplicaciones con ejecuciones restringidas por tiempo o simuladores HPC de din谩mica molecular, como el utilizado en un centro de investigaci贸n internacional. Considerando las tendencias industriales y enfatizando la aplicabilidad profesional, CoexecutorRuntime proporciona un sistema flexible centrado en C++/SYCL que dota de soporte a la co-ejecuci贸n a la tecnolog铆a oneAPI. Este runtime acerca a los programadores al dominio del problema, posibilitando la explotaci贸n de estrategias din谩micas adaptativas que mejoran la eficiencia en todo tipo de aplicaciones.ABSTRACT Heterogeneous systems are becoming increasingly relevant, due to their performance and energy efficiency capabilities, being present in all types of computing platforms, from embedded devices and servers to HPC nodes in large data centers. Their complexity implies that they are usually used under the task paradigm and the host-device programming model. This strongly penalizes accelerator utilization and system energy consumption, as well as making it difficult to adapt applications. Co-execution allows all devices to simultaneously compute the same problem, cooperating to consume less time and energy. However, programmers must handle all device management, workload distribution and code portability between systems, significantly complicating their programming. This thesis offers contributions to improve performance and energy efficiency in these massively parallel systems. The proposals address the following generally conflicting objectives: usability and programmability are improved, while ensuring enhanced system abstraction and extensibility, and at the same time performance, scalability and energy efficiency are increased. To achieve this, two runtime systems with completely different approaches are proposed. EngineCL, focused on OpenCL and with a high-level API, provides an extensible modular system and favors maximum compatibility between all types of devices. Its versatility allows it to be adapted to environments for which it was not originally designed, including applications with time-constrained executions or molecular dynamics HPC simulators, such as the one used in an international research center. Considering industrial trends and emphasizing professional applicability, CoexecutorRuntime provides a flexible C++/SYCL-based system that provides co-execution support for oneAPI technology. This runtime brings programmers closer to the problem domain, enabling the exploitation of dynamic adaptive strategies that improve efficiency in all types of applications.Funding: This PhD has been supported by the Spanish Ministry of Education (FPU16/03299 grant), the Spanish Science and Technology Commission under contracts TIN2016-76635-C2-2-R and PID2019-105660RB-C22. This work has also been partially supported by the Mont-Blanc 3: European Scalable and Power Efficient HPC Platform based on Low-Power Embedded Technology project (G.A. No. 671697) from the European Union鈥檚 Horizon 2020 Research and Innovation Programme (H2020 Programme). Some activities have also been funded by the Spanish Science and Technology Commission under contract TIN2016-81840-REDT (CAPAP-H6 network). The Integration II: Hybrid programming models of Chapter 4 has been partially performed under the Project HPC-EUROPA3 (INFRAIA-2016-1-730897), with the support of the EC Research Innovation Action under the H2020 Programme. In particular, the author gratefully acknowledges the support of the SPMT Department of the High Performance Computing Center Stuttgart (HLRS)

    Towards efficient exploitation of GPUs : a methodology for mapping index-digit algorithms

    Get PDF
    [Resumen]La computaci贸n de prop贸sito general en GPUs supuso un gran paso, llevando la computaci贸n de alto rendimiento a los equipos dom茅sticos. Lenguajes de programaci贸n de alto nivel como OpenCL y CUDA redujeron en gran medida la complejidad de programaci贸n. Sin embargo, para poder explotar totalmente el poder computacional de las GPUs, se requieren algoritmos paralelos especializados. La complejidad en la jerarqu铆a de memoria y su arquitectura masivamente paralela hace que la programaci贸n de GPUs sea una tarea compleja incluso para programadores experimentados. Debido a la novedad, las librer铆as de prop贸sito general son escasas y las versiones paralelas de los algoritmos no siempre est谩n disponibles. En lugar de centrarnos en la paralelizaci贸n de algoritmos concretos, en esta tesis proponemos una metodolog铆a general aplicable a la mayor铆a de los problemas de tipo divide y vencer谩s con una estructura de mariposa que puedan formularse a trav茅s de la representaci贸n Indice-D铆gito. En primer lugar, se analizan los diferentes factores que afectan al rendimiento de la arquitectura de las GPUs. A continuaci贸n, estudiamos varias t茅cnicas de optimizaci贸n y dise帽amos una serie de bloques constructivos modulares y reutilizables, que se emplean para crear los diferentes algoritmos. Por 煤ltimo, estudiamos el equilibrio 贸ptimo de los recursos, y usando vectores de mapeo y operadores algebraicos ajustamos los algoritmos para las configuraciones deseadas. A pesar del enfoque centrado en la exibilidad y la facilidad de programaci贸n, las implementaciones resultantes ofrecen un rendimiento muy competitivo, que llega a superar conocidas librer铆as recientes.[Resumo] A computaci贸n de prop贸sito xeral en GPUs supuxo un gran paso, levando a computaci贸n de alto rendemento aos equipos dom茅sticos. Linguaxes de programaci贸n de alto nivel como OpenCL e CUDA reduciron en boa medida a complexidade da programaci贸n. Con todo, para poder aproveitar totalmente o poder computacional das GPUs, requ铆rense algoritmos paralelos especializados. A complexidade na xerarqu铆a de memoria e a s煤a arquitectura masivamente paralela fai que a programaci贸n de GPUs sexa unha tarefa complexa mesmo para programadores experimentados. Debido 谩 novidade, as librar铆as de prop贸sito xeral son escasas e as versi贸ns paralelas dos algoritmos non sempre est谩n dispo帽ibles. En lugar de centrarnos na paralelizaci贸n de algoritmos concretos, nesta tese propo帽emos unha metodolox铆a xeral aplicable 谩 maior铆a dos problemas de tipo divide e vencer谩s cunha estrutura de bolboreta que poidan formularse a trav茅s da representaci贸n 脥ndice-D铆xito. En primeiro lugar, anal铆zanse os diferentes factores que afectan ao rendemento da arquitectura das GPUs. A continuaci贸n, estudamos varias t茅cnicas de optimizaci贸n e dese帽amos unha serie de bloques construtivos modulares e reutilizables, que se empregan para crear os diferentes algoritmos. Por 煤ltimo, estudamos o equilibrio 贸ptimo dos recursos, e usando vectores de mapeo e operadores alxbricos axustamos os algoritmos para as configuraci贸ns desexadas. A pesar do enfoque centrado na exibilidade e a facilidade de programaci贸n, as implementaci贸ns resultantes ofrecen un rendemento moi competitivo, que chega a superar co帽ecidas librar铆as recentes.[Abstract]GPU computing supposed a major step forward, bringing high performance computing to commodity hardware. Feature-rich parallel languages like CUDA and OpenCL reduced the programming complexity. However, to fully take advantage of their computing power, specialized parallel algorithms are required. Moreover, the complex GPU memory hierarchy and highly threaded architecture makes programming a difficult task even for experienced programmers. Due to the novelty of GPU programming, common general purpose libraries are scarce and parallel versions of the algorithms are not always readily available. Instead of focusing in the parallelization of particular algorithms, in this thesis we propose a general methodology applicable to most divide-and-conquer problems with a buttery structure which can be formulated through the Index-Digit representation. First, we analyze the different performance factors of the GPU architecture. Next, we study several optimization techniques and design a series of modular and reusable building blocks, which will be used to create the different algorithms. Finally, we study the optimal resource balance, and through a mapping vector representation and operator algebra, we tune the algorithms for the desired configurations. Despite the focus on programmability and exibility, the resulting implementations offer very competitive performance, being able to surpass other well-known state of the art libraries

    Tools for improving performance portability in heterogeneous environments

    Get PDF
    Programa Oficial de Doutoramento en Investigaci贸n en Tecnolox铆as da Informaci贸n. 524V01[Abstract] Parallel computing is currently partially dominated by the availability of heterogeneous devices. These devices differ from each other in aspects such as the instruction set they execute, the number and the type of computing devices that they offer or the structure of their memory systems. In the last years, langnages, libraries and extensions have appeared to allow to write a parallel code once aud run it in a wide variety of devices, OpenCL being the most widespread solution of this kind. However, functional portability does not imply performance portability. This way, one of the probletns that is still open in this field is to achieve automatic performance portability. That is, the ability to automatically tune a given code for any device where it will be execnted so that it ill obtain a good performance. This thesis develops three different solutions to tackle this problem. The three of them are based on typical source-to-sonrce optimizations for heterogeneous devices. Both the set of optimizations to apply and the way they are applied depend on different optimization parameters, whose values have to be tuned for each specific device. The first solution is OCLoptimizer, a source-to-source optimizer that can optimize annotated OpenCL kemels with the help of configuration files that guide the optimization process. The tool optimizes kernels for a specific device, and it is also able to automate the generation of functional host codes when only a single kernel is optimized. The two remaining solutions are built on top of the Heterogeneous Programming Library (HPL), a C++ framework that provides an easy and portable way to exploit heterogeneous computing systexns. The first of these solutions uses the run-time code generation capabilities of HPL to generate a self-optimizing version of a matrix multiplication that can optimize itself at run-time for an spedfic device. The last solut铆on is the development of a built-in just-in-time optirnizer for HPL, that can optirnize, at run-tirne, a HPL code for an specific device. While the first two solutions use search processes to find the best values for the optimization parameters, this Iast alternative relies on heuristics bMed on general optirnization strategies.[Resumen] Actualmente la computaci贸n paralela se encuentra dominada parcialmente por los m煤ltiples dispositivos heterog茅neos disponibles. Estos dispositivos difieren entre s铆 en caracter铆sticas tales como el conjunto de instrucciones que ejecutan, el n煤mero y tipo de unidades de computaci贸n que incluyen o la estructura de sus sistemas de memoria. Durante los 煤ltimos a帽os han aparecido lenguajes, librer铆as y extensiones que permiten escribir una 煤nica vez la versi贸n paralela de un c贸digo y ejecutarla en un amplio abanico de dispositivos, siendo de entre todos ellos OpenCL la soluci贸n m谩s extendida. Sin embargo, la portabilidad funcional no implica portabilidad de rendimiento. As铆, uno de los grandes problemas que sigue abierto en este campo es la automatizaci贸n de la portabilidad de rendimiento, es decir, la capacidad de adaptar autom谩ticamente un c贸digo dado para su ejecuci贸n en cualquier dispositivo y obtener un buen rendimiento. Esta tesis aborda este problema planteando tres soluciones diferentes al mismo. Las tres se basan en la aplicaci贸n de optimizaciones de c贸digo a c贸digo usadas habitualmente en dispositivos heterog茅neos. Tanto el conjunto de optimizaciones a aplicar como la forma de aplicarlas dependen de varios par谩metros de optimizaci贸n, cuyos valores han de ser ajustados para cada dispositivo concreto. La primera soluci贸n planteada es OCLoptirnizer, un optimizador de c贸digo a c贸digo que a partir de kernels OpenCL anotados y ficheros de configuraci贸n como apoyo, obtiene versiones optimizada de dichos kernels para un dispositivo concreto. Adem谩s, cuando el kernel a optimizar es 煤nico, automatiza la generaci贸n de un c贸digo de host funcional para ese kernel. Las otras dos soluciones han sido implementadas utilizando Heterogeneous Prograrnming LibranJ (HPL), una librer铆a C++ que permite programar sistemas heterog茅neos de forma f谩cil y portable. La primera de estas soluciones explota las capacidades de generaci贸n de c贸digo en tiempo de ejecuci贸n de HPL para generar versiones de un producto de matrices que se adaptan autom谩ticamente en tiempo de ejecuci贸n a las caracter铆sticas de un dispositivo concreto. La 煤ltima soluci贸n consiste en el desarrollo e incorporaci贸n a HPL de un optimizador al vuelo, de fonna que se puedan obtener en tiempo de ejecuci贸n versiones optimizadas de un c贸digo HPL para un dispositivo dado. Mientras las dos primeras soluciones usan procesos de b煤squeda para encontrar los mejores valores para los par谩metros de optimizaci贸n, esta 煤ltima altemativa se basa para ello en heur铆sticas definidas a partir de recomendaciones generales de optimizaci贸n.[Resumo] Actualmente a computaci贸n paralela at贸pase dominada parcialmente polos m煤ltiples dispositivos heterox茅neos dispo帽ibles. Estes dispositivos difiren entre si en caracter铆sticas tales como o conxunto de instrucci贸ns que executan, o n煤mero e tipo de unidades de computaci贸n que incl煤en ou a estrutura dos seus sistemas de mem~ r铆a. Nos 煤ltimos anos apareceron linguaxes, bibliotecas e extensi贸ns que permiten escribir unha soa vez a versi贸n paralela dun c贸digo e executala nun amplio abano de dispositivos, senda de entre todos eles OpenCL a soluci贸n m谩is extendida. Por茅n, a portabilidade funcional non implica portabilidade de rendemento. Deste xeito, uns dos grandes problemas que segue aberto neste campo 茅 a automatizaci贸n da portabilidade de rendemento, isto 茅, a capacidade de adaptar automaticamente un c贸digo dado para a s煤a execuci贸n en calquera dispositivo e obter un bo rendemento. Esta tese aborda este problema propondo tres soluci贸ns diferentes. As tres est谩n baseadas na aplicaci贸n de optimizaci贸ns de c贸digo a c贸digo usadas habitualmente en disp~ sitivos heterox茅neos. Tanto o conxunto de optimizaci贸ns a aplicar como a forma de aplicalas dependen de varios par谩metros de optimizaci贸n para os que 茅 preciso fixar determinados valores en funci贸n do dispositivo concreto. A primeira soluci贸n pro posta 茅 OCLoptirnizer, un optimizador de c贸digo a c贸digo que partindo de kemels OpenCL anotados e ficheiros de configuraci贸n de apoio, obt茅n versi贸ns optimizadas dos devanditos kernels para un dispositivo concreto. Amais, cando o kernel a optimizar茅 煤nico, tarn茅n automatiza a xeraci贸n dun c贸digo de host funcional para ese kernel. As outras d煤as soluci贸ns foron implementadas utilizando Heterogeneous Programming Library (HPL), unha biblioteca C++ que permite programar sistemas heterox茅neos de xeito f谩cil e portable. A primeira destas soluci贸ns explota as capacidades de xeraci贸n de c贸digo en tempo de execuci贸n de HPL para xerar versi贸ns dun produto de matrices que se adaptan automaticamente 谩s caracter铆sticas dun dispositivo concreto. A 煤ltima soluci贸n consiste no deseuvolvemento e incorporaci贸n a HPL dun optimizador capaz de obter en tiempo de execuci贸n versi贸ns optimizada<; dun c贸digo HPL para un dispositivo dado. Mentres as d煤as primeiras soluci贸ns usan procesos de procura para atopar os mellares valores para os par谩metros de optimizaci贸n, esta 煤ltima alternativa bas茅ase para iso en heur铆sticas definidas a partir de recomendaci贸ns xerais de optimizaci贸n

    Scalable Observation, Analysis, and Tuning for Parallel Portability in HPC

    Get PDF
    It is desirable for general productivity that high-performance computing applications be portable to new architectures, or can be optimized for new workflows and input types, without the need for costly code interventions or algorithmic re-writes. Parallel portability programming models provide the potential for high performance and productivity, however they come with a multitude of runtime parameters that can have significant impact on execution performance. Selecting the optimal set of parameters, so that HPC applications perform well in different system environments and on different input data sets, is not trivial.This dissertation maps out a vision for addressing this parallel portability challenge, and then demonstrates this plan through an effective combination of observability, analysis, and in situ machine learning techniques. A platform for general-purpose observation in HPC contexts is investigated, along with support for its use in human-in-the-loop performance understanding and analysis. The dissertation culminates in a demonstration of lessons learned in order to provide automated tuning of HPC applications utilizing parallel portability frameworks

    A Contribution to Resource-Aware Architectures for Humanoid Robots

    Get PDF
    The goal of this work is to provide building blocks for resource-aware robot architectures. The topic of these blocks are data-driven generation of context-sensitive resource models, prediction of future resource utilizations, and resource-aware computer vision and motion planning algorithms. The implementation of these algorithms is based on resource-aware concepts and methodologies originating from the Transregional Collaborative Research Center "Invasive Computing" (SFB/TR 89)
    corecore