23 research outputs found
Autotuning and Self-Adaptability in Concurrency Libraries
Autotuning is an established technique for optimizing the performance of
parallel applications. However, programmers must prepare applications for
autotuning, which is tedious and error prone coding work. We demonstrate how
applications become ready for autotuning with few or no modifications by
extending Threading Building Blocks (TBB), a library for parallel programming,
with autotuning. The extended TBB library optimizes all application-independent
tuning parameters fully automatically. We compare manual effort, autotuning
overhead and performance gains on 17 examples. While some examples benefit only
slightly, others speed up by 28% over standard TBB.Comment: Presented at 1st Workshop on Resource Awareness and Adaptivity in
Multi-Core Computing (Racing 2014) (arXiv:1405.2281
Enhancing productivity and performance portability of opencl applications on heterogeneous systems using runtime optimizations
Initially driven by a strong need for increased computational performance in science and
engineering, heterogeneous systems have become ubiquitous and they are getting increasingly
complex. The single processor era has been replaced with multi-core processors,
which have quickly been surrounded by satellite devices aiming to increase the throughput
of the entire system. These auxiliary devices, such as Graphics Processing Units, Field Programmable
Gate Arrays or other specialized processors have very different architectures.
This puts an enormous strain on programming models and software developers to take full
advantage of the computing power at hand. Because of this diversity and the unachievable
flexibility and portability necessary to optimize for each target individually, heterogeneous
systems remain typically vastly under-utilized.
In this thesis, we explore two distinct ways to tackle this problem. Providing automated,
non intrusive methods in the form of compiler tools and implementing efficient abstractions
to automatically tune parameters for a restricted domain are two complementary
approaches investigated to better utilize compute resources in heterogeneous systems.
First, we explore a fully automated compiler based approach, where a runtime system
analyzes the computation flow of an OpenCL application and optimizes it across multiple
compute kernels. This method can be deployed on any existing application transparently
and replaces significant software engineering effort spent to tune application for a particular
system. We show that this technique achieves speedups of up to 3x over unoptimized
code and an average of 1.4x over manually optimized code for highly dynamic applications.
Second, a library based approach is designed to provide a high level abstraction for
complex problems in a specific domain, stencil computation. Using domain specific techniques,
the underlying framework optimizes the code aggressively. We show that even in
a restricted domain, automatic tuning mechanisms and robust architectural abstraction are
necessary to improve performance. Using the abstraction layer, we demonstrate strong scaling
of various applications to multiple GPUs with a speedup of up to 1.9x on two GPUs
and 3.6x on four
Optimizaci贸n del rendimiento y la eficiencia energ茅tica en sistemas masivamente paralelos
RESUMEN Los sistemas heterog茅neos son cada vez m谩s relevantes, debido a sus capacidades de rendimiento y eficiencia energ茅tica, estando presentes en todo tipo de plataformas de c贸mputo, desde dispositivos embebidos y servidores, hasta nodos HPC de grandes centros de datos. Su complejidad hace que sean habitualmente usados bajo el paradigma de tareas y el modelo de programaci贸n host-device. Esto penaliza fuertemente el aprovechamiento de los aceleradores y el consumo energ茅tico del sistema, adem谩s de dificultar la adaptaci贸n de las aplicaciones.
La co-ejecuci贸n permite que todos los dispositivos cooperen para computar el mismo problema, consumiendo menos tiempo y energ铆a. No obstante, los programadores deben encargarse de toda la gesti贸n de los dispositivos, la distribuci贸n de la carga y la portabilidad del c贸digo entre sistemas, complicando notablemente su programaci贸n.
Esta tesis ofrece contribuciones para mejorar el rendimiento y la eficiencia energ茅tica en estos sistemas masivamente paralelos. Se realizan propuestas que abordan objetivos generalmente contrapuestos: se mejora la usabilidad y la programabilidad, a la vez que se garantiza una mayor abstracci贸n y extensibilidad del sistema, y al mismo tiempo se aumenta el rendimiento, la escalabilidad y la eficiencia energ茅tica. Para ello, se proponen dos motores de ejecuci贸n con enfoques completamente distintos.
EngineCL, centrado en OpenCL y con una API de alto nivel, favorece la m谩xima compatibilidad entre todo tipo de dispositivos y proporciona un sistema modular extensible. Su versatilidad permite adaptarlo a entornos para los que no fue concebido, como aplicaciones con ejecuciones restringidas por tiempo o simuladores HPC de din谩mica molecular, como el utilizado en un centro de investigaci贸n internacional.
Considerando las tendencias industriales y enfatizando la aplicabilidad profesional, CoexecutorRuntime proporciona un sistema flexible centrado en C++/SYCL que dota de soporte a la co-ejecuci贸n a la tecnolog铆a oneAPI. Este runtime acerca a los programadores al dominio del problema, posibilitando la explotaci贸n de estrategias din谩micas adaptativas que mejoran la eficiencia en todo tipo de aplicaciones.ABSTRACT Heterogeneous systems are becoming increasingly relevant, due to their performance and energy efficiency capabilities, being present in all types of computing platforms, from embedded devices and servers to HPC nodes in large data centers. Their complexity implies that they are usually used under the task paradigm and the host-device programming model. This strongly penalizes accelerator utilization and system energy consumption, as well as making it difficult to adapt applications.
Co-execution allows all devices to simultaneously compute the same problem, cooperating to consume less time and energy. However, programmers must handle all device management, workload distribution and code portability between systems, significantly complicating their programming.
This thesis offers contributions to improve performance and energy efficiency in these massively parallel systems. The proposals address the following generally conflicting objectives: usability and programmability are improved, while ensuring enhanced system abstraction and extensibility, and at the same time performance, scalability and energy efficiency are increased. To achieve this, two runtime systems with completely different approaches are proposed.
EngineCL, focused on OpenCL and with a high-level API, provides an extensible modular system and favors maximum compatibility between all types of devices. Its versatility allows it to be adapted to environments for which it was not originally designed, including applications with time-constrained executions or molecular dynamics HPC simulators, such as the one used in an international research center.
Considering industrial trends and emphasizing professional applicability, CoexecutorRuntime provides a flexible C++/SYCL-based system that provides co-execution support for oneAPI technology. This runtime brings programmers closer to the problem domain, enabling the exploitation of dynamic adaptive strategies that improve efficiency in all types of applications.Funding: This PhD has been supported by the Spanish Ministry of Education (FPU16/03299 grant),
the Spanish Science and Technology Commission under contracts TIN2016-76635-C2-2-R
and PID2019-105660RB-C22.
This work has also been partially supported by the Mont-Blanc 3: European Scalable and
Power Efficient HPC Platform based on Low-Power Embedded Technology project (G.A. No.
671697) from the European Union鈥檚 Horizon 2020 Research and Innovation Programme
(H2020 Programme). Some activities have also been funded by the Spanish Science and Technology
Commission under contract TIN2016-81840-REDT (CAPAP-H6 network).
The Integration II: Hybrid programming models of Chapter 4 has been partially performed
under the Project HPC-EUROPA3 (INFRAIA-2016-1-730897), with the support of the EC
Research Innovation Action under the H2020 Programme. In particular, the author gratefully
acknowledges the support of the SPMT Department of the High Performance Computing
Center Stuttgart (HLRS)
Towards efficient exploitation of GPUs : a methodology for mapping index-digit algorithms
[Resumen]La computaci贸n de prop贸sito general en GPUs supuso un gran paso, llevando la
computaci贸n de alto rendimiento a los equipos dom茅sticos. Lenguajes de programaci贸n de alto nivel como OpenCL y CUDA redujeron en gran medida la complejidad
de programaci贸n. Sin embargo, para poder explotar totalmente el poder computacional
de las GPUs, se requieren algoritmos paralelos especializados. La complejidad
en la jerarqu铆a de memoria y su arquitectura masivamente paralela hace que la
programaci贸n de GPUs sea una tarea compleja incluso para programadores experimentados.
Debido a la novedad, las librer铆as de prop贸sito general son escasas y las
versiones paralelas de los algoritmos no siempre est谩n disponibles.
En lugar de centrarnos en la paralelizaci贸n de algoritmos concretos, en esta tesis
proponemos una metodolog铆a general aplicable a la mayor铆a de los problemas de tipo
divide y vencer谩s con una estructura de mariposa que puedan formularse a trav茅s de
la representaci贸n Indice-D铆gito. En primer lugar, se analizan los diferentes factores que afectan al rendimiento de la arquitectura de las GPUs. A continuaci贸n, estudiamos
varias t茅cnicas de optimizaci贸n y dise帽amos una serie de bloques constructivos
modulares y reutilizables, que se emplean para crear los diferentes algoritmos. Por 煤ltimo, estudiamos el equilibrio 贸ptimo de los recursos, y usando vectores de mapeo
y operadores algebraicos ajustamos los algoritmos para las configuraciones deseadas.
A pesar del enfoque centrado en la exibilidad y la facilidad de programaci贸n, las
implementaciones resultantes ofrecen un rendimiento muy competitivo, que llega a superar conocidas librer铆as recientes.[Resumo] A computaci贸n de prop贸sito xeral en GPUs supuxo un gran paso, levando a
computaci贸n de alto rendemento aos equipos dom茅sticos. Linguaxes de programaci贸n de alto nivel como OpenCL e CUDA reduciron en boa medida a complexidade
da programaci贸n. Con todo, para poder aproveitar totalmente o poder computacional
das GPUs, requ铆rense algoritmos paralelos especializados. A complexidade na
xerarqu铆a de memoria e a s煤a arquitectura masivamente paralela fai que a programaci贸n de GPUs sexa unha tarefa complexa mesmo para programadores experimentados.
Debido 谩 novidade, as librar铆as de prop贸sito xeral son escasas e as versi贸ns
paralelas dos algoritmos non sempre est谩n dispo帽ibles.
En lugar de centrarnos na paralelizaci贸n de algoritmos concretos, nesta tese propo帽emos unha metodolox铆a xeral aplicable 谩 maior铆a dos problemas de tipo divide e
vencer谩s cunha estrutura de bolboreta que poidan formularse a trav茅s da representaci贸n 脥ndice-D铆xito. En primeiro lugar, anal铆zanse os diferentes factores que afectan
ao rendemento da arquitectura das GPUs. A continuaci贸n, estudamos varias t茅cnicas
de optimizaci贸n e dese帽amos unha serie de bloques construtivos modulares e
reutilizables, que se empregan para crear os diferentes algoritmos. Por 煤ltimo, estudamos
o equilibrio 贸ptimo dos recursos, e usando vectores de mapeo e operadores
alxbricos axustamos os algoritmos para as configuraci贸ns desexadas. A pesar do enfoque
centrado na exibilidade e a facilidade de programaci贸n, as implementaci贸ns
resultantes ofrecen un rendemento moi competitivo, que chega a superar co帽ecidas
librar铆as recentes.[Abstract]GPU computing supposed a major step forward, bringing high performance computing
to commodity hardware. Feature-rich parallel languages like CUDA and
OpenCL reduced the programming complexity. However, to fully take advantage of
their computing power, specialized parallel algorithms are required. Moreover, the
complex GPU memory hierarchy and highly threaded architecture makes programming
a difficult task even for experienced programmers. Due to the novelty of GPU
programming, common general purpose libraries are scarce and parallel versions of
the algorithms are not always readily available.
Instead of focusing in the parallelization of particular algorithms, in this thesis
we propose a general methodology applicable to most divide-and-conquer problems
with a buttery structure which can be formulated through the Index-Digit
representation. First, we analyze the different performance factors of the GPU architecture.
Next, we study several optimization techniques and design a series of
modular and reusable building blocks, which will be used to create the different
algorithms. Finally, we study the optimal resource balance, and through a mapping
vector representation and operator algebra, we tune the algorithms for the desired
configurations. Despite the focus on programmability and exibility, the resulting
implementations offer very competitive performance, being able to surpass other
well-known state of the art libraries
Tools for improving performance portability in heterogeneous environments
Programa Oficial de Doutoramento en Investigaci贸n en Tecnolox铆as da Informaci贸n. 524V01[Abstract]
Parallel computing is currently partially dominated by the availability of heterogeneous
devices. These devices differ from each other in aspects such as the
instruction set they execute, the number and the type of computing devices that
they offer or the structure of their memory systems. In the last years, langnages,
libraries and extensions have appeared to allow to write a parallel code once aud
run it in a wide variety of devices, OpenCL being the most widespread solution of
this kind. However, functional portability does not imply performance portability.
This way, one of the probletns that is still open in this field is to achieve automatic
performance portability. That is, the ability to automatically tune a given code for
any device where it will be execnted so that it ill obtain a good performance. This
thesis develops three different solutions to tackle this problem. The three of them
are based on typical source-to-sonrce optimizations for heterogeneous devices. Both
the set of optimizations to apply and the way they are applied depend on different
optimization parameters, whose values have to be tuned for each specific device.
The first solution is OCLoptimizer, a source-to-source optimizer that can optimize
annotated OpenCL kemels with the help of configuration files that guide the
optimization process. The tool optimizes kernels for a specific device, and it is also
able to automate the generation of functional host codes when only a single kernel
is optimized.
The two remaining solutions are built on top of the Heterogeneous Programming
Library (HPL), a C++ framework that provides an easy and portable way to exploit
heterogeneous computing systexns. The first of these solutions uses the run-time
code generation capabilities of HPL to generate a self-optimizing version of a matrix
multiplication that can optimize itself at run-time for an spedfic device. The last solut铆on is the development of a built-in just-in-time optirnizer for HPL, that can
optirnize, at run-tirne, a HPL code for an specific device. While the first two solutions
use search processes to find the best values for the optimization parameters, this Iast
alternative relies on heuristics bMed on general optirnization strategies.[Resumen]
Actualmente la computaci贸n paralela se encuentra dominada parcialmente por
los m煤ltiples dispositivos heterog茅neos disponibles. Estos dispositivos difieren entre
s铆 en caracter铆sticas tales como el conjunto de instrucciones que ejecutan, el n煤mero
y tipo de unidades de computaci贸n que incluyen o la estructura de sus sistemas de
memoria. Durante los 煤ltimos a帽os han aparecido lenguajes, librer铆as y extensiones
que permiten escribir una 煤nica vez la versi贸n paralela de un c贸digo y ejecutarla en
un amplio abanico de dispositivos, siendo de entre todos ellos OpenCL la soluci贸n
m谩s extendida. Sin embargo, la portabilidad funcional no implica portabilidad de
rendimiento. As铆, uno de los grandes problemas que sigue abierto en este campo
es la automatizaci贸n de la portabilidad de rendimiento, es decir, la capacidad de
adaptar autom谩ticamente un c贸digo dado para su ejecuci贸n en cualquier dispositivo
y obtener un buen rendimiento. Esta tesis aborda este problema planteando tres
soluciones diferentes al mismo. Las tres se basan en la aplicaci贸n de optimizaciones
de c贸digo a c贸digo usadas habitualmente en dispositivos heterog茅neos. Tanto el
conjunto de optimizaciones a aplicar como la forma de aplicarlas dependen de varios
par谩metros de optimizaci贸n, cuyos valores han de ser ajustados para cada dispositivo
concreto.
La primera soluci贸n planteada es OCLoptirnizer, un optimizador de c贸digo a
c贸digo que a partir de kernels OpenCL anotados y ficheros de configuraci贸n como
apoyo, obtiene versiones optimizada de dichos kernels para un dispositivo concreto.
Adem谩s, cuando el kernel a optimizar es 煤nico, automatiza la generaci贸n de un
c贸digo de host funcional para ese kernel.
Las otras dos soluciones han sido implementadas utilizando Heterogeneous Prograrnming
LibranJ (HPL), una librer铆a C++ que permite programar sistemas heterog茅neos de forma f谩cil y portable. La primera de estas soluciones explota las
capacidades de generaci贸n de c贸digo en tiempo de ejecuci贸n de HPL para generar
versiones de un producto de matrices que se adaptan autom谩ticamente en tiempo
de ejecuci贸n a las caracter铆sticas de un dispositivo concreto. La 煤ltima soluci贸n consiste
en el desarrollo e incorporaci贸n a HPL de un optimizador al vuelo, de fonna
que se puedan obtener en tiempo de ejecuci贸n versiones optimizadas de un c贸digo
HPL para un dispositivo dado. Mientras las dos primeras soluciones usan procesos
de b煤squeda para encontrar los mejores valores para los par谩metros de optimizaci贸n,
esta 煤ltima altemativa se basa para ello en heur铆sticas definidas a partir de
recomendaciones generales de optimizaci贸n.[Resumo]
Actualmente a computaci贸n paralela at贸pase dominada parcialmente polos m煤ltiples
dispositivos heterox茅neos dispo帽ibles. Estes dispositivos difiren entre si en caracter铆sticas
tales como o conxunto de instrucci贸ns que executan, o n煤mero e tipo
de unidades de computaci贸n que incl煤en ou a estrutura dos seus sistemas de mem~
r铆a. Nos 煤ltimos anos apareceron linguaxes, bibliotecas e extensi贸ns que permiten
escribir unha soa vez a versi贸n paralela dun c贸digo e executala nun amplio abano de
dispositivos, senda de entre todos eles OpenCL a soluci贸n m谩is extendida. Por茅n, a
portabilidade funcional non implica portabilidade de rendemento. Deste xeito, uns
dos grandes problemas que segue aberto neste campo 茅 a automatizaci贸n da portabilidade
de rendemento, isto 茅, a capacidade de adaptar automaticamente un c贸digo
dado para a s煤a execuci贸n en calquera dispositivo e obter un bo rendemento. Esta
tese aborda este problema propondo tres soluci贸ns diferentes. As tres est谩n baseadas
na aplicaci贸n de optimizaci贸ns de c贸digo a c贸digo usadas habitualmente en disp~
sitivos heterox茅neos. Tanto o conxunto de optimizaci贸ns a aplicar como a forma de
aplicalas dependen de varios par谩metros de optimizaci贸n para os que 茅 preciso fixar
determinados valores en funci贸n do dispositivo concreto.
A primeira soluci贸n pro posta 茅 OCLoptirnizer, un optimizador de c贸digo a c贸digo
que partindo de kemels OpenCL anotados e ficheiros de configuraci贸n de apoio,
obt茅n versi贸ns optimizadas dos devanditos kernels para un dispositivo concreto.
Amais, cando o kernel a optimizar茅 煤nico, tarn茅n automatiza a xeraci贸n dun c贸digo
de host funcional para ese kernel.
As outras d煤as soluci贸ns foron implementadas utilizando Heterogeneous Programming
Library (HPL), unha biblioteca C++ que permite programar sistemas
heterox茅neos de xeito f谩cil e portable. A primeira destas soluci贸ns explota as capacidades de xeraci贸n de c贸digo en tempo de execuci贸n de HPL para xerar versi贸ns
dun produto de matrices que se adaptan automaticamente 谩s caracter铆sticas dun
dispositivo concreto. A 煤ltima soluci贸n consiste no deseuvolvemento e incorporaci贸n
a HPL dun optimizador capaz de obter en tiempo de execuci贸n versi贸ns optimizada<;
dun c贸digo HPL para un dispositivo dado. Mentres as d煤as primeiras soluci贸ns usan
procesos de procura para atopar os mellares valores para os par谩metros de optimizaci贸n,
esta 煤ltima alternativa bas茅ase para iso en heur铆sticas definidas a partir de
recomendaci贸ns xerais de optimizaci贸n
Scalable Observation, Analysis, and Tuning for Parallel Portability in HPC
It is desirable for general productivity that high-performance computing applications be portable to new architectures, or can be optimized for new workflows and input types, without the need for costly code interventions or algorithmic re-writes. Parallel portability programming models provide the potential for high performance and productivity, however they come with a multitude of runtime parameters that can have significant impact on execution performance. Selecting the optimal set of parameters, so that HPC applications perform well in different system environments and on different input data sets, is not trivial.This dissertation maps out a vision for addressing this parallel portability challenge, and then demonstrates this plan through an effective combination of observability, analysis, and in situ machine learning techniques. A platform for general-purpose observation in HPC contexts is investigated, along with support for its use in human-in-the-loop performance understanding and analysis. The dissertation culminates in a demonstration of lessons learned in order to provide automated tuning of HPC applications utilizing parallel portability frameworks
A Contribution to Resource-Aware Architectures for Humanoid Robots
The goal of this work is to provide building blocks for resource-aware robot architectures. The topic of these blocks are data-driven generation of context-sensitive resource models, prediction of future resource utilizations, and resource-aware computer vision and motion planning algorithms. The implementation of these algorithms is based on resource-aware concepts and methodologies originating from the Transregional Collaborative Research Center "Invasive Computing" (SFB/TR 89)