8 research outputs found
Tools for improving performance portability in heterogeneous environments
Programa Oficial de Doutoramento en Investigación en Tecnoloxías da Información. 524V01[Abstract]
Parallel computing is currently partially dominated by the availability of heterogeneous
devices. These devices differ from each other in aspects such as the
instruction set they execute, the number and the type of computing devices that
they offer or the structure of their memory systems. In the last years, langnages,
libraries and extensions have appeared to allow to write a parallel code once aud
run it in a wide variety of devices, OpenCL being the most widespread solution of
this kind. However, functional portability does not imply performance portability.
This way, one of the probletns that is still open in this field is to achieve automatic
performance portability. That is, the ability to automatically tune a given code for
any device where it will be execnted so that it ill obtain a good performance. This
thesis develops three different solutions to tackle this problem. The three of them
are based on typical source-to-sonrce optimizations for heterogeneous devices. Both
the set of optimizations to apply and the way they are applied depend on different
optimization parameters, whose values have to be tuned for each specific device.
The first solution is OCLoptimizer, a source-to-source optimizer that can optimize
annotated OpenCL kemels with the help of configuration files that guide the
optimization process. The tool optimizes kernels for a specific device, and it is also
able to automate the generation of functional host codes when only a single kernel
is optimized.
The two remaining solutions are built on top of the Heterogeneous Programming
Library (HPL), a C++ framework that provides an easy and portable way to exploit
heterogeneous computing systexns. The first of these solutions uses the run-time
code generation capabilities of HPL to generate a self-optimizing version of a matrix
multiplication that can optimize itself at run-time for an spedfic device. The last solutíon is the development of a built-in just-in-time optirnizer for HPL, that can
optirnize, at run-tirne, a HPL code for an specific device. While the first two solutions
use search processes to find the best values for the optimization parameters, this Iast
alternative relies on heuristics bMed on general optirnization strategies.[Resumen]
Actualmente la computación paralela se encuentra dominada parcialmente por
los múltiples dispositivos heterogéneos disponibles. Estos dispositivos difieren entre
sí en características tales como el conjunto de instrucciones que ejecutan, el número
y tipo de unidades de computación que incluyen o la estructura de sus sistemas de
memoria. Durante los últimos años han aparecido lenguajes, librerías y extensiones
que permiten escribir una única vez la versión paralela de un código y ejecutarla en
un amplio abanico de dispositivos, siendo de entre todos ellos OpenCL la solución
más extendida. Sin embargo, la portabilidad funcional no implica portabilidad de
rendimiento. Así, uno de los grandes problemas que sigue abierto en este campo
es la automatización de la portabilidad de rendimiento, es decir, la capacidad de
adaptar automáticamente un código dado para su ejecución en cualquier dispositivo
y obtener un buen rendimiento. Esta tesis aborda este problema planteando tres
soluciones diferentes al mismo. Las tres se basan en la aplicación de optimizaciones
de código a código usadas habitualmente en dispositivos heterogéneos. Tanto el
conjunto de optimizaciones a aplicar como la forma de aplicarlas dependen de varios
parámetros de optimización, cuyos valores han de ser ajustados para cada dispositivo
concreto.
La primera solución planteada es OCLoptirnizer, un optimizador de código a
código que a partir de kernels OpenCL anotados y ficheros de configuración como
apoyo, obtiene versiones optimizada de dichos kernels para un dispositivo concreto.
Además, cuando el kernel a optimizar es único, automatiza la generación de un
código de host funcional para ese kernel.
Las otras dos soluciones han sido implementadas utilizando Heterogeneous Prograrnming
LibranJ (HPL), una librería C++ que permite programar sistemas heterogéneos de forma fácil y portable. La primera de estas soluciones explota las
capacidades de generación de código en tiempo de ejecución de HPL para generar
versiones de un producto de matrices que se adaptan automáticamente en tiempo
de ejecución a las características de un dispositivo concreto. La última solución consiste
en el desarrollo e incorporación a HPL de un optimizador al vuelo, de fonna
que se puedan obtener en tiempo de ejecución versiones optimizadas de un código
HPL para un dispositivo dado. Mientras las dos primeras soluciones usan procesos
de búsqueda para encontrar los mejores valores para los parámetros de optimización,
esta última altemativa se basa para ello en heurísticas definidas a partir de
recomendaciones generales de optimización.[Resumo]
Actualmente a computación paralela atópase dominada parcialmente polos múltiples
dispositivos heteroxéneos dispoñibles. Estes dispositivos difiren entre si en características
tales como o conxunto de instruccións que executan, o número e tipo
de unidades de computación que inclúen ou a estrutura dos seus sistemas de mem~
ría. Nos últimos anos apareceron linguaxes, bibliotecas e extensións que permiten
escribir unha soa vez a versión paralela dun código e executala nun amplio abano de
dispositivos, senda de entre todos eles OpenCL a solución máis extendida. Porén, a
portabilidade funcional non implica portabilidade de rendemento. Deste xeito, uns
dos grandes problemas que segue aberto neste campo é a automatización da portabilidade
de rendemento, isto é, a capacidade de adaptar automaticamente un código
dado para a súa execución en calquera dispositivo e obter un bo rendemento. Esta
tese aborda este problema propondo tres solucións diferentes. As tres están baseadas
na aplicación de optimizacións de código a código usadas habitualmente en disp~
sitivos heteroxéneos. Tanto o conxunto de optimizacións a aplicar como a forma de
aplicalas dependen de varios parámetros de optimización para os que é preciso fixar
determinados valores en función do dispositivo concreto.
A primeira solución pro posta é OCLoptirnizer, un optimizador de código a código
que partindo de kemels OpenCL anotados e ficheiros de configuración de apoio,
obtén versións optimizadas dos devanditos kernels para un dispositivo concreto.
Amais, cando o kernel a optimizaré único, tarnén automatiza a xeración dun código
de host funcional para ese kernel.
As outras dúas solucións foron implementadas utilizando Heterogeneous Programming
Library (HPL), unha biblioteca C++ que permite programar sistemas
heteroxéneos de xeito fácil e portable. A primeira destas solucións explota as capacidades de xeración de código en tempo de execución de HPL para xerar versións
dun produto de matrices que se adaptan automaticamente ás características dun
dispositivo concreto. A última solución consiste no deseuvolvemento e incorporación
a HPL dun optimizador capaz de obter en tiempo de execución versións optimizada<;
dun código HPL para un dispositivo dado. Mentres as dúas primeiras solucións usan
procesos de procura para atopar os mellares valores para os parámetros de optimización,
esta última alternativa baséase para iso en heurísticas definidas a partir de
recomendacións xerais de optimización
Distributed programming of a hyperspectral image registration algorithm for heterogeneous GPU clusters
Producción CientíficaHyperspectral image registration is a relevant task for real-time applications such as environmental disaster management or search and rescue scenarios. The HYFMGPU algorithm was proposed as a single-GPU high-performance solution, but the need for a distributed version has arisen due to the continuous evolution of sensors that generate images with finer spatial and spectral resolutions. In a previous work, we simplified the programming of the multi-device parts of an initial MPI+CUDA multi-GPU implementation of HYFMGPU by means of Hitmap, a library to ease the programming of parallel applications based on distributed arrays. The performance of that Hitmap version was assessed in a homogeneous GPU cluster. In this paper, we extend this implementation by means of new functionalities added to the latest version of Hitmap in order to support arbitrary load distributions for multi-node heterogeneous GPU clusters. Three different load balancing layouts are tested, which prove that selecting a proper layout affects the performance of the code and how this performance is correlated with the use of the GPUs available in the cluster.This work has been funded by the Consejería de Educación of Junta de Castilla y León and the European Regional Development Fund (ERDF) program (projects PROPHET, VA082P17, and PROPHET-2, VA226P20); by the Ministerio de Economía, Industria y Competitividad of Spain (project PCAS, TIN2017-88614-R); and by the Fulbright Commission, (grant Salvador de Madariaga/Fulbright Scholar PRX17/00674)
An automatic optimizer for heterogeneous devices
Versión final aceptada de: https://doi.org/10.1016/j.future.2020.01.018This version of the article: Fernández-Fabeiro, J., Andrade, D., Fraguela, B. B., & Doallo, R. (2020). 'An automaticoptimizer for heterogeneous devices' has been accepted for publication in: Future Generation Computer Systems, 106, 572–584.
The Version of Record is available online at: https://doi.org/10.1016/j.future.2020.01.018 .[Abstract]: Codes written in a naive way seldom effectively exploit the computing resources, while writing optimized codes is usually a complex task that requires certain levels of expertise. This problem is further increased in the presence of heterogeneous devices, which present more tunable parameters than regular CPUs and high sensitivity to the optimization decisions taken. Furthermore, portability is an added concern given the wide variety of accelerators available. This paper tackles this problem adding an automatic optimizer to a library that already provides an easy and portable way to program heterogeneous devices, the Heterogeneous Programming Library (HPL). Our optimizer takes as input a simple version of a code and then tunes it for the device where it is going to be executed by performing the most usual set of optimizations applicable in heterogeneous devices. These optimizations are parametrized using a set of optimization parameters that need to be tuned for the device. The HPL library has also been equipped with an autotuner that can be used to this purpose. The effectiveness of the autotuner and the optimizer has been tested on several codes and devices. The results show that the combination of the autotuner and the optimizer make the tested codes 16 times faster on average than the original codes written by the programmer.This research was supported by the Ministry of Economy and Competitiveness of Spain and FEDER funds (80%) of the EU (TIN2016-75845-P), and by the Government of Galicia (Xunta de Galicia, Spain) co-founded by the European Regional Development Fund (ERDF) under the Consolidation Programme of Competitive Reference Groups (ED431C 2017/04) as well as under Xunta de Galicia and FEDER funds of the EU (Centro de Investigación de Galicia accreditation 2019–2022, ref. ED431G2019/01)Xunta de Galicia; ED431C 2017/04Xunta de Galicia; ED431G2019/0
A multi-device version of the HYFMGPU algorithm for hyperspectral scenes registration
This is a post-peer-review, pre-copyedit version of an article published in The Journal of Supercomputing. The final authenticated version is available online at: https://doi.org/10.1007/s11227-018-2689-7Hyperspectral image registration is a relevant task for real-time applications like environmental disasters management or search and rescue scenarios. Traditional algorithms were not really devoted to real-time performance, even when ported to GPUs or other parallel devices. Thus, the HYFMGPU algorithm arose as a solution to such a lack. Nevertheless, as sensors are expected to evolve and thus generate images with finer resolutions and wider wavelength ranges, a multi-GPU implementation of this algorithm seems to be necessary in a near future. This work presents a multi-device MPI + CUDA implementation of the HYFMGPU algorithm that distributes all its stages among several GPUs. This version has been validated testing it for 5 different real hyperspectral images, with sizes from about 80 MB to nearly 2 GB, achieving speedups for the whole execution of the algorithm from 1.18 × to 1.59 × in 2 GPUs and from 1.26 × to 2.58 × in 4 GPUs. The parallelization efficiencies obtained are stable around 86 % and 78 % for 2 and 4 GPUs, respectively, which proves the scalability of this multi-device versionThis work has been partially supported by: Universidad de Valladolid—Consejería de Educación of Junta de Castilla y León, Ministerio de Economía, Industria y Competitividad of Spain, and European Regional Development Fund (ERDF) program: Project PCAS (TIN2017-88614-R), Project PROPHET (VA082P17) and CAPAP-H6 network (TIN2016-81840-REDT). Universidade de Santiago de Compostela—Consellería de Cultura, Educación e Ordenación Universitaria of Xunta de Galicia (grant numbers GRC2014/008 and ED431G/08) and Ministerio de Economía, Industria y Competitividad of Spain (Grant Number TIN2016-76373-P), all co-funded by the European Regional Development Fund (ERDF) program. The work of Álvaro Ordóñez was supported by the Ministerio de Educación, Cultura y Deporte under an FPU Grant (Grant Number FPU16/03537)S
Towards a multi-device versión of the HYFMGPU Algorithm
Proceedings of the 18th International Conference on Computational and Mathematical Methods
in Science and Engineering, CMMSE 2018, July 9–14, 2018The task consisting on estimating the translation, rotation and scaling of an image with respect to another take of the same scene obtained at different times, viewpoints and/or lightning conditions is known as image registration. Applications like environmental disasters management or rescue operations depend on real-time hyperspectral images registration, but most of the current FFT-based techniques ignore such performance needs. Ordóñez et al. proposed HYFMGPU [1], a single-GPU algorithm whose performance makes it suitable for real-time use cases. As hyperspectral sensors improve, both the size of images and the wavelength ranges covered are expected to increase, so that a multi-GPU implementation is proposed to satisfy such growing needsThis work has been partially supported by Regional Government of Castilla y León (Spain) and ERDF program of European Union: PROPHET project (JCYL-VA082P17
CMMSE 2018
The task consisting on estimating the translation, rotation and scaling of an image with respect to another take of the same scene obtained at different times, viewpoints and/or lightning conditions is known as image registration. Applications like environmental disasters management or rescue operations depend on real-time hyperspectral images registration, but most of the current FFT-based techniques ignore such performance needs. Ordóñez et al. proposed HYFMGPU [1], a single-GPU algorithm whose performance makes it suitable for real-time use cases. As hyperspectral sensors improve, both the size of images and the wavelength ranges covered are expected to increase, so that a multi-GPU implementation is proposed to satisfy such growing needs.Universidad de Valladolid (Consejería de Educación of Junta de Castilla y León, Ministerio de Economía, Industria y Competitividad of Spain, and European Regional Development Fund (ERDF) program: Project PCAS (TIN2017-88614-R), Project PROPHET (VA082P17) and CAPAP-H6 network (TIN2016-81840-REDT)
High Performance Computing Systems Conference 2019
Hyperspectral image registration is a relevant task for real-time applications like environmental disasters management or search and rescue scenarios. Traditional algorithms for this problem were not really devoted to real-time performance. The HYFMGPU algorithm arose as a high-performance GPU-based solution to solve such a lack. Nevertheless, a single-GPU solution is not enough, as sensors are evolving and then generating images with finer resolutions and wider wavelength ranges. An MPI+CUDA distributed multi-GPU implementation of HYFMGPU was previously presented. However, this solution
shows the programming complexity of combining MPI with an accelerator programming model. In this paper we present a new and more abstract programming approach for this type of applications, which provides a high efficiency while simplifying the programming of the distributed code. The solution uses Hitmap, a library to ease the programming of parallel applications based on distributed arrays. It uses a more algorithm-oriented approach than MPI, including abstractions for the automatic partition and mapping of arrays at runtime with arbitrary granularity, as well as techniques to build flexible communication patterns that transparently adapt to the data partitions. We show how these abstractions apply to this application class. We present a comparison of development effort metrics between the
original MPI implementation and the one based on Hitmap, with reductions of up to 95% for the Halstead score in specific work redistribution steps. We finally present experimental results showing that these abstractions are internally implemented in a high efficient way that can reduce the overall performance time in up to 37% comparing with the original MPI implementation.Este trabajo forma parte del proyecto de investigación PCAS Grant TIN2017-88614-R y la Junta de Castilla y León, proyecto PROPHET, VA082P1
Peachy Parallel Assignments, GAMUVa (2018-2023)
Esta colección contiene seis propuestas de trabajos prácticos a desarrollar durante todo el curso en asignaturas de Computación Paralela. Cada trabajo contiene un guión, un programa secuencial y versiones preparadas para ser paralelizadas y optimizadas por los alumnos en tres modelos de programación paralela diferentes (OpenMP, MPI, CUDA/OpenCL) que presentan las tres alternativas más relevantes para el desarrollo de habilidades y competencias en esta materia. Están preparados para utilizarse opcionalmente con técnicas de gamificación proponiendo a los alumnos concursos de programación para el desarrollo de cada parte del ejercicio. Estos trabajos han sido desarrollados en el contexto GAMUVa (Grupo de Innovación Docente con Técnicas de Gamificación de la Universidad de Valladolid) financiados parcialmente por el Vicerrectorado de Innovación Docente y Transformación Digital de la Universidad de Valladolid. Han sido probados, corregidos y utilizados con éxito en los cursos 2017-2018 a 2022-2023 en la asignatura de Computación Paralela del Grado en Informática en la Universidad de Valladolid. Cada uno de ellos ha sido seleccionado como Peachy Parallel Assignment por NSF/IEEE-TPCC, National Science Foundation EEUU, the Center for Parallel and Distributed Computing Curriculum Development and Educational Resources (CDER) in conjunction with the IEEE Technical Committee on Parallel Processing (TCPP), para su presentación en los Workshop on Education for High-Performance Computing (EduHPC) de los años 2018, 2019, 2020, 2021, 2022 y 2023 respectivamente.Departamento de Informática (Arquitectura y Tecnología de Computadores, Ciencias de la Computación e Inteligencia Artificial, Lenguajes y Sistemas Informáticos)Aprendizaje basado en proyectos, opcionalmente gamificación con concursos de programaciónDocencia en Computación ParalelaProgramación, arquitecturas paralelasActiv