194 research outputs found
EPSILOD: efficient parallel skeleton for generic iterative stencil computations in distributed GPUs
Producción CientíficaIterative stencil computations are widely used in numerical simulations. They
present a high degree of parallelism, high locality and mostly-coalesced memory
access patterns. Therefore, GPUs are good candidates to speed up their computa-
tion. However, the development of stencil programs that can work with huge grids in
distributed systems with multiple GPUs is not straightforward, since it requires solv-
ing problems related to the partition of the grid across nodes and devices, and the
synchronization and data movement across remote GPUs. In this work, we present
EPSILOD, a high-productivity parallel programming skeleton for iterative stencil
computations on distributed multi-GPUs, of the same or different vendors that sup-
ports any type of n-dimensional geometric stencils of any order. It uses an abstract
specification of the stencil pattern (neighbors and weights) to internally derive the
data partition, synchronizations and communications. Computation is split to better
overlap with communications. This paper describes the underlying architecture of
EPSILOD, its main components, and presents an experimental evaluation to show
the benefits of our approach, including a comparison with another state-of-the-art
solution. The experimental results show that EPSILOD is faster and shows good
strong and weak scalability for platforms with both homogeneous and heterogene-
ous types of GPUJunta de Castilla y León, Ministerio de Economía, Industria y Competitividad, y Fondo Europeo de Desarrollo Regional (FEDER): Proyecto PCAS (TIN2017-88614-R) y Proyecto PROPHET-2 (VA226P20).Ministerio de Ciencia e Innovación, Agencia Estatal de Investigación y “European Union NextGenerationEU/PRTR” : (MCIN/ AEI/10.13039/501100011033) - grant TED2021-130367B-I00CTE-POWER and Minotauro and the technical support provided by Barcelona Supercomputing Center (RES-IM-2021-2-0005, RES-IM-2021-3-0024, RES- IM-2022-1-0014).Publicación en abierto financiada por el Consorcio de Bibliotecas Universitarias de Castilla y León (BUCLE), con cargo al Programa Operativo 2014ES16RFOP009 FEDER 2014-2020 DE CASTILLA Y LEÓN, Actuación:20007-CL - Apoyo Consorcio BUCL
Guided rewriting and constraint satisfaction for parallel GPU code generation
Graphics Processing Units (GPUs) are notoriously hard to optimise for manually due to their scheduling and memory hierarchies. What is needed are good automatic code generators and optimisers for such parallel hardware. Functional approaches such as Accelerate, Futhark and LIFT leverage a high-level algorithmic Intermediate Representation (IR) to expose parallelism and abstract the implementation details away from the user. However, producing efficient code for a given accelerator remains challenging. Existing code generators depend on the user input to choose a subset of hard-coded optimizations or automated exploration of implementation search space. The former suffers from the lack of extensibility, while the latter is too costly due to the size of the search space. A hybrid approach is needed, where a space of valid implementations is built automatically and explored with the aid of human expertise.
This thesis presents a solution combining user-guided rewriting and automatically generated constraints to produce high-performance code. The first contribution is an automatic tuning technique to find a balance between performance and memory consumption. Leveraging its functional patterns, the LIFT compiler is empowered to infer tuning constraints and limit the search to valid tuning combinations only.
Next, the thesis reframes parallelisation as a constraint satisfaction problem. Parallelisation constraints are extracted automatically from the input expression, and a solver is used to identify valid rewriting. The constraints truncate the search space to valid parallel mappings only by capturing the scheduling restrictions of the GPU in the context of a given program. A synchronisation barrier insertion technique is proposed to prevent data races and improve the efficiency of the generated parallel mappings.
The final contribution of this thesis is the guided rewriting method, where the user encodes a design space of structural transformations using high-level IR nodes called rewrite points. These strongly typed pragmas express macro rewrites and expose design choices as explorable parameters. The thesis proposes a small set of reusable rewrite points to achieve tiling, cache locality, data reuse and memory optimisation.
A comparison with the vendor-provided handwritten kernel ARM Compute Library and the TVM code generator demonstrates the effectiveness of this thesis' contributions. With convolution as a use case, LIFT-generated direct and GEMM-based convolution implementations are shown to perform on par with the state-of-the-art solutions on a mobile GPU. Overall, this thesis demonstrates that a functional IR yields well to user-guided and automatic rewriting for high-performance code generation
Automated cache optimisations of stencil computations for partial differential equations
This thesis focuses on numerical methods that solve partial differential equations.
Our focal point is the finite difference method, which solves partial
differential equations by approximating derivatives with explicit finite differences.
These partial differential equation solvers consist of stencil computations on structured grids.
Stencils for computing real-world practical applications are patterns often
characterised by many memory accesses and non-trivial arithmetic expressions
that lead to high computational costs compared to simple stencils used in much prior
proof-of-concept work.
In addition, the loop nests to express stencils on structured grids may often be complicated.
This work is highly motivated by a specific domain of stencil computations where one of the challenges is non-aligned to the structured grid ("off-the-grid") operations.
These operations update neighbouring grid points through scatter and gather operations via non-affine memory accesses, such as {A[B[i]]}.
In addition to this challenge, these practical stencils often include many computation fields (need to store multiple grid copies), complex data dependencies and imperfect loop nests.
In this work, we aim to increase the performance of stencil kernel execution.
We study automated cache-memory-dependent optimisations for stencil computations.
This work consists of two core parts with their respective contributions.The first part of our work tries to reduce the data movement in stencil computations of practical interest.
Data movement is a dominant factor affecting the performance of high-performance computing applications.
It has long been a target of optimisations due to its impact on execution time and energy consumption.
This thesis tries to relieve this cost by applying temporal blocking optimisations, also known as time-tiling, to stencil computations.
Temporal blocking is a well-known technique to enhance data reuse in stencil computations.
However, it is rarely used in practical applications but rather in theoretical examples to prove its efficacy.
Applying temporal blocking to scientific simulations is more complex.
More specifically, in this work, we focus on the application context of seismic and medical imaging.
In this area, we often encounter scatter and gather operations due to signal sources and receivers at arbitrary locations in the computational domain.
These operations make the application of temporal blocking challenging.
We present an approach to overcome this challenge and successfully apply temporal blocking.In the second part of our work, we extend the first part as an automated approach targeting a wide range of simulations modelled with partial differential equations.
Since temporal blocking is error-prone, tedious to apply by hand and highly complex to assimilate theoretically and practically, we are motivated to automate its application and automatically generate code that benefits from it.
We discuss algorithmic approaches and present a generalised compiler pipeline to automate the application of temporal blocking.
These passes are written in the Devito compiler. They are used to accelerate the computation of stencil kernels in areas such as seismic and medical imaging, computational fluid dynamics and machine learning.
\href{www.devitoproject.org}{Devito} is a Python package to implement optimised stencil computation (e.g., finite differences, image processing, machine learning) from high-level symbolic problem definitions.
Devito builds on \href{www.sympy.org}{SymPy} and employs automated code generation and just-in-time compilation to execute optimised computational kernels on several computer platforms, including CPUs, GPUs, and clusters thereof.
We show how we automate temporal blocking code generation without user intervention and often achieve better time-to-solution.
We enable domain-specific optimisation through compiler passes and offer temporal blocking gains from a high-level symbolic abstraction.
These automated optimisations benefit various computational kernels for solving real-world application problems.Open Acces
A symbolic emulator for shuffle synthesis on the NVIDIA PTX code
Various kinds of applications take advantage of GPUs through automation tools that attempt to automatically exploit the available performance of the GPU's parallel architecture. Directive-based programming models, such as OpenACC, are one such method that easily enables parallel computing by just adhering code annotations to code loops. Such abstract models, however, often prevent programmers from making additional low-level optimizations to take advantage of the advanced architectural features of GPUs because the actual generated computation is hidden from the application developer. This paper describes and implements a novel flexible optimization technique that operates by inserting a code emulator phase to the tail-end of the compilation pipeline. Our tool emulates the generated code using symbolic analysis by substituting dynamic information and thus allowing for further low-level code optimizations to be applied. We implement our tool to support both CUDA and OpenACC directives as the frontend of the compilation pipeline, thus enabling low-level GPU optimizations for OpenACC that were not previously possible. We demonstrate the capabilities of our tool by automating warp-level shuffle instructions that are difficult to use by even advanced GPU programmers. Lastly, evaluating our tool with a benchmark suite and complex application code, we provide a detailed study to assess the benefits of shuffle instructions across four generations of GPU architectures.We are funded by the EPEEC project from the European Union’s Horizon 2020 research and innovation program under grant agreement No. 801051 and the Ministerio de Ciencia e Innovación-Agencia Estatal de Investigación (PID2019-107255GB-C21/AEI/10.13039/501100011033). This work has been partially carried out on the ACME cluster owned by CIEMAT and funded by the Spanish Ministry of Economy and Competitiveness project CODEC-OSE (RTI2018-096006-B-I00).Peer ReviewedPostprint (published version
A domain-extensible compiler with controllable automation of optimisations
In high performance domains like image processing, physics simulation or machine learning, program performance is critical. Programmers called performance engineers are responsible for the challenging task of optimising programs. Two major challenges prevent modern compilers targeting heterogeneous architectures from reliably automating optimisation. First, domain specific compilers such as Halide for image processing and TVM for machine learning are difficult to extend with the new optimisations required by new algorithms and hardware. Second, automatic optimisation is often unable to achieve the required performance, and performance engineers often fall back to painstaking manual optimisation.
This thesis shows the potential of the Shine compiler to achieve domain-extensibility, controllable automation, and generate high performance code. Domain-extensibility facilitates adapting compilers to new algorithms and hardware. Controllable automation enables performance engineers to gradually take control of the optimisation process. The first research contribution is to add 3 code generation features to Shine, namely: synchronisation barrier insertion, kernel execution, and storage folding. Adding these features requires making novel design choices in terms of compiler extensibility and controllability. The rest of this thesis builds on these features to generate code with competitive runtime compared to established domain-specific compilers. The second research contribution is to demonstrate how extensibility and controllability are exploited to optimise a standard image processing pipeline for corner detection. Shine achieves 6 well-known image processing optimisations, 2 of them not being supported by Halide. Our results on 4 ARM multi-core CPUs show that the code generated by Shine for corner detection runs up to 1.4× faster than the Halide code. However, we observe that controlling rewriting is tedious, motivating the need for more automation.
The final research contribution is to introduce sketch-guided equality saturation, a semiautomated technique that allows performance engineers to guide program rewriting by specifying rewrite goals as sketches: program patterns that leave details unspecified. We evaluate this approach by applying 7 realistic optimisations of matrix multiplication. Without guidance, the compiler fails to apply the 5 most complex optimisations even given an hour and 60GB of RAM. With the guidance of at most 3 sketch guides, each 10 times smaller than the complete program, the compiler applies the optimisations in seconds using less than 1GB
Real-time high-performance computing for embedded control systems
The real-time control systems industry is moving towards the consolidation of multiple computing systems into fewer and more powerful ones, aiming for a reduction in size, weight, and power. The increasing demand for higher performance in other critical domains like autonomous driving has led the industry to recently include embedded GPUs for the implementation of advanced functionalities. The highly parallel architecture of GPUs could also be leveraged in the control systems industry to develop more advanced, energy-efficient, and scalable control systems. However, the closed-source and non-deterministic nature of GPUs complicates the resource provisioning analysis required for the implementation of critical real-time systems. On the other hand, there is no indication of the integration of GPUs in the traditional development cycle of control systems, which is oriented to the use of a model-based design approach. Recently, some model-based design tools vendors have extended their development frameworks with GPU code generation capabilities targeting hybrid computing platforms, so that the model-based design environment now enables the concurrent analysis of more complex and diverse functions by simulation and automating the deployment to the final target. However, there is no indication whether these tools are well-suited for the design and development of time-sensitive systems.
Motivated by these challenges, in this thesis, we contribute to the state of the art of real-time control systems towards the adoption of embedded GPUs by providing tools to facilitate the resource provisioning analysis and the integration in the model-based design development cycle. First, we present a methodology and an automated tool to extract the properties of GPU memory allocators. This tool allows the computation of the real amount of memory used by GPU applications, facilitating a correct resource provisioning analysis. Then, we present a library which allows the characterization of the use of dynamic memory in GPU applications. We use this library to characterize GPU benchmarks and we identify memory allocation patterns that could be modified to improve performance and memory consumption when targeting embedded GPUs. Based on these results, we present a tool to optimize the use of dynamic memory in legacy GPU applications executed on embedded platforms. This tool allows us to minimize the memory consumption and memory management overhead of GPU applications without rewriting them. Afterwards, we analyze the timing of control algorithms executed in embedded GPUs and we identify techniques to achieve an acceptable real-time behavior. Finally, we evaluate model-based design tools in terms of integration with GPU hardware and GPU code generation, and we propose improvements for the model-based generated GPU code. Then, we present a source-to-source transformation tool to automatically apply the proposed improvements.La industria de los sistemas de control en tiempo real avanza hacia la consolidación de múltiples sistemas informáticos en menos y más potentes sistemas, con el objetivo de reducir el tamaño, el peso y el consumo. La creciente demanda de un mayor rendimiento en otros dominios críticos, como la conducción autónoma, ha llevado a la industria a incluir recientemente GPU embebidas para la implementación de funcionalidades avanzadas. La arquitectura altamente paralela de las GPU también podría aprovecharse en la industria de los sistemas de control para desarrollar sistemas de control más avanzados, eficientes energéticamente y escalables. Sin embargo, la naturaleza privativa y no determinista de las GPUs complica el análisis de aprovisionamiento de recursos requerido para la implementación de sistemas críticos en tiempo real. Por otro lado, no hay indicios de la integración de las GPU en el ciclo de desarrollo tradicional de los sistemas de control, que está orientado al uso de un enfoque de diseño basado en modelos. Recientemente, algunos proveedores de herramientas de diseño basado en modelos han ampliado sus entornos de desarrollo con capacidades de generación de código de GPU dirigidas a plataformas informáticas híbridas, de modo que el entorno de diseño basado en modelos ahora permite el análisis simultáneo de funciones más complejas y diversas mediante la simulación y la automatización de la implementación para el objetivo final. Sin embargo, no hay indicación de si estas herramientas son adecuadas para el diseño y desarrollo de sistemas sensibles al tiempo. Motivados por estos desafíos, en esta tesis contribuimos al estado del arte de los sistemas de control en tiempo real hacia la adopción de GPUs integradas al proporcionar herramientas para facilitar el análisis de aprovisionamiento de recursos y la integración en el ciclo de desarrollo de diseño basado en modelos. Primero, presentamos una metodología y una herramienta automatizada para extraer las propiedades de los asignadores de memoria en GPUs. Esta herramienta permite el cómputo de la cantidad real de memoria utilizada por las aplicaciones GPU, facilitando un correcto análisis del aprovisionamiento de recursos. Luego, presentamos una librería que permite la caracterización del uso de memoria dinámica en aplicaciones de GPU. Usamos esta librería para caracterizar una serie de benchmarks GPU e identificamos patrones de asignación de memoria que podrían modificarse para mejorar el rendimiento y el consumo de memoria al utilizar GPUs embebidas. Con base en estos resultados, presentamos también una herramienta para optimizar el uso de la memoria dinámica en aplicaciones de GPU heredadas al ser ejecutadas en plataformas embebidas. Esta herramienta nos permite minimizar el consumo de memoria y la sobrecarga de administración de memoria de las aplicaciones GPU sin necesidad de reescribirlas. Posteriormente, analizamos el tiempo de los algoritmos de control ejecutados en GPUs embebidas e identificamos técnicas para lograr un comportamiento de tiempo real aceptable. Finalmente, evaluamos las herramientas de diseño basadas en modelos en términos de integración con hardware GPU y generación de código GPU, y proponemos mejoras para el código GPU generado por las herramientas basadas en modelos. Luego, presentamos una herramienta de transformación de código fuente para aplicar automáticamente al código generado las mejoras propuestas.Postprint (published version
New experimental and computational methods for ultrasound brain tomography
Fast, portable, and affordable neuroimaging is currently unavailable in clinical practice, hindering prevention and treatment of pathologies such as stroke, a leading cause of death and disability worldwide. Full-waveform inversion (FWI), an ultrasound-based tomographic technique, has been recently proposed as a solution to this problem, but is yet to be successfully applied experimentally. Two fundamental barriers hinder its experimental application: a lack of numerical models that accurately replicate experimental measurements, and of domain-specific software that implements FWI algorithms efficiently. Here, I address both problems, opening the door to universally available neuroimaging.
Addressing the first barrier entails finding numerical models that can explain the behaviour of the acquisition system: the transmission and reception response of the transducers, and their spatial location and orientation. As I demonstrate here, existing position-estimation methods fail when the surface of the transducers is bigger than the wavelength, a prerequisite for imaging through the skull, while available response-estimation techniques cannot achieve the precision required by full-wave methods. Therefore, I present spatial response identification, a new algorithm for transducer calibration and modelling, and show how it can be used to explain experimental devices with higher accuracy than existing methods. Additionally, I present experimental reconstructions of a tissue-mimicking phantom, achieving improved imaging quality with respect to standard calibration techniques.
The second barrier stems from the fact that FWI is mathematically challenging and orders of magnitude more computationally expensive than conventional ultrasound imaging, while there is an absence of open codes, slowing the pace of research and hindering reproducibility. Therefore, I introduce Stride, an open-source Python library that combines high-level interfaces with automatically generated, high-performance solvers and scalable parallelisation. Here, I demonstrate that Stride can achieve state-of-the-art modelling accuracy and how it can be used to image in 2D and 3D, scaling from a local workstation to a high-performance cluster.Open Acces
EuroEXA - D2.6: Final ported application software
This document describes the ported software of the EuroEXA applications to the single CRDB testbed and it discusses the experiences extracted from porting and optimization activities that should be actively taken into account in future redesign and optimization. This document accompanies the ported application software, found in the EuroEXA private repository (https://github.com/euroexa). In particular, this document describes the status of the software for each of the EuroEXA applications, sketches the redesign and optimization strategy for each application, discusses issues and difficulties faced during the porting activities and the relative lesson learned. A few preliminary evaluation results have been presented, however the full evaluation will be discussed in deliverable 2.8
- …