31 research outputs found
Accelerating Halide on an FPGA by using CIRCT and Calyx as an intermediate step to go from a high-level and software-centric IRs down to RTL
Image processing and, more generally, array processing play an essential role in modern life: from applying filters to the images that we upload to social media to running object detection algorithms on self-driving cars. Optimizing these algorithms can be complex and often results in non-portable code. The Halide language provides a simple way to write image and array processing algorithms by separating the algorithm definition (what needs to be executed) from its execution schedule (how it is executed), delivering state-of-the-art performance that exceeds hand-tuned parallel and vectorized code. Due to the inherent parallel nature of these algorithms, FPGAs present an attractive acceleration platform. While previous work has added an RTL code generator to Halide, and utilized other heterogeneous computing languages as an intermediate step, these projects are no longer maintained. MLIR is an attractive solution, allowing the generation of code that can target multiple devices, such as parallelized and vectorized CPU code, OpenMP, and CUDA. CIRCT builds on top of MLIR to convert generic MLIR code to register transfer level (RTL) languages by using Calyx, a new intermediate language (IL) for compiling high-level programs into hardware designs. This thesis presents a novel flow that implements an MLIR code generator for Halide that generates RTL code, adding the necessary wrappers to execute that code on Xilinx FPGA devices. Additionally, it implements a Halide runtime using the Xilinx Runtime (XRT), enabling seamless execution of the generated Halide RTL kernels. While this thesis provides initial support for running Halide kernels and not all features and optimizations are supported, it also details the future work needed to improve the performance of the generated RTL kernels. The proposed flow serves as a foundation for further research and development in the field of hardware acceleration for image and array processing applications using Halide
Interactive application independent data processing using synthetic filesystems
In this thesis a software system is proposed that provides transparent access to dynamically processed data using a synthetic filesystem for the data transfer as well as interaction with the processing pipeline. Within this context the architecture for such a software solution has been designed and implemented. Using this implementation various profiling measurements have been acquired in order to evaluate the applicability in different data processing scenarios. Usability aspects, considering the interaction with the processing pipeline, have been examined as well. The implemented software is able to generate the processing result on-the-fly without modification of the original input data. Access to the output data is provided by means of a common filesystem interface without the need of implementing yet another communication protocol. Within the processing pipeline the data can be accessed and modified independently from the actual input and output encoding. Currently the data can be modified using a C/C++, GLSL or Java front end. Profiling data has shown that the overhead induced by the filesystem is negligible for most usage patterns and is only critical for realtime processing with a high data throughput e. g. video processing at or above 30 frames per second where typically no file operations are involved
Tools for improving performance portability in heterogeneous environments
Programa Oficial de Doutoramento en Investigaci贸n en Tecnolox铆as da Informaci贸n. 524V01[Abstract]
Parallel computing is currently partially dominated by the availability of heterogeneous
devices. These devices differ from each other in aspects such as the
instruction set they execute, the number and the type of computing devices that
they offer or the structure of their memory systems. In the last years, langnages,
libraries and extensions have appeared to allow to write a parallel code once aud
run it in a wide variety of devices, OpenCL being the most widespread solution of
this kind. However, functional portability does not imply performance portability.
This way, one of the probletns that is still open in this field is to achieve automatic
performance portability. That is, the ability to automatically tune a given code for
any device where it will be execnted so that it ill obtain a good performance. This
thesis develops three different solutions to tackle this problem. The three of them
are based on typical source-to-sonrce optimizations for heterogeneous devices. Both
the set of optimizations to apply and the way they are applied depend on different
optimization parameters, whose values have to be tuned for each specific device.
The first solution is OCLoptimizer, a source-to-source optimizer that can optimize
annotated OpenCL kemels with the help of configuration files that guide the
optimization process. The tool optimizes kernels for a specific device, and it is also
able to automate the generation of functional host codes when only a single kernel
is optimized.
The two remaining solutions are built on top of the Heterogeneous Programming
Library (HPL), a C++ framework that provides an easy and portable way to exploit
heterogeneous computing systexns. The first of these solutions uses the run-time
code generation capabilities of HPL to generate a self-optimizing version of a matrix
multiplication that can optimize itself at run-time for an spedfic device. The last solut铆on is the development of a built-in just-in-time optirnizer for HPL, that can
optirnize, at run-tirne, a HPL code for an specific device. While the first two solutions
use search processes to find the best values for the optimization parameters, this Iast
alternative relies on heuristics bMed on general optirnization strategies.[Resumen]
Actualmente la computaci贸n paralela se encuentra dominada parcialmente por
los m煤ltiples dispositivos heterog茅neos disponibles. Estos dispositivos difieren entre
s铆 en caracter铆sticas tales como el conjunto de instrucciones que ejecutan, el n煤mero
y tipo de unidades de computaci贸n que incluyen o la estructura de sus sistemas de
memoria. Durante los 煤ltimos a帽os han aparecido lenguajes, librer铆as y extensiones
que permiten escribir una 煤nica vez la versi贸n paralela de un c贸digo y ejecutarla en
un amplio abanico de dispositivos, siendo de entre todos ellos OpenCL la soluci贸n
m谩s extendida. Sin embargo, la portabilidad funcional no implica portabilidad de
rendimiento. As铆, uno de los grandes problemas que sigue abierto en este campo
es la automatizaci贸n de la portabilidad de rendimiento, es decir, la capacidad de
adaptar autom谩ticamente un c贸digo dado para su ejecuci贸n en cualquier dispositivo
y obtener un buen rendimiento. Esta tesis aborda este problema planteando tres
soluciones diferentes al mismo. Las tres se basan en la aplicaci贸n de optimizaciones
de c贸digo a c贸digo usadas habitualmente en dispositivos heterog茅neos. Tanto el
conjunto de optimizaciones a aplicar como la forma de aplicarlas dependen de varios
par谩metros de optimizaci贸n, cuyos valores han de ser ajustados para cada dispositivo
concreto.
La primera soluci贸n planteada es OCLoptirnizer, un optimizador de c贸digo a
c贸digo que a partir de kernels OpenCL anotados y ficheros de configuraci贸n como
apoyo, obtiene versiones optimizada de dichos kernels para un dispositivo concreto.
Adem谩s, cuando el kernel a optimizar es 煤nico, automatiza la generaci贸n de un
c贸digo de host funcional para ese kernel.
Las otras dos soluciones han sido implementadas utilizando Heterogeneous Prograrnming
LibranJ (HPL), una librer铆a C++ que permite programar sistemas heterog茅neos de forma f谩cil y portable. La primera de estas soluciones explota las
capacidades de generaci贸n de c贸digo en tiempo de ejecuci贸n de HPL para generar
versiones de un producto de matrices que se adaptan autom谩ticamente en tiempo
de ejecuci贸n a las caracter铆sticas de un dispositivo concreto. La 煤ltima soluci贸n consiste
en el desarrollo e incorporaci贸n a HPL de un optimizador al vuelo, de fonna
que se puedan obtener en tiempo de ejecuci贸n versiones optimizadas de un c贸digo
HPL para un dispositivo dado. Mientras las dos primeras soluciones usan procesos
de b煤squeda para encontrar los mejores valores para los par谩metros de optimizaci贸n,
esta 煤ltima altemativa se basa para ello en heur铆sticas definidas a partir de
recomendaciones generales de optimizaci贸n.[Resumo]
Actualmente a computaci贸n paralela at贸pase dominada parcialmente polos m煤ltiples
dispositivos heterox茅neos dispo帽ibles. Estes dispositivos difiren entre si en caracter铆sticas
tales como o conxunto de instrucci贸ns que executan, o n煤mero e tipo
de unidades de computaci贸n que incl煤en ou a estrutura dos seus sistemas de mem~
r铆a. Nos 煤ltimos anos apareceron linguaxes, bibliotecas e extensi贸ns que permiten
escribir unha soa vez a versi贸n paralela dun c贸digo e executala nun amplio abano de
dispositivos, senda de entre todos eles OpenCL a soluci贸n m谩is extendida. Por茅n, a
portabilidade funcional non implica portabilidade de rendemento. Deste xeito, uns
dos grandes problemas que segue aberto neste campo 茅 a automatizaci贸n da portabilidade
de rendemento, isto 茅, a capacidade de adaptar automaticamente un c贸digo
dado para a s煤a execuci贸n en calquera dispositivo e obter un bo rendemento. Esta
tese aborda este problema propondo tres soluci贸ns diferentes. As tres est谩n baseadas
na aplicaci贸n de optimizaci贸ns de c贸digo a c贸digo usadas habitualmente en disp~
sitivos heterox茅neos. Tanto o conxunto de optimizaci贸ns a aplicar como a forma de
aplicalas dependen de varios par谩metros de optimizaci贸n para os que 茅 preciso fixar
determinados valores en funci贸n do dispositivo concreto.
A primeira soluci贸n pro posta 茅 OCLoptirnizer, un optimizador de c贸digo a c贸digo
que partindo de kemels OpenCL anotados e ficheiros de configuraci贸n de apoio,
obt茅n versi贸ns optimizadas dos devanditos kernels para un dispositivo concreto.
Amais, cando o kernel a optimizar茅 煤nico, tarn茅n automatiza a xeraci贸n dun c贸digo
de host funcional para ese kernel.
As outras d煤as soluci贸ns foron implementadas utilizando Heterogeneous Programming
Library (HPL), unha biblioteca C++ que permite programar sistemas
heterox茅neos de xeito f谩cil e portable. A primeira destas soluci贸ns explota as capacidades de xeraci贸n de c贸digo en tempo de execuci贸n de HPL para xerar versi贸ns
dun produto de matrices que se adaptan automaticamente 谩s caracter铆sticas dun
dispositivo concreto. A 煤ltima soluci贸n consiste no deseuvolvemento e incorporaci贸n
a HPL dun optimizador capaz de obter en tiempo de execuci贸n versi贸ns optimizada<;
dun c贸digo HPL para un dispositivo dado. Mentres as d煤as primeiras soluci贸ns usan
procesos de procura para atopar os mellares valores para os par谩metros de optimizaci贸n,
esta 煤ltima alternativa bas茅ase para iso en heur铆sticas definidas a partir de
recomendaci贸ns xerais de optimizaci贸n
Simulation Intelligence: Towards a New Generation of Scientific Methods
The original "Seven Motifs" set forth a roadmap of essential methods for the
field of scientific computing, where a motif is an algorithmic method that
captures a pattern of computation and data movement. We present the "Nine
Motifs of Simulation Intelligence", a roadmap for the development and
integration of the essential algorithms necessary for a merger of scientific
computing, scientific simulation, and artificial intelligence. We call this
merger simulation intelligence (SI), for short. We argue the motifs of
simulation intelligence are interconnected and interdependent, much like the
components within the layers of an operating system. Using this metaphor, we
explore the nature of each layer of the simulation intelligence operating
system stack (SI-stack) and the motifs therein: (1) Multi-physics and
multi-scale modeling; (2) Surrogate modeling and emulation; (3)
Simulation-based inference; (4) Causal modeling and inference; (5) Agent-based
modeling; (6) Probabilistic programming; (7) Differentiable programming; (8)
Open-ended optimization; (9) Machine programming. We believe coordinated
efforts between motifs offers immense opportunity to accelerate scientific
discovery, from solving inverse problems in synthetic biology and climate
science, to directing nuclear energy experiments and predicting emergent
behavior in socioeconomic settings. We elaborate on each layer of the SI-stack,
detailing the state-of-art methods, presenting examples to highlight challenges
and opportunities, and advocating for specific ways to advance the motifs and
the synergies from their combinations. Advancing and integrating these
technologies can enable a robust and efficient hypothesis-simulation-analysis
type of scientific method, which we introduce with several use-cases for
human-machine teaming and automated science
Analysing and Reducing Costs of Deep Learning Compiler Auto-tuning
Deep Learning (DL) is significantly impacting many industries, including automotive, retail and medicine, enabling autonomous driving, recommender systems and genomics modelling, amongst other applications. At the same time, demand for complex and fast DL models is continually growing. The most capable models tend to exhibit highest operational costs, primarily due to their large computational resource footprint and inefficient utilisation of computational resources employed by DL systems. In an attempt to tackle these problems, DL compilers and auto-tuners emerged, automating the traditionally manual task of DL model performance optimisation. While auto-tuning improves model inference speed, it is a costly process, which limits its wider adoption within DL deployment pipelines. The high operational costs associated with DL auto-tuning have multiple causes. During operation, DL auto-tuners explore large search spaces consisting of billions of tensor programs, to propose potential candidates that improve DL model inference latency. Subsequently, DL auto-tuners measure candidate performance in isolation on the target-device, which constitutes the majority of auto-tuning compute-time. Suboptimal candidate proposals, combined with their serial measurement in an isolated target-device lead to prolonged optimisation time and reduced resource availability, ultimately reducing cost-efficiency of the process. In this thesis, we investigate the reasons behind prolonged DL auto-tuning and quantify their impact on the optimisation costs, revealing directions for improved DL auto-tuner design. Based on these insights, we propose two complementary systems: Trimmer and DOPpler. Trimmer improves tensor program search efficacy by filtering out poorly performing candidates, and controls end-to-end auto-tuning using cost objectives, monitoring optimisation cost. Simultaneously, DOPpler breaks long-held assumptions about the serial candidate measurements by successfully parallelising them intra-device, with minimal penalty to optimisation quality. Through extensive experimental evaluation of both systems, we demonstrate that they significantly improve cost-efficiency of autotuning (up to 50.5%) across a plethora of tensor operators, DL models, auto-tuners and target-devices