31 research outputs found

    Accelerating Halide on an FPGA by using CIRCT and Calyx as an intermediate step to go from a high-level and software-centric IRs down to RTL

    Get PDF
    Image processing and, more generally, array processing play an essential role in modern life: from applying filters to the images that we upload to social media to running object detection algorithms on self-driving cars. Optimizing these algorithms can be complex and often results in non-portable code. The Halide language provides a simple way to write image and array processing algorithms by separating the algorithm definition (what needs to be executed) from its execution schedule (how it is executed), delivering state-of-the-art performance that exceeds hand-tuned parallel and vectorized code. Due to the inherent parallel nature of these algorithms, FPGAs present an attractive acceleration platform. While previous work has added an RTL code generator to Halide, and utilized other heterogeneous computing languages as an intermediate step, these projects are no longer maintained. MLIR is an attractive solution, allowing the generation of code that can target multiple devices, such as parallelized and vectorized CPU code, OpenMP, and CUDA. CIRCT builds on top of MLIR to convert generic MLIR code to register transfer level (RTL) languages by using Calyx, a new intermediate language (IL) for compiling high-level programs into hardware designs. This thesis presents a novel flow that implements an MLIR code generator for Halide that generates RTL code, adding the necessary wrappers to execute that code on Xilinx FPGA devices. Additionally, it implements a Halide runtime using the Xilinx Runtime (XRT), enabling seamless execution of the generated Halide RTL kernels. While this thesis provides initial support for running Halide kernels and not all features and optimizations are supported, it also details the future work needed to improve the performance of the generated RTL kernels. The proposed flow serves as a foundation for further research and development in the field of hardware acceleration for image and array processing applications using Halide

    Interactive application independent data processing using synthetic filesystems

    Get PDF
    In this thesis a software system is proposed that provides transparent access to dynamically processed data using a synthetic filesystem for the data transfer as well as interaction with the processing pipeline. Within this context the architecture for such a software solution has been designed and implemented. Using this implementation various profiling measurements have been acquired in order to evaluate the applicability in different data processing scenarios. Usability aspects, considering the interaction with the processing pipeline, have been examined as well. The implemented software is able to generate the processing result on-the-fly without modification of the original input data. Access to the output data is provided by means of a common filesystem interface without the need of implementing yet another communication protocol. Within the processing pipeline the data can be accessed and modified independently from the actual input and output encoding. Currently the data can be modified using a C/C++, GLSL or Java front end. Profiling data has shown that the overhead induced by the filesystem is negligible for most usage patterns and is only critical for realtime processing with a high data throughput e. g. video processing at or above 30 frames per second where typically no file operations are involved

    Tools for improving performance portability in heterogeneous environments

    Get PDF
    Programa Oficial de Doutoramento en Investigaci贸n en Tecnolox铆as da Informaci贸n. 524V01[Abstract] Parallel computing is currently partially dominated by the availability of heterogeneous devices. These devices differ from each other in aspects such as the instruction set they execute, the number and the type of computing devices that they offer or the structure of their memory systems. In the last years, langnages, libraries and extensions have appeared to allow to write a parallel code once aud run it in a wide variety of devices, OpenCL being the most widespread solution of this kind. However, functional portability does not imply performance portability. This way, one of the probletns that is still open in this field is to achieve automatic performance portability. That is, the ability to automatically tune a given code for any device where it will be execnted so that it ill obtain a good performance. This thesis develops three different solutions to tackle this problem. The three of them are based on typical source-to-sonrce optimizations for heterogeneous devices. Both the set of optimizations to apply and the way they are applied depend on different optimization parameters, whose values have to be tuned for each specific device. The first solution is OCLoptimizer, a source-to-source optimizer that can optimize annotated OpenCL kemels with the help of configuration files that guide the optimization process. The tool optimizes kernels for a specific device, and it is also able to automate the generation of functional host codes when only a single kernel is optimized. The two remaining solutions are built on top of the Heterogeneous Programming Library (HPL), a C++ framework that provides an easy and portable way to exploit heterogeneous computing systexns. The first of these solutions uses the run-time code generation capabilities of HPL to generate a self-optimizing version of a matrix multiplication that can optimize itself at run-time for an spedfic device. The last solut铆on is the development of a built-in just-in-time optirnizer for HPL, that can optirnize, at run-tirne, a HPL code for an specific device. While the first two solutions use search processes to find the best values for the optimization parameters, this Iast alternative relies on heuristics bMed on general optirnization strategies.[Resumen] Actualmente la computaci贸n paralela se encuentra dominada parcialmente por los m煤ltiples dispositivos heterog茅neos disponibles. Estos dispositivos difieren entre s铆 en caracter铆sticas tales como el conjunto de instrucciones que ejecutan, el n煤mero y tipo de unidades de computaci贸n que incluyen o la estructura de sus sistemas de memoria. Durante los 煤ltimos a帽os han aparecido lenguajes, librer铆as y extensiones que permiten escribir una 煤nica vez la versi贸n paralela de un c贸digo y ejecutarla en un amplio abanico de dispositivos, siendo de entre todos ellos OpenCL la soluci贸n m谩s extendida. Sin embargo, la portabilidad funcional no implica portabilidad de rendimiento. As铆, uno de los grandes problemas que sigue abierto en este campo es la automatizaci贸n de la portabilidad de rendimiento, es decir, la capacidad de adaptar autom谩ticamente un c贸digo dado para su ejecuci贸n en cualquier dispositivo y obtener un buen rendimiento. Esta tesis aborda este problema planteando tres soluciones diferentes al mismo. Las tres se basan en la aplicaci贸n de optimizaciones de c贸digo a c贸digo usadas habitualmente en dispositivos heterog茅neos. Tanto el conjunto de optimizaciones a aplicar como la forma de aplicarlas dependen de varios par谩metros de optimizaci贸n, cuyos valores han de ser ajustados para cada dispositivo concreto. La primera soluci贸n planteada es OCLoptirnizer, un optimizador de c贸digo a c贸digo que a partir de kernels OpenCL anotados y ficheros de configuraci贸n como apoyo, obtiene versiones optimizada de dichos kernels para un dispositivo concreto. Adem谩s, cuando el kernel a optimizar es 煤nico, automatiza la generaci贸n de un c贸digo de host funcional para ese kernel. Las otras dos soluciones han sido implementadas utilizando Heterogeneous Prograrnming LibranJ (HPL), una librer铆a C++ que permite programar sistemas heterog茅neos de forma f谩cil y portable. La primera de estas soluciones explota las capacidades de generaci贸n de c贸digo en tiempo de ejecuci贸n de HPL para generar versiones de un producto de matrices que se adaptan autom谩ticamente en tiempo de ejecuci贸n a las caracter铆sticas de un dispositivo concreto. La 煤ltima soluci贸n consiste en el desarrollo e incorporaci贸n a HPL de un optimizador al vuelo, de fonna que se puedan obtener en tiempo de ejecuci贸n versiones optimizadas de un c贸digo HPL para un dispositivo dado. Mientras las dos primeras soluciones usan procesos de b煤squeda para encontrar los mejores valores para los par谩metros de optimizaci贸n, esta 煤ltima altemativa se basa para ello en heur铆sticas definidas a partir de recomendaciones generales de optimizaci贸n.[Resumo] Actualmente a computaci贸n paralela at贸pase dominada parcialmente polos m煤ltiples dispositivos heterox茅neos dispo帽ibles. Estes dispositivos difiren entre si en caracter铆sticas tales como o conxunto de instrucci贸ns que executan, o n煤mero e tipo de unidades de computaci贸n que incl煤en ou a estrutura dos seus sistemas de mem~ r铆a. Nos 煤ltimos anos apareceron linguaxes, bibliotecas e extensi贸ns que permiten escribir unha soa vez a versi贸n paralela dun c贸digo e executala nun amplio abano de dispositivos, senda de entre todos eles OpenCL a soluci贸n m谩is extendida. Por茅n, a portabilidade funcional non implica portabilidade de rendemento. Deste xeito, uns dos grandes problemas que segue aberto neste campo 茅 a automatizaci贸n da portabilidade de rendemento, isto 茅, a capacidade de adaptar automaticamente un c贸digo dado para a s煤a execuci贸n en calquera dispositivo e obter un bo rendemento. Esta tese aborda este problema propondo tres soluci贸ns diferentes. As tres est谩n baseadas na aplicaci贸n de optimizaci贸ns de c贸digo a c贸digo usadas habitualmente en disp~ sitivos heterox茅neos. Tanto o conxunto de optimizaci贸ns a aplicar como a forma de aplicalas dependen de varios par谩metros de optimizaci贸n para os que 茅 preciso fixar determinados valores en funci贸n do dispositivo concreto. A primeira soluci贸n pro posta 茅 OCLoptirnizer, un optimizador de c贸digo a c贸digo que partindo de kemels OpenCL anotados e ficheiros de configuraci贸n de apoio, obt茅n versi贸ns optimizadas dos devanditos kernels para un dispositivo concreto. Amais, cando o kernel a optimizar茅 煤nico, tarn茅n automatiza a xeraci贸n dun c贸digo de host funcional para ese kernel. As outras d煤as soluci贸ns foron implementadas utilizando Heterogeneous Programming Library (HPL), unha biblioteca C++ que permite programar sistemas heterox茅neos de xeito f谩cil e portable. A primeira destas soluci贸ns explota as capacidades de xeraci贸n de c贸digo en tempo de execuci贸n de HPL para xerar versi贸ns dun produto de matrices que se adaptan automaticamente 谩s caracter铆sticas dun dispositivo concreto. A 煤ltima soluci贸n consiste no deseuvolvemento e incorporaci贸n a HPL dun optimizador capaz de obter en tiempo de execuci贸n versi贸ns optimizada<; dun c贸digo HPL para un dispositivo dado. Mentres as d煤as primeiras soluci贸ns usan procesos de procura para atopar os mellares valores para os par谩metros de optimizaci贸n, esta 煤ltima alternativa bas茅ase para iso en heur铆sticas definidas a partir de recomendaci贸ns xerais de optimizaci贸n

    Simulation Intelligence: Towards a New Generation of Scientific Methods

    Full text link
    The original "Seven Motifs" set forth a roadmap of essential methods for the field of scientific computing, where a motif is an algorithmic method that captures a pattern of computation and data movement. We present the "Nine Motifs of Simulation Intelligence", a roadmap for the development and integration of the essential algorithms necessary for a merger of scientific computing, scientific simulation, and artificial intelligence. We call this merger simulation intelligence (SI), for short. We argue the motifs of simulation intelligence are interconnected and interdependent, much like the components within the layers of an operating system. Using this metaphor, we explore the nature of each layer of the simulation intelligence operating system stack (SI-stack) and the motifs therein: (1) Multi-physics and multi-scale modeling; (2) Surrogate modeling and emulation; (3) Simulation-based inference; (4) Causal modeling and inference; (5) Agent-based modeling; (6) Probabilistic programming; (7) Differentiable programming; (8) Open-ended optimization; (9) Machine programming. We believe coordinated efforts between motifs offers immense opportunity to accelerate scientific discovery, from solving inverse problems in synthetic biology and climate science, to directing nuclear energy experiments and predicting emergent behavior in socioeconomic settings. We elaborate on each layer of the SI-stack, detailing the state-of-art methods, presenting examples to highlight challenges and opportunities, and advocating for specific ways to advance the motifs and the synergies from their combinations. Advancing and integrating these technologies can enable a robust and efficient hypothesis-simulation-analysis type of scientific method, which we introduce with several use-cases for human-machine teaming and automated science

    Analysing and Reducing Costs of Deep Learning Compiler Auto-tuning

    Get PDF
    Deep Learning (DL) is significantly impacting many industries, including automotive, retail and medicine, enabling autonomous driving, recommender systems and genomics modelling, amongst other applications. At the same time, demand for complex and fast DL models is continually growing. The most capable models tend to exhibit highest operational costs, primarily due to their large computational resource footprint and inefficient utilisation of computational resources employed by DL systems. In an attempt to tackle these problems, DL compilers and auto-tuners emerged, automating the traditionally manual task of DL model performance optimisation. While auto-tuning improves model inference speed, it is a costly process, which limits its wider adoption within DL deployment pipelines. The high operational costs associated with DL auto-tuning have multiple causes. During operation, DL auto-tuners explore large search spaces consisting of billions of tensor programs, to propose potential candidates that improve DL model inference latency. Subsequently, DL auto-tuners measure candidate performance in isolation on the target-device, which constitutes the majority of auto-tuning compute-time. Suboptimal candidate proposals, combined with their serial measurement in an isolated target-device lead to prolonged optimisation time and reduced resource availability, ultimately reducing cost-efficiency of the process. In this thesis, we investigate the reasons behind prolonged DL auto-tuning and quantify their impact on the optimisation costs, revealing directions for improved DL auto-tuner design. Based on these insights, we propose two complementary systems: Trimmer and DOPpler. Trimmer improves tensor program search efficacy by filtering out poorly performing candidates, and controls end-to-end auto-tuning using cost objectives, monitoring optimisation cost. Simultaneously, DOPpler breaks long-held assumptions about the serial candidate measurements by successfully parallelising them intra-device, with minimal penalty to optimisation quality. Through extensive experimental evaluation of both systems, we demonstrate that they significantly improve cost-efficiency of autotuning (up to 50.5%) across a plethora of tensor operators, DL models, auto-tuners and target-devices
    corecore