1,137 research outputs found

    Guided rewriting and constraint satisfaction for parallel GPU code generation

    Get PDF
    Graphics Processing Units (GPUs) are notoriously hard to optimise for manually due to their scheduling and memory hierarchies. What is needed are good automatic code generators and optimisers for such parallel hardware. Functional approaches such as Accelerate, Futhark and LIFT leverage a high-level algorithmic Intermediate Representation (IR) to expose parallelism and abstract the implementation details away from the user. However, producing efficient code for a given accelerator remains challenging. Existing code generators depend on the user input to choose a subset of hard-coded optimizations or automated exploration of implementation search space. The former suffers from the lack of extensibility, while the latter is too costly due to the size of the search space. A hybrid approach is needed, where a space of valid implementations is built automatically and explored with the aid of human expertise. This thesis presents a solution combining user-guided rewriting and automatically generated constraints to produce high-performance code. The first contribution is an automatic tuning technique to find a balance between performance and memory consumption. Leveraging its functional patterns, the LIFT compiler is empowered to infer tuning constraints and limit the search to valid tuning combinations only. Next, the thesis reframes parallelisation as a constraint satisfaction problem. Parallelisation constraints are extracted automatically from the input expression, and a solver is used to identify valid rewriting. The constraints truncate the search space to valid parallel mappings only by capturing the scheduling restrictions of the GPU in the context of a given program. A synchronisation barrier insertion technique is proposed to prevent data races and improve the efficiency of the generated parallel mappings. The final contribution of this thesis is the guided rewriting method, where the user encodes a design space of structural transformations using high-level IR nodes called rewrite points. These strongly typed pragmas express macro rewrites and expose design choices as explorable parameters. The thesis proposes a small set of reusable rewrite points to achieve tiling, cache locality, data reuse and memory optimisation. A comparison with the vendor-provided handwritten kernel ARM Compute Library and the TVM code generator demonstrates the effectiveness of this thesis' contributions. With convolution as a use case, LIFT-generated direct and GEMM-based convolution implementations are shown to perform on par with the state-of-the-art solutions on a mobile GPU. Overall, this thesis demonstrates that a functional IR yields well to user-guided and automatic rewriting for high-performance code generation

    A Primer on Seq2Seq Models for Generative Chatbots

    Get PDF
    The recent spread of Deep Learning-based solutions for Artificial Intelligence and the development of Large Language Models has pushed forwards significantly the Natural Language Processing area. The approach has quickly evolved in the last ten years, deeply affecting NLP, from low-level text pre-processing tasks –such as tokenisation or POS tagging– to high-level, complex NLP applications like machine translation and chatbots. This paper examines recent trends in the development of open-domain data-driven generative chatbots, focusing on the Seq2Seq architectures. Such architectures are compatible with multiple learning approaches, ranging from supervised to reinforcement and, in the last years, allowed to realise very engaging open-domain chatbots. Not only do these architectures allow to directly output the next turn in a conversation but, to some extent, they also allow to control the style or content of the response. To offer a complete view on the subject, we examine possible architecture implementations as well as training and evaluation approaches. Additionally, we provide information about the openly available corpora to train and evaluate such models and about the current and past chatbot competitions. Finally, we present some insights on possible future directions, given the current research status

    Using Visual Programming Games to Study Novice Programmers

    Get PDF
    Enabling programmers to write correct and efficient parallel code remains an important challenge, and the prevalence of on-chip accelerators exacerbates this challenge. Novice programmers, especially those in disciplines outside of Computer Science and Computer Engineering, need to be able to write code that exploits parallelism and heterogeneity, but the frameworks for writing parallel and heterogeneous programs expect expert knowledge and experience. More effort must be put into understanding how novice programmers solve parallel problems. Unfortunately, novice programmers are difficult to study because they are, by definition, novices. We have designed a visual programming language and game-based framework for studying how novice programmers solve parallel problems. This tool was used to conduct an initial study on 95 undergraduate students with little to no prior programming experience. 71% of all volunteer participants completed the study in 48 minutes on average. This study demonstrated that novice programmers could solve parallel problems, and this framework can be used to conduct more thorough studies of how novice programmers approach parallel code

    Optimization for Deep Learning Systems Applied to Computer Vision

    Get PDF
    149 p.Since the DL revolution and especially over the last years (2010-2022), DNNs have become an essentialpart of the CV field, and they are present in all its sub-fields (video-surveillance, industrialmanufacturing, autonomous driving, ...) and in almost every new state-of-the-art application that isdeveloped. However, DNNs are very complex and the architecture needs to be carefully selected andadapted in order to maximize its efficiency. In many cases, networks are not specifically designed for theconsidered use case, they are simply recycled from other applications and slightly adapted, without takinginto account the particularities of the use case or the interaction with the rest of the system components,which usually results in a performance drop.This research work aims at providing knowledge and tools for the optimization of systems based on DeepLearning applied to different real use cases within the field of Computer Vision, in order to maximizetheir effectiveness and efficiency

    Constraint-based simulation of virtual crowds

    Get PDF
    Central to simulating pedestrian crowds is their motion and behaviour. It is required to understand how pedestrians move to simulate and predict scenarios with crowds of people. Pedestrian behaviours enhance the range of motions people can demonstrate, resulting in greater variety, believability, and accuracy. Models with complex computations and motion have difficulty in being extended with additional behaviours. This is because the structure of these models are not designed in a way that is generally compatible with collision avoidance behaviours. To address this issue, this thesis will research a possible pedestrian model that can simulate collision response with a wide range of additional behaviours. The model will do so by using constraints, a limit on the velocity of a person's movement. The proposed model will use constraints as the core computation. By describing behaviours in terms of constraints, these behaviours can be combined with the proposed model. Pedestrian simulations strike a balance between model complexity and runtime speed. Some models focus entirely on the complexity and accuracy of people, while other models focus on creating believable yet lightweight and performant simulations. Believable crowds look realistic to human observation, but do not match up to numerical analysis under scrutiny. The larger the population, and the more complex the motion of people, the slower the simulation will run. One route for improving performance of software is by using Graphical Processing Units (GPUs). GPUs are devices with theoretical performance that far outperforms equivalent multi-core CPUs. Research literature tends to focus on either the accuracy, or the performance optimisations of pedestrian crowd simulations. This suggests that there is opportunity to create more accurate models that run relatively quickly. Real time is a useful measure of model runtime. A simulation that runs in real time can be interactive and respond live to user input. By increasing the performance of the model, larger and more complex models can be simulated. This in turn increases the range of applications the model can represent. This thesis will develop a performant pedestrian simulation that runs in real time. It will explore how suitable the model is for GPU acceleration, and what performance gains can be obtained by implementing the model on the GPU

    Lattice Boltzmann Methods for Partial Differential Equations

    Get PDF
    Lattice Boltzmann methods provide a robust and highly scalable numerical technique in modern computational fluid dynamics. Besides the discretization procedure, the relaxation principles form the basis of any lattice Boltzmann scheme and render the method a bottom-up approach, which obstructs its development for approximating broad classes of partial differential equations. This work introduces a novel coherent mathematical path to jointly approach the topics of constructability, stability, and limit consistency for lattice Boltzmann methods. A new constructive ansatz for lattice Boltzmann equations is introduced, which highlights the concept of relaxation in a top-down procedure starting at the targeted partial differential equation. Modular convergence proofs are used at each step to identify the key ingredients of relaxation frequencies, equilibria, and moment bases in the ansatz, which determine linear and nonlinear stability as well as consistency orders of relaxation and space-time discretization. For the latter, conventional techniques are employed and extended to determine the impact of the kinetic limit at the very foundation of lattice Boltzmann methods. To computationally analyze nonlinear stability, extensive numerical tests are enabled by combining the intrinsic parallelizability of lattice Boltzmann methods with the platform-agnostic and scalable open-source framework OpenLB. Through upscaling the number and quality of computations, large variations in the parameter spaces of classical benchmark problems are considered for the exploratory indication of methodological insights. Finally, the introduced mathematical and computational techniques are applied for the proposal and analysis of new lattice Boltzmann methods. Based on stabilized relaxation, limit consistent discretizations, and consistent temporal filters, novel numerical schemes are developed for approximating initial value problems and initial boundary value problems as well as coupled systems thereof. In particular, lattice Boltzmann methods are proposed and analyzed for temporal large eddy simulation, for simulating homogenized nonstationary fluid flow through porous media, for binary fluid flow simulations with higher order free energy models, and for the combination with Monte Carlo sampling to approximate statistical solutions of the incompressible Euler equations in three dimensions

    Close to the metal: Towards a material political economy of the epistemology of computation

    Get PDF
    This paper investigates the role of the materiality of computation in two domains: blockchain technologies and artificial intelligence (AI). Although historically designed as parallel computing accelerators for image rendering and videogames, graphics processing units (GPUs) have been instrumental in the explosion of both cryptoasset mining and machine learning models. The political economy associated with video games and Bitcoin and Ethereum mining provided a staggering growth in performance and energy efficiency and this, in turn, fostered a change in the epistemological understanding of AI: from rules-based or symbolic AI towards the matrix multiplications underpinning connectionism, machine learning and neural nets. Combining a material political economy of markets with a material epistemology of science, the article shows that there is no clear-cut division between software and hardware, between instructions and tools, and between frameworks of thought and the material and economic conditions of possibility of thought itself. As the microchip shortage and the growing geopolitical relevance of the hardware and semiconductor supply chain come to the fore, the paper invites social scientists to engage more closely with the materialities and hardware architectures of ‘virtual’ algorithms and software

    Dynamically Finding Optimal Kernel Launch Parameters for CUDA Programs

    Get PDF
    In this thesis, we present KLARAPTOR (Kernel LAunch parameters RAtional Program estimaTOR), a freely available tool to dynamically determine the values of kernel launch parameters of a CUDA kernel. We describe a technique for building a helper program, at the compile-time of a CUDA program, that is used at run-time to determine near-optimal kernel launch parameters for the kernels of that CUDA program. This technique leverages the MWP-CWP performance prediction model, runtime data parameters, and runtime hardware parameters to dynamically determine the launch parameters for each kernel invocation. This technique is implemented within the KLARAPTOR tool, utilizing the LLVM Pass Framework and NVIDIA Nsight Compute CLI profiler. We demonstrate the effectiveness of our approach through experimentation on the PolyBench benchmark suite of CUDA kernels

    On benchmarking of deep learning systems: software engineering issues and reproducibility challenges

    Get PDF
    Since AlexNet won the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in 2012, Deep Learning (and Machine Learning/AI in general) gained an exponential interest. Nowadays, their adoption spreads over numerous sectors, like automotive, robotics, healthcare and finance. The ML advancement goes in pair with the quality improvement delivered by those solutions. However, those ameliorations are not for free: ML algorithms always require an increasing computational power, which pushes computer engineers to develop new devices capable of coping with this demand for performance. To foster the evolution of DSAs, and thus ML research, it is key to make it easy to experiment and compare them. This may be challenging since, even if the software built around these devices simplifies their usage, obtaining the best performance is not always straightforward. The situation gets even worse when the experiments are not conducted in a reproducible way. Even though the importance of reproducibility for the research is evident, it does not directly translate into reproducible experiments. In fact, as already shown by previous studies regarding other research fields, also ML is facing a reproducibility crisis. Our work addresses the topic of reproducibility of ML applications. Reproducibility in this context has two aspects: results reproducibility and performance reproducibility. While the reproducibility of the results is mandatory, performance reproducibility cannot be neglected because high-performance device usage causes cost. To understand how the ML situation is regarding reproducibility of performance, we reproduce results published for the MLPerf suite, which seems to be the most used machine learning benchmark. Because of the wide range of devices and frameworks used in different benchmark submissions, we focus on a subset of accuracy and performance results submitted to the MLPerf Inference benchmark, presenting a detailed analysis of the difficulties a scientist may find when trying to reproduce such a benchmark and a possible solution using our workflow tool for experiment reproducibility: PROVA!. We designed PROVA! to support the reproducibility in traditional HPC experiments, but we will show how we extended it to be used as a 'driver' for MLPerf benchmark applications. The PROVA! driver mode allows us to experiment with different versions of the MLPerf Inference benchmark switching among different hardware and software combinations and compare them in a reproducible way. In the last part, we will present the results of our reproducibility study, demonstrating the importance of having a support tool to reproduce and extend original experiments getting deeper knowledge about performance behaviours

    A multi-level functional IR with rewrites for higher-level synthesis of accelerators

    Get PDF
    Specialised accelerators deliver orders of magnitude higher energy-efficiency than general-purpose processors. Field Programmable Gate Arrays (FPGAs) have become the substrate of choice, because the ever-changing nature of modern workloads, such as machine learning, demands reconfigurability. However, they are notoriously hard to program directly using Hardware Description Languages (HDLs). Traditional High-Level Synthesis (HLS) tools improve productivity, but come with their own problems. They often produce sub-optimal designs and programmers are still required to write hardware-specific code, thus development cycles remain long. This thesis proposes Shir, a higher-level synthesis approach for high-performance accelerator design with a hardware-agnostic programming entry point, a multi-level Intermediate Representation (IR), a compiler and rewrite rules for optimisation. First, a novel, multi-level functional IR structure for accelerator design is described. The IRs operate on different levels of abstraction, cleanly separating different hardware concerns. They enable the expression of different forms of parallelism and standard memory features, such as asynchronous off-chip memories or synchronous on-chip buffers, as well as arbitration of such shared resources. Exposing these features at the IR level is essential for achieving high performance. Next, mechanical lowering procedures are introduced to automatically compile a program specification through Shir’s functional IRs until low-level HDL code for FPGA synthesis is emitted. Each lowering step gradually adds implementation details. Finally, this thesis presents rewrite rules for automatic optimisations around parallelisation, buffering and data reshaping. Reshaping operations pose a challenge to functional approaches in particular. They introduce overheads that compromise performance or even prevent the generation of synthesisable hardware designs altogether. This fundamental issue is solved by the application of rewrite rules. The viability of this approach is demonstrated by running matrix multiplication and 2D convolution on an Intel Arria 10 FPGA. A limited design space exploration is conducted, confirming the ability of the IR to exploit various hardware features. Using rewrite rules for optimisation, it is possible to generate high-performance designs that are competitive with highly tuned OpenCL implementations and that outperform hardware-agnostic OpenCL code. The performance impact of the optimisations is further evaluated showing that they are essential to achieving high performance, and in many cases also necessary to produce hardware that fits the resource constraints
    • …
    corecore