510 research outputs found

    Guided rewriting and constraint satisfaction for parallel GPU code generation

    Get PDF
    Graphics Processing Units (GPUs) are notoriously hard to optimise for manually due to their scheduling and memory hierarchies. What is needed are good automatic code generators and optimisers for such parallel hardware. Functional approaches such as Accelerate, Futhark and LIFT leverage a high-level algorithmic Intermediate Representation (IR) to expose parallelism and abstract the implementation details away from the user. However, producing efficient code for a given accelerator remains challenging. Existing code generators depend on the user input to choose a subset of hard-coded optimizations or automated exploration of implementation search space. The former suffers from the lack of extensibility, while the latter is too costly due to the size of the search space. A hybrid approach is needed, where a space of valid implementations is built automatically and explored with the aid of human expertise. This thesis presents a solution combining user-guided rewriting and automatically generated constraints to produce high-performance code. The first contribution is an automatic tuning technique to find a balance between performance and memory consumption. Leveraging its functional patterns, the LIFT compiler is empowered to infer tuning constraints and limit the search to valid tuning combinations only. Next, the thesis reframes parallelisation as a constraint satisfaction problem. Parallelisation constraints are extracted automatically from the input expression, and a solver is used to identify valid rewriting. The constraints truncate the search space to valid parallel mappings only by capturing the scheduling restrictions of the GPU in the context of a given program. A synchronisation barrier insertion technique is proposed to prevent data races and improve the efficiency of the generated parallel mappings. The final contribution of this thesis is the guided rewriting method, where the user encodes a design space of structural transformations using high-level IR nodes called rewrite points. These strongly typed pragmas express macro rewrites and expose design choices as explorable parameters. The thesis proposes a small set of reusable rewrite points to achieve tiling, cache locality, data reuse and memory optimisation. A comparison with the vendor-provided handwritten kernel ARM Compute Library and the TVM code generator demonstrates the effectiveness of this thesis' contributions. With convolution as a use case, LIFT-generated direct and GEMM-based convolution implementations are shown to perform on par with the state-of-the-art solutions on a mobile GPU. Overall, this thesis demonstrates that a functional IR yields well to user-guided and automatic rewriting for high-performance code generation

    Integrated Approaches to Digital-enabled Design for Manufacture and Assembly: A Modularity Perspective and Case Study of Huoshenshan Hospital in Wuhan, China

    Get PDF
    Countries are trying to expand their healthcare capacity through advanced construction, modular innovation, digital technologies and integrated design approaches such as Design for Manufacture and Assembly (DfMA). Within the context of China, there is a need for stronger implementation of digital technologies and DfMA, as well as a knowledge gap regarding how digital-enabled DfMA is implemented. More critically, an integrated approach is needed in addition to DfMA guidelines and digital-enabled approaches. For this research, a mixed method was used. Questionnaires defined the context of Huoshenshan Hospital, namely the healthcare construction in China. Then, Huoshenshan Hospital provided a case study of the first emergency hospital which addressed the uncertainty of COVID-19. This extreme project, a 1,000-bed hospital built in 10 days, implemented DfMA in healthcare construction and provides an opportunity to examine the use of modularity. A workshop with a design institution provided basic facts and insight into past practice and was followed by interviews with 18 designers, from various design disciplines, who were involved in the project. Finally, multiple archival materials were used as secondary data sources. It was found that complexity hinders building systems integration, while reinforcement relationships between multiple dimensions of modularity (across organisation-process-product-supply chain dimensions) are the underlying mechanism that allows for the reduction of complexity and the integration of building systems. Promoting integrated approaches to DfMA relies on adjusting and coupling multi-dimensional modular reinforcement relationships (namely, relationships of modular alignment, modular complement, and modular incentive). Thus, the building systems integrator can use these three approaches to increase the success of digital-enabled DfMA

    Tools for efficient Deep Learning

    Get PDF
    In the era of Deep Learning (DL), there is a fast-growing demand for building and deploying Deep Neural Networks (DNNs) on various platforms. This thesis proposes five tools to address the challenges for designing DNNs that are efficient in time, in resources and in power consumption. We first present Aegis and SPGC to address the challenges in improving the memory efficiency of DL training and inference. Aegis makes mixed precision training (MPT) stabler by layer-wise gradient scaling. Empirical experiments show that Aegis can improve MPT accuracy by at most 4\%. SPGC focuses on structured pruning: replacing standard convolution with group convolution (GConv) to avoid irregular sparsity. SPGC formulates GConv pruning as a channel permutation problem and proposes a novel heuristic polynomial-time algorithm. Common DNNs pruned by SPGC have maximally 1\% higher accuracy than prior work. This thesis also addresses the challenges lying in the gap between DNN descriptions and executables by Polygeist for software and POLSCA for hardware. Many novel techniques, e.g. statement splitting and memory partitioning, are explored and used to expand polyhedral optimisation. Polygeist can speed up software execution in sequential and parallel by 2.53 and 9.47 times on Polybench/C. POLSCA achieves 1.5 times speedup over hardware designs directly generated from high-level synthesis on Polybench/C. Moreover, this thesis presents Deacon, a framework that generates FPGA-based DNN accelerators of streaming architectures with advanced pipelining techniques to address the challenges from heterogeneous convolution and residual connections. Deacon provides fine-grained pipelining, graph-level optimisation, and heuristic exploration by graph colouring. Compared with prior designs, Deacon shows resource/power consumption efficiency improvement of 1.2x/3.5x for MobileNets and 1.0x/2.8x for SqueezeNets. All these tools are open source, some of which have already gained public engagement. We believe they can make efficient deep learning applications easier to build and deploy.Open Acces

    Improving digital image correlation in the TopoSEM Software Package

    Get PDF
    Dissertação de mestrado integrado em Informatics EngineeringTopoSEM is a software package with the aim of reconstructing a 3D surface topography of a microscopic sample from a set of 2D Scanning Electron Microscopy (SEM) images. TopoSEM is also able to produce a stability report on the calibration of the SEM hardware based solely on output images. One of the key steps in both of these workflows is the use of a Digital Image Correlation (DIC) algorithm, a no-contact imaging technique, to measure full-field displacements of an input image. A novel DIC implementation fine-tuned for 3D reconstructions was originally developed in MATLAB to satisfy the feature requirement of this project. However, near real-time usability of the TopoSEM is paramount for its users, and the main barrier towards this goal is the under-performing DIC implementation. This dissertation work ported the original MATLAB implementation of TopoSEM to sequential C++ and its performance was further optimised: (i) to improve memory accesses, (ii) to explore the available vector exten sions in each core of current multiprocessor chips processors to perform computationally intensive operations on vectors and matrices of single and double-precision floating point values, and (iii) to additionally improve the execution performance through parallelization on multi-core devices, by using multiple threads with a front wave propagation scheduler. The initial MATLAB implementation took 3279.4 seconds to compute the full-field displacement of a 2576 pixels by 2086 pixels image on a quad-core laptop. With all added improvements, the new parallel C++ version on the same laptop lowered the execution time to 1.52 seconds, achieving an overall speedup of 2158.TopoSEM é um programa cujo objetivo é reconstruir em 3D a topografia de uma amostra capturada por um mi croscópio electrónico de varrimento. Esta ferramenta é também capaz de gerar um relatório sobre a estabilidade da calibração do microscópio com base apenas em imagens capturadas. Um dos passos chave para ambas as funcionalidades trata-se da utilização de um algoritmo de Correlação Digital de Imagens (DIC), uma técnica de visão por computador que não envolve contacto direto com a amostra e que permite medir deslocamentos e deformações entre imagens. Criou-se uma nova implementação de DIC em MATLAB especialmente formulada para reconstrução 3D. No entanto, a capacidade de utilizar o TopoSEM em quase tempo real é fundamental para os seus utilizadores e a principal barreira para tal são os elevados tempos de execução da implementação em MATLAB. Esta dissertação portou o código de MATLAB para código sequencial em C++ e a sua performance foi melho rada: (i) para otimizar acessos a memória, (ii) para explorar extensões de vetorização disponíveis em hardware moderno para otimizar operações sobre vetores e matrizes, e (iii) para através de paralelização em dispositivos multi-core melhorar ainda mais a performance utilizando para isso vários fios de execução com um escalonador de propagação em onda. A implementação inicial em MATLAB demorava 3279.4 segundos para computar uma imagem com resolução de 2576 pixels por 2086 pixels num portátil quad-core. Com todas as melhorias de performance, a nova imple mentação paralela em C++ reduziu o tempo de execução para 1.52 segundos para as mesmas imagens no mesmo computador, atingindo um speedup de 2158

    Horizontally distributed inference of deep neural networks for AI-enabled IoT

    Get PDF
    Motivated by the pervasiveness of artificial intelligence (AI) and the Internet of Things (IoT) in the current “smart everything” scenario, this article provides a comprehensive overview of the most recent research at the intersection of both domains, focusing on the design and development of specific mechanisms for enabling a collaborative inference across edge devices towards the in situ execution of highly complex state-of-the-art deep neural networks (DNNs), despite the resource-constrained nature of such infrastructures. In particular, the review discusses the most salient approaches conceived along those lines, elaborating on the specificities of the partitioning schemes and the parallelism paradigms explored, providing an organized and schematic discussion of the underlying workflows and associated communication patterns, as well as the architectural aspects of the DNNs that have driven the design of such techniques, while also highlighting both the primary challenges encountered at the design and operational levels and the specific adjustments or enhancements explored in response to them.Agencia Estatal de Investigación | Ref. DPI2017-87494-RMinisterio de Ciencia e Innovación | Ref. PDC2021-121644-I00Xunta de Galicia | Ref. ED431C 2022/03-GR

    Learning to Configure Separators in Branch-and-Cut

    Full text link
    Cutting planes are crucial in solving mixed integer linear programs (MILP) as they facilitate bound improvements on the optimal solution. Modern MILP solvers rely on a variety of separators to generate a diverse set of cutting planes by invoking the separators frequently during the solving process. This work identifies that MILP solvers can be drastically accelerated by appropriately selecting separators to activate. As the combinatorial separator selection space imposes challenges for machine learning, we learn to separate by proposing a novel data-driven strategy to restrict the selection space and a learning-guided algorithm on the restricted space. Our method predicts instance-aware separator configurations which can dynamically adapt during the solve, effectively accelerating the open source MILP solver SCIP by improving the relative solve time up to 72% and 37% on synthetic and real-world MILP benchmarks. Our work complements recent work on learning to select cutting planes and highlights the importance of separator management

    Rethinking FPGA Architectures for Deep Neural Network applications

    Get PDF
    The prominence of machine learning-powered solutions instituted an unprecedented trend of integration into virtually all applications with a broad range of deployment constraints from tiny embedded systems to large-scale warehouse computing machines. While recent research confirms the edges of using contemporary FPGAs to deploy or accelerate machine learning applications, especially where the latency and energy consumption are strictly limited, their pre-machine learning optimised architectures remain a barrier to the overall efficiency and performance. Realizing this shortcoming, this thesis demonstrates an architectural study aiming at solutions that enable hidden potentials in the FPGA technology, primarily for machine learning algorithms. Particularly, it shows how slight alterations to the state-of-the-art architectures could significantly enhance the FPGAs toward becoming more machine learning-friendly while maintaining the near-promised performance for the rest of the applications. Eventually, it presents a novel systematic approach to deriving new block architectures guided by designing limitations and machine learning algorithm characteristics through benchmarking. First, through three modifications to Xilinx DSP48E2 blocks, an enhanced digital signal processing (DSP) block for important computations in embedded deep neural network (DNN) accelerators is described. Then, two tiers of modifications to FPGA logic cell architecture are explained that deliver a variety of performance and utilisation benefits with only minor area overheads. Eventually, with the goal of exploring this new design space in a methodical manner, a problem formulation involving computing nested loops over multiply-accumulate (MAC) operations is first proposed. A quantitative methodology for deriving efficient coarse-grained compute block architectures from benchmarks is then suggested together with a family of new embedded blocks, called MLBlocks

    Hardware Acceleration of Progressive Refinement Radiosity using Nvidia RTX

    Full text link
    A vital component of photo-realistic image synthesis is the simulation of indirect diffuse reflections, which still remain a quintessential hurdle that modern rendering engines struggle to overcome. Real-time applications typically pre-generate diffuse lighting information offline using radiosity to avoid performing costly computations at run-time. In this thesis we present a variant of progressive refinement radiosity that utilizes Nvidia's novel RTX technology to accelerate the process of form-factor computation without compromising on visual fidelity. Through a modern implementation built on DirectX 12 we demonstrate that offloading radiosity's visibility component to RT cores significantly improves the lightmap generation process and potentially propels it into the domain of real-time.Comment: 114 page

    LIPIcs, Volume 261, ICALP 2023, Complete Volume

    Get PDF
    LIPIcs, Volume 261, ICALP 2023, Complete Volum

    Understanding and Optimizing Communication Overhead in Distributed Training

    Get PDF
    In recent years, Deep Learning models have shown great potential in many areas, including Computer Vision, Speech Recognition, Information Retrieval, etc. This results in a growing interest in applying Deep Learning models in academia and industry. Using Deep Learning models on a specific task requires training. With the recent trends of the rapid growth of the size of the Deep Learning models and datasets, training on a single accelerator can take years. To complete the training within a reasonable amount of time, people start using multiple accelerators to speed up training (i.e., distributed training). Using distributed training requires additional communications to coordinate all accelerators. In many cases, communications become the bottleneck of distributed training. In this thesis, we study and optimize the communication overhead in distributed training. In the first part of the thesis, we conduct measurement studies and what-if analyses to understand the relationship between the network and communication overhead. We design a trace-based simulation algorithm and test it with various network assumptions. We found that the network is under-utilized, and achieving gradient compression ratios up to hundreds of times is often unnecessary for data center networks. The second part of the thesis optimizes the communication overhead of distributed training without changing the semantics of the training algorithm. We design and implement system MiCS that significantly reduces the communication overhead in public cloud environments by minimizing the communication scale. The evaluation shows that MiCS outperforms existing partitioned data-parallel systems significantly. In the last part of the thesis, we further improve the system performance of MiCS for more challenging cases, e.g., long input sequences. We combine pipeline parallelism with MiCS to further reduce the overhead of inter-node communications in MiCS. Besides, we propose two memory optimizations to improve memory efficiency. System MiCS has been adopted by several teams inside Amazon and is available at Amazon SageMaker
    corecore