169 research outputs found

    Guided rewriting and constraint satisfaction for parallel GPU code generation

    Get PDF
    Graphics Processing Units (GPUs) are notoriously hard to optimise for manually due to their scheduling and memory hierarchies. What is needed are good automatic code generators and optimisers for such parallel hardware. Functional approaches such as Accelerate, Futhark and LIFT leverage a high-level algorithmic Intermediate Representation (IR) to expose parallelism and abstract the implementation details away from the user. However, producing efficient code for a given accelerator remains challenging. Existing code generators depend on the user input to choose a subset of hard-coded optimizations or automated exploration of implementation search space. The former suffers from the lack of extensibility, while the latter is too costly due to the size of the search space. A hybrid approach is needed, where a space of valid implementations is built automatically and explored with the aid of human expertise. This thesis presents a solution combining user-guided rewriting and automatically generated constraints to produce high-performance code. The first contribution is an automatic tuning technique to find a balance between performance and memory consumption. Leveraging its functional patterns, the LIFT compiler is empowered to infer tuning constraints and limit the search to valid tuning combinations only. Next, the thesis reframes parallelisation as a constraint satisfaction problem. Parallelisation constraints are extracted automatically from the input expression, and a solver is used to identify valid rewriting. The constraints truncate the search space to valid parallel mappings only by capturing the scheduling restrictions of the GPU in the context of a given program. A synchronisation barrier insertion technique is proposed to prevent data races and improve the efficiency of the generated parallel mappings. The final contribution of this thesis is the guided rewriting method, where the user encodes a design space of structural transformations using high-level IR nodes called rewrite points. These strongly typed pragmas express macro rewrites and expose design choices as explorable parameters. The thesis proposes a small set of reusable rewrite points to achieve tiling, cache locality, data reuse and memory optimisation. A comparison with the vendor-provided handwritten kernel ARM Compute Library and the TVM code generator demonstrates the effectiveness of this thesis' contributions. With convolution as a use case, LIFT-generated direct and GEMM-based convolution implementations are shown to perform on par with the state-of-the-art solutions on a mobile GPU. Overall, this thesis demonstrates that a functional IR yields well to user-guided and automatic rewriting for high-performance code generation

    Approximate Computing Survey, Part I: Terminology and Software & Hardware Approximation Techniques

    Full text link
    The rapid growth of demanding applications in domains applying multimedia processing and machine learning has marked a new era for edge and cloud computing. These applications involve massive data and compute-intensive tasks, and thus, typical computing paradigms in embedded systems and data centers are stressed to meet the worldwide demand for high performance. Concurrently, the landscape of the semiconductor field in the last 15 years has constituted power as a first-class design concern. As a result, the community of computing systems is forced to find alternative design approaches to facilitate high-performance and/or power-efficient computing. Among the examined solutions, Approximate Computing has attracted an ever-increasing interest, with research works applying approximations across the entire traditional computing stack, i.e., at software, hardware, and architectural levels. Over the last decade, there is a plethora of approximation techniques in software (programs, frameworks, compilers, runtimes, languages), hardware (circuits, accelerators), and architectures (processors, memories). The current article is Part I of our comprehensive survey on Approximate Computing, and it reviews its motivation, terminology and principles, as well it classifies and presents the technical details of the state-of-the-art software and hardware approximation techniques.Comment: Under Review at ACM Computing Survey

    LoopTune: Optimizing Tensor Computations with Reinforcement Learning

    Full text link
    Advanced compiler technology is crucial for enabling machine learning applications to run on novel hardware, but traditional compilers fail to deliver performance, popular auto-tuners have long search times and expert-optimized libraries introduce unsustainable costs. To address this, we developed LoopTune, a deep reinforcement learning compiler that optimizes tensor computations in deep learning models for the CPU. LoopTune optimizes tensor traversal order while using the ultra-fast lightweight code generator LoopNest to perform hardware-specific optimizations. With a novel graph-based representation and action space, LoopTune speeds up LoopNest by 3.2x, generating an order of magnitude faster code than TVM, 2.8x faster than MetaSchedule, and 1.08x faster than AutoTVM, consistently performing at the level of the hand-tuned library Numpy. Moreover, LoopTune tunes code in order of seconds

    mAPN: Modeling, Analysis, and Exploration of Algorithmic and Parallelism Adaptivity

    Full text link
    Using parallel embedded systems these days is increasing. They are getting more complex due to integrating multiple functionalities in one application or running numerous ones concurrently. This concerns a wide range of applications, including streaming applications, commonly used in embedded systems. These applications must implement adaptable and reliable algorithms to deliver the required performance under varying circumstances (e.g., running applications on the platform, input data, platform variety, etc.). Given the complexity of streaming applications, target systems, and adaptivity requirements, designing such systems with traditional programming models is daunting. This is why model-based strategies with an appropriate Model of Computation (MoC) have long been studied for embedded system design. This work provides algorithmic adaptivity on top of parallelism for dynamic dataflow to express larger sets of variants. We present a multi-Alternative Process Network (mAPN), a high-level abstract representation in which several variants of the same application coexist in the same graph expressing different implementations. We introduce mAPN properties and its formalism to describe various local implementation alternatives. Furthermore, mAPNs are enriched with metadata to Provide the alternatives with quantitative annotations in terms of a specific metric. To help the user analyze the rich space of variants, we propose a methodology to extract feasible variants under user and hardware constraints. At the core of the methodology is an algorithm for computing global metrics of an execution of different alternatives from a compact mAPN specification. We validate our approach by exploring several possible variants created for the Automatic Subtitling Application (ASA) on two hardware platforms.Comment: 26 PAGES JOURNAL PAPE

    A Highly Optimized Skeleton for Unbalanced and Deep Divide-And-Conquer Algorithms on Multi-Core Clusters

    Get PDF
    Financiado para publicación en acceso aberto: Universidade da Coruña/CISUG [Abstract] Efficiently implementing the divide-and-conquer pattern of parallelism in distributed memory systems is very relevant, given its ubiquity, and difficult, given its recursive nature and the need to exchange tasks and data among the processors. This task is noticeably further complicated in the presence of multi-core systems, where hybrid parallelism must be exploited to attain the best performance, and when unbalanced and deep workloads are considered, as additional measures must be taken to load balance and avoid deep recursion problems. In this manuscript a parallel skeleton that fulfills all these requirements while providing high levels of usability is presented. In fact, the evaluation shows that our proposal is on average 415.32% faster than MPI codes and 229.18% faster than MPI + OpenMP benchmarks, while offering an average improvement in the programmability metrics of 131.04% over MPI alternatives and 155.18% over MPI + OpenMP solutions.This research was supported by the Ministry of Science and Innovation of Spain (PID2019-104184RB-I00 and PID2019-104834GB-I00, AEI/FEDER/EU, 10.13039/501100011033) and the predoctoral Grant of Millán Álvarez Ref. BES-2017-081320), and by the Xunta de Galicia co-founded by the European Regional Development Fund (ERDF) under the Consolidation Programme of Competitive Reference Groups (ED431C 2018/19 and ED431C 2021/30). We acknowledge also the support from the Centro Singular de Investigación de Galicia “CITIC” and the Centro Singular de Investigación en Tecnoloxías Intelixentes “CiTIUS”, funded by Xunta de Galicia and the European Union (European Regional Development Fund- Galicia 2014-2020 Program), by Grants ED431G 2019/01 and ED431G 2019/04. We also acknowledge the Centro de Supercomputación de Galicia (CESGA). Open Access funding provided thanks to the CRUE-CSIC agreement with Springer NatureXunta de Galicia; ED431C 2018/19Xunta de Galicia; ED431C 2021/30Xunta de Galicia; ED431G 2019/01Xunta de Galicia; ED431G 2019/0

    The Impact of Precision Tuning on Embedded Systems Performance: A Case Study on Field-Oriented Control

    Get PDF
    Field Oriented Control (FOC) is an industry-standard strategy for controlling induction motors and other kinds of AC-based motors. This control scheme has a very high arithmetic intensity when implemented digitally - in particular it requires the use of trigonometric functions. This requirement contrasts with the necessity of increasing the control step frequency when required, and the minimization of power consumption in applications where conserving battery life is paramount such as drones. However, it also makes FOC well suited for optimization using precision tuning techniques. Therefore, we exploit the state-of-the-art FixM methodology to optimize a miniapp simulating a typical FOC application by applying precision tuning of trigonometric functions. The FixM approach itself was extended in order to implement additional algorithm choices to enable a trade-off between execution time and code size. With the application of FixM on the miniapp, we achieved a speedup up to 278%, at a cost of an error in the output less than 0.1%

    Eco ladrillos de plástico reciclado PET para el mejoramiento de las viviendas del sector Kumamoto II Etapa, El Porvenir 2021

    Get PDF
    Se determinó que más de un millón de familias habitan una vivienda en mal estado o no cuentan con una por el alto costo de los materiales de construcción; asimismo, otro gran problema que se identificó es la cantidad de residuos plástico que se desechan diariamente, el cual representa el 10% del total de los residuos que se generan. El objetivo fue determinar la influencia de los Eco ladrillos de plástico reciclado PET para el mejoramiento de las viviendas del sector Kumamoto II Etapa, El Porvenir 2021. Es de enfoque cualitativo tipo básica no experimental – descriptivo; la muestra fue realizada a 45 familias con viviendas en estado vulnerable del sector Kumamoto II Etapa del distrito de El Porvenir. Se aplicó la técnica de encuesta para la obtención de los datos requeridos; utilizando la estadística de fiabilidad del Alfa de Cronbach con un resultado de 0.81 procesado mediante el programa SPSS versión 25. Se determinó que el precio de material para la construcción de 1.00 m2 de muro con ladrillo PET es de 34.40 soles. Se concluyó que el costo de material más bajo para la construcción de una vivienda es el ladrillo PET, el cual no requiere de aditivos de alto precio para ser unidos; siendo un material térmico, sismorresistente y de peso liviano
    corecore