19,254 research outputs found

    swTVM: Exploring the Automated Compilation for Deep Learning on Sunway Architecture

    Full text link
    The flourish of deep learning frameworks and hardware platforms has been demanding an efficient compiler that can shield the diversity in both software and hardware in order to provide application portability. Among the exiting deep learning compilers, TVM is well known for its efficiency in code generation and optimization across diverse hardware devices. In the meanwhile, the Sunway many-core processor renders itself as a competitive candidate for its attractive computational power in both scientific and deep learning applications. This paper combines the trends in these two directions. Specifically, we propose swTVM that extends the original TVM to support ahead-of-time compilation for architecture requiring cross-compilation such as Sunway. In addition, we leverage the architecture features during the compilation such as core group for massive parallelism, DMA for high bandwidth memory transfer and local device memory for data locality, in order to generate efficient code for deep learning application on Sunway. The experimental results show the ability of swTVM to automatically generate code for various deep neural network models on Sunway. The performance of automatically generated code for AlexNet and VGG-19 by swTVM achieves 6.71x and 2.45x speedup on average than hand-optimized OpenACC implementations on convolution and fully connected layers respectively. This work is the first attempt from the compiler perspective to bridge the gap of deep learning and high performance architecture particularly with productivity and efficiency in mind. We would like to open source the implementation so that more people can embrace the power of deep learning compiler and Sunway many-core processor

    A Compilation Flow for Parametric Dataflow: Programming Model, Scheduling, and Application to Heterogeneous MPSoC

    Get PDF
    International audienceEfficient programming of signal processing applications on embedded systems is a complex problem. High level models such as Synchronous dataflow (SDF) have been privileged candidates for dealing with this complexity. These models permit to express inherent application parallelism, as well as analysis for both verification and optimization. Parametric dataflow models aim at providing sufficient dynamicity to model new applications, while at the same time maintaining the high level of analyzability needed for efficient real life implementations. This paper presents a new compilation flow that targets parametric dataflows. Built on the LLVM compiler infrastructure, it offers an actor based C++ programming model to describe parametric graphs, a compilation front-end providing graph analysis features, and a retargetable back-end to map the application on real hardware. This paper gives an overview of this flow, with a specific focus on scheduling. The crucial gap between dataflow models and real hardware on which actor firing is not atomic, as well as the consequences on FIFOs sizing and execution pipelining are taken into account.The experimental results illustrate our compilation flow applied to compilation of 3GPP LTE-Advanced demodulation on a heterogeneous MPSoC with distributed scheduling features. This achieves performances similar to time-consuming hand made optimizations

    pocl: A Performance-Portable OpenCL Implementation

    Get PDF
    OpenCL is a standard for parallel programming of heterogeneous systems. The benefits of a common programming standard are clear; multiple vendors can provide support for application descriptions written according to the standard, thus reducing the program porting effort. While the standard brings the obvious benefits of platform portability, the performance portability aspects are largely left to the programmer. The situation is made worse due to multiple proprietary vendor implementations with different characteristics, and, thus, required optimization strategies. In this paper, we propose an OpenCL implementation that is both portable and performance portable. At its core is a kernel compiler that can be used to exploit the data parallelism of OpenCL programs on multiple platforms with different parallel hardware styles. The kernel compiler is modularized to perform target-independent parallel region formation separately from the target-specific parallel mapping of the regions to enable support for various styles of fine-grained parallel resources such as subword SIMD extensions, SIMD datapaths and static multi-issue. Unlike previous similar techniques that work on the source level, the parallel region formation retains the information of the data parallelism using the LLVM IR and its metadata infrastructure. This data can be exploited by the later generic compiler passes for efficient parallelization. The proposed open source implementation of OpenCL is also platform portable, enabling OpenCL on a wide range of architectures, both already commercialized and on those that are still under research. The paper describes how the portability of the implementation is achieved. Our results show that most of the benchmarked applications when compiled using pocl were faster or close to as fast as the best proprietary OpenCL implementation for the platform at hand.Comment: This article was published in 2015; it is now openly accessible via arxi

    Optimized Compilation of Aggregated Instructions for Realistic Quantum Computers

    Full text link
    Recent developments in engineering and algorithms have made real-world applications in quantum computing possible in the near future. Existing quantum programming languages and compilers use a quantum assembly language composed of 1- and 2-qubit (quantum bit) gates. Quantum compiler frameworks translate this quantum assembly to electric signals (called control pulses) that implement the specified computation on specific physical devices. However, there is a mismatch between the operations defined by the 1- and 2-qubit logical ISA and their underlying physical implementation, so the current practice of directly translating logical instructions into control pulses results in inefficient, high-latency programs. To address this inefficiency, we propose a universal quantum compilation methodology that aggregates multiple logical operations into larger units that manipulate up to 10 qubits at a time. Our methodology then optimizes these aggregates by (1) finding commutative intermediate operations that result in more efficient schedules and (2) creating custom control pulses optimized for the aggregate (instead of individual 1- and 2-qubit operations). Compared to the standard gate-based compilation, the proposed approach realizes a deeper vertical integration of high-level quantum software and low-level, physical quantum hardware. We evaluate our approach on important near-term quantum applications on simulations of superconducting quantum architectures. Our proposed approach provides a mean speedup of 5×5\times, with a maximum of 10×10\times. Because latency directly affects the feasibility of quantum computation, our results not only improve performance but also have the potential to enable quantum computation sooner than otherwise possible.Comment: 13 pages, to apper in ASPLO

    Application Acceleration on FPGAs with OmpSs@FPGA

    Get PDF
    © 2019 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes,creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.OmpSs@FPGA is the flavor of OmpSs that allows offloading application functionality to FPGAs. Similarly to OpenMP, it is based on compiler directives. While the OpenMP specification also includes support for heterogeneous execution, we use OmpSs and OmpSs@FPGA as prototype implementation to develop new ideas for OpenMP. OmpSs@FPGA implements the tasking model with runtime support to automatically exploit all SMP and FPGA resources available in the execution platform. In this paper, we present the OmpSs@FPGA ecosystem, based on the Mercurium compiler and the Nanos++ runtime system. We show how the applications are transformed to run on the SMP cores and the FPGA. The application kernels defined as tasks to be accelerated, using the OmpSs directives are: 1) transformed by the compiler into kernels connected with the proper synchronization and communication ports, 2) extracted to intermediate files, 3) compiled through the FPGA vendor HLS tool, and 4) used to configure the FPGA. Our Nanos++ runtime system schedules the application tasks on the platform, being able to use the SMP cores and the FPGA accelerators at the same time. We present the evaluation of the OmpSs@FPGA environment with the Matrix Multiplication, Cholesky and N-Body benchmarks, showing the internal details of the execution, and the performance obtained on a Zynq Ultrascale+ MPSoC (up to 128x). The source code uses OmpSs@FPGA annotations and different Vivado HLS optimization directives are applied for acceleration.This work is partially supported by the European Union H2020 program through the EuroEXA project (grant 754337), and HiPEAC (GA 687698), by the Spanish Government through Programa Severo Ochoa (SEV-2015- 0493), by the Spanish Ministry of Science and Technology (TIN2015-65316-P) and the Departament d’Innovació Universitats i Empresa de la Generalitat de Catalunya, under project MPEXPAR: Models de Programació i Entorns d’Execució Paral·lels (2014-SGR-1051).Peer ReviewedPostprint (author's final draft

    Loop pipelining with resource and timing constraints

    Get PDF
    Developing efficient programs for many of the current parallel computers is not easy due to the architectural complexity of those machines. The wide variety of machine organizations often makes it more difficult to port an existing program than to reprogram it completely. Therefore, powerful translators are necessary to generate effective code and free the programmer from concerns about the specific characteristics of the target machine. This work focuses on techniques to be used by an important class of translators, whose objective is to transform sequential programs into equivalent more parallel programs. The transformations are performed at instruction level in order to exploit low level parallelism and increase memory locality.Most of the current applications are programmed in languages which do not allow us to express parallelism between high-level sentences (as Pascal, C or Fortran). Furthermore, a lot of applications written ten or more years ago are still used today, and it is not feasible to rewrite such applications for many reasons (not only technical reasons, but also economic ones). Translators enable programmers to write the application in a familiar sequential programming language, without concerning their selves with the architecture of the target machine. Current compilers for parallel architectures not only translate a program written on a high-level language to the appropriate machine language, but also perform some transformations in the final code in order to execute the program in a more parallel way. The transformations improve the performance in the execution of the program by making use of the knowledge that the compiler has about the machine architecture. The semantics of the program remain intact after any transformation.Experiments show that limiting parallelization to basic blocks not included in loops limits maximum speedup. This is because loops often comprise a large portion of the parallelism available to be exploited in a program. For this reason, a lot of effort has been devoted in the recent years to parallelize loop execution. Several parallel computer architectures and compilation techniques have been proposed to exploit such a parallelism at different granularities. Multiprocessors exploit coarse grained parallelism by distributing entire loop iterations to different processors. Systems oriented to the high-level synthesis (HLS) of VLSI circuits, superscalar processors and very long instruction word (VLIW) processors exploit fine-grained parallelism at instruction level. This work addresses fine-grained parallelization of loops addressed to the HLS of VLSI circuits. Two algorithms are proposed for resource constraints and for timing constraints. An algorithm to reduce the number of registers required to execute a loop in a given architecture is also proposed.Postprint (published version

    The CIAO Multi-Dialect Compiler and System: An Experimentation Workbench for Future (C)LP Systems

    Full text link
    CIAO is an advanced programming environment supporting Logic and Constraint programming. It offers a simple concurrent kernel on top of which declarative and non-declarative extensions are added via librarles. Librarles are available for supporting the ISOProlog standard, several constraint domains, functional and higher order programming, concurrent and distributed programming, internet programming, and others. The source language allows declaring properties of predicates via assertions, including types and modes. Such properties are checked at compile-time or at run-time. The compiler and system architecture are designed to natively support modular global analysis, with the two objectives of proving properties in assertions and performing program optimizations, including transparently exploiting parallelism in programs. The purpose of this paper is to report on recent progress made in the context of the CIAO system, with special emphasis on the capabilities of the compiler, the techniques used for supporting such capabilities, and the results in the áreas of program analysis and transformation already obtained with the system
    corecore