473 research outputs found

    Decoupling algorithms from schedules for easy optimization of image processing pipelines

    Get PDF
    Using existing programming tools, writing high-performance image processing code requires sacrificing readability, portability, and modularity. We argue that this is a consequence of conflating what computations define the algorithm, with decisions about storage and the order of computation. We refer to these latter two concerns as the schedule, including choices of tiling, fusion, recomputation vs. storage, vectorization, and parallelism. We propose a representation for feed-forward imaging pipelines that separates the algorithm from its schedule, enabling high-performance without sacrificing code clarity. This decoupling simplifies the algorithm specification: images and intermediate buffers become functions over an infinite integer domain, with no explicit storage or boundary conditions. Imaging pipelines are compositions of functions. Programmers separately specify scheduling strategies for the various functions composing the algorithm, which allows them to efficiently explore different optimizations without changing the algorithmic code. We demonstrate the power of this representation by expressing a range of recent image processing applications in an embedded domain specific language called Halide, and compiling them for ARM, x86, and GPUs. Our compiler targets SIMD units, multiple cores, and complex memory hierarchies. We demonstrate that it can handle algorithms such as a camera raw pipeline, the bilateral grid, fast local Laplacian filtering, and image segmentation. The algorithms expressed in our language are both shorter and faster than state-of-the-art implementations.National Science Foundation (U.S.) (Grant 0964004)National Science Foundation (U.S.) (Grant 0964218)National Science Foundation (U.S.) (Grant 0832997)United States. Dept. of Energy (Award DE-SC0005288)Cognex CorporationAdobe System

    Parallelization of shallow water simulations on current multi-threaded systems

    Get PDF
    Lobeiras, J., Viñas, M., Amor, M., Fraguela, B.B., Arenaz, M., García, J., Castro, M. Parallelization of shallow water simulations on current multi-threaded systems. The International Journal of High Performance Computing Applications 27, 493–512. © 2013 The Author(s), © SAGE Publications. https://doi.org/10.1177/1094342012464800[Abstract]: In this work, several parallel implementations of a numerical model of pollutant transport on a shallow water system are presented. These parallel implementations are developed in two phases. First, the sequential code is rewritten to exploit the stream programming model. And second, the streamed code is targeted for current multi-threaded systems, in particular, multi-core CPUs and modern GPUs. The performance is evaluated on a multi-core CPU using OpenMP, and on a GPU using the streaming-oriented programming language Brook+, as well as the standard language for heterogeneous systems, OpenCL.Funding This work was supported by the Galician Government (Consolidation of Competitive Research Groups, Xunta de Galicia ref. 2010/6, projects INCITE08PXIB105161PR and 08TIC001206PR), the Ministry of Science and Innovation, cofunded by the FEDER funds of the European Union (grant number TIN2010-16735, and project numbers MTM2009-11923 and MTM2010-21135).Xunta de Galicia; INCITE08PXIB105161PRXunta de Galicia; 08TIC001206P

    Improving Compute & Data Efficiency of Flexible Architectures

    Get PDF

    SQUARE: Strategic Quantum Ancilla Reuse for Modular Quantum Programs via Cost-Effective Uncomputation

    Full text link
    Compiling high-level quantum programs to machines that are size constrained (i.e. limited number of quantum bits) and time constrained (i.e. limited number of quantum operations) is challenging. In this paper, we present SQUARE (Strategic QUantum Ancilla REuse), a compilation infrastructure that tackles allocation and reclamation of scratch qubits (called ancilla) in modular quantum programs. At its core, SQUARE strategically performs uncomputation to create opportunities for qubit reuse. Current Noisy Intermediate-Scale Quantum (NISQ) computers and forward-looking Fault-Tolerant (FT) quantum computers have fundamentally different constraints such as data locality, instruction parallelism, and communication overhead. Our heuristic-based ancilla-reuse algorithm balances these considerations and fits computations into resource-constrained NISQ or FT quantum machines, throttling parallelism when necessary. To precisely capture the workload of a program, we propose an improved metric, the "active quantum volume," and use this metric to evaluate the effectiveness of our algorithm. Our results show that SQUARE improves the average success rate of NISQ applications by 1.47X. Surprisingly, the additional gates for uncomputation create ancilla with better locality, and result in substantially fewer swap gates and less gate noise overall. SQUARE also achieves an average reduction of 1.5X (and up to 9.6X) in active quantum volume for FT machines.Comment: 14 pages, 10 figure

    Automatic scheduling of image processing pipelines

    Get PDF
    corecore