473 research outputs found
Decoupling algorithms from schedules for easy optimization of image processing pipelines
Using existing programming tools, writing high-performance image processing code requires sacrificing readability, portability, and modularity. We argue that this is a consequence of conflating what computations define the algorithm, with decisions about storage and the order of computation. We refer to these latter two concerns as the schedule, including choices of tiling, fusion, recomputation vs. storage, vectorization, and parallelism.
We propose a representation for feed-forward imaging pipelines that separates the algorithm from its schedule, enabling high-performance without sacrificing code clarity. This decoupling simplifies the algorithm specification: images and intermediate buffers become functions over an infinite integer domain, with no explicit storage or boundary conditions. Imaging pipelines are compositions of functions. Programmers separately specify scheduling strategies for the various functions composing the algorithm, which allows them to efficiently explore different optimizations without changing the algorithmic code.
We demonstrate the power of this representation by expressing a range of recent image processing applications in an embedded domain specific language called Halide, and compiling them for ARM, x86, and GPUs. Our compiler targets SIMD units, multiple cores, and complex memory hierarchies. We demonstrate that it can handle algorithms such as a camera raw pipeline, the bilateral grid, fast local Laplacian filtering, and image segmentation. The algorithms expressed in our language are both shorter and faster than state-of-the-art implementations.National Science Foundation (U.S.) (Grant 0964004)National Science Foundation (U.S.) (Grant 0964218)National Science Foundation (U.S.) (Grant 0832997)United States. Dept. of Energy (Award DE-SC0005288)Cognex CorporationAdobe System
Parallelization of shallow water simulations on current multi-threaded systems
Lobeiras, J., Viñas, M., Amor, M., Fraguela, B.B., Arenaz, M., García, J., Castro, M. Parallelization of shallow water simulations on current multi-threaded systems. The International Journal of High Performance Computing Applications 27, 493–512. © 2013 The Author(s), © SAGE Publications. https://doi.org/10.1177/1094342012464800[Abstract]: In this work, several parallel implementations of a numerical model of pollutant transport on a shallow water system are presented. These parallel implementations are developed in two phases. First, the sequential code is rewritten to exploit the stream programming model. And second, the streamed code is targeted for current multi-threaded systems, in particular, multi-core CPUs and modern GPUs. The performance is evaluated on a multi-core CPU using OpenMP, and on a GPU using the streaming-oriented programming language Brook+, as well as the standard language for heterogeneous systems, OpenCL.Funding This work was supported by the Galician Government (Consolidation of Competitive Research Groups, Xunta de Galicia ref. 2010/6, projects INCITE08PXIB105161PR and 08TIC001206PR), the Ministry of Science and Innovation, cofunded by the FEDER funds of the European Union (grant number TIN2010-16735, and project numbers MTM2009-11923 and MTM2010-21135).Xunta de Galicia; INCITE08PXIB105161PRXunta de Galicia; 08TIC001206P
SQUARE: Strategic Quantum Ancilla Reuse for Modular Quantum Programs via Cost-Effective Uncomputation
Compiling high-level quantum programs to machines that are size constrained
(i.e. limited number of quantum bits) and time constrained (i.e. limited number
of quantum operations) is challenging. In this paper, we present SQUARE
(Strategic QUantum Ancilla REuse), a compilation infrastructure that tackles
allocation and reclamation of scratch qubits (called ancilla) in modular
quantum programs. At its core, SQUARE strategically performs uncomputation to
create opportunities for qubit reuse.
Current Noisy Intermediate-Scale Quantum (NISQ) computers and forward-looking
Fault-Tolerant (FT) quantum computers have fundamentally different constraints
such as data locality, instruction parallelism, and communication overhead. Our
heuristic-based ancilla-reuse algorithm balances these considerations and fits
computations into resource-constrained NISQ or FT quantum machines, throttling
parallelism when necessary. To precisely capture the workload of a program, we
propose an improved metric, the "active quantum volume," and use this metric to
evaluate the effectiveness of our algorithm. Our results show that SQUARE
improves the average success rate of NISQ applications by 1.47X. Surprisingly,
the additional gates for uncomputation create ancilla with better locality, and
result in substantially fewer swap gates and less gate noise overall. SQUARE
also achieves an average reduction of 1.5X (and up to 9.6X) in active quantum
volume for FT machines.Comment: 14 pages, 10 figure
Recent Advances in Graph Partitioning
We survey recent trends in practical algorithms for balanced graph
partitioning together with applications and future research directions
- …