Search CORE

1,573 research outputs found

A Tiling Perspective for Register Optimization

Author: Ponnuswany Sadayappan
Rastello Fabrice
van Amstel Duco
Publication venue: HAL CCSD
Publication date: 26/05/2014
Field of study

Register allocation is a much studied problem. A particularly important context for optimizing register allocation is within loops, since a significant fraction of the execution time of programs is often inside loop code. A variety of algorithms have been proposed in the past for register allocation, but the complexity of the problem has resulted in a decoupling of several important aspects, including loop unrolling, register promotion, and instruction reordering. In this paper, we develop an approach to register allocation and promotion in a unified optimization framework that simultaneously considers the impact of loop unrolling and instruction scheduling. This is done via a novel instruction tiling approach where instructions within a loop are represented along one dimension and innermost loop iterations along the other dimension. By exploiting the regularity along the loop dimension, and imposing essential dependence based constraints on intra-tile execution order, the problem of optimizing register pressure is cast in a constraint programming formalism. Experimental results are provided from thousands of innermost loops extracted from the SPEC benchmarks, demonstrating improvements over the current state-of-the-art.L'allocation de registres est un problème largement étudié. Un contexte particulièrement important pour l'optimisation de l'allocation de registres est celui des boucles car elles constituent une fraction importante du temps d'exécution du programme. De nombreux algorithmes d'allocation de registres ont été proposés dans le passé mais la complexité du problème à donné lieu à un découplage de plusieurs aspects importants, incluant notamment le déroulage de boucles, la promotion de registres ou le réordonnance d'instructions. Dans ce rapport nous développons une approche unifiée au problème d'allocation et promotion de registres dans un cadre d'optimisation qui combine l'impact du déroulage de boucles et le réordonnancement d'instructions. Ceci est réalisé grâce à une nouvelle approche de pavage-registres dans lequel les instructions du corps de boucle sont représentées le long d'une dimension et les itérations de la boucle interne le long d'une autre dimension. En profitant de régularités le long d'une dimension et en imposant à l'ordre intra-tuile les contraintes de dépendances, le problème d'optimisation de la pression registres est exprimée dans un formalisme de programmation par contraintes. Les résultats expérimentaux issus de milliers de boucles internes extraites de la suite de benchmarks SPEC, démontrent l'amélioration par rapport à l'état de l'art

INRIA a CCSD electronic archive server

HAL-Rennes 1

Memory optimization techniques for embedded systems

Author: Hong Jinpyo
Publication venue: LSU Digital Commons
Publication date: 01/01/2002
Field of study

Embedded systems have become ubiquitous and as a result optimization of the design and performance of programs that run on these systems have continued to remain as significant challenges to the computer systems research community. This dissertation addresses several key problems in the optimization of programs for embedded systems which include digital signal processors as the core processor. Chapter 2 develops an efficient and effective algorithm to construct a worm partition graph by finding a longest worm at the moment and maintaining the legality of scheduling. Proper assignment of offsets to variables in embedded DSPs plays a key role in determining the execution time and amount of program memory needed. Chapter 3 proposes a new approach of introducing a weight adjustment function and showed that its experimental results are slightly better and at least as well as the results of the previous works. Our solutions address several problems such as handling fragmented paths resulting from graph-based solutions, dealing with modify registers, and the effective utilization of multiple address registers. In addition to offset assignment, address register allocation is important for embedded DSPs. Chapter 4 develops a lower bound and an algorithm that can eliminate the explicit use of address register instructions in loops with array references. Scheduling of computations and the associated memory requirement are closely inter-related for loop computations. In Chapter 5, we develop a general framework for studying the trade-off between scheduling and storage requirements in nested loops that access multi-dimensional arrays. Tiling has long been used to improve the memory performance of loops. Only a sufficient condition for the legality of tiling was known previously. While it was conjectured that the sufficient condition would also become necessary for large enough tiles, there had been no precise characterization of what is large enough. Chapter 6 develops a new framework for characterizing tiling by viewing tiles as points on a lattice. This also leads to the development of conditions under the legality condition for tiling is both necessary and sufficient

Louisiana State University

Multicore-optimized wavefront diamond blocking for optimizing stencil updates

Author: Hager Georg
Keyes David
Ltaief Hatem
Malas Tareq
Stengel Holger
Wellein Gerhard
Publication venue: 'Society for Industrial & Applied Mathematics (SIAM)'
Publication date: 12/10/2014
Field of study

The importance of stencil-based algorithms in computational science has focused attention on optimized parallel implementations for multilevel cache-based processors. Temporal blocking schemes leverage the large bandwidth and low latency of caches to accelerate stencil updates and approach theoretical peak performance. A key ingredient is the reduction of data traffic across slow data paths, especially the main memory interface. In this work we combine the ideas of multi-core wavefront temporal blocking and diamond tiling to arrive at stencil update schemes that show large reductions in memory pressure compared to existing approaches. The resulting schemes show performance advantages in bandwidth-starved situations, which are exacerbated by the high bytes per lattice update case of variable coefficients. Our thread groups concept provides a controllable trade-off between concurrency and memory usage, shifting the pressure between the memory interface and the CPU. We present performance results on a contemporary Intel processor

arXiv.org e-Print Archive

CiteSeerX

Good approximate quantum LDPC codes from spacetime circuit Hamiltonians

Author: Bohdanowicz Thomas C.
Crosson Elizabeth
Nirkhe Chinmay
Yuen Henry
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/11/2018
Field of study

We study approximate quantum low-density parity-check (QLDPC) codes, which are approximate quantum error-correcting codes specified as the ground space of a frustration-free local Hamiltonian, whose terms do not necessarily commute. Such codes generalize stabilizer QLDPC codes, which are exact quantum error-correcting codes with sparse, low-weight stabilizer generators (i.e. each stabilizer generator acts on a few qubits, and each qubit participates in a few stabilizer generators). Our investigation is motivated by an important question in Hamiltonian complexity and quantum coding theory: do stabilizer QLDPC codes with constant rate, linear distance, and constant-weight stabilizers exist? We show that obtaining such optimal scaling of parameters (modulo polylogarithmic corrections) is possible if we go beyond stabilizer codes: we prove the existence of a family of

[[N,k,d,\varepsilon]]

approximate QLDPC codes that encode

k = \widetilde{\Omega}(N)

logical qubits into

N

physical qubits with distance

d = \widetilde{\Omega}(N)

and approximation infidelity

\varepsilon = \mathcal{O}(1/\textrm{polylog}(N))

. The code space is stabilized by a set of 10-local noncommuting projectors, with each physical qubit only participating in

\mathcal{O}(\textrm{polylog} N)

projectors. We prove the existence of an efficient encoding map, and we show that arbitrary Pauli errors can be locally detected by circuits of polylogarithmic depth. Finally, we show that the spectral gap of the code Hamiltonian is

\widetilde{\Omega}(N^{-3.09})

by analyzing a spacetime circuit-to-Hamiltonian construction for a bitonic sorting network architecture that is spatially local in

\textrm{polylog}(N)

dimensions.Comment: 51 pages, 13 figure

arXiv.org e-Print Archive

Crossref

Caltech Authors

Fast Matrix Multiplication via Compiler-only Layered Data Reorganization and Intrinsic Lowering

Author: Amaral José Nelson
Araujo Guido
Barton Christopher
de Carvalho João P. L.
Korostelev Ivan
Kuzma Braedy
Moreira José E.
Publication venue: 'Wiley'
Publication date: 15/05/2023
Field of study

The resurgence of machine learning has increased the demand for high-performance basic linear algebra subroutines (BLAS), which have long depended on libraries to achieve peak performance on commodity hardware. High-performance BLAS implementations rely on a layered approach that consists of tiling and packing layers, for data (re)organization, and micro kernels that perform the actual computations. The creation of high-performance micro kernels requires significant development effort to write tailored assembly code for each architecture. This hand optimization task is complicated by the recent introduction of matrix engines by IBM's POWER10 MMA, Intel AMX, and Arm ME to deliver high-performance matrix operations. This paper presents a compiler-only alternative to the use of high-performance libraries by incorporating, to the best of our knowledge and for the first time, the automatic generation of the layered approach into LLVM, a production compiler. Modular design of the algorithm, such as the use of LLVM's matrix-multiply intrinsic for a clear interface between the tiling and packing layers and the micro kernel, makes it easy to retarget the code generation to multiple accelerators. The use of intrinsics enables a comprehensive performance study. In processors without hardware matrix engines, the tiling and packing delivers performance up to 22x (Intel), for small matrices, and more than 6x (POWER9), for large matrices, faster than PLuTo, a widely used polyhedral optimizer. The performance also approaches high-performance libraries and is only 34% slower than OpenBLAS and on-par with Eigen for large matrices. With MMA in POWER10 this solution is, for large matrices, over 2.6x faster than the vector-extension solution, matches Eigen performance, and achieves up to 96% of BLAS peak performance

arXiv.org e-Print Archive