Search CORE

62 research outputs found

Automatic Storage Optimization for Arrays

Author: Bhaskaracharya Somashekaracharya,
Bondhugula Uday
Cohen Albert
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2016
Field of study

International audienceEfficient memory allocation is crucial for data-intensive applications as a smaller memory footprint ensures better cache performance and allows one to run a larger problem size given a fixed amount of main memory. In this paper, we describe a new automatic storage optimization technique to minimize the dimensionality and storage requirements of arrays used in sequences of loop nests with a predetermined schedule. We formulate the problem of intra-array storage optimization as one of finding the right storage partitioning hyperplanes: each storage partition corresponds to a single storage location. Our heuristic is driven by a dual objective function that minimizes both, the dimensionality of the mapping and the extents along those dimensions. The technique is dimension optimal for most codes encountered in practice. The storage requirements of the mappings obtained also are asymptotically better than those obtained by any existing schedule-dependent technique. Storage reduction factors and other results we report from an implementation of our technique demonstrate its effectiveness on several real-world examples drawn from the domains of image processing, stencil computations, high-performance computing, and the class of tiled codes in general

Crossref

INRIA a CCSD electronic archive server

Open Access Repository of IISc Research Publications

Hybrid Iterative and Model-Driven Optimization in the Polyhedral Model

Author: Bastoul Cédric
Bondhugula Uday
Cohen Albert
Pouchet Louis-Noel
Ramanujam R.
Sadayappan P.
Publication venue: HAL CCSD
Publication date: 01/01/2009
Field of study

On modern architectures, a missed optimization can translate into performance degradations reaching orders of magnitude. More than ever, translating Moore's law into actual performance improvements depends on the effectiveness of the compiler. Moreover, missing an optimization and putting the blame on the programmer is not a viable strategy: we must strive for portability of performance or the majority of the software industry will see no benefit in future many-core processors. As a consequence, an optimizing compiler must also be a parallelizing one; it must take care of the memory hierarchy and of (re)partitioning computation to best suit the target architecture Polyhedral compilation is a program optimization and parallelization framework capable of expressing extremely complex transformation sequences. The ability to build and traverse a tractable search space of such transformations remains challenging, and existing model-based heuristics can easily be beaten in identifying profitable parallelism/locality trade-offs. We propose a hybrid iterative and model-driven algorithm for automatic tiling, fusion, distribution and parallelization of programs in the polyhedral model. Our experiments demonstrate the effectiveness of this approach, both in obtaining solid performance improvements over existing auto-parallelizing compilers, and in achieving portability of performance on various modern multi-core architectures

INRIA a CCSD electronic archive server

AN5D: Automated Stencil Framework for High-Degree Temporal Blocking on GPUs

Author: Ao Y.
Bondhugula U.
Bondhugula Uday
Chi Y.
de Fine Licht Johannes
Grosser Tobias
Grosser Tobias
Grosser Tobias
Hagedorn Bastian
Holewinski Justin
Irigoin F.
Kamil Shoaib
Konstantinidis E.
Krishnamoorthy Sriram
Maruyama Naoya
Meng Jiayuan
Muranushi Takayuki
Nguyen A.
Prajapati Nirmal
Ravishankar Mahesh
Rawat P. S.
Rawat Prashant
Rawat Prashant Singh
Rawat Prashant Singh
Rawat Prashant Singh
Rossinelli Diego
Shimokawabe Takashi
Shimokawabe Takashi
Tang W. T.
Tang Yuan
Verdoolaege Sven
Verdoolaege Sven
Verdoolaege Sven
Williams Samuel
Wolfe M.
Zohouri H. R.
Zohouri Hamid Reza
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 03/02/2020
Field of study

Stencil computation is one of the most widely-used compute patterns in high performance computing applications. Spatial and temporal blocking have been proposed to overcome the memory-bound nature of this type of computation by moving memory pressure from external memory to on-chip memory on GPUs. However, correctly implementing those optimizations while considering the complexity of the architecture and memory hierarchy of GPUs to achieve high performance is difficult. We propose AN5D, an automated stencil framework which is capable of automatically transforming and optimizing stencil patterns in a given C source code, and generating corresponding CUDA code. Parameter tuning in our framework is guided by our performance model. Our novel optimization strategy reduces shared memory and register pressure in comparison to existing implementations, allowing performance scaling up to a temporal blocking degree of 10. We achieve the highest performance reported so far for all evaluated stencil benchmarks on the state-of-the-art Tesla V100 GPU

arXiv.org e-Print Archive

Crossref

Compiling Affine Loop Nests for Distributed-Memory Parallel Architectures

Author: Bondhugula Uday
Publication venue: IEEE
Publication date
Field of study

We present new techniques for compilation of arbitrarily nested loops with affine dependences for distributed-memory parallel architectures. Our framework is implemented as a source-level transformer that uses the polyhedral model, and generates parallel code with communication expressed with the Message Passing Interface (MPI) library. Compared to all previous approaches, ours is a significant advance either (1) with respect to the generality of input code handled, or (2) efficiency of communication code, or both. We provide experimental results on a cluster of multicores demonstrating its effectiveness. In some cases, code we generate outperforms manually parallelized codes, and in another case is within 25% of it. To the best of our knowledge, this is the first work reporting end-to-end fully automatic distributed-memory parallelization and code generation for input programs and transformation techniques as general as those we allow

Open Access Repository of IISc Research Publications

Automatic Mapping of Nested Loops to FPGAs

Author: Uday Bondhugula
Publication venue
Publication date
Field of study

This paper present a framework for automatic mapping of perfectly nested loops with constant dependences onto regular processor arrays, suitable for direct implementation on Field Programmable Gate Arrays (FPGAs). The problem is modeled as that of finding a suitable completion procedure for a full-rank linear transformation on the iteration space. The approach enables extraction of necessary degrees of communication-free and pipelined parallelism to optimize performance under the resource constraints of limited logic resources and I/O bandwidth available on an FPGA. The generation of control signals for the custom processing elements is also addressed. Examples of automatic derivation of parallel designs for some common nested loops are provided. Experimental results on the Cray XD1 show that an FPGA-based matrix-multiplication design obtained using the framework attains significant speedup on the XD1’s attached FPGA, when compared to execution on the XD1 CPU

CiteSeerX