62 research outputs found

    Automatic Storage Optimization for Arrays

    Get PDF
    International audienceEfficient memory allocation is crucial for data-intensive applications as a smaller memory footprint ensures better cache performance and allows one to run a larger problem size given a fixed amount of main memory. In this paper, we describe a new automatic storage optimization technique to minimize the dimensionality and storage requirements of arrays used in sequences of loop nests with a predetermined schedule. We formulate the problem of intra-array storage optimization as one of finding the right storage partitioning hyperplanes: each storage partition corresponds to a single storage location. Our heuristic is driven by a dual objective function that minimizes both, the dimensionality of the mapping and the extents along those dimensions. The technique is dimension optimal for most codes encountered in practice. The storage requirements of the mappings obtained also are asymptotically better than those obtained by any existing schedule-dependent technique. Storage reduction factors and other results we report from an implementation of our technique demonstrate its effectiveness on several real-world examples drawn from the domains of image processing, stencil computations, high-performance computing, and the class of tiled codes in general

    Hybrid Iterative and Model-Driven Optimization in the Polyhedral Model

    Get PDF
    On modern architectures, a missed optimization can translate into performance degradations reaching orders of magnitude. More than ever, translating Moore's law into actual performance improvements depends on the effectiveness of the compiler. Moreover, missing an optimization and putting the blame on the programmer is not a viable strategy: we must strive for portability of performance or the majority of the software industry will see no benefit in future many-core processors. As a consequence, an optimizing compiler must also be a parallelizing one; it must take care of the memory hierarchy and of (re)partitioning computation to best suit the target architecture Polyhedral compilation is a program optimization and parallelization framework capable of expressing extremely complex transformation sequences. The ability to build and traverse a tractable search space of such transformations remains challenging, and existing model-based heuristics can easily be beaten in identifying profitable parallelism/locality trade-offs. We propose a hybrid iterative and model-driven algorithm for automatic tiling, fusion, distribution and parallelization of programs in the polyhedral model. Our experiments demonstrate the effectiveness of this approach, both in obtaining solid performance improvements over existing auto-parallelizing compilers, and in achieving portability of performance on various modern multi-core architectures

    AN5D: Automated Stencil Framework for High-Degree Temporal Blocking on GPUs

    Full text link
    Stencil computation is one of the most widely-used compute patterns in high performance computing applications. Spatial and temporal blocking have been proposed to overcome the memory-bound nature of this type of computation by moving memory pressure from external memory to on-chip memory on GPUs. However, correctly implementing those optimizations while considering the complexity of the architecture and memory hierarchy of GPUs to achieve high performance is difficult. We propose AN5D, an automated stencil framework which is capable of automatically transforming and optimizing stencil patterns in a given C source code, and generating corresponding CUDA code. Parameter tuning in our framework is guided by our performance model. Our novel optimization strategy reduces shared memory and register pressure in comparison to existing implementations, allowing performance scaling up to a temporal blocking degree of 10. We achieve the highest performance reported so far for all evaluated stencil benchmarks on the state-of-the-art Tesla V100 GPU

    Compiling Affine Loop Nests for Distributed-Memory Parallel Architectures

    No full text
    We present new techniques for compilation of arbitrarily nested loops with affine dependences for distributed-memory parallel architectures. Our framework is implemented as a source-level transformer that uses the polyhedral model, and generates parallel code with communication expressed with the Message Passing Interface (MPI) library. Compared to all previous approaches, ours is a significant advance either (1) with respect to the generality of input code handled, or (2) efficiency of communication code, or both. We provide experimental results on a cluster of multicores demonstrating its effectiveness. In some cases, code we generate outperforms manually parallelized codes, and in another case is within 25% of it. To the best of our knowledge, this is the first work reporting end-to-end fully automatic distributed-memory parallelization and code generation for input programs and transformation techniques as general as those we allow

    Automatic Mapping of Nested Loops to FPGAs

    No full text
    This paper present a framework for automatic mapping of perfectly nested loops with constant dependences onto regular processor arrays, suitable for direct implementation on Field Programmable Gate Arrays (FPGAs). The problem is modeled as that of finding a suitable completion procedure for a full-rank linear transformation on the iteration space. The approach enables extraction of necessary degrees of communication-free and pipelined parallelism to optimize performance under the resource constraints of limited logic resources and I/O bandwidth available on an FPGA. The generation of control signals for the custom processing elements is also addressed. Examples of automatic derivation of parallel designs for some common nested loops are provided. Experimental results on the Cray XD1 show that an FPGA-based matrix-multiplication design obtained using the framework attains significant speedup on the XD1’s attached FPGA, when compared to execution on the XD1 CPU
    corecore