8 research outputs found

    Tiramisu: A Polyhedral Compiler for Expressing Fast and Portable Code

    Full text link
    This paper introduces Tiramisu, a polyhedral framework designed to generate high performance code for multiple platforms including multicores, GPUs, and distributed machines. Tiramisu introduces a scheduling language with novel extensions to explicitly manage the complexities that arise when targeting these systems. The framework is designed for the areas of image processing, stencils, linear algebra and deep learning. Tiramisu has two main features: it relies on a flexible representation based on the polyhedral model and it has a rich scheduling language allowing fine-grained control of optimizations. Tiramisu uses a four-level intermediate representation that allows full separation between the algorithms, loop transformations, data layouts, and communication. This separation simplifies targeting multiple hardware architectures with the same algorithm. We evaluate Tiramisu by writing a set of image processing, deep learning, and linear algebra benchmarks and compare them with state-of-the-art compilers and hand-tuned libraries. We show that Tiramisu matches or outperforms existing compilers and libraries on different hardware architectures, including multicore CPUs, GPUs, and distributed machines.Comment: arXiv admin note: substantial text overlap with arXiv:1803.0041

    UPIR: Toward the Design of Unified Parallel Intermediate Representation for Parallel Programming Models

    Full text link
    The complexity of heterogeneous computing architectures, as well as the demand for productive and portable parallel application development, have driven the evolution of parallel programming models to become more comprehensive and complex than before. Enhancing the conventional compilation technologies and software infrastructure to be parallelism-aware has become one of the main goals of recent compiler development. In this paper, we propose the design of unified parallel intermediate representation (UPIR) for multiple parallel programming models and for enabling unified compiler transformation for the models. UPIR specifies three commonly used parallelism patterns (SPMD, data and task parallelism), data attributes and explicit data movement and memory management, and synchronization operations used in parallel programming. We demonstrate UPIR via a prototype implementation in the ROSE compiler for unifying IR for both OpenMP and OpenACC and in both C/C++ and Fortran, for unifying the transformation that lowers both OpenMP and OpenACC code to LLVM runtime, and for exporting UPIR to LLVM MLIR dialect.Comment: Typos corrected. Format update

    Array size computation under uniform overlapping and irregular accesses

    Get PDF
    The size required to store an array is crucial for an embedded system, as it affects the memory size, the energy per memory access, and the overall system cost. Existing techniques for finding the minimum number of resources required to store an array are less efficient for codes with large loops and not regularly occurring memory accesses. They have to approximate the accessed parts of the array leading to overestimation of the required resources. Otherwise, their exploration time is increased with an increase over the number of the different accessed parts of the array. We propose a methodology to compute the minimum resources required for storing an array which keeps the exploration time low and provides a near-optimal result for regularly and non-regularly occurring memory accesses and overlapping writes and reads

    Some advances in the polyhedral model

    Get PDF
    Department Head: L. Darrell Whitley.2010 Summer.Includes bibliographical references.The polyhedral model is a mathematical formalism and a framework for the analysis and transformation of regular computations. It provides a unified approach to the optimization of computations from different application domains. It is now gaining wide use in optimizing compilers and automatic parallelization. In its purest form, it is based on a declarative model where computations are specified as equations over domains defined by "polyhedral sets". This dissertation presents two results. First is an analysis and optimization technique that enables us to simplify---reduce the asymptotic complexity---of such equations. The second is an extension of the model to richer domains called Ƶ-Polyhedra. Many equational specifications in the polyhedral model have reductions---application of an associative and commutative operator to collections of values to produce a collection of answers. Moreover, expressions in such equations may also exhibit reuse where intermediate values that are computed or used at different index points are identical. We develop various compiler transformations to automatically exploit this reuse and simplify the computational complexity of the specification. In general, there is an infinite set of applicable simplification transformations. Unfortunately, different choices may result in equivalent specifications with different asymptotic complexity. We present an algorithm for the optimal application of simplification transformations resulting in a final specification with minimum complexity. This dissertation also presents the Ƶ-Polyhedral model, an extension to the polyhedral model to more general sets, thereby providing a transformation framework for a larger set of regular computations. For this, we present a novel representation and interpretation of Ƶ-Polyhedra and prove a number of properties of the family of unions of Ƶ-Polyhedra that are required to extend the polyhedral model. Finally, we present value based dependence analysis and scheduling analysis for specifications in the Ƶ-Polyhedral model. These are direct extensions of the corresponding analyses of specifications in the polyhedral model. One of the benefits of our results in the Ƶ-Polyhedral model is that our abstraction allows the reuse of previously developed tools in the polyhedral model with straightforward pre- and post-processing

    Automatic Data and Computation Mapping for Distributed-Memory Machines.

    Get PDF
    Distributed memory parallel computers offer enormous computation power, scalability and flexibility. However, these machines are difficult to program and this limits their widespread use. An important characteristic of these machines is the difference in the access time for data in local versus non-local memory; non-local memory accesses are much slower than local memory accesses. This is also a characteristic of shared memory machines but to a less degree. Therefore it is essential that as far as possible, the data that needs to be accessed by a processor during the execution of the computation assigned to it reside in its local memory rather than in some other processor\u27s memory. Several research projects have concluded that proper mapping of data is key to realizing the performance potential of distributed memory machines. Current language design efforts such as Fortran D and High Performance Fortran (HPF) are based on this. It is our thesis that for many practical codes, it is possible to derive good mappings through a combination of algorithms and systematic procedures. We view mapping as consisting of wo phases, alignment followed by distribution. For the alignment phase we present three constraint-based methods--one based on a linear programming formulation of the problem; the second formulates the alignment problem as a constrained optimization problem using Lagrange multipliers; the third method uses a heuristic to decide which constraints to leave unsatisfied (based on the penalty of increased communication incurred in doing so) in order to find a mapping. In addressing the distribution phase, we have developed two methods that integrate the placement of computation--loop nests in our case--with the mapping of data. For one distributed dimension, our approach finds the best combination of data and computation mapping that results in low communication overhead; this is done by choosing a loop order that allows message vectorization. In the second method, we introduce the distribution preference graph and the operations on this graph allow us to integrate loop restructuring transformations and data mapping. These techniques produce mappings that have been used in efficient hand-coded implementations of several benchmark codes

    Performance-aware component composition for GPU-based systems

    Full text link
    corecore