46 research outputs found

    Compiling Imperfectly-nested Sparse Matrix Codes with Dependences

    Full text link
    We present compiler technology for generating sparse matrix code from (i) dense matrix code and (ii) a description of the indexing structure of the sparse matrices. This technology embeds statement instances into a Cartesian product of statement iteration and data spaces, and produces efficient sparse code by identifying common enumerations for multiple references to sparse matrices. This approach works for imperfectly-nested codes with dependences, and produces sparse code competitive with hand-written library code

    Parameterized and multi-level tiled loop generation

    Get PDF
    Department Head: L. Darrell Whitley.2010 Summer.Includes bibliographical references.Tiling is a loop transformation that decomposes computations into a set of smaller computation blocks. The transformation has been proven to be useful for many high-level program optimizations, such as data locality optimization and exploiting coarse-grained parallelism, and crucial for architecture with limited resources, such as embedded systems, GPUs, and the Cell architecture. Data locality and parallelism will continue to serve as major vehicles for achieving high performance on modern architecture in multi-core era. In parameterized tiling the size of blocks is not fixed at compile time but remains a symbolic constant so that it can be selected/changed even at runtime. Parameterized tiled loops facilitate iterative and runtime optimizations, such as iterative compilation, auto-tuning and dynamic program adaption. In this dissertation we present a collection of techniques for generating parameterized and multi-level tiled loops from affine control loops and their parallelization. The tiled loop generation problem even for perfectly nested loops has been believed to have an exponential time complexity due to the heavy machinery like Fourier-Motzkin elimination. Disproving this decade-long belief, we provide a simple technique for generating tiled loop nests even from imperfectly nested loops. Our technique for perfectly nested loops consists of only syntactic processing that is applied only once and independently to each loop bound. Our approach to imperfectly nested loops is composed of a direct extension of the tiled code generation technique for perfectly nested loops and three simple optimizations on the resulting parameterized tiled loops. The generation as well as the optimizations are achieved only with purely syntactic processing, hence loop generation time remains negligible. We also present three schemes for multi-level tiling where tiling is applied more than once. All the schemes are scalable with respect to the number of tiling levels and can be combined to achieve better performance. To facilitate parallelization of parameterized tiled loops, we generate outermost tile-loops that are perfectly nested. We also provide a technique for statically restructuring parameterized tiled loops to the wavefront scheduling on shared memory system. Because the formulation of parameterized tiling does not fit into the well established polyhedral framework, such static restructuring has been a great challenge. However, we achieve this limited restructuring through a syntactic processing without any sophisticated machinery

    Parametric Multi-Level Tiling of Imperfectly Nested Loops

    Get PDF
    International audienceTiling is a crucial loop transformation for generating high perfor- mance code on modern architectures. Efficient generation of multi- level tiled code is essential for maximizing data reuse in systems with deep memory hierarchies. Tiled loops with parametric tile sizes (not compile-time constants) facilitate runtime feedback and dynamic optimizations used in iterative compilation and automatic tuning. Previous parametric multi-level tiling approaches have been restricted to perfectly nested loops, where all assignment state- ments are contained inside the innermost loop of a loop nest. Pre- vious solutions to tiling for imperfect loop nests have only handled fixed tile sizes. In this paper, we present an approach to paramet- ric multi-level tiling of imperfectly nested loops. The tiling tech- nique generates loops that iterate over full rectangular tiles, making them amenable to compiler optimizations such as register tiling. Experimental results using a number of computational benchmarks demonstrate the effectiveness of the developed tiling approach

    Transformations of High-Level Synthesis Codes for High-Performance Computing

    Full text link
    Specialized hardware architectures promise a major step in performance and energy efficiency over the traditional load/store devices currently employed in large scale computing systems. The adoption of high-level synthesis (HLS) from languages such as C/C++ and OpenCL has greatly increased programmer productivity when designing for such platforms. While this has enabled a wider audience to target specialized hardware, the optimization principles known from traditional software design are no longer sufficient to implement high-performance codes. Fast and efficient codes for reconfigurable platforms are thus still challenging to design. To alleviate this, we present a set of optimizing transformations for HLS, targeting scalable and efficient architectures for high-performance computing (HPC) applications. Our work provides a toolbox for developers, where we systematically identify classes of transformations, the characteristics of their effect on the HLS code and the resulting hardware (e.g., increases data reuse or resource consumption), and the objectives that each transformation can target (e.g., resolve interface contention, or increase parallelism). We show how these can be used to efficiently exploit pipelining, on-chip distributed fast memory, and on-chip streaming dataflow, allowing for massively parallel architectures. To quantify the effect of our transformations, we use them to optimize a set of throughput-oriented FPGA kernels, demonstrating that our enhancements are sufficient to scale up parallelism within the hardware constraints. With the transformations covered, we hope to establish a common framework for performance engineers, compiler developers, and hardware developers, to tap into the performance potential offered by specialized hardware architectures using HLS

    Cache partitioning + loop tiling: A methodology for effective shared cache management

    Get PDF
    In this paper, we present a new methodology that provides i) a theoretical analysis of the two most commonly used approaches for effective shared cache management (i.e., cache partitioning and loop tiling) and ii) a unified framework to fine tuning those two mechanisms in tandem (not separately). Our approach manages to lower the number of main memory accesses by one order of magnitude keeping at the same time the number of arithmetical/addressing instructions in a minimal level. We also present a search space exploration analysis where our proposal is able to offer a vast deduction in the required search space

    A methodology pruning the search space of six compiler transformations by addressing them together as one problem and by exploiting the hardware architecture details

    Get PDF
    Today’s compilers have a plethora of optimizations-transformations to choose from, and the correct choice, order as well parameters of transformations have a significant/large impact on performance; choosing the correct order and parameters of optimizations has been a long standing problem in compilation research, which until now remains unsolved; the separate sub-problems optimization gives a different schedule/binary for each sub-problem and these schedules cannot coexist, as by refining one degrades the other. Researchers try to solve this problem by using iterative compilation techniques but the search space is so big that it cannot be searched even by using modern supercomputers. Moreover, compiler transformations do not take into account the hardware architecture details and data reuse in an efficient way. In this paper, a new iterative compilation methodology is presented which reduces the search space of six compiler transformations by addressing the above problems; the search space is reduced by many orders of magnitude and thus an efficient solution is now capable to be found. The transformations are the following: loop tiling (including the number of the levels of tiling), loop unroll, register allocation, scalar replacement, loop interchange and data array layouts. The search space is reduced (a) by addressing the aforementioned transformations together as one problem and not separately, (b) by taking into account the custom hardware architecture details (e.g., cache size and associativity) and algorithm characteristics (e.g., data reuse). The proposed methodology has been evaluated over iterative compilation and gcc/icc compilers, on both embedded and general purpose processors; it achieves significant performance gains at many orders of magnitude lower compilation time

    Combining software cache partitioning and loop tiling for effective shared cache management

    Get PDF
    One of the biggest challenges in multicore platforms is shared cache management, especially for data dominant applications. Two commonly used approaches for increasing shared cache utilization are cache partitioning and loop tiling. However, state-of-the-art compilers lack of efficient cache partitioning and loop tiling methods for two reasons. First, cache partitioning and loop tiling are strongly coupled together, thus addressing them separately is simply not effective. Second, cache partitioning and loop tiling must be tailored to the target shared cache architecture details and the memory characteristics of the co-running workloads. To the best of our knowledge, this is the first time that a methodology provides i) a theoretical foundation in the above mentioned cache management mechanisms and ii) a unified framework to orchestrate these two mechanisms in tandem (not separately). Our approach manages to lower the number of main memory accesses by an order of magnitude keeping at the same time the number of arithmetic/addressing instructions in a minimal level. We motivate this work by showcasing that cache partitioning, loop tiling, data array layouts, shared cache architecture details (i.e., cache size and associativity) and the memory reuse patterns of the executing tasks must be addressed together as one problem, when a (near)- optimal solution is requested. To this end, we present a search space exploration analysis where our proposal is able to offer a vast deduction in the required search space

    IEGen: semi-automatic generation of inspectors and executors

    Get PDF
    Department Head: L. Darrell Whitley.Includes bibliographical references (pages 74-76).Software that simulates real-world phenomena such as heat transfer over surfaces and molecular interaction is often based on irregular computational kernels. Indirect array accesses such as A[B[i]] that are found in irregular computations often exhibit a memory access pattern that does not make efficient use of the memory hierarchy, reducing performance. Additionally, indirect array accesses hinder our ability to apply loop optimizations to improve data locality or introduce parallelism at compile time. One approach to solving this problem, an inspector/executor strategy, inspects the index arrays (B) at runtime to determine the order of accesses to the data arrays (A), reorders the data and index arrays, and executes a transformed computation that accesses the data arrays in a more efficient manner. For the most part, the application of inspector/executor strategies to irregular computations has been done manually or with limited generality. This thesis presents the Inspector/Executor Generator (IEGen). This tool accepts an irregular computation specification and sequence of run-time reordering transformations to apply to that computation as input. IEGen then generates serial inspector and executor code that implements the transformed computation. We contribute an inspector intermediate representation (IR) called an Inspector Dependence Graph (IDG), a method for code generation of inspectors based on an IDG, a method for code generation of executors, and techniques for manipulating affine constraints with uninterpreted function symbol (UFS) expressions to enable code generation. We evaluate our techniques against an existing library that supports limited UFS expressions and additionally show that generalized generation of inspectors and executors that implement composed run-time reordering transformations is possible
    corecore