1,544 research outputs found

    Parameterized and multi-level tiled loop generation

    Get PDF
    Department Head: L. Darrell Whitley.2010 Summer.Includes bibliographical references.Tiling is a loop transformation that decomposes computations into a set of smaller computation blocks. The transformation has been proven to be useful for many high-level program optimizations, such as data locality optimization and exploiting coarse-grained parallelism, and crucial for architecture with limited resources, such as embedded systems, GPUs, and the Cell architecture. Data locality and parallelism will continue to serve as major vehicles for achieving high performance on modern architecture in multi-core era. In parameterized tiling the size of blocks is not fixed at compile time but remains a symbolic constant so that it can be selected/changed even at runtime. Parameterized tiled loops facilitate iterative and runtime optimizations, such as iterative compilation, auto-tuning and dynamic program adaption. In this dissertation we present a collection of techniques for generating parameterized and multi-level tiled loops from affine control loops and their parallelization. The tiled loop generation problem even for perfectly nested loops has been believed to have an exponential time complexity due to the heavy machinery like Fourier-Motzkin elimination. Disproving this decade-long belief, we provide a simple technique for generating tiled loop nests even from imperfectly nested loops. Our technique for perfectly nested loops consists of only syntactic processing that is applied only once and independently to each loop bound. Our approach to imperfectly nested loops is composed of a direct extension of the tiled code generation technique for perfectly nested loops and three simple optimizations on the resulting parameterized tiled loops. The generation as well as the optimizations are achieved only with purely syntactic processing, hence loop generation time remains negligible. We also present three schemes for multi-level tiling where tiling is applied more than once. All the schemes are scalable with respect to the number of tiling levels and can be combined to achieve better performance. To facilitate parallelization of parameterized tiled loops, we generate outermost tile-loops that are perfectly nested. We also provide a technique for statically restructuring parameterized tiled loops to the wavefront scheduling on shared memory system. Because the formulation of parameterized tiling does not fit into the well established polyhedral framework, such static restructuring has been a great challenge. However, we achieve this limited restructuring through a syntactic processing without any sophisticated machinery

    Near-optimal loop tiling by means of cache miss equations and genetic algorithms

    Get PDF
    The effectiveness of the memory hierarchy is critical for the performance of current processors. The performance of the memory hierarchy can be improved by means of program transformations such as loop tiling, which is a code transformation targeted to reduce capacity misses. This paper presents a novel systematic approach to perform near-optimal loop tiling based on an accurate data locality analysis (cache miss equations) and a powerful technique to search the solution space that is based on a genetic algorithm. The results show that this approach can remove practically all capacity misses for all considered benchmarks. The reduction of replacement misses results in a decrease of the miss ratio that can be as significant as a factor of 7 for the matrix multiply kernel.Peer ReviewedPostprint (published version

    Hydra: An Accelerator for Real-Time Edge-Aware Permeability Filtering in 65nm CMOS

    Full text link
    Many modern video processing pipelines rely on edge-aware (EA) filtering methods. However, recent high-quality methods are challenging to run in real-time on embedded hardware due to their computational load. To this end, we propose an area-efficient and real-time capable hardware implementation of a high quality EA method. In particular, we focus on the recently proposed permeability filter (PF) that delivers promising quality and performance in the domains of HDR tone mapping, disparity and optical flow estimation. We present an efficient hardware accelerator that implements a tiled variant of the PF with low on-chip memory requirements and a significantly reduced external memory bandwidth (6.4x w.r.t. the non-tiled PF). The design has been taped out in 65 nm CMOS technology, is able to filter 720p grayscale video at 24.8 Hz and achieves a high compute density of 6.7 GFLOPS/mm2 (12x higher than embedded GPUs when scaled to the same technology node). The low area and bandwidth requirements make the accelerator highly suitable for integration into SoCs where silicon area budget is constrained and external memory is typically a heavily contended resource

    Refactoring intermediately executed code to reduce cache capacity misses

    Get PDF
    The growing memory wall requires that more attention is given to the data cache behavior of programs. In this paper, attention is given to the capacity misses i.e. the misses that occur because the cache size is smaller than the data footprint between the use and the reuse of the same data. The data footprint is measured with the reuse distance metric, by counting the distinct memory locations accessed between use and reuse. For reuse distances larger than the cache size, the associated code needs to be refactored in a way that reduces the reuse distance to below the cache size so that the capacity misses are eliminated. In a number of simple loops, the reuse distance can be calculated analytically. However, in most cases profiling is needed to pinpoint the areas where the program needs to be transformed for better data locality. This is achieved by the reuse distance visualizer, RDVIS, which shows the intermediately executed code for critical data reuses. In addition, another tool, SLO, annotates the source program with suggestions for locality ptimization. Both tools have been used to analyze and to refactor a number of SPEC2000 benchmark programs with very positive results
    • …
    corecore