30 research outputs found

    Loop Distribution and Fusion with Timing and Code Size Optimization for Embedded DSPs

    Full text link
    International Conference on Embedded and Ubiquitous Computing (EUC 2005), Nagasaki, Japan,6-9 Dec 2005Loop distribution and loop fusion are two e.ective loop transformation techniques to optimize the execution of the programs in DSP applications. In this paper, we propose a new technique combining loop distribution with direct loop fusion, which will improve the timing performance without jeopardizing the code size. We .rst develop the loop distribution theorems that state the legality conditions of loop distribution for multi-level nested loops. We show that if the summation of the edge weights of the dependence cycle satis.es a certain condition, then the statements involved in the dependence cycle can be distributed; otherwise, they should be put in the same loop after loop distribution. Then, we propose the technique of maximum loop distribution with direct loop fusion. The experimental results show that the execution time of the transformed loops by our technique is reduced 21.0compared to the original loops and the code size of the transformed loops is reduced 7.0% on average compared to the original loops.Department of Computin

    Evidence of Color Coherence Effects in W+jets Events from ppbar Collisions at sqrt(s) = 1.8 TeV

    Full text link
    We report the results of a study of color coherence effects in ppbar collisions based on data collected by the D0 detector during the 1994-1995 run of the Fermilab Tevatron Collider, at a center of mass energy sqrt(s) = 1.8 TeV. Initial-to-final state color interference effects are studied by examining particle distribution patterns in events with a W boson and at least one jet. The data are compared to Monte Carlo simulations with different color coherence implementations and to an analytic modified-leading-logarithm perturbative calculation based on the local parton-hadron duality hypothesis.Comment: 13 pages, 6 figures. Submitted to Physics Letters

    Compiler Directed Parallelization of Loops in Scale for Shared-Memory Multiprocessors

    No full text

    A Comparison of Compiler Tiling Algorithms

    No full text

    Path-based reuse distance analysis

    No full text
    Profiling can effectively analyze program behavior and provide critical information for feedback-directed or dynamic optimizations. Based on memory profiling, reuse distance analysis has shown much promise in predicting data locality for a program using inputs other than the profiled ones. Both wholeprogram and instruction-based locality can be accurately predicted by reuse distance analysis. Reuse distance analysis abstracts a cluster of memory references for a particular instruction having similar reuse distance values into a locality pattern. Prior work has shown that a significant number of memory instructions have multiple locality patterns, a property not desirable for many instruction-based memory optimizations. This paper investigates the relationship between locality patterns and execution paths by analyzing reuse distance distribution along each dynamic path to an instruction. Here a path is defined as the program execution trace from the previous access of a memory location to the current access. By differentiating locality patterns with the context of execution paths, the proposed analysis can expose optimization opportunities tailored only to a specific subset of paths leading to an instruction. In this paper, we present an effective method for path-based reuse distance profiling and analysis. We have observed that a significant percentage of the multiple locality patterns for an instruction can be uniquely related to a particular execution path in the program. In addition, we have also investigated the influence of inputs on reuse distance distribution for each path/instruction pair. The experimental results show that the path-based reuse distance is highly predictable, as a function of the data size, for a set of SPEC CPU2000 programs

    Compiler Optimizations for Improving Data Locality

    No full text
    In the past decade, processor speed has become significantly faster than memory speed. Small, fast cache memories are designed to overcome this discrepancy, but they are only effective when programs exhibit data locality. In this paper, we present compiler optimizations to improve data locality basedon a simple yet accurate cost model. The model computes both temporal and spatial reuse of cache lines to find desirable loop organizations. The cost model drives the application of compound transformations consisting of loop permutation, loop fusion, loop distribution, and loop reversal. We demonstrate that these program transformations are useful for optimizing many programs. To validate our optimization strategy, we implemented our algorithms and ran experiments on a large collection of scientific programs and kernels. Experiments with kernels illustrate that our model and algorithm can select and achieve the best performance. For over thirty complete applications, we executed the origi..

    RDVIS: A tool that visualizes the causes of low locality and hints program optimizations

    No full text
    The visualization tool rdvis is presented which aims at helping the programmer to find program transformations to improve temporal data locality. We present a number of locality metrics that capture the necessary information. Based on a cluster analysis of basic block vectors, the tool gives strong hints on which program transformations are needed. The visualizer allowed us to find the necessary transformations for three SPEC2000 programs in just a few minutes. After performing these transformations, the programs run 3 times faster on average on a number of different platforms

    Loop Transformation Recipes for Code Generation and Auto-Tuning

    No full text
    Abstract. In this paper, we describe transformation recipes, which provide a high-level interface to the code transformation and code generation capability of a compiler. These recipes can be generated by compiler decision algorithms or savvy software developers. This interface is part of an auto-tuning framework that explores a set of different implementations of the same computation and automatically selects the best-performing implementation. Along with the original computation, a transformation recipe specifies a range of implementations of the computation resulting from composing a set of high-level code transformations. In our system, an underlying polyhedral framework coupled with transformation algorithms takes this set of transformations, composes them and automatically generates correct code. We first describe an abstract interface for transformation recipes, which we propose to facilitate interoperability with other transformation frameworks. We then focus on the specific transformation recipe interface used in our compiler and present performance results on its application to kernel and library tuning and tuning of key computations in high-end applications. We also show how this framework can be used to generate and auto-tune parallel OpenMP or CUDA code from a high-level specification.
    corecore