48 research outputs found

    Approaching Symbolic Parallelization by Synthesis of Recurrence Decompositions

    No full text
    We present GraSSP, a novel approach to perform automated parallelization relying on recent advances in formal verification and synthesis. GraSSP augments an existing sequential program with an additional functionality to decompose data dependencies in loop iterations, to compute partial results, and to compose them together. We show that for some classes of the sequential prefix sum problems, such parallelization can be performed efficiently

    Array Data Flow Analysis for Load-Store Optimizations in Superscalar Architectures

    No full text
    . The performance of scientific programs on superscalar processors can be significantly degraded by memory references that frequently arise due to load and store operations associated with array references. Therefore, register allocation techniques have been developed for allocating registers to array elements whose values are repeatedly referenced over one or more loop iterations. To place load, store, and register-to-register shift operations without introducing fully/partially redundant and dead memory operations, a detailed value flow analysis of array references is required. We present an analysis framework to efficiently solve various data flow problems required by array load-store optimizations. The framework determines the collective behavior of recurrent references spread over multiple loop iterations. We also demonstrate how our algorithms can be adapted for various fine-grain architectures. 1 Introduction The performance of a superscalar processor is adversely affected by f..

    Fast and parallel webpage layout

    No full text
    The web browser is a CPU-intensive program. Especially on mobile devices, webpages load too slowly, expending significant time in processing a document’s appearance. Due to power constraints, most hardware-driven speedups will come in the form of parallel architectures. This is also true of mobile devices such as phones. Current browsers, however, barely exploit hardware parallelism, so we are designing a parallel mobile browser. In this paper, we introduce new algorithms for CSS selector matching, layout solving, and font rendering, which represent key components for a fast layout engine. Evaluation on popular sites shows speedups as high as 80x. We also formulate layout solving with attribute grammars, enabling us to not only parallelize our algorithm but prove that it computes in O(log) time and without reflow

    Abstract Focusing Processor Policies via Critical-Path Prediction

    No full text
    Although some instructions hurt performance more than others, current processors typically apply scheduling and speculation as if each instruction was equally costly. Instruction cost can be naturally expressed through the critical path: if we could predict it at run-time, egalitarian policies could be replaced with cost-sensitive strategies that will grow increasingly effective as processors become more parallel. This paper introduces a hardware predictor of instruction criticality and uses it to improve performance. The predictor is both effective and simple in its hardware implementation. The effectiveness at improving performance stems from using a dependence-graph model of the microarchitectural critical path that identifies execution bottlenecks by incorporating both data and machine-specific dependences. The simplicity stems from a token-passing algorithm that computes the critical path without actually building the dependence graph. By focusing processor policies on critical instructions, our predictor enables a large class of optimizations. It can (i) give priority to critical instructions for scarce resources (functional units, ports, predictor entries); and (ii) suppress speculation on non-critical instructions, thus reducing “useless ” misspeculations. We present two case studies that illustrate the potential of the two types of optimization, we show that (i) critical-pathbased dynamic instruction scheduling and steering in a clustered architecture improves performance by as much as 21% (10 % on average); and (ii) focusing value prediction only on critical instructions improves performance by as much as 5%, due to removing nearly half of the misspeculations.
    corecore