17 research outputs found

    Encoding Mini-Graphs With Handle Prefix Outlining

    Get PDF
    Recently proposed techniques like mini-graphs, CCA-subgraphs, and static strands exploit application-specific compound or fused instructions to reduce execution time, energy consumption, and/or processor complexity. To achieve their full potential, these techniques rely on static tools to identify common instruction sequences that make good fusion candidates. As a result, they also rely on ISA extension facilities that can encode these chosen instruction groups in a way that supports efficient execution on fusion-enabled hardware as well as compatibility across different implementations, including fusion-agnostic implementations. This paper describes handle prefix outlining, the ISA extension scheme used by mini-graph processors. Handle prefix outlining can be thought of as a hybrid of the encoding scheme used by three previous instruction aggregation techniques: PRISC, static strands, and CCA-subgraphs. It combines the best features of each scheme to deliver both full compatibility and execution efficiency on fusion-enabled processors

    Mini-Graph Processing

    No full text
    For years, single-thread performance was the most dominant force driving processor development. In recent years, however, the poor scaling of single-thread super-scalar performance and power concerns coupled with the ever-increasing number of transistors available on chip has changed the focus from single-thread performance to thread-level parallelism running on multi-core designs. The trend is for these cores to be narrower with smaller windows. This dissertation addresses the question of how to maintain—and, ideally, improve—single-thread performance under such constraints. Mini-graph processing is a form of instruction fusion—the grouping of multiple operations into a single processing unit—that increases the instruction-per-cycle (IPC) throughput of dynamically scheduled superscalar processors in an efficient way. Mini-graphs are compiler-identified aggregates of multiple instructions that look and behave like singleton instructions at every pipeline stage, except for execute—there the constituent operations are retrieved and performed serially micro-code style. A mini-graph processor exploits instruction fusion to increase the efficiency of pipeline stages and structures that perform instruction book-keeping. This dissertation describes a mini-graph architecture and evaluates it using cycle-level simulation. A superscalar processor enhanced with mini-graphs can match the performance otherwise achieved with a wider, deeper superscalar processor. Experiments show that across four benchmark suites, the addition of mini-graph processing allows a dynamically scheduled 3-wide superscalar processor to match the IPC of a 4-wide superscalar machine

    Dataflow mini-graphs: Amplifying superscalar capacity and bandwidth

    Get PDF
    A mini-graph is a dataflow graph that has an arbitrary internal size and shape but the interface of a singleton instruction: two register inputs, one register output, a maximum of one memory operation, and a maximum of one (terminal) control transfer. Previous work has exploited dataflow sub-graphs whose execution latency can be reduced via programmable FPGA-style hardware. In this paper we show that mini-graphs can improve performance by amplifying the bandwidths of a superscalar processor’s stages and the capacities of many of its structures without custom latency-reduction hardware. Amplification is achieved because the processor deals with a complete mini-graph via a single quasi-instruction, the handle. By constraining mini-graph structure and forcing handles to behave as much like singleton instructions as possible, the number and scope of the modifications over a conventional superscalar microarchitecture is kept to a minimum. This paper describes mini-graphs, a simple algorithm for extracting them from basic block frequency profiles, and a microarchitecture for exploiting them. Cycle-level simulation of several benchmark suites shows that mini-graphs can provide average performance gains of 2–12 % over an aggressive baseline, with peak gains exceeding 40%. Alternatively, they can compensate for substantial reductions in register file and scheduler size, and in pipeline bandwidth. 1

    Three Extensions to Register Integration

    Get PDF
    Register integration (or just integration) is a register renaming discipline that implements instruction reuse via physical register sharing. Initially developed to perform squash reuse, the integration mechanism is a powerful reuse tool that can exploit more reuse scenarios. In this paper, we describe three extensions to the initial integration mechanism that expand its applicability and boost its performance impact. First, we extend squash reuse to general reuse. Whereas squash reuse maintains the superscalar concept of an instruction instance "owning" its output physical register, we allow multiple instructions to simultaneously and seamlessly share a single physical register. Next, we replace the PC-indexing scheme used by squash reuse with an opcode-based indexing scheme that exposes more integration opportunities. Finally, we introduce an extension called reverse integration in which we speculatively create integration entries for the inverses of operations---for instance, when renaming an add, we create an entry for the inverse subtract. Reverse integration allows us to reuse operations that were not specified by the original program. We use reverse integration to obtain a free implementation of speculative memory bypassing for stack-pointer based loads (register fills and restores)

    Disintermediated Active Communication

    No full text
    corecore