7 research outputs found

    NoSQ: Store-Load Communication without a Store Queue

    Get PDF
    This paper presents NoSQ (short for No Store Queue), a microarchitecture that performs store-load communication without a store queue and without executing stores in the out-of-order engine. NoSQ implements store-load communication using speculative memory bypassing (SMB), the dynamic short-circuiting of DEF-store-load-USE chains to DEF-USE chains. Whereas previous proposals used SMB as an opportunistic complement to conventional store queue-based forwarding, NoSQ uses SMB as a store queue replacement. NoSQ relies on two supporting mechanisms. The first is an advanced store-load bypassing predictor that for a given dynamic load can predict whether that load will bypass and the identity of the communicating store. The second is an efficient verification mechanism for both bypassed and non-bypassed loads using in-order load re-execution with an SMB-aware store vulnerability window (SVW) filter. The primary benefit of NoSQ is a simple, fast datapath that does not contain store-load forwarding hardware; all loads get their values either from the data cache or from the register file. Experiments show that this simpler design - despite being more speculative - slightly outperforms a conventional store-queue based design on most benchmarks (by 2% on average)

    EOLE: Paving the Way for an Effective Implementation of Value Prediction

    Get PDF
    A fait l'objet d'une publication au "International Symposium on Computer Architecture (ISCA) 2014" Lien : http://people.irisa.fr/Arthur.Perais/data/ISCA%2714_EOLE.pdfEven in the multicore era, there is a continuous demand to increase the performance of single-threaded applications. However, the conventional path of increasing both issue width and instruction window size inevitably leads to the power wall. Value prediction (VP) was proposed in the mid 90's as an alternative path to further enhance the performance of wide-issue superscalar processors. Still, it was considered up to recently that a performance-effective implementation of Value Prediction would add tremendous complexity and power consumption in almost every stage of the pipeline. Nonetheless, recent work in the field of VP has shown that given an efficient confidence estimation mechanism, prediction validation could be removed from the out-of-order engine and delayed until commit time. As a result, recovering from mispredictions via selective replay can be avoided and a much simpler mechanism - pipeline squashing - can be used, while the out-of-order engine remains mostly unmodified. Nonetheless, VP and validation at commit time entail strong constraints on the Physical Register File. Write ports are needed to write predicted results and read ports are needed in order to validate them at commit time, potentially rendering the overall number of ports unbearable. Fortunately, VP also implies that many single-cycle ALU instructions have their operands predicted in the front-end and can be executed in-place and in-order. Similarly, the execution of single-cycle instructions whose result has been predicted can be delayed until just before commit since predictions are validated at commit time. Consequently, a significant number of instructions - 10% to 60% in our experiments - can bypass the out-of-order engine, allowing the reduction of the issue width, which is a major contributor to both out-of-order engine complexity and register file port requirement. This reduction paves the way for a truly practical implementation of Value Prediction. Furthermore, since Value Prediction in itself usually increases performance, our resulting {Early | Out-of-Order | Late} Execution architecture (EOLE), is often more efficient than a baseline VP-augmented 6-issue superscalar while having a significantly narrower 4-issue out-of-order engine.MĂȘme Ă  l'Ăšre des multicoeurs, il existe une demande continue pour l'augmentation de la performance sur les applications mono-threads. Cependant, la solution conventionnelle consistant Ă  augmenter la largeur d'exĂ©cution ainsi que la taille de la fenĂȘtre d'instructions se heurte inĂ©vitablement au mur de la consommation. La PrĂ©diction de Valeurs (VP) a Ă©tĂ© proposĂ©e dans les annĂ©es 90 comme une alternative permettant d'amĂ©liorer la performance des processeurs superscalaires. Cela Ă©tant, une implĂ©mentation intĂ©ressante du point de vue cout-efficacitĂ© Ă©tait jusqu'ici considĂ©rĂ©e comme impossible Ă  cause de la complexitĂ© ainsi que de la consommation induite. Cependant, des travaux rĂ©cents dans le domaine de la PrĂ©diction de Valeurs ont montrĂ©s qu'avec un mĂ©canisme d'estimation de la confiance efficace, la validation d'une prĂ©diction pouvait ĂȘtre repoussĂ©e au moment ou l'instruction est retirĂ©e du pipeline. ConsĂ©quemment, rĂ©cupĂ©rer d'une mauvaise prĂ©diction via une rĂ©-exĂ©cution sĂ©lective peut-ĂȘtre Ă©vitĂ© et un mĂ©canisme bien plus simple - vidage du pipeline - peut-ĂȘtre utilisĂ©. Toute la partie du processeur chargĂ©e d'exĂ©cuter les instructions dans le dĂ©sordre n'est donc pas modifiĂ©e. NĂ©anmoins, VP et la validation au retirement impliquent des contraintes fortes sur le fichier de registres. Des ports d'Ă©criture sont requis pour Ă©crire les prĂ©dictions et des ports de lecture sont requis pour valider les prĂ©dictions au retirement. Heureusement, VP implique aussi que beaucoup d'instructions simples ont leurs opĂ©randes disponibles tĂŽt dans le pipeline et peuvent ĂȘtre exĂ©cutĂ©es dans l'ordre. De façon similaire, l'exĂ©cution des instructions simples ayant Ă©tĂ© prĂ©dites peut ĂȘtre reportĂ©e aux derniers Ă©tages du pipeline puisque les prĂ©dictions sont validĂ©es au retirement. Au final, une proportion significative des instructions - 10% to 60% dans notre Ă©tude - peuvent contourner le moteur d'exĂ©cution dans le dĂ©sordre, ce qui permet de rĂ©duire la largeur d'exĂ©cution, qui contribue grandement Ă  la complexitĂ© du processeur. Cette rĂ©duction ouvre la porte Ă  une implĂ©mentation rĂ©aliste de la PrĂ©diction de Valeurs. De plus, puisque la VP augmente la performance, notre architecture {Early | Out-of-Order | Late} Execution architecture (EOLE), est souvent plus performante qu'une architecture superscalaire implĂ©mentant la VP tout en ayant un moteur d'exĂ©cution dans le dĂ©sordre bien moins complexe

    Energy Efficient Load Latency Tolerance: Single-Thread Performance for the Multi-Core Era

    Get PDF
    Around 2003, newly activated power constraints caused single-thread performance growth to slow dramatically. The multi-core era was born with an emphasis on explicitly parallel software. Continuing to grow single-thread performance is still important in the multi-core context, but it must be done in an energy efficient way. One significant impediment to performance growth in both out-of-order and in-order processors is the long latency of last-level cache misses. Prior work introduced the idea of load latency tolerance---the ability to dynamically remove miss-dependent instructions from critical execution structures, continue execution under the miss, and re-execute miss-dependent instructions after the miss returns. However, previously proposed designs were unable to improve performance in an energy-efficient way---they introduced too many new large, complex structures and re-executed too many instructions. This dissertation describes a new load latency tolerant design that is both energy-efficient, and applicable to both in-order and out-of-order cores. Key novel features include formulation of slice re-execution as an alternative use of multi-threading support, efficient schemes for register and memory state management, and new pruning mechanisms for drastically reducing load latency tolerance\u27s dynamic execution overheads. Area analysis shows that energy-efficient load latency tolerance increases the footprint of an out-of-order core by a few percent, while cycle-level simulation shows that it significantly improves the performance of memory-bound programs. Energy-efficient load latency tolerance is more energy-efficient than---and synergistic with---existing performance technique like dynamic voltage and frequency scaling (DVFS)

    Mechanisms for Unbounded, Conflict-Robust Hardware Transactional Memory

    Get PDF
    Conventional lock implementations serialize access to critical sections guarded by the same lock, presenting programmers with a difficult tradeoff between granularity of synchronization and amount of parallelism realized. Recently, researchers have been investigating an emerging synchronization mechanism called transactional memory as an alternative to such conventional lock-based synchronization. Memory transactions have the semantics of executing in isolation from one another while in reality executing speculatively in parallel, aborting when necessary to maintain the appearance of isolation. This combination of coarse-grained isolation and optimistic parallelism has the potential to ease the tradeoff presented by lock-based programming. This dissertation studies the hardware implementation of transactional memory, making three main contributions. First, we propose the permissions-only cache, a mechanism that efficiently increases the size of transactions that can be handled in the local cache hierarchy to optimize performance. Second, we propose OneTM, an unbounded hardware transactional memory system that serializes transactions that escape the local cache hierarchy. Finally, we propose RetCon, a novel mechanism for detecting conflicts that reduces conflicts by allowing transactions to commit with different values than those with which they executed as long as dataflow and control-flow constraints are maintained

    Reno: A rename-based instruction optimizer

    Get PDF
    The effectiveness of static code optimizations—including static optimizations performed “just-in-time”—is limited by some basic constraints: (i) a limited number of logical registers, (ii) a function- or region-bounded optimization scope, and (iii) the requirement that transformations be valid along all possible paths. RENO is a modified MIPS-R10000 style register renaming mechanism augmented with physical register reference count-ing that uses map-table “short-circuiting ” to implement dynamic versions of several well-known static optimizations: move elimination, common subexpression elimination, register allocation, and constant folding. Because it implements these opti-mizations dynamically, RENO can overcome some of the limitations faced by static compilers and apply optimizations where static compilers cannot. RENO has many more registers at its disposal—the entire physical register file. Its optimizations naturally cross function or any other compilation region boundary. And RENO performs optimizations along the dynamic path without being impacted by other, non-taken paths. If the dynamic path proves incorrect due to mispeculations, RENO optimizations are naturally rolled back along with the code they optimize. RENO unifies several previously proposed optimizations: dynamic move elimination [14] (RENOME), register integra-tion [24] (RENOCSE), and speculative memory bypassing (the dynamic counterpart of register allocation) [14, 21, 22, 24] (RENORA). To this union, we add a new optimization: RENOCF a dynamic version of constant folding. RENOCF extends th

    Reno: A rename-based instruction optimizer

    No full text
    RENO is a modified register renaming mechanism that performs optimizations on the dynamic instruction stream. RENO uses map table manipulations to implement the dynamic counterparts of several well-known static optimizations. RENO examines the dynamic instructions as they flow through rename and optimizes some of them. An optimized instruction must be indistinguishable from an executed one. Before optimizing an instruction, RENO ensures that the value it would compute is already present in the physical register file. RENO then maps the output of the optimized instruction to that register. Because the map table points to the right value, the optimized instruction can safely bypass out-of-order execution. RENO combines four optimizations into a unified framework. Move elimination optimizes moves. Common sub-expression elimination optimizes redundant operations. Register allocation optimizes stack-pointer loads. Finally, constant propagation optimizes add-immediate instructions. Although it may seem superfluous to perform these optimizations in hardware, their static counterparts are inherently limited by: (a) separate, file-level compilation, (b) conservative information about memory dependencies, (c) inability to use resources that are not visible at the architectural level, and (d) the requirement that any transformation be correct along all possible paths. The dynamic RENO versions: (a) ignore compilation boundaries, (b) can optimize speculatively, (c) can access micro-architectural resources, and (d) need to ensure correctness along the current dynamic path only. Consequently, RENO is capable of optimizing an average 22% dynamic instructions from highly-optimized MediaBench and SPEC2000 integer programs. Despite this, RENO is a complement rather than replacement for static optimizations because RENO-optimized instructions still have to fetch and commit—statically-optimized instructions bypass the entire pipeline. Removing instructions from the out-of-order execution stream improves processor performance via execution latency reduction and out-of-order bandwidth and capacity amplification. For a balanced 4-wide/128-instruction window pipeline, these effects convert a 22% optimization rate into average speedups of 8.3%/11.6% (SPEC2000 integer/MediaBench). In addition, RENO\u27s out-of-order bandwidth and capacity amplification effects enables a new class of designs, which couple wider in-order front-end and back-end with a narrower out-of-order core. Because the out-of-order core is difficult to scale, these designs deliver performance in a more complexity-effective way

    Efficient Scaling of Out-of-Order Processor Resources

    Get PDF
    Rather than improving single-threaded performance, with the dawn of the multi-core era, processor microarchitects have exploited Moore's law transistor scaling by increasing core density on a chip and increasing the number of thread contexts within a core. However, single-thread performance and efficiency is still very relevant in the power-constrained multi-core era, as increasing core counts do not yield corresponding performance improvements under real thermal and thread-level constraints. This dissertation provides a detailed study of register reference count structures and its application to both conventional and non-conventional, latency tolerant, out-of-order processors. Prior work has incorporated reference counting, but without a detailed implementation or energy model. This dissertation presents a working implementation of reference count structures and shows the overheads are low and can be recouped by the techniques enabled in high-performance out-of-order processors. A study of register allocation algorithms exploits register file occupancy to reduce power consumption by dynamically resizing the register file, which is especially important in the face of wider multi-threaded processors who require larger register files. Latency tolerance has been introduced as a technique to improve single threaded performance by removing cache-miss dependent instructions from the execution pipeline until the miss returns. This dissertation introduces a microarchitecture with a predictive approach to identify long-latency loads, and reduce the energy cost and overhead of scaling the instruction window inherent in latency tolerant microarchitectures. The key features include a front-end predictive slice-out mechanism and in-order queue structure along with mechanisms to reduce the energy cost and register-file usage of executing instructions. Cycle-level simulation shows improved performance and reduced energy delay for memory-bound workloads. Both techniques scale processor resources, addressing register file inefficiency and the allocation of processor resources to instructions during low ILP regions.Ph.D., Computer Engineering -- Drexel University, 201
    corecore