Search CORE

7 research outputs found

NoSQ: Store-Load Communication without a Store Queue

Author: Martin Milo
Roth Amir
Sha Tingting
Publication venue: ScholarlyCommons
Publication date: 01/12/2006
Field of study

This paper presents NoSQ (short for No Store Queue), a microarchitecture that performs store-load communication without a store queue and without executing stores in the out-of-order engine. NoSQ implements store-load communication using speculative memory bypassing (SMB), the dynamic short-circuiting of DEF-store-load-USE chains to DEF-USE chains. Whereas previous proposals used SMB as an opportunistic complement to conventional store queue-based forwarding, NoSQ uses SMB as a store queue replacement. NoSQ relies on two supporting mechanisms. The first is an advanced store-load bypassing predictor that for a given dynamic load can predict whether that load will bypass and the identity of the communicating store. The second is an efficient verification mechanism for both bypassed and non-bypassed loads using in-order load re-execution with an SMB-aware store vulnerability window (SVW) filter. The primary benefit of NoSQ is a simple, fast datapath that does not contain store-load forwarding hardware; all loads get their values either from the data cache or from the register file. Experiments show that this simpler design - despite being more speculative - slightly outperforms a conventional store-queue based design on most benchmarks (by 2% on average)

Crossref

ScholarlyCommons@Penn

EOLE: Paving the Way for an Effective Implementation of Value Prediction

Author: Perais Arthur
Seznec André
Publication venue: HAL CCSD
Publication date: 22/11/2013
Field of study

A fait l'objet d'une publication au "International Symposium on Computer Architecture (ISCA) 2014" Lien : http://people.irisa.fr/Arthur.Perais/data/ISCA%2714_EOLE.pdfEven in the multicore era, there is a continuous demand to increase the performance of single-threaded applications. However, the conventional path of increasing both issue width and instruction window size inevitably leads to the power wall. Value prediction (VP) was proposed in the mid 90's as an alternative path to further enhance the performance of wide-issue superscalar processors. Still, it was considered up to recently that a performance-effective implementation of Value Prediction would add tremendous complexity and power consumption in almost every stage of the pipeline. Nonetheless, recent work in the field of VP has shown that given an efficient confidence estimation mechanism, prediction validation could be removed from the out-of-order engine and delayed until commit time. As a result, recovering from mispredictions via selective replay can be avoided and a much simpler mechanism - pipeline squashing - can be used, while the out-of-order engine remains mostly unmodified. Nonetheless, VP and validation at commit time entail strong constraints on the Physical Register File. Write ports are needed to write predicted results and read ports are needed in order to validate them at commit time, potentially rendering the overall number of ports unbearable. Fortunately, VP also implies that many single-cycle ALU instructions have their operands predicted in the front-end and can be executed in-place and in-order. Similarly, the execution of single-cycle instructions whose result has been predicted can be delayed until just before commit since predictions are validated at commit time. Consequently, a significant number of instructions - 10% to 60% in our experiments - can bypass the out-of-order engine, allowing the reduction of the issue width, which is a major contributor to both out-of-order engine complexity and register file port requirement. This reduction paves the way for a truly practical implementation of Value Prediction. Furthermore, since Value Prediction in itself usually increases performance, our resulting {Early | Out-of-Order | Late} Execution architecture (EOLE), is often more efficient than a baseline VP-augmented 6-issue superscalar while having a significantly narrower 4-issue out-of-order engine.Même à l'ère des multicoeurs, il existe une demande continue pour l'augmentation de la performance sur les applications mono-threads. Cependant, la solution conventionnelle consistant à augmenter la largeur d'exécution ainsi que la taille de la fenêtre d'instructions se heurte inévitablement au mur de la consommation. La Prédiction de Valeurs (VP) a été proposée dans les années 90 comme une alternative permettant d'améliorer la performance des processeurs superscalaires. Cela étant, une implémentation intéressante du point de vue cout-efficacité était jusqu'ici considérée comme impossible à cause de la complexité ainsi que de la consommation induite. Cependant, des travaux récents dans le domaine de la Prédiction de Valeurs ont montrés qu'avec un mécanisme d'estimation de la confiance efficace, la validation d'une prédiction pouvait être repoussée au moment ou l'instruction est retirée du pipeline. Conséquemment, récupérer d'une mauvaise prédiction via une ré-exécution sélective peut-être évité et un mécanisme bien plus simple - vidage du pipeline - peut-être utilisé. Toute la partie du processeur chargée d'exécuter les instructions dans le désordre n'est donc pas modifiée. Néanmoins, VP et la validation au retirement impliquent des contraintes fortes sur le fichier de registres. Des ports d'écriture sont requis pour écrire les prédictions et des ports de lecture sont requis pour valider les prédictions au retirement. Heureusement, VP implique aussi que beaucoup d'instructions simples ont leurs opérandes disponibles tôt dans le pipeline et peuvent être exécutées dans l'ordre. De façon similaire, l'exécution des instructions simples ayant été prédites peut être reportée aux derniers étages du pipeline puisque les prédictions sont validées au retirement. Au final, une proportion significative des instructions - 10% to 60% dans notre étude - peuvent contourner le moteur d'exécution dans le désordre, ce qui permet de réduire la largeur d'exécution, qui contribue grandement à la complexité du processeur. Cette réduction ouvre la porte à une implémentation réaliste de la Prédiction de Valeurs. De plus, puisque la VP augmente la performance, notre architecture {Early | Out-of-Order | Late} Execution architecture (EOLE), est souvent plus performante qu'une architecture superscalaire implémentant la VP tout en ayant un moteur d'exécution dans le désordre bien moins complexe

INRIA a CCSD electronic archive server

Energy Efficient Load Latency Tolerance: Single-Thread Performance for the Multi-Core Era

Author: Hilton Andrew D
Publication venue: ScholarlyCommons
Publication date: 01/01/2010
Field of study

Around 2003, newly activated power constraints caused single-thread performance growth to slow dramatically. The multi-core era was born with an emphasis on explicitly parallel software. Continuing to grow single-thread performance is still important in the multi-core context, but it must be done in an energy efficient way. One significant impediment to performance growth in both out-of-order and in-order processors is the long latency of last-level cache misses. Prior work introduced the idea of load latency tolerance---the ability to dynamically remove miss-dependent instructions from critical execution structures, continue execution under the miss, and re-execute miss-dependent instructions after the miss returns. However, previously proposed designs were unable to improve performance in an energy-efficient way---they introduced too many new large, complex structures and re-executed too many instructions. This dissertation describes a new load latency tolerant design that is both energy-efficient, and applicable to both in-order and out-of-order cores. Key novel features include formulation of slice re-execution as an alternative use of multi-threading support, efficient schemes for register and memory state management, and new pruning mechanisms for drastically reducing load latency tolerance\u27s dynamic execution overheads. Area analysis shows that energy-efficient load latency tolerance increases the footprint of an out-of-order core by a few percent, while cycle-level simulation shows that it significantly improves the performance of memory-bound programs. Energy-efficient load latency tolerance is more energy-efficient than---and synergistic with---existing performance technique like dynamic voltage and frequency scaling (DVFS)

ScholarlyCommons@Penn

Mechanisms for Unbounded, Conflict-Robust Hardware Transactional Memory

Author: Blundell Colin
Publication venue: ScholarlyCommons
Publication date: 01/01/2010
Field of study

Conventional lock implementations serialize access to critical sections guarded by the same lock, presenting programmers with a difficult tradeoff between granularity of synchronization and amount of parallelism realized. Recently, researchers have been investigating an emerging synchronization mechanism called transactional memory as an alternative to such conventional lock-based synchronization. Memory transactions have the semantics of executing in isolation from one another while in reality executing speculatively in parallel, aborting when necessary to maintain the appearance of isolation. This combination of coarse-grained isolation and optimistic parallelism has the potential to ease the tradeoff presented by lock-based programming. This dissertation studies the hardware implementation of transactional memory, making three main contributions. First, we propose the permissions-only cache, a mechanism that efficiently increases the size of transactions that can be handled in the local cache hierarchy to optimize performance. Second, we propose OneTM, an unbounded hardware transactional memory system that serializes transactions that escape the local cache hierarchy. Finally, we propose RetCon, a novel mechanism for detecting conflicts that reduces conflicts by allowing transactions to commit with different values than those with which they executed as long as dataflow and control-flow constraints are maintained

CiteSeerX

ScholarlyCommons@Penn

Reno: A rename-based instruction optimizer

Author: Amir Roth
Tingting Sha
Vlad Petric
Publication venue
Publication date: 09/12/2004
Field of study

The effectiveness of static code optimizations—including static optimizations performed “just-in-time”—is limited by some basic constraints: (i) a limited number of logical registers, (ii) a function- or region-bounded optimization scope, and (iii) the requirement that transformations be valid along all possible paths. RENO is a modified MIPS-R10000 style register renaming mechanism augmented with physical register reference count-ing that uses map-table “short-circuiting ” to implement dynamic versions of several well-known static optimizations: move elimination, common subexpression elimination, register allocation, and constant folding. Because it implements these opti-mizations dynamically, RENO can overcome some of the limitations faced by static compilers and apply optimizations where static compilers cannot. RENO has many more registers at its disposal—the entire physical register file. Its optimizations naturally cross function or any other compilation region boundary. And RENO performs optimizations along the dynamic path without being impacted by other, non-taken paths. If the dynamic path proves incorrect due to mispeculations, RENO optimizations are naturally rolled back along with the code they optimize. RENO unifies several previously proposed optimizations: dynamic move elimination [14] (RENOME), register integra-tion [24] (RENOCSE), and speculative memory bypassing (the dynamic counterpart of register allocation) [14, 21, 22, 24] (RENORA). To this union, we add a new optimization: RENOCF a dynamic version of constant folding. RENOCF extends th

CiteSeerX

Crossref

ScholarlyCommons@Penn

Reno: A rename-based instruction optimizer

Author: Petric Vlad
Publication venue: ScholarlyCommons
Publication date: 01/01/2007
Field of study

RENO is a modified register renaming mechanism that performs optimizations on the dynamic instruction stream. RENO uses map table manipulations to implement the dynamic counterparts of several well-known static optimizations. RENO examines the dynamic instructions as they flow through rename and optimizes some of them. An optimized instruction must be indistinguishable from an executed one. Before optimizing an instruction, RENO ensures that the value it would compute is already present in the physical register file. RENO then maps the output of the optimized instruction to that register. Because the map table points to the right value, the optimized instruction can safely bypass out-of-order execution. RENO combines four optimizations into a unified framework. Move elimination optimizes moves. Common sub-expression elimination optimizes redundant operations. Register allocation optimizes stack-pointer loads. Finally, constant propagation optimizes add-immediate instructions. Although it may seem superfluous to perform these optimizations in hardware, their static counterparts are inherently limited by: (a) separate, file-level compilation, (b) conservative information about memory dependencies, (c) inability to use resources that are not visible at the architectural level, and (d) the requirement that any transformation be correct along all possible paths. The dynamic RENO versions: (a) ignore compilation boundaries, (b) can optimize speculatively, (c) can access micro-architectural resources, and (d) need to ensure correctness along the current dynamic path only. Consequently, RENO is capable of optimizing an average 22% dynamic instructions from highly-optimized MediaBench and SPEC2000 integer programs. Despite this, RENO is a complement rather than replacement for static optimizations because RENO-optimized instructions still have to fetch and commit—statically-optimized instructions bypass the entire pipeline. Removing instructions from the out-of-order execution stream improves processor performance via execution latency reduction and out-of-order bandwidth and capacity amplification. For a balanced 4-wide/128-instruction window pipeline, these effects convert a 22% optimization rate into average speedups of 8.3%/11.6% (SPEC2000 integer/MediaBench). In addition, RENO\u27s out-of-order bandwidth and capacity amplification effects enables a new class of designs, which couple wider in-order front-end and back-end with a narrower out-of-order core. Because the out-of-order core is difficult to scale, these designs deliver performance in a more complexity-effective way

ScholarlyCommons@Penn

Efficient Scaling of Out-of-Order Processor Resources

Author: Battle Steven James
Publication venue: Drexel University
Publication date
Field of study

Rather than improving single-threaded performance, with the dawn of the multi-core era, processor microarchitects have exploited Moore's law transistor scaling by increasing core density on a chip and increasing the number of thread contexts within a core. However, single-thread performance and efficiency is still very relevant in the power-constrained multi-core era, as increasing core counts do not yield corresponding performance improvements under real thermal and thread-level constraints. This dissertation provides a detailed study of register reference count structures and its application to both conventional and non-conventional, latency tolerant, out-of-order processors. Prior work has incorporated reference counting, but without a detailed implementation or energy model. This dissertation presents a working implementation of reference count structures and shows the overheads are low and can be recouped by the techniques enabled in high-performance out-of-order processors. A study of register allocation algorithms exploits register file occupancy to reduce power consumption by dynamically resizing the register file, which is especially important in the face of wider multi-threaded processors who require larger register files. Latency tolerance has been introduced as a technique to improve single threaded performance by removing cache-miss dependent instructions from the execution pipeline until the miss returns. This dissertation introduces a microarchitecture with a predictive approach to identify long-latency loads, and reduce the energy cost and overhead of scaling the instruction window inherent in latency tolerant microarchitectures. The key features include a front-end predictive slice-out mechanism and in-order queue structure along with mechanisms to reduce the energy cost and register-file usage of executing instructions. Cycle-level simulation shows improved performance and reduced energy delay for memory-bound workloads. Both techniques scale processor resources, addressing register file inefficiency and the allocation of processor resources to instructions during low ILP regions.Ph.D., Computer Engineering -- Drexel University, 201

Drexel Libraries E-Repository and Archives