2,345 research outputs found

    Reducing the LSQ and L1 data cache power consumption

    Get PDF
    In most modern processor designs, the HW dedicated to store data and instructions (memory hierarchy) has become a major consumer of power. In order to reduce this power consumption, we propose in this paper two techniques, one to filter accesses to the LSQ (Load-Store Queue) based on both timing and address information, and the other to filter accesses to the first level data cache based on a forwarding predictor. Our simulation results show that the power consumption decreases in 30-40% in each structure, with a negligible performance penalty of less than 0.1%.Presentado en el V Workshop Arquitectura, Redes y Sistemas Operativos (WARSO)Red de Universidades con Carreras en Informática (RedUNCI

    Reducing the LSQ and L1 Data Cache Power Consuption

    Get PDF
    In most modern processor designs, the HW dedicated to store data and instructions (memory hierarchy) has become a major consumer of power. In order to reduce this power consumption, we propose in this paper two techniques, one to filter accesses to the LSQ (Load-Store Queue) based on both timing and address information, and the other to filter accesses to the first level data cache based on a forwarding predictor. Our simulation results show that the power consumption decreases in 30-40% in each structure, with a negligible performance penalty of less than 0.1%

    Energy Efficient Load Latency Tolerance: Single-Thread Performance for the Multi-Core Era

    Get PDF
    Around 2003, newly activated power constraints caused single-thread performance growth to slow dramatically. The multi-core era was born with an emphasis on explicitly parallel software. Continuing to grow single-thread performance is still important in the multi-core context, but it must be done in an energy efficient way. One significant impediment to performance growth in both out-of-order and in-order processors is the long latency of last-level cache misses. Prior work introduced the idea of load latency tolerance---the ability to dynamically remove miss-dependent instructions from critical execution structures, continue execution under the miss, and re-execute miss-dependent instructions after the miss returns. However, previously proposed designs were unable to improve performance in an energy-efficient way---they introduced too many new large, complex structures and re-executed too many instructions. This dissertation describes a new load latency tolerant design that is both energy-efficient, and applicable to both in-order and out-of-order cores. Key novel features include formulation of slice re-execution as an alternative use of multi-threading support, efficient schemes for register and memory state management, and new pruning mechanisms for drastically reducing load latency tolerance\u27s dynamic execution overheads. Area analysis shows that energy-efficient load latency tolerance increases the footprint of an out-of-order core by a few percent, while cycle-level simulation shows that it significantly improves the performance of memory-bound programs. Energy-efficient load latency tolerance is more energy-efficient than---and synergistic with---existing performance technique like dynamic voltage and frequency scaling (DVFS)

    NoSQ: Store-Load Communication without a Store Queue

    Get PDF
    This paper presents NoSQ (short for No Store Queue), a microarchitecture that performs store-load communication without a store queue and without executing stores in the out-of-order engine. NoSQ implements store-load communication using speculative memory bypassing (SMB), the dynamic short-circuiting of DEF-store-load-USE chains to DEF-USE chains. Whereas previous proposals used SMB as an opportunistic complement to conventional store queue-based forwarding, NoSQ uses SMB as a store queue replacement. NoSQ relies on two supporting mechanisms. The first is an advanced store-load bypassing predictor that for a given dynamic load can predict whether that load will bypass and the identity of the communicating store. The second is an efficient verification mechanism for both bypassed and non-bypassed loads using in-order load re-execution with an SMB-aware store vulnerability window (SVW) filter. The primary benefit of NoSQ is a simple, fast datapath that does not contain store-load forwarding hardware; all loads get their values either from the data cache or from the register file. Experiments show that this simpler design - despite being more speculative - slightly outperforms a conventional store-queue based design on most benchmarks (by 2% on average)

    Store Vulnerability Window (SVW): Re-Execution Filtering for Enhanced Load/Store Optimization

    Get PDF
    A high-bandwidth, low-latency load-store unit is a critical component of a dynamically scheduled processor. Unfortunately, it is also one of the most complex and non-scalable components. Recently, several researchers have proposed techniques that simplify the core load-store unit and improve its scalability in exchange for the in-order pre-retirement re-execution of some subset of the loads in the program. We call such techniques load/store optimizations. One recent optimization attacks load queue (LQ) scalability by replacing the expensive associative search that is used to enforce intra- and inter- thread ordering with load re-execution. A second attacks store queue (SQ) scalability by speculatively filtering some load accesses and some store entries from it. The speculatively accessed, speculatively populated SQ can be made smaller and faster, but load re-execution is required to verify the speculation. A third uses a hardware table to identify redundant loads and skip their execution altogether. Redundant load elimination is highly accurate but not 100%, so re-execution is needed to flag false eliminations. Unfortunately, the inherent benefits of load/store optimizations are mitigated by re-execution itself. Re-execution contends for cache bandwidths with store retirement, and serializes load re-execution with subsequent store retirement. If a particular technique requires a sufficient number of load re-executions, the cost of these re-executions will outweigh the benefits of the technique entirely and may even produce drastic slowdowns. This is the case for the SQ technique. Store Vulnerability Window (SVW) is a new mechanism that reduces the re-execution requirements of a given load/store optimization significantly, by an average of 85% across the three load/store optimizations we study. This reduction relieves cache port contention and removes many of the dynamic serialization events that contribute the bulk of re-execution’s cost, and allows these techniques to perform up to their full potential. For the scalable SQ optimization, this means the chnace to perform at all. Without SVW, this technique posts significant slowdowns. SVW is a simple scheme based on monotonic store sequence numbering and a novel application of Bloom Filtering. The cost of an effective SVW implementation is a 1KB buffer and an 2B field per LQ entry

    Improving processor efficiency by exploiting common-case behaviors of memory instructions

    Get PDF
    Processor efficiency can be described with the help of a number of  desirable effects or metrics, for example, performance, power, area, design complexity and access latency. These metrics serve as valuable tools used in designing new processors and they also act as  effective standards for comparing current processors. Various factors impact the efficiency of modern out-of-order processors and one important factor is the manner in which instructions are processed through the processor pipeline. In this dissertation research, we study the impact of load and store instructions (collectively known as memory instructions) on processor efficiency,  and show how to improve efficiency by exploiting common-case or  predictable patterns in the behavior of memory instructions. The memory behavior patterns that we focus on in our research are the predictability of memory dependences, the predictability in data forwarding patterns,   predictability in instruction criticality and conservativeness in resource allocation and deallocation policies. We first design a scalable  and high-performance memory dependence predictor and then apply accurate memory dependence prediction to improve the efficiency of the fetch engine of a simultaneous multi-threaded processor. We then use predictable data forwarding patterns to eliminate power-hungry  hardware in the processor with no loss in performance.  We then move to  studying instruction criticality to improve  processor efficiency. We study the behavior of critical load instructions  and propose applications that can be optimized using  predictable, load-criticality  information. Finally, we explore conventional techniques for allocation and deallocation  of critical structures that process memory instructions and propose new techniques to optimize the same.  Our new designs have the potential to reduce  the power and the area required by processors significantly without losing  performance, which lead to efficient designs of processors.Ph.D.Committee Chair: Loh, Gabriel H.; Committee Member: Clark, Nathan; Committee Member: Jaleel, Aamer; Committee Member: Kim, Hyesoon; Committee Member: Lee, Hsien-Hsin S.; Committee Member: Prvulovic, Milo

    Advanced software techniques for space shuttle data management systems Final report

    Get PDF
    Airborne/spaceborn computer design and techniques for space shuttle data management system

    Affordable kilo-instruction processors

    Get PDF
    Diversos motius expliquen l'estancament en el que es troba el desenvolupament del processador tradicional dissenyat per maximitzar el rendiment d'un únic fil d'execució. Per una banda, técniques agressives com la supersegmentacó del camí de dades o l'execució fora d'ordre tenen un impacte molt negatiu sobre el consum de potència i la complexitat del disseny. Altrament, l'increment en la freqüència del processador augmenta la discrepància entre la velocitat del processador i el temps d'accés a memòria principal. Tot i que les memòries cau redueixen considerablement el nombre d'accessos a memòria principal, aquests accessos introdueixen latencies prou grans per reduir considerablement el rendiment. Tècniques convencionals com l'execució fora d'ordre, útils per ocultar accessos a les memòries cau de 2on nivell, no estan pensades per ocultar latències tan grans. Caldrien cues amb mides de centenars d'instruccions i milers de registres per tal de no interrompre l'execució en el moment de produir-se un accés a memòria principal. Desafortunadament, la tecnologia disponible no és eficient per implementar aquestes estructures monolíticament, doncs resultaria un temps d'accés molt elevat, un consum de potència igualment elevat i un àrea no menyspreable. En aquesta tesi s'han estudiat tècniques que permeten l'implementació d'un processador amb capacitat per continuar processant instruccions en el cas de que es produeixin accessos a memòria principal. Les condicions per a que aquest processador sigui implementable són que estigui basat en estructures de mida convencional i que tingui una unitat de control senzilla. El repte es troba en conciliar un model de processador distribuït amb un control senzill. El problema del disseny del processador s'ha enfocat observant el comportament d'un processador de recursos infinits. S'ha observat que l'execució segueix uns patrons molt interessants, basats en la localitat d'execució. En aplicacions numèriques s'observa que més del 70% de les instruccions no depenen de accessos a memòria principal. Aixó és molt important doncs mostra que sempre hi ha una porció important d'instruccions executables poc després de la decodificació. Aixó permet proposar un nou tipus de processador amb dues unitats d'execució. La primera unitat (el "Cache Processor") processa a alta velocitat instruccions independents de memòria principal. La segona unitat ("Memory Processor") processa les instruccions dependents de accessos a memòria principal, pero de forma molt més relaxada, cosa que li permet mantenir milers de instruccions en vol. Aquesta proposta rep el nom de Decoupled KILO-Instruction Processor (D-KIP) i té forces avantatges: per un costat permet la construcció d'un kilo-instruction processor basat en estructures convencionals i per l'altre simplifica el disseny ja que minimitza les interaccions entre ambdos unitats d'execució.En aquesta tesi es proposen dos implementacions de processadors desacoblats: el D-KIP original, i el Flexible Heterogeneous MultiCore (FMC). Sobre aquestes propostes s'analitza el rendiment i es compara amb altres tècniques que incrementan el parallelisme de memoria, com el prefetching o l'execució "runahead". D'aquesta avaluació es desprén que el processador FMC té un rendiment similar al de un processador convencional amb una finestra de 1500 instruccions en vol. Posteriorment s'analitza l'integració del FMC en entorns multicore/multiprogrammats. La tesi es completa amb la proposta d'una cua de loads i stores (LSQ) per a aquest tipus de processador.Several motives explain the slowdown of high-performance single-thread processor development. On the one hand, aggressive techniques such as superpipelining or out-of-order execution have a considerable impact on power consumption and design complexity. On the other hand, the increment in processor frequencies has led to a large disparity between processor speed and memory access time. Although cache memories considerably reduce the number of accesses to main memory, the remaining accesses introduce latencies large enough to considerably decrease performance. Conventional techniques such as out-of-order execution, while effective in hiding L2 cache accesses, cannot hide latencies this large. Queues of hundreds of entries and thousands of registers would be necessary in order to prevent execution from stalling in the event of a L2 cache miss. Unfortunately, current technology cannot efficiently implement such structures monolithically, as access latencies would considerably increase, as would power consumption and area consumption.In this thesis we studied techniques that allow the processor to continue processing instructions in the event of main memory accesses. The conditions for such a processor to be implementable are that it should be based on structures of conventional size and that it should feature simple control logic. The challenge lies in being able to design a distributed processor with simple control. The design of this processor has been approached by analyzing the behavior of a processor with infinite resources. We have observed that execution follows a very interesting pattern based on execution locality. In numerical codes we observed that over 70% of all instructions do not depend on memory accesses. This is interesting since it shows that there is always a large portion of instructions that can be executed shortly after decode. This allows us to propose a new kind of processor with two execution units. The first unit, the Cache Processor, processes memory-independent instructions at high speed. The second unit, the Memory Processor, processes instructions that depend on main memory accesses, but using relaxed scheduling logic, which allows it to scale to thousands of in-flight instructions. This proposal, which receives the name of Decoupled KILO-Instruction Processor (D-KIP), has several advantages. On the one hand it allows the construction of a kilo-instruction processor based on conventional structures and, on the other hand, it simplifies the design as the interaction between both execution units is minimal. In this thesis two implementations for this kind of processor are presented: the original D-KIP and the Flexible Heterogeneous MultiCore (FMC). The performance of these proposals is analyzed and compared to other proposals that increase memory-level parallelism, such as prefetching or runahead execution. It is observed that the FMC processor performs at the same level of a conventional processor with a window of around 1500 instructions. Further, the integration of the FMC processor into a multicore/multiprogrammed environment is studied. This thesis concludes with the proposal of a two-level Load/Store Queue for this kind of processor
    corecore