282 research outputs found

    Efficient register renaming and recovery for high-performance processors

    Full text link
    © © 2014 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.”Modern superscalar processors implement register renaming using either random access memory (RAM) or content-addressable memories (CAM) tables. The design of these structures should address both access time and misprediction recovery penalty. Although direct-mapped RAMs provide faster access times, CAMs are more appropriate to avoid recovery penalties. The presence of associative ports in CAMs, however, prevents them from scaling with the number of physical registers and pipeline width, negatively impacting performance, area, and energy consumption at the rename stage. In this paper, we present a new hybrid RAM CAM register renaming scheme, which combines the best of both approaches. In a steady state, a RAM provides fast and energy-efficient access to register mappings. On misspeculation, a low-complexity CAM enables immediate recovery. Experimental results show that in a four-way state-ofthe- art superscalar processor, the new approach provides almost the same performance as an ideal CAM-based renaming scheme, while dissipating only between 17% and 26% of the original energy and, in some cases, consuming less energy than purely RAM-based renaming schemes. Overall, the silicon area required to implement the hybrid RAM CAM scheme does not exceed the area required by conventional renaming mechanisms.This work was supported in part by the Spanish MINECO under Grant TIN2012-38341-C04-01.Petit Martí, SV.; Ubal Tena, R.; Sahuquillo Borrás, J.; López Rodríguez, PJ. (2014). Efficient register renaming and recovery for high-performance processors. IEEE Transactions on Very Large Scale Integration (VLSI) Systems. 22(7):1506-1514. https://doi.org/10.1109/TVLSI.2013.2270001S1506151422

    Direct instruction wakeup for out-of-order processors

    Get PDF
    Instruction queues consume a significant amount of power in high-performance processors, primarily due to instruction wakeup logic access to the queue structures. The wakeup logic delay is also a critical timing parameter. This paper proposes a new queue organization using a small number of successor pointers plus a small number of dynamically allocated full successor bit vectors for cases with a larger number of successors. The details of the new organization are described and it is shown to achieve the performance of CAM-based or full dependency matrix organizations using just one pointer per instruction plus eight full bit vectors. Only two full bit vectors are needed when two successor pointers are stored per instruction. Finally, a design and pre-layout of all critical structures in 70 nm technology was performed for the proposed organization as well as for a CAM-based baseline. The new design is shown to use 1/2 to 1/5th of the baseline instruction queue power, depending on queue size. It is also shown to use significantly less power than the full dependency matrix based design.Peer ReviewedPostprint (published version

    Out-of-Order Retirement of Instructions in Superscalar, Multithreaded, and Multicore Processors

    Full text link
    Los procesadores superescalares actuales utilizan un reorder buffer (ROB) para contabilizar las instrucciones en vuelo. El ROB se implementa como una cola FIFO first in first out en la que las instrucciones se insertan en orden de programa después de ser decodificadas, y de la que se extraen también en orden de programa en la etapa commit. El uso de esta estructura proporciona un soporte simple para la especulación, las excepciones precisas y la reclamación de registros. Sin embargo, el hecho de retirar instrucciones en orden puede degradar las prestaciones si una operación de alta latencia está bloqueando la cabecera del ROB. Varias propuestas se han publicado atacando este problema. La mayoría utiliza retirada de instrucciones fuera de orden de forma especulativa, requiriendo almacenar puntos de recuperación (checkpoints) para restaurar un estado válido del procesador ante un fallo de especulación. Normalmente, los checkpoints necesitan implementarse con estructuras hardware costosas, y además requieren un crecimiento de otras estructuras del procesador, lo cual a su vez puede impactar en el tiempo de ciclo de reloj. Este problema afecta a muchos tipos de procesadores actuales, independientemente del número de hilos hardware (threads) y del número de núcleos de cómputo (cores) que incluyan. Esta tesis abarca el estudio de la retirada no especulativa de instrucciones fuera de orden en procesadores superescalares, multithread y multicore.Ubal Tena, R. (2010). Out-of-Order Retirement of Instructions in Superscalar, Multithreaded, and Multicore Processors [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/8535Palanci

    Low-cost and efficient fault detection and diagnosis schemes for modern cores

    Get PDF
    Continuous improvements in transistor scaling together with microarchitectural advances have made possible the widespread adoption of high-performance processors across all market segments. However, the growing reliability threats induced by technology scaling and by the complexity of designs are challenging the production of cheap yet robust systems. Soft error trends are haunting, especially for combinational logic, and parity and ECC codes are therefore becoming insufficient as combinational logic turns into the dominant source of soft errors. Furthermore, experts are warning about the need to also address intermittent and permanent faults during processor runtime, as increasing temperatures and device variations will accelerate inherent aging phenomena. These challenges specially threaten the commodity segments, which impose requirements that existing fault tolerance mechanisms cannot offer. Current techniques based on redundant execution were devised in a time when high penalties were assumed for the sake of high reliability levels. Novel light-weight techniques are therefore needed to enable fault protection in the mass market segments. The complexity of designs is making post-silicon validation extremely expensive. Validation costs exceed design costs, and the number of discovered bugs is growing, both during validation and once products hit the market. Fault localization and diagnosis are the biggest bottlenecks, magnified by huge detection latencies, limited internal observability, and costly server farms to generate test outputs. This thesis explores two directions to address some of the critical challenges introduced by unreliable technologies and by the limitations of current validation approaches. We first explore mechanisms for comprehensively detecting multiple sources of failures in modern processors during their lifetime (including transient, intermittent, permanent and also design bugs). Our solutions embrace a paradigm where fault tolerance is built based on exploiting high-level microarchitectural invariants that are reusable across designs, rather than relying on re-execution or ad-hoc block-level protection. To do so, we decompose the basic functionalities of processors into high-level tasks and propose three novel runtime verification solutions that combined enable global error detection: a computation/register dataflow checker, a memory dataflow checker, and a control flow checker. The techniques use the concept of end-to-end signatures and allow designers to adjust the fault coverage to their needs, by trading-off area, power and performance. Our fault injection studies reveal that our methods provide high coverage levels while causing significantly lower performance, power and area costs than existing techniques. Then, this thesis extends the applicability of the proposed error detection schemes to the validation phases. We present a fault localization and diagnosis solution for the memory dataflow by combining our error detection mechanism, a new low-cost logging mechanism and a diagnosis program. Selected internal activity is continuously traced and kept in a memory-resident log whose capacity can be expanded to suite validation needs. The solution can catch undiscovered bugs, reducing the dependence on simulation farms that compute golden outputs. Upon error detection, the diagnosis algorithm analyzes the log to automatically locate the bug, and also to determine its root cause. Our evaluations show that very high localization coverage and diagnosis accuracy can be obtained at very low performance and area costs. The net result is a simplification of current debugging practices, which are extremely manual, time consuming and cumbersome. Altogether, the integrated solutions proposed in this thesis capacitate the industry to deliver more reliable and correct processors as technology evolves into more complex designs and more vulnerable transistors.El continuo escalado de los transistores junto con los avances microarquitectónicos han posibilitado la presencia de potentes procesadores en todos los segmentos de mercado. Sin embargo, varios problemas de fiabilidad están desafiando la producción de sistemas robustos. Las predicciones de "soft errors" son inquietantes, especialmente para la lógica combinacional: soluciones como ECC o paridad se están volviendo insuficientes a medida que dicha lógica se convierte en la fuente predominante de soft errors. Además, los expertos están alertando acerca de la necesidad de detectar otras fuentes de fallos (causantes de errores permanentes e intermitentes) durante el tiempo de vida de los procesadores. Los segmentos "commodity" son los más vulnerables, ya que imponen unos requisitos que las técnicas actuales de fiabilidad no ofrecen. Estas soluciones (generalmente basadas en re-ejecución) fueron ideadas en un tiempo en el que con tal de alcanzar altos nivel de fiabilidad se asumían grandes costes. Son por tanto necesarias nuevas técnicas que permitan la protección contra fallos en los segmentos más populares. La complejidad de los diseños está encareciendo la validación "post-silicon". Su coste excede el de diseño, y el número de errores descubiertos está aumentando durante la validación y ya en manos de los clientes. La localización y el diagnóstico de errores son los mayores problemas, empeorados por las altas latencias en la manifestación de errores, por la poca observabilidad interna y por el coste de generar las señales esperadas. Esta tesis explora dos direcciones para tratar algunos de los retos causados por la creciente vulnerabilidad hardware y por las limitaciones de los enfoques de validación. Primero exploramos mecanismos para detectar múltiples fuentes de fallos durante el tiempo de vida de los procesadores (errores transitorios, intermitentes, permanentes y de diseño). Nuestras soluciones son de un paradigma donde la fiabilidad se construye explotando invariantes microarquitectónicos genéricos, en lugar de basarse en re-ejecución o en protección ad-hoc. Para ello descomponemos las funcionalidades básicas de un procesador y proponemos tres soluciones de `runtime verification' que combinadas permiten una detección de errores a nivel global. Estas tres soluciones son: un verificador de flujo de datos de registro y de computación, un verificador de flujo de datos de memoria y un verificador de flujo de control. Nuestras técnicas usan el concepto de firmas y permiten a los diseñadores ajustar los niveles de protección a sus necesidades, mediante compensaciones en área, consumo energético y rendimiento. Nuestros estudios de inyección de errores revelan que los métodos propuestos obtienen altos niveles de protección, a la vez que causan menos costes que las soluciones existentes. A continuación, esta tesis explora la aplicabilidad de estos esquemas a las fases de validación. Proponemos una solución de localización y diagnóstico de errores para el flujo de datos de memoria que combina nuestro mecanismo de detección de errores, junto con un mecanismo de logging de bajo coste y un programa de diagnóstico. Cierta actividad interna es continuamente registrada en una zona de memoria cuya capacidad puede ser expandida para satisfacer las necesidades de validación. La solución permite descubrir bugs, reduciendo la necesidad de calcular los resultados esperados. Al detectar un error, el algoritmo de diagnóstico analiza el registro para automáticamente localizar el bug y determinar su causa. Nuestros estudios muestran un alto grado de localización y de precisión de diagnóstico a un coste muy bajo de rendimiento y área. El resultado es una simplificación de las prácticas actuales de depuración, que son enormemente manuales, incómodas y largas. En conjunto, las soluciones de esta tesis capacitan a la industria a producir procesadores más fiables, a medida que la tecnología evoluciona hacia diseños más complejos y más vulnerables

    Energy Efficient Load Latency Tolerance: Single-Thread Performance for the Multi-Core Era

    Get PDF
    Around 2003, newly activated power constraints caused single-thread performance growth to slow dramatically. The multi-core era was born with an emphasis on explicitly parallel software. Continuing to grow single-thread performance is still important in the multi-core context, but it must be done in an energy efficient way. One significant impediment to performance growth in both out-of-order and in-order processors is the long latency of last-level cache misses. Prior work introduced the idea of load latency tolerance---the ability to dynamically remove miss-dependent instructions from critical execution structures, continue execution under the miss, and re-execute miss-dependent instructions after the miss returns. However, previously proposed designs were unable to improve performance in an energy-efficient way---they introduced too many new large, complex structures and re-executed too many instructions. This dissertation describes a new load latency tolerant design that is both energy-efficient, and applicable to both in-order and out-of-order cores. Key novel features include formulation of slice re-execution as an alternative use of multi-threading support, efficient schemes for register and memory state management, and new pruning mechanisms for drastically reducing load latency tolerance\u27s dynamic execution overheads. Area analysis shows that energy-efficient load latency tolerance increases the footprint of an out-of-order core by a few percent, while cycle-level simulation shows that it significantly improves the performance of memory-bound programs. Energy-efficient load latency tolerance is more energy-efficient than---and synergistic with---existing performance technique like dynamic voltage and frequency scaling (DVFS)

    Compiler-Directed Energy Savings in Superscalar Processors

    Get PDF
    Institute for Computing Systems ArchitectureSuperscalar processors contain large, complex structures to hold data and instructions as they wait to be executed. However, many of these structures consume large amounts of energy, making them hotspots requiring sophisticated cooling systems. With the trend towards larger, more complex processors, this will become more of a problem, having important implications for future technology. This thesis uses compiler-based optimisation schemes to target the issue queue and register file. These are two of the most energy consuming structures in the processor. The algorithms and hardware techniques developed in this work dynamically adapt the processor's resources to the changing program phases, turning off parts of each structure when they are unused to save dynamic and static energy. To optimise the issue queue, the compiler analysis tracks data dependences through each program procedure. It identifies the critical path through each program region and informs the hardware of the minimum number of queue entries required to prevent it slowing down. This reduces the occupancy of the queue and increases the opportunities to save energy. With just a 1.3% performance loss, 26% dynamic and 32% static energy savings are achieved. Registers can be idle for many cycles after they are last read, before they are released and put back on the free-list to be reused by another instruction. Alternatively, they can be turned off for energy savings. Early register releasing can be used to perform this operation sooner than usual, but hardware schemes must wait for the instruction redefining the relevant logical register to enter the pipeline. This thesis presents an exploration of compiler-directed early register releasing. The compiler can exactly identify the last use of each register and pass the information to the hardware, based on simple data-flow and liveness analysis. The best scheme achieves 15% dynamic and 19% static energy savings. Finally, the issue queue limiting and early register releasing schemes are combined for energy savings in both processor structures. Four different configurations are evaluated bringing 25% to 31% dynamic and 19% to 34% static issue queue energy savings and reductions of 18% to 25% dynamic and 20% to 21% static energy in the register file

    Design and implementation of an out of order execution engine of floating point arithmetic operations

    Get PDF
    In this thesis, work is undertaken towards the design in hardware description languages and implementation in FPGA of an out of order execution engine of floating point arithmetic operations. This thesis work, is part of a project called Lagarto

    Efficient Scaling of Out-of-Order Processor Resources

    Get PDF
    Rather than improving single-threaded performance, with the dawn of the multi-core era, processor microarchitects have exploited Moore's law transistor scaling by increasing core density on a chip and increasing the number of thread contexts within a core. However, single-thread performance and efficiency is still very relevant in the power-constrained multi-core era, as increasing core counts do not yield corresponding performance improvements under real thermal and thread-level constraints. This dissertation provides a detailed study of register reference count structures and its application to both conventional and non-conventional, latency tolerant, out-of-order processors. Prior work has incorporated reference counting, but without a detailed implementation or energy model. This dissertation presents a working implementation of reference count structures and shows the overheads are low and can be recouped by the techniques enabled in high-performance out-of-order processors. A study of register allocation algorithms exploits register file occupancy to reduce power consumption by dynamically resizing the register file, which is especially important in the face of wider multi-threaded processors who require larger register files. Latency tolerance has been introduced as a technique to improve single threaded performance by removing cache-miss dependent instructions from the execution pipeline until the miss returns. This dissertation introduces a microarchitecture with a predictive approach to identify long-latency loads, and reduce the energy cost and overhead of scaling the instruction window inherent in latency tolerant microarchitectures. The key features include a front-end predictive slice-out mechanism and in-order queue structure along with mechanisms to reduce the energy cost and register-file usage of executing instructions. Cycle-level simulation shows improved performance and reduced energy delay for memory-bound workloads. Both techniques scale processor resources, addressing register file inefficiency and the allocation of processor resources to instructions during low ILP regions.Ph.D., Computer Engineering -- Drexel University, 201

    A low-power cache system for high-performance processors

    Get PDF
    制度:新 ; 報告番号:甲3439号 ; 学位の種類:博士(工学) ; 授与年月日:12-Sep-11 ; 早大学位記番号:新576
    corecore