28 research outputs found

    A VHDL model of a superscalar implementation of the DLX instruction set architcture

    Get PDF
    The complexity of today\u27s microprocessors demands that designers have an extensive knowledge of superscalar design techniques; this knowledge is difficult to acquire outside of a professional design team. Presently, there are a limited number of adequate resources available for the student, both in textual and model form. The limited number of options available emphasizes the need for more models and simulators, allowing students the opportunity to learn more about superscalar designs prior to entering the work force. This thesis details the design and implementation of a superscalar version of the DLX instruction set architecture in behavioral VHDL. The branch prediction strategy, instruction issue model, and hazard avoidance techniques are all issues critical to superscalar processor design and are studied in this thesis. Preliminary test results demonstrate that the performance advantage of the superscalar processor is applicable even to short test sequences. Initial findings have shown a performance improvement of 26% to 57% for instruction sequences under 150 instructions

    Out-of-Order Retirement of Instructions in Superscalar, Multithreaded, and Multicore Processors

    Full text link
    Los procesadores superescalares actuales utilizan un reorder buffer (ROB) para contabilizar las instrucciones en vuelo. El ROB se implementa como una cola FIFO first in first out en la que las instrucciones se insertan en orden de programa después de ser decodificadas, y de la que se extraen también en orden de programa en la etapa commit. El uso de esta estructura proporciona un soporte simple para la especulación, las excepciones precisas y la reclamación de registros. Sin embargo, el hecho de retirar instrucciones en orden puede degradar las prestaciones si una operación de alta latencia está bloqueando la cabecera del ROB. Varias propuestas se han publicado atacando este problema. La mayoría utiliza retirada de instrucciones fuera de orden de forma especulativa, requiriendo almacenar puntos de recuperación (checkpoints) para restaurar un estado válido del procesador ante un fallo de especulación. Normalmente, los checkpoints necesitan implementarse con estructuras hardware costosas, y además requieren un crecimiento de otras estructuras del procesador, lo cual a su vez puede impactar en el tiempo de ciclo de reloj. Este problema afecta a muchos tipos de procesadores actuales, independientemente del número de hilos hardware (threads) y del número de núcleos de cómputo (cores) que incluyan. Esta tesis abarca el estudio de la retirada no especulativa de instrucciones fuera de orden en procesadores superescalares, multithread y multicore.Ubal Tena, R. (2010). Out-of-Order Retirement of Instructions in Superscalar, Multithreaded, and Multicore Processors [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/8535Palanci

    MOWER : A NEW DESIGN FOR NON-BLOCKING MISPREDICTION RECOVERY

    Get PDF
    Mower is a micro-architecture technique which targets branch misprediction penalties in superscalar processors. It speeds-up the misprediction recovery process by dynamically evicting stale instructions and fixing the RAT (Register Alias Table) using explicit branch dependency tracking. Tracking branch dependencies is accomplished by using simple bit matrices. This low-overhead technique allows overlapping of the recovery process with instruction fetching, renaming and scheduling from the correct path. Our evaluation of the mechanism indicates that it yields performance very close to ideal recovery and provides up to 5% speed-up and 2% reduction in power consumption compared to a traditional recovery mechanism using a reorder buffer and a walker. The simplicity of the mechanism should permit easy implementation of Mower in an actual processor

    Mitigating the Effect of Misspeculations in Superscalar Processors

    Get PDF
    Modern superscalar processors highly rely on the speculative execution which speculatively executes instructions and then verifies. If the prediction is different from the execution result, a misspeculation recovery is performed. Misspeculation recovery penalties still account for a substantial amount of performance reduction. This work focuses on the techniques to mitigate the effect of recovery penalties and proposes practical mechanisms which are thoroughly implemented and analyzed. In general, we can divide the misspeculation penalty into four parts: misspeculation detection delay; stale instruction elimination delay; state restoration delay and pipeline fill delay. This dissertation does not consider the detection delay, instead, we design four innovative mechanisms. Some of these mechanisms target a specific recovery delay whereas others target multiple types of delay in a unified algorithm. Mower was designed to address the stale instruction elimination delay and the state restoration delay by using a special walker. When a misprediction is detected, the walker will scan and repair the instructions which are younger than the mispredicted instruction. During the walking procedure, the correct state is restored and the stale instructions are eliminated. Based on Mower, we further simplify the design and develop a Two-Phase recovery mechanism. This mechanism uses only a basic recovery mechanism except for the case in which the retire stage was stalled by a long latency instruction. When the retire stage is stalled, the second phase is launched and the instructions in the pipeline are re-fetched. Two-Phase mechanism recovers from an earlier point in the program and overlaps the recovery penalty with the long latency penalty. In reality, some of the instructions on the wrong path can be reused during the recovery. However, such reuse of misprediction results is not easy and most of the time involves significant complexity. We design Passing Loop to reduce the pipeline fill delay. We applied our mechanism only for short forward branches which eliminates a substantial amount of complexity. In terms of memory dependence speculation and associated delays due to memory ordering violations, we develop a mechanism that optimizes store-queue-free architectures. A store-queue-free architecture experiences more memory dependence mispredictions due to its aggressive approach to speculations. A common solution is to delay the execution of an instruction which is more likely to be mispredicted. We propose a mechanism to dynamically insert predicates for comparing the address of memory instructions, which is called “Dynamic Memory Dependence Predication” (DMDP). This mechanism boosts the instruction execution to its earliest point and reduces the number of mispredictions

    Reusing cached schedules in an out-of-order processor with in-order issue logic

    Get PDF
    Modern processors use out-of-order processing logic to achieve high performance in Instructions Per Cycle (IPC) but this logic has a serious impact on the achievable frequency. In order to get better performance out of smaller transistors there is a trend to increase the number of cores per die instead of making the cores themselves bigger. Moreover, for throughput-oriented and server workloads, simpler in-order processors that allow more cores per die and higher design frequencies are becoming the preferred choice. Unfortunately, for other workloads this type of cores result in a lower single thread performance. There are many workloads where it is still important to achieve good single thread performance. In this thesis we present the ReLaSch processor. Its aim is to enable high IPC cores capable of running at high clock frequencies by processing the instructions using simple superscalar in-order issue logic and caching instruction groups that are dynamically scheduled in hardware after commit, that is, out of the critical path and only when really needed. Objective This thesis has several research goals: • Show that the dynamic scheduler of a conventional out-of-order processor does a lot of redundant work because it ignores the repetitiveness of code. • Propose a complete superscalar out-of-order architecture that reduces the amount of redundant work done by creating the schedules once in dedicated hardware, storing them in a cache of schedules and reusing the schedules as much as possible. • Place the scheduler out of the critical path of execution, which should be enabled by the reduction of work that it must do. Thus, the execution path of our proposed processor can be simpler than that of a conventional out-of-order processor. Proposal and results We present the \textbf{ReLaSch} processor, named after Reused Late Schedules, in which the creation of issue-groups is removed from the critical path of execution and uses a simple and small in-order issue logic. It just wakes-up and selects the instructions of a single issue-group each cycle, instead of processing the instructions of a whole issue queue. A new logic at the end of the conventional pipeline schedules the committed instructions. The new scheduler can be complex since it is not in the critical path of execution. The schedules are cached and whenever it is possible an rgroup is read and its instructions executed. The schedules are reused, lowering the pressure on the scheduling logic. In some cases, the ReLaSch processor is able to outperform a conventional out-of-order processor, because the post-commit scheduler has a broader vision of the code. For instance, while ReLaSch can schedule together two independent instructions that are distant in the code, a conventional out-oforder processor only issues them in the same cycle if both are in-flight. The ReLaSch processor predicts the branch targets, memory aliases and latencies at scheduling time, out of the critical path. The prediction is based on the most recent executions at scheduling time. Furthermore, most of the register renaming process is performed by the scheduler and is removed from the execution pipeline. Our experiments show that ReLaSch has the same average IPC as our reference out-of-order processor and is clearly better than the reference inorder processor (1.55 speed-up). In all cases it outperforms the in-order processor and in 23 benchmarks out of 40 it has a higher IPC than the reference out-of-order processor

    Investigation of a simultaneous multithreaded architecture

    Get PDF
    Many enhancements have been made to the traditional general purpose load-store computer architectures. Among the enhancements are memory hierarchy improvements, branch prediction, and multiple issue processors. A major problem that exists with current microprocessor design is the disparity in the much larger increase in speed of the CPU versus the moderate increase in speed accessing main memory. The simultaneous multithreaded architecture is an extension of the single-threaded architecture that helps hide the performance penalty created by long-latency instructions, branch mispredictions, and memory accesses. Simultaneous multithreaded architectures use a more flexible parallelism, which takes advantage of both instruction-level, and thread-level parallelism. The goal of this project was to design, simulate, and analyze a model of a simultaneous multithreaded architecture in order to evaluate design alternatives. The simulator was created by modifying a version of the Simple Scalar toolset, developed at the University of Wisconsin. The simulations provide documentation for an overall system performance improvement of a simulta neous multithreaded architecture. In early simulation results, performed with the same number of functional units, an improvement in the number of instructions per cycle (IPC) of between 43% and 58% was found using four threads versus a single thread. The horizontal waste rate, which measures the number of unused issue slots, was reduced between 35% and 46%. The vertical waste rate, which measures the percentage- of unused issue cycles (no issue slots used in a cycle), was reduced between 46% and 61%. These results are derived from a set of four sample programs. It was also found that increasing the number of certain functional units did not improve performance, whereas increasing the number of other types of functional units did have a significant positive impact on performance

    A low-power cache system for high-performance processors

    Get PDF
    制度:新 ; 報告番号:甲3439号 ; 学位の種類:博士(工学) ; 授与年月日:12-Sep-11 ; 早大学位記番号:新576

    Exploiting Parallelism Between Control and Data Computation

    Get PDF
    Coordinated Science Laboratory was formerly known as Control Systems LaboratoryIntel Corporatio

    Soft-error resilient on-chip memory structures

    Get PDF
    Soft errors induced by energetic particle strikes in on-chip memory structures, such as L1 data/instruction caches and register files, have become an increasing challenge in designing new generation reliable microprocessors. Due to their transient/random nature, soft errors cannot be captured by traditional verification and testing process due to the irrelevancy to the correctness of the logic. This dissertation is thus focusing on the reliability characterization and cost-effective reliable design of on-chip memories against soft errors. Due to various performance, area/size, and energy constraints in various target systems, many existing unoptimized protection schemes on cache memories may eventually prove significantly inadequate and ineffective. This work develops new lifetime models for data and tag arrays residing in both the data and instruction caches. These models facilitate the characterization of cache vulnerability of the stored items at various lifetime phases. The design methodology is further exemplified by the proposed reliability schemes targeting at specific vulnerable phases. Benchmarking is carried out to showcase the effectiveness of these approaches. The tag array demands high reliability against soft errors while the data array is fully protected in on-chip caches, because of its crucial importance to the correctness of cache accesses. Exploiting the address locality of memory accesses, this work proposes a Tag Replication Buffer (TRB) to protect information integrity of the tag array in the data cache with low performance, energy and area overheads. To provide a comprehensive evaluation of the tag array reliability, this work also proposes a refined evaluation metric, detected-without-replica-TVF (DOR-TVF), which combines the TVF and access-with-replica (AWR) analysis. Based on the DOR-TVF analysis, a TRB scheme with early write-back (TRB-EWB) is proposed, which achieves a zero DOR-TVF at a negligible performance overhead. Recent research, as well as the proposed optimization schemes in this cache vulnerability study, have focused on the design of cost-effective reliable data caches in terms of performance, energy, and area overheads based on the assumption of fixed error rates. However, for systems in operating environments that vary with time or location, those schemes will be either insufficient or over-designed for the changing error rates. This work explores the design of a self-adaptive reliable data cache that dynamically adapts its employed reliability schemes to the changing operating environments in order to maintain a target reliability. The experimental evaluation shows that the self-adaptive data cache achieves similar reliability to a cache protected by the most reliable scheme, while simultaneously minimizing the performance and power overheads. Besides the data/instruction caches, protecting the register file and its data buses is crucial to reliable computing in high-performance microprocessors. Since the register file is in the critical path of the processor pipeline, any reliable design that increases either the pressure on the register file or the register file access latency is not desirable. This work proposes to exploit narrow-width register values, which represent the majority of generated values, for making the duplicates within the same register data item. A detailed architectural vulnerability factor (AVF) analysis shows that this in-register duplication (IRD) scheme significantly reduces the AVF in the register file compared to the conventional design. The experimental evaluation also shows that IRD provides superior read-with-duplicate (RWD) and error detection/recovery rates under heavy error injection as compared to previous reliability schemes, while only incurring a small power overhead. By integrating the proposed reliable designs in data/instruction caches and register files, the vulnerability of the entire microprocessor is dramatically reduced. The new lifetime model, the self-adaptive design and the narrow-width value duplication scheme proposed in this work can also provide guidance to architects toward highly efficient reliable system design

    Dynamically tunable memory hierarchy

    Get PDF
    Journal ArticleThe widespread use of repeaters in long wires creates the possibility of dynamically sizing regular on-chip structures. We present a tunable cache and translation lookaside buffer (TLB) hierarchy that leverages repeater insertion to dynamically trade off size for speed and power consumption on a per-application phase basis using a novel configuration management algorithm. In comparison to a conventional design that is fixed at a single design point targeted to the average application, the dynamically tunable cache and TLB hierarchy can be tailored to the needs of each application phase. The configuration algorithm dynamically detects phase changes and selects a configuration based on the application's ability to tolerate different hit and miss latencies in order to improve the memory energy-delay product. We evaluate the performance and energy consumption of our approach and project the effects of technology scaling trends on our design
    corecore