10 research outputs found
Dynamic Scheduling with Partial Operand Values
Tomasulo’s algorithm creates a dynamic execution order that extracts a high degree of instruction-level parallelism from a sequential program. Modern processors create this schedule early in the pipeline, before operand values have been computed, since present-day cycle-time demands preclude inclusion of a full ALU and bypass network delay in the instruction scheduling loop. Hence, modern schedulers must predict the latency of load instructions, since load latency cannot be determined within the scheduling pipeline. Whenever load latency is mispredicted due to an unanticipated cache miss or store alias, a significant amount of power is wasted due to incorrectly issued dependent instructions that are already traversing the execution pipeline. This paper exploits the prevalence of narrow operand values (i.e. ones with fewer signficant bits) to solve this problem, by placing a fast, narrow ALU and datapath within the scheduling loop. Virtually all load latency mispredictions can be accurately anticipated with this narrow data path, and little power is wasted on executing incorrectly scheduled instructions. We show that such a narrow data-path design, coupled with a novel partitioned store queue and pipelined data cache, can achieve a cycle time comparable to conventional approaches, while dramatically reducing misspeculation, saving power, and improving per-cycle performance. Finally, we show that due to the rarity of misspeculation in our architecture, a less-complex flush-based recovery scheme suffices for high performance
CRIB: Consolidated Rename, Issue, and Bypass
ABSTRACT Conventional high-performance processors utilize register renaming and complex broadcast-based scheduling logic to steer instructions into a small number of heavily-pipelined execution lanes. This requires multiple complex structures and repeated dependency resolution, imposing a significant dynamic power overhead. This paper advocates in-place execution of instructions, a power-saving, pipeline-free approach that consolidates rename, issue, and bypass logic into one structure-the CRIB-while simultaneously eliminating the need for a multiported register file, instead storing architected state in a simple rank of latches. CRIB achieves the high IPC of an out-of-order machine while keeping the execution core clean, simple, and low power. The datapath within a CRIB structure is purely combinational, eliminating most of the clocked elements in the core while keeping a fully synchronous yet high-frequency design. Experimental results match the IPC and cycle time of a baseline outof-order design while reducing dynamic energy consumption by more than 60% in affected structures
Narrow Width Dynamic Scheduling
To satisfy the demand for higher performance, modern processors are designed with a high degree of speculation. While speculation enhances performance, it burns power unnecessarily. The cache, store queue, and load queue are accessed associatively before a matching entry is determined. A significant amount of power is wasted to search entries that are not picked. Modern processors speculatively schedule instructions before operand values are computed, since cycle-time demands preclude inclusion of a full ALU and bypass network delay in the instruction scheduling loop. Hence, the latency of load instructions must be predicted since it cannot be determined within the scheduling pipeline. Whenever mispredictions occur due to an unanticipated cache miss, a significant amount of power is wasted by incorrectly issued dependent instructions. This paper exploits the prevalence of narrow operand values by placing fast, narrow ALUs, cache, and datapath within the scheduling loop. The results of this narrow datapath are used to avoid unnecessary activity in the rest of the execution core by creating opportunities to use different energy reduction techniques. A novel approach for transforming the data cache, store queue, and load queue from associative (or set-associative) to direct mapped saves a significant amount o
Combating Aging with the Colt Duty Cycle Equalizer
Abstract--Bias temperature instability, hot-carrier injection, and gate-oxide wearout will cause severe lifetime degradation in the performance and the reliability of future CMOS devices. The design guardband to counter these negative effects will be too expensive, largely due to the worst-case behavior induced by the uneven utilization of devices on the chip. To mitigate these effects over a chip’s lifetime, this paper proposes Colt, a simple yet holistic scheme to balance the utilization of devices in a processor by equalizing the duty cycle ratio of circuits’ internal nodes and the usage frequency of devices. Colt relies on alternating true- and complement-mode operations to equalize the duty cycle ratio of signals (thus the utilization of devices) in most data path and storage devices. Colt also employs a pseudorandom indexing scheme to balance the usage of entries in storage structures that often exhibit highly uneven utilization of entries. Finally, an operand-swapping scheme equalizes utilization of the left and right operand data paths. The proposed mechanisms impose trivial overhead in area, complexity, power, and performance, while recapturing 27 % of aging-induced performance degradation and improving mean time to failure by an estimated 40%. Keywords-design; reliability; BTI; HCI; gate oxide wearout I