Razor [1-3] is a hybrid technique for dynamic detection and correction of timing errors. A combination of error-detecting circuits and micro-architectural recovery mechanisms creates a system which is robust in the face of timing errors, and can be tuned to an efficient operating point by dynamically eliminating unused guardbands.
Error-detection in Razor is performed by specific circuits which check for latearriving signals. Error-correction is performed by the system using either stall mechanisms, or by instruction/transaction-replay. Measurements on a simplified Alpha pipeline [2] showed 33% energy savings. In [3] , the authors evaluated error-detection circuits on a 3-stage pipeline, using artificially induced Vcc droops showing 32% throughput (TP) gain at same Vcc, or 17% Vcc reduction at equal TP. This paper presents Razor applied to a processor with timing paths representative of an industrial design, running at frequencies over 1GHz, where fast-moving and transient timing-related events are significant. The processor implements a subset of the ARM ISA, with a micro-architecture design that has balanced pipeline stages resulting in critical memory access, and clock-gating enable paths. The design has been fabricated on a UMC [6] 65nm process, using industry standard EDA tools, with STA signoff frequency of 724MHz at the worst-case corner (0.9V/SS/125C). Silicon measurements on 63 samples, including split lots, show a 52% power reduction of the overall distribution for 1GHz operation. Error-rate driven dynamic voltage (DVS) and frequency scaling (DFS) schemes have been evaluated. The micro-architecture is shown in Fig. 15 .6.1. The pipeline is balanced using a combination of micro-architecture design and path-equalization performed by backend tools, such that all stages have similar critical-path delay. The pipeline includes forwarding and interlock logic, which contributes to both data and control critical paths, including clock gate enables, and memory access paths. Error recovery consists of flushing the pipeline and restarting execution from the next un-committed instruction. Razor stabilization stages, S0 and S1, delay instruction commit by two cycles. This allows synchronizing the potentially metastable error signal from the ME stage. Forwarding paths prevent any impact on IPC due to S0 and S1, which add 2.4% extra power overhead.
The Transition-Detector (TD) (Fig. 15.6 .2), detects errors by generating a pulse in response to a transition at the D input of a flip-flop (FF) and capturing this pulse it within a window defined by a clock-pulse (CP) generated from the rising-edge. The sizing of the devices in the inverter and AND gates in the pulsegenerators determines the width of the data pulse (DP). A delay on CK defines the width (T CK ) of the implicit CP, which is active when N1 and N2 are both on. Detection begins (ends) when the trailing (leading) edge of DP overlaps with the leading (trailing) edge of CP. The error-detection window is T D +T CK -2T OV , where T OV is the minimum overlap required. The min-delay constraint is T CK -T OV which is less than the high clock-phase of previous designs [2] [3] . The trade-off is increased pessimism, as the point at which transitions are flagged as errors is moved earlier. For 1GHz operation, this pessimism corresponds to ~5% of the cycle time, compared to when incorrect state starts to be latched.
In contrast to the RazorII FF [2] 
Monitoring the input D, instead of the latch node, precludes the need for extra circuitry to suppress spurious error-detection for genuine transitions.
An error history (EHIST) diagnostic bit was added to each TD using an RS-latch, set whenever an error occurs. Reading out the EHIST allows identification of each TD that triggered over the course of a test. Simulation of a typical workload (WTYP) shows power overhead due to TD was 5.7% of the overall power with 1.3% overhead due to min-delay buffers. DFS experiments use an on-die Adaptive Frequency Controller (AFC) which adapts to the dynamic workload variation by changing frequency in response to error-rate. Figure 15 .6.4 shows the AFC response for a workload with 3 phases -a NOP loop, a combined critical path/power virus loop (PV), and typical workload (WTYP), running at a fixed 1V VDD. Highest frequency is measured in the NOP phase (1.2GHz) and the lowest in the PV phase (1GHz). In the TYP phase, there are 4 distinct frequencies (1143 -1068MHz) due to a wider range of paths being exercised compared to the synthetic test cases.
Figure 15.6.5 shows the same 3-phase workload using an adaptive voltage controller at 1GHz frequency for 3 samples. Using Razor with the worst-case PV code on the slowest (SS6) part requires 1.17V, while WTYP requires 1.07V, which is below the 1.1V overdrive limit of the process. Considering parametric yield implications, conventional margining without Razor requires operation above 1.2V (3% VDD margin over PoFF) to achieve 100% yield at 1GHz, for reliable WC operation of SS6. This is unsustainable due to power and wear-out implications of excessive overdrive. Figure 15 .6.6 shows the comparison between a baseline of 1.2V and Razor-tuned voltages. The max-power for the 1.2V distribution is due to the FF5 part, and is 52% higher than the Razor distribution, with a spread of 37mW compared to10mW.
An alternative to dynamic adaptation is to discard slower parts or reduce the max-frequency specification. As 6 out of 22 of our TT lot samples require more than 1.1V for the PV, discarding slower parts would almost certainly impact yield. Reducing the clock frequency to a point where yield was not impacted would limit the operation frequency to 800MHz. For the same distribution Razor provides potential for an effective 100% yield point at 1GHz, with supply voltage kept at or below 1.1V for all devices, except for extremely rare use cases equivalent to the pathological WC PV code. The die photograph and implementation details are shown in Fig. 15 
