One of the most effective techniques to reduce a processor's power consumption is to reduce supply voltage. However, reducing voltage in the context of parameter variations can cause circuits to fail. As a result, voltage scaling is limited by a minimum voltage, often called Vccmin, beyond which circuits may not operate reliably. In this paper, we propose an architectural technique that enables microprocessor to operate at low voltage, while maintaining high frequency operations based on instruction isolation. The instruction isolation scheme isolates the set of possible instructions that do not complete within the clock period at the scaled Vcc and avoids possible timing errors in the instructions by dynamically adapting the clock period. Compared to current design, our scheme enables extra 13% average power saving.
Introduction
Although today's microprocessors are much faster and far more versatile than their predecessors using high-speed operation and parallelism, they also consume a lot of power. Pipeline, which is the core of microprocessor where all computations are being performed is one of the most power hungry components in a processor and is often the possible location of hot-spots. Therefore, designing low power pipelines in high performance microprocessor is becoming a challenging issue.
One of the most effective techniques to reduce a power consumed by a microprocessor is to reduce supply voltage since dynamic power is a quadratic function of supply voltage (Vcc) and leakage is an exponential function of Vcc [1] . However, it will also slow down the critical path of the circuit. If the supply voltage is further reduced, the critical path finally becomes too slow to assure the correct functionality of the chip. Therefore, Vccmin (the minimum voltage at which circuits can reliably operate) is a critical parameter that prevents the voltage scaling of a design. Overcoming Vccmin limits allows designs to operate at lower voltages, improving energy consumption and battery life for handheld and laptop products.
The maximum clock frequency (Fmax) of a microprocessor is traditionally determined based on maximum Vcc droop and temperature specification. However, too large current transient in the power delivery system due to the abrupt changes in die-level switching activity can cause instantaneous supply fluctuations such as Vcc droops and overshoots. The magnitude and duration of Vcc droops and overshoots depends on the interaction of capacitive and inductive parasitic at the board, package, and die levels with changes in current demand [2] . In addition, temperature varies with workload, environmental conditions, and the heat-removal capability of the package. The instantaneous supply fluctuations due to large current transient in the power delivery system can cause circuit paths to be significantly slower or faster than was expected. Therefore, an inherent Vcc guardband is required at nominal frequency and temperature to ensure correct functionality within the presence of dynamic Vcc and temperature variations.
Recently, timing-error detection and correction circuits have been proposed to monitor and to recover timing faults for on-line testing of digital circuits within the presence of environmental influences and reliability concerns to explore the effectiveness of resilient circuits in eliminating Vcc and temperature guardbands [3] [4] . These circuits allow aggressive voltage scaling where Vcc is further scaled down and unexpected timing errors due to environmental variations on the system are corrected by recovery circuits.
This paper proposes a low-power and robust pipeline which provides an opportunity for aggressive voltage scaling while maintaining high frequency operations based on instruction isolation. This technique is based on the facts that 1) different instructions exercise different pipeline stages or different circuit paths in the same pipeline stage, resulting different critical paths; 2) and lowering the operating clock frequency ensures reduction in timing-error probability at a scaled Vcc, because timing error is caused by the increased gate delay due to the reduced Vcc. We isolate the set of possible instructions that may become critical and avoid possible timing errors in the instructions by dynamically adapting the clock period. This allows us to apply aggressive voltage scaling to a pipeline while maintaining the nominal clock frequency such that off-critical instructions operate at 1-cycle while critical instructions are evaluated in 2-cycles. To our knowledge, this is the first paper to address instruction isolation that allows aggressive voltage scaling, while maintaining high frequency operation.
In the reminder of this paper, we address previous works on aggressive voltage scaling for microprocessor (Section 2). We then introduce our instruction isolation technique and analyze its performance penalty in Section 3. Section 4 examines the error rates of a full custom 72-bit adder and a synthesized ALU. In Section 5, we present the application of this technique to a pipeline design and evaluate our proposed technique in terms of performance and power. Conclusions are drawn in Section 6.
Related Works
Wilkerson et al. investigated microprocessor cache in terms of types of cell failures and impact of Vcc on reliability and proposed two architectural techniques that enable microprocessor caches to operate at scaled voltages despite very high memory cell failure rates [5] . The Word-disable scheme combines two consecutive cache lines, to form a single cache line where only non-failing words are used. The Bit-fix scheme uses a quarter of the ways in a cache set to store positions of defective bits and fix bits for failing bits in other ways of the set. By adopting these two schemes, 40% voltage reduction was achieved by sacrificing cache capacity by 50% and 25%.
A self-adjusting voltage reduction circuit which adjusts Vcc of a digital circuit at the functional boundary for the speed requirements was proposed to reduce energy consumption [6] . In order to find the minimum Vcc for the system, they implemented equivalent critical path which is a small circuit with electrical properties that are comparable to those of the actual critical path of the original circuit. Kuroda et al. extended the work for a microprocessor core where Vcc is controlled by monitoring propagation delay of a critical path within a chip such that it is set to the minimum of voltages in which the chip can operate at a given clock frequency [7] . In order to detect the speed of critical path, a critical path replica was implemented instead of equivalent critical path.
Recently, researchers applied aggressive voltage scaling to linear time complexity adders that maintain high clock frequency even at scaled voltage. The idea is based on the fact that the critical paths of adder are exercised rarely. The technique analyzes the set of critical paths of an adder and scale down Vcc such that non-critical paths can be computed without any delay failure at nominal frequency. For the critical paths, a clock stretching operation (i.e. the infrequent critical paths are evaluated in two-clock cycles) is performed to prevent possible timing errors at the scaled voltage. Chen et al. applied voltage scaling to a carry-select adder based on the observation that the carry propagation through the MUX chain of carry select state is determined by the input vectors [8] . A carry length detect circuit detects long-and shortlatency operations based on the input vectors and the adder automatically works with one or two clock-cycle latency at the scaled Vcc, achieving 44.4% power improvement.
In [9] , different adder topologies were explored for possible use in high speed at scaled Vcc with variable latency operation. Based on the analysis, they proposed a hybrid adder designs that increase the timing slack of off-critical paths for aggressive voltage scaling. The main idea is to compute the intermediate carries faster by replacing the middle portion of the adders to a Kogge-Stone adder (a faster adder topology). Due to the fact that the off-critical paths are optimized to make them faster, supply voltage can be reduced further while maintaining similar yield, resulting extra power savings.
Ghosh et al. proposed an adaptive pipeline design which is suitable for aggressive voltage scaling while maintaining high frequency operation. This was achieved by critical path isolation, which selectively makes short paths faster and long paths slower to create large timing slack between the set of long paths and the shorter paths [10] . At scaled voltage, critical path activation was predicted by pre-decoding few primary inputs such that possible timing error was avoided by adaptive stretching the clock period to 2 cycles. They implemented a two-stage pipeline containing 4-bit carrylook-ahead adder and comparator, demonstrating additional power saving in pipeline stage.
Even though the critical path isolation allows aggressive voltage scaling for an adder, there might be little additional power saving in parallel adder topologies which are more popular in high speed designs. According to our observation, the speed of a parallel adder itself is fast because it tries to compute all paths in parallel. But, it also reduces the timing slack between critical and off-critical paths in an adder. This limits the advantage of the critical path isolation for parallel adders which is used in microprocessors. Instead of the critical path isolation, we propose the instruction isolation that adapts the clock frequency for instructions.
Instruction Isolation
In this section, we introduce an instruction isolation to enable aggressive voltage scaling for a pipeline while maintaining the nominal clock frequency. An informal and intuitive description of the operation is presented and the performance penalty is analyzed.
Mechanism
Instruction isolation is based on the fact that different instructions exercise different parts of circuits and have different critical paths. Thus, the probability of timing-error also varies according to the Vcc differently from instruction to instruction. Lowering the operating clock frequency ensures reduction in timing-error probability for a scaled Vcc, because timing error is caused by the increased gate delay due to the reduced Vcc. Therefore, there is opportunity for aggressive voltage scaling by evaluating different clock frequencies for each instruction. However, changing clock domain among different frequencies results additional hardware and control overhead, including additional latency for the clock domain transition. In order to simplify the implementation, halved clock frequency is applied to the critical instructions such that off-critical instructions operate at 1-cycle while critical instructions are evaluated in 2-cycles (reducing frequency in half ensures correct operation at low Vcc). This adaptive clocking allows aggressive Vcc scaling while maintaining the nominal clock frequency such that offcritical instructions are executed without any performance degradation and critical instructions are performed in halved frequency to avoid timing failure. Figure 1 shows the time space diagram of the adaptive clocking for three pipelined instructions. Out of these three instructions, the second instruction generates timing error at the scaled Vcc. Therefore, adaptive clocking is performed during the execution of the second instruction for correct functionality of the pipeline. The second instruction is fired at cycle 2, but evaluated in cycle 4 by using the adaptive clock.
Performance Analysis
We consider an N-stage linear pipeline where control logic predicts the execution of the isolated instruction set which might generate error in data-path with one-cycle operation at the scaled Vcc. The clock divider is enabled whenever the timing error of any of the pipeline stages is expected. Let p be the activation probability of isolated instructions and m be the number of pipeline stages that requires halved clock operation for the proper functioning. In order to simply the analysis, we assume that the probability of having a timing violation in m pipeline stages which require halved clock operation is equal. Then, the probability of halved clock operation in the pipeline in each clock cycle (p total ) is given by 
The performance penalty for different m and probability of isolated instruction is shown in Figure 2 . From the equation, it can be noted that penalty can be large for deep pipeline designs (i.e. large N) because the deep pipeline designs might have well balanced pipeline stages resulting increased m. Upper bound of the performance penalty is 50%.
Error Rate Analysis
The adaptive clocking permits a microprocessor to tolerate circuit timing errors, thereby permitting operation at the scaled Vcc at the expense of reduced instruction throughput. As an initial step, we examined the error rates of a full custom 72-bit adder and a synthesized ALU implemented in 45nm technology. We used SPICE-level models to measure the error rates over the range of different Vcc.
Full Custom 72-bit Adder
To gain an understanding of the nature of circuit timing errors, a full-custom 72-bit adder which employs a sparsetree architecture was analyzed [11] . Instead of generating the carry for each bit as in traditional Kogge-Stone approaches, the sparse tree generates every 4th carry.
Randomly generated 1000 vectors were applied to the adder and the error rate is computed as the fraction of sample vectors that do not complete within the clock period. Figure 3 illustrates the relationship between voltage and error rates for a 72-bit adder running with random input vectors at 3.57GHz under different voltages and temperatures. At high Vcc, cell operating margins are large, leading to reliable operation. As voltage drops, the adder circuits fail quite quickly, taking 40 mV to go from the point of the first error (0.7 V) to an error rate of nearly 99% (0.66 V) at temperature 50 C. From 0.64 V, none of the circuit paths can complete operation in the clock period, resulting 100% timing error. The rapid increase in error rate is due to the sparse-tree architecture that speeds up the adder critical path by moving a substantial portion of the carry-merge logic from the carry-tree to a non-critical side-path.
As shown in Figure 4 , each bank starts to fail from 0.7V and contributes to the total timing error. The speed of the adder itself is fast because it tries to compute all paths in parallel. But, it also reduces the timing slack between critical and off-critical paths in an adder. Therefore, the isolation of critical path is not further effective for parallel adders which are usually used in microprocessor. Moreover, a full custom ALU has well balanced critical path, where the critical paths are made faster and off-critical paths are slowed down. 
Synthesized ALU
In order to gain a deeper understanding of the nature of circuit timing errors for an ALU, a synthesized ALU was implemented and evaluated. The processor used in our analysis is the original Pentium® 1 processor. It is a 32-bit inorder 5-stage dual-pipeline processor supporting the IA32 instruction set [12] . The ALU was synthesized with an advanced technology based standard cell library for 1.2 GHz operation at 1.1 V. Figure 4 illustrates the error rate and power consumption of the ALU as a function of voltage. The analysis was performed with random vector for ADD, XOR, and AND instructions at 1.25 GHz and 110 C (we show the feasibility of the instruction isolation by using these three instructions). As supply voltage scales, ADD instruction functions correctly until 0.74 V, while logical operations (XOR and AND) makes error-free Vcc shift to 0.68 V. Conventional voltage scaling allows the Vcc of ALU to be scaled to 0.74 1 Pentium® is a registered trademark of Intel Corporation or its subsidiaries in the United States and other countries.
V, saving 60% of total power consumption with correct operation. Isolating ADD instruction allows the ALU operates at 0.68 V by applying 2-cycles operation for ADD and 1-cycle operation for other instructions, achieving additional power savings. For ADD, the ALU saves additional 23% power saving since halving the operating frequency at scaled Vcc also reduces power consumption at the cost of performance reduction. For other instructions, the instruction isolation saves additional 7% power consumption due to the down-shift of scaled Vcc from 0.74 V to 0.68 V. Figure 6 illustrates an application of instruction isolation in a generic in-order 5-stage pipelined machine whose states are prefetch (PF), instruction decode (D1), address generate (D2), instruction execute (EX), and write back (WB).
Implementation

Pipeline Design
A decoder circuit is integrated in the pipeline for the instruction isolation. The decoder monitors the control signals which define the operations of ALU and predicts the execution of the isolated instruction set in EX stage. A LUT is used to store the isolated instructions and it can be predefined in design-time or updated in run-time. The hardware cost is a decoder and n×w bits LUT where n is the number of isolated instructions and w is the width of control signals (w is 6-bits in the reference pipeline). HALF FREQUNCY signal is asserted whenever the isolated instruction will be executed in EX stage, preventing possible timing error in the ALU. Reducing Fclk in half ensures correct operation during the execution. In order to allow the aggressive Vcc scaling, the concept of timing-error detection and correction circuits is extended. The resilient circuits eliminate the Vcc guardband from dynamic Vcc and temperature variations [4] . Error-detection sequential (EDS) circuit monitors timing fault for on-line testing of digital circuits within the presence of environmental influences and reliability concerns. When dynamic variations induce a timing error, the error is detected and corrected to maintain proper logic functionality. Error signals from each EDS circuit per pipeline stage are propagated to the controller (CTRL) to replay the failed instruction and pipelined to the WB stage to invalidate erroneous data. CTRL determines the appropriate instruction to replay based on the pipeline-error signals. In a microprocessor, the instruction replay circuit could leverage the existing replay design to recover from a branch missprediction [13] . If a pipeline-error signal transitions to a logic-HIGH, the CTRL signals the clock divider to halve Fclk while maintaining a constant high clock phase delay for min-delay protection. After the replayed instruction finishes, the CTRL sends a VALID signal to validate the output data and signals the clock divider to resume at target Fclk. Since the number of recovery cycles linearly increases with the number of pipeline stages, the average error recovery penalty of a microprocessor is expected to linearly increase.
Adaptive clocking is realized by using the clock divider and duty-cycle control circuits presented in Figure 6 [4] . An off-die signal generator with a differential pulse splitter creates differential inputs CLKIN and CLKIN# (i.e. inversion of CLKIN) which are inputs to a differential amplifier that generates an intermediate clock signal. This intermediate clock signal and the output of the second MSFF are inputs to a logic-AND gate to produce the clock divider output. When HALF FREQUENCY input is a logic-LOW, the output of the second MSFF remains a logic-HIGH, thus CLK0 and CLKIN have the same frequency. When the HALF FREQUENCY input is asserted, the output of the second MSFF toggles every other cycle, enabling the clock divider circuit to skip every other high phase of CLKIN. The duty-cycle control is performed with a logical-AND of CLK0 and a delayed CLK0# with CLK as the output. The delayed CLK0# determines the CLK high phase delay, as controlled via scan bits. With this duty-cycle control circuit, the CLK high phase delay remains constant at both high and low Fclk values, which is essential for min-delay protection. The CLK output is distributed throughout the pipeline.
Simulation Methodology
In our experiments, we simulate nine categories of benchmarks. For each individual benchmark, we carefully select multiple sample traces that well represent the benchmark behavior. The benchmark samples were generated by using a cycle-accurate, execution-driven simulator running IA32 binaries such that all instructions using the adder are recorded. The simulator is micro-operation (µOP) based, executes both user and kernel instructions, and models a detailed memory subsystem. Table 2 lists the number of traces and example benchmark included in each category. We use instructions per cycle (IPC) as the performance metric. The IPC of each category is the geometric mean of IPC for all traces within that category. Then, we normalize the IPC of each category to the baseline to show performance. 
Simulation Results
In this section, we evaluate the performance overhead of our instruction isolation. Figure 7 shows normalized IPC and Performance at scaled Vcc is sensitive to the fraction of the isolated instructions. When we use the instruction isolation, IPC was reduced by 28% in average. The more IPC is reduced, more power saving was achieved. However, this performance overhead is in comparison to a normal operation. Moreover, as the low-voltage mode is normally used when the processor load is low, the performance is not the primary concern. Figure 8 shows the power savings of the conventional Vcc scaling (Baseline) and the instruction isolation compared to the high-voltage operations. At the cost of IPC reduction, the instruction isolation results average of 13% extra power saving, demonstrating the feasibility of our scheme.
Conclusions
In this paper, we demonstrated how the minimum supply voltage (Vccmin) is a critical parameter that affects microprocessor power and reliability. We proposed a novel architectural technique that allows a processor to decrease its Vccmin to 680 mV while maintaining high frequency operations. The instruction isolation scheme isolates the set of possible instructions that may become critical and avoids possible timing errors in the instructions by dynamically adapting the clock period. This allows us to apply aggressive voltage scaling to a pipeline while maintaining the nominal clock frequency such that off-critical instructions operate at 1-cycle while critical instructions are evaluated in 2-cycles. The simulation results demonstrated average 13% of extra power saving compared to the conventional Vcc scaling (total 73% power saving), while reducing performance by 28%.
Acknowledgment
