Abstract-Process variations and temperature variations can cause both the frequency and the leakage of the chip to vary significantly from their expected values, thereby decreasing the yield. Adaptive Body Bias (ABB) can be used to pull back the chip to the nominal operational region. We propose the use of this technique to counter temperature variations along with process variations. We present a CAD perspective for achieving process and temperature compensation using bidirectional ABB. Mathematical models are used to determine the exact amount of body bias required to optimize the delay and leakage, and an algorithmic flow that can be adopted for gigascale LSI systems is provided.
I. INTRODUCTION
With technology scaling, the effects of process parameter variations and on-chip temperature variations have caused the delay and leakage of modern-day processors to vary significantly from their desired values. Some of the dies may satisfy the delay constraint but leak too much, while others may leak nominally but fail to meet the target frequency. Thus, a significant fraction of the total number of acceptable dies may fail to achieve the performance goals. This has led to the evolution of methodologies to perform post-silicon tuning for yield improvement. Adaptive Body Bias (ABB) provides a viable control technique that can counter the effects of on-chip variations.
Two of the significant contributors to on-chip variability arise from changes in process parameters and in the operating temperature. Process variations lead to fluctuations in parameters such as transistor channel lengths, oxide thicknesses, and dopant concentrations. These cause variations in the delay and leakage of the circuit, thereby affecting performance. On-chip temperature variations, on the other hand, change the mobilities of electrons and holes. An increase in the operating temperature causes the mobilities to decrease, thereby decreasing the Ion current, which, in turn, reduces the speed of the circuit. Further, elevated temperatures also lead to an increase in the leakage current. On-chip variations can be categorized as lot-tolot (L2L), wafer-to-wafer (W2W), die-to-die (D2D), and within-die (WID) variations [1] .
Adaptive Body Bias (ABB) is a dynamic technique that helps tighten the distribution of the maximum operational frequency and the maximum leakage power in the presence of WID variations, and thereby helps improve the yield significantly. It was first proposed by Wann et al. in [2] and was further explored by Kuroda [3] during the design of a DSP Processor. Bidirectional Adaptive Body Bias has been shown to reduce the impact of D2D and WID parameter variations on microprocessor frequency and leakage in [4] , [5] , [6] and [1] . Typically, devices that are slow but do not leak too much can be Forward Body Biased (FBB) to improve the speed, whereas devices are fast and leaky can be Reverse Body Biased (RBB) to meet the leakage budget. The work in [4] , [7] performs process variationbased ABB, and divides the die into a set of WID variational regions. In each region, test structures, which are replicas of the critical path, are built. The delay and leakage of these test structures are measured, and used to determine the exact body bias values that are required to counter process variations at room temperature. The application of a WID-ABB technique for one-time compensation during the testphase, in [4] , shows that 100% of the dies can be salvaged, while 99% of them operate at the highest frequency bin.
Traditionally, ABB has been used only to compensate for process variations [4] [5] [6] . However, on-chip temperature changes can also significantly vary the delay and leakage of nanometer-scale devices, thereby necessitating the need to mitigate the effects of these thermal variations as well. Only a limited amount of work so far has addressed this problem, such as [8] , which focuses purely on temperature effects. In this work, we apply a combination of temperaturebased ABB (TABB) and process-based ABB (PABB) to permit the circuit to recover from changes due to both temperature and process variations. In order to be able to adaptively body bias all of our dies at all operating temperatures, we utilize an efficient self-adjusting mechanism that can sense the operating temperature, and thereby dynamically regulate the voltages that must be applied to the body of the devices to meet the performance constraints. We propose a general architecture and an implementation scheme to achieve this.
The contribution of this paper is to provide a strategy for determining the exact amount of bias required to achieve process and temperature compensation through a combination of simulation, probabilistic design and post-silicon tuning in order to maximize the yield subject to frequency and leakage constraints. This method is aptly termed PTABB (process and temperature-based adaptive body bias). The final set of PTABB voltages that can counter process and temperature variations at all operating conditions is thus a combination of PABB and TABB. We propose two methods to compute the TABB values, namely, an enumeration based method and a mathematical model based method. Enumeration based TABB involves simulating the circuit at discrete points in the solution space and finding the best solution. In contrast, mathematically assisted TABB assumes a continuous search space and provides an exact solution using a model that captures the effect of body bias on delay and leakage and a simple nonlinear programming problem (NLPP) formulation. PABB can be performed by building test structures with critical path replicas on each WID-variational region [4] . The exact amount of body bias to counter the effects of process variations at room temperature is determined by measuring the delay and leakage of the circuit, and choosing the optimal solution.
The concept of using mathematical models to formulate expressions for delay and leakage, and thereby to obtain exact solutions for the ABB voltages, is in itself a new and attractive approach. Compared to prior approaches that determine the exact body bias required during run-time by monitoring the delay and leakage (listed in [9] ), our scheme uses a simple look-up table (similar in concept to that used in [8] ), that stores these pre-computed values, and hence, only requires a temperature sensor to monitor the variations in onchip temperature. This eliminates the need for circuits like leakage current monitor, substrate charge injector, self substrate bias, etc., since the determination of the TABB voltages is carried out at the design stage. Further, the idea of one-time compensation for process variations and run-time compensation for temperature variations is effectively combined. The generation of these additional body bias voltages and their distribution on chip is not considered to be within the scope of our work. We present the algorithm, implementation and results of this novel scheme in the subsequent sections.
II. CENTRAL IDEA
In this section, we present an overall picture of the proposed implementation. The die is partitioned into several WID variational regions, and each of these regions is separately compensated. Our target technology in this work uses a triple well process although the idea can be generalized to any other process. When the circuit is in operation, the temperature sensor detects changes in the on-chip temperature. The corresponding values of v bn and v bp are read from the ROM and fed as inputs to the central bias generator. These voltages are generated by the central bias generator and distributed to NWELL and PWELL through the bias distribution network. The overall architecture is shown in Fig. 2 .
III. PTABB ALGORITHM In this section, we present the algorithm that determines the body bias required for process and temperature variation compensation. Since we assume the existence of a triple well process, the bodies of both NMOS and PMOS devices can be biased independently. However, the algorithm can be easily modified for a twin well process. We present SPICE-calibrated models that express the delay, and leakage in terms of the bias voltages and determine the optimal bias voltages based on operational constraints.
The effects of process and temperature variations on the delay of a combinational circuit can be represented as:
where D is the delay of the circuitry, x is a vector of process variables and T is the operating temperature of the chip. Let x0 and T0 denote the values of the process and temperature variables under ideal conditions where there is no variation. The increase in delay at any point (x T ) can be written as:
where x is the vector of process variables of a particular die, while T is the operating temperature of the die. If x and T are independent variables, the effect of simultaneously varying x and T , from (x0 T0) to (x1 T1) can be approximated as varying x and T individually from their original values and adding their effects, i.e.,
where f (x0 T1) is the delay with temperature variations only, while f (x1 T0) is the delay considering the effect of process variations only.
The above assumption of independence is justified since process and temperature variations have different device level effects, and hence their impacts on the delay can be treated as independent of one another. Process variations affect parameters such as channel length, oxide thickness, and dopant concentration, thereby altering the delay, while temperature variations affect the mobilities of electrons and holes, which influences the on-current, and hence, the delay of the circuit. Further, the results shown in Table I indicate the validity of the assumption. The delay of a ring oscillator is measured through simulations performed using BPTM [10] 100nm model files at T = 50 C and T = 7 5 C at the two extreme process corners: 1) Low-V t corner which is the case where process variations cause the threshold voltages of both NMOS and PMOS devices to decrease by 10%. 2) High-V t corner which is the case where process variations cause an increase in both Vtn0 and Vtp0 by 10%. The column labeled Nom-Delay in Table I indicates the delay at T = 25 C under ideal process conditions. The delay considering the effect of both process and temperature variations is shown in the column labeled DelayP Tand the variation in delay calculated directly, using (2) is shown in column 7. Columns labeled DelayT and DelayP list the delay considering temperature variations and process variations respectively. The change in delay, expressed as a sum of the change in delays due to process and thermal effects using (3) is listed in column 8. It can be seen from the last column in the table that the difference in delay between the two measurements are negligible compared to the actual circuit delay values. Thus, we can indeed decompose the delay expressions into a temperature-dependent term and a process-dependent term. We use the above findings to perform temperature compensation and process compensation independently of each other. 
A. Temperature Compensation
Generally, the delay of a circuit exhibits negative temperature dependence, i.e. the delay increases with an increase in temperature due to a reduction in the mobility of electrons and holes. Hence, we need to forward body bias the devices to reduce the delay at higher operating temperatures, at the expense of leakage. However, at low-V dd operations, the reduction in Vt has a higher impact than the reduction in mobility and an increase in temperature allows the circuits to operate at a higher speed. This effect, described as positive temperature dependence, can be used to achieve TABB as described in [11] . In such cases, the devices may be less forward biased (or relatively reverse body biased) at higher temperatures to achieve leakage savings. We hereby present two methodologies to determine the amount of FBB needed to meet the delay constraint, thereby minimizing leakage, for the general case of negative temperature dependence.
B. Enumeration based TABB
The task of ABB compensation is to determine the optimal value of the biases for the NWELL and PWELL, that brings the delay back to specifications, with a minimal leakage overhead. The basic idea of enumeration is to traverse through the entire search space and find this solution. However, since it is infeasible to find the delay and leakage over all possible values of v bn and v bp , we discretize the voltage levels and perform the enumeration over a limited set of values. The maximum amount of FBB that can be applied is restricted by the diode turn on voltage of the source-substrate junction and is process-dependent. The minimum resolution of voltage that can be applied is set by the designer and is constrained by the bias generation network. A method for determining the optimal values is shown in Algorithm 1. We wish to operate the circuit at the highest possible frequency, and the target delay of the circuitry (D ) is determined by a simulation at the nominal temperature. Since we have assumed negative temperature dependence, the delay of the circuit at a higher operating temperature is greater than D , hence requiring FBB. The circuit is simulated with the upper bound of the search space (v bnmax v bpmax 1 ) to determine if maximum FBB can pull the circuit delay back to D . If the maximum applicable bias fails to meet the target delay, the operational frequency of the circuit block needs to be reduced. Otherwise, we set this as our initial solution and seek better solutions than (v bnmax v bpmax ) within the search space since (v bnmax v bpmax ) is overkill in terms of leakage. The circuit is simulated at each of the bias pair points and the solution that has the minimum leakage is chosen. If the final leakage of the block is still greater than the allocated budget, then the operational frequency is reduced, D thereby increased, and the process of enumeration is repeated.
C. Mathematically assisted TABB
Enumeration over the entire two dimensional search space to determine the optimum bias ordered pair is a costly process for large circuits since it requires simulations at each bias value (worst case) 1 The actual voltage applied to the body of the PMOS transistors is (V dd ? v bp ). and has a cost of O(n 2 ), where n is the number of bias voltages available. Hence, we propose an efficient algorithm based on a simple nonlinear programming problem (NLPP) that requires the simulation of the circuit for delay and leakage at a few points only, to determine the exact body bias pair required. The crux of this method is as follows.
The delay and leakage of a circuitry can be altered by applying a bias voltage v bn to the body of the NMOS transistors and (V dd ?v bp )
to that of the PMOS transistors. Since analytical expressions that can quantize the effect of body bias on delay and leakage at the circuit level do not exist, we use polynomial best fit curves to realize these models. Simulation results show that second order polynomials in both v bn and v bp provide a reasonably accurate model of both delay and leakage. Thus we have the expressions:
where D0 and L0 are the delay and leakage values at the given operating temperature under the condition where process variation effects are ignored. Since we have two variables v bn and v bp , it is desirable to model the effects of these individually, independently of each other and finally superpose their effects. In other words, we wish to re-write (4) as: We verified the possibility of this decomposition on the delay of a Ring Oscillator (RO) and the results are shown in Fig. 3 . The reference delays of the RO following the application of body bias are measured through HSPICE simulations performed using BPTM 100nm model files. The delay due to varying v bn only (measured at v bp = V dd ) is approximated using a second order best fit curve as, f (v bn ) = (1 + x1v bn + x2 v 2 bn ) ( 
7)
Similarly, the delay due to varying v bp only (measured at v bn = 0 ) is approximated using the polynomial g(v bp ) as,
The new delay of the ring oscillator at any point (v bn ,v bp ) is calculated as a product of the polynomials f (v bn ) and g(v bp ). Finally, the difference between the reference values and the new delay values, calculated at each point, is shown in Fig. 3 . It can be seen that this difference is negligible, thereby conforming the predicted trend. Hence (4) can be re-written as, 
D. Process Compensation
In order to perform process compensation using ABB, a test structure consisting of the critical path replica is built in each of the WID variational regions. PABB is performed in [4] by applying an NMOS bias (v bn ) from an off-chip source and automatically adapting the PMOS bias to meet the target frequency. The process is repeated for all possible values of v bn and the bias pair which results in lowest leakage is chosen as the final solution. This scheme requires a 5 bit counter and a DAC (Digital to Analog Converter) in the test structures, to automatically determine the PMOS bias for each NMOS bias applied.
This methodology can be simplified with the use of external voltage sources for biasing both the NWELL and PWELL and an NLPP formulation to determine the exact PABB values. The test structure now consists of a critical path replica and a phase detector only as shown in Fig. 2 . The NLPP formulation outlined in the previous sub-section is employed to determine the exact PABB values. The coefficients in (5) and (9) are now determined by actual measurements on chip, instead of circuit simulations for the TABB case. Off-chip sources are used to bias the wells, and the delay and leakage values are measured at some points. The NLPP is formulated in an identical manner as that in (10) and (11), with D0 and L0 being the measured delay and leakage values of the WID-variational region at nominal temperature. The NLPP is solved to obtain the optimal bias values into the ROM, as described in Fig. 1 . When the circuit is in operation, these values are referenced from the ROM, based on the output of the temperature sensor and the corresponding bias values are applied to recover performance. 
IV. RESULTS FOR ISCAS BENCHMARK CIRCUITS
In this section, we apply the above described design flow on 7 ISCAS combinational benchmark circuits and present the results obtained. A static timing analyzer (STA) is implemented to determine the delay and leakage of the benchmark circuits. The library consists of 26 gates (10 NOT gates, 5 NAND2 gates, 5 NOR2 gates, 3 NAND3 gates and 3 NOR3 gates) of different sizes, and has been characterized using HSPICE simulations performed using the BPTM 100nm technology [10] with V dd = 1 :0V . The benchmark circuits have been synthesized based on this library using SIS [12] . Since each ISCAS benchmark is rather small, we consider a test case where all of the benchmarks are placed in different regions of the same chip. Specifically, we assume that each of these benchmarks is in a different WID variational zone, and can be compensated independently of each other.
A. Results of TABB
To determine the optimal amount of TABB required, we assume that there are no process variations, and that the on-chip temperature varies from 25 C to 75 C . We also choose T = 50 C and T = 75 C as the points at which we will determine the optimal bias required to maintain the delay. The results obtained through enumeration as well as mathematically assisted methods are explained below: 1) Enumeration based TABB: We assume that the devices can be body biased between the range of 0V 0:5V ] with a step of 0:1V . A step of 0:1V is chosen assuming that this is the lowest resolution of voltages that can be generated by the central body bias generator. Thus, 6 possible voltage levels exist for both v bn and v bp , leading to 36 candidate solutions. The benchmarks are simulated at these points, and the solution that satisfies the delay and has the minimum leakage is chosen as the final optimal solution, based on Algorithm 1. The results are tabulated in Table II The NLPP is formulated, as explained in (10) and (11), and the solution obtained for different temperatures is tabulated in Table II Table II . It can be seen from the table that all benchmarks require FBB at higher operating temperatures to compensate for the increase in delay due to reduction in mobilities. Further, most of the NLPP solutions when snapped to the nearest discrete voltage levels give solutions which are identical to that obtained by enumeration. However, for C17 at T = 7 5 C , mathematically assisted TABB returns a solution which is one grid higher for v bn as compared with enumeration due to slight inaccuracies in the delay-leakage model. Due to the same reason, for C432 at T = 50 C , mathematically assisted TABB returns a solution which is better than enumeration (but does not meet the delay requirement when back-annotated using STA).
Similarly, solutions for C3540 and C5315 at T = 7 5 C are slightly inferior than the corresponding values obtained through enumeration. The final column in Table II compares the run-time for the two implementations measured on a Linux workstation with a 2.8GHz Pentium CPU. While it can be seen that mathematically assisted TABB is approximately two times faster than enumeration based TABB at T = 5 0 C (with the exception of the smallest benchmark C17), enumeration outperforms the former for most benchmarks at T = 75 C . This is due to the fact that fewer bias pairs at T = 7 5 C satisfy the delay requirement, and hence the number of candidate solutions for enumerating is quite low. (At T = 7 5 C , only (v bn v bp ) = (0.5,0.5) satisfies the delay requirement for C17,C880 and C1355, and hence enumeration is more than three times faster than mathematically assisted TABB.) However, at T = 50 C , the search space for enumeration based TABB increases, and significant speed-up is obtained by the other method.
B. Results of PABB
While PABB is actually performed through post-silicon tuning, we perform the same using statistical simulations to get an overview of the nature of results obtainable by our method. The test structures to compensate for process variations in each WID variational zone are assumed to consist of a simple ring oscillator (RO) circuit. Simulations are performed on the ring oscillator using BPTM 100nm. model files [10] . The delay of the RO simulated at V dd = 1:0V and T = 25 C with nominal threshold voltage values (Vtn0 = 0:2607V and Vtp0 = ?0:303V ) is 151ps, while the leakage power 3 The delay and leakage numbers reported are the STA values obtained after back-annotating the bias voltages. 4 A minimum of 9 points is required for the leakage interpolation.
is 5.253nW . We wish to maintain the delay of the RO at this value, denoted by D , despite process variations.
The effects of process variations on transistor threshold voltages are quantized using Gaussian distributions for Vtn0 and Vtp0. For simplicity, it is assumed that the statistical distribution of transistor threshold voltages in each WID variational region is the same. This simplification helps us to perform Monte Carlo simulations with one set of Gaussian distribution parameters for transistor threshold voltages, and use the results over all benchmarks. In order to obtain an estimate of the yield without adaptive body bias, Monte Carlo simulations are performed on this ring oscillator with 50 runs at each temperature. If the delay of the RO does not meet the target value, it is assumed that the die fails to meet the delay requirement. The number of dies that satisfy the delay requirement at each temperature is shown in Fig. 4 . It can be seen from the figure that only about 50% of the dies are acceptable at room temperature, and this number steadily decreases with increase in temperature. This is attributed to changes in threshold voltages caused by process variations, thereby necessitating compensation using PABB. In order to determine the PABB voltages for each die, the delay and leakage distributions of the test-structure are characterized based on the method described in Section III D. The delay and leakage values with body biasing are measured through simulations, and second order polynomials, as indicated in (10) and (11) are obtained. The NLPP is formulated and solved for each die to determine its optimal bias. All 50 dies have been successfully biased. 42 dies require RBB for PMOS and FBB for NMOS while 6 dies require RBB for both NMOS and PMOS and the remaining 2 require FBB for both NMOS and PMOS. Most dies need FBB for PMOS to increase the speed and RBB for NMOS to minimize the leakage. This is consistent with the observation made by the authors in [1] .
The PABB values can be combined with the TABB data obtained from the previous sub-section to determine the PTABB values required for each benchmark at each operating temperature, according to (12) . The amount of dies which meet the delay requirement at T = 5 0 C and T = 7 5 C for the benchmark circuits and the nature of bias required is shown in Table III . Although 100% of the dies cannot be recovered at T = 75 C, the yield can be significantly improved.
V. CONCLUSION Temperature variations and process variations in nanometer-scale devices can cause the delay and leakage of dies to vary significantly. Bidirectional Adaptive Body Bias can be used to improve the yield of dies for reasonable ranges of operating temperatures. We propose an algorithm to compute the exact amount of body bias required to perform run-time compensation to counter thermal variations. We determine these bias values during the design stage using mathematical models and thereby eliminate the need for complex on-chip circuitry to monitor delay and leakage. We also present a unique methodology 
