Abstract-This paper presents real-time on-chip power and temperature sensors that provide fine-grained estimates for power consumption in systems-on-chip and also provide a concurrent temperature estimate for dynamic power and temperature management. This sensor occupies an area of 0.01 mm 2 in a 0.13-μm CMOS technology. With a simplified one-point calibration and a response time of 80 ns, it shows improvements in input dynamic range by ten times, response time by six times, and sensitivity by three times over previous such sensors. A compact low-power current reference with a 3-σ max TC of 127 ppm/°C and a line regulation of 1%/V is also presented, which is used in a replica-structure-based online calibration scheme, resulting in improved tolerance to variations in process, voltage, and temperature (PVT) and degradation due to supply noise as well as aging.
Variation Resilient Power Sensor With an 80-ns
Response Time for Fine-Grained Power Management Srikar Bhagavatula and Byunghoo Jung, Member, IEEE
Abstract-This paper presents real-time on-chip power and temperature sensors that provide fine-grained estimates for power consumption in systems-on-chip and also provide a concurrent temperature estimate for dynamic power and temperature management. This sensor occupies an area of 0.01 mm 2 in a 0.13-μm CMOS technology. With a simplified one-point calibration and a response time of 80 ns, it shows improvements in input dynamic range by ten times, response time by six times, and sensitivity by three times over previous such sensors. A compact low-power current reference with a 3-σ max TC of 127 ppm/°C and a line regulation of 1%/V is also presented, which is used in a replica-structure-based online calibration scheme, resulting in improved tolerance to variations in process, voltage, and temperature (PVT) and degradation due to supply noise as well as aging.
Index Terms-Calibration, low power, power management, power sensor, temperature sensor, voltage-to-frequency converters.
I. INTRODUCTION

I
NCREASING the prevalence of smartphones, laptops, and remote sensors has brought the importance of energy conservation to the forefront of tomorrow's electronics on the one hand, while on the other hand, increasing computational requirements coupled with scaling technologies have resulted in continued rise in density of thermal power dissipation. Resulting increases in cooling costs and reliability have underlined the importance of smart power and thermal management, especially in microprocessor systems. Power management techniques such as power gating, clock gating, dynamic voltage and frequency scaling, and dynamic thermal management have been proposed to meet the throughput and reliability targets while reducing overall power consumption. As Fig. 1 shows, sensors for estimating power occupy a crucial space in this power management system. Accurate real-time power estimates enable the system to make better schedules in multicore environments and to better predict the optimal number of threads required to meet job deadlines subject to a power criterion.
As methods for direct online measurement of power have been unavailable, indirect methods such as performance counters [1] - [3] or temperature sensors had to be used as proxies for estimating power [4] . Hardware performance counters (HPCs) are sets of registers built into the microprocessor to count performance events such as instructions executed per cycle, data dependencies, or cache misses, whose outputs fit into linearized power models that are architecture dependent [1] . Estimation accuracy also depends on the choice of counters, representative benchmarks used to formulate relationships between HPCs and actual power consumption. These representative expressions often vary according to architecture and the processor state (voltage and frequency). Process variations are also unaccounted in the HPCs unless these weights are recalculated for each processor state in each chip that takes significant testing overheads. These HPCs are only accurate in estimating power averaged over 10 000 or more cycles, whereas the errors in estimating dynamic power consumption can be as high as 40% [2] , [3] . In addition, variation in on-chip temperatures can cause significant errors in the power values estimated from performance-counterbased models, and often an on-chip temperature sensor becomes essential for correcting these errors.
As a part of the power consumed is dissipated as heat, a rise in chip temperature can also be used as a proxy measure for estimating power consumption. The thermodynamics of this system is accurately modeled as a series of heat sources feeding into an RC network [4] . The rise and fall in temperatures therefore mimic RC charging and discharging and can have time constants on the orders of milliseconds [5] . This results in an inherent lag between occurrence of a power event and the availability of the information in the form of a temperature change. In addition, most reported temperature sensors have conversion times that vary from tens of microseconds to a few milliseconds [6] - [9] . Hence, the overall system response time when using temperature sensors is very poor especially compared with microprocessors that run on sub-1-ns clock periods. Moving to multicore environments further complicates the annotation of temperature readings into power dissipation due to lateral spread of heat on the chip surface. Using software models to estimate power and to then convert these estimates into a 3-D thermal model has been studied [10] . Techniques to improve the speed of convergence to a mathematical solution have also been explored [11] . However, these methods still rely on indirect-software-based power estimates and are susceptible to ambient variations. Moreover, these techniques operate in time steps on the order of hundreds of milliseconds, which are at least six orders of magnitude too coarse.
Separate digital dynamic power meters and temperature sensors have been implemented in multicore environments for fine-grained power management [12] . However, separate circuits for temperature and power result in increased area overhead, and yet, the power meter accounts only for dynamic power and does not account for static power consumption. A current sensor described in [14] introduces a small resistance in series with the current path and samples the voltage drop across the resistor to estimate power accurately. However, it has a poor temporal resolution (hertz to kilohertz) and a large area overhead (1 mm 2 ). A sensor with a fast response time and a low area overhead was presented in [13] . However, the estimation required measurement of on-chip temperature, and even with a two-point calibration, estimates may be vulnerable to aging defects. It also has limited dynamic range, which although sufficient for the estimation of average power may not be sufficient to estimate spikes in power consumption. In addition, methods described in [13] and [14] require an external current source for calibration, which increases the test complexity.
A replica-sleep transistor based on-chip power sensor with a novel online calibration scheme was partially reported in [15] . This paper presents a significant extension of the report in [15] including a theoretical analysis of the effects of nonidealities on the sensor output and the role of online calibration in mitigating these effects. Additional experimental data have been presented to support this analysis. An improved scheme for reference current generation has been presented and evaluated by simulation and these results are compared with some of the state-of-the-art current reference circuits. The rest of this paper has been arranged as follows. Section II introduces the architecture of the proposed power sensor along with the on-chip calibration current source. Section III describes the calibration scheme. Section IV outlines the theoretically predicted effects of nonidealities on sensor output. Section V includes the experimental results, followed by the conclusion in Section VI.
II. ARCHITECTURE
A. Sensor With Replica Structure
Power gating has become a necessary design technique to reduce the effect of ever-increasing leakage power on system performance. Large transistors known as sleep transistors are used to shut down voltage supply rails to inactive circuits. However, when the circuits are active, the virtual supply rails must be within tens of microvolts of V dd and ground supply planes to ensure that the circuit performance does not degrade significantly. Hence, the sleep transistors operate in triode region with low V DS . In addition, decoupling capacitors (roughly ten times the switching capacitance of the circuit, BPC1 in Fig. 2 ) is also added as part of the power supply network to these virtual supply rails to reduce load regulation. As V DS across transistors operating in triode region is proportional to current flowing through them, we utilize this voltage drop to sense the current being drawn by the circuit under test (CUT). Fig. 2 shows the circuit schematic of the proposed power sensor, which includes a mechanism to recalibrate the estimated current according to ambient effects such as PVT variations, supply noise, and aging degradations. The basic structure of the sensor reported in [13] is retained. The sensor can be understood as a cascade of three blocks.
1) The sleep transistor M1 generates an inductive reactance (IR) drop that is proportional to the current flowing through it. 2) This voltage is converted into a current I chg by a transconductance stage (Gm-stage), which consists of a source follower M3 followed by two source-degenerated common source stages (M5 and M9). 3) Current generated by this Gm-stage (drain current of M9) serves as the input to a current-controlled oscillator (CCO). In this design, the CCO consists of a capacitor C a that is initially charged up to V dd and slowly discharged by M9 at a rate of I chg /C a . When the voltage at C a reaches the inverter threshold voltage, its output flips, triggering a ripple through the delay chain, which is then used to quickly recharge the capacitor to V dd by a strong transistor M11. The output at the end of delay line is used to drive a T-flip flop to generate a periodic waveform with 50% duty cycle whose frequency is directly proportional to the input current. The delay through this chain of inverters is kept small by design, so that the period of the voltage waveform at capacitor node C a is determined primarily by the current output of the
Hence, the size of capacitance C a is driven by a tradeoff between lower power and area overheads on the one hand (for smaller capacitance C a ) and the sensor's linearity on the other hand-parasitic capacitances at this node must be significantly smaller than C a . For such a design, the output time period of the sensor tp m relates to the load current flowing through M1 as follows:
( 1) where tp z0 is the zero error, measured as the output time period at zero-load current. The frequency of the output signal is directly proportional to the load current, where the current-tofrequency gain is denoted by A f . Therefore, the output pulses are shorter at higher load currents. This property can be used to obtain a faster response time at a given accuracy or to improve the accuracy at a given response time as the current load increases.
From (1), it is seen that the sensor gain A f and zero error tp z0 are susceptible to variations in ambient conditions. In order to improve the variation resilience of our sensor, we duplicate it with two changes aimed at reducing the area overhead. The replica of the sleep transistor M2 is scaled down by N× with respect to the original sleep transistor M1 and the capacitance in the CCO C b is made half that of C a . As the input to each sensor branch is the IR drop across sleep transistor, scaling of M2 also helps in reducing the power overhead, as the same voltage drop can be achieved by a calibration current that is N× smaller. On the other hand, C b is chosen such that even after resizing, it is significantly larger than the parasitic capacitances at this node. The output time period of this branch tp r is related to a calibration current used to load this branch I cal by the following equation:
where tp z is the zero current output time period of the replica sensor and is related to tp z0 as tp z0 = η · tp z . We can make the following simplifying approximations.
1) The sensor operates within its 1-dB compression point (P1 dB), so its gain can be approximated to A f for the entire region of its interest. This is ensured by the very nature of a sleep transistor, which by design needs to keep the virtual voltage rail (V V dd ) close to V dd . 2) Variations in ambient conditions serve as common mode inputs to the two branches, and hence, their net effect can be removed by considering the outputs of both the branches.
3) The ratio of zero-load output frequency for the main branch (tp z0 ) and the replica branch (tp z ) η remains constant across operating conditions. At zero load, input to the sensor settles to V dd . As the transconductance stages are well matched, currents into the CCO are equal. Output frequencies are then determined solely by the capacitances (C a and C b ), whose ratio remains constant. Provided we have access to a temperature-tolerant reference current I cal , a variation tolerant estimate of the load current can be obtained as a ratio of the outputs of the two branches as follows:
This strategy of measuring current as a ratio of sensor outputs also eases the constraints on the sensor's linearity, enabling a design with wider input dynamic range and higher sensitivity. As noise suppression is vastly improved by online calibration, good noise immunity is achieved even without averaging the sensor output over a long time, improving the overall response time of the sensor. When the sensor branches have zero external load, the input voltage to the replica branch remains close to V dd as the IR drop reaches 0. This voltage is level shifted by the source follower M4 to V dd − V t n, 4 , which appears at the gate of M6 whose overdrive voltage V ov is approximately V t n,4 − V t p, 6 . If M6, M8, M10, and R1 and R2 are treated as a single transconductance stage, their transconductance is given by
where gm 6 , gm 8 , and gm 10 are the small signal transconductance gains of M6, M8, and M10, respectively, and γ is the ratio of gm 10 to gm 8 . Hence, in this mode of operation, the drain current of M10, I chg,0 = (V ov ) Gm eff , where V t p, 6 is the threshold voltage of M6. This quantity varies with temperature as follows:
where c is a process dependent constant. The second term in (5a) can be ignored for small overdrive voltages. As threshold voltages of transistors vary linearly with temperature, I chg can be approximated to increase linearly with temperature. Therefore, the output pulse rate at zero-load current 1/tp z is expected to increase linearly with temperature providing an added functionality as an on-chip temperature sensor. The slope of this line, however, depends on process, so a two-point calibration is necessary to operate this circuit as a temperature sensor. 
B. Subthreshold Current Reference
A compact low-power aging-tolerant current reference with low temperature coefficient (TC) is therefore needed to make the current estimates tolerant to any variations. A few current reference circuits for the generation of a temperature-tolerant current source have been presented [20] - [24] . Such works rely on BJTs [21] , which are difficult to implement in today's technologies, or rely on an extensive calibration scheme requiring as many as six measurements at three different temperatures [24] resulting in testing overheads. Others such as [20] and [22] have footprints in excess of 0.1 mm 2 , which increase the system costs significantly, as multiple instances of local reference currents are needed on a single die. Fig. 3 shows the basic structure of a beta-multiplier cell, which is traditionally biased in saturation. If MNa and MNb are biased in subthreshold region instead with drain-to-source voltage >0.1 V, the following equation holds:
where n is the subthreshold slope factor, φ t is the thermal voltage, W and L are transistor width and length, respectively, I d0 is a process-dependent constant, V G and V S denote the gate and source voltages, respectively, and V th denotes the threshold voltage. If currents in the two branches are matched and equal to I1, (6) can be rewritten to eliminate the gate voltage V G as
where V ta and V tb are the threshold voltages and W a /L a and W b /L b are the dimensions of the two transistors MNa and MNb, respectively. The first term in this equation generates a complementary to absolute temperature (CTAT) current and the second term generates a proportional to absolute temperature (PTAT) current. Based on this principle, a reference current can be obtained from a circuit consisting of two beta-multiplier cells, as shown in Fig. 4 , where one cell generating a PTAT current by sizing transistors with identical threshold voltages (MN3 and MN4) in a ratio of k:1 and a second one generating a CTAT current from transistors with differing threshold voltages (MN7 and MN8), sized k :1. The summed current I cal is given by
where V t 7 and V t 8 are the threshold voltages of MN7 and MN8, respectively. Calibration current with a low TC can be obtained by suitable scaling of k, k , R 1 , and R 2 , which has been scaled up by α and measured at an output pad as seen in Fig. 5 . However, significant variance in the nominal current value was observed across the three samples. Monte Carlo simulations across 300 samples showed that with a mean TC of 224 ppm/°C and a standard deviation 311 of 196 ppm/°C, this design was susceptible to mismatches ( Fig. 6) . At a 95% spread in a given lot, the maximum TC could be as high as 600 ppm/°C, which equals a 6% error to estimation currents, if the chip temperature changes from 0°C to 100°C. Hence, a reference current design with greater resilience to mismatches is desirable for mass production. Fig. 7 shows such a design where MN1, a high V t nFET, and MN2, a low V t n-FET, are sized 1:a and biased in subthreshold. The transistors MP1 and MP2 are biased in saturation by a singlestage opamp (OPA) with a simple startup circuit. In order to reduce power consumption, the OPA is designed in weak inversion with a high dc gain (≈40 dB) and a low-bandwidth (≈10 kHz). The reference current thus generated is equal to current in each of the two branches and is given by
This current reference circuit as shown in Fig. 7 has not yet been fabricated, but verified by simulations. Even after accounting for offsets and mismatches in the OPA, this design shows a significantly improved tolerance to mismatches as seen in Fig. 8 . With a 3-σ max TC of 128 ppm/°C, the variation in reference current in a lot is guaranteed to be less than 1% over the nominal from 0°C to 100°C for a 95% yield. This improvement can be attributed to a reduced number of the matched devices (MN1, MN2, MP1, and MP2). In addition, reduction in the number of beta-multiplier cells from 2 to 1 also ensures that an OPA can be used to achieve better matching. The OPA generated bias for pFETs (MP1 and MP2) also improves the line regulation of this circuit to 1%/V (Fig. 9 ) for a supply voltage ranging from 0.85 to 1.5 V. Table I compares the two presented designs with the state-of-the-art in current reference circuits. The two designs based on a beta-multiplier cell in weak inversion show better TC than those in most of the referenced works except for [23] , which also comes with a significantly larger area footprint. In fact, compared with other state-of-the-art reports, this circuit has the smallest area footprint. In addition, the low power consumption and TCs make it an ideal design for replication at multiple locations on an SoC as part of localized power sensors. However, the nominal value of this current is still affected by the resistor tolerances, which needs to be calibrated out.
III. CALIBRATION
Equation (1) shows that the sensor is susceptible to gain and offset errors, which need to be calibrated out. Equation (3) is the mathematical representation of this calibration, which estimates both gain and offset errors in runtime using a replica branch. Thus, in order to obtain a variation-tolerant estimate of I load from (3), the value of calibration current I cal and the ratio of sensor gains in the two branches (A f /A f ) need to be measured for each given sample (chip). This one-time measurement requires loading the main branch of the sensor with a current whose magnitude is known a priori. In order to eliminate the need for an external current source for testing, a scaled version of the on-chip reference current α I cal is made available as a pad output. α I cal is measured externally and then used as a load to the main branch. Subsequently, four measurements are taken in this setup:
1) output of the main branch with a load current tp 1 ; 2) output of main branch at no load tp 00 ; 3) output of the replica branch with load current I cal , tp 2 ; 4) output of replica branch at no load, tp 0 . This one-time, one-point, and postfabrication measurement is step 1 in Fig. 10 , from which the two parameters η and are calculated as follows and stored in memory:
Steps 2 and 3 in Fig. 10 form the core of online calibration, which involves taking three measurements from the sensor (tp m , tp z , and tp r ), recalling the stored values of η and , and estimating load current according to the following equation [obtained by combining (3) and (11)]:
Steps 3a and 3b involve a recall and a simple arithmetic calculation, which is comparatively faster than the sensor response time (step 2), respectively. Hence, the conversion speed for the sensor is mostly determined by the sensor response time and remains relatively unaffected by the proposed calibration scheme. 
where tp 01 and tp 02 are the output measures of the replica branch at zero load at two different temperatures T 1 and T 2 , respectively.
IV. NONIDEALITIES
A. Mismatch
Mismatches can affect the sensor in one of the two ways: 1) mismatch within the current reference subcircuit, which can worsen the TC of the current; 2) mismatch in the gain between the main branch and the replica branch across operating conditions. Mismatches within the current reference circuit that lead to an increase in TC were addressed in Section II-B. Equations (1)- (3) give us more insight into the effect of mismatches between the two branches. The absolute mismatch between the gains of the two branches, A f and A f , is addressed by calibration (11) . However, across a range of inputs, the sleep transistor's response deviates from an ideal ohmic response and the sensor undergoes gain compression. As a result, the inherent sensor gain at full-scale input A f,(I load =I max ) will differ from its value at zero input A f,(I load =0) and at an arbitrary load condition A f,I load .
Yet, the replica branch is only designed to work only at two load conditions: 1) at zero-load current; 2) at input load current = I cal .
Although the replica branch calibrates the sensor gain of the main branch according to (11) , distortion and systematic nonlinearities introduce a second-order effect resulting in deviation of the ratio I cal · A f /A f from calculated during one-time calibration. This may also be exacerbated by mismatches between the sensor and its replica. To alleviate this effect, sensor linearity was improved by adding source degeneration and designing the inverter chain to have delay much smaller than the discharging time.
Monte Carlo simulations across 300 samples were run under two different current loads, 12.5 mA (midscale) and 25 mA (full scale), to estimate the extent of nonlinearities on system accuracy. The current estimated according to (12) under full-scale load is denoted by Iload_2, and that estimated for a mid-scale load is denoted by Iload_1.
It must be noted that these estimates already account for sample-to-sample variations in gain and offset, following the calibration scheme outlined in Section III. Ideally, the ratio of these two numbers is expected to be 2, and any deviation from this ideal value can be treated as the maximum error contribution by distortions, which happens at full-scale input. Fig. 11 shows that under 3-σ mismatches, the minimum value of this ratio is 1.89 (≈5.5% error) and the nominal value is equal to 1.95 (≈2.5% error). Thus, distortion effects in the presence of mismatches between the two branches limit the full-scale accuracy of the sensor to 95% (3-σ ) and 97.5% on average.
B. Ambient Effects
Ambient conditions such as temperature, supply voltage level, and supply noise can be treated as common mode inputs modulating the sensor output. Evaluating (3) with respect to these variations yields the following:
As the two branches have the same voltage supply, the effects of supply voltage variation and of supply noise are fully correlated, significantly improving the sensor's power supply rejection. As the two branches are placed in close vicinity in an interdigitated layout, no temperature gradient exists between the two branches either. Hence, the gain of both the branches varies in the same direction due to all ambient effects, significantly improving the sensor's variation resilience.
C. Aging
Voltage and temperature stresses on devices act through various mechanisms such as bias temperature instability (BTI), hot carrier injection (HCI), and time-dependent dielectric breakdown to reduce performance. This is typically seen as a degradation in the threshold voltage of active devices [18] .
HCI-related degradation is strongly dependent on the electric field strength in the channel, which is closely related to the drain-source voltage. On the other hand, BTI-related degradation is gate voltage dependent. Transistors forming the Gm-cell in the sensors (M5-M10) are biased in saturation and experience similar stresses on both accounts. The sleep transistor and its replica on the other hand have the same BTI stress (as gate-to-source voltage of both transistors is either V dd or 0) but, due to different drain to source voltage, undergo slightly different HCI stresses. If the aging related degradation is modeled as A f and A f for the two sensor branches, the output in the presence of aging degradations can be written as
The transistors in the current reference circuit are biased in subthreshold so that they are subjected to very little stress compared with the other devices. Hence, the variations in I cal due to aging are minimal. Similarity of stress phenomena and stress values ensures high correlation between A f and A f , leading to aging-tolerant current estimates. Foundry-provided device aging models were used to simulate the cumulative effects of degradations due to HCI and BTI under nominal operating conditions (10-mA load current in the main branch and I cal turned ON). The sensor output for beginning of life (BoL) models is used as a baseline and the degradation in sensor response with time is simulated and recorded for up to 10 years of operation using RelXpert [25] .
Compared with the BoL model, the sensor gain degrades by as much as 5% within ten years of operation due to aging effects (Fig. 12) . These stress effects are more prominent at smaller load current as the transistors in the Gm-cell (M5, M7, and M9) operate with smaller overdrives and are therefore more susceptible to variations in threshold voltage. Replica-based online calibration can reduce these estimation errors due to aging to less than 1% even when the actual sensor gain degrades by as much as 5%.
D. AC Effects
To study the effects of transient currents on sensor response, the sensor is divided into two parts. The first part consists of a cascade of low-pass elements, starting with the bypass capacitor BPC1 on the virtual voltage rail, followed by the Gm-stage made up of a source follower and two commonsource amplifiers. The cumulative effect of these low-pass elements as seen in the current output at the drain of M9 are captured by simulation in Fig. 13 . A dc current of 5 mA with a half-sine current source of 5 mA is used as the load and the ac gain from load current to CCO is normalized with respect to the dc gain and plotted in Fig. 13 .
The simulation plot shows that for this design, the 3-dB bandwidth is greater than 200 MHz, which ensures that transients with frequencies up to hundreds of megahertz are captured at the input to the CCO. On the other hand, the capacitor at the input of CCO acts as an integrator, averaging out transients that are faster than the CCO output period. Hence, the sensor has sufficient bandwidth to track transient currents that are slower than the sensor's response time, while providing a moving average of the others that are faster. Bhagavatula and Jung [13] demonstrated that with transient current loads, the estimated output power is within ±3.5% of the average load currents during a single pulse. As the core architecture of the sensor has not been changed, the same analysis holds true.
V. MEASUREMENT RESULTS
This sensor was designed and fabricated in a 130-nm CMOS and it occupies an active area of 110 μm × 90 μm, as shown in Fig. 14 , of which the area of the reference current circuit is 30 μm×50 μm. An external current source rated at a 10-ppm accuracy for dc currents was used to generate various load currents from 0 to 25 mA and tested at 13 different temperatures ranging from −23°C to 100°C. This setup is placed in an environmental chamber with a tightly controlled temperature, which is monitored using a thermocouple placed inside the chamber, near the chip. The sensor is switched ON and readings are taken only after the thermocouple confirms that the temperature inside the chamber has reached a steady-state value for each different temperature setting.
As the current source generator available for testing was not rated for ac currents, effects of transient currents had to be inferred indirectly (Section IV-D). Output pulse rate was measured as a reciprocal of the time period of the output pulses averaged over four cycles (for improved accuracy) and load currents were estimated according to (11) . The difference in the estimated value of the current and that reported by the external current source is reported as error in percentage terms. As the longest time period of the output pulse is 20 ns, a current estimate is available within 80 ns at all times. Fig. 15 shows that the pulse rate varies linearly with load currents with an R 2 > 0.99 at all 39 sample points. A load current of 20 mA results in an IR drop in excess of 100 mV across the sleep transistor (at room temperature), which is equivalent to 10% of V dd . For any given circuit block, the sleep transistors are designed such that the virtual supply rail is within tens of millivolts of V dd , which sets the specification for the input dynamic range of the sensor. Hence, it can be replicated without redesign for blocks across the chip saving significant design effort.
η was assumed to be invariant across operating temperatures in order to derive (12) . With measurements for the three sample chips from −20°C to 120°C, Fig. 16 validates this assumption, enabling the zero-error correction of the main branch without interfering with the operation of the CUT. Fig. 17 shows that the average error in current estimation across 3 samples and 13 different temperatures was less than ±8.25% with a 3-σ error ≤±15%. This error is minimized at the operating condition where the IR drop across the replica sleep transistor (due to I cal ) is equal to the load current (≈5 mA in Fig. 17) . Thus, the scaling ratio N in (2) and current overhead due to I cal can be used to tradeoff with sensor accuracy.
At room temperature, the sensor output was tested for dc inputs from 0 to 20 mA in steps of 10 μA and the output was monotonic to 11 bits. Fig. 18(a) and (b) shows that the equivalent Integral Non Linearity (INL) is <0.5 LSB and Differential non-linearity (DNL) <±2 LSBs for the most of the inputs, and momentarily increases to about 3LSBs at ≈18 mA. At this point, the drain-to-source voltage of the sleep transistor is greater than 95 mV, and it is suspected that nonlinearities in the Gm-cell drive the sensor gain to drop significantly at this point. This inference is supported by Fig. 17 , where the rate of increase in estimation errors is higher for inputs greater than 18 mA. It must be noted that even though the sensor is linear to 11 bits under dc test conditions, for an accuracy <10%, the transient settling target is only 5 bits in a given measurement window of 80 ns. Fig. 19 shows the effect of supply noise on estimation accuracy. Single tones of varying amplitudes (10 mV-100 mV) were superimposed on the voltage supply when the current load is maintained at 10 mA (half of the full-scale range). The output was sampled for at least two time periods of the superimposed tone or 100 ns (response time of the sensor), whichever is larger. The maximum deviation in the output frequency from the nominal case, with a clean dc supply, is reported as error percentage in Fig. 19(a) . Within this time window, the maximum deviation of the estimated current from the actual load current is reported as error (Fig. 19) . Thus, the replica-based calibration is successful in suppressing the effect of supply noise on output estimate by at least five times.
The same setup with an environmental chamber and a thermocouple is used for evaluating the design as a temperature sensor. Fig. 20 shows that the zero-load pulse rate varies linearly with temperature in all three samples tested. Outputs recorded at two temperatures, 25°C and 75°C, are used to derive temperature as a linear function of 1/tp z . Using this equation and the measured output pulse rate, temperatures at other measurement points are estimated. The difference in temperatures estimated by the on-chip sensor and that recorded by the thermocouple are reported as errors. After a two-point calibration, temperature was estimated from the sensor output with an average error of ±0.7°C and a 3-σ error ≤3°C from −20°C to 120°C ( fig. 21) , which is sufficient for thermal management in microprocessors [6] . Table II summarizes and compares the test results with the state-of-the-art sensors used for power management. With a low power overhead and fast response time, this sensor exhibits an energy-per-conversion figure of merit that is at least two orders of magnitude better than the state-ofthe-art sensor in power estimation. With a compact design that occupies less than a 0.01-mm 2 active area, this sensor is considerably smaller than others presented. A wide input dynamic range with a linearity of 11 bits implies that this sensor can accurately estimate both static and dynamic power with high accuracy in real time. Although designs in more advanced process technologies (sub-45 nm) may show higher process and sample-to-sample variations, these variations can be treated as gain and offset errors, which are calibrated out by a combination of one-time measurement and online calibration. On the other hand, at these technology nodes, this sensor can be designed with higher sensitivity and lower area and power overheads.
VI. CONCLUSION
A replica-sleep-transistor-based on-chip power sensor has been presented. With only one-point calibration, this sensor is resilient to PVT variations and aging degradations. A low-TC reference current required for online calibration has also been presented, which allows for a design with an increased input dynamic range and sensitivity. With a compact design and a fast response time, this sensor provides block-by-block power estimates in real time that are essential for fine-grained power management in both temporal and spatial domains in tomorrow's microprocessors and other SoCs.
