. This raises the computational throughput per watt target for the future technology nodes. SRAM arrays which are used widely as a system component, such as caches and register files, in both high-performance and portable systems, are getting to be dominant power consumers because of their large capacity and area. Hence any reduction in cache energy can result in considerable overall power reduction. In this paper, we propose a novel circuit technique using depletion mode devices, to reduce the static energy of SRAM array in an on-chip cache by 90% without any performance impact.
I. INTRODUCTION
Static subthreshold leakage has emerged as one of the major impediments in CMOS scaling. The magnitude of the problem is reflected in the fact that the leakage current per unit transistor width is expected to increase a 1000-10000 fold in going from 180nm technology node to 70nm technology node. This appears to pose a problem only for state holding circuits such as memory arrays that are idle for extended periods until one considers the following. In the same period, 180nm to 70nm technology node, the on (drive) current is expected to stay constant at 750 µA/µm in order to sustain the historical speed advantage. In the 180nm technology node, the ratio of on current to leakage current per unit transistor width is approximately 7.5 * 10
6 . This ratio is projected to shrink to 750 in 70nm technology node and to 75 in 32nm technology node! Some of the implications of this projection are as follows. In 32nm node, the leakage current is expected to play a major role in the design of some of the high fanout, computing logic as well. Consider a bus with 30 clients being driven by one client. The twenty-nine off transistors (drivers) from the inactive bus clients are able to leak about one half (38%) of the on drive current of the driving client! Similar concerns will surface for logic blocks such as decoders, multiplexors, and gate arrays. This argues that we need to develop low-leakage design styles even for computing logic with high activity rate.
This paper proposes a CMOS design style with low leakage characteristics. This design style belongs to a family of CMOS design styles that we name warmup CMOS. These logic styles warmup initially to charge some nodes up to a certain potential. The steady state energy savings are delivered in the "warmed-up" state with an initial energy cost for the warmup. The specific warmup CMOS design style presented in this paper relies upon depletion mode transistors to provide proper biasing to reduce the leakage. Hence we name it dep-warmup-CMOS. During the active phase of logic, the depletion mode transistors are transparent ( Figure 1 ). However, during the inactive phase, the depletion mode transistors leak enough charge to bias the transistors to a point favorable with respect to the leakage current. A low leakage current equilibrium state is forced in the inactive phase. We describe the dep-warmup CMOS design style in Section II.
An immediate application of dep-warmup CMOS logic design style is in static memory arrays characterized by caches and register files. Microprocessors attain significant performance improvement by increasing the size and associativity of on-chip caches. For example, Intel's latest processor family, Centrino [19] has a 1MB L2 cache on-chip. Both dynamic switching energy, and static subthreshold leakage current induced energy of on-chip caches are already significant factors in over-all power consumption of the processors. The static leakage energy would overwhelm the dynamic energy for these caches with the expected 1000-10000 fold increase in leakage current in the 70nm technology node. This is why the static leakage power reduction is one of the most important considerations in both high-Performance and low standby power circuits' design. As the device size shrinks, to maintain the same on current, threshold voltage V t has to be reduced. This causes sub-threshold current to increase exponentially. Table-I compares 0.18µm technology and 70nm technology with respect to on and off currents. This illustrates the magnitude of the problem.
As we had argued earlier, with feature size reduction, the static energy component of the on-chip caches constitutes a sizeable fraction of the total processor energy. Circuit techniques such as DVS [14] , ABB-MTCMOS [10] , Gated-V dd [11] have been proposed to reduce the leakage energy in the caches. Control algorithms deploying these techniques estimate the active footprint in the cache to selectively power it. This results in performance penalty as well as in energy penalty in switching the cache lines back and forth from active to dormant state. Thus an efficient control algorithm is essential to extract maximum possible energy savings. Power consumption in any digital integrated circuit, is given by the equation,
where, I o is the leakage current, which is governed by the diode equation I s (e qV /kT − 1), V dd is the power supply voltage, α is the average switching activity factor, C is the total capacitance of the circuit, and f is the frequency of operation. The first term in the equation corresponds to the leakage power and the second term corresponds to the dynamic switching power. With the reduction in feature sizes, V dd has also decreased, forcing a reduction in the threshold voltage V t of the transistors. Thus the leakage current I o which depends on V t , through the diode equation (presented in the preceding discussion), increases [15] .
A more elaborate expression for the sub-threshold leakage current is given by [ [16] , page 201],
where,
The leakage power decreases exponentially with respect to V ds due to the Drain Induced Barrier Leakage effect [16] . This fact has been used by earlier researchers in DVS [14] and in Row-by-Row Dynamic V dd Control (RRDV) Scheme [18] . We propose a novel technique using depletion mode devices to achieve a leakage reduction of 90% in SRAM arrays without any additional control mechanism. We call our SRAM as warm SRAM for the reasons explained in Section III.
II. PROPOSED CIRCUIT TECHNIQUE
In order to achieve leakage reduction we should be able to dynamically control the voltages V gs , V ds and V s . We use depletion mode devices to achieve this. A depletion device works in same way as an enhancement mode device, except that the device is ON even when V gs is zero. To understand how depletion mode devices can help in leakage current 
When the access signal ACC is HIGH, both the depletion devices are ON hence the virtual power and ground nodes 
The equilibrium condition of the circuit can be found by solving Equation 2, substituting appropriate threshold voltages and equating all the currents. To have at-least a voltage difference of 0.3V between HIGH and LOW and for V dd = 1V, this equilibrium condition requires (for the case |V T depN | = |V T depP |), |V T depN | = 0.65V. We used 70nm technology model files provided by BPTM [4] to perform the HSPICE [5] simulation. We chose the following threshold voltages for the HSPICE simulation: Table I shows the leakage current of a minimum sized transistor to be (99.47nA/µm) * 38nm which is 3779.86pA. Compared to 10pA leakage current from our scheme in Table II , we have achieved a leakage current reduction of 377 times, but with a penalty in performance. The performance impact of the extra NMOS access transistor in the charging path and PMOS access transistor in the discharging path could be high. Since these devices are depletion mode devices, this impact can be managed to a great extent. Various delay parameters are listed in Table- II. There is a 54.5% increase in the average propagation delay which may not suit high-speed logic circuits. We can reduce the delay by several means. The easiest one is to make the ACC signal rise time slope higher (faster) compared to the input signal slope, or equivalently preraise ACC earlier than the input transitions. In this inverter, just by making the ACC rise 10 times faster than the input, we reduced the propagation delay penalty to 18%. But the increase in fall time, which is limited by the PMOS, is still 76%. We can increase the width of the depletion mode transistor, which unfortunately will also increase the leakage current. This is one of basic limitations of this circuit design style. The other limitation is the energy spent in switching V P W R node. Note that whenever the input is 1 and ACC = 1, the virtual power node is pulled up to V dd . However, during the inactivity period (ACC = 0), the virtual power node leaks some charge to the virtual ground node settling at a voltage equaling approximately −V T depN , which is about .65V in our design. A complete energy estimate needs to take this switching of the V P W R node between V dd and −V T depN into consideration. The V P W R node has capacitance C eq = 2 * C dif f . Consider a 0.3V swing between V dd and −V T depN . The extra energy required then will be:
If the circuit is idle for ∆ t time, on average, after each active period, the switching will pay-off only if,
This is because from Table I , the 70nm leakage current is 99.47nA/µm. With L ef f = 38nm, this gives the leakage current as (.038µm) * (99.47nA/µm) = 3.78nA. The leakage energy over time ∆ t will equal 3.78 * 10 −9 * ∆ t . For the leakage energy to exceed the extra switching energy of V P W R ,
7 . With 70 nm technology and minimum width transistors, C dif f typically equals 0.1 fF hence ∆ t ≥ 7.9ns. For typical high performance circuits, with clock cycle time as 0.2 ns, we get around 40 cycles as the idle time. Hence the dep-warmup CMOS circuit technique can be applied if the average idle time between active periods is far greater than this break-even window. The immediate application will be in SRAM arrays, which occupy more than 50% of the typical microprocessor's area and consumes more than 50% of the energy. The circuit in Figure-1 is not truly regenerative in idle mode, i.e., as the devices are cascaded, the high output
We can study a cross-coupled inverter configuration to determine the limiting leakage energy. This will be the minimum leakage reduction possible through this method as V P W R will be equal to V H and V GN D will equal V L . In this scenario, one of the factors in leakage reduction, reverse biased V gs , doesn't exist. Leakage current for this cross-coupled inverter estimated using HSPICE [5] is 515 pA. And V P W R = HIGH = 742mV , V GN D = LOW = 225mV . Even though the leakage current increases by 25 times (per inverter) when compared to circuit in Figure-1 , we still have achieved a reduction of 12-15 times when compared to the original inverter. Moreover, the high and low levels are only 500mV apart, and hence require less switching energy. Further reduction in leakage current is possible if multiple cross-coupled inverters share the same depletion device pair.
III. REDUCING STATIC ENERGY IN ON-CHIP CACHES
As we saw in Section II, the immediate application of the dep-warmup-CMOS is in SRAM arrays. Hence, in this section, we study its application in caches. The latest processors have two or three levels of caches, namely L0, L1, and L2. L0 is closest to the CPU and L2 is closest to the main memory. To improve the performance, L2 caches are now made on-chip and sized upto 1MB. In contrast, L1 cache sizes are typically 32KB to 64KB. L1 miss rates are on the average less than 2%, hence L2 will be very infrequently accessed when compared to L1. Given this, we validate our circuit design technique with L1 cache, which is in the critical path. If the scheme works with L1, we can infer that it will work with L2 as well (L2 has longer idle periods and is less critical for processor performance).
We use CACTI 3.0 [13] as the base model to evaluate the performance impact. Cacti uses sub-arrays to reduce the bitline and word-line delays. For a 32KB 4-way cache with 32B Table- IV. The cache access time for this configuration is 556.1 ps, which includes the data output driver delay of 76.2 ps. From Table- IV it is clear that data array delay does not form the critical path in cache access timing. A 32 bit processor with this cache configuration requires a tag array of 19KBits, which is just 7.4% of the data array size. Therefore, we apply our method of static leakage reduction only to data array, which is not in the critical path (increase in data array delay will not impact the cache access time), and consumes more than 92% of the static power in the caches. We can use a depletion device pair per SRAM cell to reduce the leakage power as described in Section II. However, this will increase the SRAM cell size, and hence the area increase will out-weigh the leakage power reduction. An alternative is to share the depletion mode transistors with multiple SRAM cells. The wordline access signal can be used to control the depletion devices since it already encodes the active periods of a SRAM cell. Hence, no additional control signals need be generated. There are two ways to share the depletion or the voltage clamp devices. As we saw in Table- II the fall time t f is approximately 4 times the rise time t r . We decided to increase the depletion PMOS width by a factor of 4. This leads to approximately equal t r and t f . For an SRAM cell which is a cross-coupled inverter, write time is not the bottleneck as the cross-coupling effect aids in state transition. The read time is impacted only by the discharge path. Hence increasing depletion PMOS width alone is justified. From Table- IV data, the data array delay is 85.2% of the tag array delay and data bitline delay is 2.73% of the data array delay. Hence upto 6 times increase in bitline delay will still not increase the overall cache access time. We try to exploit this fact Figure-3 . We use low V t devices in the cross-coupled inverters and high V t devices for the access transistors, in order to reduce the leakage due to the access transistors. Except for the access transistors, all the other transistors have |V t | = 0.2V.
A. Leakage Reduction
The most relevant parameters of the proposed dep-warmup CMOS cache is the leakage reduction achieved by this circuit and the steady state voltages of the nodes. The current measurements presented in Table- V are for a single SRAM cell. The steady state values were measured after a bit was written into the cell. When compared to the values presented in Section II for the cross-coupled inverter case, we see that V H has decreased, i.e., moved closer to |V T depN |. This is because we have shared one depletion mode NMOS with 16 SRAM cells, hence the leakage current in the circuit has to decrease. All the 16 SRAM cells have to share the leakage current supplied by one depletion mode NMOS. V L has also moved closer to V dd − |V T depP | but not as much as V H . This is because we have increased the width of depletion mode PMOS by 3 times over the minimum. Hence it could supply more leakage current.
We can reduce the leakage current further by increasing the sharing but that will affect the transition delay and hence impact the performance. From Table- V it is clear that we have achieved more than 23 times static power reduction. However, we need to assess the performance impact as well. The factor which will contribute to performance reduction is the delay in charging up V P W R node from 0.686V to 1.0V. In the following two sub-sections we will analyze the performance impact on write and read operations.
B. Performance Impact on Write Operation
We performed HSPICE simulations on a subarray of size 128 rows by 256 columns (as given by Cacti) to study the Table- VI gives the transition delay values for both the base configuration and the warm SRAM configuration. These delays are measured from the point WL signal is completely ON. It is clear from these values that the write operation is not getting affected by the presence of depletion mode devices. Firstly we are getting the advantage of depletion PMOS getting switched on ten times faster than the access transistors. Therefore V GN D transits to zero even before access transistors are completely opened allowing bit to become zero with zero delay. Secondly since the bit transits from a non-zero initial value to V H , the peak current requirement for the transition is smaller. Thus the single depletion NMOS transistor is able to supply the required current for 32 inverter transitions. This fact also illustrates that the proposed circuit uses less energy for bit transitions.
However, the warm SRAM has one disadvantage in terms of energy, i.e., every time a bit is written, whether the bit is transitioning or not, it has to raise the V P W R node from V H (≈700mV ) to V dd . This also applies to the output node of an inverter which is in V H state. Since V P W R node has capacitance 518*C dif f (per wordline) and the output node of an inverter has capacitance 768*C dif f (per wordline), this energy will approximately equal 327.9*C dif f . For 70nm technology, C dif f is in the range of 0.11fF . Hence the extra energy spent will be approximately 36.07 fJ. Note that this extra energy is paid only when the memory cell bit does not change state due to this write. This is because the normal write scheme will not consume any energy in such a case, however warm SRAM will still need .14 fJ per bit (where state does not change). Hence, the extra write energy is proportional to the number of bits that do not change state. We calculated the write energy by integrating the input current. This energy is parameterized by the number of bit-transitions (bits changing state). The results are shown in Table- The warm SRAM will be spending at most 36.07fJ extra energy for any access causing less than 64 bits to transit. When compared to the dynamic energy per cache access, which is estimated by Cacti as 0.3nJ, this extra energy is very insignificant. Hence we can safely assume that it has little effect on the overall dynamic energy. 
C. Performance Impact on Read Operation
The most critical operation in cache is read operation which occurs twice as often as writes. It is critical because load latency cannot be hidden. The instructions waiting on read results often stall. Whereas, the store (write) latency can be easily hidden with write buffers. No instructions in the immediate vicinity depend on the outcome of the write operation. The tag array access is, however, in the critical path in the cache read, as we observed from Table- IV. Hence we can exploit this slack to make the data array path slower without impacting the same cache access time. This is the basis for the chosen depletion mode transistor widths. If we have to reduce the delay any further, we can reduce the degree of sharing, trading it with increase in leakage current. As per Cacti, bitline delay is defined as the delay between the time at which wordline enable is ON and voltage difference between bit and bitbar becomes 100mV . Cacti uses a precharge voltage of 0.7 V for the 70nm technology and estimates the bitline delay to be 11.7 ps. We varied the precharge voltage from 0.7V to 0.5V in 50mV steps to study its influence on bitline delay for both the base circuit and the warm SRAM circuit. The results are shown in Table- VIII. The difference in Cacti's estimation and our results can be attributed to the high-V t access transistors. Since we are interested in estimating the leakage savings in warm SRAM, we used high-V t (=0.39V) access transistors in all the other reported simulation results. We used a pulse width of 200 ps for the wordline signal in our simulations. Warm SRAM bitline delay for 0.7V and 0.65V precharge voltages are greater than this 200 ps pulse width for the wordline. Hence, the bitlines did not achieve the 100mV difference. Furthermore, since 0.5V closely matches with Cacti's estimation of the bitline delay, we use 0.5V as our precharge voltage for all the further simulations. From Table- VIII, it is clear that bitline delay increase for precharge voltage of 0.5 V and 0.55 V does not increase both cache access time as well as the wave pipelined cycle time. Thus warm SRAM does not have any performance impact on cache access time. As is the case with write, read operation also requires extra energy to charge the nodes to V dd . This can be avoided if we don't switch on depletion NMOS for read operation. This requires additional control logic and a separate driver for depletion NMOS transistors. This may not pay off since the extra energy spent for charging V P W R and output node is at most 36fJ (based on our prior estimation). This potential energy saving is insignificant when compared to per access dynamic energy of the cache. This energy estimate is worst case since the V P W R node acts like a capacitor and is discharged by the leakage current in the circuit. The time the node takes to reach the steady state depends on the leakage current magnitude. It takes more than 200ns to leak down to the steady state value from V dd at room temperature (T=25 • C) as shown in Figure-4 . If the cache line was accessed again within this 200ns period, we will spend less energy than 36fJ on charging V P W R and output node to V dd . We calculated the energy used per word line access for different inter-access time intervals. The results are shown in Table-IX. These results are very close to our estimates. This also shows that if two accesses happen within 25 ns of each other, we don't spend any extra energy compared to the base SRAM cell design. The reason why the warm SRAM energy is less than the base circuit at 25 ns inter-access gap is as follows. During the read operation, the potential of the SRAM cell node which is pulling the bitline rises by a small amount over GN D. The discharging NMOS transistor cannot do with V ds at 0V . This causes a short circuit path from V dd to ground. Moreover, this effect is higher in the case of base circuit as it is discharging more current.
D. Architecture/Program Level Energy Estimation
In the preceding sections we saw how warm SRAM can reduce the leakage energy without impacting the read/write access times of a cache. In this section, we evaluate the static leakage energy savings for a hypothetical 32KB, 4-way cache. The extra access energy for warm SRAM to charge V P W R and output up to V dd is accounted for in a cycle by cycle microarchitecture simulator, Simplescalar version 3.0 [8] . The average number of cache accesses are captured by Simplescalar. We executed SPEC2000 [6] integer benchmarks on this simulator. Suleyman Sair et. al [9] , did analyze the memory behavior of SPEC2000 benchmarks and found the 500 million execution window which closely matches with the characteristics of the whole application. We use Simplescalar 3.0 with the configuration shown in Table- X and the execution window specified in [9] to collect the statistics. In order to calculate the net energy savings, we need to account for the extra energy incurred by this method. We have so far seen one source of extra energy, i.e., energy spent in bringing warm sram to normal state. This energy penalty will be significant only if the accesses occur more than 50 ns apart. Table-XI lists the percentage of L1 cache accesses within 50 cycle window and 100 cycle window. From Table- XI it is evident that more than 80% of the accesses occur within 100 cycles of another access. Using this fact and the extra energy figures presented in Table- IX, we can estimate the extra energy per access. These numbers depend on the cycle time. We use both 0.2ns and 0.5ns as the cycle times to illustrate the energy savings. In both the cases, 100 cycles is less than 50ns. Hence we penalize only the accesses occurring after more than 100 cycles from another access with 33.52fJ. We chose 0.2ns because it is the wave pipelined cache cycle time estimated by Cacti and 0.5ns because it approximates the cache access time. The energy saving is calculated with the leakage currents given in Table- V. For a 32KB cache, this gives us the base leakage power of 1.638 mW while the warm SRAM leakage power as 68 µW. Using these figures and taking the energy penalty incurred by our method into consideration, we achieve a 90% reduction in leakage energy on average in a 32KB data array (for both 0.2 and 0.5 ns cycle times) without impacting the cache access time. Table-XII lists the energy savings for various SPEC2000 integer benchmarks for both 0.2ns cycle time and 0.5ns cycle time.
IV. MODEL VALIDITY
In all our HSPICE simulations, we have assumed that enhancement mode model file with its threshold voltage modified to suit the depletion device will be able to model the depletion device's characteristics. We justify the validity of this assumption in this section. We discuss only the depletion NMOS case, however the discussion is equally applicable to depletion PMOS as well. 
Where,
Note that a depletion mode transistor has higher body effect compared to enhancement mode devices. Hence N d and d I should be varied to get the required device characteristics. The first point of consideration is that the depletion NMOS should get cut-off when V sb = 0.65V and V gs = 0V. Hence by equating (3) to -0.65V, we can solve for N d and d I . The second point of consideration is that when V gs = 1V the gate should have gain comparable to what is predicted by the enhancement mode model. If these two operating points can be verified then our circuit will yield a similar result if we use proper depletion model for the devices.
There are four regions of operation in a depletion device, 1 Cut-Off -device is completely depleted at the source end. 2 Surface depletion -surface is depleted but the buried channel exists conducting the current. 3 Surface Accumulation -as V gs increases beyond threshold, inversion occurs and carriers are accumulated on the surface. 4 Surface Accumulation/Depletion -as V db or V sd increase beyond a certain value, it depletes the channel on one of the sides making the device behave like a saturated enhancement mode device.
Our operating points should be in cut-off and surface accumulation regions. The device enters surface accumulation region if V gs > V N where V N = V F B + φ bi . The gain of the device in surface depletion region is given by
. Even though µ b , bulk mobility, is typically larger than the surface mobility, the factor (1 + σ) tends to reduce the gain in surface depletion region when compared to the enhancement mode device. Once surface accumulation occurs the variations of channel charge with V gs occur at the surface, and, thus, the gain is determined by the surface mobility and oxide thickness only. In other words the gain is comparable to the enhancement mode device. One problem with depletion device could be that if the implantation depth is too high then controlling the buried channel just with gate voltage will not be possible as the surface inversion will occur before pinch-off condition could be reached. This can be overcome if the depth is not made very large or the buried channel could be depleted with high V sb . In our circuit since both drain and source are always above 0.65 V this condition can never occur.
The process parameters for the 70 nm technology [4] we are considering are, Note that there are various possible solutions but we should choose the one with the lowest σ. Depending on the process and the maximum N d allowed, the device has to be scaled to take into account the decrease in the gain. In all these cases, note that V N < 0 hence both V gs and V gd are > V N . This places the device in surface accumulation region when switched on. Hence, our assumption of validating the warm sram circuit technique with enhancement device models is justified. With proper depletion mode models, we can verify other characteristics of the device such as subthreshold leakage characteristics. However, to within a first order approximation, our circuit can achieve the claimed static energy reduction.
V. CONCLUSION
The static leakage energy is one of the biggest challenges facing the semiconductor industry in the near to intermediate term future (3-10 years) . The static energy consumption grows exponentially with reduction in feature size. On-chip caches occupy a major fraction of the processor's area. This holds both for high performance and for low power embedded processors. Since static leakage energy constitutes a large fraction of cache energy, and hence of total processor energy, on-chip caches form a good target for static energy reduction techniques. We proposed, modelled, and verified a new CMOS design style, warmup CMOS. These devices operate best when warmed up like an engine, where some of the key nodes have attained a certain potential. The warmingup cost is paid only once for each activity period, but the savings more than offset the initial warmup cost. The specific version of warmup CMOS presented in this paper is based on depletion mode devices -dep-warmpup CMOS. We presented a SRAM cell design in dep-warmup CMOS and its block level implementation. The detailed SPICE simulations estimate the static leakage energy savings for L1 caches at more than 90% without any affect on the performance. We are further investigating other warmup CMOS design styles and their application to broader logic blocks.
