Increasing leakage currents combined with reduced noise margins are seriously degrading the robustness of dynamic circuits. This paper describes a dynamic implementation of a 256X32b 4-read/write-port Register-File for -6GHz operation at 1.2V in a 0.13ym technology. The pre-charged local bit-lines utilize an efficient conditional keeper-technique, where a large fraction of the keeper is turned ON only if the dynamic output remains High in the evaluation phase. Using this technique, we are able to improve upon all-low-Vt performance by 4%, while maintaining Dual-Vt usage. Thus, the robustness is improved by 96% and the active leakage power is reduced by 5X.
Introduction
High fan-in compact dynamic gates are often employed in performance-critical units of microprocessors and other high performance VLSI circuits. The use of wide dynamic gates is strongly impacted by reduced noise margin and increasing leakage currents in sub-0.13ym low-Vt devices. Traditionally, dynamic floating nodes have been avoided by employing a static path trough a pull-up and/or pull-down device referred to as "keeper". For small leakage currents, weak keepers were sufficient to maintain the voltage level of pre-charged nodes without a significant impact on the performance of the dynamic gates. However, in the presence of increasingly larger leakage currents the keepers must be sized to compensate for the leakage currents. This significantly degrades the performance of dynamic circuits. Fig.1 shows an example of a K-bit wide dynamic gate, a K-bit wide MUX, with the standard keeper PK,,. To ensure correct operation during the evaluation phase (clock High), two worst-case conditions must be fulfilled: 1-Worst-case noise (where dynamic output remains High):
Vss (Low) + a DC noise on gates of MI^, M,z, .... MIK, and Vcc (High) on the gates of Mzl, MZ2, .... MZK. 2-Worst-case delay (High-to-Low transition): Vss on all the pull-down transistors, except the stack transistors Mil, M21, which are tumed on to pull-down the output node.
Thus the task for the keepers is not only to compensate for "1off"-leakage currents but also higher sub-threshold currents due to the potential worst-case noise on the inputs of all the pulldown devices. The same circuit (Fig. 1) effectively shows the local read-path of conventional Register-Files, where read-select signals (MI,, MI2, ... MIK) select one of the storage cells Files are performance-critical processor components with singlecycle read/write latency and high throughput requirement. A large number of read-select entries per port, enforces the use of wide dynamic MUX-structures. The elevated Ioff in sub-0.13ym technologies necessitates alternative dynamic techniques to achieve low read path delays, while simultaneously meeting robustness requirements.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ISLPED '01, August 6-7,2001 
2.
A 256x32b register-file in 1.2V, 0.13ym dual-Vt CMOS technology with copper interconnect [4] is described for 6GHz operation. Single-ended read-select and bit-line signaling is used to reduce wiring congestion, enabling 4-read, 4-write port capability in a dense layout occupying 356ymx89ym. Fig. 3 shows the register-file bit-cell, with symmetric loading of 2 read ports on each side of storage cell for optimal cell stability [5] . Matched pass transistors on each side of the storage cell enable single-ended write. Fig. 4 shows the local read organization. Two access transistors per word (M1 and M2) read the data in the storage cells, forming a dynamic 16-way MUX on the local read path. Fig.7 shows one topology of the CKP technique [l, and 21. It employs two keepers: A fixed keeper P K l , and a conditional keeper PK2. At the onset of the evaluation phase (Clock Low-to High), PKl is the only active keeper. After a delay-time, Tkeeper=TDelay element + &AND, the keeper PK2 is activated only if the dynamic output is still High (Fig. 8) The timing-scheme of the read operation LBL and GBL dynamic MUXes are susceptible to noise due to the increasingly large active leakage and potential input noises during the evaluation phase, when pre-charged dynamic nodes should stay high. LBL is particularly more sensitive than GBL due to a smaller stored charge ( 0 . 1~ compared to GBL) and a wider dynamic MUX structure (16-way for LBL vs. 8-way for GBL). The goal of this work has been to increase the robustness of the LBL operation without any considerable impact on power consumption and performance of the Register File.
Review of Conditional Keeper Technique, CKP
As was discussed in Sec. 1, larger standard keepers can increase the robustness of the read path. However, this results in significant contention, which degrades the performance and increases the power consumption. In the next section, we describe a technique that results in a significant robustness improvement and leakage reduction without any performance/area penalty or any modification of the standard bit-cell topology. Knowing the worst-case time for a potential output High-to-Low transition, the highest performance can be achieved when PK2 is activated close to or later than the worst-case clock-to-output transition, TMAX. The fixed keeper, PKl (Fig.7) Another optional but important advantage of the keeper-circuit in Fig.7 is that an inversion of the input signal to PK2 provides a domino-compatible dual-output. This, when needed, can save a significant amount of area and power consumption, as a singlerail wide gate offers the same function of its dual-rail counterpart. CKP technique [l] is an enhanced version of [2] verified at the worst-case IoFF comer of a 0.13pm technology. A special case of the general concept in [2] , has been later published in [6] , where the standard keeper was removed. In the next section we show that the standard keeper is required to maintain robustness during the time (2-3 inversion delays) required to activate the conditional keeper.
Simulation and Comparisons
We have replaced the standard keepers on the local bit-lines (Fig. 4) with the CKP technique (Fig.7) . As was described in the previous section, by employing the CKP technique, higher performance or higher robustness can be achieved depending on the total strength of the keepers, PKI, and PK2 and their gain ratio h = W(PKI)/W(PK2).
Robustness Analysis
The robustness analysis for the 16-bit MUX with the CKP technique should be considered for two different cases, during two different time slots: During the time, Tkeeper'TD&y element + TNAND, where the PKl is the only active keeper. After Tkeeper, when both PKI, and PK2 are active (when the dynamic output should remain High). (Fig. 9) .
1-

2-
CKP is activated (output of the CKP's NAND)
TkeeDer
Input noise (UNG-level)
O.lns 0.2ns 0.3ns 0.4ns 0.5ns Time Fig.9 : Simulation waveforms at worst case leakage comer and an applied DC-noise on the inputs of the dynamic gates. The keeper PKI meets the target robustness during Tkeeper, as the noise on the output of the NAND gate (Fig. 8) does not exceed the final output noise of the NAND gate following the dynamic circuit with the standard keeper (Fig. 5) w:
After Tkeeper, both PK1, and PK2 are conditionally ON and the DC noise robustness of the MUX with CKP is equal to that of the conventional MUX with standard keeper PKO, as W(PKo) = W(PKI)+ W(PK2). Following the above criterion, and at W(PKo) = W(PKI)+ W(PK2), the robustness of CKF' technique is fairly comparable to the conventional technique. This robustness-criterion has been followed for all the comparisons between the CKP and the standard keeper technique in the next section.
Simulation Results
The 0.13pm technology offers two threshold voltages for each device. We have simulated the local read operation of the Register-File, where performance, robustness, and energy/transition of all-low-Vt, (LVT) and dual-Vt (DVT) 16-bit conventional MUXes (STD) have been compared to those utilizing the CKP technique. In the dual-Vt case, the read-select pass-transistors (M1 in Fig.4 and 7) as well as the keepers are high-Vt devices, while the rest of the devices are low-Vt. Simulations are performed at worst-case Ioff corners of the 0.13pm technology, at Vcc=1.2V, 110°C. In order to verify the sensitivity of the delay and robustness of the CKP technique to any potential clock-to-data race, the Tkeeper= T D~~~ + TNMO was swept over a wide range at a fixed worst-case clock-tooutput transition, TMAX. The X-axis in Fig. 10-12 shows the ratio Fig.10 shows the bit-line evaluation delay for the 16-bit LVT MUXes. For the conventional MUX (STD), the standard keeper is sized such that UGN-level is marginally above the specified noise-floor. This gives the fastest bit-line evaluation at the target robustness. At this point, the CKP-MUX utilized a keeper-ratio h = W(PKI)/W(PK2)= 0.25/0.75 which was sufficient to meet the I S 0 robustness criterion at as large Tkeeper/ TMAX as 1.25. The figure (10) shows that the CKP technique results in up to 19% higher performance at a comparable level of robustness. Further, the performance improvement is relatively insensitive to clockto-data race. The figure suggests that the best performance improvement can be achieved by activating the conditional keeper close to or after a potential worst-case output transition, with the optimum design point at Tkeeper/ TMAX =1.25 which result in maximum performance without violating our described Optimum Design point robustness criterion. 
All-Low-VT Design
TkeeperITMAX
The additional power consumption due to the circuit overhead and larger clocked load is efficiently compensated by the reduced contention during the output High-to-Low transition (Fig 11) . This resulted in an energy/transition comparable to that of the conventional case. 
Low-Vt / Dual-Vt Simulation results
The previous all low-Vt simulations were performed at lowest acceptable noise margin to achieve the highest performance, where CKP technique resulted in 19% faster bit-line evaluation. There are two main techniques to increase the robustness to a higher level: 1-Upsizing the keepers for the all-Low-Vt conventional and CKP-based LBL MUXes. 2-Utilizing the High-Vt devices. Fig. 12 shows the delay-robustness trade-off for both cases. The figure shows that "keeper-upsizing" is an inefficient technique for increasing the robustness as it results in significant delay penalties. Still, the CKP all-low-Vt MUX maintains its relative performance benefit. However, Fig. 12 shows that utilizing the HVT devices is a much more efficient way to increase the robustness.
Comparing the low-VT result with the dual-VT result (the two point) we show clearly that large delay-robustness trade-offs are involved for relatively small performance improvements. Since the CKP technique is independent of Vt-levels, for the dual-Vt case, it results also in 17% less bit-line evaluation time. The interesting observation is that this performance improvement is about the same as the all-low-Vt performance improvement (compared to the dual-Vt design). Another important consequence of the use of dual-Vt scheme is that the leakage power consumption is also significantly reduced. Table 1 , summarizes the performance/robustness comparisons, normalized to the standard all-low-Vt design. Where, the dualVt LBL MUX with CKP technique results in -2X higher robustness, and 5X less active leakage power consumption at a comparable level of performance (4% faster). The all-low-Vt LBL MUX with CKP results in 19% faster MUX operation, and 10% faster total LBL read operation. The total LBL performance improvement is partially screened by the driver delay, and the merge-delay, which are fixed delaytimes for all the circuit alternatives. For highest performance the CKP technique enables a total local read delay of 86ps. This allows the Multi-ported Register File to operate at 5.8GHz clock. 
Conclusions
In this paper we have described a dynamic implementation of a 256X32b 4-read/write-port Register-File, for -6GHz operation at 1.2V in a O.13pm technology. The pre-charged local bit-lines utilize an efficient conditional keeper-technique, CKP, where a large fraction of the keeper is turned ON only if the dynamic output remains High in the evaluation phase. Using this technique, we are able to improve the dual-Vt-based circuitperformance upon all-low-Vt-based one by 4%, while reducing the active leakage currents by 5X, and increasing the noiserobustness by -2X. Alternatively, up to 19% higher performance at comparable robustness has been observed for all-low-vtbased wide MUXes utilizing CKP technique.
