ABSTRACT
INTRODUCTION
Modern microprocessors operate at clock frequencies more than 3GHz and have close to 100 million transistors on die. The need for improved performance and more functionality has resulted in aggressive technology scaling. As this trend continues into the future, it is expected that the device geometry, transistor threshold (V TH ) and supply (V DD ) voltages will be scaled further. This will lead to degraded short channel effects (SCE) and increased transistor OFF-state (I OFF ) current.
In addition, higher operating frequency, leakage currents and on-die transistor count will result in an increase in total power. Some of these scaling trends are shown in Figure 1 using data from ITRS reports (2001) (2002) [1] . The ITRS long-term projections indicate that, by the year 2016, on-die clock frequency of high-end microprocessors might reach 29GHz while the total power consumption would be about 288W. This will offset the savings in switching energy obtained from technology scaling and result in lower battery life for mobile devices. This will increase the possibility of thermal hot spots and run away during stress testing (burn-in). Consequently, the long-term reliability of high end digital ICs may be compromised in deep submicron (DSM) technologies. 
Figure 1: ITRS roadmap near and long-term projections
Several design techniques have been proposed to minimize transistor leakage and system level power consumption in high performance ICs [2] [3] [4] [5] [6] [7] . These include transistor level leakage control techniques such as: dual V TH techniques [2] , multi-oxide or non-minimum channel length transistors [4] , reverse body bias (RBB), and stack effect. In this paper, we present the design of a CPL-based dual supply, 32-bit ALU and demonstrate the scaling trends of its delay, total energy and leakage power for the 180nm-65nm bulk CMOS technologies. Our 180nm results correspond to a TSMC process while the 130nm-65nm technology results pertain to the Berkeley Predictive Technology Models [8, 9] . The rest of this paper is organized as follows: In section 2, we discuss the impact of supply voltage reduction on transistor leakage currents. In section 3 we present circuit level techniques used in the ALU design to achieve low power operation. We discuss the energy-delay tradeoffs and ALU scaling trends in section 4, while section 5 is for conclusions.
Supply Scaling and Transistor Currents
The impact of supply voltage scaling on transistor leakage and ONstate saturation currents is discussed in this section. A simplified expression for the transistor OFF state current is given by [2] :
Supply scaling reduces the transistor drain-source voltage and helps to minimize the DIBL current. The reduction in the transistor OFF-state current due to supply scaling can be expressed as [12] : Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. 
The data in Figure 2 indicates that a 30% reduction in supply voltage results in up to 32% reduction in the I OFF current while lowering by the I GATE component by 84%. However, this also results in lower gate overdrive voltage and reduces the I DSAT by ~48%. This may cause performance degradation in DSM logic circuits. In subsequent sections, we will demonstrate the selective usage of a dual supply scheme to reduce ALU total energy consumption, while maintaining the performance degradation within acceptable limits. 
Low Power Circuit Techniques for ALU Design
In this section, we focus on some of the circuit strategies adopted in this design to achieve low power ALU operation:
1. Using reduced swing clocking scheme for the non-critical path latches and flip-flops [11] , 2. Using CPL logic and C 2 MOS MUX-es to design the logic/shifter units to minimize overall switching capacitance and data buffer/driver sizes, 3. Designing the 32-bit adder PG unit (Propgate-generate unit) using shared clock footers and reducing buffer sizes.
Latch Design for Dual Supply Clocking
Traditionally, static latches, and master-slave FFs use transmission gate (TG) based designs that have both n-MOS and p-MOS clocked transistors. In order to maintain high performance while saving total energy, our goal was to keep the datapath circuitry at nominal supply voltage while lowering the clock swing of the latch/flip-flop clock transistors. However, under such a scheme, the TG p-MOS transistors do not turn OFF fully, resulting in static current (power) consumption. This problem is further aggravated for high performance datapath designs that normally operate at high junction temperatures (high switching frequency) and lower V TH and thus exponentially higher I OFF /µm currents. Figure 3 (a-b) shows the circuitry and energy-delay tradeoffs of a latch that uses only n-MOS clocked transistors allowing static power-free dual supply ALU operation [11] . The master-slave flip-flops used in this design were obtained by cascading 2 of the latches shown in Figure 3 
Swing Restored CPL-Based Logic Unit
In this paper, we use swing restored CPL logic to design the noncritical units of the ALU. Complementary pass transistor logic allows us to eliminate the p-MOS network required to implement a logic function when using the static CMOS style. This results in lower switching capacitance, smaller data buffer sizes and area for the logic unit. However, CPL logic using n-MOS pass transistors result in "weak" 1 and in our design we used output keepers to restore the CPL gate output signal to full swing. Figure 4 shows the usage of the CPL style to implement the logic unit gates for a single bit-slice of the 32-bit ALU.
PG Unit with Clock Footer Sharing
The 32-bit adder forms the performance critical core of the ALU and is implemented using compound domino logic (CDL) [10] . The PG unit of the adder outputs propagate (A+B) and generate (A.B) signals using dynamic gates with clocked footer transistors. In this design, we shared the 2 explicit clock transistors and use one common transistor as shown in Figure 5 .
Figure 5: PG unit with shared clock footer for 32-bit ALU adder
It should be noted that by using this design strategy for the higher order adder bit-slices only, (shared footer used for bit slices 1 to 31, leaving bit slice 0 unchanged) it is possible to absorb the delay penalty in the existing slack. This allows us to obtain energy savings with minimal performance degradation for the worst-case delay vector. Table 1 indicates that, this sharing allows energy savings at the expense of performance. For example, when the effective n-MOS clock transistor width is reduced by ~29%, the P and G signal worstcase delays increase by ~4ps while allowing 16% energy savings for a data activity (α) of 0.1 (8% energy savings when α=1).
ALU Architecture and Design Overview
We now present an overview of the architecture of a 32-bit ALU and demonstrate the impact of the circuit techniques discussed earlier in ensuring low power operation. We also discuss the scaling trends and energy-delay tradeoffs associated with the circuit techniques for the 180nm-65nm CMOS technologies. The basic ALU architecture is shown in Figure 6 , and is similar to that reported in [10] . This full-custom ALU design consists of approximately 11.5k transistors and has an operating frequency of 4.2GHz under worst-case conditions for 65nm CMOS (Berkeley PTM) technology. The decoder unit in Figure 6 determines the actual instruction (arithmetic, logical, shift) that is executed by the ALU. Both the decoder and logic/shift units are non-critical in terms of performance and have relaxed timings. Therefore, the decoder is realized using static CMOS logic, while the logic unit and shifter are implemented using swing-restored, complementary pass transistor logic (CPL). The MUX-es at the output of the logic unit are realized using C 2 MOS logic (instead of transmission gates) to avoid the usage of cascaded pass transistors. The ALU critical path comprises of the arithmetic unit (adder frontend MUX + 32-bit adder), and the output MUX-es. In our design, these units were implemented using compound domino logic (CDL). Figure 7 shows the energy break-up for the ALU, averaged over 10 cycles of operation (includes logic, arithmetic and shift operations). The 180nm technology simulations indicate that the entire clock network contributes to 59.4% of the ALU total energy, while the arithmetic unit consumes 15.6% energy (worst case switching vector). In addition, the instruction decoder, logic unit and shifter contribute up to 12.8% of energy while the input stage flip-flops and ALU output latch data energy contribute 12.2%. It should be noted that the results in Figure 7 pertain to the baseline ALU design, with none of the design techniques discussed in section 3 incorporated in it. Henceforth we refer to this design as Design 1.
Clk drivers

Energy-Delay Tradeoffs and Scaling Trends
Based on the energy break-up in Figure 7 , we reduced the power supply for the latch and flip-flop units at the input-output boundaries of non-critical units like the decoder, and input data stage. The entire data network and the rest of the clock supply/drivers for the adder, output MUX-es and latch stages were maintained at a higher supply (V DDH ). It should be noted that, for Design 2, the energy reduction was obtained while maintaining the worst-case ALU delay the same as in Design 1. This is possible because of the dual supply assignment strategy followed in this design, whereby all the critical unit signals were maintained at V DDH . The scaling trends for the worst-case delays of both ALU and adder are shown in Figure 9 .
ALU Leakage Power Demand
The different circuit techniques discussed in this paper, allow us to lower the ALU leakage power consumption as indicated in Figure  10 . When Design 1 (Ref.) is scaled from 130nm to 65nm technology, there is a 27x increase in the standby mode leakage power (~30% gate leakage). For Design 2 with dual supply, the total leakage power reduces by 22% (32%) for the 130nm (65nm) generation. The gate leakage reduces significantly (~40%) when the power supply is lowered for Design 2, and contributes to ~18% of the total ALU leakage power for the 65nm generation (V DDL =0.72V DDH ).
CONCLUSION
In this paper, we presented a high performance 32-bit ALU design and adopted a dual supply strategy to minimize total energy consumption. We discussed the impact of sharing the footer clock transistors (PG unit) and CPL logic in minimizing clock and data energy. We demonstrated the scaling trends for the 180nm-65nm CMOS technologies showing reductions in ALU total energy (18%-24%) and leakage power (22%-32%) demand.
ACKNOWLEDGMENTS
Authors would like to acknowledge O. Semenov, S. Naraghi and C. Kwong from the University of Waterloo, and S. Hsu and S. Borkar from Intel Corp. for encouragement and support.
