This paper describes a new mixed-swing topology for dual-rail domino logic that results in a simultaneous energy and delay reduction. HSPICE simulation results for a 1-bit full adder cell show a 24% delay decrease and a 24% energy reduction for the mixed-swing topology compared to standard dual-rail domino. Energy and delay trends with supply voltage scaling are also presented for the adder cell. An 8-bit by 8-bit multiplier design with mixedswing dual-rail domino adders is presented. Simulation results show this implementation to be 10% faster with an 18% energy savings.
INTRODUCTION
Domino CMOS [I] has become the prevailing logic family for high performance CMOS applications and it is extensively used in most state-of-the-art processors due to its high speed capabilities. The drawback of domino CMOS is that it provides only non-inverting functions because of its monotonic nature. Dual-Rail Domino logic, (also known as clocked Cascade voltage switch logic [Z] ) where both polarities of the output are generated, provides a robust solution to this problem. Other techniques either have latches built into the gate 131, [4] , resulting in very finely pipelined designs, or make use of delayed clocks [5] , making the design sensitive to manufacturing variations. The penalty associated with dual-rail domino logic is the increased power dissipation compared to static CMOS as well as dynamic circuit techniques which use singleended logic gates. In this paper we explore a mixed-swing topology wherein multiple supply voltages are used in a dual-rail domino logic gate that offers simultaneous power and delay reductions. In Sec. 2, we present the new Mixed-Swing Dual-Rail Domino (MSDRD) topology, contrast it with conventional Dual-Rail Domino methods, and explore the energy-delay space for a mixedswing implementation of a dual-rail domino full adder. Sec. 3
Peirnirsion to make digital or hard copies of all or part of this work fur personal or clasrroom use is granted without fee provided that copies are not rmde 01 distrih~ uted i o~~~o f i t~~c~r n r n e r r i d l a d i i d n t a~e a n d thatcopiesbearthisnoticeand the full citation i n the first page. transistors is used, the noise margin is set by V,& and, it is relatively poor. In order to improve and control the noise margin, we employ a modified circuit topology, called cross-coupled domino [61. The Cross-coupled pfets can be sized up without significantly affecting the delay through the domino gate since they do not contend with the nmos stacks during evaluation. The input noise margin is determined by the ratio of the sizes of the pfet keeper to that of the nfet stack. For the circuit of Fig. I the keeper can be sized up until the noise margin approaches that of static CMOS. This technique improves the noise margin considerably with little impact on delay or energy. Like cross-coupled domino logic, mixed-swing dual-rail domino logic can be switched into completely static operation by keeping the clock high. The circuit can be switched between highspeed dynamic operation and low-power static operation. This feature can also make the circuit significantly easier to test.
Gate-Level Circuit Design
In QuadRail operation, the voltage swing across the inverter is reduced hy pulling up its lower rail, so that VSS,Nv > VSSLOG. Simultaneously, the voltage swing across the logic circuit is reduced by pulling down its upper rail, so that VDD,, > VUDLoc. Obviously, this moves the VoL of the gate up by VSSINv -VSSLoG which illustrates why the cms-cuupled topology is essential for mixed-swing techniques to work. Note, as with any dual-rail domino implementation transistors can be shared between the nmos trees for the tme and the complement functions to generate more compact circuits.
The worst case evaluation delay through the gate is determined by the discharge delay through the dynamic circuit and the pull-up delay through the inverter. 
Y"
where Gin, is the total load capacitance driven by the inverter, Cd is the total load capacitance driven by the dynamic logic gate, a IS the activity factor and f is the frequency. From these equations it can be seen that performance improvement and power reduction can be obtained simultaneously by minimizing the voltage swing in the logic block and the inverter and by maximizing the difference VDDINv -VSSLoG Two different operating modes are possible for this circuit depending on the value of VDDLOG When VDDLOG = VDDINv we have Trirail. The topology shown in Fig. 1 is called Quadrail since it has four distinct supply rails.
Exploring the Energy-Delay Space
To explore the energy-delay space of the mixed-swing dual-rail domino methodology in a more realistic manner, a full adder was considered, Detailed HSPICE simulations were carried out using the Level 13 BSIMl models in the Hewlett-Packard 0.6pm drawn CMOS process. The outer rails, VDDlw and VSSLoc, were fixed at 2.4V and OV respectively. Fig. 2 and Fig. 3 show the energy and delay, respectively, as a function of the difference between the inner and outer rails for the Trirail and Quadrail topologies. For the Quadrail case both the inner rails, VDDLOC and VSSlm were moved by the same amount from the outer rails. VDD,,,) Table I gives the energy-delay comparisons for a mixed-swing dual-rail domino adder with a standard dual-rail domino adder. The VDDINvand the VSS,,, were fixed at 2.4V and OV respectively. VSSINv was sct to 0.W and VDD,,, was set to 2.2V to give n noise margin of about O.6V, which is around a V,,,, the typical noise margin in single-ended domino circuits. The standard dual-rail domino adder cell is operated on a single 2.4V power supply. As can be seen from Table I , mixed-swing dual-rail domino delivers significant reductions in energy and delay over standard domino.
Multiplier Architecture
Multipliers are an important part of most DSP and processor cores.
With escalating demand for higher performance in both of these areas, domino CMOS finds increasing use in multiplier circuits to kcep up with increasing performance demands. The inverting nature of thc functions involved in a multiplier necessitates the use of dual-rail domino logic in order to design robust, low latency, high performance multipliers. Fig. 4 shows the block diagram of the multiplier to which the Mixed Swing methodology was applied to demonstrate the power savings obtained with the mixed swing approach. The multiplier was designed to be part of an FIR filter. It takes in 8 hits of the filter coefficient a(iJ and multiplies it with an 8-bit input and adds this result to 20-bits of sum and carry from the previous tap and gives out an 8-hit answer. The delay elements between the Wallace tree and the final adder are implemented using C'MOS [IO] latches which also serve to pipeline the design. The PPgen and the booth encoder [I I] were implemented in static CMOS. The Wallace tree which is the energy and delay critical module in the whole design was implemented in dual-rail domino logic. The final adder was implemented in single-ended domino logic. The filter coefficient a(iJ is essentially static, i.e. it changes at a slow rate compared to the input data rate. The worst case delay is thus either through the PPgen (Partial Product generator), the Wallace tree and a latch or through the final adder. The precharge time in this design is hidden by having all the domino circuits in the design precharge while the PPgen block is generating the partial products. Table 2 presents the simulation results for the mixed-swing dual-rail domino multiplier. All simulations were carried out in HSPICE using BSIMI CMOS models for the HP 0.6pm drawn CMOS process. The simulations were carried out using nominal models at 25 C. Clock skew and clock buffering were not taken into account in these results. The Wallace tree was simulated with the netlist extracted from an actual layout while the rest of the modules were simulated without layout parasitics. The Wallace tree layout was generated automatically using the place and route tool Silicon EnsembleTM. The layout for the leaf cells was also generated automatically using LASm, a device level layout synthesizer.
Experimental Results
The worst case delay is through the PPgen block, the Wallace tree and the setup time in the pipeline latch. The Mixed Swing implementation of the adder cells in the Wallace tree reduces delay by 19% and energy by 23%. For the multiplier this translates into a 10% delay decrease with an 18% energyloperation reduction. 
Conclusions
In this paper we presented a novel Mixed Swing topology for dualrail domino logic which can yield simultaneous power and delay reductions. For a simple full adder cell this method results in a 43% reduction in the Energy-Delay product. Energy and delay trends with voltage scaling were also presented for the adder cell. The performance improvements possible with mixed-swing dual-rail domino logic for larger blocks were demonstrated for a 8-bit by 8-bit multiplier. From these simulation experiments the proposed mixed-swing dual-rail domino logic approach appears to offer an interesting avenue for exploration in the design of robust highperformance low-power digital circuit.
Acknowledgements
This work was funded in part by DARPA under Order A564, NSF under Grant MIP9408457, and SRC under Contract 068.007.
