Introductioin
With the increasingly stringent demands on battery space and weight in portable multimedia devices, there exists a strong necessity to investig(3te techniques for lowering the power dissipation of Digital Signal Processing (DSP) circuits. A majority of DSP circuits are signed, fixed-point, short bit-width (8 -24 bit) datapath operators, specifically multipliers andor multiplier-accumulators (MACs), and therefore, substantial attention has been devoted to lowering their power consumption.
Multiple voltage techniques have been reported earlier for lowering the power dissipation by operating non-critical path gates at reduced voltages [l] , [2] . These techniques employ multiple voltages while retaining the static CMOS based logic gate structure una Itered. A four-power-rail methodology called Mixed Swing QuadRail has been proposed previously to construct standard digital logic gates using multiple voltages at the gate level [3] , [4] . This approach performs logic by intermixing high-and low-swing signals while driving the load capacitance at h e gate outputs at reduced swings. These multiple voltage techniques use explicit high-and low-voltage supplies to offer a nearly quadratic reduction in dynamic power since there exists no DC path between the supplies. However, they have three limitations: (i) the additional offchip supply and its associated pin requirements add to the total system cost, making the techniques economically unattractive, (ii) low-volltage off-chip supplies are prone to significant inefficiencies, particularly if the drive-current requirements are high (e.g., if the supply delivers the drivecurrents of many on-chip low-swing circuits). This degrades overall system power efficiency, (iii) due to the lack of any onchip regulation (the separation between the high-and lowvoltage supplies remains fixed), these techniques suffer from substantial low-voltage dispersion in delay and power across worst-case process and temperature comers, contributing significantly to parametric yield loss [5] . These are increasingly important concems in future deep-submicron processes.
In this paper we describe a self-contained, on-chip seriesregulated Mixed Swing QuadRail methodology with sleepmode control. This technique locally generates the low-swing supply rails from the regular, high-swing supply rails, mitigating the above concems. The next section describes details of the series-regulated Mixed Swing QuadRail technique. Section 3.0 describes the first reported implementation of a large-scale DSP datapath operator using the proposed seriesregulated approach a 16*16+36-bit DSP MAC fabricated in a commercial 0.5pm CMOS process. The same MAC is also fabricated in the off-chip regulated QuadRail approach (with explicit off-chip high-and low-voltage supplies) as well as conventional static CMOS to study the respective power-delay trade-offs. The remainder sections describe measured results, power-delay comparisons across three additional (0.35pm, 0.25pm, 0.16pm) CMOS and fully-depleted SO1 (FDSOI) processes, and manufacturability analyses of the MACs. Fig. l(a) shows the Mixed Swing QuadRail gate topology for a (3,2) counter, consisting of a logic stage operating between the high-swing power rails (Vdl-Vsl = Vlogic) and a driver/ buffer stage operating between the low-swing power rails (Vd2-Vs2 = Vb", , , ). Vlogic and Vbuffer are approximately centered to maximize high and low noise margins and to equalize rising and falling delays in either stage. PMOS devices in each stage have independent N-wells for minimal body-effect on the buffer stage PMOS devices. Since our target process is single-threshold and N-well based, NMOS devices reside in the native P-substrate. Fig. l(b) shows the proposed seriesregulator circuit for local generation of the low-swing power rails (Vd2 and Vs2) from the regular, off-chip high-swing power rails (Vdl and Vsl). The low-swing voltage is servoed to maintain a fixed ratio of off-to average on-drive current (Ioff/Ion) in the QuadRail circuit in order to balance static and dynamic power. This achieves the same goal of minimizing total power as [6] but without mandating any process modifications. The transistor pairs (M3:M4) and (M7:M8) are ratioed Nx:lx, where lx is the minimum-width transistor and N is the target Ion/Iop The PMOS devices are appropriately ratioed wider than the NMOS devices to equalize their respective drive capabilities. The current mirror devices (M1:MZ) Ioflon ratio based series regulator circuit. and (M5:M6) are ratioed 1:l. M9 and MI0 provide the DC series path between the power rails and are sized to be able to source/sink the QuadRail circuit's peak on-drive current requirement. Three local inter-rail decoupling capacitors (C,) are inserted to reduce rippling on the low-swing power rails due to simultaneous switching noise on the high-and lowswing power rails. M11 and M12 are sleep-mode enable devices that are disabled (SLP=Vsl) during normal operation. During power-down mode (SLP=Vdl), the low-swing rails are shorted to the high-swing rails, eliminating the DC path power consumption that exists during active mode. stage between the multiplier and accumulator for enhanced throughput (Fig. 2(a) ). The power distribution measured on a static CMOS implementation of the MAC is shown in Fig. 2(b) . The Wallace tree multiplier is the most power-critical MAC component, consuming 75% of total power. This is due to the substantial interconnect capacitances driven by the 28-transistor-based (3,2) counters [9] within the Wallace tree. In order to lower the multiplier power, three versions of the MAC are fabricated with the multiplier constructed in series-. - regulated QuadRail, off-chip regulated QuadRail, and conventional static CMOS to study the relative power-delay tradeoffs. The final accumulator, due to its higher logic depth than the multiplier, is the most time-critical MAC component and hence sets the maximum clock frequency. It is therefore implemented in full-swing static CMOS in all MAC versions to retain a fixed, high throughput. All three MACS have CMOS-level I/Os to enable interfacing with external CMOS circuitry without level conversion. Fig. 3(a)-(b) show the measured Wallace tree multiplier power-delay comparisons for static CMOS vs. the QuadRail
Series-regulated Mixed Swing QuadRail

24.2.2
methodologies over a range of operating voltages (2.5-1.5V), i.e., Vdd for CMOS and Vlogic for QuadRail. QuadRail's corresponding buffer vlAtages are selected to maintain an Io~fl0,, ratio of 1:150, which balances static and dynamic power within the QuadRaill multiplier while meeting the target delay constraints set by the CMOS MAC. Fig.4 shows the lowswing rail wavefolms from the series-regulated QuadRail MAC at Vdl=2V, Vsl=OV. Measured peak-to-peak power1 ground bounce on the low-swing power rails is confined to within 8% of the low-swing voltage with 4pF on-chip interrail decoupling capacitors.
Power and delay are measured across 500 pseudo-random input vectors. The off-chip regulated QuadRail approach shows energyloperation savings ranging up to 3.79X over static CMOS, with the savings increasing with voltage scaling. The savings are attributed to the following:
Average point-to-point net capacitance (due to both interconnect and fanoiit gate loading) extracted from the Wallace tree multiplicr layout is 58fF. This, coupled with the inherently high switching activities of Wallace trees makes the effective switched capacitance per cycle substantial. A full quadratic reduction in buffer stage dynamic power is achieved due to the lowered output swing across this capacitance.
28% of the dynanuc power within the multiplier is due to short-circuit powix dissipation, despite the multipliers being optimally size d to maintain steep input rise/fall times. This is comparab1,e to the short-circuit power reported for a similar multiplier in [lo]. Thus, the reduced buffer stage swing offers a nearly cubic reduction in its short-circuit power component as well, contributing to the additional energyloperation savings.
Series-regulated QuadRail offers relatively lower energy1 operation savings than off-chip regulated QuadRail, due to the DC series path between the power supplies. Therefore, the buffer stage dynamic power reduction factor drops from quadratic to linear. However, the nearly cubic reduction in buffer stage short-circuit power is still retained, contributing to an energyloperation savings slightly larger than linear. The savings range up to 2.55X, i.e., up to a 35% loss in savings compared to off-chip regulated QuadRail. At 67MHd23MHz (maximudminimu m measured clock speed), the total seriesregulated QuadRail MAC power (i.e., multiplier, accumulator, and registers) is 16.6mW/2.06mW. Series-regulated QuadRail's DC power disadvantage is offset by the following advantages:
Standby power (I 52.5nW) is nearly three orders of magnitude lower thari off-chip regulated QuadRail's standby power (143.8pY1, because of the absence of the Vdl-Vsl totempole current path during sleep mode. Further, transition between sleep and active modes is accomplished in a single clock cycle. Since transitioning to sleep mode essentially transfoxms QuadRail into conventional static CMOS, circuit stiite is still retained during standby. Thus, transitioning between sleep and active modes eliminates the need for any explicit state data transferring schemes similar to [ 1 11. Since the additional low-voltage supply is not required, series-regulated QuadRail is a self-contained methodology that can replace static CMOS operating from a regular, high-swing supply without mandating any system-level modifications. Fig. 5 shows the static CMOS and QuadRail MAC die microphotographs. The off-chip regulated QuadRail MAC occupies about 10% larger layout area due to intrinsic celllayout area penalty incurred by its dual-well requirement.
Series-regulated QuadRail MAC incurs an additional 8% area penalty due to the on-chip decoupling capacitors.
The power-delay comparisons are extended over three additional commercial single-threshold processes: 0.35pm CMOS, 0.25pm FDSOI, and 0.16pm CMOS, to study the impact of process scaling on energyloperation savings (Fig. 6) . Series-regulated QuadRail energyloperation savings increase with process scaling: up to 3.2X in 0.35pm, 3.45X in 0.25pm, and 3.8X in 0.16pm processes. The 0.25pm implementation's lowest energy/operation (at Vlogic = 0.75V, Vbuffer = 0.35V) is 6pJ. This is nearly 3.3X lower than one of the lowest reported energy/operation implementations in literature in a comparable multi-threshold 0.25pm process [lo] .
Since interconnect capacitance scales slower than gate capacitance with process scaling, the Wallace tree multiplier, because of its interconnect-dominated point-to-point net capacitances, becomes more and more power-critical. This, coupled with the increasing ratios of logic to buffer swings with process scaling, makes driving the multiplier's load 
16*16+36-bit MAC Manufacturability
To study the impact of series-regulated QuadRail on manufacturability, worst-case process and temperature comer analysis is performed across industrial Slow-NMOS-Slow-PMOS and Fast-NMOS-Fast-PMOS comers on the CMOS and QuadRail multipliers in the 0.5pm process (Fig. 7) . QuadRail demonstrates similar power*delay dispersions as CMOS at high voltages. With voltage scaling, the dispersion remains well controlled and at VI0,ic=l.5V, Vbuge.=0.8V, the power*delay dispersion is 1.8X lower than CMOS, demonstrating improved low-voltage parametric yield. This is attributed to (i) the low-swing rails being dynamically offset across comers to maintain the target Io&,n ratio, thereby signifi- 
