Abstract-Reduction of power consumption is significantly important for all high-performance digital VLSI systems. This paper reviews several approaches for low-power implementations of building blocks for digital subscriber line (DSL) systems. Low-power implementations of Reed-Solomon (RS) coders, fast Fourier transforms (FFTs), FIR filters, and equalizers, and reduction of power consumption by use of dual supply voltages are addressed. It is shown that use of separate Galois Field functional units for multiply-accumulate and degree reduction can reduce the energy consumption of RS coders dramatically. A hybrid feedforward and feedback commutator scheme-based FFT is shown to require less area and full hardware utilization efficiency. Reduction of switching activity at one or both inputs of the multipliers is a key to reduction of power consumption in FIR filters and equalizers. The switching activity can be reduced by use of transpose structure and by time-multiplexing of an unfolded filter. A well established retiming approach can be generalized to find those noncritical gates which can be operated with lower supply voltages to reduce the overall system power consumption.
I. INTRODUCTION

D
IGITAL subscriber line (DSL), as the next generation modem technology, has received a lot of attention. DSL works on regular telephone lines and provides high-speed access which is typically two to three orders of magnitude faster than most analog modems. In practice, DSL is often written as "xDSL" to describe all the different variations of the technology, which all fall into one of two categories, asymmetric DSL (ADSL) and symmetric DSL. ADSLreserves more bandwidth for downstream (from the central office to the subscribers) and less for upstream, which is the most suitable for internet surfers and users of remote LANs, because they typically download much more data than they send. ADSL mainly includes G.dmt ADSL (also known as full-rate ADSL), G.lite ADSL (also known as universal ADSL), and RADSL (rate adaptive ADSL). On the contrary, symmetric DSL provides the same rate in both ways, and is suited to Web servers, corporate networks, and those who send out large quantities of Manuscript received July 10, 2000; revised April 20, 2001 . This paper was presented in part at the IEEE CAS-COM Workshop on High-Speed Data Over Local Loops and Cables, Princeton University, Princeton, NJ, July 1999. This paper was recommended by Guest Associate Editor Y.-F. Huang.
The author is with the Department of Electrical and Computer Engineering, University of Minnesota, Minneapolis, MN 55455 USA (e-mail: parhi@ece.umn.edu).
Publisher Item Identifier S 1057-7122(01)08975-9.
data. Symmetric DSL family consists of HDSL (high bit-rate DSL), IDSL (ISDN DSL), and VDSL (very high bit-rate DSL). For detailed overview of xDSL technology and its applications, readers are referred to [1] - [3] . Recently, reduction of power consumption has become an important issue in design of high-performance digital VLSI systems. The techniques which are used to achieve low power consumption in digital systems span a wide range, from algorithm and architectural levels to logic, circuit and device levels. In this paper, we review low-power architectural design methodologies for several key components in DSL systems.
Power consumption can be reduced by a combination of several techniques. Pipelining and parallel processing can be used to reduce power consumption by reducing the supply voltage. Power consumption can be reduced by reducing effective capacitance which can be achieved by reducing the number of gates or by algorithmic strength reduction where the number of operations in an algorithm is reduced. Power can also be reduced by reducing memory access. The single most effective means to power-consumption reduction is clock gating where all functional units which need not compute any useful outputs are switched off by using gated clocks. Use of multiple-supply voltages and a simultaneous reduction of threshold and supply voltages are also effective in reducing power consumption. Most power-reduction approaches apply to dedicated, programmable, or FPGA systems in a dual manner. A thorough review of the various power-reduction techniques is beyond the scope of this paper. However, the reader is referred to [4 , Ch. 17] .
In this paper, we review some generic approaches to reducing power consumption in DSL systems. Power reduction approaches in building blocks such as error-control coders based on Reed-Solomon (RS) coders, fast Fourier transforms (FFTs), FIR filters, and equalizers are considered. Power reduction by use of multiple-supply voltages is then reviewed.
The reliability of DSL systems is generally increased by using forward error correction (FEC), which is based on the RS codes. There has been considerable interest in designing dedicated circuits for RS encoders and decoders [5] - [13] . Design of universal RS codecs which can operate over different finite fields with variable code rates is of great practical importance. A dramatic increase in the area and hardware complexity with increase in the block length and underlying field degree makes it very difficult to realize universal RS codecs using dedicated hardware. With current scaled technologies, many DSP algorithms based on binary arithmetic can be realized using domain-specific programmable digital signal processors (DS-PDSP) optimized for targeted applications. This processor-based software approach is a very efficient design alternative as it reduces the time-to-market and the design costs. However, due to lack of hardware support for finite field multiplications, sophisticated RS codecs are usually slow in software. A software based RS (16, 12) coder was presented in [14] , in which general-purpose DSP processors or microprocessors were used for programming. If finite-field arithmetic would be implemented in a programmable DSP datapath, the universal RS codecs (and other finite-field based systems) could be easily implemented in software. In Section II of this paper, a hardware-software-codesign approach is presented with the objective to find the best combination of hardware datapath units and software coding schemes to reduce the total system energy and energy-latency products.
ADSL systems are based on one of two modulation techniques, discrete multitone modulation (DMT), and carrierless amplitude phase/quadrature amplitude modulation (CAP/QAM). By molding smaller channels (or tones) to the characteristic of the larger channel, DMT optimizes performance over a wide range of lines. The DMT modulation has been standardized in ANSI T1.413 [15] . DMT is a special form of multicarrier modulation which is realized with digital techniques using the FFT. In general, a 256-point or even higher order FFT is used, which leads to high computational complexity. Therefore, efficient implementation of the FFT processor is a key issue in ADSL where DMT modulation is used. There has been tremendous research work devoted to the design of high-performance dedicated FFT processors [16] , [17] . The pipelined FFT processor is a class of real-time FFT architecture characterized by continuous processing of the input data which, for the reason of the transmission economy, usually arrives in a word sequential format. However, the FFT operation is very communication intensive which calls for spatially global interconnection. Therefore, much effort on the design of FFT processors focuses on how to efficiently map the FFT algorithm to hardware to accommodate the serial input for computation. By using digit-serial arithmetic, an efficient FFT processor architecture has been developed in [18] and is reviewed in Section III.
FIR filtering is another basic operation in DSL systems. In high-performance implementations, FIR filters often require long lengths or large number of taps. An implementation where every multiply-add operation is mapped to a multiply-adder in hardware can be prohibitively area expensive. Therefore, for area-constrained systems, folding [19] - [22] , [4] or time-multiplexing is used to map multiple algorithm operations (multiplication and addition operations in an FIR filter) on a single functional unit, resulting in an integrated circuit with low silicon area. Unfortunately, maximal resource sharing while minimizing area can lead to a large increase in power consumption [23] , [24] . In Section IV of this paper, we review a low power technique for the design of folded FIR filters where storage area is traded-off for lowering power consumption.
Finally, in Section V, we address the issue of low power design at the gate level in CMOS circuit. It is well known that dynamic power consumed in CMOS gates goes down quadratically with the supply voltage. By maintaining a high-supply voltage for gates on the critical path and using a low supply voltage for gates off the critical path, it is possible to dramatically reduce power consumption in CMOS VLSI circuits without performance degradation. Interfacing gates operating under multiple-supply voltages require the use of level converters. Due to the nonnegligible power consumed by level converters and the substantial propagation delay they might incur, it is necessary to develop a formal model that quantifies various design parameters such as delay and power. A formal model allows us to develop efficient heuristics to address this problem. In this paper, we review a formal model and develop an efficient heuristic for addressing the use of two supply voltages for low power CMOS VLSI circuits without performance degradation.
II. HARDWARE/SOFTWARE CODESIGN FOR RS CODECS
RS codes are among the most versatile and powerful block error-correcting codes. They are special cases of the more general BCH codes [25] . In this section, we consider the softwarebased implementation of RS codecs.
In practice, the arithmetic operations in RS codecs are based on finite field , in which addition and subtraction are bit-independent and can be computed using array of XOR gates, and inversion (as well as division) can be computed iteratively using multiplication. Therefore, to design DSP datapaths for finite field arithmetic, we only concentrate on multiplication operation. In the following, we adopt a hardware-software-codesign approach to design finite field datapath for low-energy RS codecs, in which two novel design methods are developed: 1) Separate operations for low-energy vector-vector multiplication and degree reduction; and 2) A heterogeneous digit-serial datapath.
A. Separate and Units for Low-Energy Vector-Vector Multiplications
Vector-vector multiplications are one of the most common computations in DSP algorithms. Consider the vector-vector multiplication over (1) where and , , are elements of with word-length equal to . With one parallel multiplier in . Notice that in finite field vector-vector multiplication, the polynomial modulo operation can be delayed and performed only once at the last step; the intermediate result can be obtained by polynomial multiplication and accumulation only. Based on this observation, instead of using the datapath with one parallel multiplier as shown in Fig. 1(a) , we can use two separate subarrays in finite field datapath as shown in Fig. 1(b) where the two subcomputations in one finite field multiplication, polynomial multiplication and polynomial modulo operations, are implemented by two separate functional units and controlled by instructions, and , respectively. The intermediate products, , have word-length and are obtained using polynomial multiplications only. Then, after accumulating these intermediate results, only one polynomial modulo operation is required to reduce the degree of the final result from to . As a result, only operations and 1 operation are required and the total energy consumption can be reduced significantly.
B. Heterogeneous Digit-Serial Multiplication Datapath Units
Reduction of energy consumption by using separate bit-parallel multiply-accumulate (MAC) and degree reduction (DEGRED) functional units was addressed in the previous subsection. Here heterogeneous implementation styles are considered for the two separate functional units. In a heterogeneous digit-serial implementation, one MAC unit of digit-size and one DEGRED unit of digit-size are assumed in the DSP datapath. Only the digit-cells, MAC and DEGRED are implemented in hardware. 
Note that when , the maximum degree of the intermediate result is equal to 15 which requires degree reduction by 7 only. Hence instead of having , we set maximum value of to 7.
C. Low-Energy Two-Error-Correcting RS( ) Codecs Over
Based on the two design approaches, the hardware/software codesign for low-energy high-performance RS codecs can be developed. Here, two-error-correcting RS( ) codes over are considered, where and . (Energy consumption values are estimated using the HEAT tool proposed in [27] at a clock frequency of 50 MHZ, using MOSIS 0.5 technology, at Volts supply voltage.) A systematic generator-matrix based method is used for RS encoding, which is based on vector-matrix multiplications over . Peterson-Gorenstein-Zierler algorithm [28] is used for RS decoding. The reader may refer to [28] for a detailed explanation of RS encoding and decoding algorithms.
RS codecs based on different datapaths require the same number of instructions other than and , and the same number of memory accesses. Therefore, only the energy consumption and instruction cycles related to multiplication operations in RS( encoders and decoders are estimated using HEAT tool and summarized in Table I in terms of the codeword length . In Table I , represents nonpipelined, represents one-level pipelined multipliers; the critical path computation time of the corresponding datapath unit(s) is represented as number of one gate delay time; the total number of transistors is used to represent the hardware complexity of datapath units; energy consumption values are with respect to the unit pico J latency is in terms of number of instruction cycles.
For RS (40, 36) encoders and decoders, we can conclude, from Table I , that the RS codecs based on datapaths containing (MAC8, DEGRED2) [or (MAC8, DEGRED4)] units have the least energy and energy-latency products. Comparing the performance of RS( ) codecs based on the parallel datapath and the digit-serial datapath containing MAC8 DEGRED2, we can observe that the digit-serial RS encoders only consume 23.1% of the energy of the parallel approach with slight increase in latency; the digit-serial RS decoders consume about 27.4% of the energy of the parallel approach with 1.5 times latency. The energy-latency products of RS( ) encoders and decoders are plotted as a function of the block length in Fig. 2 assuming the clock cycle time in both datapaths are the same. From Fig. 2 we can see that the energy-latency products of RS encoders can be reduced by and the energy-latency products of RS decoders can be reduced by at least using the digit-serial approach.
Note that the critical path of each digit-cell (MAC and DE-GRED) is shorter than that of the parallel multiplier. Therefore, the digit-serial datapath can be operated at higher clock rate, which reduces the computation delay ( ) of the digit-serial RS codecs without increasing the energy consumption. (Recall that in a digital CMOS circuit, 90% energy consumption are due to dynamic switchings and can be estimated using , where is the activity factor, is the load capacitance and is the supply voltage [29] . This energy value does not directly relate to the operating clock rate.) Hence, further energy-delay reduction is possible for the digit-serial approach by either operating the digit-serial system at higher clock rate (to reduce the total computation delay) or using lower supply voltage (to reduce the energy consumption).
III. PIPELINED FFT IMPLEMENTATION
A. Review of FFT Processors
The discrete Fourier transform (DFT) of an -point sequence is defined by (2) where . There are two well-known types of FFT algorithms called decimation-in-time (DIT) and decimation-in-frequency (DIF). For example, according to radix-4 DIT FFT, (2) can be decomposed and expressed in the matrix form as follows: Radix-2 and Radix-4 are the most common radices used in FFT decompositions. Radix-4 decomposition is more attractive since it requires the least amount of multiplication operations for FFT and it reduces the number of multiplications from for direct implementation of DFT to only . As (3) shows, each stage of FFT computation consists of retrieving the data for specific , and the corresponding twiddle factor multiplication, followed by the multiplication of the Radix-4 butterfly matrix. Direct implementation of (3) requires three multipliers to perform the twiddle factor multiplication as shown in Fig. 3(a) [16] , [30] . However, unless four input data are sampled in parallel, this architecture cannot achieve full efficiency. For most of the applications where the FFT processor must be interfaced to a continuous word serial stream, it is only possible to achieve 25% hardware utilization as there is a : mismatch between the bandwidth of input data rate and that of the processor. In order to compensate this mismatch, a fully utilized architecture based on the use of digit-serial arithmetic units has been proposed in [31] . The other way of implementing (3) is to use a single multiplier for the twiddle factor multiplication as shown in Fig. 3(b) , which generates each element in the vector sequentially by computing one factor multiplication each time. However, this scheme still suffers from low hardware utilization. Therefore, in [32] , the radix-4 FFT algorithm was implemented using radix-2 FFT architecture such that the utilization of adders was increased from 25% to 50% The multiplier utilization is equal to 75% since one of every four factor multiplications shown in Fig. 3(b) involves the trivial factor .
B. Radix-4 Digit-Serial FFT Processors
In general, the feedforward scheme can achieve better hardware utilization while the feedback scheme can lead to less memory requirement. Combining these two schemes, we can develop an FFT architecture which not only increases the hardware efficiency, but also incorporates the methodology of the feedback scheme to reduce the memory requirement [18] . Fig. 4 shows the block diagram of the proposed FFT architecture for a transformation size of 64 based on DIT FFT because of its lower twiddle factor update rate. The details of each functional block are shown in Fig. 5 . The first FFT stage, based on the feedback scheme, consists of a data distributor followed by four parallel digit-serial feedback commutator and butterfly data paths. The data distributor serves as a format converter which converts input data format from bit-parallel into digit-serial. In the meantime, it also distributes the -point input data equally into four parallel -point digit-serial data streams. Following the distributor, each data stream then goes through the commutator in order to perform a radix-4 butterfly operation which, as shown in Fig. 4 , is actually implemented by two Radix-2 butterfly stages in order to reduce the number of adders. One of the Radix-2 butterfly outputs in this stage will be fed back to the delay commutator which corresponds to the Radix-2 version of feedback implementation scheme shown in Fig. 3(b) . This approach results in a significant saving of the memory required. All the remaining stages after the first stage adopt the feedforward scheme similar to [31] . The detailed commutator architecture used for this scheme is shown in Fig. 5(b) , where an -point commutator shuffles data units like matrix transposition operation to align the correct output data for the subsequent butterfly operations. The architecture of this commutator is composed of the switch array which is similar to the data distributor architecture. The Radix-4 butterfly unit, as shown in Fig. 5(c) can be built by the interconnection of four radix-2 butterfly units. Since the first stage generates four digit-serial outputs continuously, the feedforward scheme in the remaining stages will not suffer from the problem of low utilization of arithmetic units. The digit-serial adders and multipliers can be fully utilized which is not possible for the pure feedback scheme. Table II compares the hardware requirement for different FFT architectures. It is shown that the hardware utilization based on the digit-serial approach is higher than the bit-parallel implementation. Therefore, the number of adders required for the proposed architecture is less than half compared with [32] , [33] . The critical path of the digit-serial multiplier is smaller than that of the bit-parallel multiplier. Since the multiplication is the critical operation in the complete FFT architecture, the digit-serial based FFT architecture can operate at faster speed. As for the data memory requirement, proposed architecture will need slightly more data memory compared with [32] . Finally, for the twiddle factor ROM, compared with the bit-parallel FFT approach, the proposed digit-serial architecture can not only reduce the access rate of ROM, but it is also suitable for the ROM reduction techniques.
IV. LOW-POWER FOLDED FIR FILTER
Folding [19] , [4] or time-multiplexing is a technique for efficient resource sharing. The throughput requirement in folded architectures is met by pipelining the hardware functional units to a relevant number of levels. In this way folded architectures are able to meet both throughput and area constraints for a target application. However, maximal resource sharing can lead to an increase in power consumption [23] , [24] .
In the following, we review a low-power folded FIR filter methodology [34] . The basic strategy employed consists of maximizing the correlation of successive inputs to shared computational resources. The primary computational resources targeted in FIR filters are multipliers as the power consumed in multipliers is significantly higher than in adders. Also, the assignment of DFG operations (nodes) to resources is always done in such a manner that the interconnect overhead is minimized. Finally, as the input data-memory to filters in DSL applications can be very large in size a data-value is accessed exactly once from the data-memory. Once a data-value is accessed from memory all operations utilizing this value are scheduled successively one after the other.
There are two inputs for a multiplier in an FIR filter, the coefficient input and the data input. The coefficient input to the multiplication nodes in the DFG of an FIR filter are fixed. It is possible to find a scheduling order for the DFG multiplication nodes mapped to a functional unit so that the net switching activity at the coefficient input is minimized.
The data-input, however, is variable and can at best be modeled as a stochastic input which may or may not exhibit temporal correlation. By unfolding [35] the DFG, it is possible to unearth all the multiplication nodes which share a common data-input. However, the transpose DFG of an FIR filter is able to expose all these multiplication nodes simultaneously without unfolding. Therefore, the transpose FIR filter DFG was used as a starting point for developing low power folded FIR filter.
A. Unfolding the Transpose FIR Filter
The DFG of the -tap transpose FIR filter in Fig. 6 has 1 input node, , 1 output node, , multiplication operation nodes, and addition operation nodes, . Unfolding a filter DFG by a factor of will result in a new -parallel filter topology which computes filter outputs simultaneously. We now outline a strategy to unfold the -tap transpose FIR filter by a factor . 1) Create copies, , , , for the input , output , multipliers and adders , respectively; 2) Connect the input to multiplier by an arc with 0 delay. Connect the multiplier to the adder by an arc with 1 delay. Connect the multiplier , to adder with 0 delays. Connect every other multiplier to adder by an arc with 0 delay. Connect the adder to adder with 1 delay for . Connect the adder to the adder by an arc with 0 delay. Connect each adder to output . Fig. 7 shows the two-unfolded version of a two-tap transpose FIR filter. The reader may refer to [4] for the detailed explanation of folding and unfolding algorithms.
B. Low Power Assignment and Scheduling for Filter Nodes
When we have hardware multiplier units, , with pipeline stages and hardware adder units, , with pipeline stages, , then we can perform scheduling and assignment as follows:
1) Unfold the transpose FIR filter by . a) Assign multiplication operations to hardware multiplier unit . b) Assign addition operations to hardware adder unit . This assignment of operation nodes in the unfolded filter to hardware units has the property that all multiplication operations assigned to a multiplier unit share a common input. In addition the regularity of the original filter topology is almost completely maintained in the hardware mapping which leads to localized communication, minimum interconnect overhead (e.g., only 2-1 multiplexors are needed), and extremely uniform and compact layout.
2) Order the -coefficients of the original FIR filter in such a manner that successive scheduling of the coefficients in this order results in minimum total switching activity. We can approximate this step in the following manner: a) Set up a graph , where vertices have 1-1 correspondence to coefficients of the original FIR filter. Arcs exist between every pair of vertices with an associated distance . This distance between vertices, say vertex and vertex , is proportional to the average power dissipated when the coefficient corresponding to is scheduled immediately after the coefficient corresponding to . This distance measure can be obtained by simulation. Denoting the coefficient corresponding to a vertex Fig. 8 . A three-tap FIR filter is unfolded for mapping on to an architecture with two multipliers and two adders. While the 2 multipliers exhibit 100% hardware utilization the two adders exhibit 67% hardware utilization. As an aside, we observe that unfolding the DFG and then folding it can lead to higher hardware utilization.
by
, we made a coarse approximation by taking Hamming Distance between and . b) Find a cost-optimal Hamiltonian tour for the graph constructed in i). As might be evident the above problem can be solved as an instance of the Hamming Distance Traveling Salesperson Problem (TSP) [36] . The optimal Hamiltonian-tour identified in b) defines the periodic scheduling orders for the various multiplication operations in the FIR filter, i.e., the multiplication operations are scheduled one after the other on the functional unit according to this order and this ordered schedule is repeated every clock cycles. The folding orders for the adders is selected as follows: addition operation has the folding order . The reason for selecting this folding order for the addition operations is because this order gave good results in simulation for minimization of storage [34] . Fig. 8 illustrates two-unfolding of a thre-tap FIR filter for mapping to two multipliers and two adders.
C. Folding the Unfolded Transpose Filter
As mentioned before the Hamiltonian order from step 2) in the last section specifies a scheduling order (folding order) for the multiplication operations of the unfolded filter. Therefore, corresponding to each multiplication node in the unfolded filter there corresponds a folding order . Also, as explained in the last section every addition operation node has the scheduling order . Recall that is an index which identifies 1 among functional units to which the DFG node is mapped. Due to the periodic nature of filtering operations we are able to choose a time-static (not changing with time) periodic schedule. Every multiplication operation or addition operation is scheduled exactly once in clock cycles. The clock period is chosen as the worst-case critical path of any pipeline stage among all the functional units. Due to pipelining levels in multiplier the output of the th iteration is available only in clock cycle . This output is consumed by the th iteration of adder which is scheduled at clock cycle . The chosen schedule is feasible if and only if: (4) Similar constraints can be derived for adder to adder arcs in the unfolded filter. For every arc in the unfolded filter such a corresponding constraint will be generated. Unfortunately, some of these constraints may not be satisfied in the original schedule. In order to facilitate the above schedule we can use retiming [37] , [20] , [21] . Retiming alters the iteration structure of the original unfolded filter by moving delays around while maintaining the functionality of the original filter. In retiming an integer variable is associated with every operation node in the unfolded filter, e.g., with input node with output node , with multiply node and with addition node . After retiming, the number of delays in a communication arc between two operation nodes, say from to , is changed by an amount . It can be shown that this alteration of the DFG while changing the iteration structure maintains the functionality of the original algorithm, [37] . The constraint in (4), therefore, gets altered to (5) This inequality is referred to as the retiming for folding constraint for the arc from to . Similar constraints can be derived for all the other arcs in the unfolded filter. The storage required in the folded architecture can be modeled approximately as a linear function of the retiming variables [21] .
In order to ensure that retiming maintains the common shared input of all the multiplication nodes mapped to the same functional unit, we require that for any such pair of multiplier nodes say and
This constraint is referred to as the low-power folding constraint. Unfortunately, unfolding by a factor of leads to a -fold increase in memory. Hence, it is not always practical to unfold a filter DFG by a factor of for folding onto multipliers, especially, when is large. Hence, it is desirable to find an optimal unfolding factor for folding multiplication operations on to multipliers in such a way that overall power consumption is minimized. In order to do this, the power savings obtained by reducing switching activity at the multiplier inputs must be weighed against the power expenditure due to the extra memory. This will help uncover an optimal unfolding factor. Low-power folding can be used to reduce the power consumption of multipliers for folded FIR filter architectures. The technique proposed in [34] formalizes an intuitive notion that fixing one of the inputs of a multiplier and increasing the correlation of successive operands in the other input will lower the power consumed in the multiplier.
V. DUAL SUPPLY VOLTAGES FOR LOW POWER
Supply-voltage reduction is one of the most effective techniques in reducing power consumption of CMOS circuits. The major portion of power consumed is dynamic power, which is reduced quadratically with the voltage [38] . Reducing , unfortunately, leads to an increase in delay which results in performance degradation of the entire circuit. Recently many papers have been published on techniques to reduce without degrading performance, [39] - [42] .
In [41] and [42] simple greedy heuristics are proposed for utilizing this available slack and using a dual supply voltage scheme for obtaining significant reduction in power consumption. However, since [41] and [42] use greedy heuristics for this process there is reason to believe that a heuristic derived from a formal modeling of the problem may give us substantially higher reduction in the power consumption at the expense of CPU time. In this section, we will present a formal model for the use of two or more supply voltages for reducing power consumption in CMOS circuits without degrading performance.
A. Notation
A gate-level combinational circuit can be represented by a directed acyclic graph, , we will refer to as a circuit-graph. Every circuit-graph node represents a logic gate or a primary input and every directed edge denotes the wire that connects the output of gate to the input of gate . The fanin of a logic gate refers to the number of gates whose output has a wired connection to the input of the given gate. The fanout of the gate is the total number of gates which take the output of the given gate as an input. The input wires to gates with a fanin of 0 are fed externally and these external input junctions constitute the primary inputs, . The are also modeled as nodes . The gates with a fanout of 0 constitute the primary outputs,
. The delay of a node/gate is denoted by . The maximum propagation time for a signal through the circuit-graph, including input arrival times and the path gate delays, for any path from a primary input node to a primary output node constitutes the critical path, , of the circuit-graph. We now define three attributes for every node in . For a node these are namely, the arrival time , the required time and the slack, . Additionally, every wire has the attribute edge-slack, . We call a circuit safe when all nodes have and all wires have . We will now define all the attributes mentioned external time of arrival, u PI
We call a circuit safe when all nodes have and all wires have . 
B. Delay Balancing
A given circuit-graph can be transformed to a functionally equivalent circuit-graph by introducing appropriate number of unit delay buffers into the circuit in such a manner that for every and . This process is known as delay balancing. we use delay balancing as a tool to capture all the slack in the circuit. The delay buffers we use for delay balancing are fictitious entities whose only purpose is to model the slack present in the circuit. We refer to these fictitious buffers as unit delay fictitious-buffers (UDFs). Fig. 9 shows a gate level circuit and Fig. 10 shows its delay balanced counterpart; the "boxed" numbers in the wires of the circuit in Fig. 10 represents the number of UDFs on that wire.
C. UDF-Displacement
We define UDF-displacement, a circuit-graph transformation technique, as a mapping r:V Z, {Z: the set of integers}; such that the number of UDFs in the wire , UD , after UDF-Displacement is related to the number of UDFs before UDF-Displacement, , by,
A UDF-Displacement is legal if and only if for all wires . . However, only a subset of all the gates with the requisite slack can be switched to . Identifying that particular subset of gates which lends us a maximum power benefit is our problem at hand. Note that a delay-balanced configuration corresponding to a circuit graph gives us information as to which gates can be switched to . A systematic way of identifying such gates, for the simplistic case with , is given 1) Identify a gate satisfying the following constraint (8) 2) Switch that gate to and remove from each of its fanout wires , UDFs and from each of its fanin wires UDFs. 3) Repeat 1 and 2 until no gates satisfying (8) remain. The key fact that UDF-displacement can generate all delay-balanced configurations of a circuit-graph leads to the following.
D. Dual Supply Voltages for Gate-Level Circuits
Objective: To employ UDF-displacement to identify that particular delay-balanced configuration which identifies all the logic gates which can be switched to a lower supply voltage while providing a maximal reduction in power consumption, while also maintaining the critical path. The process of obtaining a Power Reducing Optimization with UDF-Displacement will onwards be referred to as PROUD.
PROUD can be modeled as a linear programming (LP) problem and solved as in [43] .
Example V.1: For the delay-balanced circuit as shown in Fig. 11(a) , we can perform the UDF displacement and get the result circuit as in Fig. 11(b) , in which the AND gate is switched to .
2) The Nonintegral Delay-Difference Model: The obvious way to deal with the case where a gate has a nonintegral is to employ the ceil operator to and then solve for PROUD. For example, if we have the voltage pair (2.5 V, 1.8 V) then for a gate, , with delay of 1 unit at 2.5 V the delay is 1.93 at 1.8 V, which means . By using instead we can still use the above model while maintaining the critical path and obtaining near optimal results. However, since we use 0.07 additional UDFs than necessary for switching to 1.8 V our solution is potentially suboptimal. This suboptimality is brought about due to the ceil operator. The small value, 0.07, of the excess UDFs employed will, however, almost always guarantee optimal results. On the other hand for the voltage pair (3.3 V, 2.5 V) a gate, , with a delay of 1 unit at 3.3 V has a delay of 1.582 units at 2.5 V with
. Substituting may in all likelihood give us suboptimal results. There is, however, a very simple solution to the above problem, if we multiply all delay values in the circuit by a factor of 10 then the minimum value of for the above example will be 0.582 10 5.82. Also, the minimum gate delay which was 1 initially will now be ten, we can now use the ceil operator as before to get . The excess UDFs used to switch gate to 2.5 V is therefore reduced from 1 0.582 0.418 to (6 5.82)/10 0.018.
The PROUD formulation has been used to explore the use of two supply voltages for all the combinational benchmark circuits in the ISCAS85 benchmark [43] , which shows that it is a useful mathematical model and an efficient heuristic for handling dual voltage supplies to reduce power consumption at gate-level.
VI. CONCLUSION
Several power-reduction approaches have been reviewed. Appropriate selection of datapaths can lead to a low-energy programmable implementation of a RS coder. Use of a hybrid digit-serial Radix-4 FFT architecture based on both feed-forward and feedback commutator schemes eliminates the underutilization of the adders and multipliers and reduces the area of the implementation. It has been shown that appropriate unfolding and transpose form FIR filter structures can lead to reduction of power by reducing switching activity. Finally, an optimal approach to reduce power consumption by use of dual or multiple supply voltages has been addressed. Interconnect power has been mostly neglected in the past. This will be more important in the future. Tools for power estimation for circuits designed using deep submicron technologies are still not available. More future efforts will be directed toward this goal.
ACKNOWLEDGMENT
The author wishes to thank his colleagues L. Song, Y.-N. Chang, and V. Sundararajan. Most of the results presented here are a review of his joint work with them. He also wishes to thank T. Zhang for his help in preparation of this paper.
