Strategies for the design of ultra low power multipliers and multiplier-accumulators are reported. These are optimized for asynchronous applications being able to take advantage of data-dependent computation times. Nevertheless, the low power consumption can be obtained in both synchronous and asynchronous environments. Central to the energy efficiency is a dynamic-logic technique termed Conditional Evaluation which is able to exploit redundancies within the carry-save array and deliver energy consumption which is also heavily data-dependent.
INTRODUCTION
In many recently emerging portable applications of DSP circuits, maximizing battery life can be of paramount importance. In recent years, this has resulted in considerable research effort being directed at the development of VLSI design techniques for circuits with reduced power consumption. Such techniques span the design hierarchy from the algorithmic level down to process technology [1] . This paper addresses power minimization at the logic and circuit levels through reductions in circuit activity and switched capacitance.
Asynchronous circuits have also been the subject of a growing body of research. Amongst
Although relatively few asynchronous VLSI designs have been reported, in many cases datadependent delays have been realized at the expense of low-power due to the adoption of differential logic styles such as DCVSL [2] . Whilst such styles readily facilitate Completion-Detection (CD), they are known to offer poor energy efficiency [1] . This paper reports strategies for the design of the building blocks of DSP computation namely multipliers and multiplier-accumulators which simultaneously achieve both ultra low-power operation and data-dependent delays. This is achieved by a synergy of architecture and circuitlevel implementation, neither of which alone, is responsible for the low energy consumption of the circuits.
The architecture adopted for the basic multiplication operation is based on the carry-save array (CSA) proposed in [3] which exploits datadependent delays. This architecture has the property that much of its logic becomes redundant in proportion to the number of zeros in the multiplier operand.
An issue not investigated in [3] , however, is the potential for power reduction offered by this redundancy which can be exploited by inhibiting circuit activity within the redundant logic, as a function of input data. An implementation based on a hybrid static/dynamic logic CMOS circuit style lends itself readily to such exploitation. The feasibility of such an implementation was demonstrated by the authors in the design of an unsigned multiplier [4] using a technique we term 'Conditional Evaluation' illustrating both low-power operation as well as other benefits in terms of low device-count and high-regularity.
In this paper we extend that work by demon- strating that the synergy of architecture and implementation can be equally well applied to the design of signed multipliers and multiplieraccumulators to similar advantage.
It should be noted that whilst the datadependency in terms of propagation delay is best exploited within an asynchronous framework, the high energy-efficiency can be obtained in both synchronous and asynchronous applications.
Section 2 describes background material underpinning the main design considerations. Section 3 describes the data-dependent carry-save architecture and its application to multiplier and Concurrent Multiplier-Accumulator (CMAC) designs.
Section 4 deals with adaptations to allow operation with 2's complement operands. Implementation and timing issues are central to obtaining the maximum energy savings of the proposed approach these are discussed in Section 5. Simulation results and comparisons with other designs are presented in Section 6. Finally, conclusions are presented in Section 7.
BACKGROUND
In this paper, the goal of ultra low-power multiplication and multiply-accumulation has been achieved by the application of a number of different techniques, which are discussed below.
Carry-save Array Multiplication
Traditional carry-save array (CSA) multipliers comprise, in principle, an array of gated fulladders cells. In any single row of such cells, the 'gates' perform the AND of the multiplicand with a single bit of the multiplier creating a 'bitproduct'. The sum of all bit-products emerges from the last row of the array as Sum and Carry vectors which must be added to produce the final result an operation involving full carry-propagation. This is carried out in a 'vector-merging' or Carry-Resolution Adder (CRA).
DATA-DEPENDENT COMPUTATION 351
The terminology used in this paper is as follows.
An n-bit multiplicand, MD, whose MSB is MDn-1 is multiplied by an m-bit multiplier, MR, whose MSB is MRm_ 1. The product of one bit of MR with MD produces a bit-product of nbits. The kth row of the array adds the incoming partial product to its bit-product thereby producing the kth partial product.
The simple carry-save array is one of several approaches commonly used in multipliers to perform the summation of bit-products. Also widely used are tree structures, including those using 4:2 counters, and modified-Booth encoding which is applicable to both carry-save arrays and tree structures. Such strategies are known to offer improved worst-case latency over the simple carrysave array. However, as well as having less regularity, which complicates VLSI implementation, their architectures make the application of the energy saving conditional-evaluation technique more complex and less advantageous. Their use here, particularly in view of the attendant overheads, is not clearly justified since the primary design goal is low power.
Concurrent Multiply-Accumulation
In DSP algorithms, multiplication is very often followed by accumulation and time to perform a multiply-accumulate operation is often quoted as a performance measure of DSP hardware. Most commonly, multiply and addition operations are carried out separately in two cascaded hardware structures. Considerable benefits can, however, be obtained by carrying out the multiply and addition functions concurrently in the same structure, making use of the unused inputs around the periphery of a non-minimized CSA. An implementation of such a Concurrent Multiply-Accumulate (CMAC) structure was described in [5] .
Since this approach exploits the otherwise unused inputs at the periphery of the CSA, the overall gate cost is considerably less than would be required to implement the addition separately. In comparison with conventional multiply-accumulate structures a reduction in area of 20% has been reported [5] . Furthermore a reduction in latency of 50% (in a synchronous environment) was obtained.
Dynamic Logic
Use of dynamic logic for low-power operation inherently offers some advantages. In particular since each output can undergo, at most, one transition per evaluation, spurious transitions, which in static logic multipliers can account for as much as 50% of the energy consumption [6] , are eliminated. Although spurious transitions can be addressed by other static-logic methods such as delay balancing [7] , dynamic logic has the additional advantage of considerably reduced input capacitances due to the absence of a complementary logic tree.
On the other hand, in comparison with static logic, dynamic CMOS circuits have two drawbacks which tend to offset the energy advantages. The first of these is the need to charge and discharge the precharge/evaluate lines. This normally takes place once per cycle although with the conditional-evaluation technique reported in this paper, on average it happens less frequently. The second is the increased probability of output activity which results from the fact that the precharge voltage precedes each valid output voltage. Whilst for some combinational functions this puts dynamic logic at a significant disadvantage [1] , for others the increase in activity is small.
In the implementations reported here the energy benefits considerably outweigh the drawbacks. Indeed, the fact that evaluation of dynamic logic can be inhibited by a single input recommends it naturally to activity reducing schemes. To avoid the race problem, dynamic-logic carrysave arrays usually employ differential circuitry [2, 3] despite its relatively poor energy efficiency [1] .
Alternatively, the power benefits of single-ended logic can be obtained by using self-timing to avoid the race [8] For synchronous applications, a pipelined implementation of the two separated functions can deliver some increase in throughput albeit at the cost of latency, energy and area. In asynchronous applications, however, pipelining has the effect of pushing average-case performance closer to the worst-case due to starvation and blocking effects [11] . The argument in favor of using the CMAC structure in low-power applications, is therefore valid for both synchronous and asynchronous circuits, but particularly so in the latter case.
The general architecture of an 8+4 4-bit unsigned CMAC structure, based on the datadependent CSA, is shown in Figure 2 .
In order to accommodate growth in wordlength it is common to provide guard-bits in the accumulator. These can be included by extending the most significant end of the CRA. As with the multiplier discussed above, the aim here is to produce an asynchronous CMAC with reduced propagation delays and energy consumption by utilizing the data-dependency of the operation.
Carry-resolution Adder
Conventional unsigned CSA multipliers require an n-bit adder for carry resolution. However, with the data-dependent CSA, the number of bits in the CRA is greater than n, requiring additional bits at the least significant end. This is because carries into the LSB of a row of the array can, if the row is in bypass mode, emerge from the array unresolved. In a stand-alone multiplier the first two rows of the CSA cannot produce (non-zero) unresolved carries and the number of bits required in the CRA is n-t-m-3 as shown in Figure 2 . In the CMAC, (and in certain 2's complement modifications to the multiplier described later) an additional CRA bit is required at the most significant end. Also, with the CMAC, since all rows except the first can produce unresolved carries, one further adder-bit is needed at the least significant end. These additional bits are shown shaded in the CRA of Figure 2 . The CRA used is a ripple-carry adder, chosen because of its minimal power dissipation. It is fitted with completion-detection circuitry and therefore the increase in average latency, due to Completion-Detection (AMCD) method [12] is used. AMCD has certain advantages over other completion-detection methods. In particular, its silicon and energy overhead is small in comparison with other techniques especially when applied to domino CMOS [13] . A brief explanation of AMCD is given in Section 5.
HANDLING SIGNED OPERANDS
The structure of Figure 2 performs multiplication and multiply-accumulation on unsigned numbers.
In many applications, however, the 2's complement data representation is used. The following section presents a method of adapting the above architecture to handle such operands.
Two's Complement Multiplier, MR
Array multipliers commonly use a simple modification to accommodate a 2's complement representation of MR: since the MSB of a 2's complement number carries a negative weight, instead of performing an addition of the bit-product to the partial-product in the mth (last) row of the array, a subtraction is carried out. The subtraction is often implemented by adding the l's complement of MD together with a in its LSB position.
The same algorithm can be implemented with the data-dependent CSA albeit with some alternative method of adding the extra 1. Since there are no unused inputs to the CRA at the appropriate position, the addition of the is carried out in the last row of the array by tying the unused input of the LSB's carry-bypass MUX to a logic 1. The is therefore only selected when the row is performing an evaluation, as required. 4 An alternative which incurs no such penalties uses an arithmetic transform to convert each negative vector into a positive vector plus a (negative) correction term and has some similarities to the algorithm described in [14] . It is illustrated below by means of an example.
For any single binary digit, d, the Boolean identity: d-1-or more usefully -d--can be used to transform the negatively weighted MSB of each bit-product into its positivelyweighted complement minus 1. Hence the summation of bit-products in the carry save array can be transformed into the representation shown in Figure 3 . Here, a row of 5 dots represents a 5 bit, bit-product and an overbar represents the complement of a bit (shown with its correction term). By elimination of all negatively weighted bits within the array, addition of the sum and carry vectors from each row can take place without signextension, the MSB of the partial product being simply the most-significant Carry-out of the row. Figure 4 (a)
whilst the other requires some modification to the MSB cell itself, shown in Figure 4 (b). Although the logic of both these circuits can be minimized, their structure is retained here for clarity. Whilst these solutions are adequate for a stand-alone multiplier, it should be noted that both preclude the combining of multiply and addition functions in the CMAC structure.
The -1 -1 -1 -1 'correction term' can be dealt with most efficiently by Booth encoding it into the equivalent form: -1 /0 /0 /0 /1 whose LSB, has a weight of 2"-1 (the same as the MSB of the multiplicand). Therefore this bit's addition can most simply be performed in the most significant cell of the CSA's first row, by tying its Carry-in to a logic 1 Carry-in input to the CRA. In other words the correction term is applied as:
The negatively weighted MSB of the correction term can be dealt with in the CRA by tying one of its MSB inputs to 1, ignoring any Carry thereby produced i.e., a modulo-two addition. sign-bits cannot be carried out as in the case of the multiplier because the unused adder inputs are now required for summand, D. Therefore, bitproduct sign-bit inversion is carried out by replacing each row's most significant cell with that shown in Figure 5 .
Here, the required to be injected at the MSB Unlike the multiplier, the CMAC structure must anyway have circuitry (i.e., a MUX) to deal with the Carry-out from the most significant full-adder of each row. Consequently, addition of the positive bit of the correction term can be done more simply with the CMAC structure than with the multiplier, merely requiring that the top row's most significant Cin input be tied to a logic 1. The negative bit of the correction term is dealt with, as before, in the MSB of the CRA.
IMPLEMENTATION AND TIMING

Interface to the Environment
The multiplier and CMAC circuits fit into the asynchronous framework for dynamic logic proposed in [15] i.e., the leading and trailing edges of the START input, initiate the evaluation and precharge phases respectively. Similarly, the leading edge of the DONE signal indicates validity of the output data and its trailing edge indicates completion of the precharge phase. Such a protocol requires that the input operands are stable before the leading edge of the START signal and remain so throughout the computation.
To ensure low-power operation, the circuits should not be left in the evaluation phase for longer than the dynamic storage time, otherwise charge leakage from dynamic storage nodes could cause short-circuit power dissipation in the domino inverters. This requirement is easily met in both synchronous and asynchronous applications, a handshake circuit suitable for the latter being described in [15] .
Carry Save Array Using Conditional Evaluation
In the implementation of [4] [4] , was shown to be an important mechanism for power reduction a row in bypass mode typically using less than one quarter of the energy of a row in which addition is performed.
The structure of the CSA cell is shown in Figure 6 (a). The inverters provide buffering to prevent excessive rise/fall times through the MUXs, (particularly important in bypass-mode). As outlined above, to maximize energy savings, single ended dynamic-logic was employed unlike the differential implementation of [3] . The full-adder is itself composed of a cascade of two dynamic n-blocks, one which evaluates the CarryOut signal, Co, and one which uses Co to evaluate the Sum output, S. This circuit was used because of its low device count and hence low switchedcapacitance. The precharge/evaluate lines for the carry and sum circuits are labeled CarryEval and SumEval respectively and are common to all cells in the same row. The transistor-level schematic of the two dynamic blocks is shown in Figure 6 (b).
The self-timed mechanism used to eliminate the dynamic-logic race uses inserted delays to postpone evaluation of each dynamic stage until all of its inputs have stabilized [8] . Each row of the CSA is controlled by a Conditional Evaluation timer circuit whose structure is shown in Figure 7 . For rows whose multiplier bit is one, evaluation of the carry and sum circuits, takes place when low-high transitions occur on CarryEval_H and SumEval_H, separated in time by -o. The duration of -c is chosen to allow Co to settle before switching the sum circuit into evaluation mode. Similarly, % and 7" b are chosen to postpone evaluation in the following row until it is safe to do so.
TrigIn of the first row's timer is driven by the START signal which initiates computation in the array. It is a requirement of this type of circuit that the input operands are stable before the leading edge of the START signal and remain so throughout the evaluation phase. The TrigOut signal is connected to TrigIn of the following row's timer. In rows whose multiplier bit is zero, it is passed on through the MUX after delay 7" b to the following row, without firing the precharge/evaluate lines.
Delays for the timer structure of Figure 7 can most simply be implemented with a minimumsized inverter whose pull-down is connected to ground via a permanently-on n-type device. The channel length of this latter device is selected to produce the required delay. Whilst such a circuit has a very low switched capacitance, selection of the channel length requires careful simulation. Furthermore, the structural differences between the delay and the data-path it is intended to model, will demand a greater margin of safety to ensure correct operation under a spread of operating
conditions. An alternative to this simple delay is to use a matched delay circuit. This presents a higher switched capacitance but has the advantage that the delay more closely tracks that found in the data-path the delay is produced using a circuit identical to that of the data path but hard-wired to its slowest configuration.
A characteristic of the structure of Figure 7 is that rise times on CarryEval and SumEval add to the delays through the timer. If these rise times are known accurately then the timer delays can be adjusted accordingly. However, given the relatively large capacitance on these two signals, their rise times may be subject to significant variation and therefore the timer may require more conservative timing margins than would otherwise be necessary. To avoid this uncertainty the structure can be modified slightly such that the capacitances of SumEval and CarryEval are decoupled from the actual timing path by buffering. This gives tighter control of the timer delays but incurs the added risk that a large difference between the rise times of CarryEval and SumEval could cause the timing requirements to be violated. In practice, since the load and interconnect of these two signals is very similar, differences between their rise times can be kept small. The row-timer adopted for the designs reported here, uses this decoupled timer structure with matched circuits for the delays.
Asynchronous Carry Resolution Adder
The carry-resolution adder uses a similar fulladder circuit to that shown in Figure 6( since a transition has just occurred on one of its inputs. Assuming appropriate placement (granularity) of the AMs and sufficient pulse-widths to ensure overlap, the logical OR of the individual pulses indicates that the whole circuit is still in transition. This signal is generated with a wired-OR configuration and its trailing edge indicates that the circuit has completed its operation.
Granularity of the AMs is set according to [13] i.e., such that one AM covers two cascaded fulladders. To deal with the case when there are no transitions on the monitored signals, a minimumdelay-generator (MDG) is required; an AM connected to CarryEval is used in order to produce this required minimum delay.
To avoid the dynamic race, evaluation of the sum circuits is delayed until the output from the AMCD circuit indicates that all carries have settled. Therefore SumEval should rise following a rising edge on ACT_L (the signal indicating completion), but fall when the multiplier/CMAC enters its precharge phase (during which period ACT_L remains high).
To produce these dependencies, an interlock circuit as shown in Figure 8 is used.
To produce the DONE signal, whose leading edge indicates validity of the output data, an additional delay is used, DONE being merely the CRA's SumEval signal delayed by at least the settling time of the sum circuit. Its trailing edge indicates that the precharge phase is complete. The delay was implemented here using the matched delay approach described above.
Precharge Strategy and Spurious
Transitions
The time to precharge the CMAC adds directly to its cycle time and should be kept as small as possible. One of the attractions of dynamic logic for low power circuits is its immunity from spurious transitions during evaluation. Dynamic circuits using the delayed-evaluation technique can, however, be susceptible to spurious transitions when entering precharge unless the relative timing of each stage's precharge phase is appropriately controlled. This is because some logic-tree transistors are turned on during precharge and therefore an upstream gate precharging before a downstream gate, could result in a glitch on the output of the downstream gate.
To avoid this possibility, the safest, most general approach is to ensure that all stages precharge in the reverse order to their evaluation. (FirstEvaluate, Last-Precharge). However, this demands a more complex control circuit than the simple cascade of delays described above and furthermore, a fairly long precharge delay is produced.
Another possible approach is to ensure that the time for a H-L transition to p:opagate through delay -c (Fig. 7) is sufficiently short that the sum circuit does not have time to react to the precharged Carry-out signal before it too, is precharged. Therefore [16] and is based on the assumption that In the case of computation delays, normalization has been carried out using a different algorithm. Delays taken from [8, 16] and [17] have been linearly extrapolated to a 16-bit multiplier operand. The circuits of [8, 16] and [18] CSA only. CRA not included. Energy benefits from non-random operands.
Delay is an average value.
As can be seen, our design yields the lowest normalized energy consumption per multiplication. The energy given for the multiplier of [16] is taken from a FIR filter and uses other, filterspecific techniques to reduce operand activity. In a similar environment our circuit could be expected to consume significantly less energy.
The design of [17] is a CSA only and the significant energy and delay contributions of resolving the sum and carry vectors are omitted. A nonstandard CMOS process adapted to maintain performance at reduced supply voltage was used in [18] , a fact reflected in the given delay. The other designs were based on standard CMOS processes and a performance comparison is more meaningful. In this respect our design is, on average, 19% slower than [16] but shows at least a 14% improvement in Energy*Delay. In terms of Energy*Delay 2 our design is approximately equivalent to [16] . 7 . CONCLUSIONS Self-timed circuits for multiplication and multiplyaccumulation using a data-dependent architecture have been implemented, laid-out and simulated at transistor-level. Adaptation of the architecture to handle 2's complement operands has been discussed and an efficient solution is provided.
Measurements indicate that the proposed design style offers considerable benefits in terms of energy consumption. These are achieved by a synergy of data-dependent architecture and implementation. Central to the energy efficiency of the implementation is the Conditional-Evaluation technique, which enables redundant dynamic-logic activity to be inhibited at very low overhead. Comparisons show this approach can achieve lower energy consumption than other reported designs.
The designs also benefit from low device count, high regularity and good testability, which facilitate VLSI implementation. Whilst the data-dependent computation times of the proposed structures can be best exploited within an asynchronous environment, the high energy-efficiency can be obtained in both synchronous and asynchronous applications.
