Design of portable battery operated multimedia devices requires energy-efficient multiplication circuits. This paper proposes a novel architectural technique to reduce power consumption of digital multipliers. Unlike related approaches which focus on multiplier transition activity reduction, we concentrate on dynamic reduction of supply voltage. Two implementation schemes capable of dynamically adjusting a double voltage supply to input data variation are presented. Simulations show that using these schemes we can reduce energy consumption of 16 × 16-bit multiplier by 34% and 29% on peak and by 10% and 7% on average with area overhead of 15% and 4%, respectively, while maintaining the performance of traditional multiplier.
Introduction

Motivation
Digital array multipliers are essential arithmetic blocks for many DSP applications: convolution, filtering, discrete cosine transform (DCT), vector quantization, etc. Due to high capacitive load and large bit-width, these structures become the most energy-consuming units in modern DSP circuits. In the NEC's 16-bit SPX processor, for example, two multiplying units dissipate almost half of the total power [1] . As result optimizing the multipliers for energy is important.
In digital CMOS circuits, charging and discharging of capacitors dominates the total energy dissipation. Given the average load capacitance (C), the supply voltage (Vdd), and the number (α) of energy consuming signal transitions per operation, the average energy dissipation of a CMOS multiplier can be expressed by
Although lowering this energy amounts to all factors, voltage reduction offers the most drastic means of minimizing energy consumption. Unfortunately, the price needs to be paid is higher delay: D =∝ V/(V − V T ) 2 , where V T is the threshold voltage. If however, both supply voltage and delay are dynamically varied in response to computational load demands, then the energy consumed per task can be reduced for the low execution periods, while retaining peak throughput when required. Manuscript 
Related Research
There has been an extensive research on energy reduction in digital multipliers with most efforts put on transition activity reduction in adding array. Methods proposed cover signmagnitude representation, algebraic operation re-ordering, and self-timing [2] , [3] , replacing the carry-save array by a tree-based structure [4] , inserting extra hardware in the array to stop spurious transitions [5] - [8] , delay balancing [9] , [10] , using adding compressors, modified sign-extension, and coding [11] , truncating the operands [12] , decomposing the operands [13] , applying a mixed number representation with canonical sign digit numbers [14] , interchanging the multiplicands [15] , multiplicand reordering and optimization [16] , optimizing the adding cells [17] , activating the adding cells as the evaluation wave moves within the array [18] , employing the dynamic range detection unit to control the data path [19] , [20] , etc. Despite differences all these approaches have one feature in common: they focus on transition activity reduction assuming that the voltage supply is fixed and independent of the workload. Up to our knowledge, the architectural-driven voltage scaling [3] , which proved its efficiency in a variety of designs, has not been applied to multipliers yet. There have been attempts to change the supply voltage adiabatically, i.e. by the system clock [21] . This scheme enables the charge transfers to occur in a controlled manner, keeping the charging currents constant and thus limiting the dissipation across the active circuit devices. Any undissipated energy related to charge stored in circuit capacitance is recycled via an inductor or a network of switched capacitors. To achieve the constant (non-exponential) current waveform, the scheme requires time-varying (sinusoidal) supply voltage. Since this voltage alternation must be slow, making fast adiabatic multipliers becomes almost impractical.
Contribution
This paper presents a new approach to reduce the energy consumption in digital multipliers. Unlike existing methods, the approach targets previously unexplored degree of freedom inherent in the multiplier energy optimization, namely, the voltage supply per operation. Because the full multiplication bit-width is rarely required in real applications, the time budget of multiplication becomes frequently unused. We propose to trade this unused time with the voltage sup- The paper is structured as follows. The next section presents the proposed approach and outlines its circuit implementation. Section 3 analyzes the performance. Conclusions are drawn in Sect. 4.
Proposed Approach
Main Idea
Our approach is based on three features of digital multipliers employed in media processors and DSP:
1. Fixed multiplication latency and bit-width, 2. Unevenness of delays corresponding to each bit, and 3. Unevenness of bit-utilization during operation.
A typical multiplier traditionally computes the product based the whole bit-width (16-bit or more). Its latency is fixed by the time of multiplying the largest operands. Next, the multiplier is driven by a single supply voltage, whose level is high enough to charge/discharge the internal circuitry in the time interval, T , to satisfy the performance requirement. The circuit capacitance, which has to be charged or discharged to produce a corresponding bit of the product depends on the bit position; the Most Significant Bits (MSB) have larger capacitance than the Least Significant Bits (LSB). Consequently, the actual delay required to generate the product varies along the bit-width. Namely, the LSB are produced faster than the MSB.
Furthermore, we observed that media applications rarely utilize the whole bit-width of operands due to high large number of zeros in MSBs, especially when the signmagnitude representation is used. As an example, consider Fig. 1 , which shows the values occurred on DCT input during MPEG coding and the frequency of their occurrence. Because majority of the values are small in magnitude (less than 32), representing them by full (e.g. 16-bit) resolution is unnecessary. Even though the actual bit-width of the DCT multiplicand varied with the frame type (see Fig. 2 ), five bits satisfied most of the cases. Notice that a similar observation has been made in [22] for the speech filtering and speech recognition.
Since the actual multiplication time (Tact) is much shorter than the time interval, T , allocated for the worst case (say, 16 × 16 bit multiplication), the digital multiplier has an idle time T idle = T − T act , whenever a short operand is used. We propose to trade the idle time with supply voltage in order to save energy dissipated in the multiplier. The approach we present takes advantage of both the idling time which exists in the multiplier when small values are multiplied, and the relation between the multiplication time and supply voltage. In CMOS multipliers, processing delay is almost inversely proportional to the supply voltage, as shown in Fig. 3 . A low supply voltage enlarges the circuit delay while a high one accelerates the circuit operation, shortening its delay. Because a moderate decrease in the voltage drastically shrinks the energy dissipation, we propose to lower the supply voltage and thus null the idle time, whenever the multiplication bit-width (i.e. delay) is reduced. In general, the number of supply voltages can equal the number of disabled MSB, Although efficient DC-DC level converters are already available [23] , there is still some cost involved in supporting several different supply voltages. Therefore we suggest two fixed voltage levels (VH, VL) selecting them dynamically based on the amount of zero MSB in multiplicands (We assume that these supply voltages are available on the chip from a DC-DC-converter [23] ).
The basic idea of our approach can be formulated as follows. Instead of running the multiplier circuits at a single high voltage and then wait till the end of allocated (clock) time, we examine the k-most significant bits of incoming operands using a zero detection circuitry and if all of these bits are null, then use the low voltage VL and a reduced multiplier with the MSB circuits disabled. Otherwise the high voltage VH is selected to power both the LSB and the MSB circuits to accelerate the operation. Figure 4 illustrates the idea on example of 16 × 16 bits multiplier, and k=5. The patterns in this figure depict the multiplier delay; D 16×16 and D 5×5 are the delays of 16×16 bit and 5 × 5 bit multipliers, respectively. Because of quadratic relationship between the voltage supply and the energy consumed per multiplication, the energy saving achievable by the approach is proportional to: h × (VH 2 − VL 2 ), where h is the frequency of downsizing the multiplication to k × k bits.
Implementation Schemes
To realize the proposed approach in hardware, we developed two schemes, presented below.
Scheme1. This is a simple implementation that combines an n × n − bit multiplier (driven by VH) with a k × k bit multiplier (driven by VL), as shown in Fig. 5 . The decision logic takes the k most significant bits of incoming n-bit long operands (X, Y) and detects whether they are all zero. If so, it sets the flip-flop (F) to one, choosing the low-voltage/lowprecision mode and directing the multiplicands to the k × kbit multiplier. Otherwise the n × n-bit multiplier is selected. The output of the chosen multiplier is multiplexed to the system output (register Rp). Note, that the low-precision mode sets only 2(n − k) LSB bits of the Rp. The other bits of the register are reset to zero in this mode without processing. Certainly, this scheme has significant area overhead (n × n -bit multiplier, k-bit zero-detection circuit, and control logic to route the input multiplicands and select the result). However, it is fast. The delay we introduce combines the delay of 2-1 multiplexer and the 3-state buffer activation delay.
Scheme2. In contrast to the above implementation, this scheme does not require an extra multiplier, utilizing the same array for both an n×n-bit and (n−k)×(n−k)-bit multiplication, while dynamically changing the voltage supplied to the array and enabling/disabling some of its hardware. Figure 6 shows the circuit organization. At the full precision mode (flip-flop F is set to zero), the multiplier operates normally, with all its circuits activated and driven by the voltage (VH). At the low-precision mode, the high output of the flip-flop (F) selects the low supply voltage (Vdd = VL) and disables the multiplier hardware not necessary for the (n − k) × (n − k)-bit multiplication. Figure 7 illustrates the internal structure of a 6 × 6-bit array multiplier, modified for 2 × 2-bit (low-precision) multiplication at low voltage VL and 6 × 6-bit multiplication at high voltage VH. In this figure, FA denotes 1-bit full-adder, HA half-adder, "+" stands for adding cell and bold bar represents three-state buffer. (The power and control lines are not shown for the simplicity). At the low-precision mode, the three-state buffers disconnect the non-patterned blocks from the input/output and power lines, thus leaving active a small set of adding cells (shown in gray). Driven by the low voltage VL, these cells operate slowly thus filling the idle time with action. In opposite, the high-precision mode connects the high supply voltage (VH) to all the circuits accelerating their operation. Fed by VH, the circuit performs 6 × 6 bit multiplication in conventional way.
The scheme 2 does not require large circuit overhead in comparison to either the conventional design or the scheme 1. Assuming that the MSB circuits are powered by a single power line, disabling them requires one three-state buffer. Also cutting the k-MSB bits off requires k 2 buffers placed on the global inputs and (n − k) 2 buffers within the array. Additionally, (n − k) 2 buffers are needed to disconnect carries and sums between the disabled cells and active cells. Thus, in total we have 2 × (n − k) 2 + k 2 + 1 buffers. However, it requires that the (n − k) × (n − k) adding cells be redesigned with increased tolerance to voltage degradation. To decrease leakage current of the three-state buffers in the high-impedance state, dual-or multi-threshold CMOS design techniques, such as [24] , [25] , have to be employed. 
Experimental Results
To evaluate the proposed approach we designed three versions of 16 × 16-bit radix-4 Booth multiplier based on traditional scheme [26] and the schemes 1 and 2 (see Sect. 2.2), respectively. All the designs have been carried out by using 0.35 µm CMOS ROHM technology, Synopsys Design Compiler and Design Composer Tools from Cadence. The constraint of 4.5 ns (220 MHz clock frequency) has been applied to control the multiplier's worst-case delay. Figure 8 shows the design layouts. In contrast to traditional design (Fig. 8(a) ) which always is driven by a 3.3 V voltage, the designs shown in Figs. 8(b) , (c) utilize two voltage supplies and operate in two voltage modes. In the highvoltage mode, both these designs work as a conventional 16 × 16 bits Booth multiplier, making use of full 16-bits operand representation and a 3.3 V voltage supply. In the low voltage mode, they utilize only five LSB bits of each multiplicand and a low voltage (VL).
The level of VL was determined empirically based on SPICE simulation. Figure 9 shows the delays of 5 × 5 multiplier measured for different supply voltages. The lowest voltage level, which ensured correct multiplication under the given time constraint, was 2 V. Using this voltage level, we then evaluated the energy dissipation of the proposed designs by SPICE simulation. Since energy strongly depends on switching activity and consequently on input data, we evaluated energy consumption based on five different data patterns (of 64 values each), observed on the DCT input during MPEG2 video coding. The patterns differed by the occurrence of values, which can be represented by five bits, as follows: the first pattern had no such values, the second 25%, the third 50%, the forth 75%, and the fifth 100%. Figure 10 shows the results in terms of average energy (mW/MHz) consumed by the designs on these data patterns. We observe that the energy consumption strongly depends on the input pattern. When input values require more than 5 bits for representation, the proposed schemes consume more energy than the traditional one due to the energy overhead of extra circuitry. However, as the occurrence of small values on inputs increases, the proposed schemes achieve better results. Furthermore, the scheme 1 saves more energy than the scheme 2 for all input patterns except the first one. This input pattern requires only 16 × 16 bits multiplication, so the scheme 2 with the larger hardware overhead takes more energy. As more small numbers appear on inputs, both proposed schemes switch their operation from 16×16 bits to 5 × 5 bits more frequently. While the scheme 1 only needs to multiplex the pre-charged circuits and convert the voltage on input/output of the 5 × 5 multiplier, the scheme 2 have to charge/discharge the internal circuitry of the multiplier. Therefore, the more 5-bit operands appear on inputs the higher the energy efficiency of the scheme 1. Table 1 shows the energy reduction ratio (RR) of the proposed schemes observed for the tested patterns. The value of RR was computed in comparison to the traditional design as follows:
As we see, the proposed schemes 1 and 2 can reduce multiplier energy consumption by 34% and 29% respectively, on a peak (i.e. when both operands are represented by 5 bits) while achieving 10% and 7% energy savings on average for all the patterns.
We evaluated the efficiency of the proposed approach on data taken from the first B-type frame of standard video benchmark "Salesman" (frame size 258 × 288 pixels) using unsigned values of the DCT error image and the DCT coefficients. The simulation revealed that we can save power by 21.4% for the scheme 1 and by 13.2% for the scheme 2 on average in comparison to the traditional multiplier.
Considering the area costs, the proposed schemes were larger than the conventional Booth multiplier by 14.8% (scheme 1) and by 3.7% (scheme 2) respectively.
Conclusion
This paper presented a novel technique for reducing energy consumption of digital multipliers. The technique differs to existing research by exploiting a new freedom in the multiplier design, namely voltage per operation. By dynamically adjusting the voltage supply to the operand bit-width, the proposed schemes were able to save energy by 34% (scheme 1) and 29% (scheme 2) on peak and by 10% and 7%, on average, in comparison to the traditional design. The area overhead of the proposed schemes was 15% (scheme 1) and 4% (scheme 2). In the current study we were unable to compare our approach to other techniques except the traditional (non-optimized) multiplier design due to differences in implementation and input data. To provide such a comparison, we are experimentally working on several related implementations. Future work will also cover large bit-width multipliers and the floating point multipliers.
