Abstract-In this paper, a 32×32-bit low power multi-precision multiplier is described, in which each building block can be either an independent smaller-precision multiplier or work in parallel to perform higher-precision operations. The proposed multi-precision multiplier enables voltage and frequency scaling for low power operation, while still maintaining full throughput. According to user's arbitrary throughput requirements, the highly dynamic voltage and frequency scaling circuits can autonomously configure the multiplier to operate with the lowest possible voltage and frequency to achieve the lowest power consumption. By carrying out optimizations at the algorithmic and architectural levels, we have completely removed silicon area and power overheads which is always associated with the reconfigurability features. The 32×32-bit low power multiprecision multiplier has been implemented in TSMC 0.18 µm technology. Compared with fixed-width multipliers, the proposed design features around 13.8% and 30% reduction in circuit area and power, respectively. Multi-precision processing featured in this paper accordingly enables voltage and frequency scaling resulting in up to 68% reduction in power consumption.
I. INTRODUCTION
Recent works put significant effort to decrease multiplier's power consumption to enable its integration into batterypowered portable systems. However, in most full-custom as well as DSPs and FPGAs implementations, the multiplier is typically designed for a fixed maximum wordlength to suit the worst case scenario. However, the real effective worklengths of an application vary dramatically. The use of a non-proper wordlength may cause performance degradation or inefficient usage of the hardware resources. In addition, the minimization of the multiplier power budget requires the estimation of the optimal operating point including clock frequencies, supply voltage, and threshold voltage [1] . In most VLSI system designs, the supply voltage is also selected based on the worst case scenario. In order to achieve an optimal power/performance ratio, a variable precision datapath solution is needed to cater for various types of applications. Dynamic Voltage Scaling (DVS) can be used to match the circuit's real working load and further reduce the power consumption.
Several works have researched wordlength and supply voltage optimizations. Oliver A. and Hans-Jorg [2] proposed a reconfigurable multiplier which can be partitioned into several separate fully functional small multipliers, or used as a big multiplier. Wei and Yvon [3] proposed a variable precision multiplier which supports variable precisions ranging from 9 to 15 bits and dynamic voltage scaling. Vasily and Tomoyuki [4] proposed a multiplier architecture, which examines k MSBs of the operands to decide whether the entire multiplier is used and accordingly select between two different apply voltages. In the aforementioned multipliers, the reconfigurability function results in: (i) non-negligible silicon area and power overhead; (ii) performance and throughput reduction brought by the shutdown of parts of the circuit and/or use of reduced supply voltage, and (iii) restriction and great margins to the operating condition versatility of the multiplier.
Dynamic voltage scaling (DVS) saves energy by scaling down the voltage supply when the processor is not fully loaded. However, an effective method to find the lowest voltage to achieve the speed goal at run time is highly difficult. Above mentioned works found out the minimal voltage value by offline pre-simulations. However, different tasks have different speed and power requirements. Indeed, rarely can a user take an existing single set of power-speed modes and use it in every application. In this paper, we propose an automatic calibration circuit to solve this problem. Initially, the operating frequency of the multiplier is determined according to users' throughput requirement. Then the circuit will initially run at a proper voltage. When successive errors or correct results occur, the automatic calibration circuit would decide to raise or decrease the voltage, respectively.
The rest of the paper is organized as follows. Section II describes the architecture of our 32bit×32bit multi-precision multiplier. Section III illustrates the dynamic power and speed scaling management strategy. Section IV evaluate the performance of different multiplier topologies and totally remove the overhead resulting from the multi-precision reconfigurability. Section V gives the results and discussions. Finally, conclusions are drawn.
II. SYSTEM OVERVIEW
The architecture of our multi-precision multiplier system is shown in Fig.1 . The 32bit×32bit multiplier is composed of 9 sub-blocks that can work separately as 9 signed/unsigned 8bit×8bit multipliers, or concurrently to be configured to form 3 separate 16bit×16bit multipliers, or one 32bit×32bit multiplier. When the full precision (32bit×32bit) is not exercised, the supply voltage and the clock frequency is scaled down according to the shorter latency restriction and actual workload to help save power. The resulting reduction in computation speed is overcomed by combining parallel architectures to maintain the throughput. 
Computation results

User defined throughput
Dynamic
III. DYNAMIC POWER AND SPEED SCALING MANAGEMENT
In our architecture, voltage dithering is utilized to provide near-optimum dynamic voltage scaling with much less overhead [5] . Voltage dithering uses a few of power switches and let them toggle between a small number of voltage levels for different fractions of time to achieve an intermediate average voltage. Our test chip is designed to use two PMOS header switches as shown in Fig.2 . By tuning the on/off time of the two complementary switches, the dithered voltage can be set to be equivalent to the required value. Voltage dithering was proposed as a low overhead implementation of DVS. The savings are only achievable if the voltage can change on the same time scale as the altering workload. Numerous methods for optimizing the headers are available, and most of them are designed to ensure that the circuit never exceeds a delay penalty more than 10%. Fig.3 shows the timing overhead of our dithering circuit. The dithered output can fully settle down within just 1 dithering cycle (0.01us), which is much shorter than a multiplier's execution cycle. 
IV. LOW POWER MULTIPLIER IMPLEMENTATION
A. Choice of Booth Encoding Algorithm
Booth encoding is used to increase the multiplier's speed by reducing the number of partial products to be added. Radix-2 multiplier are easy for partial products generation but difficult for compression. While radix-8 is just the opposite. Given the scale of our design, radix-4 encoding is the best choice in view of a low power implementation. Simulation results show that radix-2 structure is 13% and 8% worse than the radix-4 structure in terms of area and power, respectively, refer to Table. I. 
B. Choice of Partial Products Compression Topology
High-speed parallel multipliers are usually implemented either as array multipliers or as tree multipliers. As the interconnect power dissipation becomes more and more dominant, it becomes difficult to choose between regular/long latency structure and irregular/short latency structure. To verify, we have compared CSA array multiplier and Wallace tree multiplier. Simulation results reported in Table. II show that the Wallace tree scheme is advantageous in terms of computation speed, hardware complexity and also power consumption. 
C. Choice of Partial Products Compression Array Type
When performing the compression, the left-to-right structure can provide a significant decrease in power dissipation, by reducing the amount of glitching in the left-hand side of the array and the whole compression array, see Fig.6 . To reduce the dominant spurious compression array transitions, it is desirable to assign a signal that has high switching probability to circuits having short logic depth. Obviously, the MSBs are more often to be encoded as all 0s. This implies that we should first add the MSB's partial-products in a tree array for reduced circuit switching. From Table. III, without any extra operation, this scheme achieves 8% and 2% reduction in power and area, respectively.
The final design step combines the two's complement representation, radix-4 Booth encoding algorithm, left-to-right array scheme, and Wallace tree compression structure to build the 8bit×8bit multiplier building block. The next step would be an evaluation of the configuration or scaling overhead. 
D. Reconfigurablity Overhead Evaluation
We define the 2n-bits wide multiplicand and multiplier as X and Y , respectively. X H , X L , Y H , Y L are their n-bit high significant and low significant half number of bits. The product of X and Y is expressed as:
However, if we define [6] :
equation 1 could be rewritten as:
Comparing equations (1) and (3), one can note that one nbit×n-bit multiplier and one 2n-bit adder can be removed and replaced by two n-bit adders and two (2n+2)-bit subtractors. For the 32-bit multiplier case, if we consider the complexity of a 32-bit adder and two 16-bit adders to be roughly the same, this enables us to use two 34-bit subtractor to replace a 16bit×16bit multiplier, the complexity reduction is obvious.
Simulation results show that the proposed architecture achieves reductions of 13.8 % in power and 30% in area as compared to the fixed-width multiplier design, in contrast to traditional multi-precision scheme which has 21.5% and 36.8% overhead in terms of area and power, respectively, referring to Table. IV. More power savings can be achieved when the input signal has a relatively smaller magnitude.
V. RESULTS AND DISCUSSION
Using our proposed voltage scaling approach, because the obtained voltage is always dithering, its averaged value is not equivalent to that of the static voltage of the same value. To better align the duty cycle to the static voltage value, we In prior work, flexibility and reconfigurability have been associated to increased silicon area and power consumption. In this paper, we have totally removed both the silicon area and power penalties. The layout of the multiplier is shown in Fig.8 . The critical path of our multiplier is proportional to the real operating wordlength. With the shortened critical path, the supply voltage and clock frequency can be both reduced to save the power. Simulation results given in Table V indicate that a reduction of around 40-68% in power can be achieved under lower supply voltage and clock frequency. The implemented dynamic voltage/frequency adjusting scheme enables real-time processing at a given throughput, while saving more energy in comparison to static supply voltage scheduling.
VI. CONCLUSION
Variable latency functional units using adaptive operation precision can allow aggressive supply voltage scaling and clock frequency scaling for improved power efficiency with no performance penalties. In this paper, we proposed a multiprecision multiplier combining variable precision processing and scaled supply voltage and clock frequency to efficiently reduce circuit's power consumption. Various algorithms and topologies are explored to obtain high performance while no silicon area or power overheads. Reported results show that our variable precision multiplier enables a 30% reduction in silicon area and a 13.8% reduction in power dissipation compared to fixed-precision multipliers of the same size. When operating under different precisions, it can further bring around 40-68% power reduction. Our multi-precision multiplier is very attractive for various general purpose low-power applications.
