An energy-efficient power-aware design is highly desirable for DSP functions that encounter a wide diversity of operating scenarios in battery-powered wireless sensor network systems. Addressing this issue, this letter presents a low-power power-aware scalable pipelined Booth multiplier that makes use of dynamic-range detection unit, sharing common functional units, ensemble of optimized Wallace-trees and a 4-bit array-based adder-tree for DSP applications.
Introduction
The energy-efficient digital signal processing (DSP) modules are becoming increasingly important in wireless sensor networks, where from tens to thousands of battery-operated microsensor nodes are deployed remotely and used to relay sensing data to the end-user [1] - [4] . Given the constantly changing environments of portable devices and the extreme constraints placed on their battery lifetimes, the considerations for power-aware design need to be taken into account [2] . A power-aware DSP module will be able to adapt energy consumption as energy resources of the system diminish or as performance requirements change. Therefore it is advantageous to design the power-aware DSP modules with power scalability hooks such as variable bitprecision and variable memory size so that it can be used for a variety of scenarios and changing operating conditions of each individual sensor node. The DSP functions extensively make use of the multiply-and-accumulate (MAC) operation, which makes the multiplication function as most power-consuming task. Therefore it is essential to implement the power-efficient multipliers for power-aware DSP modules. The development of multipliers with short critical path and low power consumption has become the topic of extensive recent investigation [5] - [7] . These multipliers are typically designed for a fixed maximum operand size, such as 16-bits/input. In practical sensor network applications, the actual inputs often have a small range although datapath hardware is designed to accommodate the maximum possible precision. For example, an 8-bit multiplication on a 16-bit multiplier can lead to serious power inefficiencies due to unnecessary signal switching. One pro- posed method of designing power-aware system has been to use the ensemble of point solutions method [2] . This method consists of multiple point solutions, each optimized to a particular scenario, and an ensemble is chosen so as to cover the entire scenario space. Studies [1] , [2] indicate that the ensemble of point systems could significantly enhance the power-awareness for digital architectures with a modest cost in chip area. On a series of operands typical of speech processing applications, the ensemble multiplier consumes 57% less power than the monolithic multiplier [2] , [3] . Although this method reduces the power consumption significantly, the area cost is increased. Thus, it is highly desirable to use the common functional blocks that can be shared and reused over a range of scenarios. In this paper, we present a novel power-aware scalable Booth multiplier, which detects the dynamic range of the input operands and implements 16-bit, 8-bit or 4-bit multiplication operations accordingly.
Power-Aware Scalable Pipelined Booth Multiplier
The proposed reconfigurable Booth multiplier consists of a dynamic-range detection unit, a shared radix-4 Booth encoder, a shared configurable partial product generation unit, 16-bit/8-bit optimized Wallace-trees for partial product summation, a 4-bit array-based adder-tree and a shared final carry-lookahead adder, as shown in Fig. 1 . The dynamicrange detection unit detects the effective dynamic ranges of the input data, appropriately selects the multiplier and multiplicand operands, and then generates control signals to deactivate the parts of the Wallace-trees and the arraybased adder-tree to match the desired run-time data precision. Depending on the number of multiplier bits, the Booth encoder and partial product generator adjust the number of partial products generated, while maintaining the unused partial product generator sections in the static condition. The control and enable signals generated by the dynamicrange detection unit select either the Wallace-tree or the 4-bit array-based adder-tree for a given single multiplication operation. Thus, the components of the unused circuits are able to maintain their static state. In the static state, the previous values are held, so as to avoid any switching from occurring in the unused part of the structure. To take advantage of short precision, signal gating can selectively deactivate those parts of the Wallace-trees and 4-bit arraybased adder-tree, which are not currently in use, and make the proposed Booth multiplier behave like a variable size multiplier. In the final stage, the product is generated from the outputs of the active Wallace-tree and carry-lookahead adder. Thus, the Booth encoder, partial product generator and carry-lookahead adder can be shared and reused to process 16 or 8-bit multiplications. For 4-bit multiplications, the Booth encoder and partial product generator units are unused and deactivated. The proposed Booth multiplier is implemented with five pipeline stages, with an input enable signal being used as a power-down switch for each stage.
Pipeline Gating Techniques
The latch-based clock gating technique used in each of the five pipeline stages enables the multiplier circuit to deactivate the unused part of the logic at runtime, thereby avoiding excessive power-consumption in each multiplication operation. The Synopsys Power Compiler feature was utilized to generate a latch-based clock gating circuit [8] . The top-level pipelined configuration for the multiplier is shown in Fig. 2 . Each pipeline stage in the proposed multiplier is optimized in order to minimize the amount of switching that occurs in the logic between the pipeline stages, as well as within the pipeline registers. The enable signal, "ENABLE," is used to indicate the validity of the input operands for the multiplication operation. Thus, the first stage pipeline registers utilize a 16-bit clock gating D-FF with the enable signal used as the control signal for the clock gating circuit. The pipeline registers are divided into two types, namely Type-I and II. The only difference between the Type-I and II pipeline register units is the existence of an extra control unit, which is employed in the Type-II pipeline register units to eliminate any dependencies between the three multiplication modes of the proposed multiplier. The 3-bit control signal, "CTL BUS," which is input of control unit, is generated from the dynamic-range detection unit using the dynamic range of the input operands, where the most significant bit is used to indicate a 16-bit operand, the second bit is used for an 8-bit operand and the least significant bit is used to indicate a 4-bit operand. The second Type-II pipeline register stage is partitioned into three configuration states, that is, the most significant 8-bits of the input operand are selected only if the input range is greater than 8-bits, the middle 4-bits are selected if the input range is 16-bits or 8-bits, and finally the least 4-bits are selected if the input range is 16-bits, 8-bits or 4-bits. All of these control signals for each gated pipeline register are generated if the enable signal is asserted high. The third pipeline stage after the Booth encoder utilizes a Type-II pipeline register. As a function of the "CTL BUS" signal, 2-way partitioning is done to generate the control signals for the eight partial product generator units. The first four partial product generators are split into two modes (16-bit and 8-bit modes), whereas the last four partial product generators are only activated when 16-bit operands are detected. The fourth pipeline stage utilizes a Type-I pipeline register, in which the appropriate partial products and the complement values, "COMP," are distributed to the 16-bit and 8-bit Wallace-tree structures. In the case of the 4-bit unsigned multiplier, the least significant 4-bits of both the multiplier and multiplicand are used. Similarly, the fifth pipeline stage utilizes the Type-I pipeline register before the final addition.
Dynamic Range Detection Unit
The dynamic-range detection unit detects the effective dynamic ranges of the input data, and then generates the control signals that are used to deactivate the appropriate parts of the Wallace-trees and array-based adder-tree, so as to match the run-time data precision. In the proposed multiplier, the control signals not only select the data flows, but also manipulate the pipeline register units, in order to maintain the non-effective bits in their previous states and, thus, ensure that the functional units addressed by these data do not consume switching power. Additionally, these control signals are used to control the Wallace-trees and arraybased adder-tree. Figure 3 shows the functional blocks of the dynamic-range detection unit that includes one-detect circuit (OR gate), comparator, multiplexers and logic gates. The OR gate is used, which asserts the output to be high if any of the inputs are high. Both the 16-bit wide multiplier and the multiplicand input operands are sectioned into three parts, where the detection is done for the 16-bit, 8-bit and 4-bit ranges. The proposed multiplier supports three multiplication modes that are 16-bit, 8-bit and 4-bit multiplication modes. Both the 16-bit and 8-bit multiplication modes are implemented using Booth multipliers with a shared Booth encoder, partial product generator and final adder. Since the 8-bit Booth multiplier values greater than 7FHEX utilize the 16-bit multiplication mode. This problem is avoided in the 4-bit multiplication mode by using a 4-bit unsigned arraybased adder-tree. The output of the three one-detect circuits of both the multiplier and multiplicand are grouped into a 3-bit bus, where the most significant bit represents the 16-bit multiplication mode, the second bit represents the 8-bit multiplication mode and the least significant bit represents the 4-bit multiplication mode. The two 3-bit buses are compared using a 3-bit comparator. This comparison detects if the input multiplicand operand is greater than the input multiplier operand. The output of the comparator is used to switch the input multiplicand and multiplier operands.
Booth Encoder and Partial Product Geberator
The radix-4 Booth encoder can generate five possible values of −2, −1, 0, 1, and 2 times the input datum [5] , [6] . Three control signals, COMP j , S HIFT j and ZERO j ( j = 0 · · · 7) are generated depending on the 3-bit recoding scheme shown in the Booth-recoding table [8] . The Boot encoder is used to generate these control signals, which are used in the partial product generation unit to direct appropriate operation (OP i , i = 0 · · · 7) on the input multiplicand operand. Figure 4 shows the configurable partial product generator unit, which is made configurable to be shared between the 16-bit and 8-bit multiplication modes. The ZERO signal is used to output zeros as output of that partial product stage, the COMP inverts the input multiplicand operand, and finally the SHIFT signal shifts the input multiplicand operand left by one. The total numbers of partial products (PP) generated are N/2 (N = max. number of multiplier bits), where 
16-Bit and 8-Bit Wallace-Trees
The partial product summation is done in an optimized Wallace-tree structure. Each column in the partial product summation has different number of operands to be added. The easiest way to implement this addition is to have array adders which has the long critical path delay to be proportional to O(N). Therefore, to minimize the critical path delay and hardware size, we chose to implement the partial product summation using optimized Wallace-tree structure.
In each column stage equally weighted partial products and COMP inputs are added using (k:2) compressors, where k is the number of inputs. Each compressor uses (3:2) carry save adders (CSA) as a basic building block, which adds three equally weighted inputs and generates two equally weighted outputs. A (9:2) compressor is used to add these inputs. Thus the maximum levels of CSA adders required for the (9:2) compressor are four. The output of this tree structure is two 32-bit numbers, which are then added using a carry-lookahead adder. A similar optimized Wallace-tree is also created for the summation of the partial products generated in the 8-bit multiplication mode. The Wallace-tree is the only structure in the 16-bit and 8-bit multipliers that are not shared. The 8-bit Wallace-tree has maximum of five equally weighted inputs, which are four partial products and the COMP input from the 4th partial product generator. The output of this tree structure is two 16-bit numbers, which are then added using the final carry-lookahead adder.
Final Carry-Lookahead Adder
The final carry-lookahead adder (CLA) is shared between the 16-bit and 8-bit Wallace-trees. Two 16-bit CLA are used where the "CTL BUS" control signals indicating the mode of multiplication determines the selection order of these structures for final addition. If the 8-bit multiplication mode is utilized, then the right most 16-bit then both the CLA structures are utilized acting as a single 32-bit CLA structure. The 32-bit output of the 16-bit Wallacetree is distributed between the two adder structures. The least significant 16-bit CLA structure gets it's input from the output of the multiplexer that chooses the 16-bits from the 8-bit Wallace-tree during the 8-bit multiplication mode and chooses the least 16-bits of the 32-bits from the 16-bit Wallace-tree. Both the CLA circuits keep all their internal values static during the 4-bit multiplication mode. However in order to fully exploit the power-efficiency of this structure, it is important to join the two CLA structures using carry control unit. The carry control unit enables the activation and deactivation of the carryout from the least sig- nificant 16-bit CLA structure, as shown in Fig. 5 . During the 16-bit multiplication mode, the carryout from the least significant CLA structure is passed to the carryin of the most significant 16-bit CLA. During the 8-bit multiplication mode, this carry is disabled as a carryout generated from the least significant 16-bit CLA can cause the internal signals of the most significant 16-bit CLA to change unnecessarily. When either of the input multiplicand or the multiplier operand is a set of all zero, the detection unit disables all of the multiplication modes. Therefore all of the logic is bypassed during the multiply-by-zero mode to avoid unnecessary toggling. The final multiplexer makes the final selection choosing from the 16-bit, 8-bit, 4-bit or the "multiplyby-zero" mode to yield the final product.
Performance Measurements
The proposed scalable pipelined Booth multiplier was first modeled in Verilog HDL and functionally verified using ModelSim simulator. Verilog HDL simulations were conducted using uniformly distributed random input test vectors with a supply voltage of 1.2 V under nominal conditions Following its functional validation, the architecture was synthesized for appropriate time/area constraints and TSMC 0.13-µm CMOS technology using the SYNOPSYS Design Compiler. Table 1 shows the comparison of hardware complexity and power consumption for different Booth multipliers in different input precisions (16-bit, 8-bit, 4-bit). The power analysis of the gate level structure for Booth multipliers has been conducted using Synopsys PrimePower tool. For power comparison, 5-stage pipelined Booth multipliers were used to get almost same clock frequency, and the same number of random input test vectors for the 16-bit, 8-bit and 4-bit cases were given to measure the power consumption using PrimePower tool. For the 8-bit and 4-bit computations, the proposed scalable Booth multiplier affords reductions in power consumption of 29% and 58%, respectively, in comparison with a non-scalable multiplier However, for the 16-bit computations, the power consumption of the proposed Booth multiplier is higher than that of the non-scalable Booth multiplier, due to the overhead logic used to render the power-aware. Globally, however, the proposed Booth multiplier is 20% more power-efficient than the non-scalable Booth multiplier, as well as offering high speed 
Conclusion
In this paper, we propose a novel power-aware scalable Booth multiplier designed to provide low power consumption in highly changing environments. In this work, the sharing common functional units, ensemble of optimized Wallace-trees and pipelining technique were used to design the proposed power-aware Booth multiplier. Comparable power savings have been observed in preliminary experiments with the non-scalable Booth multiplier. As a result, the proposed Booth multiplier has better power-scalability characteristics, and thus be more globally power efficient than other non-scalable Booth multiplier over a variety of scenarios. The proposed multiplier architecture can be applicable to other signal processing computations for applications with a variety of scenarios.
