Introduction
Multiplication is an important process in today's electronic systems. The need for high performance, low power multiplier is ever increasing for 3-D multimedia applications, digital image processing, digital signal processing and within the DSP as MAC unit. The reason is that multipliers contribute considerably to the overall system power and performance. Multipliers are extensively used in FIR and IIR filter design in DSP domain. A large multiplier (for example 128 bits) may consume a considerable power and chip area as well as proves to be a bottle neck for system performance.
A substantial research work has been performed on optimizing partial product generation logic as it is the very first step in multiplication algorithm [1] , [2] . The use of Booth Encoding and Modified Booth Encoding (MBE) reduced the partial products to almost half [3] . The other choices include the use of array-based or treebased topology. Array multipliers are compact and low power but are slower than tree multipliers. Tree multipliers itself are classified as binary tree, balanced delay tree, overturned-staircase tree and Wallace tree. Wallace tree is most commonly used tree topology due to its simplicity. Some work has also been done in reducing the spurious switching in multipliers by the use of latches [4] . The effects of wiring on multiplier delay have been discussed in [5] .
This paper concentrates on the optimization of partial product generation logic as proposed by [2] . We have adopted and realized the technique proposed by them and formed a new architecture of a 32-bit multiplier and shown with the help of simulations that this architecture is low power and area efficient as compared to an optimized conventional architecture.
The basic multiplication algorithm is shift-and-add. Each bit of multiplier is multiplied with all bits of the multiplicand resulting in a chain of partial products. These partial products are then added using shift-and-add algorithm. For simplicity, we can describe the multiplication of 2 N-bit numbers X and Y resulting in the product P of 2N-bits long, by the equation (1) .
The partial products are produced by ANDing each bit of multiplier with every multiplicand bit. After these partial products are produced, they are added in a manner to produce the shift with the help of 3:2 compressors. The basic idea is depicted in the following Figure 1 . In a signed multiplier the negative numbers are assumed to be in 2's complement format. The sign of the result is determined by XORing the sign bits of multiplier and the multiplicand. For example if both numbers are negative, then the sign of the result must be positive and if any one of them is negative, then it will be negative. This can be simply achieved by XOR operation. If the sign bit of the result is negative, then all the bits are 2'complemented to find the correct result.
Modified Booth Encoding
Booth encoding is a method used for the reduction of the number of partial products proposed by A.D. Booth [6] . Modified booth encoding was invented by O.L. Macsorley [7] . MBE is an enhanced form of Booth encoding. A binary number X = xm-1, xm-2,....., x0 consisting of m bits represented in 2's complement form can be mathematically expressed as X =-2,xm -]+ x12 O<i<m-2. (2) Equivalently, representation of X in base 4 is as follows:
X=di4' O<i<m/2-1.
The digits di are chosen from the ensemble t-2,-1,0,1,2] according to Table I . 
However, there are two unavoidable consequences associated with MBE [2] . Sign extension problem is one of the drawbacks of MBE. The other problem is the negative encoding. The combination of these two problems results in the formation of one additional partial product row which requires more hardware and consequently more time. Consider a simple 8x8 multiplier and its partial products using MBE shown in The last neg3 in the above method causes another carry save adder delay in order to generate the sums and carries before the final accumulation.
4. Proposed solution J-Y. Kang et al. [2] showed that if we take the 2's complement of the multiplicand, there would be no need for the last neg signal as it would have already been applied while generating the two's complement. In this paper we will adopt the same methodology as proposed in [2] to develop the new low power multiplier architecture.
A. Fast 2 's complement method One way to calculate the two's complement of a number is first to take its l's complement and then add "1" to the result. Another way is to complement all the bits after the right most 1 in the word and keep all other bits intact. For example, two's complement of 001110 (14) is 110010 (-14). So the 2's complement can be found by finding the conversion signals. These conversion signals are necessary for complementation. If the conversion signal is "O", the value is kept as it is otherwise it is complemented. The search for conversion signals starts in a systematic way. This is achieved by grouping the number in groups of 2-bits and finding the conversion signals, and then the search is extended for 4-bit groups formed by two consecutive 2-bit groups. Then two 4-bit groups are joined to form 8-bits group and so on. This method is shown below. While grouping two 21-bits groups, the leftmost conversion signals from the right side contain the collective information of its group about whether a "1" ever appeared in any bit position in this group. Figure 4 shows two possible implementations of the algorithm. 
4YA

Power-performance comparison
The two multiplier architectures have been simulated using the circuit simulator Spectre and process data from a commercially available 90-nm CMOS technology. A common test bench has been set up, where the power and performance have been evaluated. Figure 9 shows the power-delay plot of the simulation results.
Conventinal . 1.OV Multiplier delay (ps) Figure 9 : Power-performance comparison
The conventional multiplier has a delay of 652 ps with 5.11 mW power dissipation at nominal power supply (1.0 V). The proposed structure with its more complex PPG resulted in 1648 ps delay and 1.41 mW power dissipation at nominal supply. Calculating the powerdelay-product (PDP) for the two architectures at nominal supply gives 3.33 pJ and 2.32 pJ for the conventional and proposed, respectively. In order to get a fair power comparison of the two multiplier architectures the power supply of the conventional structure was reduced, while the power supply for the proposed multiplier was increase until equal latency was achieved. For the same 1-4244-0772-9/06/$20.00 .2006 IEEE delay the PDP point of the conventional multiplier was 3.78pJ compared to 2.01 pJ for the proposed architecture. Hence, the proposed architecture enables 47 00 better energy-efficiency for equal throughput and latency.
The main benefit with the MBE architecture is the reduced number of generated partial products. This reduces the amount of combinatorial logic needed to merge partial products to the final result. The conventional architecture required a total transistor count of 58030 transistors. Although the partial product merging tree was reduced by roughly half for the proposed architecture, the additional hardware required for the MBE and 2's complement generation added to the total transistor count. Therefore, the total transistor count of the proposed architecture was 38510, which still is 34 00 less than the conventional.
Chip area comparison
In order to analyze the chip area and effects of wiring we have prepared the layouts of both multipliers. These are shown in figure 10.
Conventional This reduction in the chip area has been possible primarily due the use of MBE which reduces the partial products by half. The logic for the removal of the last neg signal does not contribute to the overall area reduction because the overhead of 2's complement logic and 5-1 selector dominates over the gain of reducing one compressor.
Conclusions
An energy-efficient 32-bit multiplier architecture has been presented. The architecture is based on a modified Booth-encoding scheme, which reduces the number of partial-products by half compared to a conventional implementation. Simulation results show that for equal delay the power-efficiency of the proposed architecture is improved by 47 00 and the area is reduced by 24 00 compared to a conventional implementation.
In order to completely analyze the performance and power consumption, the layout can be extended to chip level where the wiring and interconnect capacitances and PAD delays are introduced.
