Abstract-In this paper, a novel unified implementation of signed/unsigned multiplication is proposed using a simple signcontrol unit together with a line of multiplexers. The proposed approach is demonstrated through a 0.18μm CMOS implementation of a 32-bit signed/unsigned multiplier. Reported results show that the proposed unified signed/unsigned implementation is very compact with only 0.45% silicon area overhead. The critical path delay of the proposed multiplier is about 3.13ns.
I. INTRODUCTION
Both unsigned and signed binary number operation instructions are essential in configurable Digital Signal Processors (DSPs) and special-purpose computers [1] . However, multipliers are designed for either signed or unsigned binary numbers. To the best of our knowledge, the only reported implementations of signed/unsigned multipliers are: a) a programmable signed/unsigned tree-based architecture for redundant binary arithmetic [2] and b) a signed/unsigned Booth multiplier using a 2-bit most significant bit (MSB) extension to select the mode of operation [3] . In this paper, we propose a novel programmable signed/unsigned multiplier architecture that compares favorably against prior art in terms of silicon area and power consumption. Compared with the conventional signed multiplier, the proposed multiplier results in only 0.45% silicon area overhead for the implemented 32-bit signed/unsigned multiplier. This is achieved by using three stages with a sign-control unit in the first pipelined stage.
In the first stage, modified Booth Encoding (MBE) [4] is utilized to reduce the partial product rows (PPRs) by half. Instead of partial product generators (PPGs) based on MBE used in the bit-extension scheme, a line of multiplexers are proposed here to generate a configurable PPR for signed and unsigned modes. The second stage comprises a two-level Wallace-tree compression structure to efficiently sum up PPRs using carry-save adders. The final two partial product rows are processed by a hybrid adder mixed with conditional carry adder (CCA) and conditional sum adder (CSA) based on the MLCSMA algorithm [5] .
The proposed signed/unsigned multiplication scheme was optimized in terms of speed, power consumption and silicon area by: a) exploring more regular partial product array, b) developing more efficient compression methods and c) combining several types of fast adders. This paper is organized as follows. Section II introduces the signed/unsigned algorithm. Section III describes the multiplier architecture and details the VLSI implementation of each of the multiplier stages. Section IV presents implementation results and compares them with prior art. Finally, a conclusion is given in Section V.
II. ALGORITHM
The multiplier presents two modes of operations, namely 32-bit 2's complement number operand and unsigned 32-bit binary number operand. Assume the multiplication operation is Y (m)×X(n), where Y (m) and X(n) represent the m-bit multiplicand and the n-bit multiplier respectively. The 2's complement number representation of Y (m) is
In its unsigned representation, Y (m) can be written as
These two representations can be combined using (m+1) bits:
where y equals to y m−1 in the signed mode or equals to 0 in the unsigned mode. When radix-4 Booth encoding is used on the multiplier, the expression of the multiplier X(n) in its signed form is
where x −1 = 0. The n-bit unsigned representation of X(n) can also be expressed as
where x −1 = 0, x = x n−1 and x n−1 = 0 in equation (5) . Equation (5) also can represent equation (4) in the condition where
According to equation (3) and equation (5), we attain
where
Equation (6) describes both signed and unsigned multiplication. A control unit is used to select the value of x and y , which define the type of the operands. A line of multiplexers is used to implement the first term x ×2 n ×Y (m) in equation (6).
III. VLSI IMPLEMENTATION

A. Architecture
The architecture of the proposed 32-bit signed/unsigned multiplier is shown in Fig. 1 . The sign-control unit generates the MSBs of the multiplier and multiplicand and the select signal for the line of multiplexers. Meanwhile, modified Booth encoding (MBE) is used to reduce the number of PPRs by a factor of two. After generating the PPRs, Wallacetree structures are used to efficiently add-up all PPRs in parallel. More specifically, [3:2] [6:2] adders are combined to sum up all the PPRs until only two rows are left. Carry Select adders are inserted in the second stage to reduce the third-stage long-length fast adder's delay, area and power without delay overhead. In the last step, a fast carry-propagation adder is used to add the final two PPRs. The final adder is characterized by the fact that the input signals do not arrive simultaneously as a result of the Wallace tree compression. Ordinary single carrypropagation adder designs that assume all the inputs arrive simultaneously. A full adder combining both CSA and CCA is developed in the last stage [6] . 
B. Sign-control Unit
The control unit determines whether the multiplier operates on signed or unsigned numbers. This reconfigurability results in a negligible 0.45% silicon area overhead. Figure 2 shows the building blocks of the control unit. The first two AND gates are used to pre-process the operands' MSBs and generate the correct bit value for the signed or unsigned operands. The third AND gate makes the control signal for the extra 17 th partial product row. Figure 3 compares the bit-extension scheme circuit with the proposed mux-based scheme circuit to generate the extra 17 th partial product row. Our circuit consists of two AND gates and 33 multiplexers while prior art requires a PPG, which includes 35 XNOR gates, 2 XOR gates and 33 OAI (OR-AND-INV) gates. Our approach is thus not only more compact but also faster than the previously reported bit-extension scheme. 
C. Modified Booth Encoding
The Modified Radix-4 Booth encoding, proposed in [5] , was adopted to balance the critical paths of MBE stage and Wallace-tree. The scheme is detailed in Table III while the Booth encoder and selector circuits, proposed in [5] , are shown in Fig. 4(a) and (b) , respectively. Table I NEW MBE SCHEME PROPOSED IN [5] x 2i+1
x 2i According to equation (4), the bit x −1 is always 0. As a result, the PPR generated by the last two LSB x 1 and x 0 can be simplified to the circuits in fig. 4(c) .
The extra partial product bit (N eg) at the LSB position of each partial product row for negative encoding leads to an irregular partial product array and a complex reduction tree. In the conventional MBE scheme [4] , the LSB of PPR (P LSB ) and N eg logic equations have the same bit weight. They are
where A = x 2i ⊕x 2i−1 , y −1 = 0. By pre-computing the sum of P LSB and N eg and manipulating the logical equations N eg , P LSB = N eg + P LSB
The logic function is changed to
By using equation (9), the silicon area and speed of the MBE stage was optimized. VLSI implementation of MBE is decreased and the speed for the LSB operation is optimized. Note that all the optimized bits in MBE are generated no later than other conventional partial prodcut bits. Figure 5 is a sample for 8-bit signed/unsigned multiplication using signextend protection, optimized booth encoding scheme in LSB bit and the mux-based signed/unsigned scheme. 
D. Partial Product Reduction
Traditionally, half and full adders, organized in a carrysave adder format, have been used in the partial product reduction process. However, since their inception by Weinberger [7] , [4:2] adders have become a topic of significant research in the arithmetic community. It has transformed the standard frame of mind of counter for partial product reduction by introducing the notion of horizontal data paths within stages of reduction. Furthermore, optimized The MBE algorithm typically generates n/2+1 PPRs instead of n/2 due to the extra partial product bit (Neg bit) [11] . One more PPR is needed for signed/unsigned configurations in our multiplier. Instead of using [11] to reduce the number of PPRs, all the MSBs of PPR for sign-protection scheme 
E. Final Fast Addition
CSA, CCA and Carry Look-Ahead Adder (CLAs) can be used to implement the final fast addition. CLA is widely used and can be easily implemented in dynamic domino CMOS logic with the limitation of full-custom design. For standard static CMOS circuit, CCA and CSA [6] are preferred and can easily be implemented using a standard cell library. In contrast to the CSA, CCA needs to use XOR logic to produce the final results. This translates in more delay as compared to a same bit-width CSA. The CSA needs to store both the conditional sum and carry together. As a result, more multiplexers are used than for a CCA. To combine the benefits of both adders, a mixed CSA-CCA architecture was implemented to compute a final fast addition. Figure 8 shows the last-stage architecture of a 32-bit CCA followed by a 16-bit CSA, which has the same performance than a 48-bit CSA because the carry-out from CCA is used as the 16-bit CSA final select signal. In this situation, 48 bit results could be generated simultaneously. 
IV. RESULTS
The multiplier was modeled in verilog HDL and synthesized using Synopsys Design Complier (DC) with a TSMC 0.18μm 1.8V standard cell library. The synthesized netlists were fed into Cadence SOC Encounter to perform autoplacement and routing. Power consumption was estimated from the same netlist by using Synopsys PrimeTime PX to analyze switching activity with 5000 random input patterns at the clock frequency of 50 MHz and 100 MHz, respectively. Bit-extension and mux-based signed/unsigned schemes building blocks for the 32-bit multiplication were modeled and synthesized in verilog HDL. The two schemes are compared in Table II . Reported results show that our scheme enable a reduction of 15.17% in silicon area and 18.64% in power consumption. The circuit occupies 642μm×636μm for a standardcell design using auto-placing and routing tools. We have also implemented the multiplier using a full-custom design flow. The corresponding full-custom design occupies 360μm×900μm. Figure 9 shows the total latency "final" and the latency "sum" and "carry" produced before the final addition for all bits from 0 to 63. With the proposed multiplier architecture, bits 30 to 64 come out almost simultaneously while bits 0 to 12 come out slower than for the case of a 64-bit CLA. This is explained by the fact that we used a carry-select adder to reduce area and power consumption where delay is not as critical. Table III compares the performance of the proposed multiplier against recent implementations [10] - [11] . The proposed multiplier achieves a delay as small as 3.13ns because registers are used for pipelining. This translates in a relatively larger silicon area. The power dissipation is also improved by optimizing MBE stage's logic function and balancing the signal paths of tree-based parallel compression stage. V. CONCLUSION In this paper, we present a 32-bit×32-bit pipelined multiplier capable of carrying out both signed and unsigned operations. The proposed novel unified signed/unsigned multiplication scheme requires only a simple sign-control unit together with a line of multiplexers, resulting in only 0.45% silicon area overhead in a 0.18μm CMOS process. The critical path delay of the proposed multiplier is about 3.13ns. The signed/unsigned multiplier was optimized in terms of speed, power consumption and silicon area by exploiting more regular partial product array, developing more efficient compression methods and combining several types of fast adders.
