SEb15bl b13b12bllbl b9 b8 b7 b6 b5 b4 b3 b2 bl ABSTRACT This paper presents a new two-gate-delay implementation of the Booth encoder and partial product generator, which eliminates the unnecessary glitches associated with the Booth multiplier. In addition, a modified signedhnsigned (MSU) and modified sign-generate (MSG) algorithms, suitable especially for signedhnsigned multipliers, were developed in order to reduce the compression level needed in the Wallace tree, and hence reduce the multiplier hardware. Using these features reduces the multiplier array energy dissipation by about 30% and increases speed by about 10%. bO 
INTRODUCTION
Currently, it become imperative for reduced instruction set computers (RISC) and digital signal processors (DSP) to use less energy without sacrificing their computation throughput. Hence, the parallel multiplier as one of the key building blocks of RISC and DSP, must address simultaneously the low-power and high-speed design issues. In general there are two basic approaches to enhance the speed of parallel multipliers, one is the Booth algorithm and the other is the Wallace tree compressors or counters. However, both typically lead to excessive energy dissipation [l] . When only Wallace tree is used to compress the number of partial products [2] , the multiplier array becomes very large due to the large number of gates and the interconnect wires. This leads to high energy dissipation. On the other hand, when the Booth algorithm is used [3][4][5], a lot of unnecessary glitches occur in the multiplier array as a result of the race condition between the multiplicand and the multiplier, due to the Booth encoder and the partial product generator. This again leads to high energy dissipation.
Furthermore, when not optimizing the Booth algorithm, to match the special conditions of the array, both in terms of operands sizes and in terms of sign extension bits (as for Permission to make digitalkard copy of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage, the copyright notice, the title of the publication and its date appear, and notice is given that copying is by permission of ACM, Inc. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission andlor a fee. 01997 ACM 0-89791-903-3/97/08..$3.50 example when signedhnsigned multipliers are used), the result is a large increase in the number o f full adders (FA) needed for the Wallace tree compressors. In this work a new implementation of the Booth encoder and partial product generator is presented, that eliminates the unnecessary glitches associated with the modified Booth algorithm. In addition, it exhibit a very short delay time, only two gates, from the input operands to the partial products. Combining this Booth encoder with the MSG algorithm, the MSU algorithm, and the 4-2 based Wallace tree compressors [3][4][6], leads to the fastest possible multiplier array with reduced energy dissipation.
THE MSU ALGORITHM
The operands of the multiplier presented here have two basic operation modes: 16-bit unsigned numbers and 16-bit signed numbers, as shown in Fig. 1 , where SE is the sign extension bit. The additional bit (17th) is needed to represent the operands in both modes in two's complement. The range of the unsigned o erands is 0 < X, Y < 216 -1 and of the signed o erands -2 < X, Y 4 215 -1 . The result i s i n t h e r a n g e -2 + 2 $S<232-2 + l , a n d i t c a n b e represented in two's complement using 33 bits. In general, the modified Booth algorithm is applicable only for two's complement operands. The bit-pair Booth algorithm is based on partitioning the multiplier into overlapping groups of 3 bits. Each group is then encoded to generate a correct partial product. The n-bit multiplier Y is written in two's complement as: wherle the term in brackets, in (2), has values in the set {-2, -1,O, 1, 2 ) . Each recoded value performs a certain operation on the multiplicand X (and accordingly adds '0' or ' I ' to the LSB) as illustrated in Table I . 
In order to achieve the fastest multiply operation, using the Wallace tree compression, two rows of 4-2 compressors and one row of 6-2 compressors should be used, resulting in an equivalent delay of 8 XOR gates [6] .
A modified signedhnsigned Booth algorithm is proposed next in order to reduce the number of rows in the array, and hence reduce hardware and increases speed. Since the operand Y in the signed mode can be represented by using only 16-bits, a solution must be given for the unsigned operand mode. In this case (2) can be rewritten as:
The last term in brackets, in (3), has values in the set (0, 1, 2, 3,4}, and must be separated into two terms:
This algorithm is suitable only for operands with even number of bits. Hence, in our case an additional bit is needed for 2: The result is a 17 x 18-b multiplier array ( X has 17 bits and Y has 18 bits). In this case a maximal number of 10 bits can be added in the same bit position (column number 16).
Eq. (4) is the modified signedhnsigned version of the Booth algorithm. Using it enables to have a 17 x 16-b multiplier array.
132 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 When working in the signed mode the "regular" Booth algorithm (2) is used. In the unsigned mode, bit yn-l in the last Booth encoder ( y l s in the case of 16-bit) should be replaced by 'O', this satisfies the term ( y , -+ y,
. 2" -* in (4). An additional partial product should be generated according to the real value of y n -l , in order to satisfy the last term in (4) yn -. 2,-I .
The multiplier array, after using the MSU algorithm is depicted in Fig. 2 . The p ; are the partial products, and ai are the bits added to the LSB. The first seven rows ( i = 0 -6 ) are yielded from the "regular" Booth algorithm. The partial products in the eighth row (i = 7) are yielded either from the "regular" term given in (2), or the modified second term given in (4). The last partial product (i = 8) is either all zeroes, when using the signed mode, or the partial product given by the AND function between y15 and the X operand,
In this array maximum of only 9 bits are added in the same bit position. Hence, in order to achieve the fastest multiplier operation, only one row of 9-2 compressors should be used, resulting in a delay equivalent to only 7 XOR gates [6].
MSG ALGORITHM
In some cases, when Wallace tree compression is used, the partial products are signed extended till the MSB of the array. The result is a large power and area waste due to the unnecessary compressors used for the extended sign bits.
In order to reduce the array to a rectangle two basic sign extension methods are typically used, namely the signpropagate and the sign-generate [ 11 [7] . The sign-propagate algorithm is useless in fast multiplier arrays due to its large delay time as a result of the series dependence of the sign extension of each partial product on that of the previous one.
According to the sign-generate algorithm [7] the result of adding all the sign extension bits of a 17 x 16-b multiplier can be written as:
Using the two equivalences: The first term in (8) affects only the bits 34 and above and can be omitted. When using (8) all the sign extension bits, of all partial products, can be replaced by the following steps:
Inverting the MSB of all the partial product (pl 7)
Adding '1 ' to the left of each partial product Adding 'I' in bit column number 17 Although this algorithm reduces significantly the number of unnecessary FA used to compress the sign extension bits when signed operands are used, the total result is a very small reduction in the overall number of FA needed for the multiplier array when unsigned operands are also used. The reason for that is the increase in the number of bits in column number n+l (column 17 in the 16-bit multiplier, depicted in Fig. 2 , will have 10 bits). In order to achieve the most economical array a modified sign-generate algorithm is presented hereafter. To eliminate the '1 ' in column number 17, (8) is rewritten as:
where the first term of ( 8 ) was omitted.
The last term in (9) has the value 3 or 4, depending on the value of $0. When so = '0' the last term is '011 ' or 'so so so' and when SO = '1 ' this term equals '100' or again 'so so so'. where si is the sign bit of the partial product in the ith row. 132 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 To compress this array only one row of 9-2 compressors is needed. This results in a delay time equivalent to 7 XOR gates, with a total number of 170 FA. On the other hand when not using the MSU and MSG the array compression time is equivalent to 8 XOR gates and 210 FA are used. The final adder is a 27-bit fast adder for the higher bits [8]-
[lo], while the lower 7 bits of the final result can be computed simultaneously to the array compression.
BOOTH ENCODER AND PARTIAL PRODUCT GENERATOR
The implementation of the modified Booth algorithm, is typically a major cause of energy dissipation, due to the race condition between the X and Y operands.
The most common implementation of the Booth encoder and the partial product generator is depicted in Table I1 and product generators (only 8 transistors), resulting is a small array area. This is due to the fact that in a n x n multiplier the partial product generator is placed ( n / 2 ) . ( n + 1) times while the encoder only ("2) times. Furthermore, the encoder's load is comparatively small since only one NMOS is used for each column in each row, for each encoder's control line.
A more compact implementation is presented in Table I11 and Fig. 5 [4] [5]. In this case more transistors are used for the partial product generator, but on the other hand the encoder is much simpler, and only three control lines are passed for each row. A similar implementation that is used to optimize the dimensions of the array slice can be found in [6] . The drawback of all these implementations is the unnecessary glitches caused on the partial product. This problem is best demonstrated by an example using the implementation presented in Fig. 5 . and the new encoded bits are XI = '0' and X2 = NEG = '1 '. It is seen from Fig. 5 , that the NEG signal propagates to the partial product generator without any gate delay, and hence the value of the new partial product, after one gate delay T = T,,, will be PP', = '1 '. After additional gate delay T = TDG6 + T,,, , the partial product will be inverted, P P I , = '0,'due to t h e c h a n g e of t h e o p e r a n d X (x3 = x2 = '1 '). The third time the partial product will be inverted, PPI, = '1 ', is after additional one gate delay T = T, , , It should be mentioned that any change in the value of the partial products causes also a change all along the multiplier array, and the final adder. Thus, in the example presented above some parts of the array will exhibit four logic state changes when no change was actually needed at all. This energy dissipation associated with the glitches in the modified Booth algorithm is an important portion of the total energy dissipation of the whole multiplier [ 11.
The problem of spurious transitions is not unique for the implementation presented in Fig. 5 , it can be verified that it appears in all the previously reported Booth encoders. Two basic approaches can be used in order to eliminate the unnecessary glitches in the Booth algorithm. One is to latch all the partial products and allow them to change only after steady-state was reached in the encoder and the partial product generator. This can be done by using a clock derivative from the global clock, whose duty cycle is defined according to the slowest path in the Booth implementation. However, this approach requires large area and dissipates a lot of energy by itself. Table IV : Race-free encoding of Booth algorithm.
N E G A ADD

Fig. 5
Assuming the multiplier is a 4 x 4-b, the current operands are X = Y= 01 01 and they are changed to X = I I I I and Y = 1001. In the current cycle the two 3-bit sub-strings of the multiplier Yare:
Compact encoding of Booth algorithm.
According to Table I11 the encoded bits are: XI = ' I ' and x 2 = NEG = '0'. Hence, the fourth partial product in the second column (i = I , j = 3) is PP', = '0'. In the next cycle the operands are changed. The left sub-string (i = 1) is:
The second approach is to synchronize all the path in the encoder and the partial product generator. It can be implementled by using a different recoding scheme as presented in Table IV and Fig. 6 . The principle here is to achieve the fastest possible equal path for all signals emerging from the X and Y operand latches. It can be verified that the unnecessary glitch problem demonstrated above does not occur in this implementation. The following properties should be noted:
Race-free encoding of Booth algorithm.
~~
The circuit uses only XOR and NXOR gates till the last stage of the partial product generator. It means that all the path can be equalized to have exactly the same propagation delay. The delay from X and Y to PP is only two gates, one XOR/NXOR and the output complex gate. The encoder is very compact. Four control lines are used for each row. The load, for each column in each row, on x 1 and x2p is one gate, and NEG is loaded with two gates. This is the saime as in the compact encoder. The additional control line ZP is loaded with one gate. Complementary inputs are not needed.
-.
The penalty for this fast and race-free implementation is the larger area used for the partial product generators [ 1 I]. The full CMOS implementation of the partial product generator consists of 24 transistors, compared to only 15 that are needed for the compact implementation presented in Fig. 5 . As a result, the area that is spared by reducing the number of FA that are needed to compress the array, is used for the larger partial product generators.
SUMMARY
Glitch-free Booth encoder and partial product generator were presented, together with two algorithms: MSU and MSG. Using all those features yields on one hand the fastest signed/unsigned multiplier array, while on the other hand decreases significantly its energy dissipation.
The speed enhancement is due to the reduction of the compression level in the multiplier array. Without using the MSG and MSU algorithms the compression time is equivalent to 8 XOR gates (3 at the 4-2 first compression level and 5 more at the 6-2 second compression level). When using the MSG and MSU only one compression level 9-2 is used, with a delay equivalent to 7 XOR gates. Energy dissipation is reduced due to the elimination of the glitches associated with the Booth algorithm, and the reduction of the number of FA needed to compress the multiplier array.
