An efficient multiplier design method for predetermined coefficient groups is presented based on the variation of canonic signed digit (CSD) encoding and partial product sharing. By applications to radix-2 4 FFT structure and the pulse-shaping filter design used in CDMA, it is shown that the proposed method significantly reduces the area, propagation delay and power consumption compared with previous methods.
Introduction
In some DSP applications such as FFT, multiplications are performed only with a few predetermined coefficients which are time-varying in periodical order. In these applications, multipliers should have programmability. When a few coefficients share a multiplier, modified Booth encoding, which halves the number of partial products, is generally used. If the multiplier coefficient is a constant, the coefficient can be encoded such that it contains the fewest number of nonzero digits, which can be accomplished using CSD to reduce the area and power consumption [1] , [2] .
Subexpression elimination can be used to efficiently implement constant multipliers. The multiple constant multiplication (MCM) problem determines how subexpression elimination can be applied to a set of constant multipliers so that the number of additions required for the implementation is minimized [3] . However, subexpression elimination provides an efficient solution only when the multiplications between the input (as multiplicand) and each multiplier in the coefficient group need to be performed at the same time. In other words, subexpression elimination cannot be used if the multiplications are performed one after another in a periodical manner (e.g., multiplications in a pipelined FFT). Based on the modified Booth encoding, a multiplier design method for a group of predetermined coefficients has been recently proposed in [4] . Figure 1 shows the pipelined structure of N-point radix-2 4 FFT [5] . In the first and the third multiplication blocks, three coefficients {cos(π/8), cos(π/4), and sin(π/8)} are multiplied by an input complex signal in periodical order as
Partial Products Depth Reduction
In general, the multiplications in (1) can be implemented using a programmable multiplier such as modified Booth multiplier.
If the coefficient word-length is W, the number of partial product (PP) rows obtained by modified Booth algorithm is W/2. To further reduce the number of PP rows, we propose a grouping algorithm for multiplication coefficients:
1. When the number of given 2's complement coefficients is N c with the word length of N w bits, the coefficients are arranged as an N c × N w table. 2. The coefficients in the table are converted to CSD representations. 3. Starting from the first column, a group is defined such that each row in a group contains at most one nonzero digit. A group should contain as many columns as possible so that the number of groups is minimized.
By applying the proposed grouping algorithm to these three coefficients in (1) with N w =14, the CSD coefficient table with 5 groups is obtained as shown in Table 1 . Each group G i generates a corresponding partial product P i . Thus, the multiplication result (Y) is obtained as
In Table 1 , the number of PP rows required by the proposed algorithm is only 5, while modified Booth encoding requires 7 (=N w /2) PP rows. Thus the proposed grouping algorithm reduces the number of PP rows by 2, which leads to lower power consumption and higher speed as can be seen Table 4 . Each group includes at least two columns by the proposed algorithm since CSD coding does not allow any consecutive nonzero digits. Thus, the number of PP rows generated by the proposed algorithm is always less than or equal to that of the modified Booth encoding.
Partial Products Sharing
If the nonzero digit locations of two groups are the same as in G 4 and G 1 in Table 1 , the two groups can share PP generation circuits. The sign difference in the first rows of G 4 and G 1 can be taken care of later by additional complementing circuits. For any row in a group that contains only 0's, the corresponding PP is 0. In this case, the zero digits in the row can be changed to nonzero digits to share the partial product generation circuits, since the output value can be easily changed back to 0 using a control signal. The all zero rows of G 3 and G 2 in Table 1 can be changed as shown in Table 2, where the changed digits are denoted using parentheses. Figure 2 shows the partial product bit generation circuit with control signals, where S i , N i and Z i are shift, negation, and zero control signals, respectively. Table 3 shows a new representation of each group in Fig. 2 can be eliminated. In addition, if S i = S j , the input shift patterns are the same and the shifted inputs can be shared between G i and G j . Since the shifted inputs are shared, it is sufficient to store either S i or S j . Notice that S 4 , S 3 , S 2 and S 1 are the same in Table 3 .
Thus the shifted inputs are shared among groups G 4 , G 3 ,G 2 and G 1 and we only need to store S 4 . (For the shift operation by S 0 , refer to the lower part of Fig. 3 .) In conventional approach, the coefficient look-up table (LUT) has 14 columns if the coefficient word-length is 14 bits. As an example, if a modified Booth multiplier is used, all the coefficients are represented in 14 bit 2's complement format and are stored in an LUT. Then, for each multiplication operation, the corresponding stored coefficient is read from the LUT and encoded using modified Booth algorithm for the multiplication. On the other hand, in the proposed approach, we store the encoded signals (not the coefficients in 2's complement format). Thus, if an N i (or Z i ) signal always has a fixed encoded value (e.g., N 4 , Z 4 , N 3 , N 2 , Z 1 , and Z 0 in Table 3 ), the N i (or Z i ) signal can be implemented Table 4 Simulation results. Table 3. directly without storing the values in an LUT. In addition, if some groups have the same shift pattern (e.g., S 4 , S 3 , S 2 and S 1 in Table 3 ), it is sufficient to store only one shift pattern. Thus, in Table 3 , we need to store only 7 columns of control signals, S 4 , S 0 (2 columns), Z 3 , Z 2 , N 1 and N 0 . In this case, LUT size is reduced by 50% compared with conventional approaches. Figure 3 shows the total partial product generation circuit designed using Table 3 . The generated PP bits are added using (2) .
The sharing algorithm of PP generation circuits can be summarized as follows: 
Comparison of Performances
To evaluate the performance of the proposed method, the multipliers in radix-2 4 FFT and pulse-shaping filter used in CDMA [4] have been designed and synthesized using MagnaChip 0.18-μm CMOS technology. We employed Wallace tree for addition of partial products and carry lookahead adder for the vector merging part. Table 4 shows the Synopsys simulation results. In FFT application, compared with the conventional modified Booth multiplier, the proposed method reduces the area, power consumption and propagation delay by 41%, 45% and 12%, respectively. The power consumption was estimated using Synopsys Design analyzer. Also, in the pulse-shaping filter application, the proposed method reduces the area, power consumption and propagation delay by 63%, 70% and 24%, respectively. In addition, Table 4 shows proposed method performs much better than the modified Booth-based approach for predetermined coefficient groups [4] .
Conclusions
Based on the proposed grouping and sharing algorithms, this letter presents an efficient multiplier design method for predetermined coefficient groups. The simulation results show that the proposed method can be successfully applied to the multiplication blocks in radix-2 4 FFT and the pulse-shaping filter in CDMA.
