Abstract| This paper presents several new array m ultiplier architectures for reducing switching activity in general digital signal processing DSP applications. A cellular structure is described which can be used to obtain any array m ultiplier suitable for a given application. The switching activity at the output nodes of the cells in this structure is analyzed and compared with a tree multiplier based on 4 : 2 compressors. It is shown that the relative improvement i n p o wer is a function of statistical properties of the signal and most structures out-perform all others for speci c signal conditions. It is also shown that selection of appropriate array architecture can give up to 40 reduction in switching activity compared to a tree multiplier, and more than 3 times less switching activity compared to the widely used least-signi cant-bit-rst array multiplier for commonly occurring situations. We also outline applications of the proposed structures to the areas of low power quantization, recon gurable computing and high-level synthesis for low p o wer.
I. Introduction DSP algorithms are dominated by three basic operations; add, shift and multiply. Multiplication operations are considered to be the dominant computation in DSP algorithms 1 and are equally important in dynamic power dissipation. Over the past few years, a number of papers have addressed multiplier topologies for a variety of applications 2 , 3 , 4 . In particular, array structures proposed in 3 address pipelining of recursive digital lters using most signi cant bit MSB rst digit serial arithmetic.
In this paper, we explore array structures from the point of view of dynamic power dissipation. Contrary to the expectation that any ordering of array m ultiplier would yield similar dynamic power dissipation performance, we will show that more than 3 times reduction in switching activity m a y be possible compared to the commonly used least signicant bit LSB rst array m ultipliers also known as right-left multipliers, depending on the signal characteristic of input signals. Computations in DSP algorithms are governed by the statistical properties of the underlying process generating data. In general, data signals are correlated and rapid changing data is seldom processed. Therefore we explore the e ects of signal statistics on the output switching activity in various array structures in order to assess the feasibility of using a given structure under the condition of known or predictable signal statistics. We show that re-ordering of This work was supported in part by D ARPA F33615-95-C-1625, NSF CAREER award 9501869-MIP, Rockwell, AT&T and Lucent foundation.
partial product addition can result in signi cant reduction in switching activity hence, dynamic power if the signal statistics are known a priori. This observation leads to new array multiplier architectures which form hybrids of MSB-rst and LSB-rst structures. We also discuss the application of such multipliers to low p o wer implementation of DSP algorithms and to the general area of recon gurable computing. In particular, we propose hybrid-array structures which combine LSBrst and MSB-rst types of array m ultipliers. compare the switching characteristics of array m ultipliers with a tree multiplier based on 4 : 2 compressors to show the region of strength of each architecture. provide new insights in the areas of low p o wer design, recon gurable computing and high-level synthesis.
II. Multiplier Architectures
We will rst present a simple frame-work for obtaining various types of array m ultipliers. Figure 1 shows a template for a cellular array structure which serves as the basis for generating di erent t ypes of 8-bit array m ultipliers. Each location in this matrix can be occupied by a cell which can be an and gate AND, a half adder HA or a full adder FA. In the sequel, the cell at row i and column j will be referred to as c i;j . As an example, the cells on four corners are shown labeled in the gure. Let The goal of an array m ultiplier is to add the partial prod- 11  12  13  15  14  1  2  3  4  5  6  7  8  9  10  0   7  6  5  4  3  2  1  0   0   3   4   5   6   7   2   1   0000 0000   0000  1111 1111   1111   000 000   000  111 111   111  000 000   000  111 111   111  000 000   000  111  111  111  000 000   000  111 111   111  000 000   000  111 111   111  000 000   000  111  111  111  000 000   000  111 111   111  000 000   000  111 111   111   000  000 000  111  111 111  000  000 000  111  111 111  000  000 000  111  111 111  000  000 000  111  111 111  000  000 000  111  111 111  000  000 000  111  111 111  000  000 000  111  111 111  000  000 000  111  111 111 5  15 14 13 12 11 10 9  8  7  6  4  2  3  0  1   A   A   A   A   A   A   A  A  A  A  A  A  A  A   A   H  H  H  H  H  H  H   F  F  F  F  F  F  H   p  p  p  p  p  p  p  p  p  p  p  p  p  p ucts from cells which occupy the same column. The order in which these partial products are added is not important as we only need to ensure that only the partial products in the same column are added in addition to the carry's generated from the cells in the adjacent column on right. Hence, one can exchange rows 3 and 7 as shown in gure 1. Cells in row 3 after moving to row 7 are shown by cells shaded by circles. The cells in row 7 after moving to row 3 a r e shown by dark lled cells. Now, we only need to ensure that carry's generated from these are correctly added, which m a y require extra cells. Let R = fr 0 ; r 1 ; r 2 ; : : : ; r N,1 g be the set of indices which represents an ordering of successive additions of rows of partial products. Then, the ordering given by r i = i for i = 0 ; 1; : : : ; N, 1 expresses the LSB-rst multiplier shown in gure 1. The MSB-rst multiplier can also be expressed similarly by the ordering r i = N , 1 , i for i = 0 ; 1; : : : ; N, 1. Clearly, there are N! w ays to construct array m ultipliers. Each of these multipliers mays be constructed using propagation of carry in either ripple form or CS form or a combination of these.
A. LSB-First Multipliers
The LSB-rst multiplier can be constructed either using the CS format shown in gure 1, or by using ripple carry structure. We will refer to the former as LSB-rst CS multiplier and the latter as the LSB-rst RP multiplier. LSB-rst RP multiplier is the most well-known and widely used array structure for multiplication and is obtained from the cellular is shown on right in gure 3. We will refer to the former as hybrid LSB-rst multiplier and the latter as hybrid MSB-rst multiplier. Both of these can be constructed either by using ripple carry or by using CS format. Hence, there are four ways to implement a h ybrid multiplier which puts L top most rows of one type of multiplier above the other i.e. LSB-rst over MSB-rst or vice versa. The multiplier on left in gure 3 puts L = 3 top rows of the LSB-rst CS multiplier over N, L = 5 top rows of the MSB-rst CS multiplier. We will refer to such a m ultiplier as hybrid LSB-rst CS CS multiplier with L = 3 . Similarly, the multiplier on right in 3 puts L = 3 top most rows of MSB-rst RP multiplier over N , L top most rows of LSB-rst CS multiplier. This multiplier will be referred to as hybrid MSB-rst RP CS multiplier with L = 3 . We can obtain three more types of L = 3 hybrid multipliers for each of these cases by considering the remaining three combinations of adding carrys in the two parts of the multiplier. Each t ype of hybrid multiplier implementation requires a di erent o verhead and has a di erent length of critical path. We only consider implementations which place L consecutive r o ws of one ty p e o f m ultiplier over the other. The reason for focusing on such architectures is because DSP applications process data streams whose properties can only be predicted or controlled over a part of the word-length. For example, if the signal strength reduces, consecutive MSBs of the data-stream become zeros assuming a sign-magnitude representation. Similarly, less important" data values may be further quantized by truncating some LSBs, thereby resulting in the data-stream having zeros at the corresponding locations. It will be shown that the proposed hybrid multipliers yield substantial improvement in switching activity reduction compared to a tree multiplier constructed using 4 : 2 compressors as well as the simple LSB-rst or MSBrst multipliers under appropriate signal conditions. The multiplier structure shown on left in gure 3 is entirely CS structure, and its speed can be increased by using a carry select structure similar to the one proposed in 3 . The multiplier on right in 3 has the same delay as a LSB-rst CS array m ultiplier despite the fact that the MSB-rst part ripples the carry. The reason for considering this structure is that it requires a smaller overhead cells required to ensure that all partial product sums and carrys are added at appropriate locations. 
III. Switching Characteristics of Multipliers
Let us rst consider LSB-rst multipliers. A close observation of the multiplier in gure 1 shows that if successive inputs are applied such that their LSBs are zeros in operand A, the corresponding top rows of the multiplier will be turned o as the evaluated partial products would all be zeros. Any input which has a 0 = 1 will place the vector B at the output of the rst row of partial product outputs. These values will propagate downwards even if the next LSBs in A are all zeros. Hence, switching activity can only be reduced if successive inputs applied at the input A ensure that when a bit a j is 1, all a i 's are zeros for i j . Similarly, w e notice that if the successive inputs applied at the B inputs are such that L MSB bits are zeros, then the cells c i;j such that j 2N,1,i,L along the diagonal columns of partial product generators in the cellular array are all turned o . Hence, no sum or carry output transitions in these cells. Hence, low over-all switching activity can be ensured if the inputs applied to this multiplier are ordered to ensure that they cause smaller switching activity. Similar observations are made for the MSB-rst and hybrid multipliers. The best" input conditions for these multipliers are summarized in table I and can be veri ed by studying gures 1 3.
A. Signal Models
We will quantify switching activity reduction by considering two signal models. In the rst model we only vary the The switching activity of each m ultiplier was evaluated by counting the number of switches at each output of every cell in the multiplier. We assumed that multipliers were delay balanced as suggested in 7 and used zero-delay model for computing the switching activity. Let S c denote the switching count of cell c. Then the possible cells in a multiplier are an AND gate, a HA, a FA and a 4 : 2 compressor the 4 : 2 compressor appears in the tree multiplier. The corresponding switching metric which expresses the switch counts in these cells will be represented by S AND ; S HA ; S HA and S 4:2 , respectively. The total switching metric was obtained using the following weighting; 2 for S AND , 3 for S HA , S FA and S 4:2 weight re ects output load capacitance driven by the gate output. These relative w eighting factors were obtained by considering the pin loading of a typical cell in the array conguration. In addition, the switches at the input pins were counted separately for the given simulation and multiplied by N to account for input bu er drivers. The total switch counts at all outputs including input pins, weighted by the corresponding factor were summed to obtain the switching metric for the multiplier. These weightings yield a metric which expresses the total switched capacitance in the multiplier for the given input conditions.
A similar metric was obtained for the tree multiplier by using using the same input signals. We will let S Array and S Tr e e denote the switching metrics for the array and tree multipliers, respectively, for the given input signal conditions. Then the relative advantage of using the array m ultiplier will be referred to as percentage switching reduction and de ned as Tr e e = STr e e ,SArray SArray 100 This quantity shows the relative performance of an array m ultiplier with respect to the tree structure. A similar quantity can be obtained for comparing the relative performance of any t wo multipliers. Figure 4 shows one such metric computed using the LSB-rst CS multiplier as the reference for normalization. The relative advantage in comparison to the LSB-rst CS multiplier is obtained by using S LSB,Fi r s tCS in place of S Tr e e in the above equation. This quantity will be represented by Array . It is noted that switching reduction of up to 200 3X smaller is possible for appropriate signal conditions when using a hybrid multiplier in comparison to the LSB-rst CS multiplier. These results were obtained by 1000 randomly generated vectors using the U model.
B.1 Switching Activity T rends
Figures 5 7 show the contours of Tr e e for various types of 32-bit array m ultipliers. Figure 5 shows the contours for the LSB-rst CS on left and RP on right multipliers, respectively. W e observe that Tr e e changes quickly with the signal strength of B in both these cases. For small values of B, R P m ultiplier shows a steeper contour compared to CS. However, the range of signal values over which LSBrst multiplier shows improvement o ver the tree multiplier is larger in the CS case. Maximum reduction in switching activity is clearly at the top-left corner where B is small and A is large columns of cells in the multiplier switch o . Hence, gains are positive i f B is kept small while A uctuates in the entire range of its values. Figure 6 shows the contours for the MSB-rst multipliers which are almost like mirror images of the plots shown in gure 5. This follows by construction of the multipliers, since LSB-rst multipliers appear as the mirror images of MSBrst multipliers. Hence we note that the multiplier behavior is reversed with respect to the signal properties at the inputs.
We also note that the range of signal values of A over which the MSB-rst RP multiplier out-performs a tree is larger than the MSB-rst CS multiplier. Further, it gives better gains as compared to the CS multipliers, both MSB-rst and LSB-rst. The main reason for this behavior is the smaller vector merge stage compared to the MSB-rst CS multiplier. The worst case performance of all four multipliers is very close to each other. Figure 7 shows the contours obtained for hybrid multipliers with L = 2. The region which appears blank is the region where no computation is required in our model small value of A annihilated by truncation. The contours for the hybrid LSB-rst multiplier are almost diagonal in the region where tree out-performs the hybrid and it shows that the surface is almost planar. In the region where the hybrid multiplier out-performs the tree, the Tr e e surface is very steep and the proposed multiplier shows switching reduction over the tree multiplier at small signal strength of A. The contours of the hybrid MSB-rst multiplier show di erent trends. It shows large sensitivity to strength of B when A has a large value spanning more than 29 bits. Since this multiplier places a 31 and a 30 on top of a LSB-rst multiplier, the switching activity in the multiplier increases dramatically when these bits are high. This explains the contour bending at the top of the gure for the hybrid MSB-rst multiplier. This multiplier shows gains where an MSB-rst multiplier also shows gains. However, there is no switching activity when the value of A becomes smaller than L bits. It is the region where the hybrid multiplier automatically" shuts-o while the inputs are applied continually. I f w e know the statistical properties of the inputs, we can choose a structure which causes maximal portion of the multiplier to be switched o during most of computations. We observe that a hybrid structure is an augmented non-hybrid multiplier LSB-rst or MSBrst which tries to ensure that less overall switching results during most computations.
B.2 Switching Activity for Correlated Signals
We n o w consider the performance of the presented multipliers using the G model. For this purpose we applied data samples obtained from Gaussian distribution for different signal strengths varying from 1 to N , 1 bits. The correlated Gaussian signals were generated using an auto regressive AR1 model 1 . The results shown in each gure in this section were generated using 10; 000 data vectors for each signal condition. Results are shown for only 8-bit multipliers as they are consistent for multipliers of all sizes. The most striking observation of gure 8 left is the fact that iso-switching contours i.e. contours showing constant switching activity are circular. This shows that the switching activity is not more dependent on one of the inputs. Clearly, the balanced nature of tree multiplier reveals itself even in the switching activity. Identical behavior was observed in the contours for other combinations of signal correlations plots not shown. As the signal strength increases on any input, so does the switching activity. When one input signal is uncorrelated, the surface of S is almost identical. The di erences in S Tr e e shown in the gures reveal that in this case, the switching activity is more sensitive t o the correlated input. When both input signals are highly correlated, the switching activity is reduced in the entire range. The trends in gure 9 right show some interesting properties. The contour lines are almost diagonal when one input has a weak signal. As one of the signals become strong, peaks appear in the surface. Hence, S Tr e e decreases more in these regions of input signal conditions. The maximum normalized switching reduction was found to be 28 for A = 0 ; B = 0 :95 and 32 for A = 0 :95; B = 0 i n the case when the uncorrelated signal was weak. The reduction was only 2 when both signals had maximum strength. The corresponding values for the A = B = 0 :95 case was 62.3 reduction when both signals were weakest and 5 when both had maximum strength. Maximum gains were observed along the regions when one signal was weak and the other strong. Hence, strong correlation reduces switching activity in a tree automatically. Figure 10 presents the e ect of correlated signals for hybrid multipliers. In all these examples, the e ect of B is negligible, however, high A causes the gains to improve i n the region where the hybrid multiplier out-performs the tree multiplier. The e ect of correlation on the switching activity w as observed to be very small in the case of LSB-rst and MSB-rst multipliers. In all the simulations, we found that the behavior of arrays track the behavior of tree multiplier when correlations are introduced in the inputs. It is noted that many options exist when signals have v ery high time-correlation and the system is linear and time-invariant. One approach i n s u c h a case is to di erence the data and reduce its dynamic range. Two o verhead add operations are required to re-construct correct output. This approach can signi cantly reduce the size of the operands in multiplication if signals continue to have high correlation. The results shown in this section clearly indicate that signal correlations have a small e ect on the switching activity for all multipliers. It is the signal strength at the inputs which almost completely determines the switching in the multiplier.
B.3 Area Comparison
The LSB-rst CS and MSB-rst CS multipliers were implemented in CMOS using 0:6 technology. Both of these structures were implemented after inverter elimination simpli cations for the partial product generator rows. Cells were implemented for both non-inverted and inverted outputs and the bottom most row constituted a vector merge adder for converting CS format to regular representation. The layout areas of the two m ultipliers is shown in table II for purpose of comparison. MSB-rst CS adds a wiring overhead which results in an increased area. This is because the carry signal must be propagated one cell further in a rectangular layout. These values can be used to approximately estimate the area overhead of using hybrid multipliers. As discussed earlier, non-dedicated DSP systems generally employ m ultipliers whose size is determined by the performance requirements of the most computationally expensive intended application. An application in a general DSP system with xed resources may not require the full precision o ered by the resource. In such a situation, the power dissipation of the computational unit can be signi cantly reduced by appropriate use of the resource. Such quantizations have been proposed in 5 , 6 without considering support multiplier architectures. We further note that these results are also useful in formulating a strategy for employing variable word-length computing, in which di erent tasks of a DSP algorithm are computed with di erent precisions without signi cantly degrading the overall system performance.
As evident from the results presented in previous sections, the following two conditions must be met: rst, an appropriate multiplier architecture should be selected, and second, correct input conditions must be provided such that reduced switching activity is guaranteed. Quite clearly, i t i s not enough to ensure only one of these conditions. For example, if we truncate the LSB bits of the B input in a LSB-rst multiplier, it will not help reduce switching activity. F urther, it is also important to ensure that favorable signal conditions are maintained at the inputs consistently. Each architecture yields gains only for particular signal conditions.
B. Recon gurable Computing
The cellular array structure presented in section II is the most general template using which a n y array m ultiplier can be formed. In applications where recon gurability is sought for the application at hand, one may use the underlying structure proposed in this paper to form any o f N! possible multiplier architectures. It is noted that recon gurability desired speci cally for reduction of switching activity m a y not achieve that goal because of the overheads involved. In general, these overheads reduce the speed of application as well as increase the overhead power. However, for speci c applications where structure of data stream is well-known, re-con gurable multiplier may be employed which eliminates the undesired rows of multiplier to form an appropriate hybrid multiplier in order to increase the speed of multiplication. In such a case, the interpretation of array m ultipliers presented in section II and the template described in gure 1 can prove to be extremely useful.
C. High Level Synthesis Based on Signal Characteristics
We h a ve shown that each array m ultiplier o ers advantages for speci c signal conditions. Maximum reduction in switching activity can be achieved by s c heduling and allocating operations such that favorable input conditions are ensured at the inputs of the multipliers employed in the implementation. Hence, existing high-level synthesis tools can be improved such that they consider the expected signal behavior at various points of the algorithm while arriving at an implementation. Note that the condition of ensuring favorable signal conditions at the multiplier inputs also reduces bus-power, since these conditions must be met consistently between successive data samples. This work shows that an appropriate choice of array m ultiplier assures that reduction in switching activity in the input bus to the multiplier reects as reduced switching activity in the multiplier. Hence, one can reduce the power dissipation in a data-path by careful scheduling and allocation of instructions based on the expected statistical properties of the data being processed.
V. Conclusion
We presented several new array m ultiplier architectures for reducing switching activity in general DSP applications. A general cellular structure was presented which provides a uni ed view of all N! possible N-bit array m ultipliers. The switching activity at the output nodes of the cells in various multiplier structures was analyzed and compared with a tree multiplier and a LSB-rst CS array m ultiplier. It was shown that the relative improvement i n p o wer is a function of statistical properties of the input signals. It was also shown that selection of appropriate array architecture can give u p to 40 reduction in switching activity compared to a tree multiplier, and more than 3 times reduction in switching activity compared to the widely used LSB-rst array m ultiplier for commonly occurring situations. We also outlined applications of the proposed multipliers and the presented results to the areas of low p o wer quantization, recon gurable computing and high-level synthesis for low p o wer.
