Switching Characteristics of Generalized Array Multiplier Architectures and their Applications to Low Power Design by Muharnmad, Khurram et al.
Purdue University
Purdue e-Pubs
ECE Technical Reports Electrical and Computer Engineering
3-1-1999
Switching Characteristics of Generalized Array
Multiplier Architectures and their Applications to
Low Power Design
Khurram Muharnmad
Purdue University School of Electrical and Computer Engineering
Dinesh Somasekhar
Purdue University School of Electrical and Computer Engineering
Kaushik Roy
Purdue University School of Electrical and Computer Engineering
Follow this and additional works at: http://docs.lib.purdue.edu/ecetr
This document has been made available through Purdue e-Pubs, a service of the Purdue University Libraries. Please contact epubs@purdue.edu for
additional information.
Muharnmad, Khurram; Somasekhar, Dinesh; and Roy, Kaushik, "Switching Characteristics of Generalized Array Multiplier
Architectures and their Applications to Low Power Design" (1999). ECE Technical Reports. Paper 37.
http://docs.lib.purdue.edu/ecetr/37
SWITCHING CHARACTERISTICS OF 
GENERALIZED ARRAY MULTIPLIER 
ARCHITECTURES AND THEIR 




Switching Characteristics of Generalized 
Array Multiplier Architectures and their 
,Applications to Low Power ~ e s i ~ n l  
Khurram Muharnmad, Dinesh Somasekhar and Kaushik Roy 
Enlail: khurram@ecn.purdue.edu, somasekh@ecn.purdue.edu and kaushik@ecn.purdue.edu 
School of Elect.rica1 and Computer Engineering, 
Purdue University, West. Lafayette, IN 47907 
February 22, 1999. 
Abstract 
This paper presents several new array multiplier architectures for reducing the switching activity in general 
digital signal processing applications. A general cellular structure is described which can be used to obtain 
any array multiplier suitable for a given application. The switching activity at the output nodes of the 
cells in this structure is analyzed and compared with a tree multiplier based on 4 : 2 compressors. It 
is shown that the relative inlprovement in power is a function of statistical properties of the signal. It 
is also shown that selection of appropriate array architecture can give up to 40% reduction in switching 
activity compared to a tree multiplier, and more than 3 times less switching activity compared to  the 
widely used least-szgnzficant-bzt-first array multiplier for commonly occurring situations. We also outline 
applications of the proposed multipliers to  the areas of low power quantization, reconfigu~.able computing 
and high-level synthesis for low power. 
'This work was supported in part by DARPA (F33615-95-C-1625), NSF CAREER award (9501869-MIP), Rockwell, AT&T 
and Lucent foundation. 
With the recent trend in increasing mobility and performance in small hand-held mobile communicat,ion 
and portable computing equipment, low power has become an important design factor. New features are 
continually provided using DSP algorithms which are dominated by three basic operations; add, shaft and 
multiply. Many DSP algorithms can be implemented such that the data is processed in carry save (CS) 
format. as this format yields zero cost of accumulation [I] in multzply-and-accumulate (MAC) operation. 
The conversis3n of the result to normal binary forrn can be delayed for as long as possible for the given 
algorithm since it results in a significantly faster implementation. Consider, for example, a digital filter 
implementat:.on. In such an application, the intermediate result which is the accumulatior~ of a given inner 
product of d,sta and the coefficient can be kept stored in CS format, with the CS to binary conversion 
taking place only after the final result is computed in CS form. Consequently, multiplier architectures 
processing data in CS format are of particular interest. 
Multiplica1,ion operations are considered to be the dominant computation in DSP algorithms [2], [3]. 
Since, computation directly results in dynamic power consumption [4] it is an equally important factor 
when considering dynamic power dissipation of such algorithms. In general, high-performance DSP ar- 
chitectures aire required in mobile unit,s which process data at  high transmission rates, or in a port,able 
computer providing advance multimedia features. For this reason, such units are generally constructed 
with pipelined array m~lt~ipliers. If the latency of t,he pipelined architecture is an important considera- 
tion, a pipeli.ned tree multiplier can be used. Both types of multipliers can be easily pipelined using the 
conventional register based approach, or by using wave pipelining. Over t,he past few years, a number of 
papers have addressed multiplier topologies for a variety of applications [I],  [6], [7]. In particular, array 
structures prl3posed in [6] address pipelining of recursive digital filters using most signijicant bit (MSB) 
first digit serial arithmetic. However, to the best of our knowledge, no work has been reported in literature 
which address dynamic switching activity trade-offs between popular multiplier architectuires as a function 
of statistical properties of inputs. 
In this paper, we esplore array structures from the point of view of dynamic power dissipation. Contrary 
to the expectation that any ordering of array multiplier would yield similar dynamic power dissipation 
performance, we will show that more than 3 times reduction in switching activity may be possible compared 
to t,he commonly used least significant bit (LSB) first array multipliers (also known as right-left multipliers), 
depending on the signal characteristic of input signals. This is because a salient feature of computation 
in DSP algorithms is that the computations are governed by the statistical properties of the underlying 
process generating dat,a. In general, data signals are correlated and consequently, rapid crhanging data is 
seldom processed. Hence, we will explore the effects of signal statistics on the output swit,ching activit,y in 
various array structures in order to assess the feasibility of using a given structure under the condition of 
known or predictable signal statistics. We will show that re-ordering of partial product addition can result 
in significant reduction in switching activity (hence, dynamic power) if the signal statistics are known a 
przori.  This observation leads to new array multiplier architectures which form hybrids of MSB-first and 
LSB-first strl~ctures. We also discuss the application of such multipliers to low power iniplementation of 
DSP algorithms and to the general area of reconfigurable computing. 
The main objective of this work is to identify what type of architectures are best suited for processing 
signals with known statistical properties for reduced dynamic power dissipation? There are three major 
contributions: of this work: 
r We propose hybrid-array structures which combine LSB-first and MSB-first types of array multipliers. 
For appropriate signal conditions, these structures are shown to significantly reduce dynamic power 
dissipation. 
r The switching characteristics of array multipliers are compared with a tree multiplier based on 4 : 2 
compres:jors as well as the most commonly used LSB-first multiplier to show the region of strength 
of each zrchitecture. Hence, this work can be used to formulate an appropriate strategy for selecting 
the best order of partial product addition for reducing power dissipation in a given LISP task. Alter- 
natively, when processing signals with known statistical properties, one can formulate a strategy for 
applying signals to the multiplier inputs in an order which most effectively reduces dynamic power 
dissipatilsn. 
r The architectures presented in this paper provide new insights to the general area of low power design 
and reconfigurable computing. 
This paper is organized in to five sections. Section I1 describes the array multiplier architectues considered 
in this work. Section I11 presents a simulation based study of the switching characteristics of output nodes 
in the architectures considered. The signal models used to compute the performance of these multipliers 
are also explained in this section. Section IV discusses the applications of these strucl ures to general 
signal procesr;ing algorithms. Finally, section V concludes this paper. 
We will f in t  present a simple approach for obtaining various types of array multipliers. Figure 1 shows 
a template for a cellular array structure which serves as the basis for generating different types of 8-bit 
array multipliers. Each location in this matrix can be occupied by a cell which can be an a.nd gate (AND), 
a half a d d e r  :H.4) or a full a d d e r  (F.4). In the sequel, the cell at  location i, j will be referred to as ci,j.  
As an example, the cells on four corners are shown labeled in the figure. Let A = ao, a , ,  . . . , a ~ - 1  and 
B = bo ,  b l ,  . . . , b N P l  represent the input vectors applied at  right and top, respectively. The output is 
represented by P = po, p l ,  . . . , p 2 ~ - ~ .  Then each partial product ai . b j ,  where i, j = 0 , 1 , .  . . , N - 1 must 
be added in the appropriate relative position to obtain the correct value of P. In figure 1 we have shown 
the structure of LSB-first type array multiplier by the colored cells comprising a parallelogram. In this 
figure, the continuous lines show presence of connections, while the dashed lines show absence of them. 
Hence, the  aztive connections in a CS type of array multiplier are shown using the contii~uous lines. The 
connections i iom primary inputs to  appropriate cells are not shown explicitly, and are assumed implicit 
t o  reduce clutter. By counting the number of active inputs, one can determine the type of cell. Hence, 
the cells in row #O are all AND gates, whereas the seven rightmost cells in row # I  are HAS. The cells 
accepting three active inputs are FAs. Note that  the inputs are counted by considering tlie implicit input 
ai . bj  which is not shown. The  resulting CS array multiplier structure is shown on the right in figure 1 
for clarity. 
Fig. 1.  Basic template for constructing array multipliers. 
Now, the goal of an array multiplier is t o  add the partial products from cells which occupy t.he same 
column in the  cellular array structure shown in figure 1 .  The  order in which these partial products are 
added is not important, we only need t o  ensure tha t  only the products in the  same colurrln are added (in 
addition t o  the carry's generated from the cells in the adjacent column on right). Hence, one can exchange 
rows #3 and #7 as shown in figure 1. Cells in row #3 after moving t o  row #7 are shown by cells shaded 
by circles. 'I'lne cells in row #7 after moving to  row #3 are shown by dark colored cells. Now, we only 
need to  ensure tha t  carry's generated from next rows are correctly added, which may require extra cells. 
Let R = (ro, rl ,  rz, . . . , r ~ - 1 )  be the set of indices which represents an  ordering of success,ive additions of 
rows of partia,l products. Then ,  the orderi~ig given by ri = i for i = 0 , 1 ,  . . . N - 1 expresses the LSB-first 
multiplier shown in figure 1. The  MSB-first multiplier can also be expressed similarly by the ordering 
r .  1 - N - 1 - i: for i = 0 , 1 ,  . . . , N - 1. Clearly, there are N! ways t o  construct carry save array multipliers. 
Each of these multipliers mays be constructed using propagation of carry in eit,her ripple form or CS form 
or a combination of these. This formulation is the basis of generating various architectures of interest 
which are evaluated for their switching activity performance in this paper. 
A. LSB-First Multipliers 
The LSB-first multiplier can be constructed either using the CS format shown in figure 1, or by using 
ripple carry structure. We will refer the former as LSB-first CS multiplier and the latter as the LSB- 
first R P  multiplier, respectively. LSB-first RP multiplier is the most well-known and widely used array 
structure for multiplication and is obtained from the cellular array of figure 1 by turning off the diagonal 
lines (by nlalting then1 dashed) and turning on the horizontal dashed lines (by making them continuous) 
which connect cell ci,,  to ci , ,+I  for all cells c i , j ,  i = 0 ,  1, . . . , N - 1 and j = N - i, N - i + 1, . . . , 2 N  - i - 2  
(right-most cell excepted) in the LSB-first CS multiplier of figure 1. The vector. merge row (row # N  + 1) 
is no longer I-equired. The advantage of using CS format is the reduction in propagation delay through 
the multiplier. LSB-first RP multiplier has 30% longer critical path as compared to the LSB-first CS 
multiplier. Irl this work, we consider both since our objective is to highlight the switching characteristics 
of various array multipliers. 
An MSB-first multiplier place MSBs of A input at  the top row positions as shown in figure 2.  The main 
idea is to flip the cells in the cellular array of figure 1 along a horizontal axis such that row # i  is moved 
to row #(N - 1 - i ) ,  for i = 0,  1 ,  . . . , N  - 1. This results in a MSB-first multiplier [ B ] .  The multiplier 
can be const1:ucted by propagating the carry in either CS form, or can be ripple in a fashion identical 
to the LSB-first RP multiplier. The multiplier using CS format has been presented in [B ]  for pipelining 
recursive digital filters. A major advantage of the MSB-first CS multiplier is that the delaj~ through vector 
merge stage can be reduced by taking advantage of the fact that the MSB-first array produces the MSBs 
before the LSBs. Hence, a carry-select structure can be constructed in the region occupied by cells ci,j 
for i > j to improve the vector merge delay. Consequently, MSB-first CS array multiplier can improve 
the speed of multiplication [ B ] .  The observation that MSBs of product are available before the LSBs is 
fundamental to the construction of the MSB-first R P  multiplier shown in figure 2(b).  111 contrast to a 
LSB-first RP multiplier, it has the same propagation delay as the LSR-first CS multiplier and offers an 
attractive alternative to it. 
C. Hybrid Multipliers 
A hybrid multiplier is obtained by any ordering of elements of R which is not monot,one. Note that 
there is only one monotonically increasing ordering of the elements of R and it leads to the LSB first 
structure. Siinilarly, the only monotonically decreasing ordering leads to the MSB first structure. Any 
ordering other than these two leads to a hybrid array multiplier. In this paper, we consider only two 
types for hybrid structures. The first structure places L consecutive LSB bits of operand A as L top 
most rows. This structure is shown on left in figure 3.  The second structure places L consecutive MSB 
bits of operand A as L top rnost rows and is shown on right in figure 3.  We will refer to the former as 
b b b b b b b b  
Fig. 2. Structures for MSB-first multipliers; ( l e f t )  MSB-first CS multiplier, and, ( r i g h t )  MSB-first R P  multiplier. 
hybrid LSB-first multiplier and the latter as hybrid MSB-first multiplier, respectively. Botli of these can be 
constructed either by using ripple carry or by using CS format. Hence, there are four wa,ys to implement 
a hybrid multiplier which puts L top most rows of one type of multiplier above the other (i.e. LSB-first, 
over MSB-first or vice versa). The multiplier on left in figure 3 puts L = 3 top rows of the LSB-first CS 
multiplier over N - L = 5 top rows of the MSB-first CS multiplier. We will refer to such a multiplier as 
hybrid LSB-frst CS/CS multiplier with L = 3.  Similarly, the multiplier on right in 3 puts L = 3 top most 
rows of MSB-first RP multiplier over N - L top most rows of LSB-first CS multiplier. This multiplier 
will be referrcd to as hybrid MSB-first RP/CS multiplier with L = 3. We can obtain three more types 
of L = 3 hytlrid multipliers for each of these cases by considering the remaining three combinations of 
adding carrys in the two parts of the multiplier. 
Each type of hybrid multiplier implementation requires a different overhead and has a different length 
of critical path. Since the goal of this work is to develop an understanding of the swit'ching trade-offs 
in various multipliers, we will only consider implementations which place L consecutive rows of one type 
of multiplier over the other. The reason for focusing on such architectures is because DSP applications, 
in general, process data streams whose properties can only be predicted or controlled over a part of the 
word-length. For example, if the signal strength reduces, consecutive MSBs of the data--stream become 
zeros (assuming a sign-magnitude representation). Similarly, "less important" data values may be further 
quantized by truncating some LSBs, thereby resulting in the data-stream having zeros at  the corresponding 
locations. It will be shown that the proposed hybrid multipliers yield substantial improvement in switching 
activity reduction compared to a tree multiplier (constructed using 4 : 2 compressors) as well as the simple 
LSB-first or MSB-first multipliers under appropriate signal conditions. The multiplier structure shown on 
left in figure 3 is entirely CS structure, and its speed can be increased by using a carry select structure 
similar to the one proposed in 161. The multiplier on right in 3 has the same delay as a LSB-first CS 
array multipl:.er despite the fact that the MSB-first part ripples the carry. The reason for considering this 
structure is that it requires a smaller overhead cells required to ensure that all partial product sums and 
carrys are aalded at appropriate locations. 
b b b b b b b b  
Fig. 3. Structures for hybrid multipliers; (left) hybrid LSB-first CS/CS multiplier, and ,  (r ight)  hybrid MSB-first RP/CS 
multiplier. 
We first investigate the switching characteristics of the multipliers presented in the previous section 
qualitatively. Let us first consider the LSB-first multipliers. A close observation of the multiplier in 
figure 1 shows that if successive inputs are applied such that their LSBs are zeros in operand A ,  the 
corresponding top rows of the multiplier will be turned off as the evaluated partial products would all 
be zeros. Ac.y input which has a0 = 1 will place the vector B at the output of the first row of partial 
product outputs. These values will propagate downwards even if the next LSBs in A are id1 zeros. Hence, 
switching activity can only be reduced if successive inputs applied at the input A ensure that when a bit 
a j  is 1,  all ai's are zeros for i < j .  Similarly, we notice that if the successive inputs applied at the B 
inputs are such that M MSB bits are zeros, then the cells a i , j  such that j = i + k for k := 0 , 1 , .  . . , L - 1 
along the diagonal (columns of partial product generators) in the cellular array are all turned off. Hence, 
no sum or carry output transitions in these cells. Hence, low over-all switching activity can be ensured if 
the inputs applied to this multiplier are ordered to ensure that they cause smaller switching activity. 
Similar 0b:servations are made for the MSB-first and hybrid multipliers. The "best" input conditions 
for these multipliers are summarized in table I and can be verified by a careful study of figures 1 - 3.  
Next, in order to obtain a quantitative behavior of these multipliers, we will use two signal models which 
are described next. 
A. Signal Models 
In the first model we only vary the signal strength to determine the switching characi,eristics. Hence, 
successive sainples of signals are assumed to be uncorrelated and drawn from a uniform distribution. It 
has been shown in [lo] that the switching activity in the LSB-first RP multiplier prim;~rily depends on 
the input signal strength. Hence, we apply all possible combinations of fixed signal strengths in an N-bit 
Multiplier 
C LSB-first CS LSB-first R P  
I Hybrid MSB-first RP/CS I MSBs & LSBs zeros I MSBs zeros 1 L  - 1 cells I None I 
MSB-first CS 
MSB-first R P  
Hybrid LSB-first CS/CS 
T.4BLE I 
SIGNAL CClNDITIONS CAUSING LOW SWITCHING ACTIVITY AND OVERHEADS IN THE MLlLTIPLIERS PRESENTED. HYBRID 
MULTIPLIERS ASSUME THAT L ROWS A R E  hlOVED TO TOP. 
Overheads 
Vector Merge 71 Wiring Favorable Conditions 
LSBs zeros 
LSBs zeros 
multiplier by sweeping the space of possible signal strengths at  the two inputs. We obtain data  for these 
points by generating samples comprising of i-bits from a uniform distribution, where i is varied from 1 to 
N. The N x N possible combinations of siginal strengths of the two operands are obtained by applying 
signal of strength i-bits as operand A and j-bits as operand B, where i, j = 1 , 2 ,  . . . , N. This model will 
be referred to as the U model and it can be used to assess the merits of using the presented multipliers for 
signals which can be represented by N or less bits and/or which can be re-quantized by discarding some 
LSBs without significantly degrading the system performance. 
The second model generates correlated signals from a zero mean Gaussian distribution. These samples 
are represented using sign-magnitude (SM) number representation and only the magnitude of the number 
is applied at  the inputs of the multiplier. The signal correlation in operand A is represented by p~ and 
the correlation in B is represented by p ~ .  Four situations arise by considering all possible combinations 
of high and low correlations in the signals at  the two inputs. The high correlation value is considered to 
be 0.95, and low correlation equal to  0.  This model will be referred to as the Q model. 
at  Input A 
MSBs zeros 
MSBs zeros 
MSBs & LSBs zeros 
B. h'zlmerictrl Results On Power Dissapation 
at Input B 
1 s  zeros 
MSBs zeros 
We now turn our attention to the switching activity performance of the presented multipliers. The 
switching ac1,ivity of each multiplier was evaluated by counting the number of switches a t  each output of 
every m o d u l ~ ~  in the mult,iplier. Let S, denote the switching count of cell c. Then the possible cells in 
a multiplier sre an .AND gate, a HA, a FA and a 4 : 2  compressor (the 4 : 2  compressor appears in the 
tree multiplier). The corresponding switching metric which expresses the switch counts in these cells will 
be represenkd by S A N D ,  SHA,  SHA and S4:2, respectively. The total switching metric was obtained using 
the following; weighting; 2 for SAND,  3 for SHA,   SF^ and S4:2 (weight reflects output load capacitance 




3 ( ~  - 1) cells 1 1 N - 1 cells None 3N - 2 L  - 1 
driven by the gate output). These relative weighting factors were obtained by considering the pin loading 
of a typical rnodule in the array configuration. In addition, the switches at the input pins were counted 
separately for the given simulation and multiplied by N to account for input buffer drivers. The total 
switch count:j at all outputs (including input pins), weighted by the corresponding factor were summed to 
obtain the sviitching metric for the multiplier. These weightings yield a metric which expresses the total 
switched capacitance in the multiplier for the given input conditions. 
.4 similar inetric was obtained for the tree multiplier by using using the same input signals. We will 
let SArray and STree denote the switching metrics for the array and tree multipliers, respectively, for the 
given input signal conditions. Then the relative advantage of using the array multiplier is defined as 
The above quantity is expressed as a percentage and shows the advantage of using the array multiplier 
over a tree fcr the given signal condition. We will refer to this quantity as percentage switching reduction. 
The rationale behind this normalization is to clearly indicate the relative performance of each type of array 
multiplier with respect to the tree structure and to quantify percentage reduction in switching activity 
for given signal condition. Similar quantity can be obtained for comparing the relative performance of 
any two multipliers. Figure 4 shows one such metric computed using the LSB-first CS multiplier as 
the reference for normalization. The figure shows the relative advantage of using the indicated hybrid 
multipliers ill comparison to the LSB-first CS multiplier by using SLSB-First cs in place of STree in 
equation 1. This quantity will be represented by YA,, ,~,  as we set the LSB-first CS multiplier as the 
base-line for comparison in array multipliers. It is noted that switching reduction of up to 200% (3X 
smaller) is possible when using a hybrid multiplier in comparison to the LSB-first CS multiplier, under 
appropriate signal conditions. 
The result:; presented in this section were obtained by using 1000 randomly generated vectors using the 
U model. These results give rise to a surface as a function of the number of bits in the applied inputs. 
This surface is best shown by slicing it into different regions and showing every slice ir~dividually as in 
figure 4. An even better representation is to place each slice along-side as a bar chart as shown in the 
remaining figures. The abscissa in these figures show the number of bits in the samples (drawn from a 
uniform distribution) applied at the A input. The data samples were applied at the multiplier inputs by 
aligning their LSB with the zeroth indexed row/column. Hence, the successive simulatio~ls computed the 
switching metrics for inputs with increasing widths until metrics for all the grid points of the switching 
metric surface were computed. The metrics were normalized to obtain relative switching ireduction shown 
in the figures. The bars in each figure are composed of N groups. The position of a group corresponds to 
the number c'f bits in the samples applied at A.  Each group, in turn, is composed of N bars. The position 
of a bar insicle a group indicates the number of bits in the samples applied at the B input. Hence, as we 
scan a figure from left towards right, the strength of the input signal at B input repeatedly increases and 
S~gnal Slrenglh of €3 
Signal Strenglh of A 
Signal Strength of €3 Signal Strength of A 
Fig. 4.  qilrra:, for 16-bit hybrid array multipliers. Figure above: shows the percentage switching reduction for Hybrid 
LSB-first C!S/CS with L= l ,  and, figure below: shows this surface for Hybrid MSB-First RP/CS m.ultiplier with L = l  
(normaliza1;ion is performed with respect to  LSB-first CS multiplier). 
falls, while the strength of the signal applied a t  A continually increases. 
B . l  Results Using the U Signal Model 
Figures 5 -- 7 show q~,, ,  as a function of signal strength in the LSB-first and MSB--first multipliers 
for N = 8,16 and 32, respectively. We observe a consistent trend of the relative performance for each 
of these mult,ipliers. Each of these multipliers gives gains in switching reduction for difFerent operating 
conditions. A.s pointed out in table I,  LSB-first multipliers would give an improvement when the LSBs of 
A input, or bilSBs of B input are zeros. The first situation does not arise with this signal model, because 
it would require MSBs to  be I s  and LSBs to be 0s. Such a signal can only be generated by quantizing 
(rounding/truncating) the LSBs. However, the second condition is more realistic and we note that up 
to  25% reduction in switching activity is possible over tree multiplier when the signal strength of A is 
high, and B is small as they result in left-most columns of multipliers turning off. Despi.te the overhead 
of vector merge state, the CS multiplier out-performs the R P  multiplier as evident by a close inspection 
of these figures. The MSB-first multiplier shows the gains in switching activity reduction when the signal 
strength a t  the A input is low. Hence, the top most rows do not switch. Larger gains are observed when 
the signal strength a t  the B input is large. The R P  type multiplier clearly outperforms the CS multiplier 
because of smaller overhead cells. Further, the relative gains under favorable signal conditions are higher 
as compared to the LSB-first multipliers. Finally, both favorable situations appear a t  the inputs in the U 
signal model because the MSB-first multipliers reduce switching when the MSBs of both inputs are 0s (a  
situation which frequently arises in DSP applications). It is seen that close to 40% reduction in switching 
-20 
-30 
1 2 3 4 5 6 7 8  
-40 
1 2 3 4 5 6 7 8  
Sgnal Slrerglh ol A m  LSB-F rsI C S  Muit8pler Sign3 Strength 01 A m  MSB-FlrsI CS Mulllp mr 
-30 
1 2 3 4 5 6 7 6  
-40 
1 2 3 4 5 6 7 8  
S g n 3  Sllsnglh of A n LSB-FlrsI RP Mulllpl~ei Sgnal Strength ol A m MSB-Flrrt RP Mulllp ler 
Fig. 5. v~~~~ for (left) 8-bit LSB-first array multiplier, and, (right) 8-bit MSB-first array multiplier as a function of the 
signal strength in the operands. 
0 2 4 8 8 10 12 14 16 18 0 2 4 6 8 10 12 14 18 18 
Sgnal Strsngih ol A ~n LSB-Fra RP Milt!pl~sr S1gn3 Strengh ol A n MSB-Fbra RP Mult#@isr 
Fig. 6. v~~~~ Ior (left) 16-bit LSB-first array multiplier, and, (rightj 16-bit MSB-first array multiplier as a function of the 
signal strength in the operands. 
activity is possible in the MSB-first R P  multiplier when the signal strength of A is very small and B is 
very strong. The savings are consistent across 8, 16 and 32 bit multipliers. 
We can also compare the relative performance of MSB-first and LSB-first multipliers. Figure 4 shown 
earlier indicates that LSB-first multiplier out-performs the MSB-first multiplier by up to 30% when signal 
strengths a t  both A and B inputs are very small. However, MSB-first multiplier oui,-performs LSB- 
first multiplier for most situations giving larger relative advantage in switching reducltion. Note that 
these results favor MSB-first type multiplier from switching activity point of view for most common 
signal conditions. One may notice that many multipliers used in DSP do not need all 2N product bits 
(especially i r ~  floating point units) and MSB-first multiplier is an attractive choice since by construction 
it also furnishes the MSB part of the product very quickly. 
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 
" " 1 ' ' ' " 1  1 ' ' ' '  
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 
Slgnal Slrenglh ol A m LSB-Flr61 CS Mulllpllsr Slgnd Strength 01 A m  MSB-Flrsl CS Multlpiler 
I 3 5 7 9 I 1  13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 
Slgnal Slrenglh ol A m LSB-FlrsI RP Mullipller Slgnal Slrength al A m MSB-Flrpl RP Mullipiler 
Fig. 7. Q T ~ ~ ~  for ( l e f t )  32-bit LSB-first array multiplier, and, ( r i g h t )  32-bit MSB-first array multiplier as a function of the 
signal strength in the operands. 
We next consider hybrid multipliers. The main objective of employing the hybrid multipliers presented 
in this paper is to  take advantage of a signal whose L LSB bits are zeros. Such signal values may arise in 
many ways ill typical DSP applications. As an example, computations may be organized as floating point 
type of operations in which the normalized mantissa of operands is multiplied using an array multiplier 
and values are expressed by using only a few MSB bits in the mantissa, depending on the accuracy 
required (ua;-zable preczszon arzthmetzc). Example of such a system is a digital filter implementation 
employing scaled coefficients for reducing performance degradation due to coefficient quantization [3]. 
Another exainple of truncation of signal's L LSB bits is a situation where the resulting degradation in 
accuracy can be tolerated for the application at hand. Again, examples of such a system is an FIR filters 
whose objective is to meet given filter specifications, however, the implementation is nnade by using a 
multiplier which is bigger than the least number of bits required to meet these specifications [8], [9]. This 
situation car easily arise in general DSP implementations where shared multipliers are used for more 
than one applications and resources are not exclusively dedicated to only one task. For these reasons the 
switching performance of the hybrid multipliers was computed by truncating L LSB bits of the signal and 
setting them to zeros. If L LSB bits are not set to zeros, the hybrid multiplier's switching performance 
will lie between that of LSB-first and MSB-first multipliers. 
Next, we analyze the results shown in figures 8 - 10. These figures show q~~~~ for 8, 16 and 32-bit 
multipliers, respectively. The figures on left show the results for multipliers with L = 1, and the figures 
on right show the results obtained for multipliers with L = 2. It is seen that hybrid MSB--first and hybrid 
LSB-first multipliers show improvement in performance for different signal conditions. The former shows 
most improvsment when the signal strength is small for A and large for B. The latter sllows most gains 
when the converse is true. The reduction in switching activity is more pronounced in the Hybrid LSB-first 
multiplier despite the overhead of cells required to ensure correct operation. This is due to the fact that 
1 2 3 4 5 6 7 8  1 2 3 4 5 8 7 8  
Sgnal Slrmyth at A m Hybrld-LSB-Rrrl Multopller (Ll) Slgnal Slrenglh a l  A ~n ybnd-LSB-Flrn Mdl~pl  sr (L=Z) 
- 3 0 1  
1 2 3 4 5 6 7 8  
1 -301 8- 
1 2 3 4 5 6 7 8  
Slgnal Slrenglh olA m Hybrd-MSB-Flrl Mult~plsr ( L 4 )  Sgnal Sliengm al A n ybnd-MSB-Flrsl Multopl~er (L=2) 
Fig. 8. I)T,,, for 8-bit hybrid array multipliers as a function of the signal strength in the operands. (left:)  L = l ,  and, ( r i g h t )  
L=2. 
I 1 -50' 
0 2 4 6 6 10 12 14 16 I 8  0 2 4 6 8 I 0  12 14 16 18 
Sqnal nrenyln ol A ~n Hybrid-LSB-Ftrsf Mlltiplier (L.1) Sgnal Strmglh d A ~n ybnd-LSB-Flrsl Mdlplsr (L.2) 
6 20- 3 l o -  
: o -  -- - a o -  - -  - 
$ -10- 
1 -20 - 
' -30 - 
' -40- 
0 2 4 6 8 10 12 14 16 18 0 2 4 6 8 10 12 14 16 18 
sqnal Slrenglh d A n Wrd-MSB-Fin Mlltlpller ( L d )  Signd Strength of A m  Wnd-MSB-Flm Mlll~pl~er (L=Z) 
Fig. 9. q~~~~ for 16-bit hybrid array multipliersas a function of the signal strength in the operands. ( l e f t )  L = l ,  and, ( r igh t )  
L=2. 
the L LSB bit truncated signals obtained through the 24 model are more effective in turning off larger 
part of the multiplier since LSB-first part of the multiplier precedes the MSB-first in the former case. 
Significant r~duct,ion in switching activity is achieved in both cases. Further, the trends adre consistent for 
all sizes of multiplier. 
Figures 11 shows q~~~~ for 8 and 16-bit multipliers, respectively, with L = 3. Figure 1;: shows q~~~~ for 
16 and 32-bit multipliers, respectively, for hybrid multipliers wit,h L = 4. The missing bars indicate that 
the A operaind under t,he indicated signal conditions were zeros (small power, large truncation). Hence, 
no operations are necessary. However, the region of switching reduction moves to the the mid-region of 
A signal povier. The relative switching activity reduction becomes larger as L increases The trends are 
consist,ent for all hybrid LSB-first CS/CS and hybrid MSB-first RP/CS multipliers for all sizes and values 
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 
Signal Srrenylh 01 A 10 Hybrid-LSB-Fir* Mll lpl<sr (L=o Slgml Slrenyth of A m tiybnd-LSB-First Mlillplrlr (L=2) 
Fig. 10. q ~ ~ ~ < ?  for 32-bit hybrid array multipliers as a function of the signal strength in the operands. ( l e f t )  L = l ,  and, 
( r i g h t )  L=2.  
as 
-40 1 2 3 4 5 6 7 8  
Spnal Srrenglh 01 A n biybnd-LSB-Flm Mllllplier [ LS )  0 2 4 6 6 10 12 14 16 18 S~QMI Slrenylh a1 A m Hybnd-LSB-Flm Mllllplier ( k 3 )  
1 2 3 4 5 8 7 8  0 2 4 6 6 10 12 14 16 18 
Slgnal Slrenplh 01 A 8" Hybnd-MSB-Fvsl Mlltipller (LS)  Slgml Srrenglh d A ~n Wrld-MSB-Firrl Milllpixr (M) 
Fig. 11. q ~ ~ ~ ~  for 8 ( l e f t )  and 16-bit ( r i g h t )  hybrid array multipliers with L = 3 as a function of the signal strength in the 
operands. 
of L. The reduction in switching activity under favorable signal conditions is as large as 40%. 
The relative performance of a large multiplier for small and large values of L is shown in figure 13. 
This figures shows q~~~~ for 32-bit hybrid multipliers for L = 1 (figures on left) and L = 8 (figures on 
right), respectively. Since the indicated truncation for small signal strength complete1.y annihilates its 
value, the first seven groups of bars are missing in the figures on right. No operations are necessary in 
this region of operation and no srvitching activity results in the hybrid multiplier, if such operands are 
applied. Switching reduction of up to  35% are achievable in the L = 8 case in comparison to about 30% 
for the L = 1 case. Although the results shown in figures 8 - 13, in general, suggest superior performance 
of hybrid LSB-first multiplier over hybrid MSB-first multipliers, one must remember that the decision to 
choose the best multiplication scheme is dependent on the input signal conditions. The relative switching 
-40  ' 
. r 
0 2 4 0 8 10 12 14 16 I 6  1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 
Sgnal Slrenglh of A m  Hybnd-LSB-Flrn Mullapller (L.4) Slgnal Slrmglh of A m Hybrld-LSB-Flrrl Mll l~plnr (L.4) 
-40  ' 
0 2 4 6 6 10 12 14 16 18 1 3 5 7 9 11 13 I 5  17 19 21 23 25 27 29 31 
Sign?d hrrngm al A m Hybrd-MSB-Flrd Mulllplmr (L.4) Signal Strengih of A m Hybrd-MSB-Flm Mullipl~ei (L.4) 
Fig. 12. VT,,, for 16 (left) and 32-bit (right) hybrid array multipliers with L = 4 as a function of the signal strength in the 
operands. 
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 1 3 5 7 8 11 13 15 17 19 21 23 25 27 29 31 
Slgnal Strenqlh of A m Hybrd-LSB-Furl Mullpllsr (L.1) Signal Slrenglh of A ln Hybrid-LSB-Flrd Mulliplsr ( L d )  
. m L ; - 3 " ; ;  $ 1 ; I j 1 ; ; 7 1 ; ; 1 ; 3 ; s ~ ; 9 j l  1 - 8 0  ' 1  a 8 1  1 8 1  I d 
I 3 s 7 9 11 13 IS 17 19 21 23 25 n 29 31 
Slqnal Slrmgth of A m Hybrid-MSB-Flm Mull~pllsr (L.1) Slgnal Strength 01 A m  Hybrld-MSB-Flrn Mullpller (L.6) 
Fig. 13. VT,,, for 32-bit hybrid array multipliers with L = 1 (left) and L = 8 (r ight )  as a function of the signal strength in 
the operands. 
activity reduction also depends on the choice of multiplier used in normalization in equation 1. Figure 
14 demonstr<%tes this point by showing q ~ , . , . ~ ~  for 32-bit hybrid array multipliers with L = 2 and L = 4, 
respectively. Notice that the switching activity reduction in hybrid MSB-first multiplier, although smaller 
than its courlterpart, is more consistent as signal strength of A varies. Hence, for a given application, the 
latter may b? preferred despite its general inferior performance to  the hybrid LSB-first multiplier. 
B.2 Switchirlg Activity for Correlated Signals 
We now consider the performance of the presented multipliers using the model. For this purpose we 
applied data  samples obtained from Gaussian distribution for different signal strengths varying from 1 to 
N - 1 bits. Four situations were chosen to  reflect the effect of correlation in the signal by considering 

Slgnal Strength 01 A Signal Slrength 01 A S~gnel Slrenglh al A 
1 2 3 4 5 6 7 8  1 2 3 4 5 6 7 8  1 2 3 4 5 8 7 8  1 2 3 4 5 8 7 8  
Sgrlal Slrenglh a1 A Signal Strenglh 01 A Slgna Slrenglh of A Slgnal Slrength a1 A 
Fig. 16. q p e e  for ( l e f t )  8-bit MSB-first CS, and, ( r i g h t )  8-bit MSB-first RP, array multipliers as a furlction of the signal 
strength in  the operands. Values of and p g  are shown with each plot. 
strength of A increases. The effect of increasing p~ is an "equalization" of v~~~~ for small signal strengths 
a t  A.  However, the differences are very small. 
In the case of MSB-first multipliers shown in figure 16, we notice that higher p~ causes an "equalized" 
q~~~~ for smitll signal strengths of A.  Hence, better gains are obtained as signal strength of A increases, 
and these gams drop quickly as A becomes stronger. The effect of p~ is not discernible Similar results 
are seen in the hybrid multipliers shown in figures 17 - 18 which shows the effect of correlated signals on 
the performance of hybrid multipliers. In all these examples, the effect of p~ is negligible, however, high 
p~ causes the gains t o  equalize in the region where the hybrid multiplier out-performs the tree multiplier. 
It is noted that consideration of extremely high correlations do not make much sense because a better 
approach in (,his case is to  difference the data  and reduce its dynamic range. Hence, by adding overhead 
of add operatzon one can significantly reduce the size of the operands in multiplication. The results shown 
in this section clearly indicate that signal correlations have a small effect on the switching activity for 
all multipliers. It is actually the signal strength a t  the inputs which almost  completely determines the 
switching in the multiplier. This is confirmed by a similar observation made for LSB-firsl; RP multipliers 
in [ lo] 
B.3 Area Comparison 
The LSB-first CS and MSB-first CS multipliers were implemented in CMOS using 0 . 6 , ~  technology. 
Both of these structures were implemented after inverter elimination simplifications for the: partial product 
generator rows [4]. Cells were implemented for both non-inverted and inverted outputs [41 and the bottom 
mos t  row constituted a vector merge adder for converting CS format to  regular repre~ental~ion. The layout 
areas of the i;wo multipliers is shown in table I1 for purpose of comparison. MSB-first Cis adds a wiring 
overhead which results in an increased area. This is because the carry signal must be propagated one cell 
8 -20 8 -20 1 
-30 
1 2 3 4 5 6 7 6  1 2 3 4 5 6 7 8  1 2 3 4 5 6 7 8  1 2 3 4 5 6 7 8  
S g V l  Strenglh of A Sgnal Slrenglh of A Slgnal Slrenglh o l  A Slgnal Stlength of A 
1 2 3 4 5 8 7 8  1 2 3 4 5 6 7 6  1 2 3 4 5 6 7 8  1 2 3 4 5 6 7 8  
Sgla l  Slrenglh ol A Sgnal Slrenglh of A S~gnaI Slrenglh 01 A Sgnal Slrsnglh 01 A 
Fig. 17. v~~~~ for 8-bit hybrid array multipliers as a function of the signal strength in the operands. (lej't) Hybrid LSB-first 
CS/CS with L=l ,  and,  ( r i g h t )  Hybrid MSB-First RP/CS with L=l .  Values of p ~  and p g  are shown with each plot. 
-WLd 1 2 3 4 5 8 7 8  -30- - B O L A  
2 3 4 5 6 7 8 1 2 3 4 5 6 7 8  
SlgTal Slrsnglh 01 A 
1 2 3 4 5 6 7 8  
Sgnal Slrenglh of A Slgnal Strength o l  A Srgnal Slrenglh 01 A 
8 -20 -20 
1 2 3 4 5 8 7 8  1 2 3 4 5 8 7 8  
3 0  
1 2 3 4 5 8 7 8  
-30 
1 2 3 4 5 6 7 8  
Sgml  Strenglh 01 A S g ~ l  S renglh 01 A Slgnd Strength 01 A Sgnal Slrsnglh of A 
Fig. 18. 7Tree for 8-bit hybrid array rnultipliersas a function of the signal strength in the operands. ( l e j t )  Hybrid LSB-first 
CS/CS with L=2, and,  ( r i g h t )  Hybrid MSB-First RP/CS with L=2. Values of p ~  and p g  are shown with each plot. 
further in a  secta angular layout. These values can be used to approximately estimate the area overhead of 
using hybrid multipliers. 
IV.  APPLICATION TO LOW POWER DESIGN 
In the previous sections, we have provided a qualitative as well as quantitative assessment of the 
switching activity reduction which can be obtained by using the proposed multiplier structures for various 
signal conditions. These results can assist in the design for low-power as they show the relative strengths 
and weaknesses of different multiplier architectures. In this section, we will briefly discus:; the application 
of this work t o  low power quantizatzon, reconfigurable computing and high-level synthesis for low-power 
TABLE I1 






A. Low Power Quantization 
As discussed earlier, non-dedicated DSP systems generally employ multipliers whose size is determined 
by the performance requirements of the most computationally expensive intended application. An ap- 
plication in a general DSP system with fixed resources may not require the full precision offered by the 
resource. In such a situation, the power dissipation of the computational unit can be significantly reduced 
by appropriate use of the resource. Such quantizations have been proposed in [8], [9] without consider- 
ing support multiplier architectures. We further note that these results are also useful in formulating a 
strategy for employing variable word-length computing, in which different tasks of a DSP algorithm are 
computed with different precisions without significantly degrading the overall system performance. 
As evident from the results presented in previous sections, the following two conditioils must be met: 
first, an appropriate multiplier architecture should be selected, and second, correct input conditions must 
be provided such that reduced switching activity is guaranteed. Quite clearly, it is not enough to ensure 
only one of t,hese conditions. For example, if we truncate the LSB bits of the B input in a LSB-first 
multiplier, it will not help reduce switching activity. Further, it is also important t o  ensure that favorable 
signal conditions are maintained a t  the inputs consistently. For example, if successive A inputs in the 
LSB-first multiplier have toggling a0 bit, the reduction in switching activity will be entirely lost. Reduction 
in switching <~ct ivi ty  is possible only if the data-stream applied a t  A input of this multiplier ensures that 
successive sainples have all L LSB bits turned off. 




B. Reconfigurable Computing 
The cellula.r array structure presented in section I1 is the most general template using which any array 
multiplier car1 be formed. In applications where reconfigurability is sought for the application a t  hand, one 
may use the ~.nderlying structure proposed in this paper to form any of N! possible multiplier architectures. 
It is noted that reconfigurability desired specifically for reduction of switching activity may not achieve 
that goal becsuse of the overheads involved. In general, these overheads reduce the speed of application as 
well as increzse the overhead power. However, for specific applications where structure of da ta  stream is 
















well-known, re-configurable multiplier may be employed which eliminates the undesired rows of multiplier 
( to form an appropriate hybrid multiplier) in order to  increase the speed of multiplicatiori. In such a case, 
the interpretation of array multipliers presented in section I1 and the template described in figure 1 can 
prove to be extremely useful. 
C .  Hzgh Level Synthesis  Based o n  Iqnowledge of Signal Characteristics 
The results presented in this paper clearly indicate that each array multiplier offer!j advantages for 
specific signal conditions. Further, large inlprovements are possible in reduction of switching activit,y by 
appropriate choice of multipliers. Hence, maximum reduction in switching activity can be achieved by 
scheduling and allocating operations such that favorable input conditions are ensured a t  the inputs of the 
multipliers employed in the implementation. Hence, existing high-level synthesis tools can be improved 
such that they consider the expected signal behavior at various points of the algorithm while arriving a t  an 
implementation. Note that  the condition of ensuring favorable signal conditions a t  the multiplier inputs 
also reduce bus-power, since these conditions must be met consistently between successi.~.e data  samples. 
This work shows that an appropriate choice of array multiplier assures that reduction in switching activity 
in the input bus to the multiplier reflects as reduced switching activity in the multiplier Hence, one can 
reduce the pswer dissipation in a data-path by careful scheduling and allocation of instrilctions based on 
the expected statistical properties of the data  being processed. 
We presen1,ed several new array multiplier architectures for reducing switching activity !In general digital 
signal processing applications. A general cellular structure was presented which can be used to obtain 
any array multiplier suitable for the given application. This structure provides a unified view of all 
N !  possible N-bit array multipliers. The switching activity a t  the output nodes of the cells in various 
nlultiplier structures was analyzed and compared with a tree nlultiplier based on 4 : 2 cornpressors as well 
as a LSB-first CS array multiplier. It was shown that the relative improvement in power is a function of 
statistical properties of the input signals. It was also shown that selection of appropriate airray architecture 
can give up t,o 40% reduction in switching activity compared to a tree multiplier, and more than 3 times 
reduction in switching activity compared to the widely used LSB-first array multiplier for commonly 
occurring situations. We also outlined applications of the proposed multipliers and the presented results 
to  the areas of low power quantization, reconfigurable computing and high-level synthesis for low power. 
Hence, the proposed architectures can prove to be extremely useful structures for low power DSP system 
design. 
[I] E. E. Swar1,zlander. '.Computer Arithmetic," IEEE C o m p u t e r  S o c i e t y  Pres s ,  1990. 
[2] S. Haykin, "Adaptive Filter Theory," Prentice Hall, N J ,  1996. 
[3] J .  G. Proakis and D. G .  Manolakis, ''Digital Signal Processing: Principles, illgorithms, and ilpplications," McMillan 
Publishing Company, New York, 1992. 
[4] J .  M. Raba.ey, "Digital Integrated Circuits: A Design Perspective." Prentice Hall, New Jersey, 1996. 
[5] N. H. E. Weste and K. Eshraghian, "Principles of CMOS VLSI Design: A Systems Perspective," 2nd Edition, Addison 
Wesley, 1994. 
[6] S. E. McQuillan and J .  V. McCanny, "A Systematic Methodology for the Design of High Performance Recursive Digital 
Filters," IEEE Trans. on Computers, Vol. 44, No. 8,  pp. 971-982, Aug. 1995. 
[7] J .  K. Jain, L. Song and K.K. Parhi, "Efficient Semisystolic Architectures for Finite-Field Arithmetic," IEEE Trans. 
VLSI Systcms, Vol. 6 ,  No. 1 ,  pp. 101-113, Mar. 1998. 
[8] K. Muhamlnad and K. Roy, "On Complexity Reduction of FIR Digital Filters Using Constrained Least. Squares Solution," 
In Proc. of  1997 IEEE International Conference on Computer Design (ICCD '97), pp. 196-201, Austin, Texas. 
[9] K. Muham.mad and K. Roy. "Low Power Digital Filters Based On Constrained Least Squares Solution," In Proc. o f  the 
31st Asilonzar Conference on Signals, Systems and Computers, 1997, Monterey, California - Invited Paper. 
[lo] M. Lundberg, K.  Muhammad, K. Roy and S. K. Wilson, "High-level Modeling of Switching Activit ,~ With Application 
to Low-power DSP System Synthesis," To appear in the 1999 Proc. IEEE International Conference On ilcoustics, 
Speech, ant1 Sig nal Processing (ICASSP'99). 
[ l l ]  T. H. Cormen, C. E. Leiserson and R. L. Rivest, "Introduction to Algorithms," The MIT Press, 1990. 
