On Power Reduction of FIR Digital Filters Using Constrained Least Squares Solution by Muhammad, Khurram & Roy, Kaushik
Purdue University
Purdue e-Pubs
ECE Technical Reports Electrical and Computer Engineering
2-1-1997
On Power Reduction of FIR Digital Filters Using
Constrained Least Squares Solution
Khurram Muhammad
Purdue University School of Electrical and Computer Engineering
Kaushik Roy
Purdue University School of Electrical and Computer Engineering
Follow this and additional works at: http://docs.lib.purdue.edu/ecetr
This document has been made available through Purdue e-Pubs, a service of the Purdue University Libraries. Please contact epubs@purdue.edu for
additional information.
Muhammad, Khurram and Roy, Kaushik, "On Power Reduction of FIR Digital Filters Using Constrained Least Squares Solution"
(1997). ECE Technical Reports. Paper 76.
http://docs.lib.purdue.edu/ecetr/76
ON POWER REDUCTION OF FIR 
DIGITAL FILTERS USING 




On Power Reduction of FIR Digital Filters Using Constrained 
Least Squares Solution* 
Khurram Muhammad and Kaushik Roy 
khurram@ecn.purdue.edu and kaushik@ecn.purdue.edu 
School of Electrical and Computer Engineering 
Purdue University, West Lafayette 
IN 47907-1285, USA. 
Contact Person: Kaushik Roy 
School of Electrical and Computer Engineering 
Purdue University, West Lafayette 
IN 47907-1285, USA. 
Ph: 317-494-2361, Fax: 317-494-3371 
e-mail: kaushik@ecn.purdue.edu 
'This research was supported in part by ARPA F33615-95-C-1625), by NSF CAREER award (9501869-MIP), 
Rockwell Corp., and IBM Corp. 
On Power Reduction of FIR Digital Filters Using Co:nstrained 
Least Squares Solution 
ABSTRACT 
In this report, we apply constraint least squares solution (CLS)  to the problem of reducing the 
number of operations in  FIR digital filters with a motivation of reducing its power consumption. The 
constraints are defined by the maximum allowable add/subtract operations in  forming the products 
which are used in  computing the output. We  show that truncation and rounding 0.f coefficients can 
be viewed as power constrained least squares (PCLS) solutions. Further, we show that in  dedicated 
DSP processor based architectures it is possible to reduce power by using PCLS coefficients along 
with appropriately modified multipliers. It is also shown that Booth multiplier effectively reduces the 
complexity of such filters, thereby increasing power savings. Finally, we show that typically 30% 
and 45% retluction in number of operations can be obtained for systems employing uncoded and 
Booth recoded multipliers, respectively. 
I. INTRODUCTION 
Recently, constrained least square (CLS) design of FIR filters has been proposed [2, 31 which 
compromises peak stopband gain for reducing total stopband energy. Such filters find application in 
frequency division multiplexed (FDM) communication systems employing narrow frequency bands 
for multi-access. This report investigates reduction of power consumed by such filters for appli- 
cation in pel-sonal communication systems. The objective of this report is to  explore reduction 
in comp1exit:y of the operation of such filters by defining constraints on the number of operations 
(and hence, power) and employing CLS technique to  obtain coefficients that  satisfy given power 
constraints. We will refer such filters as power constrained least square (PCLS) filters. 
Low power implementations of FIR filters are of interest in wireless receivers and they have 
been investigated a t  various levels of abstractions in literature [4, 5, 61. One approach targets 
programmable DSP architectures for identification of factors which contribute t o  dissipated energy 
and finding methods which reduce power hungry operations. Methods proposed for reducing power 
dissipated in the multipliers and busses of a generic Harvard architecture based DfSP processor use 
a host of techniques such as coefficient scaling, coefficient ordering, selective coe,@cient negation, 
removing common sub-expressions [4]. These techniques attempt to  reduce power by identifying 
operations that  are "redundant" in the sense that  they repeat computational steps that  do not yield 
new information by executing a power-consuming operation. Bus power reduction is proposed 
by c0efficien.t optimization [5] which attempts t o  reduce the Hamming distance (HD) between 
successive co-efficients in order t o  reduce the activity of signals a t  one of the multiplier inputs. 
Parallel processing architectures have also been proposed for reducing power of these filters [6]. 
FIR filters have traditionally been designed using the Parks-McClellan algorithm [I] due to  
its simplicity and wide availability. This widely used algorithm is based on a mir~imax optimality 
criterion which minimizes the maximum amplitude distortion of the signal in the entire band. 
However, these filters have high stopband energies. Consequently, such filters are not preferable 
for multi-access communication systems [2, 31. In contrast, filters based on unconstrained least 
squares (ULS) criterion have relatively small stopband energy which is advantageous in multi- 
access communication system based on frequency division multiplex (FDM) . However, these filters 
have large gains a t  the edge of their stopbands, which causes distortion due t o  aliasing of the signal 
from the adjacent channel. 
A compromise between the two extremes is obtained by employing the CLS solution which 
relaxes the peak distortion in the stopband for a large reduction in stopband energy [2, 31. These 
methods provide a viable alternative in design of filters suitable for specific applications. In a 
portable computing and/or wireless communication scenario, CLS based methods provide filters 
that  can provide better performance in terms of attenuation characteristics and rejection of energy 
in the undesired band [2, 31. 
In this report, we approach the issue of power reduction from a different angle. We explore 
reduction in complexity of F IR  filters by reducing the switching activity required to  compute the 
output .  This is based on the  premise tha t  lower complexity leads to  lower power dissipation. Our 
approach is t o  explore constrained least squares solution similar t o  [2, 31 by defining constraints on 
the number of additions in computing the products t ha t  yield the filter output .  The  philosophy 
is t o  compute a constrained coefficient vector which is nearest t o  a known optimal vector in a 
LS sense constrained by a maximum number of allowable addlsubtract  operations in computing 
the products. The  maximum allowable addlsubtract  operations can be viewed as a constraint on 
power if appropriate multipliers discussed in section 11-B are employed. Hence, this approach yields 
a PCLS solution. 
We will show tha t  if the original optimal coefficient vector is computed based on CLS solutions 
similar t o  [2, 31, the PCLS vector is the optimal power constrained coefficient vector for the filter 
satisfying the given specifications. For DSP processor based filter implementatio~is, we show tha t  
PCLS filters can significantly reduce the complexity (and hence, power) without violating perfor- 
mance constraints. Further, based on the definition of least square error (LSE) a PCLS based 
approach yields a coefficient vector t ha t  is a rounded or a truncated coefficient vector. Alterna- 
tively, rounding or truncation can be viewed as generation of PCLS coefficients. The  degree of 
rounding or truncation is dictated by the power constraints. Further, i t  is shown tha t  reducing 
complexity by computing PCLS coefficients achieve reduction in power a t  the expense of increased 
energy in the stopband. 
11. DEFINITIONS 
Let F, and F, represent the passband and stopband frequencies, respectivelly, and A define 
the nominal passband gain. A,;, and Amax define the minimum and maximum passband gains, 
respectively. Further, let 6, define the peak stopband gain. Then 
SPAR = 20 loglo - 6s d B  
Amax 
is the maxirrzum stopband to  maximum passband gain as defined in [2]. Further, let E, and E, 
define the passband and stopband energies [2]. Then 
EP PSER = 10 loglo - d B  
Es 
defines the passband to  stopband energy ratio. Figure 1 shows a typical SPAR versus PSER curve 
for a F IR  fil1,er. As shown in the figure, the two extreme points on this curve corr~espond to  Parks- 
McClellan and ULS solutions. The  points between them correspond t o  CLS soliltions tha t  show 
a typical trade-off of SPAR and PSER [3]. For any given specifications on the desired impulse 
response of t he  F IR  filter, we can construct a unique SPAR-PSAR tradeoff curve. It  is noted that  
any F IR  filter can be represented in the space comprising the positive quadrant  of SPAR and PSER. 
We will show tha t  the effect of constraining power (i.e. PCLS filter) is t o  move the filter away from 
the SPAR-PSER tradeoff curve. When the constraints on power are removed, the desired filter 
moves back on the curve. 
A. FIR Filtt rs - Preliminaries 
Consider a linear time-invariant F IR  system of length M described by an input-output rela- 
tionship of the form 
In this context, b; represents the i t h  coefficient whereas x (n  - i) denotes the  d a t a  sample a t  
time instant n - i. Note tha t  the over score is used t o  remind the reader tha t  although bi is a 
scalar quantity, it can be decomposed in t o  constituent bits. Hence, the coefficient vector b = 
[bo, bl  , . . . , bnd-l]. Without any loss of generality, we will assume tha t  all coeffici,ents are integers 
represented by N bits. In sign-magnitude form, the representation of integers is expressed as 
- 
N- 2  23' 6 ,  = C1=o b;,3' and the bit S a t  ( N  - 1) th  position represents the sign of the number. It is 0 
or  1 for posi1,ive or  negative numbers, respectively. In twos complement notation an integer can be 
expressed a s  6 ,  = EE;~ 2j bb j  - S . 2N-1, where S is the sign bit defined above. 
The  system above can be implemented varying from the simplest form of a tapped-delay line 
or  transversal filter t o  more robust cascade and lattice structures [I]. In this report, we will mainly 
concentrate on transversal F IR  implementation tha t  finds use in wide  application;^. 
In a generic DSP  implementation of a transversal F IR  filter, the coefficients are held in a 
coefficient memory and are sequentially applied to  the multiply-and-accumulate (MAC) unit. A 
separate memory holds d a t a  which is applied to  the second input of the adder. A major source of 
power dissipation is the multiplier unit tha t  computes b; - x ( n  - i )  for i = 0 , 1 , .  . . , M - 1. In a 
typical multiplier unit, each 1-bit of a multiplier corresponds t o  a shift-and-add 1:SAA) operation 
of the multiplicand. If the number of 1-bits or  the Hamming Weight (HW) can be reduced in the 
multiplier, we can reduce the number of additions required t o  compute the product b; . x ( n  - i ) ,  
thereby reducing power. 
A comm~only used multiplier unit employs the Booth's algorithm [7] for high speed and low 
power multiplication. T h e  main idea is t o  recode the multiplier such tha t  conse~zutive runs of 1- 
bits are represented by difference of two numbers each having only a single 1-bit. As an example, 
t,he sequence 11111111 can be represented as 100000000 - 1 = 1 O O O O O O i  which uses two add 
and one subtract operations in calculating a product with this number. In contrast, the original 
multiplier would have caused eight add operations. Hence, recoded multiplier offers savings in 
power. Many variants of Booth recoding exist, most efficient of which is a technique called canonical 
recoding [7]. Non-canonical Booth recoding is more practical in implementation due t o  its simplicity 
[7, 81. The  difference between the two is illustrated by a simple example. Canonical recoding of 
011101 101 11 yields ~ O O O ~ O O ~ O O ~  whereas non-canonical recoding gives 1001 10i  1010i. The  number 
of add/subtract operations are reduced from 8 t o  4 in canonical and 6 in non-camonical recoding, 
respectively. When using a Booth's multiplier [7, 81, although a simple reduction in HW would 
yield improvement in power, however, the hardware essentially remains under-utilized as many 
other possible coefficient codewords with a larger HW exist that  would consume equivalent or 
lower power. 
Table 1 shows the number of available coefficients as a function of word length, N,  which 
consume equ.al power when used as multiplier for a given multiplicand. We will refer to  the set 
of coefficients that  would dissipate equal power as a code class. More formally, two codes, r and 
s belong to  the same code class if the amount of energy spent i n  forming the pipoducts p - r and 
p .  s is the sttme for any  arbitrary number p. Table 1 shows the number of available code-words in 
different possible code classes when using an unrecoded or a canonical Booth recoded multiplier. 
The method for reducing power in this report is to  constrain the coefficients to  be a member of one 
of the specified code classes. 
It is noted that  there are two criteria for measuring the filter performance; the main criterion 
being its im?ulse response satisfying given specifications of maximum stopband attenuation and 
passband ripple, the other being its power dissipation performance. This work corisiders a LS 
solution constrained by the code class of the coefficients such that  the modified coefficient vector is 
closest to the original vector b and all modified coefficients lie within given code classes. Further, 
the main criterion of a filter's performance is its adherence to  specifications in frequency domain 
which must not be violated when complexity is reduced. 
B. Hardware Considerations for Low Power 
Power dissipation in a multiplier depends on the signal activity a t  the external and internal 
nodes. The signal activity a t  the external input nodes depend on the Hammin,g Distance (HD) 
between successively applied coefficients (or data) whereas the signal activity in the internal and 
external out1)ut nodes depends on several factors including the input signal probability. In a generic 
array multiplier consisting of rows of basic cells (see figure 2), restricting the co,de class of filter 
coefficients will not significantly reduce the activity of signal a t  the internal node;;. 
In a typical DSP processor based system, an appropriate modification of generic multiplier is to 
"bypass" a cell row such that  the contents of the row of cells are held a t  the prev~ous values if the 
corresponding bit in the multiplier is a 0. The contents of a previous row of cells axe simply passed 
over to  the next row of the cells without changing any signal value in the internal nodes. This can 
be achieved by suitably modifying the control (CTRL) and controlled/add/subtr~~ct/shift (CASS) 
units in figure 2. As a result, the internal signal activity is reduced while forming the product. 
Without such a modification, the row of cells will be loaded with all 0's leading to  a higher signal 
activity, thereby redeeming power optimization techniques useless. 
In a dedicated architecture we can simply reduce the word length of the filter (coefficients. This 
would reduce the number of operations thereby saving power. Hence, for the two contrasting 
Table 1: Dis-tribution of binary codewords in various code classes for unrecoded and Booth recoded 
multipliers for N = 20,16 and 12. 
implementations (i.e. DSP  processor based vs. dedicated hardware), complexity reduction takes 
different but equivalent forms. 
For a general F I R  filter, reducing the HD between successive coefficients severely distorts the 
transfer function of the  filter and significant power savings using this method is hard to  achieve. 
Reduction o-f required number of add/subtract operations, on the other hand can result in a pro- 
portional reduction in power. Hence, we will not consider the optimization of coefficients in order 
t o  reduce the  HD between successive coefficients (or da ta) .  An interested reader is referred t o  [5] 
for more details. 
111. CONSTRAINED LEAST SQUARES TECHNIQUE F O R  FIR  FILTEFt DESIGN 
In this section, we will present the  CLS approach used t o  compute the  modified coefficient 
vector, k = [ k o ,  ill.. . , k M - l ] .  The  original coefficient vector is represented by b and the maximum 
code class allowable is represented by K .  The vector k obtained using the minimization technique 
replaces b nhen computing equation (3) in the actual implementation. We will develop the CLS 
solution for two different LSE definitions, which will be referred to  as  error definitions I and 11, 
respectively. 
A .  Brute Fo:pce CLS Solution for Error Definition I 
Let B = [bo,o, b o ~ ,  . . . , b 0 , ~ - 2 ,  b l , o l  b l , l ,  . . . , bljv-2, . . . , b ~ - l , o ,  .. . . b ~ - 1 , ~ - 2 ] ~  be the vector de- 
composition of bits representing the original coefficient vector. Similarly, K = [kO,0, k ~ , ~ , .  . .  
k0,N-2? kl,O, lIC1,l ,  . . . , k 1 , ~ - 2 , .  . ., kM-l,Ol.. . , kM-1,~-2lT represents the vector decomposition of bits 
of the modified vector which satisfies given power constraints. Further, let W =: diag[20, 2l, . . . , 
2N-2 1 2  O 1 2  > . . . I  2N-2: . . . ,2O, . . . , 2N-2]T define a weight vector. We define the LSE, E as 
In the sequel the LSE defined above will be referred to as error definition I. Noite that the error 
surface E is convex as d '~ /dk?!~  > 0 and d2~/dkijjdkl, ,  = 0 for i # I, j # m. The constrained 
minimization problem can be specified as 
min E = min ( B  - K ) ~ W  wT ( B  - K )  
K I< 
where C is a matrix defining constraints on K specified in c .  The matrix C is initialized to  all 1's 
in the first column and the vector c is initialized to M K .  This corresponds to  the original constraint 
that sum of 1's of elements of K should be equal to M K .  The Lagrangian, J is given as [lo] 
Taking the derivative of the equation above with respect to  W,  we get 
where K L S  == B and hence, the constrained least squares solution is only a modific.ation of the least 
squares solu1;ion [lo]. The value of X can be found by invoking the constraint 
which on solving gives 
Equation (12) can be used to obtain the desired solution to constrained LS problem, however, 
it does not guarantee integer values for KcLS .  In fact, the solution obtained using this equation 
will almost always yields real numbers for bits composing I<. As a consequence we need a branch- 
and-bound ILP problem formulation for obtaining integer values that comprise our solution. 
B.  The Branch-and-Bound Method for ILP Solution 
We require the desired solution to be composed of integer values (0 or 1 only) for the bit 
variables. A.s B is assumed to be known, all b;,j, i = 0 , 1 , .  . . , M - 1, j = 0, 1,. . ., N - 2 are 
either 0 or I .  Since, the CLS framework does not guarantee an integer solution to  all k ; , J ,  i = 
0 , 1 ,  . . . , M -- 1, j = 0 , 1 , .  . . , N - 2 we can use an ILP formulation for solving this problem. The 
LP relaxation conditions can be specified as 
such that  C:?il ,.- ~ g ; ~  ki,j = Mts. Note that  this sum follows from Equation (6). 
The branch-and-bound method (see figure 3) is a dynamic program which creates a tree by 
assigning all possible values to  the variable under consideration and recalculates optimal solutions 
using Equation (12) for all assumed values [9]. The tree is traversed in a depth-first manner and the 
original prob'lem is broken down to  less complex subproblems. At the root of the tree an arbitrary 
variable is selected and assigned either a 0 or 1. This assignment adds a column :in the constraint 
matrix C and modifies this matrix. For example, initial solution is computed using equation (12) 
where the constraint in equation (6) has only one column. An assignment of 1 to  say, k O , ~ - z  would 
introduce an additional column in equation (6) with a 1 a t  the bit position corresponding to  k0,N-2 
and a 1 in an additional row in c .  A different constraint set is associated with each node in the 
tree and hence, each node has an associated solution. 
Equation (12) yields optimal solution to  the CLS problem which is checkecd for all integer 
values. If all variables assignments are integers, the current LSE is compared against the lowest 
LSE obtained so far (which is initially assumed to  be m). If less, the current !solution replaces 
the best one obtained so far (Subproblems 5 and 7 in figure 3))  otherwise, the tree is truncated 
a t  that  node and search for optimal solution is resumed a t  the next node in the depth-first tree. 
If all values of the solution are not integers, the current LSE is compared against LSE of best 
solution. If the current LSE exceeds the best LSE, the tree is truncated a t  this point as any further 
solution will only increase the LSE which will not yield a better solution than what we already 
have (Subproblem 6 in figure 3) and search resumes a t  the next node in the depth-first tree. 
Figure 3 shows the branch-and-bound algorithm. It is noted that  the order. of assumptions 
regarding ki,j7s can significantly reduce the search required to obtain the optimal solution. The 
fastest solution is obtained if variable assignments of coefficients progress from most significant bit 
(MSB) towards the least significant bit (LSB). Further, any heuristically derived solution can serve 
as an initial guess and reduce computation in determining the solution. 
C. A n  Eficient Algorithm for Computing the LS solution 
The major problem of branch-and-bound ILP solution is its computational complexity. Note 
that  the formulated LSE in equation (4) comprises a sum of squares of the difference in the compo- 
nents of the original and modified vectors. The power constraint requires that  the code-class of the 
components of coefficient vector stay close to  K while the LSE between the modified and original 
coefficient vectors must be minimum. 
Consider t he  PCLS solution for numbers expressed in sign-magnitude representation. The  ILP 
branch and bound method terminates exploring a subtree if i t  finds t h a t  assignmeni;~ t o  all variables 
in K are either 0 or 1. T h e  greatest contribution t o  LSE is provided by a mismatcln between bi,N-2 
and ki,N-2, for i = 0,  1 , .  . ., M  - 1 and is equal t o  22(N-2). This is followed b:y a mismatch in 
the next lower significant bits. Hence, t o  minimize LSE in equation (4),  we must select bits in K 
t o  replicate bits in B from most significant bits towards least significant ones. Further,  the sign 
bits of coefficients in the  original and modified vectors must be identical for differences t o  be as 
small as possible. Note t ha t  this operation is equivalent t o  truncation of the  original coefficient 
vector such tha t  the  bit position where coefficients are truncated is determined by the  given power 
constraint. This follows directly because LSE in equation (4) comprises a sum of squares in which 
each individual bit mismatch contributes a weighted component t o  the  total  error. 
In the  present framework, i t  is possible t ha t  there is no bit position a t  which truncation of 
coefficients exactly meets the  power constraint. In such a case, truncation a t  one bit location gives 
a coefficient vector which consumes lower power than allowable, whereas truncation a t  the  next bit 
position results in violation of the  constraint. Hence, we have a situation in which we must select 
L out  of M  bits such tha t  L < M  and each selected bit contributes exactly the  same amount of 
error in the  LSE. In such a case, we propose selection of most sensitive bits where the  sensitivity of 
a bit is defined as the contribution to error between the transfer functions of the original coefficient 
vector and a modified coefficient vector with the bit turned on in addition to the preselected higher 
sagnificant bits. T h e  algorithm is presented a s  follows. 
copy the  sign bit of b; t o  k;, for 0 5 i 5 M  - 1 
set i = MSB and s u m  = 0 
while sum + CEi1 bj < M K  do 
M-1 j set s u m  = s u m  + Cj=o bi 
copy bit j of bi t o  ki and set  i = i - 1 
end while 
compute and store sensitivities and j for bi, s.t. bi # 0, 0 < j < 114 - 1 
sort sensitivities of bi, 0 5 j 5 M - 1 in decreasing order 
select the  i th  bit of hi's from the  first M K  - s u m  entries in this table 
Note t h a t  a s  t he  above algorithm uses truncation, it can introduce a bias in t,he output  which 
can be easily removed by adding an extra  overhead of one multiply operation. As the  length of filter 
increases, the  effect of this ex t ra  overhead operation become less significant and can be ignored for 
practical values of filter lengths. 
Finally, the algorithm above can be easily modified on Booth recoded multiplier. In this case, 
the original coefficients are replaced by Booth recoded coefficients and the same algmorithm is used to 
select the low power Booth recoded coefficients. The results obtained are uncoded back to original 
form. Note that  Booth recoding increase the number of code-words available in power constrained 
solution. Hence, significantly more complexity reduction is expected when using these multipliers. 
D. PCLS Solution for Error Definition I1 
The LSE definition formulated in equation (4) is minimized by using truncaliion as means of 
reducing power consumption by eliminating addlsubtract operations. However, other LSE formu- 
lations are also possible. We define 
where w = ['2', 2 l , . .  . , 2N-2]. Note that  this is in contrast with the definition in equation (4). This 
- - 
sum compri:.es the square errors of vector components comprising the differences e; = b; - k; for 
0 < i 5 M .- 1. To minimize this sum, we must minimize the difference between the individual 
components of the two vectors given the power constraint. Clearly, the CLS solu1;ion for obtained 
for this framework is different from the one presented earlier. As an example, the closest number t o  
00111111 given K = 1 is 01000000 rather than 00100000 which would be selected 'by the algorithm 
in the previous section for a multiplier using no Booth recoding. 
The algorithm presented in 111-C can be easily modified to  find the PCLS solution for the new 
error formu1;ktion. The difference (6; - Ei)2 for 0 5 i 5 M - 1, given K = m (no power constraint) 
is minimized by selecting b; = k ; .  Again, we observe that  the minimum contribution to LSE due 
to (6 = ki)2 occurs when & is closest to  k; and it does not violate the power constraint. This is 
achieved by rounding ii t o  bit position such that  K bits are 1's. Note, however, that  this strategy 
may not yield the minimum LSE as in one coefficient we may discard bits a t  higher significant 
positions and select ones in another coefficient at  lower significant positions. For example, given 
M = 2 and K = 2, 01010111 is rounded to  01010000 and 00000011 is rounded t o  00000011. The 
LSE is 49. In contrast LSE due to  rounding 01011000 and 00000100 is only 2. A simple rounding 
scheme will not ensure smallest LSE given that  2~ = 4. 
The algorithm for finding PCLS is given below. In the algorithm round(bj ,  i) returns bj rounded 
a t  bit position i.  The sign bits are simply copied. Moving from MSB towards 12SB, we keep on 
rounding the original coefficients a t  bit position i, 1 5 i 5 M - 1 until we reach ;% position j such 
that  the power constraint is not met a t  j and is violated a t  j + 1. As in pre\.iously presented 
algorithm, vre propose using the sensitivities of bits a t  j + 1 computed in the frequency domain 
for selecting the remaining bits. We note that  the PCLS solution based on error definition I1 is 
obtained by mere rounding. Hence, for an original filter obtained using a L S  criterion, appropriate 
rounding yields the optimal PCLS filter. 
copy the sign bit of 6; to k;, for 0 5 i < M - 1 
set i = M S B  and sum = 0 
set tj =round(b j , i ) ,O< j 5 M  - 1 
M-1 -j while sum + Cj=o ti < M K  do 
set sum = sum + zf"=i1 < 
set i = i - 1  
set tj = round(bj, i ) ,  O 5 j 5 M  - 1 
end while 
set kj = round(bj, i) ,  0 < j 5 M - 1 
compute and store sensitivities and j for b;, s.t. b; # 0, 0 5 j < Ad - 
sort sensitivities of b:, 0 < j 5 M - 1 in decreasing order 
set Icj = round(bj, i )  from the first M K  - sum entries in this table 
IV. PERFORMANCE OF CLS FIR FILTER COEFFICIENTS 
In this section we present some results using the proposed complexity reduction techniques. 
We will show that complexity reduction using the proposed PCLS approach yields filters with 
acceptable performance and results in power savings as a consequence of reduced complexity. The 
results are presented for filters employing both unrecoded and Booth recoded multipliers. 
A .  Array M~~ltipliers 
Figure 4 shows the PCLS coefficients obtained using error definition I for a 51 tap linear phase 
least squares FIR filter with Fp = 114, Fs = 3/10, R p  = 3 dB, R s  = -57 dB. Throughout this 
section, we iissume that the original coefficients are expressed as 16-bit sign magnitude numbers 
and the filter is implemented using a DSP processor. The PCLS coefficients were obtained for 
35% reduction in complexity. Clearly, the PCLS coefficients satisfy the performance constraints, 
however, they reduce the complexity of the filter substantially thereby reducing its power consump- 
tion. Figure 5 shows the PCLS coefficients for the filter with same  specification;^ obtained using 
Parks-McClellan algorithm. The algorithm yields a filter with 33 taps. The complexity reduction 
of this filter is 28%. Clearly, PCLS coefficients do not substantially affect the filter performance by 
reducing its complexity. 
Figure 6 shows the frequency response of a 85 tap LS filter obtained for F, = 11/40, Fs = 3/10, 
R, = 3 dB, R ,  = -50 dB. The PCLS coefficients for this figure have been obtained for error 
definition I1 reducing the complexity by 31% when the original coefficients are assumed to be 
represented by 16 bits. Note that the PCLS coefficients give acceptable filter performance, however, 
it increases 1;he PSER. Refer to figure 1 and note that the effect of constraining the complexity is 
to move on a horizontal line with SPAR fixed and varying PSER. Further, PCLS coefficients are 
optimal coefficients for LSE given by E* as the original optimal coefficients have also been obtained 
using LS solution. The pole-zero plot of this filter is shown in figure 7. Note that  the zeros of 
the filter are moved from their original locations, however, the filter specifications are not violated. 
Relaxing the power constraints moves the zeros back to their original locations. Further, the higher 
is the order of the filter, the more tightly clustered its zeros are, and hence, we need more bits 
to distinguish between them. However, this result reveals that  the complexity of DSP processor 
based filter implementations can be reduced using PCLS coefficients and appropriate multiplier 
structures. 
Figure 8 shows the frequency response of 201 t a p  LS FIR filter. F, = 1/10, F,: = 9/80, R, = 3 
dB, R, = -55 dB. Figure 9 shows the pole-zero plot for this filter. The comp1e:itity reduction is 
about 31%, llowever, the specifications of the original filter are not violated by thce PCLS filter. 
Finally, we observe that  complexity of FIR filters can be further reduced by numerically robust 
implementations such as the cascaded form. As compared to transversal implementations the 
transfer characteristics of cascaded implementations remain closer to the original characteristics 
when numerical roundoff or truncation is applied [I]. Hence, numerically robust implementations 
are more amenable to complexity reduction techniques presented in this report. 
B. Booth M~iltiplder 
Earlier we pointed out tha t  Booth multiplier significantly increases the code-words available for 
a given code class. Hence, use of this multiplier can further reduce the power requirements of the 
filter as it allows more code-words to  be used without significantly increasing the power. 
Figure 10 shows the frequency transfer characteristics of the filter of figure 4 with PCLS coeffi- 
cients obtained for error definition I. The savings in complexity for this filter is 45% in comparison 
to  35% obtained for unrecoded multiplier. Clearly, Booth multiplier helps further reduce the power 
consumption without violating the filter specifications. Note that  the multiplier in this example 
uses non-canonical recoding. 
Figure 11 shows the frequency response of the 85-tap filter shown in figurme 6 using PCLS 
coefficients f'3r error definition I and non-canonical Booth recoding. These coefficients yield a 46% 
savings in complexity by reducing the number of addlsubtract operations in contrast to  31% savings 
obtained earlier. Further, in contrast to  the unrecoded multiplier, the PSER for this filter is better 
than the one shown in figure 6. 
Finally, in figure 12 we show the frequency response of t,he PCLS coefficient 201-tap filt,er of 
figure 8. The savings in complexity for this filter is 46..5%. As the original filter is a LS filter, 
the constraii~ed solution yields the optimal solution given the power constraints. Again, we note 
t,hat in contrast to  31% savings in the unrecoded multiplier the reduction in complexity when using 
Booth multiplier is significantly higher. Table 2 summarizes the complexity reduction using PCLS 
coefficients for t,he example filters considered. 
V. CONCLUSION 
In this report, we approach the  problem of reducing the power consumption of F IR  digital filters 
by formu1ati:ng strategy for finding constrained least squares solution and defining constraints as  
the maximurn allowable a.dd/subtract operations in forming the  products in equati,on (3). We show 
tha t  in dedicated DSP  processor based architectures it is possible t o  reduce power by reducing the 
complexity of the filters using PCLS coefficients and appropriately modified multipliers. Further, 
Booth multiplier reduces the complexity of the filter even further, thereby increasing the power 
savings. In dedicated implementations, this translates t o  reducing the  word length of the filter 
coefficients. Further, we develop a mathematical frame work in which complexity of filters can be 
expressed as a CLS problem and show t h a t  rounding and truncation can be viewed as means of 
obtaining PCLS coefficients depending on how we choose the LSE. 
References 
[I] J .  G .  P:roakis and D.  G. Manolakis, "Digital Signal Processing: Principles, Algorithms, and 
Applicai!ions,". Mchliillan Publishing Company, New York, 1992. 
[2] I. W .  Selesnick, M. Lang, and C .  S. Burus, "Constrained Least Square Design of F IR  Filters 
without Specified Transition Bands," IEEE Trans. Signal Processing, Vol. 44, No. 8,  pp. 1879- 
1892, Aug. 1996. 
[3] J .  W .  Adams, "FIR Digital Filters with Least-Squares Stopbands Subject t o  Peak-Gain Con- 
straints.," IEEE Trans. Circuits and  Systems, Vol. 39, No. 4, pp. 376-388, Apr. 1991. 
[4] M. Mehendale, SD. Sherlekar, G. Venkatesh, "Techniques for Low Power Realization of FIR 
Filters," Proc. 1995 Asia and  South Pacific Design, pp. 447-450, Aug. 1995. 
[5] M. Mehendale, SD. Sherlekar , G. Venkatesh, "Coefficient 0ptimiza.tion for Low Power Real- 
ization of FIR," Proc. 1995 IEEE Workshop on  VLSI Signal Processing, pp. 352-361, Oct. 
1995. 
[6] D.  N. Pearson, K. K. Parhi,  "Low-power F IR  Digital Filter Architectures," Proc. 1995 IEEE 
International Symposium on Circuits and  Systems-ISCAS 95, pp. 231-234, Apr. 1995. 
[7] I. Kore11, "Computer Arithmetic Algorithms," Pentice-Hall, Englewoods, N J ,  1993. 
[8] N. H. E. Weste, K. Eshraghian, "Principles of CMOS VLSI Design: A Systems Perspective," 
2nd Edition, Addison Wesley, 1993. 
[9] W .  L. VVinston, "Introduction t o  Mathematical Programming: Applications and Algorithms," 
2nd Edition, ITP, 1995. 
[lo] L. L. Scha.rf, "Statistical Signal Processing: Detection, Estimation, and Time Series Analysis," 
Addison Wesley, 1991. 
-46 1 I 
40 42 44 46 48 50 52 54 56 
PSER 
Figure 1: Typical SPAR-PSER curve for FIR filters 
Figure 2: A generic Booth's algorit,hm array multiplier 
,,'\ 
Set )6,+ 2= 1 \, Set )6w2=0 
/" \ 
1 Subproblem 1 
Update C and c 
Compute K 
L 1 
Set kO,N J= 1 
--- 
Subproblem 3 
' Update C andc 
Compute K 
Update C and c 
Compute K 
. &  - 
Subproblem 5 
Update C and c , Compute K ~ Update C and c Compute K 
Subproblem 2 ~ 




Set % N-3= 1 I Set % N-3= 0 
\ 
-7 Subproblem 8 , , 
i , 
I , I i Update c and c i , 
I Compute K , 
~ 
Update C and c 
Compute K 
All KIJlntegers LSE > LSE,, 
Get Initial Solution 
Get LSE,,, 
All Kidlntegers 
Get Better Solution 
Update LSE, 







- - Original Coefficients 
- PCLS Coeff lcients cn 
-1 50 I I 
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 
Normalized frequency (Nyquist == 1) 
-5000 I 
I 
I I 1 I I I I I 
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 
Normalized frequency (Nyquist == 1) 
Figure 4: Response of PCLS vs. original coefficients for a 51 tap linear phase least squares FIR 






. . . . . .  . . . . . . . . . . . .  . . . . .  $ -50- 
a 
. . . . .  . . .  
-1 50 
Normalized frequency (Nyquist == 1) 
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 
a 
U) m 
- PCLS Coefficient's 
Normalized frequency (Nyquist == 1) 
Figure 5: Response of PCLS vs. original coefficients for a 33-tap filter with specifications as in 









1 Complexity Reduction ) I  
Tablt? 2: Summary of reduction in number of operations using the PCLS t.echnique. 
C 
0 
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 
Normalized frequency (Nyquist == 1) 
- - l o o o ~ , ~  V) ao -1 . - .  . . . . :  . . i -m-~r ig r~~oe: ien ts :  * PCLS Coeffici ts 1.1  
. . . . . . . . . . . .  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .  g-2000 
9 
Normalized frequency (Nyquist == 1) 
Figure 6: Frequency response of PCLS coefficients obtained using error definition I1 for a 85 tap  
LS filter. Fp = 11/40, F, = 3/10, Rp = 3 dB, R, = -50 dB. 
Figure 7: Pole-zero plot of the filter in figure 6 for original and PCLS coefficients. Zeros for original 
coefficients ;ire represented by '0' whereas zeros for PCLS coefficient vector are shown by '+' 
- - Original Coefficifjnts 















0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 




Figure 8: Frequency response of PCLS coefficients obtained using error definition I1 for a 201 tap 
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 
Normalized frequency (Nyquist == 1) 
LS FIR filter. Fp = 1/10, F, = 9/80, R p  = 3 dB, R, = -55 dB. 
Figure 9: Pole-zero plot of the filter of figure 8 for original and PCLS coefficients. Zeros for original 
coefficients ;ire represented by '0' whereas zeros for PCLS coefficient vector are shown by '+' 








. . . . . .  a& -50- . . . . . . . . . .  
a, 
-0 
3 .F-lOO- . . . .  
0) 
9 
Normalized frequency (Nyquist == 1) 
-l5Oo 
011 012 013 014 015 0.6 0.7 0.8 0.9 1 
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 
Normalized frequency (Nyquist == 1) 
Figure 10: Frequency response of t he  filter of figure 4 for Booth multiplier. 
-6000 I I I I I I I 
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 




0 -  
C 
0 a 
U) $ - 5 0 -  
w 
-0 




Figure 11: Frequency response of the filter of figure 6 for Booth multi.plier. 
I I I I I 
. . . . . . . . . . . . . . . . . .  
. . . . . . . . . . . . . . .  . . . . . . . . . . . . . . . . . . . . .  . . . . . . . . . . . . . . . .  
! . . . .  . .  - 
- PCLS Coefficients 
I I I I I I I I 
Normalized frequency (Nyquist == 1) 
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 
I I I I I I I I 
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 
Normalized frequency (Nyquist == 1) 
. . .  . . . . . . . . . . .  - - 2000 - .:. m 
a, 
a, 
@ - 4000-  .,-- ..+ .,-. .,* ,..\ .,. . . . .  
S 
c Original :Coefficients 
. . . . . . .  a -8000 - - .PCLS (=oefficients. 
-1 0000 I I I I I I I 
Normalized frequency (Nyquist == 1) 
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 
Figure 12: Frequency response of the  filter of figure 8 for Booth multiplier. 
