Very Low-Complexity Digital Filters Based On Computational Redundancy Reduction by Muhammad, Khurram & Roy, Kaushik
Purdue University
Purdue e-Pubs
ECE Technical Reports Electrical and Computer Engineering
3-1-1999
Very Low-Complexity Digital Filters Based On
Computational Redundancy Reduction
Khurram Muhammad
Purdue University School of Electrical and Computer Engineering
Kaushik Roy
Purdue University School of Electrical and Computer Engineering
Follow this and additional works at: http://docs.lib.purdue.edu/ecetr
This document has been made available through Purdue e-Pubs, a service of the Purdue University Libraries. Please contact epubs@purdue.edu for
additional information.
Muhammad, Khurram and Roy, Kaushik, "Very Low-Complexity Digital Filters Based On Computational Redundancy Reduction"




SCHOOL OF ELECTRICAL 
AND COMPUTER ENGINEEIUNG 
PURDUE UNIVERSITY 
WEST LAFAY ETTE, INDIANA 47907-1285 
Very Low-Complexity Digital Filters Based -,
On Computational Redundancy ~eduction-' 
Khurram Muhammad and Kaushik Roy 
Email: khurram@ecn.purdue.edu, kaushik@ecn.purdue.edu 
School of Electrical and Computer Engineering, 
Purdue University, West Lafayette, IN 47907 
February 22, 1999. 
Abstract 
We present computation reduction t,echniques which can be used to obtain multiplierless implementa- 
tions of finite impulse response (FIR.) digital filters. The ideas presented in this work are also applicable 
to infinite i ~ ! p u l s e  response (IIR) digital filters. The main idea is to remove computational redundancy 
by reordering computation. Hence, the frequency response of the desired filter is unaltered. Various 
approaches are presented which consider normal, diflerential and hybrid coefficients. It is shown that the 
reordering prl2blem can be formulated using a graph in which vertices represent the coefficients and edges 
represent resources required in a computation involving the coefficient. We present variou:: schemes which 
reduce filter c:omplexity by specifically targeting computational redundancy inherent in normal filter im- 
plementation,~. Simple polynomial run time algorithms are presented and their power and potential is 
derllonstratecl by presenting results for large (up to 600 tap) filters which show significant reduction in 
the number of add operations per coefficient. We also consider filter implementations i~n which shifted 
values of computations can be obtained using simple interconnects without incurring extra computation. 
We present a, methodology using which such computation can be re-used in subsequent computations 
and show such operations further reduce computational redundancy resulting in extremelly simple filters. 
It is shown that as low as 0.1 adders per filter coefficient are required to  implement the multipliers in 
such filters. Ilence, such filters can be used in very high-speed applications. Alternatively, using voltage 
scaling, one can significantly reduce the power consumption of such filters for any desired, performance. 
'This work was supported in part by DARPA (F33615-95-C-1625), NSF CAREER award (9501869-MIP), Rock- 
well. AT&T arid Lucent foundation. 
Future mobile radio and portable computing systems are expected t o  provide increased services, 
faster d a t a  rates and higher processing speeds a t  reduced power dissipation levels. This provides 
us with a rr~otivation to  explore new approaches in low-complexity design of high-performance 
digital signczl processing (DSP) blocks which operate a t  lower power levels. The  classical ap- 
proach [I], [2] in complexity reduction is the use of techniques such as recursion (e.g. RLS, FFT 
algorithms), multi-rate signal processing and low rank approximation. The last technique is a 
widely used approach which compromises accuracy by removing less significant computations 
from a given computational algorithm. Low-complexity design not only improves the speed a t  
which the  a.igorithm can process da ta ,  but it also leads to  low power design a t  the highest level 
of abstraction by reducing energy consuming operations. 
In this paper, we explore complexity reduction from the point of view of reducing the number 
of operations by reuse of computation. The  new insight provided t o  the subject of complexity 
reduction is the reduction of computational redundancy which is defined as the excess compu- 
tation over the minimum number of bit opera,tions needed for a given sequence of operations. 
This approa.ch can be used t o  compliment existing conlplexity reduction methods as  it reduces 
the number of energy consuming operations by reusing parts of computation. Although, the 
idea of com]?utation reuse appears in many forms in typical DSP system implen~entations, this 
paper develops and formalizes this approach t o  the case of F IR  filtering in order 1;o illustrate its 
potential. The  proposed techniques are also applicable t o  direct form IIR filter implementations. 
Many previous work have been reported on complexity reduction of FIR filters [3], [4], [5], [GI,  
[7], [8], [9] a.hich consider simplified parallel implementations of FIR filters for signed powers-of- 
two (SPT)  implementations. The  proposed methods star t  from a known optimal filter solution 
and search for quantizations in the vicinity of the solution which give lower implenientation cost. 
Search algorithms have been proposed to  obtain solutions t o  the coefficient quantization 
problem for canonical sign-digit (CSD) number representation 131, 141, [5], [GI ,  [7], [8], [9]. The 
published results exist for filters of small lengths and 'indicate t ha t  more than 2 adders are required 
per coefficient using such search. One disadvantage of t,hese methods is tha t  the,se compromise 
the frequency response of the filter during their search for a lower cost implementation. This 
deviation in frequency response may not be tolerable in wireless communications where such 
deviations can increase multi-user interference. Further, methods which yield optimal solutions 
are computationally expensive for large filters. 
Low power F IR  filter realizations have also been extensively studied in recent years [lo],  [ l l ] ,  
[12], [13]. The  basic techniques used in power reduction constitute architectural transformations, 
sub-structure sharing, quantizations, and computation reordering. The  idea of computation re- 
ordering was explored in [ l l ]  in the differential coeficient method (DCM) which reduces energy 
consumptioi~ by reducing the dynamic range of computation. The  main idea is t o  compute the fil- 
ter output  using coefficient differences instead of their original values. In FIR filters, the dynamic 
range of difl'erential coefficients is smaller, hence, the width of multipliers can be reduced. 
In this paper, we present various approaches t o  reduce redundant computation in digital fil- 
tering. In particular, we address methods which exploit computation reuse in thle example case 
of F IR  filtering. The  main idea is t o  find an ordering of coefficients which minimizes the number 
of adders required in the filter implementation using graph theoretic approaches. We propose 
differential coeficients multiplzerless implementation (DCMI) scheme which is sllown t o  signif- 
icantly reduce filter complexity. We also present optimal solution t o  the DCMI problem which 
we referred t o  a s  the optimal differential coeficients multiplierless implementation (ODCMI) 
scheme. In general, less than 2 add operations per coefficient are required in 16-bit implemen- 
tation of unscaled coefficients (3 for maximally scaled coefficients) using these approaches. We 
also present a methodology which further reduces computational redundancy by re-using shifted 
values of a computation in the evaluation of a subsequent computation. The  shift operation can 
be obtained free of computational cost by using interconnect wiring. We refer t o  this technique 
a s  the minimally redundant parallel filters (MRPF)  technique. Results indicate t ha t ,  in general, 
less than 1 adder per coefficient is required for 16-bit maximally scaled coefficients using the 
M R P F  technique. The  main contributions of this work are summarized below. 
We identify the subject of computational redundancy and present methodologies which re- 
duce irr~plementation complexity by specifically targeting this area. Consequently, the fre- 
quency response of the given filter is not altered. 
These techniques are independent of the choice of number representation scheme. 
DCMI/ODCMI approaches are independent of the choice of coefficient word-length. 
The  frame-work presented in this work can account for more general problen~s which consider 
memory overheads (by modifying edge costs), or: when given fixed resources (by solving a 
graph partitioning problem). 
The  presented problems are mapped t o  well-known and extensively studied g;raph/set theo- 
retic problems. Hence, efficient heuristic based solutions can be used. 
In summary, this paper presents many approaches which specifically target computational redun- 
dancy reduction. One may note that  there are two ways to  obtain reduction in power dissipation 
using these approaches. First,  we get a direct reduction in power dissipation due to  removal of 
redundant computation. This advantage appears in the form of reduced switching activity [14] 
because of relatively fewer computational operations. Second, we can obtain ml~ltiplierless im- 
plementations, which are of immense interest in high-speed signal processing applications, and, 
which can also be used to  further reduce power levels by employing voltage scaling [14]. 
This paper is organized in six sections. Section I1 provides a general background on FIR 
filtering ancl the D C M  approach in [1:1.]. Section I11 presents the DCMI approach for removing 
the computational redundancy from the filter computations. Section IV presents the optimal 
DCMI solution. Section V presents solutions to  the hybrid problem which considers both normal 
as  well as differential coefficients. Numerical results are presented in section VI t o  quantify the 
complexity reduction using the proposed methods. Section VII presents the MRPF technique. 
In section VIII, we present further numerical results. Finally, section IX concludes this paper. 
Consider a linear time-invariant (LTI) FIR filter of length M described by am input-output 
relationship of the form 
M-1 M-1 
y (n )  = C ci x ( n  - i) = C P!") 
In this context, c; represents the i th  coefficient and x ( n  - i )  denotes the da t a  ~jample a t  time 
instant n - i. P!") represents the partial product c i s ( n  - i) for i = 0 , 1 , .  . . , M - 1 computed 
a t  time instant n. Figure 1 shows a graph G = {V, E) representation of a 4-tap (M = 4) FIR 
filter in which each vertex represents a coefficient and the edge E;,j, i, j = 0 , 1 , 2 , 3  represents 
the resources required to  mult,iply a da t a  sample with the preceding vertex (i.e. coefficient c,). 
If an array multiplier is used to  compute the products, E;,  represents the number of rows of 
adders requ~red to  implement the multiplier and is given as the number of l-bits in c;. M parallel 
multipliers are required t o  obtain a parallel implementation of the A4-tap filter. E , ,  depends 
only on the number representation scheme and the type of multiplier employed. Note tha t  G is 
undirected iind E; = Ei,; for all i, j = 0 , 1 , .  . . , ,44 - 1. 
w 
Fig. 1. Graph representation of an example filter with M = 4. 
With the above interpretation of the graph, the output in equation 1 can be calculated by a 
tour along the graph a t  time instant n. Figure 2(a) shows one such tour in G which consists 
of edges E i , : i + l ) m o ~ ~ ,  i = 0, 1 , .  . ., M - 1 for an hf = 8 tap filter. The coefficients are applied 
such that  cj+l  follows cj ,  j = 0, . . . , A4 - 2. The appropriate data sample with the corresponding 
coefficient are shown next to  the edges. The total resources required to compute the output 
given by equation 1 a t  time instant n is given by the sum of resources required t o  compute the 
partial products (P/")'s) along each edge in the tour. At the next time instant, n +  1, each data 
sample x(i) :  i = n ,  n - 1, . . . n - M + 1 in the graph is replaced by x( i  + 1). The outputs of the 
filter a t  time instants n - 1 and n are given as 
Next, let us view the first order diflerential coeficient meth,od (DCM) in the context of this 
graph. The main idea in DCM approach is to compute coefficient differences Acif l  = c;+1 - c;, 
i = 0 , 1 , .  . . .  M - 2 and using these to  compute the partial products. Let Ei,;+l represent the 
resources required to compute the product of c ; + ~  - c; with the correspondi~ig data sample 
x (n  - i - 1) at  time instant n for i = 0 , 1 , .  . ., M - 2. Then each vertex can be rfeplaced by the 
differential coefficient c;+l - c; except for co. The partial product P!") is computed by adding 
("-1) ("+I) (ci - C ; - ~ ) X ( ~  - i) t o  P,-, . P/") thus obtained is stored in memory for computing Pi+l 
in future and removed subsequently. Hence, multiplication of c; with x (n - i) is replaced by 
addition of P/_";') with the product (ci - c;-l)z(n - i). Figure 3 shows the implementation of 
the esarnple filter. T h e  overhead a.dd operations are performed by using memory which stores 
the partial l~roducts .  The  cost of overhead add operations can be accounted for by adding 1 t o  
each edge in G. Since, most filters of interest are symmetric, the later half of the filter "folds- 
over" as shown in figure 3.  Note tha t  the implementation of figure 3 yields 2<("), therefore, one 
can obtain i:("-l) from this value by a simple right shzft operation and storing these values in a 
memory for use a t  next time sample. 
The  authors noticed in [ll] tha t  in shared-multiplier based implementations, this approach 
reduced power due t o  smaller word-lengths in the multiplication operation. Higher orders of 
differences may also be considered, however, t o  understand our approach for obtaining DCMI 
filters, it is enough t o  consider the first order DCM explained above. 
Fig. 2. Graph representation of an example filter with M = 8. 
Fig. 3 .  Implementation of the 8-tap example filter. 
111. THE D C M I  APPROACH 
Consider t he  tour  in figure 2(b) .  Suppose t ha t  this order yields differential coefficients which 
are simpler t o  implement than  the  order shown in figure 2(a) (e.g. they may be l~owers-of-two), 
and hence, the implementation so  obtained has lower complexity. Note t ha t  in this example, the 
ordering is given by co, c4, cg , cl ,  ca, cg, c7, cg. The  corresponding d a t a  sample x (?a - i )  migrates 
from the edge Eij t o  E i , k ,  such tha t  if T' is the new tour ,  E i , k  E TI, k # j .  This is shown in 
figure 2 where x ( n  - i )  now refers t o  the  new edge originating a t  ci. With the new ordering 
of figure 2(b) ,  we get t he  partial products a t  various time instants in t he  order shown in table 
I. For simplicity in notation, let K = {ko, k l l  . . . , kMel} be the  set representing the  indices of 
TABLE I 
PARTIAL PRODUCTS A T  DIFFERENT TIME I N S T A N T S .  
coefficients in t he  new ordering. Hence, for the exa.mple in figure 2(b) ,  K = (0: 4 ,5 ,1 ,2 ,6 ,7 ,3} .  
Then,  the new differential coefficients for the order sequence in K are given by At!; = ck,,, - ck,, 
i = 0 , 1 , .  . . . M - 1 and we can calculate the partial products using 
( for i = 1 , .  . . )  M - 1 (the first partial product pkOn) is computed directly as  ckox(n  - Lo)) where 
(n-4j 
i = 0 , 1 , .  . . . M- 1. As an  example, consider the computation of pin) = (c4-co)r(n-4)  + Po . 
(n-5) 
Similarly, we can compute P!"-~) = (c4 - co)x(n-5) + Po . Figure 4 shows the  implementation 
details of arl 8-tap filter using the  DCMI approach assuming tha t  it is asymmetric (in symmetric 
filter case, half of t he  filter "folds-over" similar t o  the  example shown in Figure 3. ) .  Figure 4(a) 
reveals the  s tructure of the  DCMI filter. Note t ha t  the implementation of this example filter 
requires reference t o  future values of partial products and P!"'~)). The  partial products 




Fig. 4. Implementation of the 8-tap example filter using DCMI. 
instant n, only M such products are  required. 
We can simplify t he  implementation shown in figure 4(a)  by re-timing the  filter. T h e  first s tep  
is to  move the delay elements to  the multiplier inputs. Consequently, the ith branch containing 
a multiplier has i delay elements on it after the first step is completed. Next, we can move 
the delay elements further down such that  the the multiplier precedes the delay elements, and 
then, move them even further down such that  the overhead add operations also precede these 
delay elements as shown in figure 4(c). The entries in table I show the partial proclucts needed to 
compute the output a t  different time instants. These entries are helpful in determining the correct 
partial product terms after moving the delay elements across the overhead partial product add 
operations. Finally, the delay elements are moved out of these branches to get the iinplementation 
shown in figure 4(d). Using the entries shown in table I we obtain the adder network for the 
re-timed implementation shown in figure 4(d). Hence, the filter output is availablcl with an extra 
delay equal to  M-adders due to  the overhead add network. However, the structure of figure 
4(d) can be pipelined to  eliminate this delay. The overhead memory needed to  store the partial 
product values is M for any coefficient ordering. 
The DChII approach computes the set K = {ko, k l , .  . ., kMPl), such that  the coefficient 
sequence ck,, ck, . . . . , ckM-, yields the least number of resources required in the implementa- 
tion. In orcler to  compute K ,  we use the graph G = (V. E) (see figure 1) in which the set V 
represents vertices {co, cl ,  . . . , c ~ - l )  for an M-tap filter and E represents the edges, E,,,, for 
i, j = 0 , 1 , .  . . , M - 1. The edge EklrklS1 connects vertex c(k,) t o  c(k,+l) and represents the 
number of adders required to  represent the difference ~k~~~ - ck3 in a given number representa- 
tion scheme. Hence, the values assigned to  the edges take into consideration the scheme used 
for number representation. As an example, if SM number representation is used, c(k,) = 17, 
and, c(k,+l) = 33, then E k l r k l S 1  is assigned a value of 1 because ~k~~~ - ck3 = 33 - 17 = 16 
requires only one adder in implementation of the multiplier. Note that  G is undirected and com- 
plete [15]. There are M elements in V and M ( M  - 1)/2 elements in E. Hence, IVI = M and 
IEl = M ( M  - 1)/2 independent of the word-length or the number representation scheme used 
in the filter implementation. 
The implementation which constructs a tour with least number of resources (total number 
of adders) can be obtained by computing the Hamiltonian path [15] with smallest weight in G. 
A Hamiltonian path is defined as a path which visits each vertex exactly once. In our work, we 
compute the Hamiltonian cycle instead of the Hamiltonian path. A Hamiltonian cycle is a simple 
cycle [15] in which each vertex in G is visited. We can remove any link in a Hamiltonian cycle to  
obtain a Hamiltonian path. This offers us added convenience as we can select the first coefficient 
such that  the first column computation (pi'), for j = 0, 1,. . . , n) requires only one adder, rather 
than a full multiplier. This is always possible if one of the coefficients is always fixed to a known 
power-of-two value and the remaining coefficients are calibrated with respect to it. For example, 
if cq in the filter in figure 2 is fixed a t  215 in a 16-bit SM representation scheme, and the graph 
in figure 2(1)) represents the minimum weight Hamiltonian cycle for this filter, then the DCMI 
implementation would use the sequence {c4, c5, c1, c2, c6, CT, c3, cO), thereby, avoitling the use of 
a full multiplier for the first partial product column computation. Hence, Hatniltonian cycle 
computation is more advantageous. 
The Ham~ltonian cycle can be solved by enlploying one of the known methods of solving the 
traveling sa!esman problem (TSP) [15], [16]. In our work. we use two well-known approaches to 
obtain the IIamiltonian cycle for a given graph. The first approach uses a greedy strategy which 
starts  a t  a given node and extends the cycle in a depth-first search (DFS) manner. Initially, all 
nodes are colored white and the start node is initialized to  a given node. Next, it looks a t  the 
white colored neighboring nodes of the given start node and selects the one which can be reached 
using the smallest weight edge (minimum resources). The selected node becomes the start node 
in the next step and is colored black. This process is repeated till all the nodes are colored black. 
Since the graph is complete, this method produces a tour by visiting each node exactly once. 
The complexity of this algorithm is O(IVI + IEl) = O ( M 2)  [15]. This algorithm is repeated by 
initializing the start node to  each vertex in V. Hence, the complexity of the greedy approach 
used in this work is O(M3) .  
The second popular approach used for solving the TSP is the heuristic algorithm due to Lin 
and Kernighan [16] (LK algorithm). The basic approach in this method is to  complete a tour and 
then perforrn a local search to improve the tour. When an improvement is found, the algorithm 
does not necessarily use it immediately, but continues its search hoping to  find a.n even greater 
improvement. 
A. Second Order DCkfI 
Similar to  the DCM [ l l ] ,  we can use higher order differential coefficients to  defirie higher order 
DCMI. Let 5i-l,i represent the first order coefficient difference, c; - ci-1, and b L , i  represent the 
second order coefficient difference, (ci - ci-1) - (c ; -~  - c ; - ~ )  = c; - 2 ~ ; - ~  + c;-2. Then, it can be 
shown tha t  the partial product P!") can be calculated as [ll] 
where i = 2:, 3 , .  . . , M - 1. Hence, using two overhead storage and two addition operations per 
partial procluct, we can implement the second order DCM as explained in detail in [ l l ] .  It can 
be verified tha t  the second-order DCMI can be obtained by computing 
for i = 2 , 3 , .  . . , M - 1, where 5;'s give the ordering sequence for the second-order DCMI. Hence, 
the second order DCMI requires twice as  much storage and add overheads as  the first order 
DCMI. Ho~vever, similar t o  the first-order DCMI, by choosing ck, t o  be a power-.of-two, we can 
eliminate the full multiplication in the computation of the first column of partial products. 
The  second-order DCMI problem cannot be solved using the graph representation presented 
in section 111. This is because in the second order DCMI, a second-order differential coefficient, 
6L2,;, requires reference t o  three coefficients, c;, c ; - ~  and c;-2. Hence, if we were t o  use an edge 
t o  express the number of adders required t o  implement a multiplier with 6f,j (i # j )  a t  one 
input, we would require counting the number of adders required to  implement 6T,j in the given 
number rep1:esentation scheme. For a given M- tap  filter, the second order differential coefficients 
comprising (7; and c j  as  the end points would be cj-2ck+ci, where k = 0 , 1 , .  . . , M--1,  k # i ,  k # j 
and,  hence, it would require M - 2 edges between coefficients c; and c j  in the gra,ph. Therefore, 
the  graph representation of section I11 needs t o  be modified t o  account for all possible (M - 2) 
intermediate nodes between the given two nodes. Figure 5 shows the modified gr8aph for a 4-tap 
(a) (b) 
Fig. 5. Graph (G and G) Representations of an Example Filter with M = 4. 
filter for second order DCN'I problem. The  vertices are represented by continuous circles. Each 
pair of coefficients has A,f - 2 edges between them. This is shown using dashed circles which 
indicates the intermediate vertex corresponding to  the edge. Note that  the dashed circles do not 
represent vertices, rather, these illustrate the vertex considered to  be the intermediate vertex in 
the particular edge. Hence, the modified graph, G = (V, E), for the second order DCMI can 
be obtained from G by inserting M - 2 edges between each pair of edges. In the new graph, 
IVI = M as in G, and, IE( = M ( M  - l ) ( M  - 2)/2. 
Next, we need t,o formulate rules for traversing G. Let the edge between ver~iices c; and cj, 
- 
with intermediate vertex ck be represented as E i ? k , j  E E, i ,  j, k = 0 , 1 , .  . . , A,f -- 1 ,  i # j # k .  
Now, if E i , k , j  is traversed, this implies that  we have selected the coefficient order c; followed by 
ck followed by cJ. Hence, c; and ck have already been visited and no subsequent edge may be 
visited which has c; or ck as  an intermediate or terminal node. The only exceptioin to this rule is 
when all vertices have already been visited and the tour is completed by one more step. In that 
case, the firljt node from which the tour computation was initially started must b,e visited as the 
terminal vertex. 
Consider the bold path in figure 5, for example. This path shows a valid tour represented 
by the coefficient sequence co, cl ,  c2, cg and contains two edges E0,1,2 and E2,3,0. Then, after 
arriving a t  (22, we cannot visit cl because it has already been visited through Eo,l,2. Further, co 
can only be visited as the terminal node in order to complete the tour, but it cannot be used as 
an intermediate node. Hence, E2,3,0 is the only edge which can be visited without violating G 
traversal rules. Therefore, for any k E O , 1 ,  . . . , A,f - 1 ,  if ck has been visited, this implies that  
before the next edge is traversed, all edges in the graph with ck as the interrnediaie vertex must 
be disallowed. Similarly all edges originating from the vertex ck must also be disallowed. 
With the above rules, it is possible t o  devise a greedy algorithm which would start  a t  a given 
initial node s tar t  and constructs a tour which visits all the vertices in the graph based on the 
best selection a t  the given time. At each step, the algorithm keeps track of three vertices, s tart ,  
middle and last.  This corresponds to  the coefficient order c,tart1 crniddle, clast. Iniliially, all nodes 
are colored white and a user selected node, initial,  is taken as the start  node. hlext, a decision 
is taken a t  cStart and the best edge Estart,middle,last is selected such that  cmiddle and cl,,t are 
white node:;. Next, cmiddle is marked black and it becomes the next start  node. Similarly, cl,,t 
becomes the new middle node and search for the next best ciast is performed such that  the start  
and middle nodes are already known and Ciast must be a white node other than initial.  If no 
such node can be found, then the tour is completed by selecting claSt = C ; , ; ~ ; , [ .  In terms of 
figure 5 ,  if the tour shown in bold were to  be the best tour, the edge sequence visited will be 
Ea,l,2, E1,2,3 and E2,3,0 which corresponds to  the coefficient order co? c l ,  c2 and cg. The algorithm 
is repeated by initializing the start node to each vertex in V.  
We can also use a better tour computation scheme such as the LK-algorithm. In this case, 
the basic LIC-algorithm must be modified such that  it does not violate the graph traversal rules 
outlined above. The main idea is to perform a local search around a given tour to find a better 
tour. This is done by forming a S-path shown in figure 6. Starting a t  a node u, the algorithm 
tries improvement on both its neighboring edges. When a node w is located such that  the cost of 
S-path shown in figure 6(b) is smaller than the cost of the initial tour T ,  the S-path is converted 
into another tour which updates the best tour found so far. Figure 6(c) shows the improved tour 
obtained using the S-path. We note that the modification required in this algorithm is to move 
the intermediate nodes p and r so that  no rules are violated while improving the tour. note 
that  this is :not the only possible approach. Better solutions may be obtained by considering the 
best edge E.,,,,,, where q may be any node in T other than u and UI. The tour may be completed 
such that  none of the graph traversal rules are violated on G. The approach presented in this 
paper is the simplest way to modify the LK-algorithm such that it can execute on 6. 
(b) 
Fig. 6 .  Tour improverrlent in modified LK-algorithm. 
In the previous sections, we considered DCMI solution based on cycles in the graphs G and 
G. The pri~nary motivation to pursue cycle based solutions was the observation that DCM can 
be viewed a:j a special case of the general framework provided in the previous sec1;ions. A major 
advantage of cycle based solution is the regularity of the resulting solution and implementation. 
This can be advantageous in soIrle filter implementations. For example, one may consider an 
application where filter exhibits some degree of adaptation so tha t  coefficients are differenced 
"on-the-fly" and multiplications are performed by serial additions using a few fixed number of 
adders. Then,  the coefficients can be stored in the order dictated by K and sequentially accessed 
from a coefficient memory of size M. Hence, cycle based solution offers an advantage of smaller 
overhead of coefficient storage. 
In the case of a parallel implementation of filter, coefficients are pre-compul.ed and P:")'s 
are obtained by adding appropriate previously computed partial products. Hence, we are not 
restricted t o  cycle based solutions, since, a differential coefficient involving a given c, can be 
obtained using any c, - c,, J # 1 .  With this observation, we explore the optimal solution to  
the DCMI problem using the mznzmum spannzng tree (MST) [15] on G. The  MST of a graph 
G = {V, E) is defined a s  a n  acyclic subset of edges zn E which connects all of the vertices zn V 
such that the sum of weights of these edges is minimized. Hence, MST on G gives a coefficient 
order such tha t  all the vertices of the graph are visited once and the total resources required to  
implement the differential filter are minimum. 
The  best linown algorithm t o  compute the LIST executes in O(IE1 + IVllog(VI) z =:0()1,'(2) run 
time by employing Fibonacci heaps [15]. A simple algorithm by Prim runs in O(IV(210gl~I)  time 
[15]. Hence, the coefficient ordering obtained through an MST requires the least amount of add 
operations t o  compute the  output .  We refer t o  this solution as the optimal dijferential coeficients 
for multip1ic:rless implementation (ODChlI) technique. For the problem of using least amount 
of resources in forming the partial products in equation 1, MST yields the optinial solution t o  
the proble~rl of ~nini~nizing the number of add operations given all the coefficients are differential 
coefficients. The  major advantages of this approach is the simplicity of the algorithm and small 
run-time. Further, it inherits all the benefits of graph representation outlined in section I. 
Figure 7 ~ ~ O W S  the implementation of an F IR  filter using the MST solution (MST shown 
in figure 7(1))). The  coefficient sequence applied using the MST of G is obtained by applying 
the differential coefficients cchild - cparent, where cchild and cpa,,,t pair consists of all possible 
parent-child pairs in the MST (leaf-nodes have no child). The  parent  of the root node of the 
MST is defined a s  0. For example, the MST in figure 7(b) yields the coefficients co, cl - co, cz - 
c3, cs - cg, cq - c1, cg - car cs - c3, c7 - c3, as  shown in figure 7(a) .  Let P = {PO pl ,  . . . , ~ M - I )  
and & = {go, ql,  . . . , q ~ - ~ )  denote the index sets of parent and child nodes, respectively. In t,he 
above example, P = { 0 , 0 , 3 , 0 , 1 , 2 , 3 , 3 )  and Q = {O,l ,  2 , 3 , 4 , 5 , 6 , 7 ) .  Then,  the partial products 
can be calculated using 
(n)  - n-q, tp , )  
pi - (cq, - cp,) x (n  - 9i) + p;, (7) 
for i = 0 .1 ,  . . . , M - 1, i # root. P!,"~ is directly calculated. As explained earlier, we ensure 
t ha t  a t  leazt one coefficient is set to  a power-of-two, and therefore, this operation does not 
( require full multiplication. Figure 7(c) shows the relationship of P,:)'s, i = 0,  1,. . ., M - 1 with 
the previou:;ly calculated partial product values in terms of memory storage. Kote t ha t  the 
implementahion obtained is non-causal and hence output  can be available after delay equal to  
max(p,  - q,, 0) for i = 0 , 1 , .  . ., M - 1. Figure 7(a)  shows the implementation of the ODCMI 
filter for the  MST solution shown in 7(b) .  The  partial products required t o  compute P!"), 
i = 0 , 1 , .  . ., M - 1 are shown in MST solution shown in 7(c) .  
As explained in section 111, we can considerably simplify the  implementation of 7(a) by carrying 
out  t he  same re-timing steps. T h e  resulting filter implementation is shown in 7(d) .  We note tha t  
the overhead memory needed to  store the previous partial products is M as seen in figure 7(d) .  
Due to  t he  tree s tructure of the solution, the sequence of overhead add operations is smaller than 
the implementation of 4(c) where we always get a sequence of M - 1 overhead add operations. 
In general, the sequence is O(1og-Ad) in a tree solution. Again, similar t o  section 111, we can 
pipeline the series of overhead add operations which arise due to  the use of differential scheme. 
The  number of registers required t o  pipeline ODCNII filter is smaller than the  ones required for 
DCMI filter. 
T h e  methods presented in earlier sections considered implementations which ernploy only dif- 
ferential coefficients. However, one may also consider a combination or  hybrid solution which 
combines various orders of solutions. For example, one may consider the  smallest cost tour in G 
which consiclers not only the  first differential coefficients, but also the original coefficients (zeroth 
order coefficient) values. We will refer t o  this solution as  a 01-hybrid solution. This approach 
would yield a solution which is better than the single order differential coefficients. T h e  imple- 
mentation of the 01-hybrid solution is simpler than the  first order filter. T h e  hybrid solution 
partitions the  best solution in two sets. The  first one containing coefficients which are imple- 
mented normally (i.e. directly) as multiplier inputs. The  second set contains only differential 
Fig. 7 .  Implementation of the 8-tap example filter using ODCMI. 
coefficients which are implemented in the same manner as in DCMI. 
The basic approach used in obtaining a 01-hybrid tour is given as follows. Given the original 
coefficients, we obtain the first order DCMI coefficients using the greedy or LK-a:lgorithms. The 
cost of the t,our is stored and for all vertices on the tour we check if the triangular inequality (Q) 
described i n  figure 8 is satisfied. This inequality checks if tour cost can be improved by removing 
vertices f r o ~ n  the tour. The algorithm is given below 
1. P = Original Problem with M coefficients. Set i = 0. 
2. Obtain first order DCN'I for P and store the tour in T ( ~ )  and cost of tour in best cost. 
3. Remove all nodes satisfying condition Q from ~ ( ~ 1 .  If no node in T ( ' )  satisfies Q, then goto 
step 6, otherwise, set i = i + 1. 
Fig. 8. Tour Improvement Using Condition T. 
4. P u t  removed nodes in R[;]. Call the remaining tour P[;]. 
5. If cost of P[~] + cost of R[;] < best cost, then, best cost = cost of P[;] + cost of R[;] 
6. 01-hybrid solution is the union of R[j17s for j = 0 ,1 ,  . . . , i and ~ ( ~ 1 .  Cost  = sum of ~ [ j l ' s  +
cost of T ( ~ ) .  
In the above algorithm cost of R[3] refers t o  the sum of cost of each element in ~ [ j l ,  where the 
cost of an element in R is the number of adders required if the original coefficient value is used 
as one input of the multiplier. Hence, this approach removes all vertices from -the tour which 
satisfy Q and improve the solution. The  ODCMI solution is tried again on the sinaller problem 
and the cycle is repeated till no more improvement in solution is possible. 
The  01-hybrid solution for ODCMI filter is simpler t o  implement. The  basic approach is t o  
obtain modified graph G from G by inserting a new vertex c ~ ,  which connects t o  all the other 
vertices c;, i = 1 , 2 , .  . . , M - 1, such tha t  the weight of EM., is assigned a value equal t o  the 
number of adds required if the original coefficient c; were used as one input of the multiplier. 
Clearly, G is undirected and complete. The  MST on G will be the optimal 01-hybrid ODCMI 
solution which requires the least number of adders t o  implement the  multiplication. The  final 
solution can be partitioned into two sets; the first set constituting differential coefficients in which 
CM is a parent or a child, and,  the second set containing the remaining pairs. The  elements in the 
first set are the  coefficients which must be applied without differencing. All pair:; in the second 
set are employed using the first order difference using the approach in section I'd. The  cost of 
final solution is simply the sum of edge weights on all the parent-child pairs in the MST of G. 
As a final note the readers attention is diverted t o  the observation tha t  the schemes for com- 
puting DChiII (ODCMI) along with all the variant solutions are significantly diff'erent from the 
previous at tempts on designing multiplierless filters [3], [4], [5], [6], [7], [8], [9] as  it uses com- 
putational reordering t o  reduce computational redundancy. Since, reordering does not result in 
any further quantization, this approach does not compromise filter performance. Hence, these 
schemes yield significantly simpler and more practical and effective means of obtaining lower 
complexity Filter implementation. 
VI.  NUMERICAL RESULTS 
We now present some numerical results to demonstrate the power and potential of the proposed 
approaches. Both SM and SPT number representations for implementing differen-tial coefficients 
are considered. Without any loss of generality, we will assume that  coefficients are normalized by 
the largest c;. Two sets of results are provided to accomrnodate coeficient scaling technique which 
is widely used in digital filter implementation [17]. The first one uses the original co'efficient values 
quantized to fit in a preselected u~ord-length, W. The second one expresses each c; as c: x 2-k, 
such that  k 2 0 and cl 2 2W-2. This scheme will be referred t,o as mazimally scaded coefficients. 
The main advantage of such scaling is that  in parallel filter implementation the distortion due 
to quantization is minimal and this scheme yield a floating point type of filter implementation 
without actually using a full floating point unit. The powers of two can be trivially obtained by 
shifting data  by appropriate amount using interconnecting wires. 
Table I1 shows a relative comparison of filter implementations obtained using, the proposed 
DChiII approach when the coefficients are expressed with N = 16 bit unscnled numbers using SM 
and S P T  number representations2, respectively3. None of the solutions presented in this section 
required more than a few minutes of CPU time on a Sun Ultra 30 workstation. 'The values en- 
closed in parenthesis show the results for the same filters using S P T  number representation. The 
overhead adds are excluded in this table with the purpose that  the reader can focus on computa- 
tion sharing obtained for the first and second order DCMI (DCMI-1 and DCMI-;!, respectively) 
using the proposed algorithms. In addition, the M - 1 add operations in equa1,ion 1 are also 
excluded since these operations are identical in both norrnal as well as ODCMI implementation. 
A close observation of the data  in the the table reveal some interesting results. The difference in 
the solution:; using the greedy strategy and the LK-algorithm are negligible. Hence, near-optimal 
solutions are obtainable using the greedy strategy alone, as LIi-algorithm is known to yield a 
' ~ o t e  that by "[Symmetric]" in the specification of filter F in the table, we mean that it is symmetric about 
f = 0.5. 
f, and f, represents normalized  assb band and stopband frequencies, respectively. Rp and R, represent the 
passband ripple and stopband attenuation, respectively. 
tour which is very close t o  the optimal solution [16]. The da t a  in the table al;so reveals tha t  
the second-order DChlI  does not offer any significant advantage as compared t o  the first-order 
DCMI. In many cases, it provides a slightly worse solution. One may conjecture tha t  higher-order 
DCMI could be more useful if one were t o  investigate a hybrid solution which combines zeroth, 
first and second-order solutions. We also note tha t  the S P T  representation offers significant 
advantage over SM representation in many cases. Finally, we note tha t  the number of adders per 
coefficient required in DCMI implementation is less than 2, in general, for SPT representat,ion. 
This compares favorably t o  the published results of mult~iplierless filters in 1iterat.ure. 
Figures 9 and 10 show a relative comparison of the average number of adders per differential 
coefficient obtained using the first-order DCMI solutions for SM and S P T  number representations, 
respectively. We compare the number of adders per differential coefficient, for 8,  16 and 24 bit 
coefficients. The example filters considered were 28-tap PM,  41-tap LS, 119-tap P M ,  172-tap LS, 
131-tap P M ,  170-tap LS, 151-tap P M ,  217-tap LS, respectively, with specifications shown in table 
11. These results were obtained using the greedy strategy for first-order DCNII. We note tha t  
S P T  implementations require less adders than SM implementations for all word-lengths. We also 
observe a linear relationship bebween t,he average number of adders per differential coefficient 
with the word-length. This relabionship is exhibited in all the cases considered (note tha t  the 
overhead ac!ds are not included t o  demonstrate this point.). Further, the average number of 
adders per differential coefficient reduces, in general, as  the length of the filter incrl-ases. We note 
tha t  traditional approaches of finding multiplierless implementations for word-lengths > 16 would 
take enormclus computational effort and may not yield good solutions. In contrast, our technique 
takes polynomial time, is independent of the word-length and the number representation scheme, 
and can be used t,o obtain good DChlI solutions for large filters within a few minutes of CPU 
time. 
In the sequel, all results presented consider the overhead add net,work but  exclude bhe M - 1 add 
operations in equation 1 since these operations are idenbical in both normal as  well as proposed 
implementa~;ions. Figures 11 and 12 show comparisons of the average number of adders per 
coefficient obtained using 01-hybrid ODCMI scheme for maximal ly  scaled coefficients with for 
SM and S P T  number representations, respectively. The  abscissa shows the filter number used to  
obtain the results shown. The  filters are described in table 111. In both of these figures, the plot, 
on top  shows a relative comparison of the average number of adds per coefficient. These numbers 
Example M Total DCMI-1 
A: Low-pass ( fp = 0.25, f, = 0.3, Rp = 3 d B ,  R, = -50 dB) 
B: Low-pass ( fp = 0.27, f, = 0.2875, R, = 2 dB, R, = -50 dB) 
1 1  C: Low-pass (f, = 0.27, f, = 0.29, Rp = 2 d B ,  R, = -100 dB) 1 I 
(f, = 0.25, f, = 0.2625, Rp = 2 dB, R, = -73 d B  
PM 165 774 (.598) 173 (157) 170 (155) 185 (164) 178 (159) 
L S 236 912 (7Fj2) 187 (172) 185 (170) 216 (194) 
biotch ( fpl = 0.3, fsl = 0.32, f , ~  = 0.68, fp2 = .7) 
1 1  F: Notch [Symmetric] (fpl = 0.2, fsl = 0.22, fS2 = 0.38, fp2 = .4) I 1 
TABLE I1 
2[hf/2] FOR DCMI-2) OBTAINED USING DCMI FOR UNSCALED 16 - bit SM A N D  SPT NUMBER 
Example F~ller No 
Fig. 9. Average Number of Adders per Coefficient for First-Order DCMI With Unscaled SM Number 
Representation. Add Overhead of \MI21 Excluded. 
Example Filler No 
Fig. 10.  Average Number of Adders per Coefficient for First-Order DCMI With Unscaled SPT Number 
Represe.ntation. Add Overhead of \MI21 Excluded. 
are obtained by normalizing the number of add operations of the 01-hybrid ODCMI solution with 
the normal filter implementation. The plot on bottom shows the actual number of average add 
operations per coefficient for the ODChlI solution. Results are shown for W = 8,12,16 and 
23 maximally scaled coefficients. Clearly, SPT requires less computational resources. Further, 
we observe that the relative savings is better in SM number representation as compared to 
SPT. This is intuitive because SPT representation reduces the required resources i n  the normal 
implementation and this is reflected as lesser gains in complexity reduction. Again, we also 
observe a linear relationship between the average number of adders per differential coefficient 
with the word-length (note that last two bars span wider range of W) .  
It is also evident that the 01-hybrid ODCMI filters require less than 1 adderlcoefficient for 
filters with M roughly > 20 in the case of maximally scaled coefficients. In the case of an IIR 
filter implementation where maximal scaling is of paramount importance for stability reasons, 
the  proposed solutions yield practical and viable solutions. In F IR  filters, these methods yield 
low-complexity filters with very small quantization noise. The  numerical results presented in this 
section also show t h a t  the first s tep  in complexity reduction of digital filters is computational 
redundancy reduction rather than  the  conventional approach of compromising the  frequency re- 
sponse characteristics. Significant gains can be achieved by using simple approaches as  presented 
in this paper. 
" 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 
Filter No. 
u u 
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 
Filter No. 
Fig. 11. Relative Comparison of the Number of Adders per Coefficient Obtained Using ODCMI for 
maximally scaled SM Number Representation. 
VII. L ~ I N I M A L L Y  REDUNDANT PARALLEL FILTERS 
We now present an approach which is specific t o  implementations where shifts of computed 
values are available without any computational cost. In particular, this assumption is mean- 
ingful in parallel filter implementations which can trivially implement a shift operation using 
interconnectiing wires. Alternatively, if the cost of shift operation is significantly lower than the 
cost of computation, the  idea presented here can be easily extended t o  a serial SAA multiplier 
based implementation. To understand this approach, we consider the following example. The  
- 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 
Filter No. 
U "  
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 
Filter No. 
Fig. 12. Relative Comparison of the Number of Adders per Coefficient Obtained Using ODCMI for 
maximally scaled SPT Number Representation. 
DESCRIPTION OF  FILTERS USED IN ODCMI EXAMPLES. E P ,  B T ,  P M  A N D  LS REPRESENT 
I I F i l t e r N o .  
E P  
L P  
AND LP  REPRESENT BAND-PASS AND LOW-PASS FILTERS, RESPECTIVELY. 
binary number 101100 ca.n be obtained by shifting 001011 twice to the left. Henc.e, the product 
TABLE I11 
2  
L O  
E P  
L P  
101100 - x(n)  can be obtained free of cost from the product 001011 - x ( n ) ,  if the latter has been 
computed e,%rlier. Hence, we can extend the fra.me-work developed in the previous sections to 
1 
i 
E P  
L P  
take advant,%ge of this observation. Since these filters a.re obtained by considering, differential as 
well as sha r~ng  of shifted pre-computed values, the computational redundancy in these filters is 
2 
20 
B T  
L P  
significantly lower than the filters obtained using the previous approaches. We will refer to these 
filt,ers as minimally redundant parallel ( M R P )  filters. 
5  
2 s  
P M  




L P  
5 
i 




P M  
B P  
6 
119 
P M  
L P  
11 
7 0  
LS 
B P  
9  
3 1  
P M  
B P  
12 
189 
P M  




L P  
14 
3 0  
LS 
B P  
1 5 1 1 6 1 1 7 1 1 8 l  
3 2 7 m  
LS P M  
LP BP B P  B P  
Recall the  ODCMI scheme presented in section IV. A close observation of figure 7 reveals 
t ha t  computational redundancy in the  filter shown can be further reduced if computation can 
be shared a,mongst the  products, ( A ~ ; ) ~ ~ ~ x ( n )  = {cqz - cp,)x(n),  i = 0 , 1 , 2 , .  . . , M - 1, 9;'s 
and pi's refer t o  t he  parent and child nodes of the  MST. Without loss of gen~erality, we will 
assume tha t  t he  filter coefficients have been maximally scaled such t h a t  ci 1 2M-2 for each 
i = 0 , 1 , .  . . , M - 1. We will let W represent t he  coefficient word-length. Then,  the generalized 
differential coefficients are  given as ( A c ~ ) ~ , ~ ~ ~ ~ ~  = cq, - 2 - L ~ p , ,  where 0 < L < W and i = 
0 , 1 ,  . . . , M - 1.  Note t h a t  maximal scaling results in maximum number of distinct values of 
(AC; )L ,MR~F for 0 < L 5 W .  Next, suppose t ha t  we were t o  implement a solution in figure 7 
where each ( A c i ) ~ , y T  is replaced by some ( A c ; ) ~ , , ~ ~ ~ ~  for i = 0,  I ,  . . . M - 1, where 0 5 L; < 
W .  Then,  !:In) = (Aci) L , ,AvRPF~(n)  + 2 - L ~ c p z x ( n ) ,  and hence, a correction term of 2TLicP,z(n) 
needs t o  be added t o  obtain the  correct value of the partial product. We note t ha t  this term 
is readily available from cp lz (n )  by a shift operation of L; bits. Consequently, we can obtain 
pjn) through any one of ( A C ~ ) ~ , M ~ ~ F ,  i = 0, 1,. . . , M - 1, L = 0 , 1 , .  . . , W, if the cost of shift 
operation is negligible in terms of computation. 
T h e  above is represented in the modified graph of figure 13  for the  4-tap exam.ple filter. This 
graph is directed, and there are W + 1 edges directed from c; t o  c j  for all i, j and rlepresenting the  
difference ~ , - - 2 - ~ c j ,  L = 0 , 1 , .  . . , W .  We will let D [ ~  denote the  edge representing the differential 
coefficient as  shown in figure 13. Consequently, each vertex has (W+1) (M- 1)  incident 
edges. In contrast t o  t he  graphs obtained for the  problems presented in previous sections, the  
edges in the graph of figure 1 3  represent the  actual value of of and not the  number 
of resources required t o  obtain a product with this generalized differential coeffici~ent. We let the  
value of t he  ( A c ~ ) ~ , ~ ~ ~ ~  represent the  color of the  corresponding Edge. Hence, each vertex has 
a maximum of W + 1 distinct colored edges entering it from another vertex and there are a total 
of ( W  + 1) (M - 1) distinct colored edges in the  graph. 
Let each vertex represent a set of incident colored edges. Then,  we can obtain 121 such sets for 
an M - t a p  filter. In order t o  implement the  filter, we need t o  visit each vertex once only. Hence, 
the  solutior~ requiring least amount of resources t o  implement this filter comprises of a set of 
colors whose edges visit each vertex a t  least once, such tha t  the cost of implementation of these 
colors is minimum. Further,  if a particular color edge is selected, all edges of the same color 
can be obtained free of computational cost for all other vertices visited by these edges. We can 
Fig. 13. Modified graph for MRP filter. 
also cast this problem t o  an alternative equivalent problem in which all possible edge colors for a 
particular set of scaled coefficients define a color set. The elements of a given color set comprises 
the vertices visited by the edges of the respective color as shown in figure 14. The total number 
of vertices which are visited by color; are stored in count, field. Each color requires a particular 
cost of implementation which is the amount of resources required to  form a product of a data 
value by the value represented by the color. This cost, for a given number representation scheme 
is stored with the color. As explained above, if color, is selected in the final solution, the product 
color, . x(n)  (equal to  A c , ~ , ~ ~ ~ ~  - x(n) for some i and L) can be shared using interconnecting 
wires to  obtain any other vertex in the set free of computational cost. Then, our goal is to  find 
least cost set of colors such tha t  the edges in these sets cover all of the vertices in the graph. 
This is a well known NP-complete problem called weighted min imum set cover. Hence, it can be 
solved using a good heuristic approach. 
A. Solutiofi: for the MRP Filter 
A greedy algorithm employed to  solve the above minimum set cover problem is outlined below. 
It takes scaled input coefficients and computes a greedy solution for the MRP filter. 
1. (Pre-process) Remove all vertices cj = 2-'c;, for some 1 such that  j # i. 
Repeat for all j such that  0 5 j 5 M - 1. 
2. Construct Vertex Sets for all vertices (coefficients) in the modified graph. 
Vertex Set; is defined as the set of all incident edges on c;. 
1 
Color Cost Coun I.-! v ~ , l  I I m 
Fig. 14. The da ta  structure for color sets. 
3. Construct Color Sets  from Vertex Sets. Color-Set; is defined as the  set of all vertices which 
can be visited by the  color of Color-Set;. 
4. Compute the  minimum cost weighted set cover problem. 
(a) Initialize MSC-Solution t o  empty set.  
(b) while (Color Sets are not empty) do 
i. Choose and the lowest cost color which visits the  maximum number of vertices. Add it 
t o  the set MSC-Solution. 
ii. Recnove all vertices visited by the chosen color from the Color Sets. 
(c) end ,while 
(d)  Gref:dy minimum set cover solution is the set of chosen colors in MSC-Solution 
5. If any c; = 2-kcolorj, for i = 0, 1, . . . , M - 1 and colorj E MSC-Solution, remove the 
overhead ADD operation for t ha t  coefficient. For all others, add overhead cost of 1. 
6. Add cost of colors in MSC-Solution t o  obtain t,he total  implementation cost.  
In the  first s tep,  all vertices which can be obtained by simple shifts of other vertices are removed 
from the  p1,oblern. Only one of these vertices is kept (i.e. c;). There is no computation or 
overhead add operation required t o  implement partial products involving the  rernoved vertices. 
Hence, cj's are  obtained with cost 0. We note that. since the removed edges are c j  = 2-'c;, the 
incident colors on cj7s would have been identical t o  the incident colors on c;, had we chosen not 
t,o remove these vertices. Hence, the final solution will not be affected, as  the  choice of colors is 
not. increased if cj's are not removed, and visiting c; automatically visits each cJ. It follows tha t  
s tep 1 does not alter the  optirnality of the solution obtained for the  smaller problem. 
Steps 2 and 3 compute the Vertex and Color Sets as explained earlier. In step 4, any heuristic 
approach may be used to compute the minimum cost weighted set cover solution. In our work, 
we use a simple greedy approach which selects the most likely color as the one which visits most 
vertices and has the smallest cost. Finally, we need to account for the overhead adcl operations for 
the differential coefficients and step 5 removes all the overhead add operations for the coefficients 
which can be directly be obtained from the colors selected in the solution color sets. Finally, the 
total implementation cost is computed for the MRP filter. 
VIII. FURTHER NUMERICAL RESULTS 
Figures 15 - 18 show a comparison of MRPF filter complexity with the normal filter implemen- 
tation for Sbl and S P T  number representations, respectively. These results were obtained using 
the greedy solution presented in the previous section. The example filters used in obtaining these 
results are clescribed in table 111. The results clearly indicate that MRP filters h.ave lower com- 
plexity as compared to  ODCMI filters. We also note that  the relative savings in a,verage number 
of adds required to implement the given filter are greater for SM representation. Again, this is 
because the average number of adds per coefficient are smaller for the normal ilnplementation 
with S P T  representation. 
Figures 15 and 17 show that NIRP filters decrease the filter complexity roughly by a factor 
of 2 for most cases. Of particular interest is the observation that  MRPF approach is also use- 
ful in reducing the complexity of small filters. For long filters, complexity reduction is almost 
by a factor greater than 3. These savings result from reuse of shifts of computation. Figures 
16 and 18 show that  for long filters, the average number of adders required per multiplication 
is less than 0.2 for 8-bit maximally scaled coefficients. Hence, a fully parallel ilnplementation 
of an 8-bit maximally scaled 200 tap  filter requires 20-40 adders to implement the full multi- 
plication network. Consequently, the complexity of the filter is dominated by the M - 1 add 
operations in the sum of equation 1 rather than multiplications. We observe that  this dramatic 
reduction in filter complexity does not compromise the filter transfer function response and is 
obtained using a simple polynomial run-time algorithm. Hence, MRPF technique offers a power- 
ful methodo'logy to  obtain very low-complexity and fast digital filters for applications requiring 
very high-performance and/or low power. 
Fig. 
No. Of Filter Taps 
15. Relative Average Number of Adders per Coefficient in MRP filters Using maximally scaled Shl 
Number Representation. 
v 
6 10 13 20 28 41 71 119131 150170189250301327401500601 
No. Of Filter T ~ D S  
Fig. 16. Average Number of Adders per Coefficient in MRP filters Using maximally scaled SM Number 
Represelltation. 
IX. CONCLUSION 
We preser.ted computation reduction techniques which can be used t o  obtain multiplierless im- 
plementations of F IR  digit,al filters. The  ideas presented in this paper are also applicable t o  IIR 
digital filters. We addressed the problem of complexity reduction for high-speed and low power 
No. Of Filter Taps 
Fig. 17. Relative Average Number of Adders per Coefficient in MRP filters Using maximally scaled SPT 
Number Representation. 
" 
6 10 13 20 28 41 71 119131 150170189250301327401500601 
No. Of Filter Taps 
Fig. 18. Average Number of Adders per Coefficient in MRP filters Using maximally scaled SPT Number 
Represe.ntation. 
applicationt, by proposing systematic methodologies for reducing computational redundancy using 
computation reordering and sharing. Various approaches were presented which consider normal, 
differential and hybrid coefficients. The reordering problem was formulated using a graph in 
which vertices represent the coefficients and edges represent resources required in a computa- 
tion involving the coefficient. Various schemes were presented which reduce filter complexity 
by specifically targeting computational redundancy inherent in normal filter implementations. 
Simple polynomial run time algorithms are presented and their power and potential was demon- 
strated by presenting results for large filters ( lengths up to 600)  which showed significant gains 
in the number of add operations per coefficient. We also considered filter implementations in 
which shifted values of computations can be obtained using simple interconnects without in- 
curring extra computation. We presented a methodology using which such conlputation can 
be re-used in other computations and showed such operations significantly reduce further com- 
putational redundancy, thereby yielding extremely simple filters. One major advantage of the 
proposed schemes is it that  the frequency response of the desired filter is unaltered. It was 
shown that  as low as 0.1 adders per filter coefficient are required to implement the multipliers 
in such filters. Hence, such filters can be used in very high-speed applications. Alternatively, 
using voltage scaling, one can significantly reduce the power consumption of such f lters for any 
desired performance. 
[I] L. L. Schai-f, "Statistical Signal Processing: Detection, Estimation, and Time Series Analysis," Addison Wesley, 
1991. 
[2] S. Haykin, "Adaptive Filter Theory," Prentice Hall, NJ,  1996. 
[3] R. Jain, P. T .  Yang and T .  Yoshino, "FIRGEN: A computer-aided design system for high performance FIR 
filter integrated circuits," IEEE Transactions on Signal Processing, Vol. 39, No. 7, pp. 1655--1668, Jul. 1991. 
[4] Y. C. Lirn and S. R. Parker, "FIR filter design over a discrete powers-of-two coefficient space," IEEE Trans. 
Acoust., Speech C4 Signal Processing, VoI. ASSP-31, No 3, pp. 583-591, Jun. 1983. 
[5] D. Li. and Y. C. Lim, "Multiplierless realization of adaptive filters by nonuniform quantization of input signal," 
1994 IEEI? International Symposium on Circuits and Systems, Vol. 2, pp. 457-459, 1994. 
[6] B. R. Horng, H. Samueli and A. N. Wilson, "The design of low-compIexity in linear-phase FIR filter banks 
using powers-of-two coefficients with an application to subband image coding," IEEE Trans. Circuits Syst.  
Vzdeo Tec,$nology, Vol. 1, No. 4, pp. 318-324, Dec. 1991. 
[7] B. R. Hoi-ng, H. Samueli and A. N. Wilson, Jr . ,  "The design of two-channel lattice structure perfect- 
recorlstruction filter banks using powel-s-of-two coefficients," IEEE Trans. Circuits and Systems-IrFundamental 
Theory and Applications, Vol. 40: No. 7, pp. 497-499, July 1993. 
[8] H. Samueli. "An improved search algorithm for the design of multiplierless FIR filters with powers-of-two 
coefficient:;." IEEE Trans. Circuits and Systems, Vol. 36, No. 7, pp. 1044-1047, July 1989. 
[9] M. Yagyu, A. Nishihara and N. Fujii, "Fast FIR Digital Filter Structures Using Minimal Number of Adders 
and its Application to Filter Design," IEICE Trans. Fundamentals, Vol. E79-A, No. 8, pp. 1120-1128, Aug. 
1996. 
[lo] D. A. Parker and K. K. Parhi, "Low-arealpower parallel FIR digital filter implementations," Journal of V L S I  
Signal Processing, Vol. 17, No. 1, Sept. 1997. 
[ l l ]  N. Sankarayya, K .  Roy, and D. Bhattacharya, "Algorithms for low power and high speed FIli filter realization 
using differential coefficients," I E E E  Trans. Circ,uits and Systems, Vol. 44, No. 6,  pp. 488-4517, Jun. 1997. 
[12] K. Muhammad and K. Roy, ''Low Power Digital Filters Based On Constrained Least Squixes Solution," In 
Proc. S l s t  As i lon~ar  Conference O n  Signals, Systems, & Con~puters! Nov. 2-5, 1997. 
[13] M. Mehendale, S. B. Roy, S. D. Sherlekar and G. Venkatesh, "Coefficient transformations for area-efficient 
implementation of multiplier-less FIR filters," Proceedings Eleventh International Conference o n  V L S I  Design, 
pp. 110-115, 1997. 
[l4] J .  M. Rabaey, "Digital Integrated Circuits: A Design Perspective," Prentice Hall, New Jersey, 1996. 
[15] T .  H. Cormen, C. E. Leiserson and R. L. Rivest, "Introduction to Algorithms," The  MIT Press, 1990. 
[16] W. J. Cook, W. H. Cunningham, W. R. Pulleyblank and A. Schrijver, "Combinatorial Optimization," John 
m'iley & Sons, Inc., 1998. 
[17] E. Cooper, "Minimizing Quantization Effects Using the TMS320 Disgital Signal Processor Family," Applica- 
t ion Repor.t, http://www.ti.corn/sc/docs/psheets/ab~tracflapps/spraOS5.htn~, Texas Instruments, 1994. 
