






















Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners 
and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. 
 
• Users may download and print one copy of any publication from the public portal for the purpose of private study or research. 
• You may not further distribute the material or use it for any profit-making activity or commercial gain 
• You may freely distribute the URL identifying the publication in the public portal  
 
If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately 
and investigate your claim. 
   
 
Downloaded from orbit.dtu.dk on: Dec 17, 2017
Time-area efficient multiplier-free filter architectures for FPGA implementation
Shajaan, Mohammad; Nielsen, Karsten; Sørensen, John Aasted
Published in:
I E E E International Conference on Acoustics, Speech and Signal Processing. Proceedings





Publisher's PDF, also known as Version of record
Link back to DTU Orbit
Citation (APA):
Shajaan, M., Nielsen, K., & Sørensen, J. A. (1995). Time-area efficient multiplier-free filter architectures for
FPGA implementation. I E E E International Conference on Acoustics, Speech and Signal Processing.
Proceedings, 5, 3251 - 3254. DOI: 10.1109/ICASSP.1995.479578
TIME-AREA EFFICIENT MULTIPLIER-FREE FILTER ARCHITECTURES 
FOR FPGA IMPLEMENTATION 
Mo ham mad Sh a jaan, Karst en Niels e n, John A ast ed S@re nse n 
Technical University of Denmark, Electronics Institute 
E-mail: msQei . d t u .  dk and j aasOei . d t u .  dk 
ABSTRACT 
Simultaneous design of multiplier-free filters and their 
hardware imp1 ementation in Xilinx Field Programmab- 
le Gate Array (XC4000) is presented. The filter syn- 
thesis method is a new approach based on cascade cou- 
pling of low oirder sections. The complexity of the de- 
sign algorithm is O(fi1ter order). The hardware de- 
sign methodology leads to  high performance filters with 
sampling frequencies in the interval 20-50 MHz. Time- 
area efficiency and performance of the architectures are 
considerably above any known approach. 
Introduction 
In recent years the complexity of the Field Program- 
mable Gate Arrays (FPGA’s) have reached a level where 
they can be useful as a fundamental new DSP-com- 
ponent. Unlilke standard gate arrays, the functional 
structure of the XC4000 family is very constrained and 
complex, due to low level irregularity. This irregular- 
ity may result in dramatic time-area efficiency differ- 
ences between equivalent realizations, making careful 
low level design and manual floorplanning necessary. 
This paper illustrates the necessary approaches to  ob- 
tain optimal F P G A- designs, using mult iplier-free filters 
as DSP algorithm examples. 
Multiplier-free linear phase FIR filters 
by quantization of zeros 
The background of the filter synthesis method pre- 
sented lis, thal, the transfer function of a linear phase 
FIR filter is symmetric or antisymmetric and can be 
factorized in fimrth order, second order and first order 
sections with real coefficients. Every section represents 
a zero-group [4] denoted by R,. The design problem 
can be formulated: 
Design problem : Find the set { R I , .   . , R N }  of 
the zero-groups which gives the least normalized peak 
ripple. N is the number of sections. 
To the best of the authors knowledge, no efficient 
algorithm witih linear time complexity exists to solve 
this design problem. Following, an O(JiZter order) iter- 
ative algorithm is presented: 
BEGIN ALGORITHM 
- Design the infinite precision FIR filter. 
- Find the set of optimal zero-groups and the num- 
ber of sections N 
WHILE improvement achieved OR first iteration 
FOR i = 1 , .  . . , N  
- Quantize Ri. 
- Retain the currently optimal quantization of  
- Compute the approximation error using the 
quantized values of Ri and the values of  Rj #i. 
- Select the best quantized Ri. 
An example will explain the algorithm in more de- 
tail. Fig. l shows a small search tree for a filter with 
three zero groups (RI,  R2 and R3). At the first itera- 
tion a zero-group ( R I )  is selected and quantized to  the 
nearest zeros in the discrete space. Let the number of 
possible quantized zeros be three at each node (e.g. A, 
B and C at the first node). We choose A which gives the 
least normalized ripple with unquantized Rz and R3 at 
node 1. Afterwards R2 is quantized and we choose D 
with quantized RI (i.e. A) and unquantized R3. At the 
third node the last zero-group RB is quantized and F is 
chosen. At this stage a result is achieved, but it may 
be improved by repeating the search. The new itera- 
tion (second) differs from the first by quantizing R I ,  
and calculating normalized ripple using D and F as the 
rest of the filter. Second iteration results in a better 
solution (B, E, G). Since improvement was achieved, a 
new iteration (third) is started. Since no improvement 
is obtained, now the algorithm stops and the output of 
the algorithm is sections with zero-groups B, E and G. 
The algorithm finds a semi-optimal filter. Since the 
zeros of the stopband section is placed on the unit cir- 
cle, a good stopband attenuation is always achieved. 
However, the passband ripple is normally larger than 
the stopband ripple. Using a systematic approach, a 
large number of filters have been designed. Results are 
comparable with other approaches, despite the low al- 
gorithm complexity. The algorithm is very fast with 
linear time complexity, e.g. a 100th order filter can be 
designed in less than 90 seconds on a HP700 computer. 
The normalized peak ripple was calculated by using a 
Rj#i,.i€{l,...,N}. 
0-7803-2431 45/95 $4.00 0 1995 IEEE 325 1 
Authorized licensed use limited to: Danmarks Tekniske Informationscenter. Downloaded on July 14,2010 at 09:18:27 UTC from IEEE Xplore.  Restrictions apply. 
frequency grid greater than 20. (f i l ter order). 
Hardware methodology by filter 
example 
A hardware synthesis method leading to minimal, high 
performance hardware realizations of the multiplier- 
free filters has been developed. In the following, this 
method is illustrated by implementing a 33-tap multi- 
plier-free filter example, with bandedges 0.3 and 0.5 
for stopband and passband, respectively. Frequency 
responses of the original and multiplier-free filter are 
shown in Fig. 2. The coefficients are represented as 
Signed Power of Two (SPT) numbers with a 9 bit 
range, and normalized peak ripple is -50 dB. 16 bit 
data representation is used to make good noise proper- 
ties possible. 
The filter example has 10 complex conjugated zero- 
groups realized by symmetric 2nd order sections with 
a 1-2 SPT term combination] i.e. first and second cc- 
efficient is a sum of 1 and 2 SPT terms, respectively. 
Three quadruple zero-groups are realized by 4th or- 
der sections of different complexity (1-3-3 and 2-3-3). 
The section ordering and scaling factors are determined 
by noise considerations, because of the extreme sensi- 
tivity between output noise and section ordering [4]. 
Different section orderings showed a theoretical 80dB 
output noise difference between the chosen section or- 
dering and worstcases. The hardware methodology is 
based on scaling factors restricted to power of two val- 
ues, and no truncation internally in the section. 
The example is targeted for a FPGA hardware pro- 
totyping PC-board, at the time present configured with 
a quadratic array of four XC4005 devices. The hard- 
ware realization is synthesized in a 4 stage process: 
In Stage One the filter architecture is defined as 
a linear systolic array in 2 levels, shown in Fig. 3. By 
applying systolization cut-sets [l] between sections, the 
filter architecture seen at level 1 is a linear, temporally 
and spatially local systolic array. Each section in the 
filter architecture is also realized by a linear systolic 
array (Fig. 3,  level a),  with fine-grain processing el- 
ements (PE’s) as the fundamental components. The 
set of PE’s are devised on the basis of detailed knowl- 
edge of the XC4000 architecture constraints, and the 
multiplier-free section structures to be implemented. 
Exactly 3 PE’s are necessary to make an efficient re- 
alization of all section structures possible. The three 
PE’s are generated as different combinations of a basic 
operation module. Fig. 3 shows high level representa- 
tions of both the basic operation module and the three 
PE’s. A 2-bit bitslice of each PE  can be implemented 
in one, optimal used configurable logic block (CLB). 
In Stage Two, mapping of section structures to 
PE’s and the section (level 2) floorplanning is carried 
out, considering the major constraints imposed by the 
routing architecture. The physical shape and relative 
placement of PE’s is highly restricted, by the use of the 
dedicated carry logic. The constraints of this resource 
implies, that optimal section processorarrays have to 
be realized with a horisontal PE-topology. 
The section structures are chosen by comparing fil- 
ter synthesis results with hardware complexity. All sec- 
tions are realized by transpose form structures, shown 
in Fig. 4. The triangled symbol represents an adder/sub- 
tracter unit realizing ’multiplications’, by addingjsub- 
tracting hardwired shifted operands. The original pass- 
band (4th order) section structures results in inefficient 
hardware realizations. An efficient map of the original 
structures to the above defined PE’s is not possible, 
since the number of delay elements is less than the num- 
ber of arithmetic units. Furthermore, three arithmetic 
units share combinatorial paths in both 4th order struc- 
tures, resulting in relatively poor performance. By 2nd 
level systolization in these section structures] far more 
efficient structures are generated. Fig. 4 shows the sys- 
tolization cut-sets, leading to the dramatic increase in 
time-area efficiency. The number of registers matches 
the number of aritmetic units (complete map to PE’s), 
and the longest combinatorial path is reduced to two 
arithmetic units (better temporal locality). 
Rules have been specified to automize both map- 
ping to PE’s and floorplanning the section processo- 
rarray. The mapping of the systolized section struc- 
tures to section processorarrays for the filter example 
is shown in Fig. 5. 
In Stage Three, bitlevel reduction mechanisms are 
applied using a bitlevel structure graph (BSG). This 
bitlevel representation form was developed to reveal the 
complete, somewhat irregular bitlevel structure of the 
sections, making total dedication and further bitlevel 
(level 3 )  systolization possible. All redundant bitlevel 
operations are eliminated by specified reduction rules 
in the upper and lower BSG. In practice this stage 
involves two substages, graph construction leading to 
BSGl and elimination leading to a fully minimized 
graph, named BSG2. Fig. 6 shows BSGl for a stop- 
band section with coefficients (U = 2-l, b = 2” - 2-5). 
The upper structure is reduced by a wordlength ad- 
justment cut (1) and a scaling cut (a), and the lower 
structure is reduced by a cut (3). All bitlevel elements 
above ( l ) ,  (2) and under ( 3 )  are redundant and can be 
removed (reductions due to cut (1) have been carried 
out in the figure). Due to the simplicity and low coeffi- 
cient wordlength of the example section, reductions are 
not remarkable. In general, especially for more complex 
structures, the reduction mechanisms have a consider- 
able effect. From the minimized graph (BSGB), the 
final wordlengths of the PES are determined, and the 
practical hardware implementation is thereafter trivial, 
using BSG2 and a library of PE’s (Xilinx hardmacro’s). 
In Stage Four, the final floorplanning (level 1) is 
carried out, considering the FPGA-topology and the 
3252 
Authorized licensed use limited to: Danmarks Tekniske Informationscenter. Downloaded on July 14,2010 at 09:18:27 UTC from IEEE Xplore.  Restrictions apply. 
fixed placement of memory connections and communi- 
cation channels on the PC board. The linear systolic 
array is mapped directly to a linear FPGA array, with 
multiple sections in every FPGA. Fig. 7 shows four 
XC4005 chip plots of the realization. Timinganalysis 
showed a maximal delay of 49 ns, giving a sampling 
frequency of 20 MHz. The total amount of resources 
used by the systolic array was 538 CLB of 784 CLB. 
Further resources are needed for memory control. 
Structure classification 
Different section structures including transpose form, 
direct form and two lattice structures have been ana- 
lyzed by time-atrea effiency considerations. Effiency has 
two primary aspects: (1) Resource usage. Map to PE. 
Number of regiisters realized outside PE’s. (2) Perfor- 
mance. Temporal locality. Number of arithmetic units 
in longest com’binatorial path. 
Two efficient forms have been defined, each repre- 
senting one of the above aspects. The Adjusted Form 
representing ain optimal map to PE’s, and Maximal 
Form representing a full systolic multiplier-free section 
structure, with only one arithmetic unit in every combi- 
natorial path. The maximal form leads to a temporally 
local section processorarray. A maximal form structure 
leads to the highest performance that can be achieved 
by 2nd level systolization. 
A library of systolized multiplier-free section struc- 
tures on the two defined forms has been generated. 
The practical classification is based on 3 complexity 
parameters representing resource usage, performance 
and pipeline delay. The library of systolized section 
structures and the attached classification parameters 
makes it, possible to  determine the optimal structures 
for every application. 
Bitlevel (level 3) systoliaation 
Higher performance is achieved by bitlevel systolization 
on the basis of BSG2. Effective bitlevel cuts (I) and 
(11) for the stopband section example is shown in Fig. 
6. The result of a bitlevel cut-set is a performance in- 
crease at the cost of an increase in resource usage. The 
table below shows implemented and estimated results 
for the two worstcase 2nd and 4th order sections in the 
different performance classes, that can be reached with 
16 bit data rerxesentation. 
F s r  $ 
[Mhz] 
20-23 
Area (1-2) Area (1-3-3) 
[CLB] [CLBI - 
30 75 
Maximal 29-33 46 99 1 Maximal(1) 1 T73:: 1 :j 1 iil 1 
MaximalOI) 
Using a time-area efficiency index Order , 
the hardware methodology yields both efficiency and 
performance considerably above any known approach 
Area.T 
on the XC4000, XC3100 and XC3000 series. Compar- 
ison has been done with the best known approaches, 
presented in [2] and [3]. 
Since the FPGA-component has the same flexibiliy 
as the digital signal processor (in terms of programma- 
bility), it is reasonable to compare with this tradi- 
tional DSP technology. This can be done by calculat- 
ing the equivalent signal processor multiply accumulate 
(MAC)-rate, N . F,. Realizing the 33-tap filterexam- 
ple (corresponding to a 27-tap infinite precision filter) 
in the above defined 4 performance classes, results in 
MAC-rates of 672 MHz, 928 MHz, 1.22 GHz and 1.41 
GHz, using 538, 763, 922 and 1081 CLB respectively. 
With these relative modest amount of ressources, the 
hardware methology thus generates filterarchitectures 
with MAC-rates, between one and two order of magni- 
tude larger than state-of-the-art signal processors. 
Conclusions 
A new multiplier-free filtersynthesis method with O(fil- 
ter order) complexity has been presented. Despite the 
low algorithm complexity, results compares well with 
other known approaches in most situations. Further- 
more, a hardware methodology synthesizing minimized, 
wordparallel and bitparallel multiplier-free filter archi- 
tectures has been presented. The total dedication to 
the Xilinx-architecture and DSP-algorithm leads to 
both efficiency and performance considerably above any 
known approaches. Efficiency is retained over a broad 
performance spectrum 20-50 MHz (16 bit). In general, 
the FPGA-technology is very promising as a future fun- 
damental DSP-component , offering the best from the 
both the signal processor (programmability, flexibility) 




S. Y. Kung. VLSI array processors. Prentice 
Hall, 1988. 
J. B. Evans. Efficient FIR-filter architectures 
suitable for FPGA implementation. IEEE 
transactions on CAS-11. Vol. 41, No. 7, 1994. 
[3] B. Feher. Efficient synthesis of distributed 
vector multipliers. Microprocessing and mi- 
croprogramming 38. 1993, pp. 345-350 
L. R. Rabiner et.al. Theory and Application of 
Digital Signal Processing. Prentice-Hall 1975. 
pp. 490-493. 
[4] 
Second and last iteration - - - - 
____.__...___...._.. 
- First iteration 
F G 
Figure 1: An illustrative example 
3253 
Authorized licensed use limited to: Danmarks Tekniske Informationscenter. Downloaded on July 14,2010 at 09:18:27 UTC from IEEE Xplore.  Restrictions apply. 
Figure 2 Frequency responses 
I RAM I R A M  
PE PE PE PE PE PE PE PE LEVEL 2 
BASIC OP MODULE PE 
Figure 3: Systolic aterarray and PE's 
Stopbandsection. 
X I S - 0  Input 
Y15-0 OUtpUt  , - 
(2) 
x i s  X I 5
X I 5  4  
X I 5  3
X IS  xi2
X I 5  1  
X I >  
(11) (1) 
xia  
X I 2  x7 
X I 1  6  
X I 0  I  
X ?  0
xz 
X I  
xo 0 
8 -  - ( ? ) - _ I  
Figure 6 :  Bitlevel Structure Graph (BSG) 
1 1 
Figure 4: Systolized ater sections 
L 
BA 88 A = C A  CB SB SC 9) 
Figure 5: Section processorarrays 
1 .
4 Memory 
Figure 7: Chip plots of systolic aterarray 
3254 
Authorized licensed use limited to: Danmarks Tekniske Informationscenter. Downloaded on July 14,2010 at 09:18:27 UTC from IEEE Xplore.  Restrictions apply. 
