Time-area efficient multiplier-free recursive filter architectures for FPGA implementation by Shajaan, Mohammad & Sørensen, John Aasted
  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  
General rights 
Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners 
and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. 
 
• Users may download and print one copy of any publication from the public portal for the purpose of private study or research. 
• You may not further distribute the material or use it for any profit-making activity or commercial gain 
• You may freely distribute the URL identifying the publication in the public portal  
 
If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately 
and investigate your claim. 
   
 
Downloaded from orbit.dtu.dk on: Dec 17, 2017
Time-area efficient multiplier-free recursive filter architectures for FPGA
implementation
Shajaan, Mohammad; Sørensen, John Aasted
Published in:
Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing
Link to article, DOI:
10.1109/ICASSP.1996.550574
Publication date:
1996
Document Version
Publisher's PDF, also known as Version of record
Link back to DTU Orbit
Citation (APA):
Shajaan, M., & Sørensen, J. A. (1996). Time-area efficient multiplier-free recursive filter architectures for FPGA
implementation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal
Processing (Vol. Volume 6, pp. 3268-3271). IEEE. DOI: 10.1109/ICASSP.1996.550574
TIME-AREA EFFICIENT MULTIPLIER-FREE RECURSIVE FILTER 
ARCHITECTURES FOR FPGA IMPLEMENTATION 
Mohammad Shajaan and John Aasted Sgwensen 
Electronics Institute, Technical University of Denmark 
Building 349, DK-2800 Lyngby, Denmark 
E-mail: ms@ei . d t u  . dk and j aas@ei . d t u  . dk 
ABSTRACT 
Simultaneous design of multiplier-free recursive filters 
(IIR filters) and their hardware implementation in Xilinx 
Field Programmable Gate Array (XC4000) is presen- 
ted. The hardware design methodology leads to high 
performance recursive filters with sampling frequencies 
in the interval 15-21 MHz (17 bits internal data rep- 
resentation). It will be demonstrated that time-area 
eficiency and performance of the architectures are con- 
siderably above any known approach. 
1. Introduction 
In recent years the complexity of the Field Program- 
mable Gate Arrays (FPGA’s) have reached a level where 
they can be useful as a fundamental DSP-compone- 
nt. The functional structure of the XC4000 family is 
very constrained and complex, due to low level irreg- 
ularity. This irregularity may result in dramatic time- 
area efficiency differences between equivalent realiza- 
tions, making careful low level design and manual floor- 
planning necessary. This paper considers the neces- 
sary approaches to obtain optimal FPGA-designs, us- 
ing multiplier-free an IIR filters example. Because of 
the recursive nature of the IIR filters, the realization 
of these filters are more difficult than FIR filters. In 
this paper an efficient method for the design and real- 
ization of the IIR filters based on cascaded biquads is 
presented. 
2. IIR Filter Design 
The background of the filter synthesis method is, that 
the transfer function of an IIR filter can be realized by 
cascaded biquads [I]. The transfer function expressed 
by biquads becomes: 
If the coefficients of the biquads are signed power- 
of-twos (SPT) , the filter can be realized multiplier-free. 
This design problem can be formulated: 
Design problem : Find the best set of N bi- 
quads with SPT coefficients that gave the least normal- 
ized ripple. 
In [2] a method is presented for optimization of a 
cascaded realization of linear phase FIR filters. This 
method can also be used to optimize a cascaded realiz- 
ation of biquads. The method starts with an IIR filter 
with infinite precision coefficients, and then by using an 
univariate search optimizes the cascaded biquads. 
The implementation of a 10th order low-pass But- 
terworth filter is used to explain the hardware synthesis 
method. The frequency response of the multiplier-free 
filter is shown in Fig. 1. The normalized ripple in pass- 
band and stop-band are also shown. Note that the fil- 
ter with quantized coefficients is not flat in the pass- 
band. The coefficients are chosen to be at most two 
SPT terms. Fig. 2 shows the possible zero and poles 
for a biquad with two SPT term coefficients and the 
word-length of the coefficients is 9; i.e., f 2 - P  f 2-q 
where 0 5 p ,  q < 9. The zero- and pole-grids show that 
these biquads can be used to realize the commonly used 
filters; of course, some IIR filters may have biquads that 
require more resources. 
3. Pipelining of a General Biquad 
Performance of the filters realized by cascaded biquads 
depends on the efficiency of the general biquads. In 
this section a pipelining method is presented for gen- 
eral biquads. Fig. 3 shows a direct form realization of a 
biquad of Eq. 1. Note that & is used in the biquad. Be- 
cause the coefficients of the biquads consist of at most 
two SPT terms, each coefficient is realized by two wired 
shifts and one addition or subtraction denoted by a. 
The critical logical paths are shown by dashed arrows. 
The critical paths contain 6 adders. These paths can be 
shortened by pipelining the all pole section and all zeros 
section of the biquads. The result of the pipelining is 
shown in Fig. 4, and the critical paths now contain just 
4 adders. It is not possible to shorten the critical paths 
in this structure furthermore, thus one of them contains 
0-7803-3 192-3/96 $5.0001996 IEEE 3268 
Authorized licensed use limited to: Danmarks Tekniske Informationscenter. Downloaded on July 14,2010 at 10:40:53 UTC from IEEE Xplore.  Restrictions apply. 
just one delay element that can only be moved in the 
path without changing its length. By using an alternat- 
ive realization of this structure and cut-set retiming [3], 
it is possible to reduce the length of the critical path 
to 3 adders. The alternative structure is achieved by 
moving the adder denoted by (1) in Fig. 4, and apply- 
ing cut-set (A), as shown on Fig. 5. Big dots denote 
delay elements appeared due to cut-set retiming. The 
critical path contains now just three adders; i.e., a 50 % 
improvement with respect to, the original structure. 
4. Hardware Synthesis 
Zeros of a low-pass Butterworth filter are all located 
at z=-1; i.e., the sections in the numerator of the bi- 
quads are multiplier-free (a0 = a2 = 0.5 and a1 = 1). 
I is chosen to be one SPT term. These simple coef- bo 
ficients simplify implementation of Butterworth filters. 
The coefficients of the denominator (given in Tab. 1) 
are found using the optimization algorithm in [2]. 
The scaling factors (S ,  in Tab. 1) ensure that there 
will be no overflow in the adders and the dynamic range 
of the adders is optimally used. All scaling factors are 
limited to be one SPT term to ease their realization. 
The example is targeted for approx. 1.4 XC4005 
FPGA’s. Each XC4005 has 196 (14x14) configurable 
logic blocks (CLB). The hardware realization is syn- 
thesized in a 4 stage process: 
Stage One : The filter architecture is defined as 
a linear systolic array, shown in Fig. 6 .  By applying 
systolization cut-sets between the biquads, the filter ar- 
chitecture in level 1 is a linear, temporally and spa- 
tially local systolic array. Level 2 systolization involves 
pipelining of a biquad, which is explained in the pre- 
vious section. The elements (adders and registers) are 
marked in Fig. 5 to show the mapping process. R5 is 
the pipeline register due to the level 1 syst,olization. 
Stage Two : The next stage, in the implementation 
of the biquads, is mapping of the adders and registers to 
a processor array, which shows the relative location of 
the adders and registers in FPGA. The mapping can be 
done in many ways depending on the available routing 
resources. The only way to find the optimal location is 
to try all the possible orderings; this requires a lot of 
time. By placing the elements in a way that preserves 
the natural flow of data, it is possible to achieve a very 
good result. Fig. 7 shows an efficient processor array 
for the shown biquad. 
Stage Three : The processor array is then mapped 
to a bit level structure graph to examine possibility of 
reduction and bit level systolization. Then the length 
of the elements in the section processor array are de- 
termined, and the next step is mapping of the section 
processor array to the FPGA. Fig. 10 shows the bit 
level structure graph of the biquad number 1. 
Unlike FIR sections [2], it is not possible to real- 
ize biquads without internal truncation or rounding, 
thus the biquads are recursive and a realization without 
truncation needs adders o€ infinite length. Therefore all 
the adders are limited to be 17 bits adders. It is not 
possible to apply bit level systolization due the recurs- 
ive nature of the biquads. 
Considering the XC4000 family structure constrain- 
ts, three efficient types of the processing elements (PE) 
can be realized. These three PE’s can realize all types 
of operations required by a digital filter with SPT coeffi- 
cients, and are at the same time the most optimal iusage 
of the CLB’s of the FPGA’s. The processor elements 
are shown in Fig. 8. The array of the processor elements 
are shown in Fig. 9. The elements in the dotted squares 
in Fig. 7 are mapped to the same processor element in 
the FPGA. 
Stage Four : Having the section processor asray, 
the next stage in the implementation, is to floor-plan 
and to realize the processor elements with the required 
word-length and route the design. Fig. 11 shows an 
efficient floor-planing for the 10th order Butter worth 
filter in two XC4005 FPGA’s. Fig. 12 shows the routed 
design of the 5 biquads. Note that biquad 3 is realized 
using U-formed elements. 
The efficiency of the routing depends on the quality 
of the CAD-system used. Normally, these systems use 
simulated annealing to rout the design. Therefore, dif- 
ferent routing will result ‘Ln different timing properties. 
This makes it difficult to predict the result, but an es- 
timate can be given for speed grade 5 XC4000 FPGA’s: 
T[ns] N 5.5 + -- 1.5 + N,  6 + N,  10 (2) 
where W is the word-length of the adders and N, 
is the number of the adders in the longest logical path. 
The three first terms in Eq. 2 are due to carry initial- 
ization, carry propagation and carry out (from the top 
full-adder), respectively. ‘The 4th term is the estimated 
routing delay per adder in the longest logical path. In 
this example the maximuin number of the adders in the 
longest logical paths in the biquads is 2, and the .word- 
length of the adders is 17. The delay for signal passing 
through these adders beciomes T = 49.5 ns; i.e., 
1 
fthroughput 2 - = 20.2 MHz 
T 
An implementation of the multiplier-free filter in two 
XC4005 FPGA (using XACT) showed that throughput 
of 21.6 MHz is possible. Each biquad was implemented 
in 54 CLB’s; i.e., the multiplier-free Butterworth filter 
requires 5.54=270 CLB’s. 
A careful floor-planning is also necessary to get a 
good performance and a, high throughput, and some 
experiences have shown that an automatic routing using 
w -- 1 
2 
(3) 
Authorized licensed use limited to: Danmarks Tekniske Informationscenter. Downloaded on July 14,2010 at 10:40:53 UTC from IEEE Xplore.  Restrictions apply. 
the design software such as XACT is not sufficient to 
obtain high performance. 
As mentioned, a low-pass Butterworth filter has sim- 
ple coefficients in the numerator of the biquads and for 
some other types of filters, at least two SPT terms must 
be used for each coefficient. In this case the implement- 
ation is more difficult and experiments show that the 
maximum throughput rate is reduced to about 17 MHz 
for an all pole section (An all-zeros section can be real- 
ized with very high throughput rate regardless of the 
number of SPT terms used in coefficients [2]). 
5. Comparison and Conclusion 
The design of an IIR filter is compared with the two 
fastest designs [4, 51 from the literature. The compar- 
ison is based on the Time-Area efficiency index: 
(4) 
Filter order(N) . Word-length(B) 
Area(A) . Sampling Period(T) 
and the result is shown in the following table. 
I =  
Multiply Accumulate rate is defined by 
MAC-rate = (Number of coefficients) fs (5) 
The MAC-rate of the above filter with 22 coefficients 
becomes 475.2 MHz. The realizations in [4] and [5] 
achieve l.6MHz and lOOMHz MAC-rates, respectively. 
This work introduces a new efficient architecture, 
which achieves a performance higher than the known 
approaches. The architecture uses FPGA technology to 
realize programmable dedicated hardware for IIR filters 
with highly quantized coefficients. 
6. References 
L. R. Rabiner et. al., “Theory and Application of 
Digital Signal Processing, ” Prentice-Hall 1975. 
M. Shajaan et. al., “Time-Areu Eficient Multi- 
plier-Free Filter Architectures for FPGA Imple- 
mentation, ” ICASSP 95, pp. 3251-3254. 
S. Y. Kung, “VLSI Array Processors, ” Prentice 
Hall, 1988. 
M. Wahab et. al., ‘%PGA-based DSP Systems, ” 
The Oxford 1993 International Workshop on Field 
Programmable Logic and Application, pp. 291- 
298. 
C. Chou et. al., “FPGA Implementation of Digital 
Filters, ” ICSPAT 93, pp. 80-88. 
Table 1: The coefficients of the 10th order filter. 
StODband 
-1500  0 1  0 2  0.3 0 4  0 5  0 6  0.7 0 8  0 9  1 
Normalized frequency 2*f/f, 
Figure 1: Multiplier-free filter example. 
Figure 2: Zeros- and pole-grid for a direct form realiz- 
ation of biquads with 2 SPT term coefficients. 
I L-gJ-T E I 
-b2 a2 
Figure 3: General biquad. Dashed arrows show the 
critical paths. 
3270 
Authorized licensed use limited to: Danmarks Tekniske Informationscenter. Downloaded on July 14,2010 at 10:40:53 UTC from IEEE Xplore.  Restrictions apply. 
a2 
Figure 4: Pipelined biquad. 
Biquad 0 
a2 
Figure 5: Alternative realization of a pipelined biquad. 
Biquad 1 ......____ I BiquadN t-- Level I 
Figure 7: The processor array for the biquad given in 
Fig. 5. 
Figure 8: The three efficient processing elements for 
XC4000 FPGA’s. 
A1,Rl C2 A2,R2 C1,R4 A4,R3 A3,R5 
Figure 9: The mapping of a biquad to the processor 
element in FPGA. 
0 
1 
2 
3 
-4 
5 
4 
7 
8 
9 
-10 
-1 1 
-12 
-13 
*14 
-15 
-16 
A1 R1 C2 A2 R2 C1 R4 A4 R3 A3 R5 
Figure 10: The bit level structure graph for the biquad 
number 1 (s = 2-l, l/h0 = 1, bl = -1 - 2-4, b2 = 
1 - 2-2, a0 = 132 = 2-l, and a1 = 1). 
0 0 0 0 0 0 0 U 0 0 c 1 0  0 0 
U U ti c U U U 0 0 0 c 1 0  0 0 
0 0 0 0 0 0 0 0 0 0 u 0 0 0  
0 0 U 0 0 ullDu c1 0 U 41 
~~ 
o n [I 
n o o n o o o o o n o o n o  
n n u ~ n o o n ~ n ~ ~ n n  
o o CI o 
0 CLB 
Figure 11: The floor platnning of the 10th Butterworth 
filter in two XC4005 FPGA’s. 
Figure 12: The chip plots of FPGA 1 and 2. 
327 1 
Authorized licensed use limited to: Danmarks Tekniske Informationscenter. Downloaded on July 14,2010 at 10:40:53 UTC from IEEE Xplore.  Restrictions apply. 
