In this paper, we propose an architecture synthesis methodolog 'to realize cascaded Infinite Impulse Response (IIRJfilter in 
Introduction
Infinite Impulse Response (IIR) and Finite Impulse Response (FIR) digital filters are perhaps the two most fundamental tools in the digital si nal processing (DSP applications. High speed app!ications like rates while some appfications like speech and communication require slow or moderate sample rates. When speed is the primary goal, dedicated hardware is required for implementation of filter taps. This enforces the ideal Coefficients to be represented as a set of finite precision coefficients. This finite representation is a source of error in the frequency spectrum. Such errors are critical in certain filter applications, because this error may cause inexact pole-zero cancellation.
In the direct form representation of the IIR filters [1,2], quantizatian of one pole disturbs the configuration of other poles which may lead to unstable filter. One method of reducing this instability is by increasing the wordsize of the internal arithmetic unit 31 or by representing the filter in the cascaded form. kach stage in the cascaded form representation is a second order structure. Since each pair of complex conjugate poles is realized independent of all the other poles, (previous stage or post stage poles) the cascade structure is generally less sensitive to coefficient quantization than the direct form. Fhrther, the simulatiom of such systems can be restricted. to individual stages of the cascaded structure. In this paper, we address the issues related to synthesis of VLSI architectures for cascaded IIR filters.
VLSI solutions offer the advantages of encapsulating complex systems in a single chip with enhanced performance of the overall system. Realizing complex system on silicon is to a large extent determined by the video, r ad ar and ima e processing require high-sample system architecture. A high performance VLSI system architecture must ensure regular structure with localized communication. These considerations favour implementations which feature arrays of identical or easily programmable Processing Elements (PE) with localized interconnections for reduced communication cost. The regularity in the cascaded stages of the IIR filter makes it an ideal candidate for VLSI implementation.
IIR filters
We give below a few definitions and introduce terms to be used in the synthesis of cascaded IIR filters. We also assume that the system clock (Tclock) matches the data rate (Tdata). i.e., Tclock = Tdala Latency of the filter L is defined as the delay through the longest combinational path (re) between any two input and output registers.
Throughput of the filter F = l/Tdata
We consider a case where , Tdata is much higher than the re (Le., Tdata > re). Clearly, the hardware that implements the combinational logic is underutilized, whiab can be enhanced by a factor hTdata/rcJ by suitably multiplexing and pipelining the ardware. Achievin this high performance implementation in rethe recursion or the internal feedback counters restricts any improvement in performance. This is because the latency associated with the feed back loop in recursive systems limits the pipelining and/or parallel processing. In non-recursive (Ex.FIR filters) systems, latches can be placed across any feed-forward cutset without changing the transfer function and achieve the desired level of pipelining. However, recursive systems cannot be pipelined at any arbitrary level by simply inserting latches, since the pipelining latches would change the number of delay o erators in the loop, and hence the transfer function ofthe implementation. In this paper, we discuss the architectural synthesis of cascaded IIR filters.
The difference equation that identifies an IIR filter is given by cursive ( f x. IIR filters) systems is a challenge, since The difference equation of one stage of equation (4) is given by
IIR Filter Architecture
The architecture of the cascaded IIR filter is derived from the system equation of a cascade stage of Fig. 2 . In the following, we introduce additional terms that are used to describe the architecture synthesis procedure for a cascade section of the IIR filter.
De endence Graph DG) : A DG is a dataflow putations in an algorithm. is the delay through each PE in the VFSA. Since we consider synthesis of cascaded IIR filters, we will restrict the architecture synthesis a proach to Cases when ~T~E ( V F S A ) < Tdaia. Thus &e data rate is slower than the computation time through a PE in VFSA. This implies that the PE in VFSA is underutilized. A further folding of the PES results in a single PE as shown in Fig. 5 . The four stages of the cascaded filter of Fig. 2 now corresponds to a linear array of four PES which forms a Fixed Full Size PEArray (FFSA) as shown in Fig. 6 . Thus a PE in FFSA represents a stage of the original cascaded filter. The critical path of FFSA comprises of 4 PES and corresponding throughput F =
Shift-Invariant

/ 4 T p E ( F j ? S A ) ,where T~E ( F F S A )
is the delay through a PE in FFSA. Further by introducing four delay registers at the input of the system and retiming [5] the PES in FFSA, we obtain a pipeline system as shown in Fig.  7 . The throughput of the resulted pipelined FFSA (PFFSA) is F = ~/ T P E ( P F F S A ) ,
where T P E ( P F F S A )
is the delay through a PE in PFFSA. Since a PE in PFFSA performs rhe computation of 5 multipli- 
VLSI Implementation
The PES in the PFFSA are replar and pipelined and hence VLSI implementation is feasible. In this section, we present an architectural synthesis methodology of cascaded IIR filters in Table Look Up function. Unlike in TLU FPGAs, where the delays are predictable, in antifuse technology, the delays are functions of the depth of the combinational logic. Each CLB in Xilinx FPGAs is a function of four independent input variables and two independent latched (optional) outputs. The function of the CLB can be extended to five variables and two independent outputs. Functionally, a CLB is defined as FCLB = f ( i~, t 2 , . . , i 5 , 0 1 ,~) , where il1i2;..,i5 are the 5 inputs and 01,eare the two latched outputs of the CLB. In power system applications, particularly in protective relaying where the data has to be filtered, the data rate is around 3 KHz. Since the data rate is suflogic block (CLB). Each C 2 B contains programmable ficiently slow, it turns out that multiple PES of the VFSA can be folded onto a single PE of the PFFSA and realized in FPGA technolo A PE in the PFFSA comprises a multiply and ad%?unit.
This necessiates implementing both multiplier and adder in FPGA technology which is difficult to accomodate in a single XC3090 device, which is a to of the line 3000 series resented as sim le sum and/or differences of powers of 2, then both figh speed and low complexity can be achieved at the cost of sli ht frequency degradation. Therefore, the multiply and add unit in PFFSA uses a two-digit Canonic-Signed-Digit CSD) (7 code for the the critical path from five multiplication and five additions to ten additions. Thus each filter coefficient h(n) is expressed as a sum or difference of atmost two Dowers of two, i.e., device from Xilinx. Instead, i P the coefficients are reprepresentation of the filter coe 6 .
cients .!r his reduces
Where s1,s2 E (-l,O,l) and i , j E {O,l ... 9}.
Since the coefficients are represented as sums i 2 and/or differences, the basic block in a PE of PFFSA is an adder. Thus the delay of the PE is equal to that of a carry-save adder . In Fig. 9 , we give the sys- to the five coefficients in two-digit CSD form and 01,oz corresponds to latched outputs in CSA form. Since a CLB in Xilinx XC3090 can be suitably used to perform the function of a full adder, the above function FA has to be realized using this full adder. In order to do this, the inputs to the full adder must be suitably multiplexed and the intermediate results of addition has to be stored. These functions are carried out in the 1 / 0 Multiplexer unit.
2. Coefficient Server Filter coefficients which are represented in two-digit CSD code must be shifted and provided as operands for the full adder in a PE of the PFFSA. In order to keep the design general, the shift amounts corresponding to the coefficients are kept programmable. To that. extent, the coefficient server serves to shift the coefficient by desired amount before presenting it as an operand for the full adder in a PE of the PFFSA. Pipelinin is achieved at a very fine level of granularity by lataing each stage of the coefficient shifter at the level of a CLB. The shifter implemented has the properties of a barrel shifter and hence it is possible that shifts by variable amount can be realized in constant time.
Computation Server This forms the heart of the PE in the PFFSA. It comprises a basic full adder that operates on operands served by the COefficient server. Intermediate results of addition are either stored in the adder and as well as in the 1/0 Server, and Computation 4 erver.
3.
Mdilplexer 110 v n ]
multiplexer.
The PE has been implemented on XC3090 device. The longest combinational path between any two reg-228 isters observed to be 96 ns. This ensures a sustained operation of 10 MRz.
Conclusion
In this paper, we presented a novel method of architecture synthesis of cascaded IIR filter in TLU FPGA technology. The synthesis procedure involves deriving the final architecture of the IIR filter through a series of transformations on the data flow graph (DG) of the filter. These transformations ensure that the hardware utilization is maximized for a given data rate and given technology. Since each PE in the PFFSA is systolic and corresponds to a single cascade stage, it is poseible to build nth order IIR filters by interconnecting n PES in a linear array.
