An area-eficzent systolic architecture for realtime, programmable-coeBcient jinite impulse response (FIR) 
Introduction
Real time filters are cha,racterised by the feature that the rate of arrival of data is fixed and hence the filter has t,o deliver a certain computational throughput. When both the data rate and filter order are high, this requires enormous hardware resources. In fixed coefficient filters, both tlie area and the latency of each filter tap is reduced through tlie use of canonical signed digit representation whereby each filter coefficient is hard-wired in the corresponding tap [l] . Such an approa.ch is however not possible in progranimablecoefficient finite impulse response [FIR) filters which is the domain of our interest.
For programmable-coefficient filters, the area can be reduced by a maximal reuse of hardware. In this paper, we introduce a technique called pipelined clustering which is used to derive a systolic FIR filter architecture that uses minimal 1ia.rdware to sustain the required da.ta ratme under tlie given technology and design constraint>s. The technology and design constraints are encapsulated as a lower hound on the period of the system clock: Tmin. Thus given T,,i,, and the data period Tautu, filter tap computations are multiplexed on a physical multiply-add unit which is appropriately pipelined. Further , the resulting systolic a.rchitecture that, is derived using pipelined clustering is 100% efficient and has the same throughput, lateiicy and approximately the same power dissipation as an unclustered full-sized array.
The technique of pipelined clustering that is used to derive the area efficient systolic array for FIR filtering is based on the principles of retiming, slowdown and holdup (RSH) transformations [a] , [3] and clustering [4] , [ 5 ] , [6] , [7] . It therefore appears that it is equally applicable to other regular a1 orithms. It is superior to other existing techniques [4f [6] in that it is a synthesis methodology that results in an architecture that is optimal for the given design and technology const,raints. Further , the resulting architecture is precisely specified and includes a complete description of the multiplexers and synchronisation delays that are required.
The paper is organised as follows. In Section 2, we begin with a brief discussion on systolic carry-save FIR filters. In Section 3, the concept of abstract processor arrays for slow and fast data rates are introduced. Pipelined clustering is presented in Section 4. The features of this new methodology are also discussed in the same section. In Section 5, we summarise our work.
Systolic FIR Filters
The input, output relationship for a n FIR filter can be represented by the following difference equation [8] :
where x,, yn are the input and output of the filter at time n respectively, bi's are the coefficients of the filter and N is the order of the filter [l]. Our emphasis in this paper will be on systolic architectures for the realisation of coefficient programmable carry-save FIR filters. Various methodologies have been reported in the literature to arrive at different systolic realisations of FIR filters [a] , [9] , [lo] , [ll] . A simple observation is that RSH transformations when applied t,o the direct-form a.rchitecture of FIR filter can lead to all the existing systolic realisations [la] .
The carry-save methodology is widely used in FIR filters [l] . The basic idea is to postpone the time con- architectures, the ca,rry-save FIR filter architecture shown in Figure 1 [2] ha.s the best overall perforinarice metric and is part,icularly well suited for clustering
The processing element (PE) of tlie FIR filter architecture is shown in Figure 2 . The iinpleinentation of this P E uses two carry-save multipliers followed by required number of 6-2 compressor slices' to cornpress six inputs t>o two outputs which are in c,arry and siive form [14] . Truncation can be incorporakd in the design of the 6-2 compressor itself to restrict the output word-length [13] , t,hereby making all tlie PE's cons8ti-tuting the full size systolic F1R filter array identical.
The function performed by the PE is s,,t + tout = D21. 
Abstract Processor Array
In this section, we introduce the notion of Abstra,ct Processor Array (APA) which is characterised by ,:,he fact that, the clock of the array is matched to ,the incoming data rat,e, i.e., T c l o c k = Tdatn, where Tdnta and T c l o c k are the input data arid clock period respectively. As per the definition of systolic arrays [4] , [15] the fundamental period 5 of the systolic full sized array is equal to the input data period, i.e., 5 = T d c L t n .
The processing elemelit of the APA is denoted as ilbstract Processing Eleirieiit, (APE). Let C denote the ' 6 4 2 compressor is the tree implementation of an addcr with outputs in carry and save form. The throughput rate is equal to f~, i.e. T(APA)
Since each APE does a. valid computakion in each clock period, the Hardware Utilisation Efficiency (HUE) (see [4] , [6] , [15] for a formal definition) is 100%. This would appear to imply that a further reduction in the array size is not possible which maintains the same throughput and latency. The above stateiiient. is true for t h e APA since Tclock is set as
T d a l a as is typical in many current real-time systolic systems. However, as will be made clear in the following sect,ions, by increasing the clock frequenc,y to the 
Pipelined Clustering
The technique of clustering [4] is essential to reduce the number of processor elements in the final physical processor array implementation. We propose a new methodology of Pipelined Clustering which is used to arrive a t an efficient and reduced size processor array implementation from the Abstract Processor Array (APA). The method as applied to FIR filtering can be viewed as an extension of the passive and active clustering of [4] . The resulting reduced size processor array obtained by clustering is referred to as Physical Processor Array (PPA) and the processor element of this PPA is denoted by Physical Processor Element (PPE). The following are some of the important features of the PPA:
0 The PPA is 100% efficient and the nuinher of PPE's is minimum.
0 the PPA is systolic C(PPA) = C(APA).
Pipelined Clustering is a 4-step process which coilsists of slowdown, holdup and retiming followed by mapping of PE's of the Abstract Processor Array onto Physical Processor Element (PPE) and scheduling of coniputat,ions on the clustered array. Tmin. The number of stages by which the P P E has to be pipelined is n, where n = 2 1. The resulting PPA for the case p 2 n is the clustering of APA corresponding to slow d a t a rate. Similarly the resulting PPA for the case p < n is the clustering of APA corresponding to the fast data rate. Note that p < n implies n > 1. Independent of whether the data ratme is slow or fast, the following four steps involved in the inethod of pipelined clustering are applied to the APA shown in Figure 3 .
Stepl:
Step2:
Step3:
Step4:
Slowdown the entire APA by a factor of p ,
Introduce ( n -1) holdup registers through the input of the APA, where n = and Tclock = *.
Retime these registers to pipeline the APE'S into n equal stages. 'p' number of locally-interconnected APE's are mapped onto one P P E with appropriate control overhead circuitry and synchronisation delays. An APEi of the APA gets mapped onto PPEj according to the relation j =
The computat#ions of these mapped APE's are started serially on a PPA according to The schetnatic circuit diagram of a PPA for a general case of pipelined clustering is shown in Figure 5 .
Note that the clustering is locally sequential and globally parallel [16] . The switching function of the variclus muxes that are present, in the PPE's are discussed in the section on control overhead circuitry. The expressions for the number of delays required in the accuniulated and feedback path will be derived using timing analysis in the section on synchronisation delays.
Control Overhead Circuitry
Since the clock of the PPA implementation is TclOck = *, every data cycle is divided into p clock cycles. For any value of p and 1% the control overhrad circuitry consists of six muxes.
0 Muxl, mux2 and inux3, mux4 (refer Figure 5) are used as input and coefficient selectors to t,he inultipliers in the PPE. M u x l , inux2, 111~x3 and inux4 are of size y to 1 and select a different input combination for each clock cycle as per step 4 of pipelined clustering. Mux5 and muxG are output and input selectors for the P P E respectively and are of size 2 to 1 for all values of p and 11.
0 Position of niux5 for various possible values of p and 7% are described as follows:
-Slow data rate ( y 2 n ): Mux5 is coiinected to position I I for all the clock cycles of a data period except the nth clock cycle in which it is connected to position I .
-Fast data rate case ( p < 
Synchronisation Delays
Hence, there is an interval of p -2 clock cycles between t,he time inst,ant at, which the result of APE1 is ready and the time instant at which APE0 requires the result from APEl. This implies that p -1 number of delays are required in the feedback path.
Summary
From the above analysis, it follows that both for slow and fast d a h rate cases, 2 p -1 number of delays are required in the accumulated path and p -1 number of delays are required in the feedback path.
Features of Pipelined-Clustered FIR
Following a.re the fea.tures of t,lie PPA arrived at by Filters using pipelined clust,ering.
e The Physical Processor Array is systolic and the fundamental period of the systolic PPA is p clock cycles which is same as one data period, i.e., S = p x T e l o c k = T d a t n . The number of delays that are required in the feedback path of the P P E of the PPA is equal to p -1, i.e., 2 = p -1 (refer Figure  5) . In order to maintain the syst,olic nature oftlie PPA, 2p -1 rruriibrr of delays that are present in the accumula.ted path of the P P E have to be split into y and z number of delays and placed in each PPE as shown in Figure 5 . The splitting is accomplished as follows:
Slow data rate: y numbcr of delays that are required at position I ofthe P P E is equal to p -n and z number of delays that are required at position IT1 of the P P E is equal
-Fast data rate: y number ofdehys that are required at position I of the P P E is equal to p -n, mod p and z number of delays that are required at position 111 of the P P E is equal to p + (TI mod p ) -1.
e When p = 1 and 71 = 1, t,he P P A is same a.s the APA c,orresponding to the slow data rate and 
4.4
when p = 1 a.nd n, > 1, t>he PPA is t8he sa,me a.s the APA corresponding to the fast data rate.
Throughput: Due to the systolic nature of the PPA, the terminal processor placcs a valid output at the end of every da.ta period. Since there is an output in each data period, the throughput of this PPA is f d a t a which is same as f b , so T ( P P A ) = 7( APA).
Latency: The latency of the PPA i s the same as the latency of the corresponding APA. In the slow data rate case, the computation of APE0 which is mapped onto PPEo goes through n pipelining stages and the output gets latched at the end of the same data cycle. So in this case latency i s one data cycle which is the same as S of the APA. For the fast data rate case, it takes more than one data period for the output of APE0 to be ready since n > p . In this case the latency is b which is the same as that of the corresponding APA.
Hence C ( P P A ) = C(APA).
I/Q Bandwidth:
The number of inputs and outputs and the 1/0 bandwidth of a P P E is the same as that of the corresponding APE.
Pipelinability: Pipelining of the APE can be achieved up to a 6 1 2 compressor delay. Further pipelining below the 6-2 compressor delay can be achieved by introducing additional holdup registers and then retiming them. However this will require modification in the A P E configuration and will also add to more input-output latency.
Power Dissipation: The dynamic power dissipation of t,he PPA is the same as that, of the corresponding APA. This is because in the PPA l / p as many processors are being clocked at p times the frequency as compared to the APA. The above analysis ignores the power dissipation in the overhead circuitry.
The control overhead which mainly consists of muxes is constant and the size of these muxes depends upon the clustering factor p . The number of synchronisation delays in each PPE is also dependent upon the clustering factor p .
Mapping t o Fixed Size Array
The above technique is directly applicable to the problem of niapping an APA onto a fixed size PPA for non real-time applica.tions. The objective is to maximise the throughput subject to the constraint that the number of PPE's No in the PPA is fixed and the lower bound on the clock of the system is T,,,i,.
If the abstract processor array has N APE'S, then the mapping factor p is given by p = As discussed in the features of pipelined clustering, the throughput of the PPA is f6 = 1/S, where S is the funda.menta.1 period of t>he systolic PPA. Since the fundamental period S o f t h e PPA is p clock cycles, hence throughput = 1 T , , , x p . So one can use the Tmill as the clock period for the PPA to get, the maxiniunn throughput of
Conclusions
In sum, we have presented an area efficient systolic architecture for real-time, programmable-coefficient FIR filters. The technique of pipelined clustering has been introduced and used to derive the architecture in which P = Ttt;Z number of APE'S are mapped onto one P P E . The precise multiplexing and synchronisattion delays that are required have also been derived. It has been shown tha.t the reduced size PPA has the same latency, throughput and power dissipat,ioii (ignoring tlie power dissipated in the overhead circuitry) as the full sized array.
While the technique of pipelined clustering has been introduced in this paper in the specific context of synthesising area efficient systolic FIR filters, it ca.n quite clea.rly be used for any systolic array. We have for example, successfully used it to develop a high clock speed reduced size array for IIR filtering.
The approach developed in this paper does not esplicitly consider the problem of minimising the number of synchronisation delays. In this context it is interesting to note that if one clusters the APE's in the reverse direction, i.e., schedules the computations of the APE's on t.he P P E in the order which is the reverse of that given in Section 4, then the resulting clust,ered a.rray is multirate systolic, [15] a,nd has fewer synchronisat8ioii delays but, more input-output, latency. For more complex regular algorithms, the identification of the optimal cluster and tlie minimisation of the nuniber of synchronisation delays will be considerably more difficult than for the case of FIR filters. Techniques available in [16] and [17] may be useful in solving these problems.
