I. INTRODUCTION YSTOLIC architectures, first developed by Kung [I],
S are characterized by a high degree of modularity, regularity , localized data communication, global clocking and increased speed of computation. Implementation of such architectures, through VLSI techniques, is facilitated by the repetitive nature of the processing elements (PE's).
Recently, systolic architectures for 2-D filtering have been proposed [2] - [4] . While the architecture in [2] is derived from the filter transfer function, the one in [4] is based on the local state space model. Though the realization in [2] has certain advantages over the one in [4] , which include employment of simpler PE's and data input to a raster scan format, it must be mentioned that a large number of shift registers are required to implement the architecture. In most image processing applications, this feature may prevent a monolithic implementation of the filter. Therefore, it is of considerable interest to develop architectures which require fewer delay elements and at the same time have equal if not better performance. This objective has been achieved up to a certain extent by the architectures in [3] . We extend this line of work by presenting another improved systolic architecture for 2-D filtering. This architecture has been derived from the signal flow graph (SFG) representation of the filter, by the application of a systolizing procedure given in [ 5 ] .
Manuscript received June 16, 1990; revised October 18, 1990 . The author is with the Department of Electrical Engineering, University IEEE Log Number 9042726.
of Minnesota, Minneapolis, M N 55455. 
IMPROVED SYSTOLIC ARCHITECTURE
It has been rigorously proved in [5] that any computable SFG can be systolized by rescaling the delays. A relevant case of this procedure, which shall henceforth be referred to as the systolic transformation (ST), is shown in Fig. 1 . The systolic architecture, which was presented in [7] for a 1-D IIR filter, can also be derived through the ST. If the SFG which has to be systolized is canonical in the number of delays, then this approach would yield a very different realization in terms of the number of registers. We therefore first derive an SFG, which is canonical in the number of delays (canonical SFG) and then, employing the ST (Fig. l) , develop the systolic architecture.
A. 2 -0 IIR Filter
Unlike in [7] , where the 1-D transfer function was taken to be strictly proper, we assume a more general transfer function. An Nth-order 2-D IIR filter transfer function is defined as If Y(zI, z2) and X(zl, z2) represent the output and input data in the 2-domain, respectively, then
2), the canonical SFG, for any 2-D IIR filter of order N , can be derived. In Fig. 2(a) , we show the canonical SFG for N = 2. Next, we systolize the SFG in Fig. 2 (a) by employing the ST (Fig. l ) , to derive the SFG for a systolic architecture ( Fig. 2(b) ). It is clear that the ST introduces two delays for every other z;' delay in the canonical SFG. In order to map the systolic SFG ( Fig. 2(b) ) to an architecture, we assume that the sequence of input data is in raster scan format. In other words, the input data sequence isx(0, 0), x ( 0 , I), . * * , x ( 0 , M -l ) , x(1, 0), x(1, l), * . , etc. Therefore, the length of a row of input is M. The architecture for a second-order IIR filter is presented in Fig. 2(c) , where the z;' delays are replaced by shift registers of length M and the z l l delays are single registers. It can be seen that two types of PE's (PE1 and PE2) are needed. To generate the architecture for higher order filters, all that is needed is to cascade PEl's in each subblock and add more subblocks in parallel. It can be easily confirmed that, if N' represents the number of PE 1 's in each subblock, then
(2.3)
B. 2 -0 FIR Filter
In a fashion similar to the IIR case, we can repeat the analysis for FIR filters. In fact, all that is required is to set all 6, equal to zero and repeat the analysis presented in Section II(A). For the sake of brevity, we present the final systolic architecture for N = 2 (Fig. 3) . Again, two types of PE's are needed (PE1 and PE2) and the architecture for higher order filters can be generated by cascading the PEl's in each subblock and adding the requisite number of subblocks in parallel. The parameters that are compared are the number of adders, multipliers, registers, the clock period, the latency and the speed-up factor (SUF) [ 5 ] . The SUF measure, which is used to compare the speed efficiencies of systolic arrays, is defined below Processing Time in a Single Processor SUF = (3.1) Processing Time in the Array Processor '
COMPARISON WITH EXISTING ARCHITECTURES
For the purpose of comparison, we assume that all adders and multipliers are 2-operand and T,,, and T,, are the times required to complete one real addition and multiplication, respectively.
A . 2 -0 IIR Filter Comparison
In [3] , three different systolic architectures (Fig. 4) for 2-D IIR filtering are presented. These three architectures, which shall henceforth be referred to as SCHl (Fig. 4(b) ), SCH2 (Fig. 4(c) ) and SCH3 (Fig. 4(d) ) respectively, are all based on the same PE as [2] (Fig. 4(a) ). It must be mentioned that SCH2 is identical to the architecture proposed in (21. Comparison of our 2-D IIR architecture with SCHl , SCH2, and SCH3 is tabulated in Table I . It is clear that we have achieved a substantial reduction (of the order of M N ) in the number of delay elements as compared to SCH2. This is due to the fact that M is usually of the order of lo2. As compared to SCH3, the reduction in the number of delay elements equals N . On the other hand, SCHl requires N 2 / 2 -N / 2 -1 fewer latches than ones. For most practical applications, the reduction achieved by SCHl is of the order of 10. In fact, for N = 2 our architecture requires the same number of latches as SCH l . The reduction in the delay elements achieved by SCHl is at the cost of increased latency. Along with SCH2 and SCH3, SCHl has a latency of one, while our architecture has the minimum achievable latency of zero. This fact can be checked easily by observing that the first output ( y ( 0 , 0)) in our architecture, is available in the same clock cycle in which the first input (x(0, 0)) is made available to the circuit.
The clock period for our architecture is marginally longer (by T,) than that of SCHl and SCH3, while it is clearly shorter than that of SCH2. This disadvantage is more than made up for by the improvement in the latency. The rest of the comparison parameters, i.e., the SUF measure, number of adders and multipliers, are identical for all the architectures under consideration except SCH2. Unlike SCH 1, SCH3, and the new architecture, where the SUF measure is equal to 1, the SUF measure for SCH2 deteriorates for increasing filter orders. If we assume that T, = T,, then the cycle period for SCH2 is an increasing function of N for N > 7. The SUF measure (Fig. 5 ) for SCH2 keeps decreasing for filter orders higher than 7.
B. 2-0 FIR Filter Comparison
Though 2-D FIR architectures are not presented in [3], they were derived, from the corresponding IIR filter architectures, by equating the b,j coefficients to zero. Let SCHl', SCH2', and SCH3' be the FIR architectures derived from SCH1, SCH2, and SCH3, respectively. Again, SCH2' is identical to the 2-D FIR architecture presented in [2] .
Comparison of our architecture with SCH 1 ' , SCH2', and SCH3' was done, the results of which are tabulated in Table 11 . It can be seen that the new architecture requires the least number of delay elements. Specifically, it requires N + 1 fewer registers than SCH1' and SCH2'.
Compared with SCH3', our architecture requires (N2/2 + 3N/2 -2) fewer registers. The rest of the factors compare in a fashion similar to the IIR case. In Fig. 6 , we show the variation of the SUF measure with the filter order N, under the assumption of T,,, = T,. This time the SUF measure for the SCH2' deteriorates for N > 3. Similar to the IIR case, the latency of our architecture is the minimum achievable.
IV. ERROR ANALYSIS
It is well known that finite-precision arithmetic results in quantization errors. Therefore, it is essential to have an estimate of the errors involved. We present, in this section, a detailed error analysis of our architecture for the 2-D IIR case. Final error expressions for SCH1, SCH2, and SCH3 are also presented. Though the error expressions for FIR filters are not calculated, it is clear that these can easily be derived from the corresponding expressions for IIR filters. It must be mentioned that this analysis is a direct extension of the error analysis done in [6] for 1-D IIR filters.
For the purpose of error analysis, it would be convenient to consider the aggregation of subblocks in Fig. 2 
where si,j, representing the storage error, is defined as the error caused by storing the output of an adder in a latch. In [ 6 ] , it has been discussed that for fixed point data representation with a dynamic range between 1/2 and 1, rounding with ( t + 1) bit registers results in si,, I 2-' -I .
Representing the error at the output of by j , j ( n , m) and neglecting the second-order terms of the type a;,.] ex(n (4.6) 1 =o Summing (4.6) over a l l j , we get the combined error at the final output as follows: .Z (QLi,ox(n -1 , m) + P;+i,oY(n -1, m)) and si's are the storage errors due to the adders not belonging to any of the PE's.
If it is assumed that s;,~, for all the architectures under consideration, are all of a similar nature then it is apparent that our architecture has the lowest storage error. This is due to the fact that our architecture has fewer PE's and thus fewer adder results are stored. By equating all bi,j's, /3i,j's, and Ci,j's to zero, error expressions for the corresponding FIR architectures can also be derived. 
