Abstract-In this paper, we propose two-dimensional (2-D) systolic-array infinite-impulse response (IIR) and finite-impulse response (FIR) digital filter architectures without global broadcast, by the hybrid of a modified reordering scheme and a new systolic transformation. This architecture has local broadcast, lower-quantization error, and zero latency without sacrificing the number of multipliers, as well as delay elements under the satisfactory critical period. Furthermore, we extend this new architecture to a useful 2-D systolic cascade-form architecture and provide the comprehensive error analysis for the proposed architectures.
I. INTRODUCTION
R ECENTLY, two-dimentional (2-D) digital filters have been widely researched in a variety of digital signal processing (DSP) applications, such as image restoration [1] , [2] obtained through a 2-D low-pass intraframe filter and image enhancement [2] , [3] performed by a 2-D high-pass filter. Generally speaking, the category of 2-D digital filters can be divided into infinite-impulse response (IIR) and finite-impulse response (FIR) digital filters. By contrast, an IIR digital filter has the advantage of highly computational efficiency and low-hardware cost, and an FIR digital filter has the merit of stable and linear-phase properties. Although, 2-D digital filters can be simulated on a general-purpose computer, it seems unlikely to process input signals in real time due to a large amount of computation. Therefore, an application-specific integrated circuit (ASIC) design plays an important role for the realization of 2-D digital filters. One of the most attractive structures for an ASIC design is the systolic array architecture [4] - [9] since it is characterized by synchronization, a high degree of modularity and regularity, local broadcast/communication, concurrency, and extendibility. When a large number of processing elements (PEs) work together, data communication becomes more significant. In recently developing VLSI technologies, the global broadcast makes a significant impact upon speed in the circuit level [5] , [6] for an ASIC design. Therefore, the local broadcast of systolic arrays is advantageous to reducing the global broadcast impact [5] - [9] .
Several systolic architectures for 2-D digital filters have existed in [10] - [12] . However, we obviously observe that input and output signals globally broadcast among the existing structures. On the other hand, the research [10] - [12] lacks a generalized discussion on further reducing storage error. As a result, we are motivated to design a local broadcast and lower storage-error architecture. In this paper, we propose an improved 2-D systolic architecture [13] , [14] by differently reordering delays as well as summations of the filter-output function and then by a new systolic transformation. The former scheme eliminates the th-dimensional global broadcast and the latter cancels the th-dimensional global broadcast and reduces storage error, where and are dimensional indices of 2-D digital filters. The hybrid of two schemes can achieve higher throughput than that in [10] , [12] . Thus, the resulting IIR and FIR architectures have local broadcast, lower quantization error, and zero latency under the acceptable throughput rate. The structure of this paper is organized as follows. The new 2-D systolic digital filter architectures without global broadcast are proposed in Section II. The quantization error analysis of the proposed architectures is discussed in Section III. In Section IV, comparison results are tabulated in terms of local broadcast, storage error, critical period, latency, and the number of multipliers as well as delay elements. In the last section, concise statements conclude this presentation.
II. AN IMPROVED SYSTOLIC ARCHITECTURE DESIGN
The general transfer function of a 2-D IIR digital filter can be represented as (1) where , as well as and are coefficients and the order [15] of the IIR digital filter, respectively. In this paper, a square image is fed to the following structures in raster-scan mode and thus the delay and where and denote a unit-delay element and the width of an image, respectively.
A. 2-D Systolic Noncascade Form Digital Filter
Without loss of generality, we review and deduce the following structures under an assumption . Equation (1) can be modified as seen in (2), at the bottom of the next page, where and are defined as input and output of the digital filter, respectively, in the transform domain. Equation (2) , referred to as SCH2 of S-G-A [11] can be mapped to Sid-Ahmed's structure [10] , [11] as shown in Fig. 1, where . In Fig. 1 , the left index ( ) of the shift register (SR) denotes the number of unit-delay elements and the input as well as the output are taken the inverse transform of and , respectively. Other 2-D IIR digital-filter structures can be realized either by reordering delays, as well as summations [11] , or the systolic transformation [12] . The mapping equations for S-G-As SHC3 [11] and Shanbhag's scheme [12] are described, respectively, as seen in (3) and (4), at the bottom of the next page. The resulting structures corresponding to (3) and (4) [11] for N = 2.
For convenience of recognizing broadcast directions of input and output signal paths for 2-D digital filters, we divide them into th-dimensional input signal, th-dimensional output signal, (2) th-dimensional input signal and th-dimensional output signal as indicated in Fig. 1 . It is obvious that the above defined signal notations as shown in Fig. 1 can also be utilized in Figs. 2, 3 and the proposed architectures. Since the global broadcast leads to the low speed operation in the circuit level [6] , we propose the hybrid of a modified recording scheme and a new systolic transformation to achieve local broadcast. The procedures are as follows.
First, we eliminate the th dimensional global broadcast of input and output signals as shown in Fig. 3 by differently reordering delays and summations of the filter output function. The reordering is as shown in (5), at the bottom of the next page, where the integer variable is restricted in the range of 1 and so as to maintain local broadcast in the th-dimensional path. For simplifying the representation of (5), we define two terms as
Substituting (6a) and (6b) into (5), we can rewrite (5) as (7) Fig. 3 . An IIR digital filter proposed by [12] for N = 2.
and (4) for (8b) and , . From (7), several 2-D local broadcast architectures in the th-dimensional paths can be generated in the range of .
Next, we discuss how to realize the summation in the square bracket of (7) as a local broadcast architecture in the th-dimensional paths. Shanbhag [12] utilizes the systolic transformation as shown in Fig. 4(b) instead of Fig. 4(a) to solve the th dimensional global broadcast path. However, Shanbhag's scheme sacrifices the critical period to obtain the output, because as shown in Fig. 3 is not immediately blocked by the delay element in the th and th dimensional output paths. Another systolic transformation as shown in Fig. 4(c) is used to construct a one-dimentional (1-D) IIR digital filter [16] . Nevertheless, how to construct a 2-D digital filter has not been discussed in [16] . For the sake of reducing the critical period and maintaining local broadcast in the th-dimensional paths, we apply a new systolic transformation as shown in Fig. 4(d) to a 2-D IIR digital-filter design. In other words, the square brackets of (7) can be mapped to the structure of this new systolic transformation. We emphasize that although there exists the same/similar second-order relationship among Fig. 4(a)-(d) , the systolic transformation in Fig. 4(d) is different from the one in Fig. 4 (b) and 4(c) due to the different splitting of the delay element in Fig. 4(a) . Utilizing this modified reordering scheme and a new systolic transformation, we can obtain a new 2-D systolic local broadcast digital filter architecture as shown in Fig. 5 with less critical period compared with Shanbhag's structure [12] . On the other hand, the critical path of the subblock, as plotted in dotted circle in Fig. 5 can be further improved by tree method. Therefore, the critical period only requires one multiplication and three additions. In addition, users merely set to zero such that a new systolic FIR digital filter architecture can be obtained.
B. 2-D Systolic Cascade-Form Digital Filter
The realization of digital filters by cascading the second-order IIR digital filters has many desirable features, such as less sensitivity to coefficient quantization error and better roundoff noise performance than the noncascade-form realization while in the fixed-point operation [17] . Under the same assumption as stated in Section II-A, the cascade-form transfer function based on a 2 2th-order IIR digital filter can be written as
where and the floor operator denotes the maximum integer less than or equal to . We insert one unit delay at the output of each th stage in order to achieve a systolic architecture and avoid the degradation in the critical period. Thus, (9) combined with and can be modified as (11) The amount of inserting delays is equal to . These inserting delays for the 2-D systolic cascade-form architecture would not affect the magnitude response but just results in the latency of . For example, let and the resulting 2-D systolic cascade-form IIR digital filter architecture is revealed in Fig. 6 , where and the detailed block diagrams of PE and PE are shown in Figs. 7(a) and 7(b), respectively. Note that the proposed cascade-form architecture certainly has the local broadcast characteristic. 
III. ERROR ANALYSIS OF NEW DIGITAL FILTER ARCHITECTURES
In this section, we investigate error analysis for finite wordlength arithmetic in the proposed 2-D systolic IIR and FIR digital filters. Here, we adopt most notations as defined in [12] , [18] , [19] to analyze quantization error. For convenience of error analysis, the 2-D local broadcast th order digital filter can be equivalently plotted in Fig. 8 . Let PE signify the position of PE in the th row and the th column of Fig. 8, where and . Observing the block diagram in Fig. 8 , the two indices and are related as (12) If represents the true value of a variable, then its quantized value would be represented by . Let and denote coefficient quantization errors in the representation of finite wordlength and , respectively. Also, the input and output quantization errors are denoted as and , respectively, while the input and the final output operate in finite word-length arithmetic. We derive the error expression at the output of PE as shown in Fig. 8 for . In other words, we consider PEs for type PE as shown in Fig. 7(a) . Let denote the true output of PE and it can be represented as (13) where for (14a) for (14b) where denotes the modulus operation. Therefore, the quantized value of is given by (15) where is defined as the storage error caused by storing the output of an adder in a register. In [12] , [19] , it has been discussed that the rounding operation with a ( )-bit register leads to for fixed-point data arithmetic. Representing the error at the output of PE by and neglecting second-order terms involving and for , 1 and 2, we obtain 
The ceiling operator denotes a minimum integer that is greater than or equal to and , , , and represent roundoff error, coefficient-quantization error, input-quantization error and storage error, respectively. Note that the integer variable does not affect quantization error and, in fact, its main aim is to provide several local broadcast architectures. The other two terms and are the same as (20e) and (20f), respectively. If we assume that and , for all the architectures under consideration, are all of a similar nature, then it is apparent that our architecture has lower-storage error than existing structures [10] - [12] . This is due to the fact that the proposed architecture in Fig. 8 has fewer PEs and thus, the lower sum of storage error is obtained in (19) and (21). Consequently, the quantization error is reduced.
It is worthy to note that the storage error depends on different types and sizes of PEs. In this paper, the PEs are restricted to the type that is either PE or extended type for PE . Under this constraint, we define two useful notations 0 and 1, where 0 and 1 indicate that the delay elements as shown in Fig. 7(a) are at top-to-down and bottom-to-up signal paths, respectively. Thus, we can easily use the digital sequence to represent different sizes of PEs. For example, PE proposed in Section II can be represented as 101. While the larger size of PE is used to construct the 2-D digital filter, the lower-storage error is achieved; however, the critical period is sacrificed as listed in Table I . With minimum-critical period and low-storage error in mind, we select the second-order PE denoted as 101 (i.e., PE ) for our design. (18) 
In similar fashion can be evaluated and then can be obtained utilizing ( ) instead of . Finally, (23) can be solved recursively. Note that , and .
IV. COMPARISON RESULTS OF IIR AND FIR DIGITAL FILTERS
In this section, we make an effort to compare our architecture with existing architectures [10] - [12] . Comparison results of 2-D IIR digital filters are tabulated in Table II in terms of storage error, critical period, latency, the number of multipliers and delay elements and, importantly, whether the input and output signals locally broadcast in these structures. In Table II , the proposed hybrid of two schemes completely eliminates the th and th dimensional global broadcast paths. Besides, if it is assumed that , and in Table II are all of a similar nature, then the latter scheme (i.e., a new systolic transformation) leads to lower-storage error than that in [10] - [12] . Let and represent the operation time required for one multiplication and one addition, respectively. So as to minimize periods in critical paths as shown in Figs. 1, 2, 3 , and 5, we properly apply tree method to those structures and then separately evaluate the periods. As a result, we detect that this work has higher throughput than Sid-Ahmed's [10] and Shanbhag's structure [12] but requires extra one addition than S-G-As structure applying SCH3 [11] . In general cases, since is much larger than , the number of delay elements in Table II is dominated by the product-term . Hence, the number of delay elements in this work for small value is almost equal to that of [11] as well as [12] and less than that of [10] . In the same way, comparison results as listed in Table III Tables II and III , it turns out that the proposed architecture has local broadcast as well as lower-storage error and maintains zero latency under the acceptable critical period.
V. CONCLUSION
A new systolic architecture for the implementation of 2-D IIR and FIR digital filters has been proposed by a modified reordering scheme and a new systolic transformation. Applying the hybrid of two schemes, better performance showing local broadcast, lower-quantization error, zero latency, and the satisfactory critical period without sacrificing other hardware characteristics can be achieved. In addition, we extend the new architecture to a 2-D systolic cascade-form digital filter and offer quantization error analysis related to the proposed architectures.
