Absrracr -In this paper, a new digital filter structure is developed for the implementation of two-dimensional (2-D) recursive filters for real-time image processing. The proposed structure has a short clock cycle time or a high data throughput rate, independent of the order of the filter. Parallelism and pipelining are the two features of the proposed filter structure that contribute to its high-speed performance. The filter can be implemented without multipliers. Using standard integrated circuits and memories, the new filter is capable of filtering images of size up to 512~ 512 pixels with a TV scan rate of 30 frames/s in real time. The effects of the finite precision arithmetic have been considered. Scaling and overflow problems are studied to give insight into the choice of a proper scaling factor, so that an adequate signal-to-noise ratio at the filter output can be obtained.
I. INTRODUCTION
I N THE PAST FEW years, real-time image processing using two-dimensional (2-D) digital filters has become a rapidly growing field in the industrial and biomedical environments, as the need for the fast processing of large amounts of data became evident. The term "real-time image processing" can be defined as "the processing of images at a speed such that the data rate of the processed images is the same as that of the input images." If one considers an image of size M x N pixels and a TV scan rate of L frames/s, and if R arithmetic operations are required for each output pixel, the total number of arithmetic operations that have to be performed in one second is A4 X N X L X R [l] . We now summarize a number of filter structures that are already known to operate at high speed. Peled and Liu [2] described an implementation of digital filters by distributed arithmetic. This method has proved its advantage with respect to speed, cost, and power dissipation by storing all possible binary sums of the filter coefficients in a programmable read only memory (PROM). Distributed arithmetic implementation of 2-D digital filters can be found in [3] and [4] . Where the variability of the filter coefficients is important, the Canonical Sign Digit (CSD) representation of the filter coefficients [5] and the use of stored square ROM [6] have been suggested. Residue Number System Arithmetic results in a highly parallel hardware design with characteristically high computational speed [7] . More recently, a new memory-oriented Manuscript received July 31, 1985 , revised February 28, 1986 . This work was supported in part by an NSERC grant.
The authors are with the Department of Electrical Engineering, University of Toronto, Toronto, Ontario, Canada M5S lA4.
IEEE Log Number 8609851.
implementation of 2-D digital filters with each coefficient expressed by an algebraic sum of power-of-a terms has been presented [8] . The stored product digital filter architecture as formulated in [9] and [lo] presents another alternative for the elimination of the multipliers through the use of ROM's. A general configuration for 1-D recursive digital filters [ll] has shown that high-speed, 10 MHz or higher, word throughput rates for parallel operations in two's complement fixed point arithmetic are feasible with reasonable memory size and standard logic devices. The high operating speed of the filter is due to its parallel-pipelined structure. This approach is different from the other implementation schemes, in the sense that "a minimum number of arithmetic operations are required in one clock cycle." If we consider an input data rate of S samples/s, the quantity l/S is the duration of one clock cycle. The main advantage of this method is that the data throughput rate is independent of the order of the filter.
The purpose of this paper is to present a 2-D recursive digital filter structure (also valid for nonrecursive digital filters) that can operate at a very high speed. Section II describes the detailed development of the new 2-D digital filter structure that has a short critical path, from both the theoretical and practical points of view. Section III contains the hardware description of a second-order 2-D recursive filter for filtering of images of size up to 512 X 512 pixels with a TV scan rate of 30 frames/s in real time. In Section IV, error analysis of the new 2-D filter structure is presented. Scaling, overflow problem, and experimental results with real images are covered in Section V. Section VI, finally, presents a summary of contributions made in this research.
II. HIGH SPEEDTWO-DIMENSIONAL RECURSIVE DIGITALFILTER
A 2-D causal recursive digital filter is described by the linear difference equation ,,,,,,= ? 2 ai,,x,-i,n-j-2 Z bk,lYm-k,n-l (1) Y ;=l-Jj=(J kks? 'r," where x and y are the input image arrays, respectively, and a,, j and b, , are the filter coefficients of the nonrecursive and recursive blocks, respectively. MA and NA de-0098-4094/86/1000-0948$01.00 01986 IEEE scribe the size of the input mask, whereas MB and NB B describe the size of the output mask.
:
Some of the 2-D recursive digital filters described by (1) Adder Tree 11 which claim the capability of filtering images in real-time have a structure similar to that as shown in Fig. 1 All signals are represented in fixed point two's complement code. The input samples and the output signals, assumed to be bounded by kl, have B bits of accuracy including the sign bit. All other intermediate results have 1 bits of accuracy with t > B. Thus, the first block of the cascaded structure has B input bits and t output bits, while the second block has t input bits and B output bits. There are two ways of cascading these two blocks. The criteria is to design a filter structure with a short critical path, which is defined as the longest path among all the possible paths from the output of delay element to the input of the next one. Thus, the maximum operating speed of any filter is determined by the length of its critical path. The critical path contains one multiplication and three additions when the recursive block is followed by the nonrecursive block. This path is shown in Fig. 2b in bold lines. When the two blocks are interchanged, the critical path contains two multiplications and three additions. It is possible to further reduce the number of arithmetic operations in the critical path, as in the 1-D case [ll] . The minimum number of arithmetic operations required in the critical path of a 2-D recursive digital filter has to be figured out first.
For 1-D digital filters, a configuration for which the critical'path contains no more than one multiplication and one addition has been derived [ll] . For 2-D digital filters, the critical path should contain only one more addition than that of the 1-D filter. The extra addition is required for adding the intermediate results from the 2; ' and Z;' blocks. In order to obtain the desired critical path, the original 2-D filter transfer function should be modified so that the new transfer function fi(Z,, Z,) should have the form A( z,, z,) = H( z,, z*)z;pz;q with p and q being nonnegative integers. The use of latency makes it possible to complete all the arithmetic operations required for an output in more than one clock cycle, thus resulting in a shorter clock cycle. It is desirable to have the minimum possible latency, which is defined as the time interval separating the appearance of an input sample at the input port from the appearance of the corresponding output at the output port. This can be achieved by setting p = 1 and q = 0. Thus, the output of new filter jm n is delayed by only one pixel compared with the original filter output y,,, (i.e., jm,, = y,,-i, .). With the chosen values for p and q, the new equation describing the input and output relationship in the 2-D z-domain is i=Oj=O "
-?(Z,, Z,) 2 2 bk,,ZckZ;?
We now propose the new filter structure for the modified 2-D filter transfer function. Let A = x( z,, z,) z z a;,jZ;iZ;' (7) i=Oj=CJ i+i #o
The relationships among all these signals of a second-order filter are shown in Fig. 3 . It can easily be proven that Y(Z,, Z,) = I. With the assumption that a multiplication takes at least twice the amount of time required for an addition, the critical path is the one shown in bold lines and contains only one multiplication and two additions. Moreover, this critical path is independent of the order of the filter. The new filter has a very regular structure, with identical building blocks. This regularity property provides a simple hardware structure for the implementation of the filter. The new filter also has a small hardware size since the input to the multiplier block has only B bits.
III. HARDWAREFORREAL-TIME IMAGE PROCESSING
Consider the processing of an image of size 512 x 512 pixels with a TV scan rate of 30 frames/s in real time, the required data throughput rate S is 512x512~30 = 7.86 x lo6 pixels/s or one pixel every 127 ns. For a reasonable gray level resolution, the input and the final output signals are represented in B = 8 bits. All intermediate results have t = 16 bits of accuracy. The hardware of the new filter of second order with the above specifications will be outlined in this section.
The new parallel-pipelined structure of a 2-D recursive digital filter consists mainly of three building blocks: 1) delay units, 2) multipliers, and 3) adders/subtracters. Input to the multipliers of the nonrecursive block is the serial sampled video data resulting from the raster scan of an image of size 512X 512 pixels, whereas the input to the multipliers of the recursive block is the most recently computed output. High-speed multipliers are very expensive and are not economical for the implementation of fast filters. One method of replacing the multipliers, namely the "stored product" method [ll] , is considered in the hardware implementation. The Z;' delay elements can be configured from SN 743174 (hex D-type flip-flops), which has a maximum propagation delay of 17 ns and a minimum set up time of 5 ns [17] . The Z;' delay elements, which consist of 16 parallel 512-bit shift registers, can be configured from TDC 10065 (1 X 256) having a maximum propagation delay of 30 ns and a minimum setup time of 0 ns. It requires two TDC 10065 chips for the implemetation of one 512-bit shift register. The number of IC chips required for SN 74S174 and TDC 10065 are 32 and 128, respectively. Since 951 the sum of the propagation delay and the set up time of SN 74S174 is shorter than that of TDC 10065, it is desirable to drive two different shift registers with two different clock signals, one being the delayed version of the other, so that the outputs of the two different shift registers will be available at almost the same instant. From the specifications of the shifts registers, the clock signal driving the single-bit shift register should be the one driving the 512-bit shift register delayed by 5 to 13 ns,.
Each 16-bit adder is constructed from four 74LS181 4-bit ALU's and one 74LS182 carry look ahead generator. Addition of two 16-bit binary numbers takes 19 ns [18] . The 16 adders, required for the 16 additions in a secondorder 2-D difference equation, take 16 x 5 = 80 IC chips.
Using the stored product method, the multipliers are replaced by memories. The memories, if constructed from AM 27820 (256 X4) PROM's, which have an access time of 45 ns [18] , would require 36 packages and 32 packages for the nonrecursive block and the recursive block, respectively.
The cycle time of the new filter, built with the above components, consists of one memory access time, two addition times, and the sum of the propagation delay and setup time of the 512-bit shift register. The new filter can process images at a data throughput rate of one pixel every 113 ns (i.e., 45 + 19 X 2+ 30 ns), which is less than the maximum allowable time (127 ns) required for real-time processing.
In addition to the cycle time, another measurement for the filter performance is the latency. The latency for the proposed filter structure is the sum of the following (see Fig. 3 ): i) one cycle time (127 ns for the processing of an image of size 512~512 pixels with a TV scan rate of 30 frames/s). ii) propagation delay of the 512-bit shift registers (30 ns), and iii) one addition time (19 ns).
This latency of 176 ns is independent of the order of the filter, which is another attractive feature of the new filter. A summary of the hardware and throughput rate of the new filter structure is shown in Tables I and II. , IV. ERROR ANALYSIS The effects of finite precision are considered in the new 2-D recursive digital filter. Errors are introduced in quantizing the input and in the roundoff accumulation of the intermediate results. In this section, and analysis of the steady-state statistics of such errors is presented. The analysis is based on recursive implementation, but the results for nonrecursive implementation can also be obtained by specializing the obtained results. Two's complement fixed point arithmetic is used and the distinction between rounding and truncation is made. To simplify the analysis, some assumptions about the statistical properties of the quantization errors are made.
i) The sequence of error samples is a sample sequence of a stationary random process.
ii) The quantization process is white. The random variables representing the error process are uncorrelated, independent of the sampling rate.
iii) The error sequence is uncorrelated with the sequence of exact samples.
iv) The quantization error has a unifrom density function. This implies that the signal is equally likely to be anywhere within a quantization interval. v) Overflow does not occur at the output of the filter. Quantization errors are caused by either truncation or rounding, each mode resulting in a different error effect. The filter output error is independent of the factor Z;' associated with the modified filter transfer and, for simplicity, we will use the original difference equation (1) describing the 2-D recursive digital filter in the analysis. respectively. Thus, a fixed amount of distortion is always present in the output of the filter no matter what the scaling factor p is. Theoretical and simulation results of the error at the output of the stored product 2-D recursive digital filters were obtained. Double precision arithmetic was used for the simulation of the ideal (infinite precision) filter. The main advantage of using computer simulation to compute the statistics of the quantization error is the possibility of studying the rounding effects, the truncation effects, and the scaling effects separately, with few changes in the filtering algorithm. were used for simulation, with p being the scaling factor. The specifications of the two filters are given in Tables III  and IV output error are the input quantization error and roundoff accumulation error, especially the quantization of the 16-bit results to 8 bits in the final output and the feedback.
V. SCALING OF THE TRANSFER FUNCTION
If the amplitude of the output signal of a recursive digital filter in a fixed point implementation is allowed to exceed the dynamic range, overflow will occur and the output signal will be severely distorted. This is due to the fact that output error due to overflow is fedback into the recursive filter. On the other hand, if the output signal amplitude is unduly low, the filter is operating inefficiently, and the signal-to-noise ratio will be poor. Therefore, for optimum filter performance, suitable scaling must be employed to adjust the output signal levels.
A 2-D recursive difference equation can always be written as
where p is a positive scaling factor. Evidently, the magni- 
To ensure absolutely no overflow in the output, i.e., Iy,,,l G 1, scaling factor p must satisfy the condition P i f lhk,,l'<l. 
We don't have to worry about the overflows that can occur in the intermediate results. This is due to the fact that if Y m,n has no overflow, it is always evaluated correctly in two's complement arithmetic, even if overflows do occur in the partial sums.
In this section, we are going to present some experimental results of the filtering real images using two different filters scaled by different scaling factors. The main objective of these filtering experiments is to estimate an optimum scaling factor for given 2-D recursive digital filter used for image processing.
The filtering process is simulated on VAX 11/780 computer. Only the coefficients of the nonrecursive block are scaled by p. The scaling of an input might cause an overflow at the input port of the filter. The first filter used Fig. 7 . Original "Yogourt" image. Fig. 10 . "Yogourt" processed by filter #3 with scaling factor being 1.5. Fig. 8 . "Yogourt" processed by filter #2 with scaling factor being 2.5. Fig. 9 . "Yogourt" processed by filter #2 with scaling factor being 3.
for image processing is filter #2 (which is one of the filters used in Section IV for simulation) and the second one is filter #3 drawn from [21] (with specifications shown in Table V ). Filter # 2 is a high-emphasis filter, whereas filter # 3 is a high-pass filter. Input images are of size 256 X 256 pixels with a gray level resolution of 256 levels. The pixel values are from 0 to.255. In order that the input image can factors. When the scaling factor is too low, the output dynamic range cannot be fully utilized. However, when the scaling factor is too high, overflows occur, as we can obviously notice from Figs. 9 and 11. Although the scaling factor from (26) does not result in any overflows both the signal-to-noise and the visual effect of the processed image are not very good. In order that the full output dynamic range can be utilized, a suitable scaling factor should be used for the scaling of the filter transfer function. There are no general rules for choosing the new scaling factor as this factor is image and filter dependent. Nevertheless, experiments with the filtering of these real images have shown that the scaling factor should be about 3 to 4 times the value given by (26) for the two filters mentioned in order that both the subjective (visual) and objective (signal-to-noise ratio) measurements of performance can be improved. The signal-to-noise here is defined as the ratio of the variance of the ideal filter output signal to the variance of the output error. The objective measurement of the filtering of the image in Fig. 7 using different scaling factors are shown in Tables  VI and VII . From the tables, we can conclude that whenever overflows have occurred, the signal-to-noise ratio is no longer a good indication of the quality of the processed image. The visual effect is still very good even though there are some overflows. Most of the overflows occur when there is a uniform background1 VI. CONCLUSIONS This paper has considered a new 2-D filter structure which results in a very high operating speed for a 2-D recursive digital filter. The throughput rate is independent of the filter order. The modification made is minimal to the extent that the output sequence of the filter is only delayed by one pixel when compared with the original one. By storing the products in programmable read only memories (PROM's), multipliers can be eliminated completely. This filter configuration provides an economical way of implementing a filter that does not require varying filter coefficients. High-data throughput rate (i.e., 8 MHz) has been shown to be feasible. A simple hardware implementation requiring standard TTL and MOS devices and two clock signals, one being the delayed version of the other, to drive the two different shift registers has been outlined. The proposed hardware is characterized by a regular structure, which consists of identical building blocks.
The noise properties of the new filter implemented by the stored product method have been studied. Expressions for estimating the mean and the variance of the noise at the filter output have been derived. Simulation results have been obtained and they agreed very well with the theoretical results.
Finally, the problem of the scaling of the filter transfer function was investigated. Results from the processing of images using a high-pass filter and a high-emphasis filter have shown that a better visual effect of the processed images can be obtained if the scaling factor is three to four times the value given by P = l 03 Ihk,,l. cc 
