Abstract-This paper describes the implementation of two-dimensional recursive digital filters for real-time image processing. The filter structure is described by distributed arithmetic and the operations involved are memory fetches and additions. An error analysis of the twodimensional filter structure described by distributed arithmetic is outlined and the expression for the mean-squared error is derived. This approach is compared to an equivalent conventional filter structure using multipliers. It is demonstrated that the distributed arithmetic implementation offers a significant reduction in cost as well as power consumption for real-time image processing. In addition, the hardware design of a second-order two-dimensional distributed filter is presented and its performance is examined. This filter is constructed using standard integrated circuits and memories and is capable of filtering images of sizes up to 256 X 256 pixels in real time.
I. INTRODUCTION
ECENTLY there has been increasing interest in the use of two-dimensional (2-D) filters for processing sampled 2-D data. For example, the geophysics industry used 2-D filters for processing seismic records, gravity, and magnetic data. 2-D filters have found applications in many areas including those of image enhancement, restoration of linearly degraded images and biomedical picture processing.
There are three common methods of 2-D filter implementation: direct convolution, frequency domain techniques through the use of the fast Fourier transform (FFT), and realization through recursive implementation. The recursive technique is the most efficient in terms of memory requirements and speed of implementation for a large convolution area [l] . Recursive filters have also been designed to satisfy both magnitude and phase characteristics. These filters were realized as basic second-order sections or as a cascade of second-order sections [2] .
Peled and Liu [3] outlined an approach to the implementation of one-dimensional recursive digital filter structures described by distributed arithmetic. This realization requires storing the finite number of possible outcomes of an arithmetic operation, as well as using them to obtain the next output sample through repeated additions and Manuscript received March 22, 1982 ; revised May 8, 1985 . The authors are with the Department of Electrical Engineering, University of Toronto, Toronto, Ont., Canada M5S 1A4.
shifting operations. Taylor [4] has shown that high-order high-speed FIR filters can be designed using the distributed arithmetic structure in the multiplicative form.
In this paper, an extension of the distributed arithmetic approach to the realization of 2-D recursive filters is investigated. Specifically, an implementation scheme for a second-order section is presented. Subsequent sections include a detailed description of the different aspects of distributed arithmetic implementation of 2-D filters. They are as follows. Section I1 introduces the theory of distributed arithmetic as applied to 2-D filters. An implementation is presented for a second-order section. Section I11 examines the effects of using finite word-length arithmetic and the mean-squared error at the output of the filter described by distributed arithmetic is derived. In Section IV, hardware considerations for real-time digital image processing are examined and comparisons are made between the new and the conventional multiplier implementation. In addition, the tradeoffs between data throughput rate and cost are investigated. Finally in Section V, a hardware implementation of the distributed arithmetic 2-D IIR filter is presented. The sequence of operations is outlined for filtering a general A4 X N image and the performance of the hardware is examined.
DISTRIBUTED ARITHMETIC IMPLEMENTATION
More recently, FIR filters became very popular primarily because of their ease in design and implementation. This, however, has not been the case for recursive filters. Their .use has been limited [5] due to the difficulties encountered at earlier attempts in designing stable filters. Even more recently, new techniques [ 11, [6] of R, and Rh defined as
indicates that the impulse response of the IIR filter is spread over only the upper quadrant of the right half plane in the spatial domain. This is referred to as a quarter plane impulse response. The filter is said to be causal [7] since its impulse response h m , n is zero for M or N less than zero. For a specific case, the sets R, and Rh, such that K1 = L1 = I1 = J1 = K , the filter is referred to as of order K. In this paper, filters of order K = 2 are considered. Those higher order filters designed as a cascade of second-order sections can be realized in a similar fashion.] Fig. l(a) shows an example of the filtering with a second-order section. In this particular case, the input mask ak,/ is superimposed on the input image and the addition ' In general, a 2-D filter cannot be realized as a cascade; the cascade form is a special case.
of the product between the mask and the underlying samples gives a 2-D convolution. The output mask b,,; is similar to the input except for the missing element in correspondence to the output sample to be computed. A direct realization of (1) is shown in Fig. l(b) . The z I and z2 terms correspond to row and column delays, respectively. For each output sample, 17 multiplications, 15 additions, and one subtraction are required.
The general difference equation of (1) can be rewritten as
for a second-order filter. Assuming all signals to be bounded by + 1 and defining the input signal and output signal in two's complement code, B bits of accuracy including the sign bit, and F2 ( -) as
F: can be expressed in a similar fashion.
The distributed arithmetic realization of (8) consists mainly of four building blocks: 1) mask bit shifters, 2) memories, 3) summers, and 4) subtractor. A schematic block diagram of the implementation is shown in Fig. 2 .
The functions F{ ( e ) and F; ( * ) (8) have a finite number of possible outcomes z9 and Z8, respectively. The arguments of the functions Ff ( e ) and F$ ( e ) are generated with the X and Y mask bit shifters shown in Fig. 3(a) and (b), respectively. The number of each X and Y mask bit shifters is equal to the word length B in bits. For a secondorder section, the amount of storage required is the previous two columns and two rows of both the input and output arrays. Fig. 4 (a) and (b) illustrates the storage allocation in the X mask bit-shifters, and in the Y mask bitshifters, respectively, for the computation of the (m, n)th 
2-i Fi
. . .
The rounded values of F,( e ) and F2(.), due to finite word length of the memories, are defined as
. . . The rounded output can be represented as put of the filter defined as E; is the quantization error when the result is trunLet gm,n be the accumulated truncation error at the out-
The variance of gm,n can be calculated as in [9] m m Since all errors are uncorrelated, (14) becomes
E -
Then the output error has a zero mean and a variance (15) Equation (15) can also be used to determine B and t for a specified signal-to-noise ratio.
Example: For the high emphasis filter described in [ l ] IV. REAL-TIME IMAGE PROCESSING The need for real-time digital processing became evident with the increasing utilization of television imaging to the industrial, medical, and military environments. The term "real-time image processing" can be defined as "the processing of images at a speed such that the data rate of the processed images is the same as the input one. " If we consider an image size of M X N pixels and a TV scan rate of L images per second, the input data rate,
In order to have the same output data rate, all the operations that are required to produce an output pixels have to be done within 1/R s. Hence, the quantity 1/R can be viewed as the period of the clock that controls the image processing system. With a display size of 256 X 256 pixels and a TV scan rate of 30 images per second, the data throughput rate R = 2562 X 30 = 1.966 X lo6 pixelsls or one pixel every 508.6 ns. Real-time digital image processing requires a large amount of number crunching to be performed at high speed. Conventional computer architectures which operate in a sequential manner are I/O bound when performing such tasks. The resulting computational problems are attacked by combination of fast processors and advanced system architectures. Even with these systems, most of the image processing algorithms can only be done near real-time. The real-time processing algorithms are limited to pixel manipulations and include such operations as wrapping, rotation, zooming, and small image convolutions. This section demonstrates that distributed arithmetic implementation of 2-D recursive second-order filters can perform real-time spatial filtering of image 256 x 256 pixels.
Consider the implementation shown in Fig. 2 for an image of 256 X 256 pixels. The coefficients are quantized to 16 bits. All computations are done with t = 16 bits of tion with serial additions. In this case, the data rate is li [(9 x 19) + 50 + 301 = 4 MHz.
The memories in Fig. 2 can be replaced by a number of multipliers operating in parallel. Fig. 6 shows the block schematic of a multiplier implementation of a 2-D recursive filter. A minimum of six multipliers are needed, three each for the input and output masks, respectively. Using LSI (large scale integration) 16 bitimultiplierlaccumulator TRW TDClOlOJ having a speed of 115 ns and the same adders and shift registers as before, the cycle time will be 115 x 3 + (19 x 6) + 19 + 280 + 100 = 858 ns. Hence power consumption is 42 W, and the cost is approximately $2200(U.S.). Images larger than 512 X 512 pixels require the use of esoteric logic types. With such logic families, the penalty for state-of-the-art performance is increased cost. The above discussions are summarized in the table of Fig. 7 for real-time and nonreal-time application^.^
V. HARDWARE DESCRIPTION
A prototype of the distributed arithmetic implementation of a 2-D recursive filter was built using mainly TTL real-time filtering cannot be achieved via the multiplier 'Some of the preliminary work on the of a 2.D implementation. The package Count in this case is 67 IC's, filter through distributed arithmetic can be found in 1141 and [15] . components. In this prototype, serial additions rather than parallel conditions were used in order to reduce the hardware size. The prototype assumes that the input and output samples are 8 bits in length, while all the intermediate computations are done in 16 bit precision. The X and Y mask bit shifters are constructed using TTL 74s 174 hex D-type flip-flops and MOS AM 2856 dual (256 X 1) static shift registers. The memories are configurated from 2716 (2K X 8) EPROM's (erasable programmable read-only memory). The use of EPROM's yields a more flexible filter at the expense of a slower throughput rate. With serial addition, an accumulator type circuit is required. This was constructed using four 74LS1814 bit ALU's, one 74LS182 carry look-ahead generator, and two TTL octal latches. The subtractor has the same configuration. A schematic block diagram of this approach is shown in Fig. 5 . Also in this hardware, the bit shifts 2-$ associated with each function F'( .) were preprogrammed into the memories, which implies that the memories are not identical.
The operational sequence for this implementation is now outlined.
Assume that input image to be filtered is of size M X N and that we are calculating the first output sample for the rnth row. For simplicity assume all initial conditions are zero. This means that registers SR1-SR8 of all the input and output mask bit-shifters are cleared and the subtractor output is zeroed. With the aid of Fig. 5 , the steps involved in the computation of ym,o are as follows.
Step 1: Each input bit x :~,~ is shifted into the corresponding ith mask bit-shifter, and simultaneously, the subtractor's output bits are shifted in a similar fashion into the Y mask bit-shifters.
Step 2: The binary vector output of each of the X mask bit-shifters address one of the B XMEM's. Similarly, the YMEM's are addressed by the Y mask bit-shifters outputs.
Step 3: The fetched memory contents of XMEM's are added in the X accumulator with a sign change for the contents of the XMEM representing the function Fy( e ) .
Similarly, the fetched memory contents of the YMEM's are added in the Y accumulator.
Step 4: The difference in the outputs of the X and Y accumulators is computed in the subtractor. The subtractor now contains Y,,,~.
Step 5: All the registers in the X and Y mask bitshifters are strobed so that all the delay versions of the input and output samples are available at the outputs of the X and Y mask bit-shifters.
To calculate the next output sample, steps 1 to 5 are repeated with x , ,~. Continuing along the rnth row after N samples, the entire row will have been processed. For the next row of computation registers SR1-SR6 of the X and Y mask bit-shifters are cleared so that the proper initial condition is retained. The steps are repeated in a similar fashion until the last element YM,N is computed.
In this serial implementation, the memory is enabled in sequence for serial addition as opposed to the parallel implementation, where all memories are enabled simultaneously. This was accomplished by using a simple counterdecoder circuit. The counter is used to keep track of the number of additions as well as to strobe the latches of the subtractor. The counter also determines when to disable the clock to the accumulator and subtractor latches upon the required number of arithmetic operation is complete. This prototype costs approximately $350(U.S.) and can operate up to 880 kHz for an image size up to 256 X 256 pixels. A photograph of the prototype is shown in Fig. 8 .
The performance of the distributed arithmetic implementation is demonstrated using a second-order IIR filter drawn from [l] . This filter has a magnitude response that is generally high pass and a linear phase characteristic over most of the frequency domain. The filter has been designed to have a gain equal to 1 beyond the radial frequency of 0 . 7~ and 0.1 below the radial frequency of 0 . 3~. The transition band is Gaussian between 0 . 3~ and 0 . 7~. The contour plot of this filter is shown in Fig. 9 . Two different sizes of images are processed with this filter. They are the 128 X 128 pixels crest image and the 128 X 256 boat image shown in Fig. 10(a) and ll(a) , respectively. it is demonstrated that the filter has been successfully implemented using the distributed arithmetic structure.
VI. CONCLUSIONS The distributed arithmetic realization of a second-order recursive digital filter has been presented. The implementation was based on memory fetch, bit-shifts, and additions. This is different from the conventional direct method of filter implementation that uses a number of multipliers operating in parallel for the same throughout. The proposed structure, when compared to the conventional method of filtering, offers significant reductions in cost and power consumption, and can operate in real-time on images as large as 256 X 256 pixels. For higher order filters, a split-address technique can be used to reduce the memory size while trading off with some extra additions.
Finally, the inherent parallelism in the architecture and the fast growing VLSI device technology with its ability to boost parallel processing suggest that the filters can be economically implemented.
