Absiract-This paper introduces the problem of and presents some state-of-the-art approaches for high-speed digital image processing. An architecture based on distributed arithmetic, which eliminates the use of multipliers, is described. A minimum-cycle-time filter architecture, which features a high degree of parallelism and pipelining, is shown to have a throughput rate that is independent of the filter order. Furthermore, a new multiprocessing-element architecture is proposed. This leads to a filter structure which can be implemented using identical building blocks. A modular VLSI architecture based on the decomposition of the kernel matrix of a two-dimensional (2-D) transfer function is also presented. In this approach, a general 2-D transfer function is expanded in terms of low-order 2-D polynomials. Each one of these 2-D polynomials is then implemented by a VLSI chip using a bit-sliced technique. In addition, a class of nonlinear 2-D filters based on the extension of one-dimensional (1-D) quadratic digital filters is introduced. It is shown that with the use of matrix decomposition, these 2-D quadratic filters can be implemented using linear filters with some extra operations. Finally, comparisons are made among the different approaches in terms of cycle time, latency, hardware'complexity, and modularity.
I. INTRODUCTION T HE NEED FOR high-speed digital image processing became evident with the increasing utilization of TV imaging to medical, geophysical, and industrial environments. Many of these applications involve acquisition, processing and display in fractions of a second. This paper will describe the problems and some approaches taken to achieve high-speed digital image processing. In particular, different high-speed architectures characterized by parallelism, pipelining, modularity, and reduction of cycle time will be presented and compared. Table I summarizes the importance of speed in a variety .of image processing applications [l]-[5] .
In this paper, high-speed image processing is defined as the processing of images in real-time or near real-time depending on the applications. The term "real-time image processing" is defined as "the processing of images at a speed, such that the data rate of the processed images is the same as that of the input images" [6] . If we consider an image of size M X N pixels and a TV scan rate of L images, the input data rate R is M x N x L pixels/s. With a display size of 256 X 256 pixels, this implies a serial data stream at the rate of 1.97 Mpixels/s or one pixel every 508 ns. The corresponding values for a 512x512 Manuscript received February 18, 1987 . The authors are with the Department of Electrical Engineering, University of Toronto, Toronto, Ontario, Canada M5S lA4.
IEEE Log Number 8714911. pixel image are 7.86 Mpixels/s and 127 ns, respectively. By "near real-time," we usually mean that the image can be processed at a rate which is lo-100 times faster than those achieved with traditional sequential processors [7] .
During recent years, a number of approaches have been considered to achieve high-speed image processing. These include array processors, VLSI architectures, and residue arithmetic. Array processors are well suited to perform typical image-processing functions. The use of array processors increases the total system throughput in two ways. First, the processing elements (memories, adders, etc.) are usually faster than those of a general-purpose processor. In addition, they run in parallel and/or may be pipelined, for further increasing the processing speed [8] . By making use of recent VLSI architectures, a set of VLSI building blocks (chips) which would meet the needs of a wide spectrum of algorithms can be designed. This approach has been, considered in [9] and other recent contributions [lo] . A programmable read only memories (PROM's) implementation of a two-dimensional (2-D) 5 X 5 matrix convolution filter using residue arithmetic was reported in [ll] . This filter structure permits easy pipeline design and the inherent modular structure of residue arithmetic minimizes the design overhead. Special purpose architectures for residue arithmetic 2-D convolutions were also reported in [12] .
A number of new architectures capable of achieving real-time or near real-time image processing will be presented in this paper. These architectures will be compared in terms of, "cycle time" and "latency." Cycle time is defined as the time between two consecutive input samples, i.e., l/R. Latency is defined as the time interval separating the appearance of an input sample on the input port from the appearance .of the corresponding output sample at the output port. In Section II, approaches based on distributed arithmetic [13] will be discussed. Recursive filters implemented by this method involve 'only memory fetches and additions. In particular, the hardware implementation of a second-order 2-D distributed filter will be presented. Section III describes a minimum-cycle-time filter architecture, which has a data throughput rate independent of the order of the filter [14] . Such an approach is different from other implementation schemes, in the sense that "a maximum number of arithmetic operations are performed in one clock cycle." In Section IV, a new 0098-4094/87/0800-0887$01.00 01987 IEEE II. DISTRIBUTED ARITHMETICARCHITECTUR~Z The implementation of 1-D recursive filter structures using distributed arithmetic was first introduced by Reled and Liu [21] . This approach requires storing the finite number of possible outcomes of an arithmetic operation, as well as using them to obtain the next output sample through repeated additions and shifting operations. In this section, an extension of the distributed arithmetic approach to the implementation of high-speed 2-D recursive filters is described. Specifically, an implementation scheme for a second-order section is presented.
A 2-D recursive digital filter is described by the linear difference equation
where x, n and y,,, n are the input and output image arrays, respectively, 'and a,, j's and bk,,'s are the filter coefficients. For a second-order filter ( Ni = Nj = Nk = N, = 2), the direct-form realization requires 17 multiplications, 15 additions, and 1 subtraction for each output sample. Assuming all signals to be bounded by +l and defining the input and output signals in two's complement code, B bits of accuracy including the sign bit, we have
and
s=l where x&-~, n-j and yAPk, n-, are binary variables. Substituting (2) into (l), we rearrange the summations and define two functions = a,xi,, + aolxL,,-l + * f * + a22xi-2,n-2 (3a) and t;2s(Y~,n-l,Y~,n-2,.",Y~-2,n-2) = h,ly;,,z-1+ bo&,,n-2 + ---+ b,,ytL,n-2. W It is possible to write (l), for a second-order filter, in terms of the two functions F;( .) and F;( .) as follows:
where Ff( a) and F;( .) have a finite number of possible outcomes (29 and 2*, respectively). The distributed arithmetic realization of (4) consists mainly of four building blocks: 1) mask bit-shifters, 2) memories, 3) summers, and 4) subtractor. A schematic block diagram is shown in Fig. 1 . All linear combinations of the nine coefficients a,, j's in (1) are stored in the 512 x t memories. Similarly, all possible combinations of the eight b,, ,'s are stored in the 256 x t memories.
Consider the implementation shown in Fig The inherent parallelism of the architecture reduces the data rate to a memory fetch, log, B + 1 additions, and register's delays (due to recursive part). Using the components mentioned above, the time for a 16-bit addition is 19 ns and that of a memory access is 50 ns [23] , [24] . In addition, the minimum set up time and maximum propagation delay of the shift registers are 0 ns and 30 ns, respectively [22] . This gives a maximum cycle time of (4 x 19) + 50 + 30 = 156 ns or a minimum data rate of 6.41 Mpixels/s. This data rate is close to the real-time requirement for a 512 X 512 pixels image. Higher throughput rates can be attained if faster logic families, such as ECL, are used. With such logic families, the penalty for state-of-theart performance is increased cost and power consumption.
When the order of the filter increases, the memory requirement of this structure increases as 2 B(2(K")2) X t, where K is the order of the filter. However, the memory address partition, which trades off memory with extra additions, can be used to reduce the amount of memory required. With this modification, the effective memory size decreases while the number of memory addresses remains the same. For example, 216 X t (K = 3) bits of memory can be implemented using two 28 x t bits of memory plus one t bits adder. Hence, with an extra addition, the effective memory size is reduced considerably.
A prototype of the distributed arithmetic implementation of a 2-D recursive filter [13] , which can process images of size up to 256X256, has been built using mainly TTL components. In this prototype, serial additions, rather, than parallel additions, were used in order to reduce the hardware size. The prototype assumes the input and output samples are 8 bits in length, while all the intermediate computations are done in 16-bit precision. The X and Y mask bit shifters are constructed using TTL 748174 hex 
III. MINIMUM-CYCLE-T&E (MCT) FILTER

ARCHITECTURE
For 1-D filters, a configuration for which the critical path contains no more than one multiplication and one addition has been proposed [25] . The critical path of a digital filter is defined as the longest one among all possible paths from the output of a delay element to the input of the next one. An extension of this idea to 2-D recursive digital filters will be presented.
The digital filter described by (1) has the transfer func-building blocks. This regularity property provides a simple tion hardware structure for the implementation of the filter. N. N, 
addition times, and the sum of the propagation delay and set up time of the 512-bit shift register. The new filter can kk f 7 '2 process images at a data throughput rate of one pixel every 113 ns (i.e., 2 x 19 + 45 + 30), which is less than the maximum allowable time 127 ns required for real-time processing.
We now propose the new filter structure for the modified 2-D filter transfer function. The relationships among all these signals of a second-order filter are shown in Fig. 3(a) and (b). It can be easily proven that Y( z; ', z;l) = 1. With the assumption that a multiplication takes at least twice the amount of time required for an addition, the critical path is the one shown in bold lines and contains only one multiplication and two additions. This critical path is independent of the order of the filter.
The new filter has a very regular structure, with identical
IV. MULTIPROCE$SING-ELEMENT (MPE) ARCHITECTURE
The "Divide and Conquer" algorithm [15] is frequently used to solve a complex problem through a recursive method. The motivation of applying this algorithm to image filtering is to provide a general implementation method by the use of simple identical processing elements. 
Consider a FIR filter
The extension of this method to 2-D case will be presented as follows. A 2-D FIR filter can always be written as
e-e aM] Equation (9) can be rewritten as .
N( zi5 z 2') = P,( z;l) + zl'P1( 111) , + zC2P2( z;l) + . . . + z;"P,( z;l) (12) where Pm(z;'), m = 0,l; . . , Ni, corresponds to the pdly-S nomial of z;' associated with the. factor z;'. Thus, where N(z;', z;l) can also be written in the form of (10) 
As we can see, (10) is a recursive form of [p q r] where p, q, and r are the corresponding filter coefficients.
I
If a processing element (PE) can perform the above operation, the use of multiprocessors can solve the original problem in (9). An all-pole 1-D filter can be implemented with a FIR filter in the feedback path. 
and Q,( z;') is a polynomial with no constant term. In order to reduce the critical path length, the original transfer function must be modified. Fig. 4 (a) and (b) show the MPE architecture for the nonrecursive and recursive blocks, respectively. From Fig. 4(b) , the modified transfer function E?(i;l, zTi) is -I?( z;l, z; ') = H( z;l, z2 -')zi21?1 (16) and the critical path length is one multiplication and four NI + 1 additions. The total latency is 2 -I 1 2 cycle times + 2 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS, VOL. CAS-34, NO. 8, AUGUST 1987 addition times + shift registei propagation delay. As illustrated, besides having a regular structure with identical PE's, the MPE architecture provides a high throughput rate suitable for high-speed digital image processing. by writing the matrix Q as a product of two other matrices R, S. The matrix R might be chosen arbitrarily as a nonsingular (Ni +1)X (N, + 1) matrix. In this case, the matrix S is determined by S=R-'Q (20) and has the dimension ( Ni + 1) x ( Nj + 1). ZITR is a 1 X Ni matrix, which is a function of the first variable zcl and SZ, is a Nj X 1 matrix, which is a function of the other variable z; '. Thus, N(z;l, z;l) can be expressed as a sum of products of first-order terms, each one of which is a function of only one of the two variables. The denominator polynomial can also be decomposed in the same manner. In order to eliminate the complex first-order terms in the decomposition as a result of complex conjugate pairs and to save on hardware, a building block implementing a second-order 2-D FIR filter was chosen [171- In this section, the VLSI implementation of a secondorder first quadrant 2-D digital filter is considered. The filter, referred to as the "Slice," has the following specifications:
V. VLSI ARCHITECTURES BY DECOMPOSITION
(9 it implements either all-pole or all-zero transfer function in 4-bit slices; the wordlength is indefi- nitely expandable in an expansion chain, the only penalty being the speed of operation, (ii) it operates on 512 X 512 pixel images, (iii) it is easily cascadable in a pipeline to implement more complex functions, with the same throughput as a single Slice, and (iv) it uses 512 X 8 bits of external scratchpad memory and a host controller to program the filter coefficients, present input data, and store the results.
The filter was implemented in 5-pm NMOS VLSI technology (available at the University of Toronto) on two chips: the "Datapath" and the "Sequencer."
The Datapath contains all the circuitry to carry out the arithmetic functions and the registers to hold the data and filter coefficients. Since a second-order filter requires an array support of eight values, there are nine (one for the input) 4-bit data registers and nine 4-bit coefficient registers. Multiplications and additions are carried out by the Arithmetic Logic Unit (ALU). The ALU uses a 4-bit adder with internal carry-look-ahead and external fast carry, for word-length expansion. The output of the adder is fed to an accumulator which is internal to the ALU. In addition to data input/output (I/O) for the Slice, the Datapath has I/O port for the scratchpad RAM. As mentioned earlier, the Slice is switchable between all-zero and all-pole filtering, and selection is carried out by setting a control signal on the Datapath. Within the Datapath, the regularity of the filter structure enabled this section to be broken into eight basic blocks. Of these, five were further divided into subblocks, the lowest hierarchical level in the Datapath. It was at this point that the actual circuit design was started. Once the subblocks had been designed, they were simply connected together to form the basic blocks, and these were interconnected to form the Datapath. Due to this design technique,. there is very little random logic in the Datapath. This is essential as the implementation of random logic in VLSI is extremely inefficient and time consuming.
The Sequencer is responsible for interfacing with the host and other Slices in a cascade or expansion chain, Datapath coefficient programming, scratchpad RAM management, data blanking control, global processing control, multiply-bit bus control, and Datapath ALU control: The Sequencer was broken into two blocks, a two-output, 18-bit counter, and a large PLA (programmable-logicarray). The PLA generates all the control signals required by the filter. Each of these two blocks were further broken down into subblocks. The circuit design started at this point.
In the Slice, column by column recursion [17] was chosen for its smaller buffer requirements (2 N + 2 output values must be buffered for an N x N pixel image) and simpler recursion algorithm. Fixed two's complement arithmetic was used because of the simplicity of addition and subtraction, ease of detecting overflows, and correct final output (if the final output is within the output dynamic range) even if intermediate results have overflowed.
IV. TWO-DIMENSIONALQUADRATIC DIGITAL FILTER ARCHITECTURE
In recent years, nonlinear filters have been used extensively in image processing. It is well known that linear filtering techniques are simplier to implement but have the disadvantage that they blur the edges. They also do not perform well in the presence of signal-dependent noise [27] . Examples of nonlinear filters include homomorphic filters for restoration of an image which is subject to multiplicative degradation, and median filters for removing impulse noise. In this section, we propose a new type of 2-D nonlinear filter based on a second-order characteristic. More importantly, it is shown that these filters can be implemented using linear 2-D FIR filters. One application of these nonlinear filters is for texture discrimination. It is shown that the coarseness of the texture is proportional to the spread of the autocorrelation function and, hence, the second-order moment [28] .
One way to describe the input and output relationship of a 1-D nonlinear system and whose design requires only a limited amount of knowledge of higher order statistics is to use a discrete Volterra series representation [29] . In particular, the first and second terms of this series describe the linear and quadratic parts of the nonlinear system. 1.1 n-j 9 k<q (22) i=l j=O where q is the rank of kernel matrix {hi, j}, and Xi's are the eigenvalues resulting from the singular value decomposition. Hence, the implementation of second-order nonlinear filters is equivalent to the implementation of 1-D linear filters with some extra operations. Based on the 1-D quadratic filter of (21), we define a 2-D quadratic digital filter as follows:
where x, n and y, n and hij;, , are the input and output image pixels is the 2-D quadratic kernel. We can assume without loss of generality that Ni = Nk and Nj = N!. The 2-D quadratic kernel can be represented by the following "ordered-pair" matrix: h oo,ofJ h O&O1 It should be obvious that the "ordered-pair" kernel matrix X.9&," -H is a symmetric matrix due to the permutations of the product terms x,,,-~,~-~ and x,-~+-,. Hence, (23) i=k, j=l otherwise '
Equation (25) 
From (27) and (28), we observe that each output requires the generation of a new set of coefficients wi, j, m, n, which will then be used to determine the final output by convolving with the input again. This implies a large number of computations is required especially when the order of the filter increases. In order to reduce the amount of computations, a matrix decomposition [16] approach on the "ordered-pair" kernel matrix is proposed. In this, approach, the "ordered-pair" matrix is viewed as a 2-D matrix. As in the 1-D case, using the singular-value decomposition, the quadratic filter of (23) q is the rank of the "ordered-pair" matrix of (24), r;,, is the element of the decomposed matrix, and Xi's are the eigenvalues resulting from the singular-value decomposition. Hence, by choosing p smaller than q, the number of coefficients can be reduced. In other words, ofie can retain only the most significant stages of the decomposition depending on the accuracy required. In addition, with the decomposition, a modular structure as shown in Fig. 5 can be obtained. Therefore, 2-D quadratic filters can be imple- Fig. 5 'can be implemented using the various architectures described in previous sections. Alternatively, each of the 2-D FIR filters can be decomposed to further reduce the hardware complexity. Finally, it should be noticed that different types of matrix decompositions can be applied to the "ordered-pair" matrix of (24) to accomplish different objectives such as minimum number of coefficients, etc.
VII. CONCLUSIONS This paper has presented some state-of-the-art architectures for high-speed image processing. High throughput rate is achieved through a high degree of parallelism and pipelining of arithmetic operations. Modular structures are shown to be very suitable for efficient VLSI implementation. Table II provides comparisons among the various architectures in terms of speed, latency, hardware complex-ity, and modularity. From the comparisons, the following comments are in order.
1)
2)
3)
4)
The distributed arithmetic architecture is very suitable for low-order filter. It offers'a savings in both cost and power consumption when compared to the direct implementation. The MCT filter architecture has the shortest cycle time among the four architectures. The cycle time and latency are also independent of the filter order. The MPE architecture features a very short cycle time and is very modular. It is suitable for high-order filters. The VLSI architecture by decomposition is very modular and flexible. High-order filters can be implemented by cascading of second-order 2-D filter chip sets.
The implementation of 2-D quadratic digital filters can be simplified by decomposing the "ordered-pair" kernel matrix. The resulting structure is composed of identical stages where each stage consists of a linear 2-D filter and two multipliers.
High-speed image processing and, in particular, real-time image processing is still in its infancy. More research is needed for this field to reach maturity. The advances in microelectronics technology will remain a major factor in the implementation of high-speed image processors.
