This paper reviews our on-going e orts on designing a real-time vision system. The system consists of a real-time feature extractor and a relaxation network. The lter system is capable of performing multiple 2D non-separable lters, multi-resolution decomposition, and steerable transform. The relaxation network is capable of performing various relaxation, di usion and minimization operations. The paper shows several vision tasks which can be implemented e ectively on the vision system.
Introduction
There are at least two application areas that fast vision systems are important. The rst one is for real-time applications. Such systems have to operate at a certain required rate to be useful. Applications include automatic target recognition/tracking, factory inspection, face recognition for security check, and robotics. Each application has di erent timing requirements. The other area is for algorithm development. Many vision algorithms are computationally intensive and time consuming to run on a sequential computer. As a matter of fact, the amount of computation for new algorithms tends to grow as more powerful microprocessors become available to the research community. Thus, we often nd ourselves in the situation where we have t o wait for hours to collect the results each time we change the value of one parameter. The research community can bene t greatly from a fast inexpensive general purpose vision system. For these reasons, our focus is on designing a general purpose real-time vision system. The system has to be exible so that various algorithms can be implemented on the platform. The architecture has to be exible so that the speed of the system can be increased easily with additional hardware to meet various real-time requirements. These concerns have previously been studied in the context of massively parallel and mesh-connected array processor systems (see 13] ).
Many vision algorithms and vision tasks can be divided into two stages in terms of the characteristics of the computations involved. They are the feature extraction stage and the localized iterative i n teraction (relaxation) stage. Some vision tasks which comply with this characterization are region growing, optic ow computation, shape from shading, non-linear di usion, and Hop eld networks.
The feature extractor operates on either single or multiple images. Here the feature extractor is a combination of linear lters and pixel-wise operations. Thus, where F is the feature extractor, I is the input image, h i is a convolution kernel, p i is a pixel-wise operation, ? represents convolution, and Q implies concatenation of operators. Pre-processing on input images such as noise reduction, temporal averaging and sensor compensation can be integrated into the feature extraction stage.
Feature extraction is a vital part of the computer vision system. It involves various types of lters. For real-time applications, the complexity of the feature extraction process often governs the speed of the system, and the system often settles for less e ective but computationally a ordable feature extractors.
In 10, 11, 12] a method for a fast feature extraction was proposed. With this method, it became possible to construct a compact vision system which is capable of operating various 2D non-separable spatial lters for feature extraction. The features of the system includes:
real-time performance of multiple 2D non-separable spatial lters.
real-time performance of multi-resolution image transform.
real-time performance of steerable transform of an image. The system has a simple pipeline structure suitable for VLSI implementation, and has very small latency and memory requirements compared to an FFT based system.
The architecture is general enough so that any types of spatial lters can be implemented on the system. They include Gabor lters, Gaussian lters, di erence of Gaussian lters, Canny The architecture is scalable in the sense that the number of lters to be implemented and the size of the lters can be increased by adding either additional chips or boards depending on the actual implementation. The amount of hardware increase is linear with respect to the number and the size of lters. The architecture is not dependent on the input image size. The relaxation stage involves iterative operations on a local neighborhood. Thus, R(I) = J (I n;1 )
I n = J (I n;1 )
where R is the relaxation operator, J is a local neighborhood operator, I n is the intermediate result at nth iteration with I 0 = I.
Various types of relaxation networks have been proposed in the literature and some have b e e n i mplemented in hardware for some speci c vision tasks 1, 2 , 2 2 ]. They are either analog, digital or hybrid containing both analog and digital circuits. Silicon implementations which mimic the human retina are often called silicon retina and have gained much attention recently 9, 1 6 ]. They can also be categorized as relaxation networks. However, they are designed as sophisticated sensors rather than general purpose vision engines.
Section 2 reviews the real-time feature extraction system. Section 3 reviews various vision architectures designed for relaxation and proposes an improved new architecture. Section 4 provides several examples of how v arious vision algorithms/tasks can be performed on the architecture. Section 5 gives a summary and conclusions.
Feature Extraction System
The basic idea of the lter system is to approximate 2D non-separable lters into a set of separable lters. This way, an expensive 2D non-separable lter operation can be done by adding the results of inexpensive separable lters. The approximation can be expressed as
(3) Figure 1 shows the implementation of the separable scheme. It has a simple lter bank structure suitable for VLSI implementation. This section brie y reviews our previous work on the separable lter scheme. For more detail see 10, 1 1 , 1 2 ].
SV/OSD
The system employs Singular Value Decomposition (SVD) to decompose a set of 2D lters into a sum of separable lters. The new method is called SV/OSD in this paper. The di erence from the TreitelShank's classical decomposition method 21] using SVD is that the SV/OSD takes multiple lters and decomposes them into a separable form simultaneously while the Treitel-Shank's method decompose each lter separately.
Given a set of 2D lters f k (x y)g (0 <= k < F N ) to be implemented, they are sampled within a discrete lattice to form a set of FIR lters, f k m n]g. For simplicity, the size of the lters is M by M. (4) Assume the eigenvalues, w ii , are sorted in decreasing order of magnitude. The advantage of SV/OSD over the Treitel-Shank method is its computational e ciency. Assume there are F N lters to be implemented, each w i t h a n M by M mask size, and the decomposition requires P order approximation to achieve t h e required accuracy. Then the computational complexity for the Treitel-Shank method is F N 2M P . The computational complexity for the SV/OSD is F N M P +M P = ( F N + 1 ) M P . Thus, the SV/OSD has less computation by the amount o f ( F N ; 1) M P . When F N = 1, the two methods become identical.
This computational advantage re ects directly into an implementation advantage. The Treitel-Shank method requires 2F N P lters with size M, while the SV/OSD requires (F N + 1 ) P lters with size M.
This comparison is depicted in Figure 3 .
The approximated lters are guaranteed to converge to the original lters as the approximation order increases. The SVD guarantees at least linear convergence, however, its convergence is close to exponential in most cases 10]. Due to the property of SV/OSD, Equation (4) is the best rank-P approximation of in the least square sense.
Multi-resolution decomposition
Multi-resolution decomposition (MRD) is a technique to produce a hierarchical image representation suitable for many image processing and analysis algorithms. It produces multiple levels of image representation. Each level represents the content of an input image over a particular frequency region. The decomposition is done by a set of lters where each lter is tuned to each frequency region. Thus the decomposition can be written as,
where d k is the kth level output, f is the input image, and k is the kth lter.
In most cases, the set of lters are the results of dilating a prototype lter at di erent dilation factors. For computational convenience, the dilation factors are powers of 2. Thus, k m n] = 2 ;k+1 (2 ;k+1 m 2 ;k+1 n) : (6) It is often the case that the lower frequency portions of the decomposition are decimated to reduce the data size and the amount of computation without information loss. For the dyadic dilation case, the decimation factor is also a power of 2. This decimated MRD can be formulated as, d k m n = X ix iy f i x i y ]2 ;k+1 (2 ;k+1 i x ; m 2 ;k+1 i y ; n) (7) This is exactly the dyadic discrete wavelet transform. One of di culties for computing a MRD is that the lter size grows exponentially as the dilation factor increases. The discrete wavelet transform ameliorates the problem elegantly by imposing the following constraint on the wavelets,
Then the decimated MRD can be computed recursively using the xed length discrete lter h.
If the prototype lter does not satisfy the requirement (8) , an approximation can be made to the lter using basic splines. The spline satis es the dyadic constraint (8) with h n] = g n] = 2 ;k;1=2 k + 1 n ! : (9) This spline approximation together with SV/OSD can compute a MRD e ciently. First, the 2D nonseparable prototype lter needs to be decomposed into a set of separable lters. Each 1D lter is approximated using the spline approximation. The MRD is performed recursively using a set of xed length discrete lters. Assume is the prototype lter. SV/OSD is applied to obtain the separable approximation of the lter. Then
Each 1D lter is approximated with the basic spline. (12) where is the basic spline which satis es
The decomposition is performed recursively in the following manner. Figure 4 shows this approximation process. The overall process is done in a pipeline fashion with lters in each level operated in parallel. It has a simple lter bank structure. The above decomposition algorithm decimates the image at every decomposition level. The rst level decomposition produces the output as big as the input image. At the kth level of the decomposition, the width and height of the output image reduces to 2 k;1 of the input. In some vision tasks, it is more desirable to keep the image size intact since the decimation process introduces aliasing and causes the decomposition to be shift variant. This undecimated MRD can be performed using the computational structure shown in Figure 4 with a little modi cation. The idea is to compute 4 sets of decimated MRDs at di erent spatial points. The same hardware can be shared for the 4 MRDs. For more detail, see 11].
Steerable system
On a steerable system, the directional preference of a lter can be adaptively controlled by i n terpolating the outputs of a set of basis lters. Thus, the steerability can be described as,
Such a steerable system can be designed by rst decomposing the lter of interest into Fourier series along the orientation dimension. The transform decouples the orientation and the spatial dimensions.
where
By taking the Q f i i g with the largest energy contents, the Qth order Fourier series approximation (FSA) is obtained. The approximated lter is approximately steerable.
f i g f i i g (22) q i ( ) = 8 > < > :
The computational structure of this approximation is shown in Figure 5 . SV/OSD can be applied to the basis lters f i g to reduce the amount of computation:
Then the spline approximation can be applied to each 1D lter for e cient steerable MRD transform.
Architecture
The above feature extraction technique can be implemented with a 1D lter bank structure. There are two w ays to implement 1D linear convolution. One is to use a sequence of multiply-accumulate units (pipelined ltering), and the other is to use a set of parallel multipliers followed by a n e t work of adders (parallel ltering). Figure 6 shows the two s c hemes. They both requires M multipliers and M ; 1 adders for a length M lter. Pipelined ltering takes one input at a time sequentially. The input is multiplied with all the lter coe cients at the same time, and each multiplication result is accumulated at an accumulator attached to the multiplication unit. The last accumulator holds the nal output. Parallel ltering takes M inputs at a time and each input is multiplied with a lter coe cient. The result of the multiplications are added through the binary tree adder. Note that the parallel ltering scheme requires M independent memory banks to provide the parallel inputs to the lter.
Pipelined ltering is suitable for a sequential input stream, and parallel ltering is suitable for a parallel input stream. For SV/OSD, pipelined ltering is suitable for horizontal lters assuming that the inputs are coming in a raster order. Parallel ltering is suitable for vertical lters since the input can be provided in parallel such that the latency of the system reduces to O(N M ) where N is the image width.
Assume that horizontal lters are implemented with the pipelined ltering scheme, while vertical lters are implemented with the parallel ltering scheme. A separable lter pair (a i and b i in (3)) can be implemented as the horizontal lter rst followed by the vertical lter, or vice versa. As noted above, each v ertical lter requires a set of parallel input bu ers. With the vertical-horizontal lter order, the bu er can be shared among all the vertical lters. Therefore it is more advantageous in terms of the memory requirement to employ the vertical-horizontal lter order. A VLSI implementation of the lter system consists of 3 custom designed VLSI chips horizontal lter chip (HFC), vertical lter chip (VFC) and separable lter chip (SFC). They contain multiple 1D lters. A simpli ed view of each lter is given in Figure 7 . Each lter in a HFC is implemented using the pipelined ltering scheme. Each lter in a VFC is implemented using the parallel ltering scheme. There are a pair of horizontal and vertical lters in a SFC. They are implemented using the pipelined and parallel ltering schemes, respectively. With them, various lter extraction systems can be constructed. Figure 8 shows the system con guration for SV/OSD. It is implemented with HFCs and VFCs. The numb e r o f c hips required depends on the approximation order, the number of lters, and the size of the lters. An estimate of the number of lters for typical con gurations is given in 12]. Figure 9 shows the system con guration for a decimated MRD. It is implemented with HFCs, VFCs and SFCs. Both the basic spline lters ( ]s) and the low pass banks (g) are implemented with SFC, while the high pass banks (g x and g y ) are implemented with VFCs. Figure 10 shows the system con guration for FSA. It is similar to the SV/OSD con guration except for the interpolation units which are implemented with a VFC.
Relaxation Network
Most relaxation networks including silicon retinas are variations of mesh connected SIMD machines, because the spatial distribution and the local interaction of data are suitable for the mesh topology. The disadvantage of this scheme is that the system performance is heavily dependent on the image size, and the cost of the system is high.
Our approach i s to introduce parallelism in the time (or iteration) domain and employ a pipeline structure for the spatial domain. This scheme is modular consisting of multiple processing units (PU), and has the following advantages.
1. The number of PUs can be expanded easily.
2. The system is not dependent on the image size.
3. The system cost is dependent on the numb e r o f P U s , t h us one can build an inexpensive system with fewer PUs.
4. The system is scalable in the sense that the performance improvement is approximately linear as the number of PUs increases.
The last point needs more explanation. Assume some relaxation algorithm took N I iterations to converge. Then the system performance increases linearly until the number of PUs reaches N I .
Disadvantages of the scheme are the following:
1. It introduces larger latency.
It requires an intermediate bu er.
Assume the image size is N N, the system has K PUs, and the algorithm converges after N I iterations. Then the latency with the mesh topology is O(N 2 +N I ) while the one with our scheme is O(N 2 N I = K + N I ). Figure 11 shows the functional diagram of a PU. The design takes advantage of the simple and repetitive nature of the local neighborhood interaction. It consists of a local processor, global processor, local memory, input bu er, local memory address generator (LMAG), and an input bu er address generator (IBAG). The local memory holds various parameter values for the relaxation operations and look-up tables for some mathematical functions which are not supported in the local processor. The LMAG provides appropriate addresses to the local memory. The input bu er contains the image data. Since the local processor is restricted to neighborhood operations, the bu er needs to contain only a few rows of data. The IBAG provides appropriate addresses to the input bu er. The address computed by LMAG and IBAG can be either an absolute address or an address relative to the current pixel location. At this point, the design of the local processor and the global processor have not been determined fully. However, our philosophy is to keep their design very simple and their functionality to a minimum, since the local neighborhood operations are usually very simple and a simpler design allows higher compaction of PUs in a system. Figure 12 shows the structure of the local processor. It consists of an input module, output module, 2 m ultipliers, 2 adders, logical operator, shift register, instruction cache and a register le. The input module directs 8 inputs (2 inputs from the input bu er, 2 inputs from the local memory, 2 inputs from the register le, and 2 feedback paths) to the inputs of 6 arithmetic units. The output module directs the outputs of the 6 arithmetic units and 4 external inputs (from the input bu er and the local memory) to the input module through the feedback paths, to the local memory, to the register le, and/or to the global processor. The instruction cache is loaded with instruction sets at the beginning of the relaxation. The data from the local memory and the input bu er are fetched using the addresses from the address generators. The local processor and the address generators are synchronized so that the instruction set and the address sets repeat for every pixel location. This processor organization is similar to the dual pipeline architecture of Huntsberger and Wood 8] and the SLAP PEs developed at Carnegie Mellon 5]. Figure 13 shows the structure of the global processor. The structure is very similar to the local processor, however, less parallelism is employed here compared to the local processor. It consists of an input module, output module, multiplier, adder logical operator and a shift register. There are only one multiplier and one adder. It allows only one input from the local memory, the input bu er and the register le. There is only one feedback path. Figure 14 shows the structure of the address generator. The pixel counter determines the base address by the size of data associated for each pixel. At the beginning of the relaxation, the address cache is loaded with either the absolute address or the o set relative to the base address. Figure 15 shows the system architecture of the relaxation network. It consists of an interface module and multiple PUs connected in cascade. The output of the last PU in the chain is fed back to the beginning of the chain, and the input to the rst PU can be chosen between the outputs of the interface module and the feedback. For a static image analysis, the features enter into the PU chain, and the multiplexer is locked in to select the feedback. Then the relaxation network proceeds with its computation until it settles to the convergence limit. For an image sequence analysis, the multiplexer selects new features when they arrive to its input ports. The stable state of the relaxation network can be maintained as an initial state for the new set of image features.
Processing unit

Architecture
The interface module performs data conversion if necessary between the feature extractor and the relaxation network. It also converts the parallel outputs from the feature extractor to the sequential input for the relaxation network.
System Examples
This section provides examples of how v arious vision tasks can be implemented using the vision system. The example tasks are optical ow computation, shape from shading, K-means clustering and MumfordShah segmentation.
Optical ow computation
this process requires at least two consecutive images to compute optical ow using the original Horn and Schunck algorithm.
The feature extractor can be used to remove noise by smoothing and temporal dither by a l o w-pass temporal lter. The main portion of the computation is left for the relaxation network. An estimate of optical ow is obtained at the minimum of the following energy equation. 
where (u v) is an optical ow estimate, u x , u y , v x and v y are the spatial di erences of the ow eld, f x and f y is the spatial di erence of the input image f, f t is the temporal di erence of the images, and is the Lagrange multiplier. The relaxation for the two-frame optical ow computation becomes u n = u n;1 ; f x P = D v n = v n;1 ; f y P = D (27) where u and v is the local average of the estimate, and
The feature extractor and the relaxation network are set up in the following way. The feature extractor simply computes the spatial di erences and the temporal di erence of the input images. These features as well as D ;1 are stored in the local memory of each PU as parameters. The system loads the parameters to the memory as the relaxation proceeds. The local processor computes u, v, u n , a n d v n using (27). The global processor computes u n ; u n;1 and v n ; v n;1 to check t h e convergence of the relaxation. Then the relaxation operation becomes p n = p n;1 + 1 (I ; R)R p q n = q n;1 + 1 (I ; R)R q (32) where p and q are the local average of p and q, respectively, and R p and R q are the derivatives of R with respect to p and q, respectively.
Shape from shading
The feature extractor passes the pixel intensity I to the PU's local memory. Another set of parameters stored in the memory is the re ectance map R. The local processor computes p, q, and updates p and q using (32). The global processor computes p n ; p n;1 and q n ; q n;1 to check the convergence of the relaxation.
K-means clustering
For every feature vector, the distance between the vector and each cluster centroid is computed in the feature vector space, and the feature is grouped into the cluster whose centroid is the closest one to the feature. At the end of each iteration, the cluster centers are updated by simply taking the average of the features in the cluster. The algorithm converges when no feature changes its cluster group. First, the feature extractor needs to extract reliable features. Gabor multi-resolution features are easily obtained by t h e extractor. The set of features becomes the parameter for the relaxation and is s t o r e d i n t h e local memory. The cluster centroids at each relaxation iteration are stored in the input bu er. The local processor computes the distance of each feature vector from each cluster centroid and determines the closest centroid to the feature vector. The global processor updates the cluster centroids for the new cluster result. It also compares the di erence between the old centroids and the new centroids to check the convergence of the relaxation. 
Mumford-Shah segmentation
where T is the temperature, which starts with a high value and cools down as the iteration proceeds.
The local memory holds the feature set. The input bu er contains the intermediate state of the surface and the line processes. The local processor performs the interaction of (35) and (36).
Conclusion
The main characteristics of our vision system are the following.
Various spatial lters are implemented using approximation techniques for faster processing.
The relaxation network explores the parallelism in the temporal (iteration) domain rather than the spatial domain.
A major portion of the feature extractor has a lter bank structure and can be integrated into VLSI chips. A major portion of the relaxation network is a chain of processing units. The design of the PU is tailored for the nature of local iterative operations.
Future research w ork includes performance evaluation of the system on various vision tasks, detailed hardware speci cation of the PU, design of instruction sets for the PU, and VLSI implementations of VFC, HFC, SFC, and the PU. g [n] c 0
The portion enclosed by dashed boxes can be shared with other orientational filters . . . 
