Abstract-The system design of a locally connected competitive neural network for video motion detection is presented. The motion information from a sequence of image data can be determined through a two-dimensional multiprocessor array in which each processing element consists of an analog neuroprocessor. Massively parallel neurocomputing is done by compact and efficient neuroprocessors. Local data transfer between the neuroprocessors is performed by using an analog point-to-point interconnection scheme. To maintain strong signal strength over the whole system, global data communication between the host computer and neuroprocessors is carried out in a digital common bus. A mixed-signal very large scale integration (VLSI) neural chip that includes multiple neuroprocessors for fast video motion detection has been developed. Measured results of the programmable synapse, and winner-take-all circuitry are presented. Based on the measurement data, system-level analysis on a sequence of real-world images was conducted. A 1.5 x 2.8-cmZ chip in a 1.2-pm CMOS technology can accommodate 64 velocityselective neuroprocessors. Each chip can achieve 83.2 giga connections per second. The intrinsic speed-up factor over a Sun-4/75 workstation is around 180.
I. INTRODUCTION
APID advances in a very large scale integration (VLSI) Fig. 1. An integrated information system. Information can be transferred between the system and the real world in multimedia format.
and advanced display units are used to process visual information. The speech recognizer and synthesizer units are used to process audio information [8] . The microsensor [9] and controller units are used to perform physical actions. Such an intelligent system can be used in offices, factories, and autonomous vehicles. VLSI neural chips can be effectively used in the construction of the interface units. Various design approaches are applicable for the construction of neural chips. The analog circuit approach is quite attractive in terms of hardware size, power consumption, and speed [lo] . Analog neural networks were used as sensory devices to preprocess the real-world data, as reported by Mead et al. [11] , [12] and Abidi et al. [13] . In addition, many other analog VLSI neural chips have been reported [ 141 -[ 161. One unique feature of a purely analog neural network is the limited computational precision for complex problems. To enhance the performance of analog VLSI neurocomputing, extra analog switching circuits can be used to facilitate the reconfigurability and scalability of analog neural networks [ 171. The digital circuit approach offers greater flexibility, scalability, and accuracy than the analog circuit approach. By using logic and memory, a large problem can be partitioned and processed by the digital neural networks. Some generalpurpose digital VLSI neural chips were reported [18]- [20] .
In this paper, a mixed-signal design approach is used to exploit the massively parallel computational power of the neural network architecture for video motion detection. To solve low-level vision processing problems, multiple neurons and synapses can be clustered together to function as one pixel processing element. By using compact analog circuit design for the neuron and synapse cells, highly parallel computation on the pixel level can be achieved.
1045-9227/93$03,00 0 1993 IEEE 11 . SYSTEM ARCHITECTURE Motion information extracted from a sequence of timevarying images plays a key role in the image understanding and automated control processes. The requirement of an enormous amount of computational power for analyzing image sequences is always a major barrier to real-world applications of most vision-processing algorithms. By using multiprocessor-based VLSI design, the parallelism embedded in low-level vision processes can be fully explored. The single instruction multiple data (SIMD) architecture is a good example [21] , [22] . Two specific multiprocessor-based neural engines based on the SIMD architecture have been reported. The CNAPS machine, from Adaptive Solutions, Inc. [MI, consists of an array of processing nodes (PN's). Each PN is an arithmetic processor with its own local memory. The array is sequenced by a system controller. Thus every PN executes the same instruction at a given clock period. The input data and control commands are broadcast to all PN's through the common bus. The output data of the PN's are transmitted through the data bus by the time-multiplexing scheme. A local digital data link exists between adjacent PN's to allow quick data transfer. Due to the simplicity of the broadcast scheme, no complex routing networks are required. The systolic/cellular array processors (SCAP) system, from Hughes Research Lab. in Malibu, CA [23] , consists of a 16 x 16 processor array, a dual-port array memory, and a system controller. The meshconnection architecture is used. The boundary columns are connected via the wrap-around scheme, and the top and bottom rows are connected to the two ports of the system memory. Data communication can be conducted in the paralleled format or in the pipelined format.
In our design, a mesh-connected two-dimensional neuroprocessor array is used for high-speed video motion detection. Each processoring element can extract the velocity information for one pixel. Interprocessor communication is done by dedicated analog point-to-point interconnections. Data communication between the host computer and array processors is carried out through the digital common bus to preserve signal strength and to achieve simple network scalability. By using this efficient communication among processors, a high computational power per unit silicon area can be achieved.
MOTION DETECTION ALGORITHM
Many features from the images such as points, lines, curves, and optical flow, can be used to estimate motion parameters. Optical flow is the apparent motion of the brightness patterns. Generally, the optical flow corresponds to the motion field [24] , and provides important information about the spatial arrangement of the objects, the rate of change of this arrangement in a given scene, and also the perceiver's own movements. Optical flow can thus be used for deriving relative depth of points [25] , [26] , segmenting images into regions [27] , and estimating the object motion in the scene [28] .
According to the nature of the measured primitives, existing approaches to optical flow computing can be divided into two types: the image intensity based approach and the token based approach. The intensity based approach relies on the assumption that changes in intensity are strictly due to the motion of the object and uses the image intensity values and their spatial and temporal derivatives to compute the optical flow. By expanding the intensity function into a firstorder Taylor series, Horn and Schunck [29] derived an optical flow equation using the brightness constancy assumption and spatial smoothness constraints. An iterative method for solving the resulting equation was also developed. The token based approach is to consider the motion of tokens such as edges, corners, and linear features in an image. The key advantage of the token based approach is that tokens are less sensitive to variations of the image intensity. The token based approach provides the information of the object motion and shape at edges, corners, and linear features. An interpolation procedure has to be included when dense data are required.
Recently, several researchers used neural networks to conduct optical flow computing [30] , [31] . To prevent the smoothness constraint from taking effect across strong velocity gradients, a line process has been incorporated into the optical flow equation [31] . The resulting equation is nonconvex and includes the cubic and some higher terms. Instead of using an annealing algorithm which is very time consuming, a deterministic algorithm was used to obtain a near-optimal solution. Convergence of such a network was obtained within a few iteration cycles. Basically, the mixed analoddigital neural network approach is to first use Horn's optical flow equation to find a smoothest solution and then to update the line process by lowering the energy function of the network repeatedly. In the hardware implementation, the resistive network is quite susceptible to device variation effects from the silicon CMOS fabrication processes.
In order to obtain a dense flow field, the intensity based approach is preferable. However, the intensity value may be corrupted by noise appeared in natural images and partial deviatives of the intensity value are sensitive to rotation. It is difficult to detect the rotational objects in natural images based on such measurement primitives. Under the assumption that changes in intensity are strictly due to the motion of the object, Zhou et al. [32] , [33] use the principal curvatures of the intensity function to compute the optical flow because they are rotation-invariant. The intensity values and their principal curvatures are estimated by using a polynomial fitting technique. Under the assumption of local rigid motion and the smoothness constraint, a self-organizing neural network [34] - [36] was developed to compute the optical flow. A deterministic decision rule was used for the updating of neuron states.
Let the velocity field consist of two components k and 1. 
where g(x,,3,k,~) is the winner-take-all function:
computing the optical flow from a pair of image frames can be expressed as
The network operation will be terminated if the network converges; i.e., the energy function of the network defined by reaches a minimum.
where kll(z,j) and k12(2 + Two important features of the network should be noted: i) The synaptic interconnection strength between neurons on different modules are zeros because only the neurons in the same module are connected, i.e.
The principal curvatures are defined as [37] ii) A maximum evolution function is used to ensure that
and only one neuron which has the maximum excitation is fired and the other ( 2 D k + 1 ) ( 2 D~ + 1) -1 neurons are As reported in [32] , a smoothness constraint is used for obtaining a smooth optical flow field and a line process is employed for detecting motion discontinuities. The line process consists of vertical and horizontal lines, L" and Lh, respectively. Each line can be in either one of the two states:
where k i ( i , j ) and k z ( i , j ) are the Principal curvatures, G and are the Gaussian and mean curvatures given by
[ aiaj ]
1 for being active and 0 for being idle. The error function for ai2 aj2 and A polynomial fitting technique can be used to estimate the derivatives. The k l l , k 2 1 , k 1 2 , and k 2 2 values are calculated from the images by the host computer and sent to the neuroprocessor for network evaluation. In (6), the first term is to find velocity values such that all points of two images are matched as closely as possible in a least-squares sense. The second term, which is weighted by B , is the smoothness constraint on the solution and the third term, which is weighted by C, is a line process to weaken the smoothness constraint and to detect motion discontinuities. In addition to an external bias input, each neuron has a self-feedback, and receives inputs from similar directionally selective neurons at the neighboring hypercolumns. (12) where Sa,b is the Dirac delta function, the error function in (6) is mapped into the energy function of the neural network in (4) . Notice that the interconnection strengths consist of constants and line process only. The bias inputs contain all the information from images. When the network reaches a stable condition, the optical flow field is determined by the neuron states. The size of a typical smoothing window is 5 x 5 .
Since the first and second terms in (6) do not contain the line process, the updating of the line process is prior to the updating of neuron states. Let Ly;j:E: and Ly;j:$ denote the new and old states of the vertical line Ly,j,k,l, respectively. Let Qt,J,k,l be the potential of the vertical line Ly,J,k,l given by different, the vertical line Lt,J,k,l will be active provided that the parameter C is greater than zero. If C = 0, then all lines are inactive, which means that no line process exists in the network operation. The choice of C is closely related to selecting the smoothness parameter B in (6). A similar updating scheme is also used for the horizontal lines. In the prototype neural chip design, computation for the terms which are weighted by the parameter C is not included. The state of each neuron is synchronously evaluated and updated according to (1) and (2) . The initial states of the neurons are set as 1 if It,J,k,l = max(It,J,p,q, - where Ii,j,k,l is the bias input. The initial conditions are completely determined by the bias inputs. If there are two maximal bias inputs at point (z,j), then only the neuron corresponding to the smaller velocity is initially set to 1 and the other one is set to 0. This is consistent with the minimal mapping theory [38] . In the updating scheme, the minimal mapping theory is also used to handle the case of two neurons having the same largest inputs.
IV. THE NEURAL-BASED NEUROPROCESSOR DESIGN

A. WSZ Architecture
To implement the electronic neural network processor, a VLSI architecture has been developed which maps the three-dimensional neural network configuration onto a two-dimensional plane. As shown in Fig. 3 , each small frame represents one velocity-selective hypercolumn which contains ( 2 0 k + 1)(20l + I) velocity-sensitive components.
Each hypercolumn is locally interconnected with the r x r -1 neighboring hypercolumns. The hypercolumn is designed as a neuroprocessor within which the velocity selectivity of an image pixel can be conducted. Mixed analogfdigital design technologies are utilized for the neuroprocessor design to achieve compact and programmable synapses and neurons for massively paralleled neural computation [39] .
To simplify the two-dimensional interconnection design for computation of optical flow, the analog point-to-point interconnection for local communication and the digital common bus for global communication are used. Since velocity information of one pixel is affected by its neighbors, each neuroprocessor receives information from the neighboring neuroprocessors during the network operation. Data communication between these locally interconnected neuroprocessors is one key factor on the overall system performance. There are three different A functional diagram of the velocity-selective neuroprocessor is shown in Fig. 4 . It includes a velocity-sensitive component array, and a data conversion block. The array has (201, + 1)(201 + 1) velocity-sensitive components which are laterally connected through the winner-take-all circuit. The velocity of the neuroprocessor is determined by competition which is performed by the winner-take-all circuit. Only one velocity component which has the maximum excitation will be the winner to represent the velocity of that pixel. The data conversion block is used for the analog point-to-point interprocessor interconnection.
As shown in Fig. 5 , the velocity-sensitive component is constructed with one synapse array, one summing neuron, and one winner-take-all cell. The synapse array contains I? x r + 1 programmable synapses. The synapse weights Ti,J,,+,l;m,n,k,l are stored as charge packets on capacitors and must be refreshed periodically [17] , [41] . The binary outputs z),,,,~,~ from the neighboring neuroprocessors are routed to the corresponding mask ports of the synapse cells to conduct the network operation. A summing neuron functions as a parallel current-mode adder. Each summing neuron with its associated programmable synapse array perform a complete inner-product computation. The binary outputs of the winnertake-all circuit represent the velocity status.
The synapse weights and bias inputs are calculated by the host computer or a digital coprocessor and stored in a digital static-RAM. The 8-bit digital/analog converter transforms the digital representation of the synapse weights into analog values for charging the weight-storage capacitances of the synapse matrix. A two-port static-RAM and differential amplifierbased synapse design allows network retrieving and learning processes to occur concurrently.
B. Detailed Circuit Design
In to provide the amplifier with a specific bias current I""". When the Vmask is at logic 0, the Vbias is connected to the negative power supply so that no synapse output current is Multiple differential pairs and current-mirror circuits make the wide-range operation possible. If more than 8-bit resolution is required for the synapse function, large-geometry MOS transistors, and shorter refreshing time will be needed. In the EEPROM-style synapse cell [43], [44] , at least a 6-bit resolution can be obtained. The summing neuron functions as a current-to-voltage converter and is realized by using a two-stage operational amplifier and a feedback resistor. Circuit schematic diagram of the two-stage operational amplifier is shown in Fig. 6 . Transistors M13 and A414 form an improved cascode stage to increase the voltage gain and M24 operates as a resistor for proper frequency compensation. The amplifier voltage gain of 100 dB can be achieved.
The outputs of the winner-take-all circuit are binary values. Only one winner cell with the maximum input voltage will have the logic-1 output value. The other cells will have the logic-0 output value. The winner-take-all circuitry functions as a multiple-input parallel comparator. Fig. 8 shows digital circuits of the data latch with the associated read/write control logic. The final velocity result is read by the host computer from the data latches through the digital common bus. Fig. 9 shows a voltage-scaling digital-to-analog converter [51] which is used to convert the encoded binary code to the analog value and send it to the neighboring neuroprocessors. Only one of these bits is logic-1 and the others are logic-0. To achieve high-speed performance and a compact silicon area, a parallel and distributed analog-to-digital converter has been designed. One voltage scaling resistor-chain is used.
As shown in Fig. 10 , the comparators and the associated digital decoding circuitries are distributed into the synapse cells. The comparators included in the same velocity-sensitive component use the same reference voltage provided by the resistor-chain. The distributed decoding circuitries make sure that only one of ( 2 0 k + 1)(202 + 1) binary outputs is logic 1 and the others are inhibited to logic 0.
V. EXPERIMENTAL RESULTS
In the prototype neuroprocessor chip design, Dk = 0 2 = 2 and a size of 5 x 5 smoothing window are used. The physical layout of the velocity-selective neuroprocessor for one image pixel using the scalable CMOS design rules is shown in Fig. 11 . It occupies an area of 2,482 x 5,636X2 and contains 25 neurons, 25 x 27 synapse cells, and is able to detect the moving object with 25 different velocities. In the hardware implementation, two rows of synapses are used Fig. 14. The system diagram for high-speed motion detection using multiple VLSI neural chips. Each VLSI neural chips can accommodate 64 neuroprocessors with 1.2 pm CMOS technology and 1.5 x 2.8 cm2 chip area. Each neuroprocessor in the VLSI neural chip can communicate with its neighbors through the analog point-to-point interconnections. The standard IC parts such as SRAM and 8-hit DAC are used for refreshing of synapse weights.
to increase the resolution of synapse weights coming from the bias inputs and also to enhance the fault tolerance of the network. With an advanced 1.2-pm CMOS technology, 64 neuroprocessors can be accommodated into one VLSI neural chip of 1.5 x 2.8 cm2 in size. The chip layout is shown in Fig. 12 . It requires a 178-pin PGA package. The analog interprocessor data communication requires 128 pins. The detailed layout of interconnects among four neuroprocessors is shown in Fig. 13 . The interconnection routing area occupies 23% of the chip area. A performance comparison against the digital bit-parallel point-to-point interconnection method is listed in Table I . In the digital bit-parallel method, each data link requires 25 lines. Only 12 neuroprocessors can be Fig. 15 . The layout of the test module which includes key circuit blocks. accommodated in the same chip area and 85% of chip area will be used for the interconnection routing purpose. With 128 VLSI neural chips and many supporting standard IC parts such as S U M ' S and 8-bit DAC's for storing the weight information and dynamically refreshing of the synapse cells, computation of optical flow from an image with 64 x 128 pixels and 256 gray levels can be performed at a rate of 30 frames per second. The proposed system set-up for fast motion detection using multiple VLSI neural chips is shown in Fig. 14 .
To obtain the electrical properties of the basic circuit blocks, a test structure containing key circuit components was fabricated with a 2-pm CMOS process from Orbit Semiconductor, Inc. through the MOSIS Service of USC/Information Sciences Institute at Marina del Ray, CA and tested. The picture of the test structure is shown in Fig. 15 . Measured transfer curves of the synapse cell with different bias voltages are shown in Fig. 16 . The dynamic range of the synapse cell is controlled by the bias voltage. Experimental data on the winner-takeall circuit are shown in Fig. 17 . The circuit consists of nine The second frame (b) The fourth frame
images. (a) The first, (b) second, (c) third, and (d) fourth frame (e) Obtained = 1, and after 36 iterations The effects of process variation on synapse using same parameters as those in 19(e) except that the effects of process winner-take-all cells. Two experiments were conducted. In Fig. 17(a) , one input sweeps linearly from -1.53 to -1.48 V, the second input is connected to -1.5 V, and the other seven inputs are kept at -1.525 V. In Fig. 17(b) , one input sweeps linearly from 1.47 to 1.52 V, the second input is connected to 1.5 V, and the other seven inputs are kept at 1.475 V. The winner-take-all function is successfully implemented with a resolution of 15 mV. The processing time for one network iteration is around 522 ns. Each iteration cycle includes synapse multiplication, neuron summing, winner-take-all operation, data storage on latches, digitallanalog and analoddigital conversion, and interprocessor data transfer. SPICE [54] simulation results on various circuit blocks are listed in Table 11 . The large response time of the synapse multiplication is due to the significant capacitance loading on the current-summation line. For the digital/analog conversion simulations, 5 pF and 50 pF effective capacitance loadings are estimated for interchip data communication and off-chip data communication, respectively. The major delay will come from the off-chip interprocessor data communication. The total computing power of 8.32 x lo1' connections per second can be achieved by using one VLSI neural chip containing 1600 neurons, 41 600 synapses cells, and operated at a master clock rate of 2 MHz. Based on the results of Table I1 the speed comparison of a system using 128 VLSI neural chips with a Sun-4/75 SPARC workstation is listed in Table 111 . The speedup factor is very large.
System-level analysis has been conducted to illustrate the performance of the motion detection chip. The mismatch effect of analog synapse components has been included. Fig. 18 shows the statistical distribution of measured synapse output conductances. A total of 300 synapses was measured. In Fig. 18(a) , the synapse conductances can be described by a Gaussian distribution with a mean value of 14.07 pA/V and a standard deviation of 0.042 pA/V at weight voltage Ksj = 2 V. In Fig. 18(b) , the synapse conductances can be described by a Gaussian distribution with a mean value of -13.69 pAfV and a standard deviation of 0.036 pA/V at weight voltage Ks, = -2 V. During computer analysis, the effects of process variation on synapse weights are included through the use of Gaussian function.
A set of four successive image frames directly produced by a Sony XC-77 CCD camera was used as the input data. Fig. 19(a)-(d) shows four successive image frames of a mobile missile launcher moving from left to right against a stationary background. The size of each image frame is 130 x 160 pixels. The maximum displacement of the mobile missile launcher between the time-varying image frame is 7 pixels. To estimate the principle curvatures and intensity values, a 5 x 5 window and a third order polynomial was used for all frames. By setting A = 4, B = 850, C = 0, DI, = 7, and D I = 1, the velocity field was obtained after 36 iterations. The parameter A is set to 4, because four successive image frames are used. The parameter B is chosen by using trial-and-error method. The parameter C is set to 0 in the prototype design to simplify the neuron-state updating scheme of the network. Fig. 19(e) shows the final result of using synapse weights obtained by including the effects of process variation. Comparing with the result in Fig. 19(f) , which the effects of process variation are not included, the motion information of the moving object still can be successful detected.
VI. CONCLUSION
A mixed-signal two-dimensional mesh-connected architecture for high-speed motion detection has been presented. A compact and efficient VLSI neuroprocessor which including 25 neurons and 25 x 27 synapse cells is able to estimate the motion of each pixel with 25 different velocities. Multiple neuroprocessors can be connected as a two-dimensional mesh to fully exploit the massively parallel computational power of neural networks. In this architecture, the local computation is processed in analog neuroprocessor and the local data communication is performed in parallel. Each 1.5 x 2.8-cm2 VLSI neural chip from a 1. From 1988 to 1992, he studied VLSI implementation of image and video compression systems, digital neurocomputing and systolic array-based image understanding in the VLSI Signal Processing Laboratory at University of Southern California. Since 1985, he has also been with the Jet Propulsion Laboratory at Pasadena, California and worked on the architecture and design of high-performance computing and image processing systems. He used advanced CAD tools to generate applicationsspecific VLSI chips in the full-custom and standard-cell design styles based on the specially modified signal processing algorithms. He is currently involved in satellite image data compression for Cassini Titan Radar Mapper in the Radar Science and Engineering Division. He has published more than fifteen papers in scientific conferences and journals. His research interests include highspeed data compression for multimedia applications, artificial neural networks, image and video processing, signal processing for synthetic aperture radars, and VLSI system design. 
Rama Challappa
