We present an optoelectronic-VLSI system that integrates complementary metal-oxide semiconductor͞ multiple-quantum-well smart pixels for high-throughput computation and signal processing. The system uses 5 ϫ 10 cellular smart-pixel arrays with intrachip electrical mesh interconnections and interchip optical point-to-point interconnections. Each smart pixel is a fine grain microprocessor that executes binary image algebra instructions. There is one dual-rail optical modulator output and one dual-rail optical detector input in each pixel. These optical input-output arrays provide chip-to-chip optical interconnects. Cascading these smart-pixel array chips permits direct transfer of two-dimensional data or images in parallel. We present laboratory demonstrations of the system for digital image edge detection and digital video motion estimation. We also analyze the performance of the system compared with that of conventional single-instruction-multiple-data processors.
Introduction
As the digital age evolves, various media such as images, videos, audio, and data are digitized for storage, processing, and transmission. The digitized information creates the need for processing huge amount of data in real time. User applications are moving to sophisticated features such as multimedia, video conferencing, three-dimensional ͑3D͒ graphics rendering, and high-resolution images, resulting in systems that require high data bandwidths. 1 These applications require the system to transfer rapidly a large amount of data to perform signal processing at high speed. Significant advancements in complementary metal-oxide semiconductor ͑CMOS͒ technology have made extremely fast microprocessors possible. By the year 2001 the integration density of CMOS logic is expected to be more than 40 ϫ 10 6 transistors per chip, and the projected frequency is expected to be 1.4 GHz. 2 Performance limits in many systems today are due not to processor clock speed but rather to input͞output ͑I͞O͒ bottlenecks and system architectures. The signal I͞O's or interconnections exist between processors and input devices, between processors for multiprocessor systems, and between processors and storage devices.
With recent progress in smart-pixel technologies and development in bump-bonding techniques, it has become possible to attach large numbers of optical I͞O devices to foundry-grade CMOS VLSI's. 3 With this method small multiple-quantum-well ͑MQW͒ modulators and detectors are attached to CMOS chips by flip-chip bonding with subsequent substrate removal. This technology has made possible many optical I͞O's normal to the surface of a VLSI chip. 4, 5 It thus creates large two-dimensional ͑2D͒ information transfer capabilities between VLSI chips and potentially alleviates the integrated circuit I͞O communication bottleneck.
In this paper we present an n-stage smart-pixel array cellular logic ͑nSPARCL͒ processor system that attempts to overcome the I͞O bottleneck by using free-space digital optical interconnects. The SPARCL chip is a single-instruction-multiple-data ͑SIMD͒ processor element ͑PE͒ array in which all PE's are identical and execute the same instruction set on multiple data elements in lock step. They efficiently execute so-called data-level parallel applications, which are programs in which the same algorithm or instruction sequence is applied to a large data set. Matrix-vector multiplication as well as image convolution and filtering operations are some examples of data parallel operations. In the SPARCL chip each PE is implemented with one smart pixel. We designed this optoelectronic chip and had it fabricated through the CO-OP program at George Mason University sponsored by the Defense Advanced Research Projects Agency. 6 The 0.8-m CMOS circuitry was fabricated by a Hewlett-Packard process through the Metal-Oxide Semiconductor Implementation Service ͑MOSIS͒. Then AlGaAs͞GaAs MQW p-i-n structures were bonded onto it by Bell Laboratories͞Lucent Technologies with its optoelectronic VLSI process. 6 The 1.95 mm ϫ 1.95 mm area contains 200 MQW diodes that can operate as either optical detectors or modulators. The SPARCL chip contains a 5 ϫ 10 array of smart pixels, each with area of 125 m ϫ 250 m. Each smart pixel contains 182 transistors to execute binary image algebra ͑BIA͒ operations with a 3-bit local memory. Each pixel also detects or transmits one optical data bit on each clock cycle. The smart-pixel array operates as a mesh-connected SIMD processor. Operation of the SPARCL chip was simulated at more than 100 MHz and has been tested at 90 MHz.
We constructed a demonstration system that interconnects as many as three SPARCL chips in a 2D pipeline processing array. Data flow unidirectionally through the SPARCL pipeline on a 5 ϫ 10 array of digital optical free-space channels. The system is packaged on a 10Љ ϫ 14Љ ͑25.4 cm ϫ 35.56 cm͒ slotted base plate that houses polarization-sensitive and diffractive optical components. A host computer sends instructions to the SPARCL chips to perform data processing routines. We have successfully verified operation of the SPARCL prototype system. Specifically, we tested several image and data processing routines, such as parallel numerical processing ͑10-bit-wide addition, subtraction, and multiplication͒, image edge detection, noise filtering, and digital video motion estimation.
We describe the nSPARCL system architecture in Section 2 and the SPARCL chip architecture in Section 3. In Section 4 we present the experimental test results of the SPARCL system and demonstrate some applications of the system for digital image processing and digital video motion estimation. Finally, in Section 5 we analyze the system architecture and show the advantage of the SPARCL system compared with conventional SIMD systems.
System Architecture
The system integrates several SPARCL chips in parallel pipelines, using free-space digital optics technology, as shown in Fig. 1 . Each chip has a 5 ϫ 10 array of PE's that are electrically mesh connected. The processing elements are optically interconnected point to point between the chip planes, thus creating a 3D massively parallel processing system for data processing and communication. The prototype system uses a host computer as a controller to send instructions as well as data blocks to SPARCL chips. The input datum, e.g., an image, is usually much larger than the 5 ϫ 10 SPARCL array size and therefore is partitioned into 5 ϫ 10 blocks. These blocks are pipelined into the system from the electrical data input pads of the first stage chip. Each chip in this multistage system can be programmed to carry out a different set of instructions. A processing routine usually contains a sequence of instructions written as a BIA sequence. 7 The host computer analyzes the instructions and shares the computation load among the SPARCL stages to optimize the computation efficiency. The processed blocks leave the system from the last stage of the SPARCL system. The host computer then collects the processed blocks and assembles the result. Thus the SPARCL system loads, processes, and unloads the data blocks in a pipelined fashion.
In general, free-space digital optics technologies offer a promising solution for I͞O bottlenecks in SIMD systems. Figure 2 shows a comparison of the conventional SIMD architecture and two types of SPARCL system, a one-dimensional ͑1D͒ parallel data access nSPARCL and a 2D parallel data access nSPARCL, where the prefix n represents the number of cascaded SPARCL stages. The 1D nSPARCL system reads and writes data with the same bus bandwidth as a conventional SIMD machine. The 2D nSPARCL system permits reading from input devices and writing to output devices optically in 2D parallel and hence with much larger I͞O bandwidth. All three systems assume the same total number of processing elements. In later sections of this paper we analyze the system performance and make comparisons among these three systems.
Binary Image Algebra and SPARCL Chip Architecture
The SPARCL chip is designed to execute binary image algebra. Each SPARCL pixel is a 1-bit processor for binary image processing. In this section we briefly describe the BIA and show how to implement the BIA into the chip architecture.
A. Binary Image Algebra BIA, derived from mathematical morphology, is a systematic mathematical tool for general morphological image processing and data manipulation. 7, 8 It defines three fundamental operations:
• Dilation, Q:
in which X and Y denote the raw data sets, W denotes the universal set in which all pixels have the value 1, and R p,q denotes the translation or structuring element R such that its origin is located at ͑ p, q͒.
It has been proved that any binary morphological image processing routine can be decomposed into these three fundamental BIA operations. 7 By the combination and repetition of these three operations, any arithmetic or symbolic functions of binary data array can be synthesized. For more information about BIA, please refer to Refs. 7 and 8.
B. SPARCL Chip Architecture
To implement the electronics of the cellular image processor, a VLSI architecture has been developed that maps the three fundamental BIA operations into each smart pixel. Figure 3 is a block diagram of the BIA smart pixel. Each pixel contains a 3-bit local memory ͑M1-M3͒, a union section, and a dilation section. At the input port, a multiplexer ͑MUX͒ selects the input either from the optical receiver or from the electrical feedback, permitting recursive operations. The input data bit is then routed into one of the three available local memories under control of a 3-bit memory-select command. Each memory is made of a flip-flop register that outputs both the value of the data and the complement value of the data. A 6-bit union command chooses outputs from the memory modules and performs a union operation among selected values. The result of the union operation is sent to the dilation section and then distributed to north, west, south, and east neighbor smart pixels as a local interconnection for dilation. The dilation section takes a reference image pattern from the control unit and performs dilation with values from the local neighbor pixels. Optical signals transmitted from or received by each pixel are encoded as two separate channels ͑dual-rail encoding͒. The power ratio of the two spatial channels determines the 0 and 1 logic levels. A schematic of a single smart pixel with one dual-rail optical receiver and one dual-rail optical transmitter is shown in Fig. 4 . The receiver contains GaAs MQW self-electro-optical effect device detectors and a CMOS transimpedance receiver. Similarly, the transmitter contains a CMOS modulator driver and MQW modulators. The GaAs and CMOS chips are fabricated separately and flip-chip bonded, and then the MQW GaAs substrate is removed. The receiver and modulator driver circuitry are standard cells designed by Bell Labs͞Lucent Technologies. 6 The silicon chip is fabricated by 0.8-m HP- 
Experimental Results and Demonstration
We constructed a testbed for optoelectronic testing and demonstration purposes. The system is packaged upon a 10Љ ϫ 14Љ slotted base plate housing polarization-sensitive and diffractive optical elements. The demonstrator that we designed is able to house three SPARCL chips and, at present, two chips were built. A host computer sends instructions to the SPARCL chip to perform data processing routines. A 4-kByte first-in-first-out buffer is used to interface a slower ͑100-kbyte͞s͒ data acquisition board on the host computer to the SPARCL chip's input data rate of 20 Mbytes͞s. Therefore, with five parallel electrical data input pads and a 5 ϫ 10 array of optical I͞O's, each SPARCL chip achieved a 100-Mbyte͞s electrical I͞O data rate or a 1-Gbyte͞s optical I͞O data rate. The MQW modulator contrast ratio had a measured average of 1.96 and 2.13 for logic levels 0 and 1, respectively. The optical switching power is the minimum difference in optical power between the dual-rail detectors that switches the logic states. The optical switching power of the detector MQW is ϳ1.5 W per diode at 20 MHz. The chip consumed approximately 400 mW of static power dissipation at 5-V operation voltage because of the transimpedance receivers. 5 Dynamic power dissipation was measured at ϳ100 mW at 20-MHz operation. The total chip power dissipation was measured to be ϳ500 mW. References 9 and 10 contain many additional details about the chip design and its optoelectronic characterization.
SPARCL is a programmable cellular logic proces- • Mathematical morphological processing: basic operations ͑e.g., dilation, erosion, closing, opening, thinning, skeleton͒, image feature extraction ͑e.g., edge detection, shape, size and location verification͒, image enhancement ͑e.g., salt and pepper noise removal͒, parallel pattern recognition ͑e.g., hit-miss transform, template matching͒,
• Parallel numerical computation ͑addition, subtraction, multiplication͒,
• Combinatorial logic functions, and • Serial-to-parallel or parallel-to-serial data format conversion and buffering.
Classic linear operators have been powerful in various numerical analysis and signal processing applications. However, when they are applied to image analysis they do not directly address the fundamental issues of how to quantify image shape or geometrical structures. In contrast, mathematical morphology, which is a set-theoretical methodology for image analysis, can rigorously quantify many aspects of the geometrical structure in a way that agrees with human intuition and perception. Morphological image analysis is done by operating on images with some structuring elements. 11 Different structural information is extracted by interaction with selected structuring elements and different combinations of operators. Here we demonstrate two examples of image analysis that use SPARCL instructions for image edge detection and digital video motion estimation.
A. Image Edge Detection Figure 6 shows an example of the application of the SPARCL system for image edge detection. Although the current version of SPARCL chip utilizes binary image algebra, it is also possible to process a gray-level image as a set of binary images by use of top-surface and umbra encoding. 10 Here we simply binarize the gray image by 64 ϫ 64 block quantization at the mean value of the block. The host computer then partitions the binarized 256 ϫ 256 image into 5 ϫ 10 blocks and pipelines these blocks into the SPARCL system. Let X be the image input to SPARCL chip and R be the reference image, which is 
where X represents the compliment of X, ഫ represents the union operation, and Q represents the dilation operation. The edge detection routine takes only three clock cycles for operation. The same routine repeats for every block in the pipeline.
B. Digital Video Motion Estimation
The transmission of digital video sequences contains highly redundant information, and there is considerable correlation between adjacent frames. Most of the change from frame to frame occurs to the moving objects in the picture, and most of the background information remains unchanged or little changed. Instead of transmitting the whole current frame and wasting precious channel bandwidth, the MPEG encoding algorithm ͑shown in block diagram form in Fig. 7͒ transmits the difference between the frames along with the motion vectors by the following procedure: The frames are partitioned into small blocks, and a search is made in the previous frame for the block that best matches a block in the current frame. When the best-matched block is found, the index offset is coded as a motion vector. Collecting these Fig. 6 . Demonstration of the SPARCL system for image edge detection. Fig. 7 . SPARCL for motion estimation. ͑a͒ Encoding and transmission, ͑b͒ receiving and frame recovery. At transmitter site, two consecutive video frames ͑at the left͒ and the difference image and motion vectors ͑at the right͒ are computed by the SPARCL system. Player number 18 moves upward in the image frames, leaving his imprint in the difference image. The current frame can be recovered easily at the receiver site.
best-matched blocks, we obtain a motion-compensated image frame. Subtracting the motion-compensated image frame from the current frame, we can obtain the difference image. The resultant difference image is compressed with a JPEG encoder and transmitted along with the motion vectors. 12 At the receiver site, the recovery of the current frame is straightforward. The motion-compensated image frame is reconstructed from the motion vectors and the previous frame. The current frame is then recovered from the addition of the motion-compensated image frame and the JPEG decoded difference image. In the overall digital video transmission system, searching for the best matched block is highly computation intensive, especially when real-time operation is desired. For example, for the MPEG system with a frame size of 288 ϫ 322 and a 16 ϫ 16 block to run a full search in a 48 ϫ 48 neighborhood area in 30 frames͞s requires more than 3 ϫ 10 9 operations per second. A parallel-pipeline smart-pixel array system such as SPARCL offers a method for running digital video motion estimation efficiently. Figure 7 shows a system level simulation in which the SPARCL chip performs the digital video motion estimation functions. A necessary step in motion estimation is computation of the difference between two video frames and image block matching. First the current frame is partitioned into data blocks of 5 ϫ 10 pixels that match the array size of the SPARCL chip. The SPARCL system searches the neighborhood area of 15 ϫ 30 pixels in the previous frame to find a block that is best matched to the data block in the current frame. To perform this search we load the current frame into the SPARCL chip in the second stage of the SPARCL system. Then we scroll the search area data through the first chip, one column at a time. For every new column the search data are updated in the first chip and transmitted optically to the second chip in 2D parallel. The second chip receives the search data optically and compares them with the data block that already resides in its memory. The second chip then performs the difference operation
to match the data block and the search data, where B 0 is the search block in the previous frame, B 1 is the data block in current frame, and D is the difference between B 0 and B 1 . The search block that is least different from the data block is chosen as the bestmatched block. This block is used as an estimate of current block, and the index offset is coded as a motion vector. The system also subtracts the bestmatched block from the current block and obtains the difference block. By collecting these difference blocks and motion offsets we can then create, encode, and transmit the difference image and motion vector for digital video applications.
Architecture Analysis and Performance Scaling
A. SIMD I͞O Problem SIMD systems contain two types of I͞O traffic between the PE's and external devices: instructions and data elements. The system delivers identical copies of the instruction to every PE, and each PE exercises the same instruction at the same time.
There are several methods for delivering the instructions ͑e.g., sequential loading and broadcast͒. The most efficient method is simply to broadcast the instruction to every PE simultaneously.
On the other hand, the data elements delivered to every PE are different. Thus we cannot simply broadcast the data to every PE as we do in instruction delivery. Because of the limited number of I͞O pads on a chip, the system has to load the data block from the border of the chip, and the data elements flow through the PE array interconnection network ͑e.g., mesh͒ step by step until the data block is registered with the PE array. Moreover, the size of the data set is usually much larger than that of the instruction set, e.g., in image processing. The same instruction set applies to a large number of data blocks repeatedly. Here, the data element's I͞O becomes the most critical bottleneck for the SIMD system. Also, after processing, the system has to unload the processed data elements from the PE array by the slow step-by-step method again. Therefore there are usually separate I͞O channels for loading and unloading data elements. However, the I͞O bottleneck still exists.
To perform an image or data processing routine on a computing system requires three distinct steps: ͑1͒ loading the data from the input device ͑such as memory or digital camera͒ to the processor͑s͒, ͑2͒ executing the instructions for the application routine, and ͑3͒ unloading the data from the processor͑s͒ and storing them to an output device ͑memory or display device͒. 12 To evaluate the performance of the system we define the processing speed as the number of data elements or pixels that are processed over the total processing time, described by
where T load , T exe , and T unload are the times required for loading, executing, and unloading the N ϫ N data field or image. 13 The value of T exe is the number of instructions required by an SIMD algorithm multiplied by the SIMD clock period. Here we assume that each instruction requires one clock cycle. A SIMD machine with P ϫ Q PE's processes an N ϫ N image by sequentially processing image blocks of size P ϫ Q, where N is usually much larger than P and Q. The computing system addresses the external device to load input image blocks through a 1D parallel bus. In all current SIMD architectures the time required for loading and unloading each of the P ϫ Q blocks depends on this I͞O bus's bandwidth and can easily dominate the total processing time.
This I͞O bottleneck occurs for two reasons: The first is that data enter the P ϫ Q processing array through one of its borders on a 1D column parallel data bus. If the data bus is P bits wide, Q clock cycles are required for loading or unloading the processor array. The processing speed in such a system grows with the bus's width rather than with the number of processing elements. The second reason is that when the SIMD array is fully loaded and operating on a data block, the data I͞O lines are idle; this results in an underutilized data bus.
Faced with this I͞O bottleneck problem, architecture designers have developed a prefetch technique. 14 -16 The elapsed time associated with loading data from memory is called memory latency. The system deals with the memory latency by adding an extra register within each PE and adding on-chip circuitry that performs data I͞O and registration in the background. These registers are interconnected with a network similar to the PE array, e.g., mesh, and match the size and location of the PE array. Instead of loading data through the PE array, the system prefetches data elements through the register array, while the PE's are dedicated for instruction execution. The technique hides the memory latency through data caching. The designers optimize the SIMD chip by balancing the amount of VLSI real estate used for PE circuitry versus registers and background I͞O circuitry to maximize the processing speed. 13 The trend is revealed when we consider current devices for SIMD image processing, such as the video signal processor, 17 the integrated memory array processor, 18 and the GLiTCH. 19 These systems use PE array sizes of only 16 ϫ 16 or fewer per chip on chips of size greater than 1 cm 2 . Because the PE's are simple 1-bit processors, the processing array itself uses only a small portion of the chip area. The majority of the chip area is used for memory and data I͞O.
The two architectures based on an nSPARCL processor system, 20,21 1D nSPARCL and 2D nSPARCL, that we are examining in this paper are shown in Figs. 2͑b͒ and 2͑c͒ and compared with a conventional SIMD machine. With the smart-pixel optical detectors and transmitters, a SPARCL chip can optically transfer its entire data block to another SPARCL chip in a single clock cycle. The 1D nSPARCL uses this feature in the system shown in Fig. 2͑b͒ , which has the first SPARCL stage as a dedicated input device, n Ϫ 2 intermediate SPARCL's for processing, and the last SPARCL stage as an output device. The 2D nSPARCL assumes that the I͞O devices are implemented with smart-pixel technology that permits 2D parallel optical I͞O. The I͞O devices can be a photonically accessed page-oriented memory, 22,23 a video camera, a display device, or a network connection. In this case, data enter the SPARCL chip in a 2D parallel format and the I͞O bottleneck is eliminated. When a SIMD system has no I͞O bottleneck, the processing speed scales linearly with the number of processing elements.
The SPARCL chip itself is a SIMD system. Cascading these SPARCL chips to a multistage nSPARCL system makes a multiple-instruction multiple-data stream system in the sense that different SPARCL stages can execute different instruction sets simultaneously. By scheduling the instruction phases among the nSPARCL stages we can improve the efficiency of the system. Figure 8 shows a timing chart of data block processing for a SIMD system, a four-stage 1D nSPARCL system, and a four-stage 2D nSPARCL system. All these systems contain the same total number of PE's. The SIMD system contains 8 ϫ 8 PE's on a single chip. Both SPARCL systems have four stages of SPARCL chips that have 4 ϫ 4 PE's. In this example, the conventional SIMD system takes 16 clock cycles to load one data block and executes the processing in 8 clock cycles. It loads a data block and executes the processing separately. Also, we assume that the system has separate buses for data load and unload so that data unloading and loading occur simultaneously in a pipeline manner.
The 1D nSPARCL system partitions the data into smaller blocks, four times smaller than the conventional SIMD block in this example. The system loads the block into the first stage in only four clock cycles because the block size is smaller. After the data block is loaded in the first SPARCL stage, it takes another clock cycle to transfer the block from chip 1 to chip 2. The system also shares the execution commands evenly between chip 2 and chip 3, so each chip takes four cycles for execution. The processed block is then transferred to the last chip for unloading, and the unloading again takes four clock cycles. The system overlaps the loading time, the execution time, and the unloading time between blocks. For example, when the fourth chip is loading data block 1, chip 3 is executing commands on block 2, chip 2 is executing commands on block 3, and chip 1 is busy loading block 4 from the input device. The resultant total processing time is reduced because of the pipeline processing.
The 2D nSPARCL loads data blocks in 2D parallel in a single clock cycle. For the four stage 2D nSPARCL, for example, the system shares the eight execution commands evenly over the four stages. Each stage uses two clock cycles to finish the operation. Again these operations are done in pipeline fashion. While chip 4 is executing commands on block 1, chip 3 is executing block 2, chip 2 is executing block 3, and chip 1 is executing block 4. The system uses all the PE resources for operation and therefore the fewest clock cycles of the three systems.
B. Performance Comparison of SIMD and 1D nSPARCL Systems
We compare the performance of nSPARCL and conventional SIMD systems given that both have the same number of total processors in the system. Here we compare and discuss the 1D nSPARCL and conventional SIMD architectures in terms of processing time, scalability, bus utilization, flexible multiple speeds, and unbalanced bandwidth applications. Because loading and unloading occur simultaneously, we can hide the unloading latency and set T unload to 0 for our simulations.
Comparison of Processing Times
Assume that each of the n chips in the 1D nSPARCL system has an array size of p ϫ q. The equivalent SIMD system has a total size of npq ͑we assume that this is equivalent in size to the P ϫ Q blocks discussed above͒. Both SIMD and 1D nSPARCL systems have the same bus bandwidth of p bits per second. The total processing time needed for the SIMD system is
where ͑N 2 ͞npq͒ represents the number of blocks to be processed and ͑nq ϩ T exe ͒ represents the loading time and the execution time required for each block. Note that the SIMD requires separate time slots for loading and execution.
For the 1D nSPARCL system we normally use the first and last chips as input and output devices for data I͞O. The intermediate n Ϫ 2 chips execute the data processing instructions. The total processing time needed for the 1D nSPARCL system is then
There are two cases of 1D nSPARCL system operation when tasks with different lengths of instructions are applied. For the first case, or T exe ͑͞n Ϫ 2͒ Յ q, the intermediate SPARCL's finish processing before the first chip finishes loading data. Thus the total processing time is dominated by the loading time and is independent of T exe . For the other case, or T exe ͑͞n Ϫ 2͒ Ͼ q, all n stages are used to run the processing routine after the first stage is finished loading. Thus the workloads are spread properly over n stages.
To compare the performance between 1D nSPARCL and conventional SIMD systems we investigate the ratio of total processing time for 1D nSPARCL systems to that of the equivalent SIMD systems for the same data I͞O bandwidth at different lengths of instruction sets, as shown in Fig. 9 . 1D nSPARCL's with stage numbers n ϭ 3, 4, 9, 25 are simulated. For the simulation, the image size is 512 ϫ 512 pixels and each SPARCL system is a 5 ϫ 10 processing array ͑ p ϭ 5, q ϭ 10͒. When T exe is small and T exe Ͻ Ͻ q͑n Ϫ 2͒ the loading time dominates the total processing time, and both systems perform roughly the same. In fact, when the instruction set is very small ͑T exe Ͻ number of nSPARCL stages͒, the extra steps of optically transferring data between SPARCL stages reduces the SPARCL system efficiency. As T exe increases to be approximately equal to loading time, the SIMD data I͞O bus must stop frequently when the array is loaded with the data block and is busy executing instructions. On the other hand, the nSPARCL system moves loaded data blocks down the multistage system pipeline for processing instead of halting data I͞O. The nSPARCL performance is optimized when the distributed execution time equals the system depth, because the loading time and the execution time are balanced. For T exe Ͼ q͑n Ϫ 2͒ the execution time plays an increasingly more critical role in the system performance than do the data. When T exe becomes much greater than q͑n Ϫ 2͒, both systems are dominated by the time required for executing processing instructions, and the loading and unloading times become insignificant.
Scalability of the 1D nSPARCL System
Here we compare the processing time for 1D nSPARCL systems with 3, 4, 8, and 25 stages, as shown in Fig. 9 . The optimum ratio of processing time decreases as the number of stages increases. In general, the performance of the 1D nSPARCL system scales up as the size of the system increases, where the system size is defined as the number of PE's of the system. As the problem size ͑ϭnumber of instructions per SIMD algorithm͒ increases, we can improve the system performance by increasing the size of the nSPARCL tailored to the size of the Fig. 9 . Comparison of ratio of total processing times of n ϭ 3, 4, 9, 25 nSPARCL's and equivalent SIMD systems plotted against the time needed for processing operations, T exe .
problem. However, a larger nSPARCL system is not always better than a smaller nSPARCL system for any size of problem. For a fixed-size problem the matched size of the 1D nSPARCL system would optimize the efficiency. From the example shown in Fig. 9 , for problems that need fewer than 10 instruction cycles, n ϭ 3 nSPARCL is better than any n Ն 4 nSPARCL. For problems that need more than 10 and fewer than 20 instruction cycles, n ϭ 4 nSPARCL is better than n ϭ 3 and n Ն 5 nSPARCL systems. In summary, for a problem that needs an instruction set of T exe cycles, the best number of 1D nSPARCL stages n opt that optimizes the efficiency is
where  •  represents the next-larger integer.
On the other hand, in a parallel-processing system with multiple users it is also desirable for users to be able to share the processors. 15 Because of the singleinstruction nature of SIMD systems, complicated mechanisms are needed to handle the scheduling. In contrast, the multistage 1D nSPARCL is indeed a multiple-instruction multiple-data system in that different SPARCL stages are able to execute different sets of instructions. It is easy to partition the multistage 1D nSPARCL system into two or more subsystems in terms of SPARCL stages. Each subsystem is an independent SIMD system, running application programs from different end users. With predictions of problem size and instruction length, we can also assign the optimum number of SPARCL stages to a subsystem dynamically according to Eq. ͑8͒ and optimize the processing efficiency individually.
Comparison of Bus Utilization
We can also approach the comparison of 1D nSPARCL and conventional SIMD systems from bus utilization of the two systems for different cases of T exe and T load . The bus utilization is defined as the volume of data flowing through the bus interface of processor array and external devices over a period of time. Because of equilibrium, the utilization of the input bus and the output bus should be equal. Bus utilization represents the data throughput rate of the system and is therefore a good measure of the system performance. The bus utilization ratio of 1D nSPARCL is compared in Fig. 10 with that of the SIMD system at different relative values of T exe ͞ T load . The nSPARCL has the greatest advantage over the SIMD architecture when T exe Ϸ T load . This illustrates the ability of the 1D nSPARCL system to utilize the bus bandwidth better by moving loaded blocks to open SPARCL processor arrays in the pipeline. It also shows the scalability advantage of nSPARCL system over its equivalent SIMD machine. Given a task with certain length of instructions, we can scale up the stages of the SPARCL system properly such that T exe Ϸ q͑n Ϫ 2͒ and the data throughput rate is optimized.
Hybrid Speeds with the 1D nSPARCL System
At the system level, a multistage nSPARCL also offers opportunities for high-speed data I͞O. In a VLSI chip, electrical signals enter and leave through electrical I͞O pads at the side of the chip. In practice, because of the off-chip parasitics from the package and the printed circuit board, the off-chip clock suffers from limited signal bandwidth. To overcome this problem it is common practice is to have VLSI chips designed with slow ͑tens of megahertz͒ off-chip clocks synchronized to the high-speed on-chip clocks with on-chip phase lock loop circuitry. However, for the SIMD array I͞O bottleneck that we have discussed, doing this helps only to shorten the execution time on-chip but not the loading-unloading time. The fundamental problem of data element traffic still exists. Although some special VLSI components fabricated in GaAs can have higher-speed I͞O, the design of such VLSI's may be more difficult and less dense than that of CMOS. On the other hand, SPARCL with optical interconnects offers opportunities at the system level to avoid these problems. It is obviously possible for the multistage nSPARCL system to have multiple off-chip speeds. In the system we can dedicate high-speed chips ͑e.g., GaAs͒ for the first and the last stages of the system for data I͞O. The loaded data elements are then transferred to the following stages optically down the pipe for processing at a high-speed on-chip clock.
1D nSPARCL for Bandwidth Unbalanced Applications
The 1D nSPARCL has a basic internal chip-to-chip bandwidth of O͑N 2 ͒ and an external I͞O bandwidth of O͑N͒. Because of this bandwidth mismatch, applying the system to general-purpose problems takes a certain amount of effort. However, the system fits nicely the problems that require only modest external bandwidths ͓O͑N͔͒ and internal bandwidths of O͑N 2 ͒. For example, matrix-vector multiplication of an on-chip N ϫ N matrix and an off-chip N-element vector is an application that requires only O͑N͒ bandwidth externally and O͑N 2 ͒ bandwidth internally. To do this multiplication we have the matrix residing in the second chip and load the N-element vector from the first chip in column parallel. Every data element of the vectors is then broadcast to a row of the matrix in 1-to-N fanout. Another special example meeting these conditions is video motion estimation described above. In the search for the best-matched block, the desired data block resides on the second chip and the search area scrolls over the first chip in column parallel. Every time that one column of the search area is loaded to the first chip, every column in the chip shifts laterally one column to the side and creates a new array of O͑N 2 ͒ internally for the matching operation. There are other systems ͑e.g., neural networks͒ that have this type of unbalanced external-internal traffic and are suited to the 1D nSPARCL architecture.
C. Performance Comparison of SIMD, 1D nSPARCL, and 2D n-SPARCL Systems
Integrated with input and output devices that support 2D parallel I͞O's, the nSPARCL system can load and unload an entire p ϫ q data block in a single clock cycle. The same technology used to create SPARCL can be used to make dense memory chips, data buffers, video relay systems, and network interface devices. [22] [23] [24] For this system the total processing time becomes
For each block the loading time is always a constant of 1 because it requires only one single clock cycle for loading and unloading, and the execution time is T exe ͞n because the execution instructions are shared evenly over n stages. Figure 11 compares the processing speed S pr of 1D nSPARCL, 2D nSPARCL, and SIMD systems when a 256 ϫ 256 image is processed over various numbers of PE's up to 256 ͑ϭ16 ϫ 16 array͒. In this simulation the same number of PE's are used for all three systems, and they exercise the same task with an instruction length of 20 clock cycles. The four-stage case is assumed for both 1D and 2D nSPARCL's. In the simulation result, both the conventional SIMD and the 1D nSPARCL processing speeds S pr tend to saturate as the number of PE's increases. This is so because the loading time dominates the system as the array size grows too large. In contrast, the processing speed of a 2D nSPARCL increases linearly with the number of PE's because the I͞O bottleneck is eliminated and all the processors are dedicated to performing the application routine.
A commonly cited advantage of SIMD systems is their scaling properties. The larger the SIMD array, the more data elements can be processed simultaneously. Decreasing VLSI feature sizes allows for higher-density PE implementation and thus for larger processing arrays per chip. Ideally the processing speed per chip, defined as the number of data elements processed divided by the processing time, increases linearly with the processing array size. However, because the processing time includes the time required for loading the data into the processing array, processing the data, and then unloading the data, the processing speed is also sensitive to the data I͞O bandwidth of the chip. The fundamental problem of data I͞O in conventional SIMD systems is the 2D nature of the processing array and the 1D nature of the data I͞O ports of electronic buses. Ideally the computation bandwidth increases proportionally to the processor array size. However, 2D data fields enter the processing array in a row-parallel format along the edge of the array and flow into the array on the mesh network. As a result, as the PE array size grows in O͑N 2 ͒, the I͞O bandwidth grows only in O͑N͒. This causes an I͞O bottleneck as the PE array size grows. Consequently, it greatly reduces the overall system throughput and limits the SIMD system array size. The 1D nSPARCL deals with the problem by hiding the memory latency by prefetching. However, this helps only when the lengths of loading-unloading cycles and the execution cycles are comparable. As the PE array size grows, the length of the loading-unloading cycle becomes much larger than that of the execution cycle. The data I͞O overwhelms the system, and the memory latency dominates the system performance as well. This occurs because of the fundamental limits of limited I͞O bandwidth. On the other hand, the 2D nSPARCL I͞O bandwidth grows in O͑N 2 ͒, well scalable with the size of the PE array.
So far we have compared the systems under the assumption of the same number of PE's. On the other hand, considering the fact that the yield of a VLSI chip decreases as the die size increases, it would be difficult to build a large SIMD chip. As the SPARCL system decomposes a large SIMD array into a multiple SIMD stages, it presents an opportunity for building a multiprocessor system with a large number of PE's distributed over several stages. Fig. 11 . Comparison of processing speed ͑in terms of pixels͞clock cycle͒ with the number of processing elements for SIMD, 1D nSPARCL, and 2D nSPARCL systems. The 2D nSPARCL eliminates the data I͞O bottleneck by performing 2D parallel data I͞O with input and output devices.
Conclusion
We have described an optoelectronic VLSI architecture for a SIMD computing system, the SPARCL. The device uses novel hybrid CMOS-MQW smartpixel technology. We constructed an experimental system for testing the devices as well as for demonstrating the system. This prototype system utilizes BIA for general-purpose morphological image processing. We have demonstrated applications of the system to image edge detection and estimation of digital video motion. We compared the performance of the conventional SIMD machine and 1D and 2D nSPARCL systems under the assumption that the total number of PE's in the systems was the same. The results illustrate that, given the same task, the nSPARCL system outperforms the SIMD system in terms of processing time, bus utilization, and processing speed. The nSPARCL system also has many system aspect advantages in scalability to optimize the computation efficiency, flexibility in hybrid speeds and multiple-instruction systems, and utility for construction of large-number PE systems. The optoelectronic VLSI technology has the potential to improve the performance of multiprocessor computing systems significantly. However, major efforts are still needed for the integration of efficient, reliable, and cost-effective systems.
