Abstract
Introduction
Low cost video cameras and advanced telecommunications technology enable many new services, such as electronic video mail and computer-based teleconferencing. Evolving compression standards (e.g., MPEG) and inexpensive disk storage allow these electronic exchanges to be treated much as e-mail is used today. Cellular phone-based wireless technology provides low cost communication in the field. However acquiring, transmitting, and manipulating this information presents a computational requirement beyond the capabilities of existing systems. Increasing user demand for portable on the move videoputing (video + computing) and teleputing (telecommunications + computing) systems places additional requirements on power, size, and weight.
General purpose microprocessors offer inexpensive and versatile processing elements for such portable imaging systems. However, these new image processing applications demand higher processing rates (10 -1000 Gops/sec) than can be provided by commercial microprocessors (.1 -.5 Gops/sec). Dedicated ASICs (Application Specific Integrated Circuits) can provide the needed performance and efficiency. But they lack the flexibility needed for varied application requirements. Unfortunately, many portable imaging applications (image enhancement, recognition, and compression) have requirements not met by either of the processing alternatives.
Alternatively, techniques for integrating OE devices, analog interface circuitry, and digital logic have enabled new approaches for image collection and processing. Monolithic systems incorporating focal plane arrays offer high I/O bandwidth with modest levels of dedicated analog or digital processing capability. Beginning with Mahowald and Mead's silicon retina [13] , onfocal plane processing has increased in complexity from simple logic gates to latches [11] to 2-bit registers and counters in [9] . Analog processing alternatives have demonstrated even greater operational complexity using passive and active networks. The systems strive to achieve high fill factor detector arrays combined with the maximum computing capability that can effectively be incorporated nearby. These image processing solutions are compact and efficient, but lack computing power and flexibility. Dedicated  ASIC  performance  low  high  high  cost  low  moderate  moderate  flexibility  high  high  very low  efficiency  moderate  high  high   Table 1 : Characteristics of microprocessors, dedicated ASICs, and an ideal videoputer processor.
General Purpose Microprocessor Videoputor Processor
The ideal architecture (Table 1 ) must blend a balance of key characteristics for these applications. It must provide high processing performance that scales with Si VLSI technology advances, while achieving high chip efficiency (Mops/sec/mm 2 ). Low cost must be realized though high efficiency and flexibility where a single system can address many image processing tasks. System power, size, and weight must support portable operation. Image I/O must exploit OE devices to provide low cost and high performance. This system has not yet been realized, but a successful solution can have an impact comparable to the introduction of the personal computer, video camera, or FAX machine. This paper summarizes the most promising architectural approaches for videoputer applications. Some example implementations being pursued at Georgia Tech illustrate these architecture classes. Section 2 describes the fundamentals of processing node organizations. Section 3 outlines the approach of systolic architectures. Sections 4 and 5 present SIMD and Message Passing MIMD computing techniques. Finally, Section 6 concludes with directions for future research.
Processing Architecture Organization
Before exploring different approaches to smart pixel architectures, the components used to build them need to be defined. Figure 1 illustrates the key elements of all digital processing nodes. The datapath contains the most familiar elements of computation: adders, subtractors, multipliers, shifters and logical units as well as registers to hold operands as they are being processed. This is where the work required by an application is performed, and all processors, general or special purpose, must have a datapath. Image processing datapaths often include more specialized functional components, such as a multiply-accumulate unit, to better support common image processing operations.
An I/O unit is required to input image data to the datapath, and output results of the computation back to the outside world. This unit is particularly significant given the high I/O data rates demanded in image processing systems. Today's desktop workstation typically operates with less than 10 Mbps I/O; a portable image processor might require 10 to 100 times as much I/O.
Since input and intermediate data cannot also fit in datapath registers, additional data memory is required. This is analogous to memory in a workstation. However, image processing applications tend to use more operands from I/O and require significantly less data storage. Since data memory represents a significant resource cost in computers, the reduction of data memory (1000X or more) can translate to a more efficient system implementation.
Instruction control and program memory are required for all programmable systems. While one computational model presented here, systolic arrays, does not include these components, they are part of nearly every digital computer. Image processing systems can employ several organizations for program control. But the typically shorter, more compact application programs can also be exploited for more powerful, efficient system implementations.
Finally, the network provides a medium for many processing nodes to communicate. This is necessary if nodes are to work together on a common task. Inter-node communication must be high bandwidth and low latency or overall performance suffers. Aggregate network bandwidths in Tbps (1000 Gbps) are sometimes required. Integrated OE smart pixel arrays can play a role in the realization of these networks as well as in image I/O. These elements provide the building blocks of many smart pixel-based videoputer architectures. The following sections describe a few of these promising architectures.
Systolic Array Architectures
Systolic architectures first became popular in the late 1970's as an architectural approach to exploit the growing potential of VLSI technology. H. T. Kung [10] and Charles Lieserson [12] were early proponents of this execution model for extremely efficient implementation of systems that solve computationally intensive applications. More transistors per chip support system designs with increased functionality leading to greater I/O and inter-cell communication requirements. Communication costs are typically high in execution time, power dissipation, and chip area. To reduce these communication penalties as well as reducing complexity in designing the system, systolic design incorporates regular cell structures that communicate over short distances. The design cost is further minimized by using regular cell structures rather than redesigning new components. The key characteristics of systolic designs include modular cells, short communications, scalability and concurrency. Figure 2 illustrates a systolic array to compute the multiplication of banded matrixes. Each hexagonal node includes a simple datapath containing a multiplier and adder, plus clocked registers to regulate data flow between nodes (shown as arrows). On every cycle, each node computes the product of the received input matrix elements and adds the rising result matrix. These systolic nodes include no data or program memory, and have an elementary network and I/O. Systolic processing systems are the most efficient in terms of resource usage. But their lack of programmability restricts their flexibility. Efforts to produce programmable systolic arrays (e.g., the CMU WARP [1] [8]) produced systems more akin to MIMD architectures (see Section 5) than those described here. Systolic architecture are well suited for dedicated high throughput computation such as image compression. However, cost and performance comparisons must be made between systolic systems and more flexible architectural approaches. The PAMSAC Architecture Figure 3 shows the layout of a pattern matching systolic architecture being implemented at Georgia Tech. PAMSAC incorporates direct optical input of image data via eight on-chip Si detectors and amplifiers. This chip, which has been implemented through the MOSIS foundry in 2.0 µm CMOS, simulates in IRSIM at 33 MHz. Digital logic testing of systolic core has been fully tested; the interface to the OE devices is currently in progress. Figure 4 illustrates the block diagram of the PAMSAC chip. The simplified logic operation of a systolic cell consists of an XNOR and AND gate to perform detection of perfect pattern matching. This systolic design methodology has simple, modular logic cells with high concurrency and local interconnection. 
SIMD Architectures
A more flexible architectural approach, compared with systolic arrays, includes programmable digital processors. Yet commercial microprocessors are ill-suited to videoputer applications because of their limited performance and low resource efficiency. They provide too much generality and functionality that is not required in image processing.
A more promising computational model, SIMD or Single Instruction stream, Multiple Data stream, replicates the datapath, data memory, and I/O to provide high processing performance with low node cost. Figure 5 illustrates this configuration. SIMD systems often employ thousands of processing elements. The cost of the control unit is amortized across each processing element.
Although a single program is being executed, each instruction is executed simultaneously on many nodes. This execution model is especially well-suit to early image processing when a subroutine must be applied to every region of an image. While a commercial microprocessor must iterate sequentially across an image, a SIMD architecture can process the entire image in a single iteration. The SIMPil Architecture While SIMD systems have been used for image processing before, the implementations have been large and expensive. The MPP [2] , CM-2 [14] , MasPar [3] , and the GAPP [18] are examples of general purpose SIMD systems capable of performing image processing applications. However, these systems achieve performance and generality at the expense of focal plane I/O coupling and physical size. Other systems, including the Scan Line Array Processor (SLAP) [6] , exploit frame scanning used in video cameras by operating on sequential scan lines. But serial loading and unloading of image data limits frame rates. A more specialized architecture can provide the same high levels of performance in a portable system.
The SIMPil system being developed at Georgia Tech [4] [5][ [15] incorporates a specialized SIMD architecture with an integrated array of optoelectronic devices. An 1300 nm optoelectronic link allows through-silicon wafer input of digital image data from a detector plane stacked above the processing plane, shown in Figure 6 . By reducing the image transfer bottleneck found in decoupled detector-processor systems, high frame rates are possible without constraining processing power. Processing area does not impact the detector array fill factor. The block diagram of a SIMPil node is displayed in Figure 7 . The figure also illustrates how a single node interfaces to a subarray of detectors, and how each node is connected to each other in a mesh network to operate in SIMD mode. Each node includes a traditional RISC load/store datapath plus an interface to the detector array via an OE data channel. Initially, an 8-bit datapath SIMPil node was implemented. It includes an 8-word register file, an arithmetic logic unit, a shift unit, a 16-bit multiply-accumulator (MACC), and 64-word local memory. The instruction set architecture (ISA) provides for arithmetic operations including addition, subtraction, multiplication, and multiply accumulation. The multiply accumulate (MACC) instruction is included because of its utility in image processing applications. For example, the MACC operation reduces the partial convolution of a 3 × 3 sub-image from 17 to 9 operations. The 16-bit accumulator in an 8-bit datapath improves precision especially when using fixed-point operands. The logic unit allows bitwise AND, OR, and exclusive-OR operations. Logical, arithmetic, and rotate shifts operations are performed in the shift unit. Register-to-register and immediate addressing modes are supported by the dyadic operations. Local memory is accessed via the load and store instructions.
Each SIMPil node interfaces to an array of thin film detectors. The instruction set architecture (ISA) allows for up to 256 addressable detectors. Each node also includes analog to digital circuitry to convert light intensities to digitally equivalent values. The ISA has a SAMPLE instruction that synchronously captures light intensities at each detector. The SIMD execution model allows the entire image to be sampled by the system synchronously. Once the detector array has been digitized, it can be processed by the SIMPil node in data parallel fashion.
Low level image processing applications, such as edge detection, are usually point algorithms needing only pixel values in a small neighborhood around the data point. This pixel access locality is well supported by a nearest neighbor or mesh network. SIMPil nodes communicate through a nearest neighbor NEWS (north, east, west, and south) network using NEWS registers in the datapath.
The SIMPil system is an embedded, programmable, focal-plane image processing system. The processing power of the SIMPil node will surpass the computational needs of a single pixel. However, desired frame rates may not be achieved if the number of pixels assigned to a node is too large. Simulations of image processing applications suggest a good balance of 36 to 64 pixels per SIMPil node (with 50 MHz node frequencies). Our prototype target is 64 pixels per SIMPil node.
Using current VLSI technology, between 16 and 64 SIMPil nodes can be fabricated on a single Si VLSI chip. By tiling an array of 16 chips each containing 16 nodes, a 128x128 pixel resolution is achieved. The aggregate total for this system is 16,384 pixels and 256 SIMPil nodes. Operating at 50 MHz, SIMPil can perform 781 Kops/sec for each pixel. Eight bits is the minimum datapath width for pixels supporting 256 gray scale levels.
This demonstration is currently being developed for use in videoputing systems, such as high speed smart cameras. This prototype addresses issues in multidisciplinary interfacing by incorporating an integrated thin film detector, on-chip analog interface circuitry, and a powerful digital processor on a single Si CMOS chip. To illustrate the effectiveness of the SIMPil processing architecture, several image processing operations are demonstrated including edge detection, convolution, and image compression. The silicon area efficiency of this type of processing node is compared with general purpose commercial microprocessors. Figure 8 is a photomicrograph of a prototype SIMPil node fabricated through the MOSIS foundry in .8 mm CMOS. This prototype has been fully tested and a second generation node is currently being designed. Image processing applications such as vector quantization compression have been implemented for SIMPil [7] . 
Message Passing MIMD Architectures
MIMD (Multiple Instructions stream, Multiple Data stream) architectures provide the most general computational model. Each processing node is an autonomous computing agent including a datapath, control, and memory. A system consists of a collection of nodes, each executing a different program, connected by a network through which nodes communication. This organization resembles a room full of connected workstations. But the high throughput, low latency communications, and optimized synchronization mechanisms allow the processing nodes to work more closely on a common task. Figure 9 illustrates the organization of a MIMD architecture. This form of execution offers the greatest generality and the lowest efficiency. Today's commercial supercomputers from Cray (T3D) and IBM (SP2) employ MIMD organizations based on commercial microprocessors. Image processing applications require less generality and storage, and be effectively executed on MIMD nodes occupying a fraction of a chip.
MIMD diagram goes here. Optoelectronic technology can enable this type of system in two ways. It can provide the same tightly coupled focal plane image I/O employed in SIMD systems. The same smart pixel arrays can provide a dense, high throughput communications network for connecting processing nodes. The details of one such system are described in [16] .
The Pica Architecture
The Pica execution architecture is designed for handling high message traffic consisting of small, ephemeral tasks. In order to achieve acceptable efficiency in this fine-grain domain, parallel overhead must be reduced to the minimum achievable level. Complex mechanisms to support general purpose applications are replaced by simpler, lower cost mechanisms for highthroughput problems.
The Pica execution architecture is designed specifically for high-throughput, low-memory operation. The design of a Pica node begins with a minimal sequential core architecture. Pica provides low overhead support for communication, synchronization, naming, and task and storage management. A small amount of memory (4096 36-bit words) and a network interface/router complete the node. This node complexity can be implemented using a fraction of the transistors available on a chip in current technology. This allows multi-node chips -the prototype chip will contain four nodes.
The Pica architecture is designed to form a dense, three dimensional computational array for processing high-throughput data streams. While less general than other MIMD architectures, it is more efficient for this application area. The execution model supported by Pica is more flexible than other high-throughput architectures (e.g., systolic arrays, static dataflow). The basic functional blocks of the Pica microarchitecture are shown in Figure 10 . The network router routes messages through the node, forming that node's contribution to the communication network. The router implements a simple adaptive routing strategy based on current local virtual-channel allocation. The network interface buffers incoming messages and signals the context manager that a context is required. When it obtains access to local memory, the network interface writes the message contents directly into the allocated, fixed-length context. The datapath consists of a 32-bit integer ALU and shifter, and special-purpose registers. Operands are accessed from a 32 word context cache, which supports two read and one write accesses on each cycle. The instruction unit fetches and decodes instructions for execution. In order to keep design complexity and task swapping overhead low, the datapath implementation is not pipelined. The context manager serves three functions: (1) it maintains a queue of suspended and ready tasks for execution, (2) it allocates task storage for incoming messages and deallocates storage as the tasks complete, and (3) it arbitrates requests by both the network interface and cache controller for control of the local memory bus.
Plans for a full-scale prototype are underway, but are still preliminary. In a target prototype system, each processing node will contain 4096 36-bit words of local memory (256 contexts), a 32-bit integer processor, and a network interface. The target node performance is 50 MIPS. A chip contains four Pica nodes and 3.2 Gbits/sec I/O bandwidth. The full-scale prototype will employ a 2.5 supply voltage in addition to other low power techniques to keep total chip power below 500 mW. In a full scale system (4096 nodes) employing through wafer optoelectronic interconnect, a processing plane contains 64 chips (256 nodes, 12,800 MIPS) and measures approximately 10 cm by 10 cm. Sixteen planes contain 1024 chips (4096 nodes, 204,800 MIPS) and fit inside a cube 10 cm on a side. 820 Gbits/sec of system I/O bandwidth is available from chips on the top and bottom surfaces of the cube. Sides of the cube are available for power and cooling mechanical connections.
Future Directions
Architectural research focuses on how available technologies can be combined to solve problems in a more effective way. The most significant advances have come when new enabling technology is harnessed to address a broad consumer need. Smart pixel systems enable a new class of portable image processing systems. The examples presented here demonstrate the potential of these new products.
