The Image Understanding Architecture (IUA) is a tightly-coupled heterogeneous parallel processor that has been specifically designed for real-time knowledge-based machine vision applications. In particular, the second generation of the IUA is designed to address the demands of the ARPA Unmanned Ground Vehicle (UGV) program in which a small military truck is being developed to carry out scouting missions with minimal human intervention. The UGV carries a large suite of sensors for both navigation and surveillance and the IUA is designed to process the data from these sensors and provide guidance to the actuator control systems of the vehicle. As such it represents one of the first parallel processors to be designed from the ground up for real-time interpretation of complex natural scenes. The purpose of this paper is to summarize the architectural features of the IUA that resulted from the specific real-time requirements imposed by the UGV.
Introduction and Origin of the IUA
The first generation of the IUA [7] was developed as a proof-of-concept prototype (shown in Figure 1 ) to demonstrate the feasibility of constructing a heterogeneous parallel processor for machine vision that tightly couples three different parallel architectures. Our preliminary analysis of general real-time image understanding applications [8] parallel processor for high-level vision. However, we concluded that the standard I/O channels on commercially available parallel machines of these types were unable to meet our real-time requirements. In addition, the available arrays were mostly designed for scientific computation and thus were suboptimal in certain respects for vision. are tightly coupled by a layer of dual-ported memory. The CAAPP is a bit-serial array with both a standard mesh communication network and a reconfigurable mesh. In the first generation prototype there are 64 processors per chip, each with 320 bits of on-chip cache and 32K bits of off-chip main memory. The ICAP is designed around a digital signal processor (DSP) and a custom communication network. The main memory for each CAAPP chip is dual-ported with one of the ICAP processors. Thus there is essentially no latency for transferring data between the CAAPP and ICAP other than the time for the CAAPP processors to write back to the memory from their caches. In this prototype, which is just a small portion of the planned array, the high-level MIMD processor is simply implemented by a uniprocessor that also serves as the host workstation. The main memory for the ICAP is dual-ported and its second port is directly accessible to the high-level processor via a VME bus. This organization is diagrammed in Figure 2 . 
Real-Time Requirements for the Second Generation IUA
After the first generation IUA prototype was demonstrated we were given the task to develop a second generation of the system for use in the UGV program. The testbed vehicles for the UGV program are Army HMMWV trucks with actuators to automate steering, brakes, and throttle. The original specification included a large sensory suite divided into navigation and RSTA (Reconnaissance, Surveillance and Target Acquisition).
The navigational sensors included wide field color stereo cameras, a narrow field monocular color camera, stereo wide field FLIR and narrow field monocular FLIR. The wide-field stereo cameras are intended to identify obstacles in the foreground of a scene, while the narrow field camera is to look ahead for navigational path planning. The color cameras are for day driving and the FLIRs are for night driving. Initial plans also called for cameras at the rear of the vehicle for operation in reverse gear. The RSTA package was initially specified as having a similar complement of sensors but on a steerable turret. In addition, the narrow field cameras were specified with zoom lenses. One additional sensor that was specified only for the navigation package is a laser range imager.
With navigation and RSTA operating simultaneously during the day, there could potentially be the equivalent of 21 image streams of 512 by 512 8-bit pixels entering the system at the same time. It was not expected that all of these would need to be processed at once, but that it might be necessary to operate on regions of interest from several of the sensors at the same time, and possibly to switch rapidly between them.
Given the stopping distance for the HMMWV at various speeds and the detection range of the navigational sensors for an object large enough to damage the vehicle, it was determined that the tolerable latency of the obstacle detection function (from image input to actuator activation) is approximately one half second at slow speeds and in some cases only 0.1 second. The steering system was expected to require a 100 command per second control rate to achieve smooth turns, but the higher level commands emerging from the navigational system would be needed at a maximum rate of 10 per second. The other navigational functions were determined to have less critical latencies, and although RSTA operation while navigating is a goal, it was decided that obstacle avoidance would take priority. We were not given real-time requirements for RSTA functions when the vehicle is stationary, and so we took the 100 ms requirement as our most critical real-time constraint.
Because the end-to-end system latency in the UGV depends on the amount of computation to be done, which we cannot predict in advance, our approach in designing the IUA is to minimize the sources of latency that are under our control and to maximize the computational capabilities of the system within certain size and cost constraints. In addition, we have focused on designing the hardware so that execution time can be easily estimated, and sources of unpredictable delay are avoided as much as possible.
Real-Time Image Input
For such a large suite of sensors it would seem that the major problem that confronts the architect is providing the necessary bandwidth. However, the real-time constraints make latency an even more pressing problem. In a processor such as the IUA, input latency is the time between a digitized pixel value leaving the camera and when it can be processed by the corresponding CAAPP processing element. Being a SIMD array, the CAAPP cannot start working on any of the pixels until the entire array is filled. However, the size of the CAAPP array is usually smaller than the size of an image, and so the array can be filled before an entire image has been output by the camera. The remainder of the image is folded onto the CAAPP array and the sections are processed in sequence. This folding is called virtualization because the programmer's model is for a virtual processor array that is the size of the image, and system hardware and software automatically manage the folding.
From this brief summary of the I/O environment for the IUA, we can see that the I/O subsystem for the IUA must be able to take digitized data from many sources, extract regions of interest from them, align the regions with the array, and fold the image data to correspond to the virtualization of the physical array, all with minimal latency. This means that the I/O subsystem must have direct access to the memory of the CAAPP array. The second generation CAAPP thus includes a second bank of main memory that is identical to the first bank, except that it is dual ported with the I/O subsystem instead of the ICAP. Data can be input to this memory bank at the same time that the CAAPP is processing other data in the same bank. We achieve this by using video RAM (VRAM) in which the direct access port is connected to the I/O subsystem, and the serial access port is connected to the CAAPP (the same is true for the bank that is shared with the ICAP). Because CAAPP accesses to main memory take the form of a cache line load (or store) for all of the processors at once, the serial port of the VRAM is ideal for handling these transfers. the processor can subsample the frames that have been input in the meantime in order to catch up to the current frame. Or, if a moving object is identified and determined to be of interest in one frame, it may be possible to look back through previous frames in order to establish a trajectory for the object without having to wait for it to move further. Thus, the I/O memory serves as a temporal buffer for the array, and adds a considerable level of flexibility in scheduling real-time processing deadlines.
Low-Level Processing
Once image data is in the CAAPP, its job is to extract information and represent it in a symbolic form that can be used by the ICAP. This extraction involves operations such as detecting edges, finding straight lines, curves, regions, depth maps, surfaces, and overall statistics that describe the image. As a specific example, we might extract straight lines by first detecting edge pixels (edgels) and then iteratively grouping the individual edgels under a set of constraints that result in high contrast, long, straight lines. Those lines could be symbolically represented by a record structure with fields for the end points, slope, length, average contrast across the line, and a list of the adjacent regions.
The second generation CAAPP is similar in design to the prototype in that it consists of bitserial processors with both a standard and a reconfigurable mesh. The array is designed to Normally, data from a SIMD bit-serial array would be most easily stored in a manner that collects a set of bits from a group of processors corresponding to the width of the datapath to the main memory, and directly transfers them. Thus, if a 32-bit quantity is to be stored from every element in the array, the first bit from every value would be combined into a set of words and stored, then the second bit, and so on. Given that the data is shared with the ICAP, however, this approach would require the corresponding ICAP processor to rearrange the bits into words that are amenable to scalar processing. Such a rearrangement, called corner turning, is time consuming and would significantly increase the access latency between the two processors. The CAAPP is therefore designed to perform the corner turning automatically, and to handle the storage of CAAPP data words in sizes of 8, 16, and 32 bits (in a bit-serial processor, the data words can be any size, but these were chosen for compatibility with the ICAP). The corner turning hardware can also pack and unpack the smaller sizes within the 32-bit words that the ICAP fetches, to avoid further reformatting. The two models of SIMD memory datapath are contrasted in Figure 3 . processor that operates on image array operands. A typical sequence of macroinstructions would be to load two registers with values from memory and multiply them, with the result being stored in a third register, and then store the register result back to memory. Each of these instructions would actually be expanded into (possibly hundreds of) bit-serial operations with the appropriate memory addresses and register numbers supplied from "parameter registers" that make up the macro-instruction word. All of this is required for the ACU to be able to issue instructions to the array at the 10 MHz rate at which it can accept them. In addition, because the SPARC sometimes issues macro instructions in bursts, the Microengine incorporates a set of queues that help it to smooth out the rate at which it receives macro instructions and the rate at which it issues the expanded instruction stream to the CAAPP. This queuing requires a complementary set of queues for responses from the array, and a handshake protocol to ensure that branches in the Macroengine that are based on array reductions receive the proper values.
The reconfigurable mesh network in the CAAPP can be partitioned along boundaries of regions in an image [9] as shown in Figure 4 . Within those partitions (called coteries), a single processor can broadcast to the rest of the processors in the coterie, or multiple processors can transmit with the result being a wired-OR of the bits that they sent. It is thus possible to carry out operations in the SIMD CAAPP that involve local broadcasts and reductions within all regions in parallel [5] . This mode of operation is termed "multiassociative" [10] because it mimics the operation of an associative processor within each region The reconfigurable mesh also supports an efficient implementation of wormhole routing [2, 3] . It thus replaces much of the functionality of other networks that have been employed in SIMD processors but at a relatively low cost. In particular, it dramatically accelerates the common vision operation of labeling connected components (to 50 microseconds). The ability to perform multiple region-parallel operations in a frame time is a key part of achieving the real-time processing requirements of the UGV [4] .
Figure 4. Isolated Processor Groups Corresponding to Regions in an Image
The second generation CAAPP has been simulated in software to facilitate software development and execution time estimation. The simulator includes a detailed simulation of the ACU, which has been verified against the operating prototype hardware ACU. The CAAPP is modeled from the actual custom chip design plus the memory hierarchy. To evaluate the performance of the CAAPP, we implemented the ARPA IU Benchmark using our C++ class library that provides for parallel execution of operations on image objects.
The C++ code for the benchmark was compiled with a version of the G++ compiler that we modified to recognize this class library and generate code for the ACU/CAAPP combination.
The ARPA IU benchmark is a complex task involving a wide range of image processing operations. Because we did not have access to the actual UGV algorithms (which were being specified and developed in parallel with the IUA development), we used the IU benchmark as a surrogate evaluation task with what we hoped would be a similar level of processing requirements. The benchmark takes two images of a synthetic scene, consisting of rectangles suspended in space, as input. One of the images is an 8-bit gray-scale image of a scene, and the other is a 32-bit floating point depth representation of the same scene.
The benchmark begins with a set of image processing operations designed to extract certain features from these images. This portion of the benchmark is referred to as the "low level"
or "bottom up" portion of the benchmark. The following From these benchmark tests, we conclude that the CAAPP is able to satisfy the 100 ms latency criteria of the UGV for a reasonably complex task when operating on 128 by 128 pixel images. For larger images, it is necessary to trade off the complexity of the processing to achieve the real-time response goals.
Intermediate-Level Processing
For the second generation ICAP we employ the Texas Instruments TMS320C40 DSP ('C40) as the processing element. Although the function of the ICAP is largely to manage a symbolic database [1] , the DSP architecture with its explicitly managed cache and simple pipeline permits the easy prediction of execution time which facilitates real-time scheduling.
In addition the 'C40 offers six byte-wide communication channels that simplify the construction of a multicomputer. Because the ICAP is designed to have a maximum of 64 processors, we arrange the 'C40's in groups of four (called quadnodes) and pool their communication channels. This enables us to provide a direct link between every pair of quadnodes (using 15 channels) so that the time to communicate between quadnodes is both minimized and equalized. Eight of the nine remaining channels implement a ring within each quadnode, and the processors in a quadnode also share a bank of memory via a common bus. This organization is diagrammed in Figure 5 .
The 'C40 provides two external memory buses and we use one of these for the quadnode shared memory. The other is reserved for local processing and includes the main memory of the corresponding CAAPP chip in addition to a bank of memory that is strictly local to the individual 'C40. This partitioning of the memory reduces timing variability by allowing local processing to take place independently of shared memory accesses. Only when a processor explicitly accesses shared memory does it risk an unpredictable timing delay due to bus contention. The shared memory is used for storing public versions of data in the database, and for downloading code from the host. We have analyzed the cost of transmitting messages between ICAP processors and found that in almost all cases it is most efficient to use the communication ports rather than the shared memory. Emulating message passing through shared memory requires several costly kernel calls as well as copying of the message to and from the primary memory. In contrast, the message passing protocol of the 'C40 can operate independently of the processor via the programmable DMA controller associated with each port, and the cost of handling the associated interrupt is lower than that of the kernel calls. The shared memory is thus used strictly for shared-data access.
The cost of sending a message in the ICAP is shown in Table 3 Thus, we believe that the ICAP will be able to handle reasonably complex intermediatelevel processing within the time constraints of the UGV program.
In addition to the communication channels, the ICAP design includes a 128-bit wide bus that provides for broadcast of database queries to the processors and for remote fetches from other quadnode's shared memories, including an atomic read-modify-write to support synchronization. The bus interface is designed to provide most of these services without interrupting the ICAP processors themselves. For a database query, the interface provides a hardware queue that can be polled by the ICAP processors when they are ready to accept more work. A typical query can be broadcast with two bus transfers. A remote fetch is implemented by a finite state machine in the bus interface that acts like a fifth processor on the quadnode shared memory bus. It can fetch a single word or a block of up to 4 words with one request. The wide bus thus provides a low-latency, limited bandwidth channel that complements the high-latency, high-bandwidth network formed by the 'C40's communication channels. The bus structure for a quadnode is shown in Figure 6 .
Inter-Level Memory
Local Memory 
High-Level Processing
High-level processing in the IUA involves the execution of multiple independent recognition agents that communicate via a shared structure called a blackboard. We chose to employ a commercial shared-memory multiprocessor for this level rather than construct a custom array. The ICAP quadnode shared memories are all mapped into segments of a single address space that is accessible to the multiprocessor through a standard VME bus.
In addition, the same memories are mapped to a second segment of the VME address space that is write-only, and implements a facility by which the high-level processor can broadcast code to all of the ICAP processors. This broadcast capability enables new code to be downloaded quickly in response to changes in the UGV's mission or overall situation.
It should be noted that the high-level processor is minimally involved in the control loop that depends on the 100 millisecond latency requirement. Most of the throttle, steering and braking commands can be derived from the results of intermediate-level processing. The high level processor is responsible for maintaining a model of the environment that can be used in planning actions to be implemented over longer periods. For example, the intermediate-level processors may detect an obstacle in the road that results in application of the brakes, at which point the high-level processor takes the time to identify the obstacle and plan a path around it. Thus, we are not as concerned with fine-grained management of real-time issues in the high-level processor as in the other two levels.
Current Status
The hardware for the second generation IUA is nearly completed. The ACU, backplane, chassis and one memory board, have been assembled and tested. The custom CAAPP chips have been successfully fabricated, and the first processor board has been partially assembled and tested. Figure 7 provides a comparison between the processor boards of the two generations. All of the software for the CAAPP and most of the software for the ICAP has been developed and tested extensively using software simulators, a hardware ICAP emulator, and the ACU. Several applications have been developed using the simulators that partially demonstrate the ability of the system to meet its real-time requirements. However, due to funding cuts in the UGV program, development has stopped short of completing the first system. Because the components required to finish assembly are all available, we hope to be able to finish at least a partial system in the future. 
Summary
The Image Understanding Architecture provides a novel combination of many features that are specifically designed to address real-time considerations. These range from its overall use of tightly coupled heterogeneous parallelism, to careful consideration of execution time variability throughout. It represents a total system approach to real-time image understanding in a challenging natural environment. We have estimated the performance of the IUA through a combination of detailed simulations and analytical techniques and believe that it is capable of supplying the necessary computational power to address UGV real-time processing tasks with complexity similar to that of the ARPA IU Benchmark. However, completion of the hardware and execution of actual UGV tasks would be necessary to fully explore its efficacy in that domain. This article has addressed only the major aspects of the IUA that help it to meet its real-time constraints. For more information about the general architectural characteristics of the IUA and its software development and execution environment, the reader should consult the references.
