Real-time image processing requires high computational and I͞O throughputs obtained by use of optoelectronic system solutions. A novel architecture that uses focal-plane optoelectronic-area I͞O with a fine-grain, low-memory, single-instruction-multiple-data ͑SIMD͒ processor array is presented as an efficient computational solution for real-time hyperspectral image processing. The architecture is evaluated by use of realistic workloads to determine data throughputs, processing demands, and storage requirements. We show that traditional store-and-process system performance is inadequate for this application domain, whereas the focal-plane SIMD architecture is capable of supporting real-time performances with sustained operation throughputs of 500 -1500 gigaoperations͞s. The focal-plane architecture exploits the direct coupling between sensor and parallel-processor arrays to alleviate databandwidth requirements, allowing computation to be performed in a stream-parallel computation model, while data arrive from the sensors.
Introduction
The use of optoelectronic technologies within computing systems has been studied extensively during the past 15 years. Particular attention has been given to the performance advantages of introducing guidedwave and free-space optical interconnects at various levels of the interconnection packaging hierarchy of massively parallel-processing systems. In general, these studies focused on the technological advantages of optical interconnects in terms of density, bandwidth, and power considerations. Although these studies provided detailed information on the technological advantages of optical interconnects on a linkper-link basis, leading to many proof-of-concept system prototypes, a broader understanding of their impact on particular computational models and application classes is still needed to motivate the incorporation of these technologies into practical products.
Real-time image-processing applications form an application class that currently is in need of improvements in computational flexibility and performance efficiency through the use of optoelectronic system solutions. In particular, real-time image-processing systems must be capable of extracting and disseminating information from sensor data in a timely fashion. Although current coarse-grain systems that use portable digital-signal-processor ͑DSP͒ chips offer the generality to handle many signal-processing domains, they are severely restrained by datathroughput bottlenecks and are incapable of harnessing the data parallelism found in image-processing applications. Optoelectronic smart-pixel systems, on the other hand, provide higher performance efficiency per ͑silicon͒ chip area, 1-5 but they do so at the expense of computational flexibility.
A hybrid approach is thus needed that can take advantage of the computational generality of conventional DSP chips and the performance efficiency of smart-pixel systems. As is shown in this paper, focalplane architectures based on a parallel singleinstruction-multiple-data ͑SIMD͒ computational model have the potential to fill this gap. These architectures provide direct coupling between the detector arrays that capture the images to be processed and the processing elements ͑PE's͒ themselves. This setup alleviates data-bandwidth requirements because computation can now be performed over the image pixels as the data arrive in a stream-based fashion rather than having to fetch data from memory, as is done in conventional store-and-process systems.
Furthermore, the parallel fine-grain processor array offers high performance and modest generality in a portable, low-power environment. This parallel system maintains the processing efficiency of smartpixel systems, while supporting adequate computational flexibility for image-processing domains. Focal-plane-area I͞O provides the necessary data throughput to keep the processor array busy.
To evaluate the focal-plane SIMD system described above, it is necessary that real-time image-processing application workloads be examined for a particular system's architectural implementation. To this end, we present the use of an instruction-level simulator for the focal-plane SIMD architecture in the evaluation of various hyperspectral image-processing applications. This application area is chosen because of the high demand on data throughput and computational workloads that it presents. In fact, current systems are limited to only immediate data collection into storage media for later processing. A radical departure from the store-and-process system is necessary to realize real-time hyperspectral image processing. In addition, the nature of hyperspectral applications calls for portable, low-power processing systems to be available ͑e.g., for remote-sensing applications͒. The parallel SIMD system provides a new balance among processing, storage, and communications to meet real-time requirements.
In this paper several key hyperspectral imageprocessing applications such as region autofocus, C-means classification, K-means clustering, vector quantization ͑VQ͒, and textural correlation that use the discrete Fourier transform ͑DFT͒ are studied to determine the proper system requirements for this new processing domain. Critical information on computational workloads, data throughputs, and memory storage are investigated for each application. This paper shows that data and computational throughputs constitute severe bottlenecks for current machines, whereas simulation results on the parallel SIMD architecture support real-time performance with sustained operation throughputs of 500 -1500 gigaoperations͞s ͑Gops͞s͒.
The rest of the paper is organized as follows: Section 2 presents background information on hyperspectral imaging and processing systems. Section 3 describes a parallel SIMD architecture that uses focal-plane-area I͞O. Section 4 describes the hyperspectral image-processing applications under consideration. Section 5 presents application analyses, extracting system requirements such as data throughput, processing workload, and memory storage. Conclusions are offered in Section 6.
Hyperspectral Imaging
Hyperspectral imaging provides the power to view the same spatial scene under different electromagnetic spectra. Imaging sensors obtain scene images in spectral bands from the visible and the near-IR to the long-wave-IR ͑0.2-12-m͒ regions. The absorption and the reflectivity of different materials provide a unique spectral signature that is identifiable at only spectral bands beyond the visible range. Processing techniques generally identify and discriminate materials through these signatures.
Hyperspectral image processing, previously applied in only military surveillance and targeting applications, is now finding paths into other domains, such as manufacturing and security. In addition, a variety of remote-sensing applications including agricultural crop yield, mineral detection, and atmospheric cloud identification rely on hyperspectral-sensing capabilities for better substance discrimination. 6 -9 Figure 1 illustrates a basic view of the hyperspectral-imaging concept. The spectral image slices form an image cube with a depth equal to the spectral resolution of the detector. A multispectral imager provides only a few ͑one to ten͒ broad and noncontiguous spectral channels. Hyperspectral imagers are more powerful, and they provide tens to hundreds of narrow, almost contiguous, spectral channels. Each spectral slice has spatial information in two dimensions ͑the X and the Y coordinates͒. Multiple cubes are combined into videolike frames with temporal correlation between each cube. Spatial and spectral analyses are performed on the image cube, creating chromatic, textural, or regional information to detect, identify, and classify materials or objects. In general, hyperspectral applications remove unwanted background clutter ͑abundant information͒ in the image cube to reveal specific data ͑specific information͒.
A. Sensor Technology
Hyperspectral data are obtained by means of specialized spectrometers, which are shown in Fig. 2 in their basic forms. In these spectrometers a polychromatic signal ͑source͒ is processed by a series of optical elements ͑1-4͒ to extract the spatial information of a two-dimensional ͑2-D͒ scene at various regions of the electromagnetic spectrum. 10 As was mentioned above, hyperspectral imaging encompasses wavelengths from 0.2-12 m. For obtaining data throughout the spectrum a variety of photonic detectors must be involved. In particular, silicon, lead sulfide, indium antimonide, and mercury cadmium telluride detectors are used to scan the entire range of wavelengths of interest. The relative responsivities of these detectors are shown in Fig. 3 from which the cutoff-wavelength characteristic of photonic sensors is clearly evident.
Hyperspectral sensors ͑spectrometers͒ are becoming increasingly effective as technology advances. 11 Improvements in the signal-to-noise ratio, the radiometric and the spectral calibration accuracy, the sensor size, and the width and the number of spectral channels continue to demand higher data throughputs to the processor. Subsection 2.B quantifies the data-throughput requirements in today's store-andprocess systems and describes the motivation for the focal-plane imaging system for hyperspectral image processing.
B. Store-and-Process Systems
Traditional hyperspectral-imaging systems scan across a ground track, generating pixels in each scan line ͑X axis͒. The forward movement of the system generates the Y-axis coordinates. Figure 4 illustrates two methods of capturing spectral information ͑ axis͒ with the spatial image. 11, 12 The first collection scheme ͓Fig. 4͑a͔͒ employs a spinning mirror to sample the ground image across a row that is perpendicular to the ground track. An array of detectors, each sensitive to different wavelengths, is used to capture the spectral information for that spatial Fig. 2 . Spectrometer block diagram: The first stage consists of an optical system that focuses the incoming radiation. The second and the third stages include optical components such as choppers, gratings, and prisms that are used to separate the spectral components of the polychromatic source. The fourth stage includes the detection, amplification, demodulation, and recording phases of the data-gathering process. pixel. The second collection scheme ͓Fig. 4͑b͔͒ relies on the direct projection of the ground onto a linear array of sensors. Multiple rows of detector arrays are required to capture different spectral bands.
Line scanners suffer from geometric distortions caused by how the image is sampled and the motion of the sensing platform. Additional processing steps must be included to remove the distortions, whereas gyroscopes or stabilizers can reduce movement during scanning. Sensors in line scanners are used continually, which limits the time a detector spends on gathering light. Movable parts in the first collection scheme limit the size and the power consumption of the system. Push-broom linear arrays use sensors with a discrete number of spectral bands, and thus large arrays are required for high spectral resolution.
Current processing systems for the above sensor schemes incorporate a frame buffer that captures an image into memory. A commercial DSP architecture then processes the image sequentially. This store-and-process model, shown in Fig. 5 , is low cost, flexible, and compatible with current line-scanner systems. However, rapid detector-array advances in resolution, frame rate, and dynamic range will soon exceed throughput limits inherent in store-andprocess systems.
Consider Table 1 , which lists existing and a futuristic hyperspectral sensor arrays. Data specifications for the Airborne Visible-Infrared Imaging Spectrometer ͑AVIRIS͒, the Compact Airborne Spectrographic Imager ͑CASI͒, and the Hyperspectral Digital Imagery Collection Experiment ͑HYDICE͒ were obtained from Ref. 12. Today's hyperspectral sensor arrays require a data bandwidth of 6 to 35 Gbits͞s ͑0.75 to 4.4 Gbytes͞s͒. This raw-bandwidth requirement is dependent on the spatial and the spectral resolutions. For instance, the futuristic sensor array listed in Table 1 will require a data bandwidth of 72 Gbits͞s ͑9 Gbytes͞s͒. Conventional memory buffers and system buses are impractical for these arrays, providing neither the needed storage capacities nor the access speeds required for effective processing. For example, an Ultra 2 SCSI ͑Small Computer Systems Interface͒ bus provides only a 7.6-Mbyte͞s data bandwidth. The direct Rambus dynamic RAM ͑RDRAM͒ ͑Rambus Corporation͒ delivers only 1.6 Gbytes͞s. 13 Current wire and optoelectronic interconnects are nearing 4.0 Gbits͞s per channel for low channel counts.
For simple image-processing algorithms such as edge detection approximately 100 elementary operations ͑additions and subtractions͒ must be performed per pixel. 14 Hyperspectral data streams incorporate an additional wavelength dimension in the data set, which elevates processing demands. Table 1 lists real-time processing demands ranging from 0.5 to 7 teraoperations͞s ͑Tops͞s͒ per bit for the selected sensor arrays. More complex algorithms require more operations per pixel; thus higher processing throughput is necessary. Today's DSP chip performance is in the range of 0.5 to 1 Gops͞s. Traditional parallel SIMD machines, such as the MasPar Computer Corporation's MP-2, provide a higher throughput of 68 Gops͞s with 16,000 PE's. 15, 16 However, these parallel machines are built for scientific computation and are not suitable for stream-based computation in hyperspectral image processing. In addition, these machines are not portable, and they consume large amounts of power.
Portable hyperspectral systems must balance performance and generality, while maintaining low power consumption. To alleviate data-bandwidth requirements, it is necessary to have direct I͞O coupling between sensor arrays and PE's. Hyperspectral image-processing algorithms must be performed on many parallel PE's to maintain high throughputs. Rather than store the entire image frame the computation must be performed as the data arrive to minimize storage buffers. Subsection 2.C discusses the stream-parallel approach and presents a candidate architecture for hyperspectral image processing. 
C. Focal-Plane Imaging System
An alternative approach to store-and-process systems is to employ a large array of stream processors in which detectors and processors are connected with parallel interconnects. This stream-parallel approach, shown in Fig. 6 , can substantially reduce the storage requirements of the system. The image stream flows directly from the focal plane into the processing plane, retaining the stream's spatial correlation. A stream-parallel system can minimize buffer-storage requirements, while reducing perchannel communications. Co-locating the detector and the processor eliminates long wires in the system and alleviates data-bandwidth requirements. The high computational throughput required by hyperspectral image-processing algorithms is delivered by many PE's executing concurrently.
The 2-D focal-plane image-collection schemes for the stream-parallel system are shown in Figs. 7-9. The imaging system uses spectrometers to capture in either the spectral or the spatial domain. 11 Sensors can be filtered to produce images for different spectral regions. The first collection scheme ͑Fig. 7͒ is an extension of the push-broom linear array. The image is dispersed among the focal-plane sensors, and each sensor detects different spectral bands of the image. A stream-parallel system operates on a frame with spectral ͑͒ and spatial ͑X͒ dimensions. The motion of the sensing platform along the ground track creates the third spatial dimension ͑Y͒ to yield the hyperspectral image cube.
The second scheme ͑Fig. 8͒ collects the image in a full 2-D frame with two spatial coordinates ͑X and Y͒. As the sensing platform moves along the ground track, full 2-D images with different spectral regions are captured, creating a temporal overlap of the spectral bands. The overlap of the spectral bands cre- As the detector array moves along the y axis, an image frame is captured for a single wavelength ͑band͒. After capturing n frames the entire hyperspectral data cube will be captured.
ates the hyperspectral image cube with a spectral dimension ͑͒.
As was described in Subsection 2.A, different types of sensors are needed to capture hyperspectral images. The detector array shown in Fig. 8 consists of interlaced spectrometers to capture images at different radiation wavelengths. This interlaced array is similar to the color-filter arrays ͑CFA's͒ in today's digital color cameras. 17, 18 Figures 9͑a͒ and 9͑b͒ illustrate CFA's with the red-green-blue ͑RGB͒ and the cyanmagenta-yellow-green ͑CMYG͒ color models. These CFA patterns gain spectral resolution by the sacrifice of spatial information. Interpolation among the color pixels must be performed to recover the spatial detail. Hyperspectral spectrometers, shown in Fig. 9͑c͒ , can also be interlaced accordingly to capture images in different spectral bands. Electronic bandpass filters are needed for each spectrometer so that continuous spectral bands can be captured in the images. If the required spectral range is within the sensitivity of a single type of spectrometer the entire detector array consists of the same sensor. This 2-D framing scheme is far superior to the line scanners used in store-andprocess systems. Individual frames have high geometric integrity and are unaffected by sensor-platform movement. This collection scheme delivers 2-D focalplane image formats that promote the stream-parallel approach to process hyperspectral images at real-time rates. Thus this collection scheme is used in the focalplane SIMD architecture described in Section 3.
Parallel Single-Instruction-Multiple-Data Stream Architectures
Compared with smart-pixel systems that are hardwired for one application, programmable digital processors offer greater system flexibility. However, commercial microprocessors are ill suited to hyperspectral image-processing applications because of their limited performance ͑typically 1 Gop͞s͒ and low resource efficiency ͑support for large memories, floating-point operations, operating systems͒. They provide a generality and a functionality that are not always required in image processing.
A more promising computational model, the SIMD model, replicates the data path, the data memory, and the I͞O to provide high processing performance ͑1-1000 Gops͞s͒ with a low cost per PE. Figure 10 illustrates this configuration. SIMD systems often employ thousands of PE's. The cost of the control unit is amortized across each PE. Each PE is a simple arithmetic logic unit ͑ALU͒ with a local memory or access to shared memory for data storage. The array-control unit ͑ACU͒ usually has its own memory for storing program and scalar data. Each program instruction is broadcast to every PE in the system in a lockstep fashion by means of a single instruction stream. Each PE, in turn, executes the received instructions on its local data ͑multiple-data stream͒, while exchanging data with other PE's through the interconnection network. The number of neighboring PE's to which each PE is connected depends on the interconnection network topology. Masking techniques are used to control the activity of each PE, which can be enabled-disabled during an execution cycle. Only enabled ͑active͒ PE's participate in the current computation.
Although a single program is being executed, each instruction is executed simultaneously on many PE's. This data-parallel execution model is especially well suited to early image processing in which a subroutine must be applied to every region of an image. Whereas a commercial microprocessor must iterate sequentially across an image, a SIMD architecture Fig. 9 . CFA's for digital color cameras: ͑a͒ RGB and ͑b͒ CMYG models. ͑c͒ Hyperspectral sensors can be interlaced in the focal plane to capture the entire radiation range. Si, silicon; PbS, lead sulfide; InSb, indium antimonide; HgDdTe, mercury cadmium telluride. Fig. 10 . Organization of a SIMD computer architecture. Program instructions are broadcast to every PE in the system through a single instruction stream, and each PE carries out the received instructions on its local data. P 0 , P 1 , P n , PE's; MEM 0, MEM 1 MEM n, local memory.
can process the entire image in a single iteration. The data-parallel execution model has proven to be beneficial for a number of image-processing applications such as wavelet decomposition 19 and VQ for both image compression 20, 21 and multispectral image data. 22 All these applications show massive amounts of data parallelism and require a large number of near-neighbor ͑local͒ communications.
A. Low-Memory Single-Instruction-Multiple-Data Processor Array with Focal-Plane-Area I͞O Focal-plane-area I͞O plays a critical role in SIMDbased parallel systems because the data throughput can often limit overall processing rates. It was already shown in Subsection 2.B that data throughput constitutes a severe bottleneck for current machines. The large communication-to-computation ratio poses design constraints for the processor's interconnection network. Many image-processing applications in early SIMD architectures were limited by I͞O, e.g., the Thinking Machine Corporation's Connection Machine 1. 23 Later machines ͑Connection Machine 2 from Thinking Machine Corporation 24 and MasPar's 15, 16 MP-2͒ overcame these limitations through the use of large parallel-disk arrays to buffer images. These systems were packaged in 10 -100-ft 3 enclosures. Portable real-time systems must process the I͞O stream directly without disk buffering. Focalplane-area I͞O provides the necessary data throughput ͑16 Gbits͞s for 1000 ϫ 1000 pixel images with 8 bits͞pixel at 2000 frames͞s͒ in a compact, efficient package. 20, 25 This subsection describes the modeling and the implementation efforts for a SIMD focal-plane architecture. The architecture, as shown in Fig. 11 , consists of an array of SIMD processors in which each PE can address a 4 ϫ 4 array of image sensors. Each processor incorporates an analog-to-digital converter to convert light intensities that are incident upon the sensors into digital values. The SIMD execution model allows for the entire image projected on many PE's to be acquired in a single operation. Each PE is a simplified reduced-instruction-set computing ͑RISC͒ processor that contains the following functional units ͑FU's͒:
• An ALU with an adder-subtractor and a barrel shifter.
• A multiply-accumulate ͑MACC͒ unit.
• Three-ported general-purpose register file and special register.
• Sixty-four words of local memory ͑a maximum of 256 words͒.
• Communication and serial I͞O units.
• A masking unit to control PE activity.
The key feature of the SIMD focal-plane system is the integration of the focal-plane I͞O with a programmable digital processor. Each processor has a small local memory to limit silicon chip-area consumption to favor a large number of PE's being integrated into a single monolithic chip. This integration provides the required high processing capabilities, shifting the reliance on clock frequency to data parallelism for performance. Focal-plane-area I͞O delivers the necessary image data to the appropriate processors, and therefore it maintains a high data throughput to keep the processor array busy. In addition, each PE has a MACC unit that is specialized for imageprocessing applications.
Early prototyping efforts proved the feasibility of the direct coupling of a processing core with sensor devices. 25 FU's are specified in Table 2 for a 16-bit implementation in terms of silicon area and transistor count. The prototypical PE measures 2.4 mm ϫ 2.7 mm and contains a total of 38,590 transistors, including clock drivers, testing circuitry, and I͞O pads. Implementation details are presented in Refs. 25-27. A power-consumption model was developed for this architecture. 26 An image-processing workload 27 and technology parameters from the National Technology Roadmap for Semiconductors 28 were integrated with this model to project power consumption in different technology situations. The maximum system size in a single monolithic chip is determined from the maximum die size, the transistor density, and the power-density values for different years. Figure 12 shows the maximum allowable system size and power consumption before they are limited by the technology of the particular year.
A future target system, which is used in this paper for the performance analysis, is described in Table 3 . This system will be able to deliver an unparalleled performance with 4096 PE's integrated into a single monolithic device running at 500 MHz. For 0.1-m VLSI technology a larger system size is allowed by the projections. However, a smaller, more reasonable system size is chosen because the powerconsumption model allocates the power budget for only the digital circuitry. Additional power for the interface circuitry and the analog devices must be included to obtain the complete power budget, which will result in a smaller system size. The operating clock frequency is chosen to ensure a low-power regime of operation for the digital circuitry.
In the future target system, a 256 ϫ 256 pixel array of image sensors is integrated with the system, and each PE is mapped directly to a 4 ϫ 4 pixel subarray. This system would deliver a peak throughput of approximately 1.5 Tops͞s in a monolithic device, enabling image-and video-processing applications that are currently unapproachable with today's portable DSP technology. Section 4 presents an analysis of a realistic workload, which comprises several hyperspectral image-processing applications, for this architecture. For this target system real-time processing throughputs are obtained, thus indicating that the SIMD focal-plane architecture is a valid candidate architecture for hyperspectral image processing.
Hyperspectral Applications
This section describes a suite of applications for hyperspectral data streams. In general, these applications process a hyperspectral cube to reduce its large data set into a smaller, more manageable size. These applications are computationally intensive and require high throughput to handle the massive data flow in real time. However, they offer a high degree of data parallelism that is not usually exploited by sequential image-processing systems. The SIMD focal-plane architecture combines focal-plane image acquisition with a SIMD execution model to exploit the available data parallelism, while alleviating I͞O bottleneck. For each application the data throughput, the computational workload, and the memory storage are calculated to determine realistic system requirements. Performance results are offered in Section 5.
A. Region Autofocus
Region autofocus isolates regions of interest and provides a smaller image for further analysis. Irrelevant portions of the image are ignored, and system resources are dedicated to the proper region for threat identification. Consequently, the workload and the execution latency of automated-target- Fig. 12 . Projected system size and power consumption for a processor array. The numbers in square brackets along the horizontal coordinate represent the feature sizes. recognition ͑ATR͒ algorithms are reduced by the removal of the need to process the entire hyperspectral data stream. The application of region autofocus for ATR is illustrated in Fig. 13 . Hyperspectral-imaging sensors generate a data cube that contains a large number of images at different wavelengths. Although threat visibility may be obstructed in the visible spectrum, other characteristics such as heat or material signatures can be detected at a different spectral wavelength, as shown in the sample image on the top right-hand side in Fig. 13 .
Data fusion among the different spectral slices is first performed to detect threat presence. This fusion is achieved by the correlation of the image coordinates in different spectra and the production of a binary image. A pixel in the binary image is set if spectral signatures are detected at the given coordinates. This binarization ͑threshold͒ process is simplified for a focal-plane system because image pixels are mapped spatially to processors. Each PE handles the same spatial coordinates for each spectral image. In other words, each PE operates on a small tile of pixels with the same spatial coordinates. A stack of tiles with the same spatial coordinates that represents the different spectral slices is stored in the same PE.
A region-identification processing stage is then utilized to locate a region of the view with the threat. An image-decomposition stage based on the quadrant-tree ͑quadtree͒ decomposition algorithm is used to subdivide the hyperspectral image into smaller image quadrants. Although quadtree can be used for image compression, it is used in this application context only to segment the original images into smaller isolated regions. Each region of interest can be identified by the parsing of the generated quadtree structure. The region of interest is subsequently enlarged or zoomed to focus on the threat by redistribution of the pixel data of a single region among the PE's. The enlarged image contains upsampled pixels from a single region of interest. This zoomed image is fed to ATR algorithms to assess further the effective presence of a threat.
These ATR algorithms take target information from a central database and inform the regionautofocus algorithm of the proper spectral signatures. By redistribution of the pixel data each PE is ready to perform an ATR algorithm as if this smaller region were originally sampled from the sensor array. Therefore no computational bandwidth is lost to regions not containing the threat. In addition, ATR algorithms can use information from the regionautofocus application to refocus the sensor view toward the target.
B. C-Means Classifier
The C-means classifier is an approach to grouping objects according to pixel characteristics. 29 This algorithm follows an unsupervised approach in which the original image pixels are not labeled. Supervised approaches are less computationally intensive, as image pixels have already been labeled according to existing maps or photointerpretations. 30 The C-means algorithm locates concentrations of feature vectors within a heterogeneous sample of pixels. These clusters represent classes in the image and are used to calculate class signatures. The C-means approach is an efficient clustering algorithm that assumes C clusters in a particular data set.
The algorithm begins with an arbitrary mean vector for each C cluster. Image pixels are assigned to their proper clusters on the basis of the distances between the pixel coordinates and the cluster's mean vector. A pixel is assigned to the closest vector with the smallest Euclidean distance, and pixel vectors are then grouped into C clusters. A unique identifier or color is assigned to each cluster to visualize the data and to differentiate the clusters. The number of colors represents the quantity of C clusters. The optimal number of clusters C depends on the image content. Small values of C may not allow the algorithm to detect the appropriate materials, whereas large C values could divide a cluster into small subclusters such that the same material will be detected as different substances. Figure 14 illustrates the implemented parallel C-means algorithm. In this implementation a 220-spectral-band AVIRIS image is used. The original image is enlarged to a resolution of 580 ϫ 580 pixels to map a single coordinate spatially to each PE. Computation occurs while the 220 spectral bands are brought into the system to minimize memory usage. Interim data such as the pixel vector for the 220- spectral-band intensities are stored in the PE local memory for the entire algorithm iteration.
C. K-Means Clustering
K-means clustering is a commonly used technique for segmenting large image regions into specific objects or areas of interest. This sort of segmentation can, for example, be useful in search and rescue operations in which large areas must be scanned for specific objects. Hyperspectral sensors can be used to obtain a clear signature of the objects of interest.
The implemented algorithm takes as input the datafused image to be analyzed and the number of clusters K to be constructed. The process is illustrated in Fig.  15 . The algorithm first identifies all possible pixel clusters in an image according to a particular threshold metric. One accomplishes this identification by having each PE compare its assigned pixel values to the specified threshold value. All pixels that satisfy the threshold condition ͑live pixels͒ are grouped into N clusters on the basis of a connectivity criterion; the clusters are then labeled accordingly. A toroidal interconnection network is used to allow efficient execution, as all PE's with live pixels can communicate with their neighbors to grow a cluster of pixels. The constructed clusters are labeled by use of a simple convention based on the identification numbers of all PE's with pixels belonging to a particular cluster. The PE's can then collaborate in each centroid computation of the N identified clusters.
After the centroids for the N ͑N Ͼ K͒ clusters are determined the algorithm must map these clusters into K new clusters. To do this, it is necessary for the algorithm to choose K clusters out of the N identified clusters and to map the remaining N Ϫ K clusters onto it. The centroids of these K clusters are denoted as seeds. The remaining N Ϫ K clusters are mapped by minimization of the distance between their centroids and the selected seeds. The implemented algorithm chooses the seeds randomly. After each of the N Ϫ K clusters is remapped the centroid of the newly expanded cluster is calculated. This new centroid substitutes the original cluster's centroid as one of the K seeds. The algorithm continues until no centroid changes cluster. At the end there will be K main clusters that, taken together, will hold the N original clusters. Each of the K clusters will have a specific centroid that was calculated by use of all internal clusters. An example of this process is shown in Fig. 16 for the cases of N ϭ 7 with K ϭ 2 and K ϭ 3. Note that the clusters generated will depend on the choice of the seed points. 
D. Full-Search Vector-Quantization Image Compression
Sensor technology with increasing spectral range and data resolution will continue to saturate the available communication bandwidth. The demand for handling increasingly larger images at faster rates continually exceeds advances in technology and upgrades in the existing communications infrastructure. Efficient data storage and transmission through digital image compression therefore become important to alleviate memory-storage limitations and transmission-channel bottlenecks.
Among compression algorithms VQ has become an attractive alternative to other transform-based techniques 31 mainly because of its computationally inexpensive decoding process. On the other hand, the greater computational requirements of the training and the encoding processes are characterized by a high level of data parallelism, which makes VQ highly suitable for parallel implementations. 21, 22 In its basic scheme VQ is a mapping function that associates vectors of input data with a representative vector ͑a codevector͒ chosen from a previously learned dictionary ͑a codebook͒. A scalar index is then used to reference the chosen codevector in the dictionary and to encode the input vector. The compression ratio depends on the cardinality of the codebook, which is usually much lower than that of the input domain. In the 2-D case nonoverlapping vectors are extracted from the input image by the grouping of a number of contiguous pixels to retain the available spatial correlation of the data. Hyperspectral data streams require an extension of vectors into multiband cubes of spectral images. Nonoverlapping, multiband cubes are used to exploit both the spatial and the spectral correlations available in the data.
A full-search VQ encoding algorithm was implemented 20 for 2-D image data. A 256-word codebook is used to achieve a 0.5-bit͞pixel encoding of 8-bit͞ pixel gray-level images. The application moves input blocks through the system so that each input block is compared with the entire codebook. When the process is complete each node contains the index of the best-matching codeword for the original input block and the corresponding distortion value. Through this process the focal-plane image is converted into a matrix of 8-bit values. A key enabling role is played by the toroidal structure of the interconnection network, which makes possible the communications among the nodes on opposite sides of the mesh. In this implementation the input blocks are compared with the codebook in a systolic fashion with a large number of blocks compared in parallel at any given time. This results in a significant speeding up of the comparison and a high degree of concurrency.
E. Textural Correlation
The texture of an object is an attribute of its surface that represents the spatial arrangement of the gray levels of the pixels in a particular region. A texel is a small region of pixels that identifies the spatial characteristics of the particular texture pattern. Texture analysis is decisive in all those imagesegmentation instances in which objects differ from one another or from the background in texture but not in average brightness or chromatic distribution.
Textural correlation is similar to chromatic matching except that a spatial rather than a chromatic spectrum is examined. For determining the spatial spectrum a DFT is computed on small regions across a spectral slice. The resulting spectrum is then compared with a set of reference texels by use of the same technique described in Subsection 4.D. Figure 17 illustrates the required steps. In Fig. 17͑a͒ an M ϫ M region of interest is located in a particular spectral slice and extracted. In Fig. 17͑b͒ the same region is compared with the texel library, and either a flag is raised or a presence probability is associated with each of the library patterns. All spectral slices in a cube must be accessed to determine whether the desired texture is present in the scene.
System Requirements
The hyperspectral applications described in Section 4 were studied to determine their computational requirements. This section describes the characterization and the performance of the SIMD focal-plane architecture. An instruction-level simulator running in Windows95 was developed to explore architectural issues and to extract critical data on computational throughput and storage requirements. 32 An instruction histogram was tabulated to determine the average workload for the hyperspectral applications. A store-and-process DSP chip's real-time performance is modeled with the average workload and compared with the SIMD focal-plane architecture. Critical analyses on the simulation results are provided in Subsection 5.C.
A. Instruction Histogram
In Table 4 each application is profiled in terms of the number of cycles spent in the key FU's: ALU, memory ͑MEM͒, communication ͑COMM͒, PE ACU ͑MASK͒, and image loading ͑PIXEL͒. The ALU and the MEM cycles are computational cycles, whereas the COMM and the MASK cycles are necessary for data distribution and synchronization of the SIMD's processor array. Scalar instructions operate in the ACU and are considered to be the serial portion of the algo- rithm. Vector instructions represent the parallel portion of the algorithm operated in the processor array.
The average workload for the entire application suite is depicted in Fig. 18 . As expected, sensitive shares of the execution time are expended in the ALU and the MEM operations. The latency for image loading is negligible, indicating that all selected applications are computationally intensive. The application suite has a high degree of parallelism because the serial component, indicated by the scalar slice, is contained to only 16.9%. The SIMD programming model harnesses data parallelism by the delegation of work to the processor array. Table 5 summarizes the performances for all the implemented applications. The target system described in Table 3 with 4096 PE's was used for the simulations. This system size supports an image size of 256 ϫ 256 pixels with a 4 ϫ 4 pixel subarray per PE. The only exception is the C-means classifier application, which uses 21,025 PE's to handle the AVIRIS image format. For each application the worst-case situation was used. System utilization is calculated as the average concurrency, and it is relative to the total system size. The execution time and the sustained throughput are computed with respect to the 500-MHz target platform described in Subsection 3.A.
B. Computation Throughput
The simulations show that all applications execute in the lower millisecond range, suggesting that they can be combined in stages to form complex applications that are still able to execute at full-frame rates ͑30-60 frames͞s or 15-30 ms͒. The high system utilization that is achieved further indicates that the underlying data-bandwidth and parallel-execution schemes in the SIMD focal-plane architecture are suitable for hyperspectral image-processing applications.
Memory usage in the SIMD focal-plane system is kept to the minimum required for stream processing. Because computation is performed as images arrive on the focal plane, a large frame-buffer storage is unnecessary. For most applications memory usage is contained within 600 kbytes for a 256 ϫ 256 pixel image. The C-means classifier operates on more spectral bands, elevating the total system memory usage to approximately 9.5 Mbytes. However, this amount is still considerably less than the traditional store-and-process data-storage requirements of 10 Gbytes for AVIRIS image formats. 12 The performance of the SIMD focal-plane system for the different applications is compared with a 4-Gop͞s store-and-process DSP chip running at 500 MHz. Following trends in architecture evolution, the DSP system performance is based on an instruction-percycle throughput of 8 ͑compared with today's microprocessor's instruction per cycle of 2 to 3͒. A model of the execution that is based on the average workload for hyperspectral applications is applied to the DSP chip. Figure 19 illustrates the relative performances of both approaches with the real-time ͑30-frame͞s͒ bound as a reference. FUSION and ZOOM are functional blocks in the region-autofocus application. The K-means clustering application is also partitioned into its functional components: clustering ͑CLUSTER͒, labeling Fig. 17 . Textural-correlation application by use of the DFT and a texel library: ͑a͒ M ϫ M region extraction from an N ϫ N spectral slice and ͑b͒ matching of the extracted region against the texel library. Y͞N, yes-no; L, the number of texels in the library. ͑LABEL͒, centroid calculation ͑CENTR͒, and classification ͑CLASSIF͒.
The SIMD focal-plane system provides more than 2-3 orders of magnitude in higher performance compared with the DSP uniprocessor solution. For the hyperspectral workloads, computational flexibility is maintained, whereas performance is delivered from the parallel-processor array. Traditional SIMD systems such as the MasPar Model MP-2 deliver 68 Gops͞s of processing throughput with 16,000 PE's running at 12.5 MHz. 15, 16 If a 40-fold increase in clock frequency alone is assumed an estimated 2.7-Top͞s processing capability may be possible. However, data bandwidth in the global router ͑1.2 Gbits͞s͒ may easily limit the estimated performance. The proposed SIMD focal-plane system incorporates the focal-planearea I͞O to deliver high data throughput to the processor array, maintaining high processor utilization. Without a high-bandwidth communication structure the hyperspectral-application workloads will likely shift more toward the COMM, the MASK, and the scalar instruction types because processors must wait for data to arrive. The percentage of time spent on useful work ͑ALU and MEM instruction types͒ will decrease, and the processing latency will increase beyond real-time ͑30-frame͞s͒ bounds.
Although real-time rate execution is achievable for the DSP uniprocessor solution for some application components, this is no longer the case when complete applications need to be executed. Figure 20 shows two examples for region autofocus and K-means clus- Fig. 18 . Workload characterization for the hyperspectralapplication suite. Fig. 19 . Performance comparison of the store-and-process DSP versus the SIMD focal-plane systems on hyperspectral-application workloads. fps, frames per second. tering. As was described in Section 5, these two applications consist of multiple building blocks or functional components. Aggregating these components for each application illustrates the inability of the store-and-process DSP system to provide realtime processing for hyperspectral applications. On the other hand, SIMD focal-plane systems still have additional headroom for more applications before exceeding the video-rate bound.
C. Analysis
The SIMD focal-plane architecture benefits from direct coupling between the sensor array and the computing plane. The focal-plane scheme allows stream-oriented mapping of image data directly with each PE. Image-processing applications and algorithms map well to the SIMD execution model and the focal-plane I͞O. 27 With only 4096 PE's the processor array delivers a competitive performance ͑0.5 to 1.5 Tops͞s͒. One can achieve this level of performance by keeping PE's busy because the focal-plane I͞O delivers spatially mapped data for each PE. The focal-plane I͞O also alleviates the need to buffer the large amount of hyperspectral image data. Computation is performed as the data arrive rather than first storing the entire data stream. In comparison traditional single-processor architectures require large memory caches to hold stream data. These caches are not used efficiently because each stream element is read exactly once. 33 Instead of placing more caches in the chip, more PE's can be substituted to allow higher image resolution and processing throughput.
Other image-processing applications such as JPEG and wavelet encoding are currently being developed. Standard building blocks for image-processing algorithms, such as spatial filtering, the DFT, morphological filtering, image rotation, and image labeling, have been implemented. 27 More complex ATR algorithms are being pursued by the integration of these components.
Architecture models described in Ref. 26 showed that VLSI technology by the year 2012 will support a system with approximately 850,000 pixels ͑920 ϫ 920͒. Power consumption for the PE's is contained to less than 50 W for a 52,900-processor system in 50-nm technology with a projected performance in excess of 70 Tops͞s. The above application simulations show the potential of processing hyperspectral data streams in a parallel SIMD environment on a focal plane. The SIMD focal-plane architecture provides a suitable operating platform for current and future semiconductor technology.
Conclusions
This paper has demonstrated that focal-plane-area I͞O used with a fine-grain, low-memory SIMD architecture meets the real-time requirements of hyperspectral image processing. Data throughputs, processing demands, and storage requirements from realistic workloads have shown that current storeand-process DSP systems are not capable of handling the tremendous computational workloads and I͞O throughputs. The key solution to maintaining realtime performance is the integration of the focalplane-area I͞O to provide high data throughput and keep the processor array busy. The SIMD processor array maintains modest computational flexibility for hyperspectral image-processing applications beyond traditional smart-pixel systems.
