The next generation of Earth-observing spacecraft are likely to generate enormous volumes of data. A major challenge lies in the conversion of these mountains of data into information useful to researchers and other users. Hierarchical segmentation is one way to detect relationships among regions in a hyperspectral image. We implemented this algorithm on a next-generation space-capable hardware platform, and studied its performance before and after adapting it to use the platform's unique computational resources. We found that these adaptations enable an orderof-magnitude increase in performance over our initial implementation, and our detailed analysis points to areas for additional improvement.
INTRODUCTION
Spaceborne sensors like NASA's Hyperion hyperspectral imager generate huge data volumes, and several near-term trends indicate that data volumes will only increase. Nextgeneration hyperspectral missions, such as NASA's Hyperspectral Infrared Imager (HyspIRI), will operate at higher duty cycles and higher data rates, and their users will expect products to be generated from the data in near real time [1] . Barring a sudden advance in satellite downlink capacity, these trends point to a need to process data and generate products onboard the spacecraft. Rather than downlink an entire hyperspectral image cube, onboard processing enables satellites to downlink partial or completed scientific data products, which are often one to two orders of magnitude smaller than the original image. In addition, a satellite with onboard data processing resources and direct broadcast transmission equipment could send data products directly to first responders, research scientists, or other users on the ground.
Next-generation space-capable data processors will have a combination of reconfigurable gate arrays, digital signal processors and general-purpose CPUs. Correctly programmed and configured, these resources are sufficient to run sophisticated data analysis programs, including hyperspectral image processing algorithms that commonly run on desktop computers [2] .
This paper describes how we implemented one such program, the HSEG hierarchical image segmentation algorithm, software commonly used on desktop and parallel processors, on a hardware platform designed to mimic a next-generation space-capable data processor [3] . We also describe our approach to porting the algorithm to and optimizing it for the new platform, and determine the expected performance gains enabled by our design.
Section 2 of this paper describes the components of our design: the hardware platform as well as the HSEG application and its core function. In section 3, we show how the pieces come together to form a simulated onboard processing system, along with various design alternatives. Section 4 offers an estimate of the real-world performance gains possible using the simulated platform. Concluding remarks are found in the final section.
SYSTEM COMPONENTS

Onboard processing platform
We expect that in the near future, spacecraft will be designed with high-performance hybrid processors with general-purpose processors (CPUs) and large-scale fieldprogrammable gate array (FPGA) units with integrated digital signal processing (DSP) resources. The hardware platform used in these experiments is a Xilinx ML507 development platform, based on the Virtex-5 processor. This specific model includes an integrated PowerPC 440 processor and an FPGA with 16,000 configurable logic blocks (CLBs) and 128 DSP blocks, all operating at 400 MHz. The board has a 100 MHz front-side bus to connect to its onboard memory. In our implementation, the PowerPC runs Linux, which allows the HSEG program to run as it would on a desktop computer. By modifying the code (HSEG is open-source software), we can use the FPGA and DSP resources to implement certain functions in hardware.
Hierarchical image segmentation
HSEG ("Hierarchical SEGmentation") is a hierarchical image segmentation program for hyperspectral images. It partitions images in the spatial domain into regions or clusters according to their similarity in the spectral domain. It operates by forming a distinct region around every pixel, then deciding which regions are similar enough to be merged. As regions merge, they become larger, and there become less of them. HSEG records the order in which regions are merged, so that the user can explore a rich representation of the structure within the data. Figure 1 shows how HSEG segments a small hyperspectral image into varying numbers of regions. The eighteen regions shown in panel (d) coalesce first into a single water region (panel (c), center), then into two land and water regions (b).
Dissimilarity Criterion
One key function within HSEG is its dissimilarity criterion, which computes how similar two regions are, and therefore at what point they should be merged [4] . The function can be defined in multiple ways, but in our implementation the function resembles the Euclidean distance between two points (a and b) in a space with as many dimensions (n) as there are bands in the hyperspectral image:
The larger the distance, the more dissimilar the two points are, and the lower the chance they will be merged. Each region is represented in the calculation by a representative pixel, derived from the representative pixels of the smaller regions that were merged to form the current region. (This recursive definition stops with the one-pixel regions, whose representative pixels are the same as the single pixel they contain.) Larger images contain more pixels, which require more region dissimilarity comparisons. Across a range of image sizes, HSEG spends 85% of its time computing region dissimilarity, This makes the dissimilarity function a ripe target for hardware optimization.
HARDWARE IMPLEMENTATION
Using profiling tools on a desktop computer running HSEG, we find that each call to the dissimilarity function consumes roughly 6480 clock cycles. The goal of hardware acceleration is to reduce the number of clock cycles (and therefore wall clock time) needed to perform each call. With so much of the program run time devoted to the dissimilarity function, even small efficiencies per function call can translate into large performance improvements.
The dissimilarity function can be decomposed into the set of discrete arithmetic operations a computer must perform to compute it: one two-element subtract per image band (hyperspectral images typically contain about 200 bands), one squaring (self-multiply) operation per computed difference, one large summation of all the squared differences, and a single square root. Figure 4 (a) shows these steps as the CPU performs them in the standard software-only algorithm.
Each of the hardware platform's 128 DSP blocks can perform one addition/subtraction and/or one multiplication operation in three clock cycles. These blocks are useful for every phase of the dissimilarity computation except the square root. However, since they are few in number, they should be allocated wisely and augmented with other hardware resources for best results. The rest of this section will describe a basic hardware-accelerated implementation and an improved version with a few useful augmentations. 
Basic hardware implementation
The major advantage to hardware acceleration is parallelism. HSEG's dissimilarity function relies on several single-instruction multiple-data (SIMD) operations. While some desktop CPUs have dedicated SIMD units, the hardware platform's general-purpose CPU does not. However, the platform's DSP units can fill the gap, performing a mix of add, subtract, and multiply operations in parallel.
Our basic hardware implementation uses 32 DSP units as subtractors, performing all the pair-wise subtraction operations in 14 clock cycles, in multiple batches. Likewise, 32 more are configured as multipliers to square each of the differences. Since the differences are processed in multiple batches, the multiplier units are in fact multiply-accumulate (MAC) units, internally computing the sum of all the values it squares.
While the subtraction and squaring steps can be performed with no interaction between DSP units, the summation step involves the merging of products from every multiplier. A common way to accumulate large numbers of values is with an adder tree [6] . The first level of the tree adds a number of input values in pairs, generating half as many sums as output. These sums are, in turn, summed by the next level of adders in the tree. For m multipliers, the tree contains (m-1) adders and has a depth of ceil(log2(m) ). In our design, products from 32 multipliers are accumulated by a tree with 31 adders arranged in 5 levels. All totaled, our design makes use of 95 of the 128 available DSP slices.
The drawback to involving DSP units in the calculation is that the two representative pixels, commonly comprising more than 200 data values apiece, must be transferred from the CPU's main memory to the DSP units within the FPGA fabric. Despite a high-speed memory bus, the function's performance is limited by its data-intensive nature. Figure 5 shows that the basic implementation completes in 805 clock cycles, 8.04 times faster than the software-only implementation. It also shows the proportion of time dedicated to transferring data to the FPGA.
Improved hardware implementation
Caching is a common technique for reducing expensive memory transfers in computer systems [7] . HSEG does not compare regions in any set pattern, however we estimate after studying the algorithm with profiling tools that in 15% of all calls to the dissimilarity function, one of the two operands are identical to one used in the previous call. Therefore, a very simple cache that holds the two pixels compared in the most recent call would eliminate the need for a large number of data transfers. An image with 200 spectral bands would require 3.2 kilobits of memory within the FPGA fabric to cache a single pixel. In our improved hardware design, we implemented a two-pixel cache with a 15% single-pixel hit rate and a 5% double-pixel hit rate.
In the basic design, we opted to leave the role of computing the final square root to the general-purpose CPU, since the DSP slices cannot compute square roots. However, our hardware platform's PowerPC CPU lacks a floatingpoint unit (FPU); to calculate the square root of a number, it must emulate the floating-point calculation in its integerbased arithmetic unit, consuming dozens of clock cycles. Instead, we can allocate a significant portion of the (so far unused) FPGA logic cells to provide the CPU a hardwarebased floating-point coprocessor. This co-processor can calculate square roots about six times faster than the CPU [8] . Figure 4 (c) shows how these two modifications to our basic implementation form an improved design.
The improved design completes one call of the dissimilarity function in 698 cycles, a 15% improvement from the basic design and 9.28 times faster than the software-only implementation. 
CONCLUSION
Next-generation Earth-observing spacecraft like HyspIRI plan to collect a constant stream of imagery at very high data rates. Onboard image processing algorithms can make sense of the data in real time, greatly reducing the amount of time from observation to insight by a data user. We showed that one such algorithm for hierarchical segmentation of hyperspectral imagery can be implemented on the kind of high-performance hybrid processor likely to be available on these spacecraft. Our basic design shows how processing time can be greatly reduced by offloading calculations to parallel processing units, and our improved design shows that further reductions are possible with careful attention to implementation details. Future versions of our system could incorporate advanced caching techniques or pipelining to use each hardware resource more efficiently and hide the effect of data transfer time on overall performance. Such advances in hardware implementation are needed to develop image processing systems able to keep up with the torrent of data flowing from next-generation imagers. 
