The matched filter is an important kernel in the processing of hyperspectral data. The filter enables researchers to sift useful data from instruments that span large frequency bands and can produce Gigabytes of data in seconds.
Introduction
For a decade or more, the goal of every microprocessor vendor was to produce very high clock rates. Those clock rates were what customers thought they wanted, and microprocessor vendors obliged by heavily pipelining architectures and pushing clock rates to precipitous heights. Unfortunately, at about 3 GHz, the power requirements skyrocketed and the performance -largely due to pipelines as deep as thirty stages -had reached a limit.
This stall in increasing performance forced designers to look elsewhere for ideas. Field-Programmable Gate Arrays were one consideration -customized logic that could be reconfigured to perform whatever operations required more efficiently than any general purpose device. Vector processors, an idea from decades past, was resurrected for special purposes and then reclaimed by the general purpose community. We explore some of these architectures as applied to specific problem, matched filtering, in this work.
Field-Programmable Gate Arrays (FPGAs) have recently shown great gains in performance. Even though clock speeds of current microprocessor-based systems are 10-30 times that of a typical FPGA based design, the large amount of spatial parallelism afforded by modern FPGAs offers the potential for speedup. FPGAs now deliver peak floating-point performance equaling or surpassing that offered by microprocessor-based systems [13] , a trend that will likely intensify in the future as FPGA logic area continues to grow.
Other architectures, including the IBM Cell processor and Graphics Processing Units (GPUs), can act as accelerators for general purpose computing. Although the devices were initially designed for use in gaming, they have been applied to general purpose computing as well. [7, 8] The Cell is a part of the Sony Playstation 3, and GPUs, such as the NVIDIA GeForce series that target high-end gaming, are used for accelerating the rendering performance on commodity CPU-based systems.
On all platforms, the devices are meant for a large, consumer audience. Cell and GPUs are targeted at the gaming market, and FPGAs are widely used in telecommunications and even in consumer electronics. This enables the prices to be relatively low compared to the cost of highly customized chips, and make them very interesting and accessible to the research community.
The ability to stream data to/from memory and overlap control and data flow, combine to help these coprocessorbased systems attain high levels of performance. Capitalizing on the potential of these coprocessors, supercomputing vendors such as SRC Computers, IBM, Cray and SGI are now offering high-performance computing systems coupling standard CPUs with FPGA, Cell, and other coprocessors. These system architectures make it feasible to adapt existing compute-intensive codes to use coprocessor based accelerators.
Matched Filter
The matched filter ( Figure 2 ) is a vital kernel for extracting useful data from hyperspectral imagery. The term Hyperspectral implies that the spectra collected covers a very large swath of the frequency spectrum. Generally, this data is broken into many frequency bands that are independently collected. Generally this is done through the use of a prism ( Figure 1 ) for near-visible bands or a set of filter banks for wider bands. In our application, single-precision floating point data is collected for each band over a few hundred bands. The main unit of data in hyperspectral processing is the datacube, layers of 2-D images for each spectral band of interest. The signature is simply a vector of single-precision floating point values for every spectral band, in our case, 240. The matched filter takes the hyperspectral data and matches it against a particular signature. The signature is a vector of coefficients that represent the spectral reflection or transmission of a particular material. Applications of the matched filter include urban planning, ecology, mineral exploration, wetland studies, etc. A common use is in detecting chemical plumes. An interesting plume might signify pollution, illegal arms production, contraband drug manufacturing, etc. It can also be used to detect particular types of vegetation. For instance, the spectral signature of the invasive species athel tamarisk is unique [5] . When this signature is applied against a datacube, the tamarisk is highlighted and is distinct against the background. Because tamarisk is such an aggressive species out of its native habitat, it must be destroyed if it nears fragile ecosystems. The matched filter is useful for detecting it remotely and monitoring its spread.
Lens
The version of the matched filter we have implemented on various platforms is focused on long-term monitoring with a large number of signatures. In terms of kernel design, this means that the signatures are pre-processed -which reduces overhead and setup costs -but that there will be huge volume of data that has to be processed as quickly as possible.
The various implementations we have investigated are customized for each platform. In all systems we have the ability to partition the design between a CPU and the coprocessor. Because the strengths of each platform vary, we have attempted to exploit the strengths of each in our designs. Table 1 illustrates how we have partitioned the designs in each of the platforms. In the following sections, we will attempt to explain and justify out design decisions, while simultaneously exploring the strengths and weaknesses of each platform. 
Kernel Blocks
We implement a reduced set of operations for the match filter. Using pre-processed signature templates, the operations required are simplified. The kernel remains a realistic application as the signature templates can be generated offline with no loss of practicality. The simplification focuses attention on the real-time costs of data transfer and bulk computation. Figure 3 illustrates the basic blocks of the kernel. First, the signatures and the datacube are read in from the input file. The datacube is then transposed, as the data arrives from the sensor in spectral frames, meaning that the entire 2-D frame for a spectral band is contiguous. After one frame is taken, the filter bank switches to the next spectral band, and another frame is saved. Because the matched filter loops over all of the spectral bands for a single pixel during the main filter phase, it is more efficient to have all of the spectral data for each pixel in sequential order. Otherwise, the memory stride would be the size of an entire frame, or 240,000 32 bit words. This is not efficient, and thus the data is transformed early on in the process.
The main pixel loop is performed for every pixel in the image. In this phase, each value in the datacube will be accessed several times. Here, the ability to reuse data and parallelize computation is highly valuable. First, the values of an entire column are averaged. Second, that average is subtracted from the entire column. Third, the dot-product is executed over the entire pixel column and the signature. Various scalar operations are executed on the result of the dot-product, and the results are returned. The process is repeated over all pixels in the frame. Table 2 illustrates the overall time required for each block. The timings were taken using a Xeon 3.2 GHz with 4 GB of RAM and a 10k RPM SCSI hard drive. Processing each signature against the datacube requires roughly 3 sec total. The file only needs to be read once. The matrix transpose of the datacube is handled differently for each implementation, but on the CPU it requires 3.05 seconds per signature to process the 230 MB cube.
Related Work
Zhuo et al. [14] explored a variety of linear algebra operations, including dot-product. As dot-product is the main kernel we implement on the FPGA, this is an important prior work. The authors' strategy for dealing with the floating-point latency is to execute the dot product accumulation as a tree-based reduction circuit. This requires more routing and more adders than the systolic array approach that is enabled by our application. The matched filter application allows us to be less concerned with the latency of any particular dot-product result and more focused on the overall throughput.
Meredith et al. [10] have implemented a variation of the matched filter application on GPU co-processors. The authors report a 26.5× speedup over a Pentium 4 CPU. This performance improvement is higher than what we claim, but their application is better suited for the GPU as it is more computationally intensive. Our approach to the application only processes the pre-prepared signature templates and data cube, rather than processing the signature library.
As processing the signature library requires the calculation of a covariance matrix, and thereby a matrix inversion, it is of high computational complexity. Because we expect to process the signatures once, and then apply them against datacubes over long time spans, it is sensible for our application to only support pre-processed signature databases.
FPGA Implementation
The implementation of the matched filter on the FPGA has been carefully partitioned between the CPU and the FPGA in order to make best use of the hardware. Modern FPGAs allow a designer to configure any arbitrary logic function. Based on the design, these operations can function at high clock rates and behave similarly to a CPUbased version. Because the FPGA does not have to be a "general-purpose" device, a designer can take advantage of the application-specific attributes of the device to achieve higher performance than a CPU. This can be true even with a CPU that operates at perhaps 10× the clock frequency of the FPGA. However, performance is only achievable through hardware parallelism, and careful design.
Because an FPGA -for any given configuration -is responsible for only one specific application, the entire resources of the device can be leveraged for that function. In the case of the matched filter, we have implemented two main kernels on the FPGA, the mean calculation and subtraction, and the first dot product. We believe this is the best of the FPGA hardware, due to the complexity of implementing the other scalar and vector operations. Most importantly, referring to Table 2 , the mean computation and the dot product require 1.9 seconds per signature. The other computational components require an additional 1.45 seconds. The largest timing component is the file load and matrix transposition, which requires 4.32 and 11.75 seconds, respectively. However, the load and transposition only has to be executed once per datacube. As the data has to be transfered to the FPGA, the memory transfer costs will not be reduced, but they will be split across the 18 parallelized signatures.
It is possible to implement dozens of single-precision dot product units using the on-board memory and hardware resources. The resource requirements for the single-precision multiply and accumulate allow for significant parallelism without much control overhead.
While the dot-product is straight-forward to implement in hardware, the other aspects of the matched filter are not appropriate for hardware. For instance, after calculating the dot-product, various scalar multiplications are performed on the dot-product result. While it would be possible to implement this functionality in hardware, these units would be both expensive and under-utilized. One of the scalar operations is a floating-point square root. This kernel would occupies several times more area compared to an adder or multiplier, but would only be in operation at the end of every dot product. This means the duty cycle is roughly one result every 240 cycles. Since this is not the most effective way to use the scarce FPGA resources, the infrequent, yet expensive scalar operations are kept on the CPU. This decision will impact the overall performance due to the scalar and setup operations occupying roughly 50% of the sequential signature processing time.
In this case, we implement the matched filter on the Cray XD1. The Cray XD1 [4] supercomputer combines high performance microprocessors, a high speed, low latency interconnect network, and reconfigurable computing elements. This provides an environment where data transfer latency and bandwidth associated with I/O busses are greatly reduced. The tight integration of processors and reconfigurable computing allows a computation problem can be split between the CPU and FPGA with close synchronization and fast communication between software and hardware.
Architecture
In the first stage of operation, the signature values are loaded onto the Cray XD-1's external SRAM memory banks. These will be moved in and out of the FPGA as each set of templates is completed. The FPGA's internal block memory resources are then loaded from SRAM memory. The SRAM words are 64-bit, allowing two 32-bit floating point values to be loaded into the BlockRAM in each cycle.
The computational stage of the dot-product is a streaming of image data through the FPGA. The dot-product units are arranged in a systolic array (a notion originally developed in [9] ). The system architecture is depicted in Figure  4 . The power of the systolic array lies in its ability to reduce total device interconnect. By allowing only short connections within a small, repeated unit, a design will achieve a higher operating frequency as well as lower area consumption. Each unit is only connected to the previous and following units, and each controls its own data movement and computation. This eliminates the need for large, multiplexed data and address buses. The systolic array prevents the proliferation of global routing resources. The clock is the only element absolutely necessary and thus we desire to limit global routes to it.
We implement the dot-product systolic array using the Sandia floating point cores [6] .
Through the use of the systolic array we allow for higher clock frequencies, decreased interconnect, and simple, easily scalable units. We implement the parallelizable and computation intensive operations within the systolic array and implement the serial and control intensive operations within the host CPU. 
Extracting Parallelism
The mean computation and subtraction, requiring 1.3 seconds on the CPU, is a prime candidate for FPGA implementation. The computation can be done in a data-flow manner, computing the average and subtracting the mean from all of the incoming data as it flows into the linear array. However, this does require a long pipeline. Because of the 10 cycle latency of the floating point adder (for producing the sum), multiple sums of pixels must be interleaved. The sum is over the entire 240 element spectrum, a 2400 cycle delay is required, provided by a BRAM FIFO.
The design of the one-signature-per-unit systolic architecture is meant to provide parallelism over signatures. A common struggle in implementing systems with floating point of FPGA is the latency associated with floating point cores. In order to provide a fast clock frequency, it is necessary to pipeline the units, and this means that any operation that requires feedback, as in a counter or a series of 4 202 210 210 additions, cannot accept new inputs every cycle. This is addressed in [14] through the use of multiple adder elements and elaborate scheduling. Our approach takes advantage of the vast number of dotproduct operations required in the matched filter. For every datacube, there is a dot-product executed for every pixel across every signature template. For our datacube, the dot product is executed on a 240 element spectrum vector for each pixel, over thousands of spectrum signature templates. Because we are interested in the throughput of entire datacubes rather than a fast result on a single operation, we can interleave computations to mitigate the latency. Using just one adder of k cycles latency, and interleaving k separate dot-product calculations, the dot-products for k pixels can be computed simultaneously without incurring any additional resource costs. Thus, the parallelism is both signature-wise across the systolic array and pixel-wise within a single unit.
FPGA Results
Using the Cray XD-1 with a Virtex 2-Pro 50 -7, the design was placed and routed with 18 dot product units to 130 MHz. This rate is due less to the design of the dot product units and more to the SRAM interfaces and arrangement of the pin constraints for the FPGA board. Each unit requires 4 BlockRAM elements and roughly logic cells.
The partitioning choices have a big impact on the performance of the design, and Amdahl's law becomes a limiting factor. Only the dot-products are moved to the FPGA, and the two dot-products account for the preponderance of the computation. As mentioned in Section 2.1, much of the running time per signature on the CPU is actually the data communication time. However, the dot product and mean computations do account for 1.9 of the 3.35 seconds of computation time.
Calculating the per-signature speedup is complicated by the systolic array. Because the system has 18 dot-product units, computing the result for one signature requires essentially the same time as calculating 18. At 130 MHz, a 240x1000 frame with 240 spectral components requires, on average, 3.5 seconds to transfer the previously transposed datacube, perform the computation, and return the results to the CPU. This is equal to 0.17 seconds per signature for only the mean subtraction and dot-product computation. Integrating the FPGA-based dot-product with the rest of the matched filer application, the total per-signature time is roughly 2.36 seconds.
While the architecture could easily be implemented in a custom ASIC -in fact, the simple units that make up the systolic array are designed explicitly for ease of ASIC implementation -the use of FPGA allows the user to utilize parameterized designs which allow for variable numbers of spectral bands as well as optimized memory sizes for a particular problem. As well, FPGAs allow the design to be scaled upward easily as process technology allows for everlarger gate counts.
Matched Filter Implementation on the Cell

Cell Broadband Engine
The Cell Broadband Engine (CBE) is a new microprocessor architecture that extends the 64-bit PowerPC Architecture. The Cell is the result of collaboration between Sony, Toshiba and IBM. Despite the fact that the processor was primarily designed for game consoles and media centered electronic devices, it was designed to overcome fundamental problems in microprocessor design. To this end, the Cell is being used in other application areas, such as supercomputing [12] .
The Cell consists of a single-chip multiprocessor with nine separate processors which share a single coherent memory. The cell follows the current microprocessor trend of added multiple cores to a single chip, but instead of all of the cores being identical, the Cell has one PowerPC Processor Element (PPE) and eight Synergistic Processing elements (SPE). Each of these elements serves a different purpose.
The PPE is a general purpose 64-bit PowerPC architecture processor capable of running 32-bit and 64-bit operating systems and applications. It supports two simultaneous threads of execution, which appear to the software as two independent processing units. The PPE has a typical virtual memory subsystem with a 32 KByte L1 instruction cache, a 32 KByte L1 data cache and a unified 512 KByte L2 cache. The caches and memory subsystem allow the PPE to see all of the shared memory as a flat memory space. The PPE also supports integer, floating point, and a SIMD/vector unit. The SPE is optimized for running compute-intensive kernel functions and is not optimized for larger applications or operating systems. The SPE is an independent processor that runs its own program, but has full access to the shared I/O and memory. The SPE is called synergistic due to is reliance on the PPE to run the operating system and to provide the high-level application control. In turn, the PPE relies on the SPE to provide application performance. In some sense, this is similar to the approach we use in the FPGA and GPU, in which bulk computation is carried out on the SPE and complex control is handled by a host processor.
The SPE achieves its compute intensity through a special SIMD instruction set. Although the SPE can be programmed in high-level languages, it has a special Vector/SIMD Multimedia instruction set extension for increasing the parallel computation on each SPE. Operations that do not take advantage of the vectorization capabilities reduce overall performance. This is due, in part, to the 128 128-bit registers of the SPE. All operations use the wide registers regardless of data type, meaning that scalar operations are not using the available compute cycles. Vectorization allows the most efficient use of the registers.
The SPE does not have any cache, but has a 256 kByte software controlled local store (LS). The local store is for both instructions and data, and has no protection or translation for the access from its own SPE. Since there is no cache or memory translation, the SPE cannot directly access the main shared memory. It must transfer the data via DMA to its own local store using the Memory Flow Controller (MFC).
The MFC contains a DMA controller for transferring instructions and data to the SPU's local store. A DMA can be initiated by the SPU, by another SPU or by the PPE. It is the function of the MFC to interface to the Element Interface Bus (EIB) and synchronize the data transfers with all the other processors in the system. The MFC operates asynchronously with respect to the SPU, so that it is possible to overlap DMA transfers with other concurrent operations.
The EIB provide coherent communication between the PPE processor, the SPE processors, main memory storage and I/O. The EIB is a four ring structure for data and a tree structure for commands. The internal bandwidth of the bus is 96 bytes per cycle and it is possible to have more than 100 outstanding DMA requests between main storage and the SPEs. Besides the SPEs and PPE, the EIB is connected to two other elements: the Memory Interface Controller (MIC) and the Cell Broadband Engine Interface (BEI). The MIC provides the interface between the EIB and main storage. The BEI provides the interface, control, and translation for all external I/O communication.
Matched Filter
Despite the limitations in memory for each of the SPEs, the entire matched filter kernel code can be implemented on each of the SPEs. This results in a partitioning of effort in which the PPE loads the data from an external file, allocates memory, loads the program into each SPE and sends the SPEs a "go" signal. The SPEs pull data from main memory and writes the results in location specified by the PPE. Each SPE holds a single signature and DMAs the pixel data from main memory to calculate the matches.
Three main features of the Cell directly impact implementation of the matched filter. First, the limited memory of the local store requires that the implementation use no more that 256 kBytes for both code and data. Second, since the memory working set is much smaller than the total data (230 MBytes), then DMA transfers must be used to page through the data. Although the MFC provides a DMA controller, there are very specific limitations on the size and alignment of that data. Third, there are at least two levels of parallelism that can be exploited to increase the compute intensity.
The limited memory and DMA controller closely impact each other and are the first features that must be dealt with when partition the design. Since the SPE cannot directly access external memory and must explicitly transfer all data via DMA, the allocation of data in the local store is very important for effective performance. The best DMA performance is achieved when transferring the largest DMA size of 16 kBytes. Thus, the data input to the dot-product will be sixteen rows of 256 floats. Each of the sixteen rows are used in the dot-product along with the signature before the next DMA transfer. The signature data changes only after all the data has been processed. This is transferred only once and transformed once before use in the dot product.
The DMA controller supports naturally aligned transfer sizes of 1, 2, 4, or 8 bytes, and multiples of 16-bytes, with a maximum transfer size of 16 kBytes. Peak performance is obtained when the data is 128 byte aligned and the size of the transfer is an even multiple of 128 bytes. The image data cube of the matched filter is three dimensional with the 6 204 212 212 dimensions 240 by 240 by 1000. In order to DMA a 240 float row to the SPE, the data must be rearranged so that the data necessary is adjacent. This inversion of the data is expensive, but allows the DMA to be padded to sixteen 256 float rows (16 kBytes). Otherwise, a single value must be taken from each column to form the data needed for the dot-product. The PPE must arrange this data so that it is available for the SPE to transfer to their local store.
There are two levels of parallelism that can be exploited to increase the compute intensity. First, multiple SPEs can operate in parallel. Second, vector operations can be used to further compute effectiveness. These combine to allow for a high degree of parallel computation, while allow for individual SPEs to execute their own program paths.
Parallelism is achieved by transferring a single signature to each of the SPEs and have each SPE perform the filter using all of the data. Since the application processes thousands of signatures, this approach easily keeps all of the SPEs busy and can be made to efficiently process all of the data input. The individual SPEs calculate the dot-product using vector add and multiplies, which allow four floats to be processed concurrently. Not all operations can be vectorized, but vectorization of key computations increases the parallelization at the operation level. The SPE also has two execution pipelines and can schedule certain operations to run in parallel on its own execution pipelines.
Cell Results
The initial results on the Cell were disappointing. The initial implementation on the Cell required over 30 seconds to process five signatures of data. Measurement of the execution time determined that the only 4 seconds were required for processing the 5 signatures and pixel image data. Loading and arranging the data to used by the SPEs required the remaining 26 seconds. This was due to the large strided accesses that could not take advantage of caching. Since the original arrangement of the data was arbitrary, preprocessing the data was reduced to just 7 seconds, by preformatting the data.
The results above were obtained on a older version of the cell with 512 MBytes of memory and a 2.1 GHz processor. Using a the updated Cell with 1 GByte of memory and a 3.2 GHz processor, the file load time is only 1 second, and processing time is reduced to 3 seconds.
It is interesting to note that from one to eight signatures take 3 seconds on the Cell. Since each of the SPEs run the same program in parallel, adding more SPEs does not impact the overall run time. To improve on this, the communication time must be reduced by instead of each SPE pulling data from main memory individually, one SPE should pull data and pass it along the EIB ring to the next SPE. This would form a pipeline for processing the data and reduce the need to transfer blocks that had already been fetched from memory. Also, since the filter signatures are relatively small, it may be possible to have each SPE process a small number of these signatures and reuse the pixel data in that case as well.
GPU Implementation of Matched Filter
Overview of GPU Architecture
The Graphics Processing Unit (GPU) is a CommodityOff-The-Shelf (COTS) product that is meant for accelerating the rendering of images to the screen. The intended audience of these products is largely the gaming market, where high frame rates and elaborate, complex 3-D graphics require state-of-the-art technology. Because of the demand from the gaming community, companies such as NVIDIA [2] and AMD/ATI [1] have provided processing capabilities that have outstripped the development of general-purpose microprocessors.
Part of the ability of GPU developers to innovate comes from the restrictions that come with their target applications. Graphics code tends to be image-centric, with data access and result production occurring in a very predictable manner. In particular, codes tend toward easily vectorizable computation with limited data usage. Other common characteristics include [11]:
Single-instruction, multiple data (SIMD) structure -the same code is executed for each pixel of an image. Because all of the processors are performing the same operations, there are fewer synchronization problems and fewer resources dedicated to instruction queues and branch prediction. This architectural structure does not easily support branching, as data movement needs to occur in lockstep through many parallel processors.
Single result outputs -this is a particular restriction of GPU architectures before the NVIDIA GeForce 8800 [2] , released in late 2006. This model assumes that all calls to a subroutine will produce a single return value. This is a reasonable assumption for many graphics-centric applications, as well as image processing applications that do not necessarily result in a displayed image. Convolution, for instance, and template matching produce a single output.
In this model, computations executed on the GPU behave much like single-return value functions. This can be inconvenient, as we will see later, as many computations have side effects and multiple results. In these situations, the single pixel output restriction causes us to repeat computation or build elaborate workarounds.
The single pixel output restriction also means that scattering data is impossible. Because a function can only produce one output, and that output is ultimately the rendering output for a pixel, it is impossible to randomly place data into memory. While gathering data from arbitrary locations is supported, data outputs are restricted in the graphics 7 205 213 213 paradigm. In that paradigm, pixels are rendered to a image frame, and scatter is rarely required.
Data locality -this is less of a constant across all GPU applications and more of determination of performance for a given application. Considering simple kernels like convolution and template matching, the data locality is high, with each pixel output only considering the pixels in its immediate vicinity. While the internal architecture of GPUs is proprietary and closely held by NVIDIA and ATI, the importance of cache locality remains. Neighborhood operators continue to be a strength of GPU devices even as their capabilities become more general.
Vectorizable Code -There are two paradigms for this approach within the GPU. Because of the SIMD nature of the intended applications, there is benefit in performing computation in a vector format. For instance, in a series of computations on successive pixels, the computational pipeline can remain full as there is a large volume of data and independent computation.
The second opportunity for vector acceleration is by packing multiple values into the Red, Green, Blue, and Alpha (RGBA) components of a pixel. In some situations, this allows the computation to be spread across parallel vector units. However, the performance of the packing seems to be a driver and implementation dependent. We have observed kernels where using only the Red channel is faster than packing the data across RGBA.
These restrictions and behavioral patterns have allowed the advancement of GPU architectures and performance at a much higher rate than general purpose microprocessors. Current GPU architectures actually appear much like a collection of classic vector supercomputers on a single device. Because of the restrictions on the form of the output, the problems with cache coherency and other memory issues that complicate other multi-processing architectures are avoided.
Before the details of the implementation are presented in Section 6.2, some GPU terminology should be defined. The texture is the basic storage structure on the GPU. It is used normally for providing storage for a texture, or a series of values that can be mapped to a surface to provide the appearance of a 3-D surface. This is used because it is computationally cheaper to map a tiled texture to a surface than it is to actually store the surface with detailed surface mapping. For general purpose use, matrices can be mapped to texture memory, and recalled randomly. Multiple textures can be used to calculate the output of most arbitrary functions.
Implementation Details
We have implemented the main pixel loop of the matched filter on a NVIDIA GeForce 7900 GTX, hosted by an Intel Pentium D 2.8 GHz with 2 GB RAM. The main pixel loop is not the whole of the computation in the application, but it does represent at least 85% of the execution time. Restricting the GPU's usage to the pixel loop simplified development time as well as allowing us to focus our attention on the most time consuming part of the application.
The computation is divided into four parts in order to work within the confines of the GPU architecture. The first operational block is the calculation of the mean for each row of the hyperspectral data cube, and the subsequent subtraction of the mean from each element of the row. The calculation for a single row is as follows:
for (i = 0; i < n; ++i) sum += b[i]; mean = sum/n; for (i = 0; i < n; ++i)
The mean calculation reveals many of the oddities that make programming for the GPU not as straightforward as programming for a general purpose CPU.
Because only one result can be generated per pixel, a vector subtraction can not be executed in one step. Ideally, the summation of the vector could be computed, the division executed, and then that result subtracted from each of the components of the vector. However, because temporary results cannot be stored between individual pixel, two rendering passes are used.
The array b[0.
.n] is laid out as a row in a texture. A block of sums are calculated in one pass, with each b[0..n] row producing one element in a vector of sums. In this format, a single column of the output frame buffer is rendered. The coordinate of the particular row of each pixel to be rendered is used along with a loop variable to sum the entire row. The mean is calculated from the sum and stored as the output pixel Red channel.
The second rendering pass subtracts the previously computed mean from all elements of the texture. Conveniently, while the rendering kernel has changed, the textures remain in memory, allowing the new computational kernel to compute the new texture values from the data still on the GPU.
The computation of the f images and mf images results are the most complicated kernels. They require both the dotproduct of the newly mean-subtracted b[] with another matrix, a[], and a dot-product of b[] with itself. The difference f images and mf images is a collection of scalar operations.
The interesting and problematic aspect of these final results is that they both share the results of the two dotproducts, which are the most data intensive computations of the entire pixel loop. However, because of the singleresult restriction of the GPU, the products must be calculated twice. This is highly wasteful. Feasibly, the dot products could be calculated first and then stored for use in calculating the final results independently. It is currently unclear if the overhead cost of the two additional rendering 8 206 214 214 passes would outweigh the cost of recomputing the dotproducts.
GPU Results
We have implemented the main pixel loop of the matched filter on a NVIDIA GeForce 7900 GTX, in an Intel Pentium D 2.8 GHz with 2 GB RAM. For our input dataset, which contains a single datacube and 5 signatures, the CPU-only time is 19 seconds. Of this time, 4.32 seconds is required for pre-processing, and then 3.05 seconds per signature. The preprocessing stage includes reading and transposing the input data cube.
The GeForce 7900 GTX implementation for the same dataset requires only 12 seconds, almost a 3× improvement. Because only the pixel loop has been modified, the 6 second pre-processing time remains constant. In the pixel loop, however, each signature now only requires about one second. For large signature databases, this means that the signature throughput will approach a 6x speedup over the microprocessor implementation.
Remaining Work
There are some interesting inconsistencies in the results that we are still exploring. The GPU results largely match the CPU results to several decimal places. However, in some cases the results are up to 0.9 in error. This sort of error is rare (perhaps one in 10,000 results) but they do occur. We have executed the code on various platforms and thus various implementations of floating point, and no errors have been observed that approach the differences between a CPU and GPU implementation.
In the following It should be clear that there is no particular pattern -the errors do not occur on the same columns, or in the same number for each signature. However, the results are consistent across runs of the code. The cause is likely accumulated floating point error. This is an issue that we are investigating.
We are interested in exploring various improvements to the GPU design. The time to implement the improvements is one constraint that has limited our ability to explore the design space. It will also be useful to more carefully evaluate the likely effectiveness of some of the more complex changes before they are actually implemented.
The main modification of interest is block signature execution. Currently, a single column is rendered. During the operation of the matched filter, a hyper-spectral datacube is reduced to a single 2-D frame for each signature. Rendering a single column is an effective way of generating results without resorting to large-strided cache-unfriendly 3-D matrix representations. If a complete result frame for a single signature is produced, there is no alternative.
However, we have noticed that there is some opportunity for data reuse as well as an efficient data representation if signatures are processed in parallel instead of the pixellevel parallelism that we are currently using. In this mode, a row is computed in parallel for a block of signatures. This should allow the GPU instruction scheduler to be extract more parallelism.
Results
We have implemented the matched filter on a variety of platforms. In each system, our objective was to produce a system that is reasonable for production use. This largely means that a CPU, bus interfaces, power supplies, and involved support hardware are included in all of the performance measurements, including cost and power.
We have considered several performance metrics. Speedup compared to the CPU implementation is the primary metric, as throughput per signatures is a primary deciding factor when data volumes are high. Cost and power are secondary factors. For large data volumes the difference in price between a commodity GPU retailing for roughly $500 and an $8,000 IBM Cell processor board from Mercury Computing [3], will not justify the difference in performance. The cost of GPUs, FPGA, and Cell-based systems also must include the price of a host machine that can support the add-on board. The ideal approach for large data centers is to maximize the speedup per cost, captured in the second to last column.
Power is of utmost concern in embedded systems. While we are only concerned with a basic matched filter kernel, a more elaborate system would perform the matched filter on the embedded system and then only transmit the interesting bits to the base station. This is an appropriate approach for satellite or aerial reconnaissance. In these situations, power is measurement of instantaneous power when the system is active. Of course, because the add-on cards provide an increase in time performance, the system could feasibly be shut off for some percentage of the time, so energy -(time * power) -is a useful metric. We capture energy in the form of speedup/watt, the final column of the table.
Our results for the various implementations are listed Table 3. The results are normalized against an Xeon 3.2 GHz with 4 GB of RAM and a 10k RPM SCSI hard drive. In all situations, the timing results are based on the average of several runs of the filter. The highest speedup we post, 8.0× 9 In terms of speedup per dollar, the GeForce 7900 GTX GPU far outstrips its competition. 1 Because the GPU is a commodity add-on card that is sold in large numbers, its price is very reasonable compared to the more esoteric FPGA and Cell co-processors. The value of the GPU in performance per dollar is higher than the Cell with similar implementation effort.
While producing slightly higher performance than the GPU implementation, the FPGA does not fare well in the comparisons of energy and cost. While only the meansubtraction and dot-product was implemented on the FPGA, the ability to reuse the data and parallelize computation over 18 signatures is beneficial. Due to the area requirements of floating point operators, the total parallelism available is less than what could be achieved with a fixed-point matched filter. As well, we are limited by the difficulty of implementing the more control intensive and scalar operations of the filter.
Conclusions
In this work, we evaluated the performance of a matched filter algorithm implementation on accelerated co-processor (Cray XD-1), the IBM Cell microprocessor, and the NVIDIA GeForce 7900 GTX GPU graphics card.
We found the FPGA-based implementation to be highly scalable and very effective at calculating the dot-product portion of the computation in parallel, but ultimately found Amdahl's law to be a limiting factor.
The Cell proved to be the highest performing solution from a time and throughput perspective. The Cell implements vector processors that easily map the matched filter, as well as the miscellaneous scalar operations and setup data transfer.
The GeForce 7900 GTX GPU co-processor was found to be the highest value in terms of performance per dollar. The GPU also maps the vectorizable filter computations easily, but requires more elaborate software design as the devices are meant for graphics-centric applications and not general purpose computation.
