Abstract -This paper presents a design for parallel processing of synthetic aperture radar (SAR) data using one or more Graphics Processing Units (GPUs). Our design supports realtime reconstruction of a two-dimensional image from a matrix of echo pulses and their corresponding response values. Key to our design is a dual partitioning scheme that divides the output image into tiles and divides the input matrix into sets of pulses. Pairs comprised of an image tile and a pulse set are distributed to thread blocks in a GPU, thus facilitating parallel computation. Memory access latency is masked by the GPU's low-latency thread scheduling. Our performance analysis quantifies latency as a function of the input and output parameters. Experimental results were generated with an nVidia Tesla C2050 GPU having maximum throughput of 1030 Gflop/s. Our design achieves peak throughput of 293 Gflop/s, which scales well for output image sizes from 2,048 x 2,048 pixels to 4,096 x 4,096 pixels. Higher throughput can be obtained by distributing the pulse matrix across multiple GPUs and combining the results at a host device.
I. INTRODUCTION
The use of electromagnetic waves to produce images that depict detail beyond the response of human vision is of keen research interest for applications such as computed tomography, meteorology and geology (e.g., remote sensing). In particular, radar-based mapping of ground objects can benefit from high performance parallel computing techniques.
In practice, radar mapping involves the reconstruction of a two-dimensional image of an object from a collection of radar pulses, together with parameters of the radar sensor. Typically, a device at a known location emits a pulse, and reflections (echoes) of this pulse from the target are collected as a function of time over an interval. Echoes associated with objects further from the emitter will arrive at the sensor later than those closer to the emitter. The exact time these responses should arrive can be estimated by dividing the distance between the pulse location and target by the speed of light c.
Unfortunately, while predicting the arrival time for a given target is relatively simple, the inversion of multiple echoes to determine the location is challenging due to ambiguities. Given the response data for a single pulse, one cannot distinguish the echoes from two targets at equal distance from the pulse location.
Synthetic aperture radar provides a technique for resolving these spatial ambiguities, based on the exploitation of multiple pulses taken at multiple locations over time, to increase radar sampling density and thus increase effective aperture. Via solution of an inversion equation, the ambiguity between targets can be approximately resolved, allowing a more accurate approximation of sensed objects in their reconstructed versus truthed locations. Additionally, the effect of sampling error due to interference and environmental factors is reduced by averaging of multiple replicates. Unfortunately, SAR-based image reconstruction becomes more expensive computationally as the size of the collected pulse data and desired output image increases. For example, our experiments showed that a 4096x4096-pixel image having 1000 pulses and 32768 range bins per pulse took over three minutes to reconstruct on an Intel i7 six-core 3.3 GHz Intel processor.
Fortunately, the emergence of less expensive parallel architectures offers support for fast but computationally expensive SAR reconstruction, for example, with Graphics Processing Units (GPUs). In addition to efficiency, a GPU's data partitioning scheme, as well as our optimization strategies for peak throughput, can be generalized to a variety of problems with data access pattern characteristics similar to SAR reconstruction. In particular, our scheme can be effectively applied to generate an effective GPU solution to any problem where spatial locality of output data implies spatial locality of the required input data (for example, in computed tomography). Due to space limitations, we focus the remainder of this paper on SAR reconstruction with one or more GPUs.
II. PRELIMINARIES

The Backprojection Algorithm
Several algorithms are available for SAR reconstruction from pulse response data, and instances of these algorithms have been optimized to perform well on sequential systems with relatively low throughput. In designing these algorithms, computational performance, rather than quality of the reconstructed image, tended to be the primary evaluation metric. An algorithm that was rated poorly by this metric is the backprojection (BP) algorithm [3] . BP, known for its high computational cost and good output quality, examines the pairing of every received (postprocessed) pulse with every reconstructed pixel, to estimate object reflectivity at each point in the spatial representation. The details of this algorithm are beyond the scope of this paper, but the algorithm is summarized as follows.
The pulse response data, or pulse response matrix, is partitioned into range bins. Each range bin corresponds to a measured response during a given interval of time after the pulse was emitted. As mentioned in Section 1, responses that arrive later are further from the emitter location, and are thus placed into a higher-numbered range bin. For each pixel in the output image, every pulse and every range bin in that pulse is considered separately, with the value in a given range bin being added to the value of the associated pixel of the reconstructed SAR image. Thus, areas of higher reflectivity will be associated with brighter output pixels or regions.
More formally, the main computational loop of the backprojection algorithm features the pairing of each pulse in the pulse response matrix with each pixel of the output image. Let P denote the set of pulses, with P [i] .bin [j] denoting the jth range bin of pulse i after phase correction, and (P[i].x, P[i].y, P [i] .z) denoting the source of pulse i with respect to some fixed reference point. If each pulse has a range of R start to R end , then the intensity of the output image pixel at (x,y) can be written as
Surprisingly, this model captures the data movement patterns of the backprojection algorithm, although our implementation includes several enhancements to improve output quality. Such enhancements are not the focus of this paper. Additional detail regarding the backprojection algorithm and its enhancements can be found in [3] .
Related Work
The algorithm and design presented herein are tailored to the fast parallel processing of synthetic aperture radar data, as well as reconstruction of a corresponding twodimensional SAR output image. However, these techniques could also be applied to any domain where pulse response data is projected back to form an image of a surface, object, or area. One such application is computer aided tomography, or the construction of three-dimensional models from a set of cross-sectional views, for example, for the purpose of obtaining a three dimensional view of tissue in vivo. Here, each atomic volumetric unit or voxel demonstrates the same properties derived from a reconstructed SAR pixel. Namely, for a given cross-sectional view, each voxel will be associated with a response value that is spatially near the response of its neighboring voxel. Similarly, a given volume of output data will require a predictable quantity of response data for each cross section. The preservation of these properties ensures that the proposed partitioning scheme would also provide an efficient data access mechanism for generating tomograms [2, 9] .
Other potential applications of our algorithm depart entirely from the domain of image sampling. Network tomography, for example, involves the inference of network characteristics from observations taken at a variety of known locations. In this case, each node in the network core is analogous to a pixel in the output image, where each sample location is analogous to a pulse. The latency or reliability of a channel can be inferred from the repeated transfer of packets between locations. The projection of this data back onto the nodes through which packets travelled can produce a visual representation of the network at any point in time, without requiring explicit assistance from core nodes. In this paradigm, spatial locality follows from the route optimization properties of routing protocols [8] . A similar problem involves estimation of ocean temperatures from acoustic wave propagation latencies, because the speed of sound in water is directly related to water temperature. By measuring the propagation time of the wave between a variety of sources and destinations, a set of cross sectional images can be reconstructed that is analogous to the images obtained in computer aided tomography. Small three dimensional units of water, not unlike the voxels of computer aided tomography, become the unit onto which acoustic sensor response data is projected. Spatial locality of input data follows from the output partitioning scheme [10] .
Graphics Processing Unit (GPU)
A Graphics Processing Unit is a parallel computing device having high throughput and relatively low cost. GPUs can be purchased for less than 100 USD, and are found inside many modern desktop computers. Performance of GPUs, despite their price, has been measured at over 1 TFLOPs. Comparatively, high end consumer-grade CPUs have not achieved more than 150 GFLOPs per chip [4] .
The throughput of GPUs can be attributed to their high degree of parallelism, or more specifically, their ability to operate efficiently as single instruction, multiple data (SIMD) devices. The nVidia Tesla C2050 GPU, which was used in our experiments, has 14 streaming multiprocessors, each consisting of 32 streaming processors or cores, for a total of 448 cores.
With respect to SAR, the backprojection algorithm described in the preceding section is an ideal candidate for GPU implementation, since (a) each output pixel can be viewed as the sum of the contribution of each input pulse, and (b) the set of operations used to calculate this contribution is not dependent on the value of the input.
III. DATA MOVEMENT ISSUES IN SAR
The backprojection algorithm described in Equation 1 requires the computational pairing of each pixel in the output image with each pulse in the pulse response matrix. Since both the output image and pulse response matrix can be large, it is often impossible to fit the whole of either structure in shared memory, constant memory or register memory. Fortunately, the data access pattern of the backprojection algorithm ensures the following two properties that are helpful in reducing the global memory access.
Let the function N R (P, S) be the number of range bins for a given pulse required to completely generate a square subregion S of the output image. For any square subregion S and pair of pulses (P 1 , P 2 ), we have Property 1:
This property can be explained by recalling that the range bin needed for a given pixel is linearly related to its distance from the pulse location. For any two points, the factor relating the difference in their physical locations to the difference D in their corresponding range bins is constant. In particular, D equals the total distance sampled, divided by the number of range bins. Observe that the maximum number of range bins is required when the pulse look angle equals 45
O with respect to the square subregion, such that
where B denotes the side length of the square shown in Figure 1 . For the 45 O case,
The final step is proven using the Pythagorean Theorem.
A related property involves the relationship between the number of bins needed by two square subregions. Consider two such regions S 1 and S 2 with base lengths of B 1 and B 2 . Letting P be a pulse in the pulse response matrix, we obtain the following lemma that states Property 2:
Observe that the distance between any two points on opposite boundaries of S 1 is less than or equal to 2 1 B . This follows directly from the Pythagorean Theorem. The upper bound on the number of range bins required to render this image is then proportional to 2 1 
B
, with a proportionality constant (C, as distinct from the speed of light c) determined by the sampling resolution:
Likewise, observe that the distance between any two points on opposite boundaries of S 2 is greater than or equal to B 2 . The number of range bins required is then proportional to B 2 .
Since the input pulse data P remains constant, so does the sampling resolution and the (constant) factor C. As a result, we have ( )
Observing that B 2 is a positive number, one obtains
and the upper bound of Property 2 follows from substitution of C. The lower bound of Property 2 follows directly from the upper bound. By swapping the roles of S 1 and S 2 , and noting that b 1 and b 2 are positive, we have
and the result follows. 
IV. APPROACHES FOR DESIGNING A PARALLEL IMPLEMENTATION
The Backprojection Algorithm can be implemented on a GPU by defining the kernel as the pairing of one pixel with one pulse. The steps involved in performing this operation are the same for all pixel/pulse pairs, so it is inherently SIMD in nature. However, challenges arise in the mapping of pair/pulse combinations to thread groups.
Naive Approach
First, let us consider the simplest approach, where kernels are grouped arbitrarily. Using this scheme, each pulse/pair combination is assigned to a thread in an unspecified sequence. This will produce a correct result, because the backprojection algorithm does not specify an order in which contributions must be summed.
While this approach achieves the minimum computational requirements of backprojection, it does not take advantage of the GPU memory hierarchy. The unspecified nature of the mapping between threads and pixel / pulse combinations requires that the pulse response matrix and the output image must be accessible to all threads running on the GPU. This yields the requirement that both structures must reside in global memory. No access locality patterns are available that would allow a subset of either structure to be copied to shared memory for lower latency access. Using this scheme, the cost of moving data from global memory to the GPU multiprocessor is one pixel-load operation, two range-bin loads, and one pixel-write-back per pixel/pulse combination. Assuming an image size of N x N pixels, rendered using P pulses, this process requires 4PM 2 global memory operations.
We have found that access locality can be achieved with respect to these data structures to allow only a small subset of the pulse and image data to be in the multiprocessor at any time.
Pulse Response Matrix Partitioning into Pulse Sets
A natural approach for partitioning backprojection is pulse partitioning. Here, the pulse response data is divided into sets along its rows. Each thread block then renders the entire image from one set of pulses, and the output image is the sum of the images produced by each thread block.
978-1-4673-0753-6/11/$26.00 ©2011 IEEE Figure 2 . Partitioning the image into tiles achieves access locality on the image, and some access locality on the pulse data. Some range bins are used by multiple partitions.
This approach achieves access locality on the pulse data, as each thread block operates on an easily predictable subset of the entire structure. This subset can be copied to shared memory at the start of execution using CUDA's coalesced memory transactions, then reused throughout the image reconstruction process.
Unfortunately, this blocking strategy achieves no access locality on the image. Each block must either maintain a copy of the entire image in its own shared memory, or sum each pulse's contribution directly to a data structure that resides in global memory. Either approach places unacceptable restrictions on the implementation, as follows. The former limits the size of images that can be generated to the size of shared memory, then requires a reduction operation to compute the final result. The latter requires an expensive pair of global memory accesses for each pulseand-pixel combination.
Assuming the latter approach, we have found the cost of accessing global memory to be 2PN 2 /K + PN R . It should also be noted that the 2PN 2 /K and PN R terms that appear in these equations represent block transfers of data to and from global memory, so coalescing operations can be employed to improve throughput.
Output Image Partitioning into Blocks
A more efficient approach achieves access locality on both the pulse data and image data by partitioning the image into blocks as shown in Figure 2 . In [5] , we showed this to be efficient for implementing backprojection on an FPGA.
The global memory access cost of this scheme is the cost of transferring a subset of the range bins from each pulse to the processing element rendering each tile, followed by a write-back of the tile to global memory after generation is complete. There is no reduction step because each tile represents the contribution of all pulses in the pulse response matrix, rather than a subset of pulses. As shown above, each tile requires a number of bins that is proportional to K/N, so the global memory access cost scales with PN R N/K + N 2 . A factor of two does not precede the N 2 term, because the write-back operation does not require a load operation. This cost estimate implies that larger tiles incur lower global memory access costs: a range bin is copied from global memory each time it is needed by a tile, rather than by a pixel. As the set of pixels contributed to by a range bin increases, the number of times that bin must be copied from global memory also increases.
Output Image and Pulse Response Matrix Partitioning
While large tile sizes help reduce I/O cost, they also reduce algorithmic parallelism. When reconstructed image size is small relative to tile size, there are fewer partitions to be distributed across the available processing cores, yielding an inefficient partitioning scheme. Thus, memory access delays that would be masked by a larger number of partitions adversely impact device throughput. This challenge can be overcome by splitting the pulses into pulse sets, then distributing the computation of a single image tile across several multiprocessors. Thus, a thread now computes a set of pixels across 1/N P pulses, where N P denotes the number of pulse sets. Figure 3 notionally illustrates this approach. Because this approach partitions the pulse response matrix, each tile of the output image is further distributed across processing devices. This requires a load and write-back reduction step as described in Section 4.2, increasing the global memory access cost to PN R N/K + 2N P N 2 . In this equation, the value of P, N B and N are input parameters to backprojection, while the values of K and S can be varied, to maximize throughput.
An ideal tile size (K) balances unnecessary global memory access (incurred by small tile sizes) against decreased parallelism (observed for large tile sizes). Unnecessary global memory access occurs when the same range bin from a given pulse is transferred from global memory to shared memory more than once. In general, if there are N R range bins, and NxN output pixels, then a range bin is accessed, on average, by R N N / 2 pixels. In a simple worst case analysis, the tile size K = 1 is equivalent to rendering each pixel independently, thereby accessing global memory each time a range bin is accessed, which results in each bin being loaded R N N / 2 times. For example, given a 4096 x 4096 pixel image, and our sample data set with 32,768 range bins per pulse (after oversampling), this results in the pulse response matrix being loaded from memory 512 times. In contrast, a simple best-case analysis uses K = N, where the entire pulse matrix is loaded from global memory once, used for every pixel of the output image, then permanently discarded, such that the pulse response matrix is loaded only once.
An intermediate case employs Lemma 2, and redefines R as the number of bins needed to render the NxN-pixel image for an arbitrarily chose pulse. In practice, this new value of N R may be less than the total number of range bins in that pulse. Consider the case of an image whose x-or y-axis is coincident with the look angle. Here, the number of bins accessed is N K N R / . Lemma 2 claims that for any other image tile, the number of range bins required to render this pulse is less than
. Therefore we have an upper bound on the number of bins required to render any tile using any pulse. Applying this concept to the processing of every tile in an NxN image results in a total transfer of bins that is contained within the following interval, as follows:
The interval is represented in Figure 4 with a smaller data containing set 4096 range bins and 512 x 512 output pixels. From the previous relationship, it would appear that an optimal tile size is K = N, subject to memory constraints. However, we have already shown this is not the case. Large values of K, even in environments where memory is not limited, reduce the inherent parallelism of the tiled approach. On a GPU, where many processing cores are available for parallel operation, this situation is inefficient. Thus, a second partitioning scheme was introduced on the pulse dimension of the pulse response matrix.
The characteristics of this partitioning scheme can be observed graphically. Figure 5 describes the observed latency when tile size is selected to be the size of the output image. As shown in Figure 6 , this graph can be divided into three distinct regions.
In Region 1, the GPU is underutilized, as there are not enough thread blocks to distribute to the multiprocessors. Increasing the number of blocks results in a linear decrease in the latency.
In Region 2, there are enough thread blocks to distribute to every multiprocessor. However, some multiprocessors have only a small number of blocks assigned to them. When the threads in these blocks have to wait for global memory access, the multiprocessor is idle (no other work to do) and must wait until memory access is complete. The result of this waiting is, predictably, suboptimal latencies.
In Region 3, ample work is distributed across the GPU, so multiprocessors rarely go idle. But when the pulse response matrix is distributed to several blocks, the images generated by each block must be combined to form a final output image, using the transfer medium of global memory. This global memory access cost increases with the number of output images, which results in a linear increase in latency as the number of pulse sets is increased, per Figure 6 and Figure 7 .
-The Effect of Input and Output Size
From the preceding analysis, when the reconstructed image is large with respect to the pulse response matrix, it is desirable to use a smaller tile size and recycle the pulse response matrix through global memory. When image sizes are small with respect to the pulse response matrix, large tile sizes are preferred, and pulse partitioning is the preferred means of achieving parallelism.
Modern GPU devices contain a relatively small block of low latency memory. This provides an upper bound on the Latency decreases sharply as parallelism improves, then increases as the latency to reduce output images overtakes improvements from increased parallelism. tile size that does not limit computational parallelism. As a result, we did not use pulse partitioning to obtain our experimental results. However, we anticipate this optimization will be of increased importance in future GPU architectures.
978-1-4673-0753-6/11/$26.00 ©2011 IEEE
V. EXPERIMENTAL GPU IMPLEMENTATION
Current GPU architectures conveniently support tilebased partitioning: most GPUs provide native support for two-dimensional block partitioning by allowing thread groups to be indexed on a two-dimensional grid. This supports logical correspondence between thread partitioning and image partitioning schemes. Also, each thread group evaluates one tile of the output image, as shown in Figure 2 .
A GPU can support thread indexing within each thread group. This provides an equally simple approach for assigning pixels to threads. In particular, the index of each thread is equal to the index of each image pixel traversed in normal scanning order (left to right, top to bottom).
Distribution of the pulse data is facilitated by the fact that the reconstruction of a given output pixel with respect to each pulse is an independent event. As a result, no benefit is gained from loading multiple pulses into shared memory at a single time. A more effective technique allows each thread to render each set of pixels with respect to every pulse. When the computation of each pixel-pulse pair is treated as a tuple in three-dimensional space, this technique can be depicted as shown in Figure 3 , where the number of pulse sets is 1. The approach benefits from its simplicity: each pixel can be stored in a register. At the completion of the thread, there is no additional work to be done. A completed pixel value is available for immediate copying back to global memory.
In order to achieve maximal throughput, other factors affecting GPU performance were considered. In particular, the GPU has also been shown to perform more efficiently when multiple blocks are active on each multiprocessor. This measure of utilization is known as occupancy, and permits global memory access latency to be masked by overlapping execution with read operations. A block can be marked as active on a multiprocessor if there are enough shared memory resources and registers available to service the block. Our Tesla GPU contained a configurable L1 cache and shared memory module. We selected the 16 KB shared memory, and 48 KB L1 cache configurations. The GPU also provided 32,768 registers per multiprocessor. It was necessary, therefore, to ensure that each block used as little of those resources as possible. The allocation of shared memory to meet these requirements was not difficult. Using the analysis described in Section 3, our experimental data, and a tile size fixed at 32 x 32 pixels, we determined that the maximum number of range bins that could be needed to render a pulse fit readily within 8 KB of memory. This would permit up to six active blocks to be scheduled on a multiprocessor without saturating the L1 cache. The process of register allocation, however, is more challenging.
The first approach was to unroll loops when the number of executed loop iterations is known at compile time. This has the effect of freeing up the register that would otherwise be used to maintain the loop index. For example, a simple code fragment that computes the sum of the first 3 elements in an array, could be implemented as: In addition, we observed that the temporal cost of recomputing a value each time it is needed is often outweighed by the spatial cost of allocating a register to hold it. For example, the contribution of a pulse to a pixel is computed as the weighted mean of the range bin immediately preceding the pixel and the bin immediately following it. This contribution is computed twice, once for the real component of the pulse data, and once for the imaginary component. A natural implementation is to determine the bin indices and weighting coefficients and then compute the sum. An alternative implementation which frees two registers at the cost of two operation delays is: A register utilization reduction can also be achieved by reducing the number of threads in a block and increasing the number of pixels rendered by each thread, as shown in Figure 8 . This follows from the fact that not all registers used by a thread store information about a specific pixel: many store information about the state of the thread. Therefore, increasing the number of pixels the thread is 978-1-4673-0753-6/11/$26.00 ©2011 IEEEresponsible for does not increase the demand for these registers, and reducing the number of threads results in a decrease in the number of these registers being used. In practice, we have determined that the number of threads could be reduced from 256 to 128 without adversely effecting performance. In this environment, each thread handled two pixels.
VI. EXPERIMENTAL RESULTS
The algorithm described above was implemented and tested on an nVidia Tesla C2050 GPU, using simulated data having 1000 pulses with 32768 range bins per pulse. Experimental results are shown in Table 1 . The low latencies produced by our algorithm, and the analysis provided in the preceding section, support the reconstruction of images in real time using multiple parallel GPU devices as pulse data is collected. For a 4096 x 4096 pixel output image, reconstruction times of less than one second can be obtained using only three GPUs. Table 1 does not include the time required to transfer input pulse data from the host to the GPU and transmit image data back to the host. These times are included in Table 2 , and are a function of the bandwidth of the host-GPU connection. In the Nvidia GPUs, these times can be masked using the Tesla GPU's asynchronous invocation capabilities. In particular, since GPU kernel calls and data transfers can be flagged to return prior to completion, it is possible to overlap these instructions. It is possible to load only a small number of pulses, launch a kernel using only those pulses which have been loaded, then continue loading the remainder of the pulses. In this way, the computation process can be overlapped with the startup transfer time. This is a particularly natural approach for invoking the kernel when pulse data is being read directly from an incoming data stream and reconstruction occurs in real time. A similar technique can be used to mask the transfer of the output image back from GPU memory. Here, each kernel invocation would process only a subset of the image tiles, and many such invocations would be needed to render the entire image. After the completion of each kernel, it is known that its tiles have been rendered, so it is possible to overlap the transfer of these tiles to the host with the rendering of the remaining tiles.
GPU Implementation Observations
Based on our experimental results, a GPU is an ideal architecture for reconstructing SAR images via backprojection. This is primarily due to two key benefits of the GPU architecture: (1) the ability to mask memory access latency through overlapping I/O and computation, and (2) parallelism obtained from many cores operating simultaneously.
When compared to a consumer-grade CPU, the performance difference is dramatic. Tables 3 and 4 and introducing a tiled partitioning scheme increased throughput to 3.78 Gflop/s. Utilizing a larger number of threads was ineffective, as thread scheduling on a CPU is much slower than on a GPU, and the increased latency due to thread scheduling negatively impacts throughput. The impact of tile-based partitioning on throughput is less clear in the CPU than in the GPU. We have found that this can be attributed to the large L1 cache size available, and the comparably low computational throughput of the CPU. An entire row of the pulse response matrix from our data set had a size of 256 KB. This fit readily into the Intel Smart Cache of the CPU, meaning that pulse response matrix cache misses were rare, even when tile partitioning was not used, the primary limiting factor in performance was the number of floating point operations that could be performed per second. The benefit of tile-based partitioning is more pronounced when execution times are measured on a GPU that is not implementing the tiling scheme. For example, an alternative scheme was tested where each GPU thread operated on a thread row, rather than a thread tile. For the 4096 x 4096 pixel output image, a minimum latency of 6.5 sec was obtained, which corresponds to a throughput of 108.0 GFLOPs, well below the measured throughput of 293.9 GFLOPs obtained using tile based partitioning. Thus, we have found that the GPU implementation of the tile based partitioning scheme is the superior approach for reconstructing SAR images. Compared to the CPU implementation, the speedup observed from this approach was 78.2. In contrast, when compared to the inferior GPU partitioning scheme, the speedup was 2.72.
VII. CONCLUSIONS AND FUTURE WORK
We have presented a design for an efficient mechanism for reconstructing images given synthetic aperture radar data using the Graphics Processing Unit (GPU). Our design, which is optimized for use on an nVidia Tesla C2050 GPU, partitions the output image into tiles. In particular, our approach relies on the fact that the amount of response data required for processing each tile is predictable and proportional to the size of the tile divided by the size of the image. Each thread block in the GPU operates on one tile for a given set of pulses from the pulse response matrix. The optimal tile size is the largest tile, such that a single pulse can be contained in GPU shared memory, while still producing enough threads to permit the GPU to effectively make use of its 448 cores.
Throughput is limited by the relatively high latency of accessing global memory on the GPU, particularly for reading the large pulse response matrix into shared memory. Nonetheless, performance for large output images (4096 x 4096 pixels) was measured at up to 293 GFLOPs, which is more than twenty-five percent of the C2050's peak single precision performance measurement of 1030 Gflop/s. Lower resolution images (e.g., 2048 x 2048 pixels), which require a smaller number of computations using the same set of input pulse data, performed at 251 Gflop/s.
Our continued research will investigate how multiple GPUs can be configured to operate in parallel and obtain additional throughput. In this case, the computational burden could be distributed across devices using the principles established in Sections 4.1 through 4.5. For small images, an efficient partitioning scheme would distribute blocks of pulses to multiple GPUs and combine the results at a single host machine to form a completed image. For large images, it is likely that a tiled partitioning scheme would be used to distribute image tiles, and the corresponding subset of the pulse data, to each device. Experimental results suggest that real-time data processing is possible for 4096 x 4096 pixel images using as few as three Nvidia Tesla C2050 GPUs.
