Abstract
I. INTRODUCTION
The use of radar for area surveillance has been of keen interest to scientists and engineers in fields such as medical tomography, meteorology and geology [4] . One such application, radar based terrain mapping, has recently become the subject of increased interest in the field of highperformance parallel computing. Terrain mapping involves the use of radar devices to generate high-resolution images of one or more ground objects (e.g., land or forest), as well as manufactured object(s).
As shown in Figure 1 , a radar transmitter at known location L broadcasts a pulse at time T 0 . Responses from objects within the sensor's field of view are measured as a function of time. Reflective objects further from the pulse location will cause echoes that arrive later than echoes for proximal objects. In the example shown, the response for P 1 will appear at an earlier time than P 2 , and P 2 will appear earlier than P 3 , and so forth. The time these responses arrive can be readily computed by dividing the distance between L and P x by the speed of light c:
Unfortunately, while prediction of arrival time for a given reflected pulse component is simple, the inverse computation that establishes an object map from multiple pulses is more challenging, as shown below.
For example, in Figure 2 , an additional point at P 4 will have the same distance, and response time, as the point at P 1 .
Figure 1:
A radar pulse is transmitted, and response intensity is measured as a function of time. Responses from more distant objects traverse a longer path and will return to the sensor at a later time. Figure 1 shown in top view. An object at P 4 is the same distance from L as P 1 , and will have the same response time.
Given the response data for only one pulse, it is impossible to generate a physically accurate representation of the original image, because reflective objects could only be echo-located to the nearest range band (shown in Figure 2 as concentric circles centered at the pulse emitter).
Given multiple views of a ground object, synthetic aperture radar provides a method of reconstructing the object map, by measuring and mathematically combining multiple return pulses at multiple locations over time, to increase effective aperture. In the example of Figures 1 and 2 , the sampling process would be repeated with L in several different locations around P 1 , P 2 , and P 3 . This helps resolve spatial ambiguity between P 1 and P 4 , allowing reflective objects to be more accurately and precisely mapped to their correct locations. In addition, the effect of sampling error (e.g., due to interference and environmental factors) is reduced by the averaging inherent in repetitive sampling, and by superposition during reconstruction. However, the rendering of output images from pulse response data is computationally more expensive as the number of pulses increases. For example, a 500 x 500 pixel SAR image took over four hours to compute on a consumer-grade Hewlett Packard 2 GHz Centrino processor.
Although parallel architectures have greater availability and utility, the cost of developing high performance sequential hardware has remained substantial. Also, since sequential processing solutions for SAR image reconstruction tend to be IO-bound primarily and compute-bound secondarily, parallel computing is further indicated. We have found that Field Programmable Gate Arrays (FPGAs), which support parallelism through replication of spatial algorithms, are useful for SAR reconstruction. In simulations with Altera Quartus II software, on an Altera Stratix III EP3SL150 FPGA, we have obtained 228 seconds execution time from one instance of our design. Increasing the number of instances operating in parallel resulted in a linear decrease in execution time. We next consider the SAR image reconstruction algorithm employed in this study.
II. THE FILTERED BACKPROJECTION ALGORITHM
Several published algorithms are available for reconstructing SAR imagery [1, 3] . The filtered back projection algorithm is known for its large computational cost and high quality output images [1] . The algorithm, whose details are beyond the scope of this paper, is overviewed as follows.
Firstly, the time dimension of the response data for each pulse is divided into range bins. This typically occurs at collection time, to convert the real time response data stream to a discrete format amenable to storage in memory. Secondly, for each pixel in the output image, for every pulse in the response data: (i) the spatial distance of that pixel from the pulse location is computed; (ii) the range bin that corresponds to this distance is determined; and (iii) the contribution of this range bin is summed with the pixel's current value.
We note that that this representation of the filtered backprojection algorithm describes only the data access pattern and the algorithmic complexity of computational and I/O operations.
III. BENEFITS OF FPGA IMPLEMENTATION
As noted previously, the back-projection algorithm is computationally expensive because it requires every pixel to be summed with the contribution of every pulse. In practice, given a large image and a large pulse data matrix, this process can yield high latencies.
Fortunately, the process of rendering pixels is inherently data-parallel. As a result, it is possible to achieve a proportional decrease in overall latency by replicating available hardware: low resolution times can be realized via a high degree of parallelism. Unfortunately, modern microprocessors are complex mainly-sequential devices, designed to be applicable in the solution of virtually any mathematical problem. These general-purpose capabilities consume large amounts of space on-chip, which tends to be wasted when a microprocessor is used to perform simple calculations. Since most microprocessors are inherently sequential, they tend to contribute little to the performance of fine-grained parallel implementations, as predicted by Amdahl's Law.
Field Programmable Gate Arrays are fundamentally different from microprocessors, as FPGAs allow designers to define a logical relationship between hardware inputs and hardware outputs in terms of spatial relationships on-chip, rather than in the temporal dimension as sequential instructions. This supports parallel computation of operations that are not serially dependent. In implementing SAR image reconstruction via a back projection algorithm, the same set of operations are performed for every (R,P) pair, where P is a pixel in the reconstructed image, and R is the range bin in a given pulse that corresponds to P.
IV. EFFECT OF RESPONSE MATRIX SIZE
In addition to computational complexity, the backprojection algorithm presents significant data movement challenges. Both the pulse response matrix and the output image can be very large, and most FPGA chips cannot store these structures in their entirety. However, in order to sum the contribution of a given pulse to a pixel, both the correct range bin and pixel must be on-chip at the right time. Fortunately, the data access pattern of the back projection algorithm ensures two properties that are helpful in reducing the cost of data input and output: Property 1. Let function f(P, X) map pixel X to its corresponding range bin in pulse P. As X is varied from left to right, or right to left, across any row of the output image, one of the following will be true:
1. f (P,X) will be increasing. 2. f (P,X) will be decreasing. 3. f (P,X) will be increasing up to some pixel Y; beyond Y, f (P,X) will be decreasing. 4. f (P,X) will be decreasing up to some pixel Y; beyond Y, f (P,X) will be increasing.
This property holds as X is varied across any column of the output image from top to bottom, or bottom to top, as shown in Figure 3 . Recall that range bin access is directly proportional to the distance of a pixel from its corresponding pulse. In the simple cases (L 1 and L 3 ), as X is varied along any straight line on the image, the distance between X and the pulse will increase or decrease. In the case where the pulse location is not beyond the endpoint of the line (L 2 is directly above the line), then at some point Y, the distance between Y and the pulse location will be at a minimum. From this property, it follows that if only a subregion of the output image is considered, only a proportionally smaller subregion of the echo matrix will be needed to render the output image subregion.
Property 2.
Let the function N(P, S) be the number of range bins for a given pulse required to completely generate a square subregion S of the output image. For any square subregion S and pair of pulses (P 1 , P 2 ), we have ( ) ( ) ( )
This property can be explained by recalling that the range bin needed for a given pixel is linearly related to its distance from the pulse location. For any two points, the factor relating the difference in their physical locations to the difference in their corresponding range bins is constant. In particular, this difference D equals the total distance sampled divided by the number of range bins. Observe that the least number of range bins will be required when the look angle is either parallel or perpendicular to the base of the square subregion. Likewise, the maximum number of range bins is required when the look angle equals 45 o , such that The final step is proven using the Pythagorean Theorem.
V. DESIGN
The proposed algorithm design partitions the output image into a set of small tiles. Each tile is square, having width and height K. The process of generating an output image can then be decomposed into the process of generating each of the ( ) 2 / K N tiles independently. The back projection algorithm states that each pixel of an output image is computed as the sum of the contributions of each pulse in the echo matrix, where the contribution of a pulse for a given pixel is the value of the range bin corresponding to that pixel. To reconstruct the image, the correct range bin of each pulse must be paired with the corresponding pixel or pixels in the output image.
In the proposed design, shown in Figure 4 , the pairing of a single pulse range bin and pixel value occurs in each box labeled Projection Element (PE). PEs are arranged in a vertical pipeline having width W and depth D. Pixels are stored in an external memory device labeled Output Image, and are loaded into the Read Pixel Queue, then processed through the pipeline. At each pipeline stage, a pixel is paired with the pulse that has been loaded into that stage. After leaving the pipe, the processed pixels are stored in a Write Pixel Queue, where they are transferred back to the external memory device. The proposed design consists of one or more vertical pipelines. Pixels pass through a pipeline from top to bottom, meeting a new pulse at each level. Processing elements are connected to one bucket of a circular memory bridge. If the needed range bin is not in that bucket, the processing element can assert a signal requesting the memory bridge to rotate until the needed bin is in the correct bucket. The pipe width can be adjusted to increase pipe throughput.
When a pixel reaches a given stage of the pipeline, the corresponding range bin must be loaded into local memory so its contribution can be summed with the existing pixel value. Due to long latencies in external memory accesses, performance would be greatly improved by loading the echo matrix into an on-chip cache. However, the echo matrix is significantly larger than available on-chip memory, so only a small portion of the echo matrix can be stored on the FPGA. Fortunately, range bin access within each pulse is predictable for a K x K tile. Within each pulse, the number of range bins needed to compute each tile is proportional to K/N, where N is the size of the output image. In Figure 4 , for example, only the shaded region of the Echo Matrix is used to generate the pixels shaded in the Output Image.
The current design, shown in Figure 4 , replaces the shared memory component with a circular memory bridge (similar to a round-robin queue). Prior to computation an output (reconstructed) tile, each memory bridge is loaded with the range bins that will be needed to compute the entire tile. Each processing element is then connected to a single storage location, or bucket, on the circular structure. If a processing element is not connected to a bucket holding the correct range bin, it can request a clockwise or counterclockwise rotation. By Properties 1 and 2, it can be shown that if adjacent pixels are passed through the pipeline, then rotation distance will be small relative to the size of the memory bridge. In fact, for a single row of the image, rotations will always occur in the same direction with changes to that direction occurring at most one time. (See the first property above.) Not more than 978-1-4244-9991-5/11/$26.00 ©2011 IEEE one full rotation is needed to compute any given row. This design exploits both the locality and predictable pattern of range bin access and uses an interconnection network that achieves an efficient trade-off between size and speed.
An additional characteristic of the design shown in Figure 4 is the pre-fetch cache. This feature masks the latency of external memory by permitting the loading of the memory bridge to be executed in parallel with the computation of a tile. Further design considerations are given in [5] .
VI. DESIGN ANALYSIS AND PARAMETERS
Complexity of the design is measured in terms of time, space, and IO complexity.
Latency
In order to derive an expression for the latency of rendering an N x N image, it is necessary to consider the two operations that are occurring in parallel during the steady state operation of this design: (1) loading the prefetch cache with a new pulse, and (2) rendering a K x K image tile with the current pulse. We write the former operation as a function of K:
where L memstart and L mem refer to the startup time and access speed of the host memory device. The constant C is determined by the sampling resolution and size of each range bin, which are held constant for the purpose of this analysis. We then write the latency of rendering a single K x K image for one pulse, as follows:
where L fill is the initial hardware time required to start the pipe after reloading the memory bridge, and 2 K is the tile size. The maximum function in this equation reflects the fact that the rate at which pixels is limited by both the processing speed and the memory access speed. Since each pixel must be read from and written to memory, the rate at which pixels can move through the pipe is bounded below by K 2 / L mem . The time to process these pixels is K 2 / W, where W is the width of the pipe. Alternatively, W can denote the number of pixels that are rendered in parallel. In steady state operation, the larger of L MB and L Render will dominate the latency of rendering a single K x K tile for a given pulse. Repeating this process for all tiles in the image, and distributing the load across J implementations each with a pipe depth of D, the total time required to generate the entire N x N image using A pulses is:
Space
The design implemented in Figure 4 consists of four central components: a processing element, a memory bridge, a prefetch cache, and a pixel queue. The prefetch cache and memory bridge are of equal size, and the pixel queue is negligibly small compared to the other elements. Under this assumption, we express the space consumed by J implementations each with a pipe depth D, pipe width of C, and memory bridge size CK, as follows:
Memory Access -IO Firstly, we observe that memory access requirements to render a single K x K tile on a single pulse, then write that pulse back to memory, is given by:
Using J parallel implementations of the design shown in Figure 4 , the IO requirements of rendering one tile on D pulses then becomes:
This equation reflects the fact that increasing the pipe depth D results in a linear increase in echo matrix IO requirements for each pass of a tile through the pipe. Increasing J, the number of parallel implementations, produces a linear increase in the both echo matrix IO and image IO.
VII. DESIGN ANALYSIS -OPTIMIZING PARAMETERS

Minimizing the Value of J
From Equations 3 and 4, we note that pipe depth D and degree of parallelism J are both inversely related to latency, and directly related to space. In fact, these values appear as a product JD in both equations. This is intuitive, as greater parallelism can be obtained by lengthening the pipe, or operating multiple pipes simultaneously. Regarding only spatial and temporal efficiency, neither option is preferable to the other. However, when Equation 6 is considered, it is clear that increasing D results in lower IO requirements. For that reason, J should always be chosen to be the lowest possible value. The section titled Partitioning the Echo MatrixFacilitate Parallel IO describes an architecture where J = 1 even when the design would be distributed across multiple FPGA boards.
Selecting the Right Value of K
We have observed from Equation 3 that when chip real estate is not bounded above, latency is inversely related to tile size K. In an environment where space is unlimited, it is then preferred to let K = N and process the entire image in a single tile. In practice, however, available space on an FPGA is finite. From Equation 4 , it can be seen that, for some fixed S, increasing the value of K decreases the value of JD. This reduces parallelism and increases latency. It is necessary, then, to find an optimal value of K such that latency is minimized.
In selecting the optimal value of K, we optimize each of the limiting factors described in Equation 3 . This separates the analysis into a distinct case for each of the limiting factors: processing speed, image memory access, and echo matrix memory.
The first two cases can be treated together, as neither the processing speed nor access speed of the memory device can be improved by changing K. We define the larger of these latencies to be the latency of rendering 1 pixel on 1 pulse and denote it L pix . Echo matrix memory access is ignored, as it is the faster of two parallel operations. Then, treating the FPGA chip space available as a constant and substituting Equation 4 in to Equation 3, the expression for latency is:
When this expression is expanded, its highest and lowest order terms have order 1 and -2 respectively. We observe that the graph of this function decreases quadratically toward some minimum value, and then increases linearly as K approaches N. The optimal value of K occurs at the minimum of this graph.
The third case assumes that echo matrix memory access is the limiting factor in throughput. Echo matrix memory access requirements can be reduced by increasing the value of K, which decreases the value of JD and the number of prefetch caches that are loading data from memory. By treating chip space as a constant and substituting Equation 4 into Equation 3, we obtain:
This expression decreases on the interval (0, N] , reflecting the fact that echo matrix memory access is minimized when the image is processed as a single tile. This is consistent intuitively, because rendering an image in tiles causes some areas of the echo matrix to be fetched multiple times, while rendering a single tile results in only one pass over the echo matrix.
We note that in the third case, Equation 8 is a decreasing function that is greater than Equation 7 as the optimal tile size dictates by cases 1 and 2. Recalling that Equation 7 increases for values of K larger than its optimum, the optimal value of K for case 3 occurs at the point of intersection.
Selecting the Optimal Value of W
In selecting a pipe width W, we first observe that increasing the pipe width allows multiple pixels to be sent down the pipe in parallel. Increasing W also has a much lower spatial cost than increasing pipe depth, because it does not require the addition of memory bridges. However, an increase in W only reduces the observed latency of the projection elements (PEs), and does not impact memory access. An increase in W is then helpful when pipe processing latency (K 2 / W) is the limiting factor in device performance,. However, in the case where image or echo matrix memory dominates the expression for latency, increasing W does not result in any additional throughput.
In general, an optimal pipe width can be obtained by increasing W until echo matrix or image memory access limits device throughput.
VIII. PARTITIONING THE ECHO MATRIX -FACILITATE PARALLEL IO
In the event of large memory accesses, the performance of external memory can be improved by partitioning the echo matrix across multiple independent memory devices. This partitioning is done on the pulses (rows) More information, and other considerations for achieving high performance are given in [5] .
IX. SIMULATION RESULTS
To estimate the throughput of our design, the elements shown in Figure 4 were implemented using Altera DSP Builder and Matlab Simulink software. They were then imported into Altera's Quartus II simulation software, then analyzed to determine the theoretical optimum throughput. The sample radar data used to generate these results contained 42,120 pulses, 424 range bins per pulse, and sampled a circular area with a simulated radius of 7 km. For all computations, it was assumed that the tile size was onetenth the size of the output image (i.e., 100 tiles per output image). It is also assumed that a sufficiently fast external memory device is available to ensure that memory access does not limit design throughput.
The theoretical optimal latencies of five representative FPGA devices are shown in Table 1 using four common output image sizes.
It is important to note that the Table 1 implementation can fit on a single FPGA, the number of independent FPGAs is unlimited in principle. For a Stratix III EP3SL150, the number of FPGA devices required to generate a 500 x 500 pixel output image in 0.5 seconds is 18. A common metric for evaluating the performance of computational devices is the equivalent number of operations per second. Considering the case of a 5,000 x 5,000 pixel output image, the total number of operations when executed sequentially is 4.42 x 10 13 . Given a single Stratix III EP3SL150 operating at 1.60 x 10 11 operations per second, 6,000 such FPGAs operating in parallel could achieve a performance near 1 petaflop.
X. RELATED WORK
The algorithm and design presented herein are tailored to the rapid parallel processing of synthetic aperture radar data, as well as the reconstruction of a corresponding twodimensional output image. However, these techniques could also be applied to any domain where echo response data is projected back to form an image of a surface, object, or area. One such application is tomography, or the construction of three-dimensional models from a set of cross-sectional views. Here, each atomic volumetric unit or voxel demonstrates the same properties observed by the pixel. Namely, for a given cross-sectional view, each voxel will be associated with a response value that is spatially near the response of its neighboring voxel. Similarly, a given volume of output data will require a predictable quantity of response data for each cross section. The preservation of these properties ensures that a circular memory bridge could also be used as an effective storage mechanism for the on-chip caching of sectional data.
Previous study in the field of tomography has shown FPGAs to be reliable platforms for deployment of back projection algorithms in the rendering of three-dimensional images. Gac (et al) [2] showed that is possible to obtain response times comparable to that of a Graphics Processing Unit (GPU) using a single System on Programmable Chip (SoPC). The techniques described in this document could supplement these developments, allowing designs to be mapped to a large number of independent FPGAs, as needed.
XI. CONCLUSIONS AND FUTURE WORK
We have presented a design for an efficient mechanism for reconstructing images given synthetic aperture radar data. Our design consists of a pre-fetch cache for masking the latency of external memory, a circular memory bridge for providing rapid access to range bins data, a pixel queue for the direction of image data through a processing pipe, and a core processing element for pairing of pixels with their corresponding range bin. This design partitions the output image into small tiles, relying on the fact that the amount of response data required for processing each tile is predictable and proportional to the size of the tile divided by the size of the image. The selection of an optimal tile size is dependent of the size and speed of the hardware, as well as the size of the problem to be solved (per Equations 7 and 8).
External memory access time may be a limiting factor in design performance, but can be mitigated via reduced pipeline depth, at the expense of increased spatial overhead per parallel implementation. The optimal depth is the largest value that allows the pre-fetch cache to effectively mask memory latency. Oversampling the observation area can also decrease design performance, but small alterations to the hardware of the memory bridge can minimize this effect. Simulations show that high throughput can be obtained by increasing the amount of hardware used, either by choosing large FPGA devices, or by connecting multiple devices to the same shared memory architecture.
