Summed area table allows filtering of arbitrary-width box regions for every pixel in constant time per pixel. This characteristic makes it beneficial in image processing applications where the sum or average of the surrounding pixel intensity is required. Although calculating the summed area table of an image data is primarily a memory bound job consisting of row or column-wise summation, previous works had to endure excessive access to the high latency global memory in order to exploit data parallelism. In this paper, we propose an efficient algorithm for generating the summed area table in the GPGPU environment where the input is decomposed into square sub-images with intermediate data that are propagated between them. By doing so, the global memory access is almost halved compared to the previous methods making an efficient use of the available memory bandwidth. The results show a substantial increase in performance.
Introduction
Originally devised as an alternative to mip mapping, the usage of summed area table has been extended to various types of blur filtering to increase the realism in 3D graphics.
With the minuscule additional overhead of generating the summed area table, the sum of an arbitrary rectangular region for all the pixels of an image can be simply calculated by a few addition operations per pixel during run-time. This can be used to filter regions of different size for every pixel in constant time per pixel, invariant of the region size.
There are numerous applications for summed area table including anti-aliasing [1] , image filtering [2] environmental mapping [3] , and depth of field [4] . Moreover, feature tracking applications [5, 6] in augmented reality use summed area table as a way to filter the image to make the edges more prominent and suitable for feature candidates.
The rest of the paper is organized as the following. Section 2 briefly describes previous related work. Then in section 3, fundamental idea behind summed area table and issues regarding GPU implementation is given. Section 4 proposes our memory bandwidth efficient algorithm. Section 5 demonstrates the experimental results of the proposed algorithm and makes comparison against previous implementations. Finally, we draw to a conclusion in section 6 and express future related work.
Related Work
Crow [1] invented the concept of summed area table to find a more accurate value for anti-aliasing mapped texture. Because the main concern was on the quality of the output image, he had inconclusive results on the computation time.
Hensley et. al. [3] extended the application of the summed area table to creating glossy and transparent surfaces at interactive rates.
However, they employed a work-inefficient method of recursive doubling to calculate the prefix scan.
Harris et. al. [7] described summed area table as an example of their work-efficient parallel prefix scan. In order to maintain global memory coalescing when fetching data in vertical columns, they transposed the image so that the columns can be processed horizontally.
In this paper, we present a novel method of generating the summed area table. All accesses to the global memory are coalesced and the total amount is reduced by nearly half compared to previous methods. In addition, the performance can be optimized for various graphics hardware by modulating the data processed per thread-block, a tradeoff between communication time and memory latency hiding capability. Table and GPU 3.1 Summed Area Table   The summed area table contains 
Summed Area

GPGPU
The GPU is a commodity hardware which provides performance rivaling that of a supercomputer at a fraction of the cost. It is As there may be thousands of SPs present on a GPU, supplying the appropriate data to the computation core is a critical issue. In the current GPU, memory is organized in a hierarchical structure to alleviate the bandwidth problem [8] .
The global memory which may be used as Lastly, the private memory is exclusive to a particular SP and usually used as a per-thread cache substitute or storing temporary data.
There are other types of memory such as constant or local memory but are seldomly used.
Except for operations which involve usage of the special function unit, the throughput of computation units is one or two cycles per instruction. On the other hand, strict criteria, such as avoidance of memory bank conflict or instruction-level parallelism, must be met to achieve comparable performance from the on-chip memory subsystem such as shared memory or registers [9] . Thus, implementations on the GPU must be memory bound and try to make the memory access pattern and bandwidth as optimal as possible.
Image Transpose Method
The generation of the summed area bytes. Therefore, executing prefix scan on the columns would require long strides through the memory where each transaction would result in most of the fetched data to be discarded from non-coalesced reads.
Harris et. al. [7] solved this problem by transposing the 2D array after the row-wise scan. This way, the second pass can be made on rows of data instead of the columns and still have the same effect.
The image transpose method proposed by
Harris et. al initiates by de-interlacing the input 8-bit per channel RGBA to four 32-bit floating point values. This is due to the fact that the GPU is optimized for processing 32-bit floating point values streaming through the hardware's pixel pipeline [8] . As such, they tried to use the most proficient data type.
Afterwards, prefix scan is performed on the row-wise fashion. In another words, each row is scanned independent of other rows. The method incorporated in the parallel prefix scan algorithm is known as recursive doubling [10] devised by Kogge Therefore, when accessing the generated summed area table, the x and y coordinates must be swapped to obtain the correct value.
Proposed Method
Performance of the applications implemented on the GPU are usually bound by the amount of memory access and the efficiency of the access pattern. This is because the device memory is the slowest component in the GPU in addition to the fact that the memory subsystem is optimized for bandwidth rather than latency unlike the host memory.
Assuming one byte per pixel, the previous method of transposing the image to maintain global memory coalescing requires 2*w*h accesses each for row-wise scan, transpose, and column-wise scan for a total of 6*w*h where w and h are width and height in pixels of the image respectively. This amounts to the image being accessed recursively six times.
[ Fig. 2 ] The proposed divide and conquer method. The input data is evenly divided between multiple thread blocks which perform row and column-wise summation.
In order to minimize the duplicative memory The internal processing size of 32x32 was selected based on the shared memory available on the commodity GPU which ranges from 16KB to 48KB. More importantly, the prefix scan employed is work inefficient although the depth is optimal. Therefore, increasing the input data size when even more memory is available will hinder the execution performance with O(nlogn) work complexity. On the other hand, decreasing the input data size below the SIMD data width of 32 specified as the warp size in the GPU will only cause parts of the data path to be idle and have no effect performance-wise.
The shared memory is divided into banks of equal size to concurrently supply data to the multiple cores with access demands without delay. However, memory bank conflict will cause the access to be serialized causing performance to degrade substantially.
Therefore, we mapped the data into the shared memory so that every thread in any warp will be assigned to a different memory bank. This 
Experimental Results
The benchmark system was composed of AMD FX6100 CPU running at 3.3GHz with 
