Computational photography applications, such as lightfield photography [1], enable capture and synthesis of images that could not be captured with a traditional camera. Non-linear filtering techniques like bilateral filtering [2] form a significant part of computational photography. These techniques have a wide range of applications, including High-Dynamic Range (HDR) imaging [3], Low-Light Enhanced (LLE) imaging [4], tone management and video enhancement. The high computational complexity of such multimedia processing applications necessitates fast hardware implementations [5] to enable real-time processing. This paper describes a hardware implementation of a reconfigurable multi-application processor for computational photography.
The bilateral grid structure used by this chip is constructed as follows. The input image is partitioned into blocks of size σ s ×σ s and a histogram of pixel intensity values is generated for each block. Each histogram has 256/σ r bins. This results in a 3D representation of the 2D image, referred to as the bilateral grid where each grid cell (i, j, r) stores the number of pixels in a block corresponding to that intensity bin (W r ij ) and their summed intensity (I r ij ). The grid assignment (GA) engine, shown in Fig. 9 .6.2, performs this operation. The convolution (Conv) engine convolves the grid intensities and weights with a 3×3×3 Gaussian kernel, which is equivalent to bilateral filtering in the image domain [6] , and returns the normalized intensity. The interpolation engine reconstructs the filtered 2D image from the filtered grid. The filtered intensity value at pixel (x, y) is obtained by trilinear interpolation of 2×2×2 filtered grid values surrounding the location (x/σ s , y/σ s , I xy /σ r ). To meet throughput requirements, the interpolation engine is implemented as three pipelined stages of linear interpolations.
The grid processing tasks are scheduled to minimize local storage requirements and memory traffic. Fig. 9 .6.3 shows the architecture of the bilateral filtering engine and task scheduling. Grid processing is performed cell-by-cell in a rowwise manner. When cell (i, j) is being assigned, the convolution engine is processing cell (i-2, j-1) and the interpolation engine is processing cell (i-4, j-2). Boundary rows and columns are replicated for processing boundary cells. This scheduling scheme allows processing without storing the entire grid. Only two grid rows need to be stored locally at a time. The number of grid cells varies inversely with σ s and σ r . Most applications work well with a coarse grid resolution on the order of 32 pixels. Decreasing the number of grid cells directly reduces the number of computations required. The grid size is configurable by adjusting σ s from 16 to 128 and σ r from 16 to 64. For a 10Mpixel (4096×2592) image, the number of grid cells scales from 663552 (σ s = 16, σ r = 16) to 2592 (σ s = 128, σ r = 64). The 21.5kB of on-chip SRAM is used to store two rows of created and filtered grid cells. The SRAM is implemented as 8 banks supporting a maximum of 256 cells in each row of the grid with 16 intensity levels, corresponding to the worst case of σ s = 16, σ r = 16. Each bank is clock and input gated to save energy when a lower resolution grid is used. Only 1 bank is used when σ s = 128 and all 8 banks are used when σ s = 16.
The testchip contains two bilateral filter engines, each processing 4 pixels/cycle. Fig. 9 .6.4 shows the architecture of the HDR creation module. It takes one lowdynamic range (LDR) pixel each from 3 different exposures (I E1 , I E2 , I E3 ) and merges them into an HDR pixel (I HDR ) using camera response curves. Displaying HDR images on LDR media requires tone mapping that compresses image dynamic range by non-linear filtering. A tone-mapped HDR image (I TM ) is created by bilateral filtering HDR intensity values in the log domain followed by contrast reduction [3] . In HDR mode, both bilateral grids are configured to perform filtering in an interleaved manner, where each grid processes alternate blocks in parallel. Glare reduction is similar to performing single image tone mapping and is integrated with the HDR architecture. LLE imaging is performed by merging two images captured in quick succession, one taken without flash (I NF ) and one with flash (I F ). The bilateral grid is used to decompose both images into base and detail layers. In this mode, one grid is configured to perform bilateral filtering on the non-flash image and the other to perform cross-bilateral filtering [6] on the flash image using the non-flash image. The scene ambience is captured in the base layer of I NF and details are captured in the detail layer of I F . The flash image contains shadows that are not present in the non-flash image. A novel shadow correction module, shown in Fig. 9 .6.4, is implemented which merges the details from the flash image with base layer of the cross-bilateral filtered non-flash image and corrects for the flash shadows to avoid artifacts. A mask representing regions with high detail in the filtered non-flash image is created and details from the flash image are added in the masked regions only. The processing is done in 4×4 sub blocks from σ s ×σ s blocks to reduce complexity. This implementation of the shadow correction module handles shadows effectively to produce LLE images without artifacts.
The testchip is implemented in 40nm CMOS technology and verified to be operational from 25MHz at 0.5V to 98MHz at 0.9V. Fig. 9 .6.5 shows outputs for HDR imaging, LLE imaging and glare reduction. This chip is designed to function as an accelerator core as part of a larger microprocessor system, utilizing the system's existing DRAM resources. For standalone testing of this chip a 32b wide 266MHz DDR2 memory controller was implemented using a Xilinx XC5VLX50 FPGA. The energy vs. performance trade-off and the frequency of operation of the testchip is shown in Fig. 9 .6.6 for a range of V DD , along with runtimes for different image sizes at 98MHz with 0.9V V DD . The runtime for a 10Mpixel image is compared with GPU/CPU implementations of C++ code that replicates the functionality of the testchip. The processor achieves 15× reduction in run-time compared to the CPU implementation, while consuming 17.8mW of power, a significant energy reduction compared to previous CPU or GPU implementations [6] . The architecture supports a high amount of parallelism, which can be used to further enhance the throughput and reduce the runtime. The energy scalable implementation proposed in this work enables efficient integration into portable multimedia devices for real-time computational photography. DIGEST 
