Abstract: A novel architecture for performing color image enhancement using a machine learning algorithm called Ratio Rule is proposed in this paper. The approach promotes logdomain computation to eliminate all multiplications and divisions, utilizing the approximation techniques for efficient estimation of the log 2 and inverse-log 2 . A new quadrant symmetric architecture is also presented to provide very high throughput rate for homomorphic filters which is part of the pixel intensity enhancement across RGB components in the system. The pipelined design of the filter features the flexibility in reloading a wide range of kernels for different frequency responses. A new approach for the design of the uniform filters is also presented to reduce the processing element arrays (PEAs) from W PEAs to 2 PEAs for W × W window. This new concept is applied to assist in training the synaptic weights of the neural network for color balancing to restore the intensity enhanced image to its natural color existed in original image. It is observed that the performance of the system with parallel pipelined architectures is able to achieve 147.3 million outputs per second (MOPS), or equivalently 57.9 billion operations per second on Xilinx's Virtex II XC2V2000-4ff896 FPGA at a clock frequency of 147.3 MHz.
I. Introduction
Physical limitations exist in the sensor arrays of imaging devices, such as CCD and CMOS cameras. Often, the devices cannot represent scenes well that have both very bright and dark regions. The sensor cells are commonly compensated with the amount of saturation from bright regions, fading out the details in the darker regions. Image enhancement algorithms [1] , [2] provide good rendering to bring out the details hidden due to dynamic range compression of the physical sensing devices. However, these algorithms fail to preserve the color relationship among RGB channels which result with loss of color information after enhancement. The recent development of fast converging neural network based learning algorithm called Ratio Rule [3] , [4] provides excellent solution for natural color restoration of the image after gray-level image enhancement. Hardware implementation of such algorithms is absolutely essential to parallelize the computation and deliver real time throughputs for color images or videos containing extensive transformations and large volumes of pixels. Implementation of window related operations such as convolution, summation, and matrix dot products which are common in enhancement architectures demands enormous amount of hardware resources [5] , [6] . Often, large number of multiplications/divisions is needed [7] . Some designs compromise this issue by effectively adapting the architectures to very specific forms [5] , [6] , [8] and cannot operate on different sets of properties related to the operation without the aid of reconfiguration. We propose the concept of log-domain computation [9] to solve the problem of multiplication and division in the enhancement system and significantly reduce the hardware requirement while providing high throughput rate.
II. Concept of the Design with Log-Domain Computation
The gray-level color image enhancement with Ratio Rule [3] comprises three main components. The illuminationreflectance image enhancement model relies on the assumption that the detail (reflectance components) in the image is logarithmically separable [10] . Hence the highboost homomorphic filters can be applied to bring out the details hidden in the dark regions of the image. In time domain, the enhancement (1a) can be expressed by (1b) in logarithmic based two where the symbol * denotes the digital filter operation (convolution), ˆ{ , , } I R G B are normalized RGB components of the input image, h(x, y) is the filter coefficients from a high-boost transfer function and D is the de-normalizing factor.
High Performance Architecture for Color Image Enhancement
, , 2
The color characterization of the RGB channels is modeled by Ratio Rule to preserve the relationship among the channels as shown in (2), where ζ denotes the original RGB component, N(s) is the neighboring pixels under W× W=P window. W I,K is the average value of the ratios among the channels under the window. Note that only 6 channels of the W I,K need to be calculated as the RR, GG, and BB ratios are always 1. In log-domain, the computation reduces to subtractions for the ratio including the factor P in the denominator. The average is then simply the summation of P values in N(s) neighbors.
for , in { , , }
The color balancing is defined by activation function as (3) where 
III. Architecture for Image Enhancement with Ratio Rule
A brief design overview is presented in section III.A along with the flow of the computations within the architecture. The implementation of the homomorphic filters is discussed in section III.B. The design of color characterization and color balancing modules are described in sections III.C and III.D, respectively.
A. Design overview A brief overview of the image enhancement system with color restoration is shown in Figure 1 along with its interfaces. The architecture features RGB streaming input with the options of specifying the image width on 'Imsize' bus, the threshold values for color balancing on 'Thi' bus, and the kernel coefficients for the homomorphic filters on 'KernBus'. The output buses include the intensity enhanced and color balanced images. In the architecture, the input RGB channels are buffered. Multiple output buses from the buffer are sent to 3 K × K homomorphic filters and 6 W× W weight windows of the neurons concurrently. The filtering and synaptic weights are computed in parallel. The filter outputs are synchronized in the color characterization module and sent to the color balancing module along with the synaptic weights. The color relationship is restored in color balancing module to produce images with natural color which exists in original images.
, otherwise
, 1 which is negative, given the fact that image pixels are positive and log 2 of negative number is undefined, the absolute value can be logically approximated by taking the inverted output (
) of the registered result from vertical folding. This procedure inherently utilizes the V-fold pipeline stage rather than introducing additional stage and resource to compute the absolute value of the normalized v. To reduce the processing bandwidth by another half, the horizontal folding is performed, taking account of the delay in systolic architecture. The registered results of the H-fold stage are sent to arrays of processing elements (PEs) for successive filtering. The partial results from the PE arrays (PEAs) are combined together by pipelined adder tree (AT). The overall output of the homomorphic filter for each channel is computed by taking the inverse-log 2 . In addition, the log 2 scale version (HRGB L ) of output is sent to color characterization module where it is synchronized to the synaptic weights for the upcoming multiplication operation. For more details on the concept of quadrant symmetric architecture, the readers are referred to [12] , which focuses on multiplier based design in linear scale. The design of the PE for the homomorphic filters is discussed in section III.B.1.
1) Architecture of Pipelined Processing Element in Homomorphic Filters
The design of the PE in the homomorphic filters utilizes the log-domain computation to eliminate the need of hardware multipliers. The data from H-fold register is pre-normalized without extra logics by shifting the bus. It is then converted to log 2 scale as shown in Figure 3a and added with log 2 scaled kernel coefficients (LKC) in LKC register set. The result from last stage is converted back to linear scale with range check (RC). If the overflow or underflow occurs, the holding register of this pipeline stage is set or clear, respectively. Setting and clearing contribute the max and min values representable to N-bit register. The output of this stage is de-normalized, likewise by bus shifting, before it is successively accumulated along the accumulation line. The log 2 architecture shown in Figure 3b is very similar to [9] , except full precision is used and registers are introduced to approximately double the performance. The maximum logic delay is reduced to single component and makes no sense to pipeline beyond this point. Interested readers are referred to [9] for detailed implementation.
C. Color characterization architecture Computation in color characterization module is performed in parallel with homomorphic filters. Performing division by hardware is rather cumbersome with conventional design approach, not to mention the pipeline stages needed. In logdomain, the computational power is reduced to subtractions in hardware. Hence the ratios in (2) can be calculated by subtractions in pipeline stage p1 of Figure 4 with the already log 2 scaled data from PDB. The averaging factor, P, in (2) can be conveniently performed at stage p2, as shown in the figure, while the ratios are still in log-domain. The results from p2 are converted back to linear scale in p3. Based on the folding concept of quadrant symmetric architecture and the fact that the summation is equivalent to digital filter with uniformed kernel, the vertical folding can be performed repeatedly until it is reduced to single processing node in These weights are converted to log-domain for upcoming multiplication in color balancing module. The HRGB L from homomorphic filters are subtracted with the constant log 2 (n) to take account of the averaging factor n (n=3) in (4b). The outputs (Sh L denotes log 2 scaled outputs of homomorphic filters normalized by n) are synchronized with the synaptic weights for color restoration in color balancing module.
D. Color balancing module
The color restoration process computes the activation function of (3) and is merged into (4a) with the assumption that the update rate, v, is 1. The calculation is simplified to summing the activation functions since the outputs Sh L (RGB) of homomorphic filters are pre-normalized by n. The complete color balancing module is shown in Figure 5 . It is pipelined into 7 stages, p1 to p7. In p1, the W L (0..5) and Sh L (0..2) are added in log-domain which are equivalent to multiplications in linear scale. Six approximated outputs of the activation functions are produced as shown in the figure. In parallel, the inverse-log 2 operation is also performed on Sh L . In p2, the approximated outputs of the activation functions are converted to linear scale. This set of outputs is subtracted by Sh to compute the error functions presented in the distortion in p3. The errors are measured by ThiP and ThiN (upper and lower thresholds) to determine whether the amount of error is tolerable to the bounds. Two bits are set in p4 for each of 6 channels if it passes the upper/lower boundary tests. The bits are ANDed together to serve as select line for 2to1 multiplexers (MUX). If the Sh is within the relationship to related RGB channels, the MUX selects Sh to its output. Else, W I,K Sh K passes through the MUX in p5. Within the same stage, Sh(0..2) are also sent to the set of output buses, as we know, the relationship within its own channel always holds true (i.e. the ratio of RR, GG, and BB is always 1 and constitutes Sh(RR,GG,BB) itself. The components belonging to each RGB channels are recombined through pipelined 3to1 AT. For example, Sh(G), W GR × Sh(R), and W GB × Sh(B) are sent to AT of G channel if the Sh I in both GR and GB channels fall outside the 
IV. Simulation and Error Analysis
Images with non-uniform darkness are used in the simulation of the hardware algorithm. The parameter set for the test is as follows: 
A. Simulation
The image is sent to the architecture pixel by pixel in raster scan fashion. After the initial latency of the homomorphic filters (i.e. Imsize × (hy-1)/2+(hx+1)/2+9+D AT cycles, where hy and hx are the dimension of the filter and D AT is the latency of pipelined AT), the output becomes available and is collected for error analysis. Likewise, after additional pipeline latency of color balancing module, the overall output of the enhancement architecture is recorded. A typical test image is shown in Figure 6a where the shadow region exists as the consequence of the saturation in bright region. The output of homomorphic filters by hardware simulation is illustrated in Figure 6b . The image appears pale with distorted color information despite the enhancement brought back the details in shadow region. The overall outputs of the system are plotted in Figures 6c and 6d for hardware and software algorithms, respectively.
B. Error analysis
A typical histogram of error between hardware simulation and software algorithm is shown in Figures 7a and 7b for outputs of homomorphic filters and overall system, respectively. Also majority of the errors in this region is less than 5 to 10 pixel intensities with the average error of 3.01 to 4.84. This error measure includes the fact that the hardware simulation is bounded to approximation error and specific number of bits representable in the architecture where the software algorithm is free from these constraints. While the hardware simulation shows very attractive results, the efficiency of hardware utilization and its performance, which is discussed in section V, is even more impressive.
V. Hardware Utilization and Performance Evaluation
A. Hardware utilization The hardware resource utilization is characterized based on the Xilinx's Virtex II XC2V2000-4ff896 FPGA and the Integrated Software Environment (ISE). The particular FPGA chip we target has 10,752 logic slices, 21,504 flipflops (FFs), 21,504 lookup tables (4-input LUTs), 56 block RAMs (BRAMs), and 56 embedded 18-bit signed multipliers in hardware; however, we do not utilize the builtin multipliers. The resource allocation for various sizes of the kernels in homomorphic filters and the windows in synaptic weights is shown in Table 1 Given 1024 × 1024 image frame, it can process over 140.4 frames per second without frame buffering at its peak performance. This tremendous gain in the performance while consuming significantly less hardware resources would have been extremely difficult to achieve without logarithmic modules and log-domain computation. The additional benefit is that the filter coefficients are not hardwired, which gives the flexibility in reloading the coefficients for different characteristics of the filters.
VI. Conclusion
A novel architecture for performing color image enhancement with Ratio learning algorithm has been presented. The approach utilized log-domain computation to eliminate all multiplications and divisions. Log 2 and inverselog 2 computations were performed based on the approximation techniques. A new high performance quadrant symmetric architecture was also presented to provide very high throughput rate for homomorphic filters in color image enhancement where the intensity of the RGB components of the images were boosted by three separate filters. The new concept of uniform filter design reduced the requirements of hardware resource and processing bandwidth from W PEAs to 2 PEAs for W× W window. The recurrent neural network based color restoration technique is found to be very effective in restoring the natural colors with the learned relationship between RGB channels. It has been observed that the performance of the system is able to sustain 147.3 million outputs per second (MOPS) or equivalently 57.9 billion operations per second with three 9 × 9 homomorphic filters and six 5 × 5 windows for synaptic weights on Xilinx's Virtex II XC2V2000-4ff896 FPGA at a clock frequency of 147.3 MHz.
