An image signal processor (ISP) for a camera image sensor consists of many complicated functions; in this paper, a full chain of the ISP functions for smart devices is presented. Each function in the proposed ISP full chain is designed to handle high-quality images. Every function in the chain is fully converted to a fixed-point arithmetic, and a special function is not used for easy porting to a Samsung Reconfigurable Processor (SRP). Several parallelizing optimization techniques are applied to the proposed ISP full chain for real-time operation on a given 600-MHz reconfigurable processor. To verify the performance of the proposed ISP full chain, a series of tests was performed, and all of the measured values satisfy the quality and performance requirements.
Introduction
Image sensors are used in numerous types of image acquisition devices such as digital cameras, camcorders, and CCTV cameras. Recently, their application region has broadened to include smart devices, and the acquired images are not merely for storage but also for interaction between a human and a computer. To satisfy the many goals of image sensors, the role of image enhancement is more important than ever before.
An image signal processor (ISP) is one of the non-optical devices that enhance the image quality of captured raw images and consists of several image processing algorithms including demosaicing, denoising, and white balancing, as well as other image enhancement algorithms. The latest ISP algorithms that include iterations with adaptive selections according to the image characteristics produce an excellent image quality. The high image quality costs vast amount of calculation, however, and also require complicated adaptive routines that cannot be executed in parallel.
An ISP can be implemented on a dedicated hardware, a general-purpose processor, or a parallel-computing processor. A dedicated hardware implementation, however, shows a high image quality and processing performance at the expense of scalability and flexibility, whereas the implementation of an ISP on a generalpurpose processor can be appropriate not only for the high image quality of complicated algorithms, but also for sound scalability and flexibility; however, the implementation cost of the latter is high due to the large computational amount, and a high-performance platform such as a desktop PC is necessary. The high processing performance and low power consumption of a parallelcomputing processor are accompanied by scalability and flexibility for software implementation. The implementation of an ISP algorithm on a parallel-computing processor, however, requires further optimization for the utilization of multiple processing elements in parallel. The conventional parallel-ISP-optimization methodology requires the division of the algorithm into data processing parts and control processing parts first, followed by their operation in parallel because of the adaptivity of the ISP algorithm. Very Long Instruction Words (VLIW) architecture can therefore be an easy choice for ISP implementation, even though Single Instruction Multiple Data (SIMD) architecture can exploit a greater extent of parallelism.
The ISP full chain that is suitable for parallel processing is proposed in this paper, and the chain is implemented through an optimization process for SIMD processor architecture to achieve both a high image quality and performance goals. The proposed ISP full chain is shown in Fig. 1 .
In Fig. 1 , GWA is Gray World Assumption, AHD is Adaptive Homogeneity-Directed Demosaicing, BF is Bilateral Filter, AC is Auto Contrast, and LTI is Luminance Transient Improvement.
The way that the high-quality images are processed by all of the algorithms that are present in the proposed ISP chain means that there are no iterations in the algorithm to reduce the execution time of the real-time budget [1] . While the basic idea of the algorithm is maintained, the operations in the algorithm have been simplified for easy parallelization on the SIMD architecture; in addition, heavy memory accesses and excessive computational overheads are reduced by limiting the operational ranges. Each complicated special operation is replaced by a simple operation that performs a similar function and the result was verified by experiments.
The proposed parallel ISP algorithm is targeted to run on the Samsung Reconfigurable Processor (SRP) [2] [3] [4] [5] [6] [7] that can be configured as an SIMD processor. Numerous high-quality image processing algorithms form the basis of each of the functional components of the proposed ISP full chain . By increasing the homogeneity of the parallel operations in the ISP algorithms, the proposed ISP algorithm can take advantage of the parallel performance of a SIMD processor while maintaining an image quality that can pass the commercial image quality test of Skype [31] . The proposed ISP can handle the resolution of full HD video (1920 × 1080, 30 frames per second) on a 600-MHz SRP that is suitable for smart devices. This paper comprises the following: Section 2 describes the existing research; Section 3 describes the implementation of the proposed ISP full chain; Section 4 describes the performance verification process and the results of the proposed ISP full chain; and the conclusion is presented in Section 5.
Background research

Algorithms of the ISP full chain
The functions of the ISP full chain mainly support recovering non-existing pixels, noise reduction, and image enhancement. The proposed ISP full chain consists of white balancing, demosaicing, color correction, color space conversion, denoising, detail enhancement, and gamma correction. The color images that enter through an image sensor can show colors that are different to those that are seen by the naked eye; to correct this, the White Balance (WB) process can be used. The WB algorithm GWA [8, 9] assumes that the average of the image is gray; similarly, the white-patch Retinex (WR) algorithm [10] assumes that the maximum-intensity pixel is white. Since these assumptions can be statistically false, Iterative White Balancing (IWB) [11] iteratively refines the white pixels while illuminant voting [12] checks the lighting conditions. The GWA is chosen for the proposed ISP, since it allows for an optimal parallelization during implementation that is due to a relatively structured computation compared with the existing algorithms, as follows:
where C represents one of R, G, and B and C WB represents the color value after white balancing. After the WB process, demosaicing is an algorithm for the production of full RGB channels, which is achieved by the interpolation of the color pixels that are lacking in image sensor-captured images. Many algorithms including heuristic methods, directional interpolations, frequency domain approaches, wavelet-based methods, and reconstruction approaches [13] [14] [15] [16] exist; in this study, Adaptive Homogeneity-Directed Demosaicing (AHD) [15] , a type of directional interpolation method that is commonly used for digital still cameras, was modified and used. A higher image quality is associated with other algorithms like wavelet-based methods [16] , but they are not suitable for real-time implementation on the reconfigurable processor that is used in this study due to the huge amounts of calculations and iterations. The rough flowchart of the AHD algorithm is shown in Fig. 2 .
The directional interpolation of the AHD performs interpolation in the direction of the strongest edge that flows either vertically or horizontally. Finding the direction of the edge depends on the homogeneity of the neighboring pixels that will also be generated. The homogeneity map is defined by Eq. (2), as follows:
where B is a set of the δ distance from (x, y) ∈ X; X is a set of 2D pixel positions; B is defined by Eq. (3); L f and C f are in the neighborhood that is established by the distance of the luminance and color in the CIELab color space and are defined by Eqs. (4) and (5), respectively; E is a set of tolerance values and δ, ε L , ε C ∈ E; and d L and d C are distance functions, where luminance and the ab plane in the CIELab color space are used. A detailed implementation of AHD is introduced in Hirakawa and Parks [15] . Inevitably, the acquired images comprise a variety of noises due to the characteristics of the sensor and converter circuits that are used-especially with the low light of an indoor environment. To remove these noises effectively, highly adaptive noise reduction methods such as the Bilateral Filter (BF) [18] [19] [20] [21] or a 3D noise reduction filter [22, 23] can be used. The BF, proposed by Aurch et al. [20] and improved by Tomash et al. [21] , is a non-linear adaptive low-pass filter with variable weighting factors according to the distance and the intensity of the neighboring pixels. Equation (6) 
In Eqs. (6) and (7), (x p , y p ) is the location of the center pixel, (x q , y q ) is the location of the neighboring pixel I(x p , y p ), I(x q , y q ) represents the intensities of the corresponding pixels, G S is the Gaussian function for the spatial domain, and G I is the Gaussian function for the intensity domain. The proposed ISP uses a modified BF that can flatten the noise area while preserving the edge information.
For an improved subjective image quality, it is necessary to enhance the contrast and edge information; to improve the image contrast, auto level, AC, and histogram equalization are examples of the methods that can be used [24] . The proposed ISP full chain includes the AC function that comprises a relatively low color distortion and less complex operations; in addition, Luminance Transient Improvement (LTI) and Chrominance Transient Improvement (CTI) are also applied to enhance the edges of the luminance and chrominance, respectively [25, 26] . For LTI and CTI implementation, the difference of Gaussian method [27] is used because of the relatively simple corresponding operations and an excellent edge extraction performance. The difference of Gaussian method is represented by Eq. (8):
where O is the enhanced signal, I is the input luminance signal, g 1 and g 2 are two Gaussian filters with the variances σ 1 and σ 2 , the symbol " * " is the 2D convolution operator, x is the row number, and y is the column number. The color correction function changes an entire color according to the desired color temperature. In the proposed ISP full chain, color correction is combined with color conversion to reduce the redundant memory accesses. The applied color correction matrix is shown in Eq. (9), as follows:
where C rr through to C bb are the correcting values that will be multiplied by the RGB channels and R cc , G cc , and B cc are the color-corrected values of the color channels. While the acquired images are processed in the ISP full chain, several different color spaces are used. Color conversion is a signal-processing technique for the transformation of the color representation coordinates into another coordinate system where some of the color axes comprise a small correlation, and the application of signal-processing functions can reduce the incidence of processing errors [30] . In the proposed ISP, the input signal is initially in the RGB space before it is converted into the YC o C g color space, and the input is then subjected to luminance-related processes; subsequently, the signal is converted back to the RGB space, color-related processes are applied to the signal, and the signal is then sent to the output display.
where Y, C o , C g , and R, G, B are the pixel values of the YC o C g color space and the RGB color space. Gamma correction (GC) modifies the linearity of the camera input to match the non-linearity of the human visual system [28, 29] . If GC is not applied to the acquired images, humans cannot differentiate the immense number of bits that represent the information. The GC can be modeled as Eq. (13); in the proposed ISP, the GC is implemented using a polynomial approximation:
where I ' is output image, I is input image, and γ is gamma value. A is a constant 1 in a common case.
Implementation platform
The proposed ISP is implemented on an SRP in accordance with the test of the preliminary version of the proposed ISP [1] . Since the SRP can support both of the parallel processing modes SIMD and VLIW, the proposed ISP is accelerated by the implementation of numerous key operations so that it can run in parallel. The SRP supports the following three operation modes: SIMD, VLIW, and scalar. As the SRP configuration that is used in this study can process 128 bits at a time with 16 functional units, it supports SIMD configurations that can process four 32-bit, eight 16-bit, or 16 8-bit data at the one time. In the VLIW mode, eight of the function units can be operational at the same time, whereby up to eight operations can be executed in parallel. Since the routing channel of the SRP comprises independent configurations for the SIMD, VLIW, and scalar modes, the three modes cannot be used in combination; however, the SRP can switch among the three operational modes dynamically while the ISP software is processed. While the sequential codes in the complex control sequences of the algorithm run in Scalar mode, the parallel codes of the massive image data processing operation are accelerated in the SIMD mode or the VLIW mode. The memory access of the SRP should be aligned by 128-bit words; therefore, if the data size is not a 128-bit word, the data should have an additional buffering stage to ensure an alignment with the 128-bit words. The SRP also consists of a single memory port for read-and-write operations; therefore, memory-intensive jobs like lookup table operations cannot be parallelized and they significantly slow down the processing speed.
The VLIW mode of the SRP comprises a greater programming flexibility because data processing operations and control operations can be executed simultaneously in this mode. The control operations often limit the parallelism, however, because of the dependency among the codes and data; alternatively, the SRP often suffers from the lack of data that is processed in parallel in the SIMD mode. Since a lack of parallelism is inherent to the algorithm, the algorithms in the proposed ISP are modified to supply enough parallelism; therefore, the proposed ISP can mostly run in the SIMD mode for a sufficient computational performance. Figure 3 shows the SRP architecture overview.
In Fig. 3 , FU is Function Unit, RF is Register Files, VLIW is Very Long Instruction Words, and CGRA is Coarse Grained Reconfigurable Array.
The existing research shows that other algorithms that have been ported on the SRP platform such as the raytracing algorithm [4] comprise low-power audio processing [5] and 3D graphics [6] . The proposed ISP full chain is designed to work with SIMD-style parallel processing; however, due to its high parallelism, the proposed design can be used for platforms with other types of microprocessors such as Intel [31] , ARM [32] , and the TI Digital Signal Processor (DSP) [33] .
Intel processors and ARM processors are based on the superscalar architecture that executes multiple instructions at the same time. The performances of the Intel processor platforms are more effective that those of the ARM processor platforms because the former contains a variety of hardware accelerators for multimedia processing (MMX, SSE, etc.); alternatively, ARM processor platforms consume less power than Intel processor platforms, making them suitable for mobile applications. TI DSP platforms comprise VLIW architectures, whereby multiple signal-processing operations and control operations can be executed in parallel.
Algorithm porting on SRP
A number of optimization technologies were used to improve the computational performance of the ISP full chain on the SRP while the image quality is maintained. The SRP that is used in this study comprises several commands for the efficient use of SIMD arithmetic data. The composition of the SIMD commands is for the processing of the 128-bit data of eight 16-bit data. The SIMD commands are composed of ADD, SHF (shift), CLIP, MUL (multiply) and ADD, and MUL and SHF functions. Since there is no SIMD command to verify the results after the comparison, the SUB and CLIP commands were used in combination so that the results after the comparison could be available for implementation. The SIMD commands were heavily used in the proposed ISP functions for a high performance.
3 Module optimization for the proposed ISP
WB
In the proposed ISP, the WB uses the GWA algorithm [9] . The GWA algorithm corrects the colors of an image, assuming that the average color of each RGB channel is gray. Using Eqs. (14) to (16), we calculated the GWA as follows: 
where R gain , G gain , and B gain are the color gain values for each of the channels and R, G, and B are the averages of the pixel values of the corresponding color channels.
In the input Bayer pattern, the number of G pixels is twice those of the R or B pixels. As the WB process requires the average of the entire image, it is possible to use the data of only half of the G pixels without incurring a significant error; therefore, when the GWA was applied, only half of the G pixels were used so that the GWA equaled the calculation amounts of R and B. Since the sum of the entire pixel amount is too large to fit into an integer register, a proper significant figure addition was used to limit the bit number of the sum. If the size of the integer registers that are used for the calculations is too small, the effective numbers become too small while the errors become larger; contrarily, if the integer register size is too large, the processor limitation makes parallelization difficult. For this reason, the sum register size was limited to 32-bit and the temporary variables can be stored in the 64-bit registers. Since the SRP does not support division, shift operations are used for the WB result. A division by 3 in Eqs. (14) to (16) is simplified by 3/8, which is performed as a multiplication by 3 followed by a shift right by three bits.
Modified AHD
As in Fig. 2 , after the WB process is performed in the Bayer pattern of the image sensor, the modified version of the original AHD [15] is used as a demosaicing algorithm. The AHD consists of the following three steps: directed interpolation, homogeneity-directed map creation, and iterative noise filtering. The method for finding the direction of the edge is dependent upon the location and color of the pixel that is to be generated. The width of the variables is 16 bits including three bits for the fractional part. An operational example of the proposed modified AHD is explained in the following section.
In Fig. 4 , R, G, and B are the red, green, and blue pixels, respectively, and the number is the pixel position. Figure 4 comprises GBRG, the Bayer pattern of the image sensor that was used for the proposed ISP implementation; based on G44 in the middle, GBRG is composed of a pattern that consists of G44, B45, R54, and G55, and numbering starts from the top left. Equations (17) and (18) represent the horizontal interpolation of the G and R pixels on the B channel where the input B pixel exists. In Eqs. (17) to (27) , all of the parameters are the corresponding pixel values of the locations in Fig. 4 .
G11 R12 G13 R14 G15 R16 G17 R18 B21 G22 B23 G24 B25 G26 B27 G28 G31 R32 G33 R34 G35 R36 G37 R38 B41 G42 B43 G44 B45 G46 B47 G48 G51 R52 G53 R54 G55 R56 G57 R58 B61 G62 B63 G64 B65 G66 B67 G68 G71 R72 G73 R74 G75 R76 G77 R78 B81 G82 B83 G84 B85 G86 B87 G88 
Equations (19) and (20) represent the horizontal interpolation of the G and B pixels on the R channel where the input R pixel exists, as follows:
Equations (21) to (24) represent the horizontal interpolation of the R and B pixels on the G channel where the input G pixel exists, as follows:
The original AHD repeats the calculation vertically, and it also comprises an additional direction-selection process after the generation of the homogeneity map according to the calculation of the CIELab color conversion and epsilon parameter. The modified AHD selects the direction immediately after the interpolation of the G channel, and then the R and B channels are interpolated only once; by doing this, the process of selecting a direction-based RGB pixel value is removed to reduce the amount of calculation. The G channel interpolation equation is also modified, as shown in Eqs. (25) to (27) :
where H is horizontal direction weight, V is vertical direction weight, and abs() is absolute value function. In Eqs. (25) and (26), and using Eq. (27) , the G channel interpolation depends on the results that are obtained by the horizontal and vertical direction calculations of the three tap filters. An iterative noise filtering is used in the original AHD. The iterative noise filtering is removed for the reduction of operational loads, however, and it is also a redundant operation because it is performed by the modified BF in the next stage.
To minimize the data load for the Bayer pattern images in the modified AHD, the memory area is designated a size that is one column larger than the original image size and the images are read only once. By manipulating the pointer to the start position, boundary processing is not needed at the time of the RGB channel interpolation, and an ordered data loading technique is applied to the vertical filter that is used for the RGB channel interpolation.
When data are loaded from the memory and fed into a filter, some of the data load may be overlapped due to the convolution operation of the filter. The pseudo codes 1 and 2 show the pseudo codes of the data load for horizontal loading and vertical loading, respectively, for the 1 × 3 filter. As shown in pseudo code 1, the buffer size is 128 bits; that is, it consists of eight 16-bit data. Once the filtering of an image line is complete, the filtering of the next image line loads a new image line (at code line 8) and the two lines that had been loaded while the previous image line was processed (code lines 6 and 7 are shown at code lines 2 and 3, respectively). To prevent such an overlapping of data loading, the data should be loaded by row unit, and then the required data should only be read by referring to the existing buffer, as shown in pseudo code 2, where the overlapped data load in pseudo code 1 does not exist. This technique, as shown below, is used for modified AHD, modified BF, modified LTI, and other modules in the proposed ISP. where C1, C2, and C3 are the filter coefficients in pseudo code 1 and pseudo code 2.
The PSNR values for the modified AHD algorithm are compared with those of the conventional AHD in Table 1 . Kodak lossless true color images were modified to form the GBRG Bayer pattern images that are used for the PSNR comparison. The PSNR differences vary between −0.22 and 1.76 dB, with an average difference of 0.48 dB, while the computational load is significantly reduced.
Color correction and color space conversion
After demosaicing, the color correction block finds the color features and repairs the color artifacts; the color correction is processed by the color correction matrix, and the matrix that was used is shown in Eq. (9). The color correction equation can be calculated in conjunction with the subsequent color space conversion. The proposed ISP, the YC o C g color space, is used because it has a lower correlation among the color channels compared with other color spaces, and it performs integer operations only without any information loss. Because the color correction equation can be combined with the equation of the YC o C g color space conversion, the intermediate process for the storage of the values of the color correction result can be removed, thereby reducing the memory access cycle. The equations for performing the combined color correction and the YC o C g color space conversion are shown in Eqs. (28) to (31):
where Y, C o , C g , and R, G, B are the pixel values in the YC o C g color space and RGB color space, respectively. A color control function is also combined in the YC o C g color space conversion to control the color saturation and color offset. A coefficient integerization technique was used for the color correction.
Auto contrast
A linear stretch method is used in the AC, and the linear scale factors in the AC function were calculated in the YC o C g color space.
Equations (32) and (33) were applied to the AC that is used in this study: 
Modified BF
BF is used as a noise reduction algorithm [18] [19] [20] [21] . The original BF comprises the following two Gaussian filters: one is for the distance weight between pixel locations and the other is for the difference weight between pixel intensities. To simplify these two Gaussian filters, the Gaussian functions are replaced by fixed-point, binary threshold functions in the proposed modified BF. The threshold values are determined by pre-calculating the Gaussian filter coefficients for the pixel locations and pixel intensities.
In the proposed ISP algorithm, the BF is simplified to reduce the amount of calculation. Since the Gaussian function requires a special math hardware, G S and G I are replaced by the binarization functions B S and B I . The size of the spatial domain of G S in the proposed ISP is 7 × 7. The output of B S for the same domain size is 1 for a 3 × 3 area and 0 for any others; therefore, the domain S of 7 × 7 is replaced by the new domain S ' of 3 × 3. The B I that is the binarization of G I is represented by the following:
where I Th is the threshold value of the pixel value difference. The resulting modified BF is Eq. (35):
where the new normalization term W ' p is the following:
To further reduce the calculation complexity, the 3 × 3 filter of the domain S ' was replaced by a separable filter that is composed of two 1D filters of the sizes 3 × 1 and 1 × 3; by using this separable filter, the computational complexity of the proposed BF becomes O(n), instead of the O(n 2 ) of the original BF [34] . When the algorithm is implemented with a 2D filter, the amount of memory access and computation for the SRP needs to be increased quadratically. Instead of the 2D filters in the original BF, the separable filter is applied to the proposed modified BF. By making the 2D filter separable, the computational load of 2D filtering is reduced to twice that of 1D filtering. Figure 5 compares the filtering operation of a 2D filter with that of a 1D filter for the SRP. Due to the SRP structure, all of the data should be stored in buffers before a filter is used; so, when a 3 × 3 filter mask is used, fifteen 128-bit registers are needed to start a necessary operation. Alternatively, when a 1D filter is used, it is possible to perform an operation with five 128-bit registers for a horizontal filter and three 128-bit registers for a vertical filter; therefore, the use of a separable filter also makes it possible to reduce the amount of memory access.
In Fig. 5 , a square box represents a single pixel of 16 bits and a buffer has eight-pixel data. In addition, the filter size has also been modified from 7 × 7 to 3 × 1 and 1 × 3, and a vertical data loading technique is applied to the 1 × 3 vertical filter.
Detail enhancement
For detail enhancement, an LTI based on the difference of Gaussian [27] is used. The Gaussian mask sizes in the LTI are 3 × 3 and 5 × 5, which are with pre-calculated coefficients. Since CTI rarely affects image quality, a simple 1 × 3 Laplacian sharpening filter is configured for the CTI. The separable filters are implemented for LTI and the filter size is adjusted. As the filter that is used here is also a vertical filter, a vertical data loading technique was applied.
Gamma correction
For GC, the lookup table method or a piecewise linear interpolation method [29] is used. The input data is used as the index of the lookup table method, while the input range and the linear interpolation parameter are checked from the table for the piecewise linear interpolation method. In this study, instead of using the lookup table that is difficult to parallelize due to a large volume of irregular memory access, the quadratic approximation of the GC equation that utilizes the 128-bit data processing of the SRP was used. Equation (37) is the equation that is used for GC:
where k 1 , k 2 , and k 3 are the GC coefficients, x is the pixel value of the RGB channel, and y is the GC result value. The parameters are determined to have the least square error over most of the central region. Since GC is performed for all three of the RGB channels, the algorithm was rearranged to use the results of the YC o C g -to-RGB color space conversion.
Experiment results
To verify the performance of the proposed ISP full chain, the quality of the result images should first pass a commercially available image quality test such as Skype [35] . The experiments were conducted using a CMOS image sensor with a specification that is shown in Table 2 . Figures 6, 7 , and 8 are the parts of the test patterns. Figure 6 is the image quality resolution test pattern, which is used to measure the clearness of luminance images. Figure 7 is for the evaluation of color performance and Fig. 8 is for the verification of texture acuity. Other patterns for the measurement of aspects such as exposure error, gamma, SNR, and dynamic range exist. Table 3 shows the results of the image quality for the proposed ISP full chain that was implemented on the SRP; as shown in Table 3 , all of the measured values meet the requirements of the test. Since the entire proposed ISP chain has been designed only with fixed-point addition and multiplication, the proposed ISP chain can be easily ported onto any other microprocessor; furthermore, even when the characteristics of a CMOS image sensor change, it is still possible to meet the image quality evaluation standards by simply adjusting the coefficients that are used for the ISP full chain.
The performance goal of the proposed ISP is the processing of full HD image sequences (1920 × 1080, 30 frames per second) with a 600-MHz SRP. Table 4 shows the number of clock cycles that were taken by the modules in the proposed ISP full chain. The number of cycles for sequential processing comprises the cycles that are taken without the use of any SIMD operations, and the number of cycles for parallel processing is the cycles that are taken from the use of the SIMD operations of eight processing elements. The parallelizing speedup by a factor of 4.9 is obtained by dividing the total sequential-cycle number by the total parallel-cycle number. The degree of parallelism can be found by using Amdahl's law of Eq. (38), as follows:
where T s_old is the time taken by the sequential operations that are not affected by parallelization; T p_old is the time taken by the sequential operations that are affected by parallelization; T s_new is the time taken by the sequential operations that are not affected by parallelization after improvement; and T p_new is the time taken by the parallel operations after improvement. Since the sequential parts are not affected by parallelization, the processing time does not change after improvement, as shown by T s_old = T s_new = T s . In the proposed ISP, the parallelization is performed by the SIMD with eight processing elements, so T p_new = T p_old /8. If T p_new is assumed as 1, Eq. (38) is changed, as shown in Eq. (39):
By inducing the sequential time T s from Eq (39), T s = T s_old = 0.81. Since the total execution time before parallelization is T s_old + T p_old = T s + T p_new * 8 = 8.81, the time that is not affected by parallelization is only 9 % of the total sequential execution time, whereby 91 % of the total sequential time is parallelized by the eight processing elements. Since the resolution of the CMOS image sensor that is used in the experiment is larger than that of the target performance, the conversion to get the performances of the full HD image sequences is shown in Eq. (40), as follows: where C s is cycles per second, CPP is cycles per pixels, Res is target resolution, C f is simulation cycles, and TP is the number of test image pixels. Since the input resolution of the test camera is 2624 × 1956, the total number of cycles to handle an image of a 1920 × 1080 resolution was recalculated; therefore, the SRP simulation result satisfies the real-time operation of the target for the 600-MHz SRP. Table 5 shows the performances of the proposed ISP on other platforms in cycles per pixel. To compare the parallelization performances of the proposed ISP algorithm in a test, widely used, commercial processor platforms were used to run the proposed ISP full chain. For the test platform, general-purpose desktop processors of the Intel processor family, a general-purpose mobile processor of the ARM Cortex family, and a signalprocessing VLIW processor of the TI DSP family were chosen; for the TI platform and the SRP platform, the simulators that were provided by the manufacturers were used in the experiments. Since each platform comprises a different operating frequency, the cycles per pixel were calculated for the purpose of comparison. The proposed ISP full chain was compiled for a single Fig. 6 The image quality resolution test pattern processor because the communication overhead for multiple threads can abuse the efficiency of parallel operations. All of the platforms comprise the multiple-issue pipelines and multimedia instructions of the SIMD style [36] [37] [38] , and the optimization options were disabled for comparison purpose because the SRC compiler does not have optimization options. For faster porting, the cycle-accurate simulators for the TI C64x + and SRP were used. The operating frequencies of the commercial TI C64x + processors are between 500 and 1200 MHz; for the proposed algorithm, the target platform of the SRP processor was designed to run at 600 MHz.
The results in Table 6 show the values of cycles per pixels obtained by using the compiler optimization option along with the proposed ISP full chain. In the case of SRP platform, the SRP compiler does not provide optimization options. Using GCC compiler, for Intel and ARM platform, option O1 allows for branch, register, and tree optimization. Similarly option O2 (default option in GCC) allows align, local, and global optimization, while option O3 allows all the abovementioned optimization for O1 and O2 along with the parallelizing optimization for loop unrolling and loop vectorization. Although the use of option O3 does not guarantee speedup as compared to the use of option O2 [39] , the application of optimization option O3 along with the proposed ISP full chain achieves higher speedup due to the inherent higher degree of parallelism.
The TI platform also allows the use of optimization options for the TI compiler. Option O1 in the TI compiler is used for register usage optimization, option O2 is used for global optimization including parallelizing optimization such as software pipelining, loop optimization, and loop unrolling, and option 3 is used for optimization related with inline calls to small functions and reorder function declarations. Again, using the optimization option O2 along with the proposed ISP full chain achieves higher speedup compared to other options due to higher degree of parallels.
Since the SRP can process eight 16-bit operations in parallel with a single SIMD instruction, the SRP outperforms the fastest commercial platform i7 by 3.36 times at the fully optimized version in Table 6 . Although Intel platforms comprise an issue width of 4 and 4 × 16-bit data SIMD instructions, the inefficiency of the dynamic scheduling and a lower memory bandwidth limit exploit the parallelism of the proposed algorithm; that is, the parallelism of the proposed algorithm can also work for Intel platforms. With respect to the ARM platform, its issue width is half that of the Intel platform and it comprises an even lower memory bandwidth, so the cycle-per-pixel value is 4.07 times higher than those of the Intel platforms. The TI platform also comprises two issue pipelines, but there are more operation slots for control operations; however, the proposed algorithm is designed for data parallelism, and the performance gain over the ARM platform is marginal.
Conclusions
In this study, a parallel version of the ISP full chain is proposed and implemented on an SRP architecture with eight data width SIMD instructions. The proposed ISP full chain is written in C language for portability, and the image quality was verified with a commercially available test suite. The proposed algorithm was modified for lesser computational loads and a capability that facilitates the easy exploitation of parallelism. A variety of optimization techniques were also applied to make the algorithm suitable for an SIMD-style architecture. The experiment results satisfy both the image quality standard and the real-time operation speed for a 600-MHz SRP with full HD image sequences, and it utilizes approximately five out of the eight operation slots in the SIMD instruction of the SRP. The parallelism of the proposed algorithm was also tested in a comparison with other commercial platforms, and the results show that it can be easily exploited. 
