Abstract-We develop a novel SAD circuit for powerefficient H.264 encoding, namely a-SAD. Here, some highest-order MSB's are approximated to single MSB. Our theoretical estimations show that our proposed design simultaneously improves performance and power of SAD circuit, achieving good power efficiency. We decide that the optimal number of approximated MSB's is four under 8-bit YUV-420 format, the largest number not to affect video quality and compression-rate in our video experiments. In logic simulations, our a-SAD circuit shows at least 9.3% smaller critical-path delay compared to existing SAD circuits. We compare power dissipation under isothroughput scenario, where our a-SAD circuit obtains at least 11.6% power saving compared to other designs. We perform same simulations under twoand three-stage pipelined architecture. Here, our a-SAD circuit delivers significant performance (by 13%) and power (by 17% and 15.8% for two and three stages respectively) improvements.
I. INTRODUCTION
MOTION ESTIMATION (ME) is one of the most critical parts in video encoding process since this requires very large computational complexity. To conduct ME, we need to find the best matching macro block (MB) within a given searching window range (SWR), namely block matching [1] . Here, sum of absolute difference (SAD) has been most widely used as a metric to determine the best matching MB [2] . We need to note that for single block matching, SAD calculation should be repeated for all MB's within the given SWR. Previous studies show that the block matching operations occupy a substantial portion of total ME computation (mostly 50~90%) [3] . Table 1 further shows the importance of SAD calculation, which is the computation time of video encoding in JM reference software. Here, SAD computation time occupies substantial portion of total encoding time. These imply that power and performance of SAD calculation highly affect those of video encoding hardware. In this work, we aim to design an SAD circuit for powerefficient real time video encoding. We simultaneously consider power and performance of SAD circuits since performance gain may be converted to power saving under iso-throughput condition (by properly scaling down supply voltage) [4] .
Many researchers have developed novel SAD circuits to improve power or performance. For instance, in [5] , V. Gupta et al. invented approximate adders to design a low power SAD calculator. This work obtains significant power saving by reducing circuit complexity at transistor level. They applied their approximate adders for calculating least significant bits (LSB's) of SAD, minimally affecting video output quality. However, under this design we may fail to find the best matching MB, degrading compression rate. Hence, this design may not be suitable for high resolution video formats.
In [6] , J. Vanne et al. developed a new SAD circuit, which is based on 3:2 compression units. The 3:2 compression unit is implemented as carry-save adder (CSA). Their goal is to enhance performance of SAD calculator and hence, do not consider power efficiency. In [7] , H. Kaul et al. designed low power SAD calculator, where they adaptively scale supply voltage (V DD ) according to workload. To obtain further aggressive voltage scaling, they presented some performance improvement techniques such as speculative difference computation and 4:2 compression units. The above two works commonly used compression units to preclude long carry-propagation delay, providing some performance gain.
In this work, we present a power-efficient SAD circuit by exploiting the concept of approximate computing [8] . However, unlike the design of [5] , our design does not degrade video output quality and compression rate despite of the approximation. In block matching, the best matching MB shows minimum SAD within a given SWR. To find the best matching MB, previous works [6, 7, [9] [10] [11] fully estimated SAD's corresponding to all MB's within the given SWR. However, some MB's have much larger SAD values than the best matching MB. For these MB's, accurate SAD estimation may not be necessary. We affirmatively take into consideration such a fact for our approximate-SAD (a-SAD) circuit design.
The major contributions of this work can be summarized as follows:
·We show that the SAD circuit based on ripple-carry adder (RCA) provides lower power and almost comparable performance compared to the CSAbased one of [6] . ·We propose an approximate-SAD circuit based on RCA's, where we approximate some highest-order MSB outputs to single MSB. The approximation reduces number of logic gates, thereby achieving power saving. In addition, this scheme mitigates long carry-propagation delay of RCA-based SAD circuit, improving performance. These things significantly improve power dissipation under isothroughput conditions. ·Through extensive video experiments, we decide the optimal number of approximation MSB's. The results show that under 4-MSB's approximation, our design hardly affects video output quality and compression rate compared to original SAD calculator.
The remainder of this paper is organized as follows. Section ІІ theoretically compare power and performance of existing SAD circuits. In Section ІІІ, we present our a-SAD circuit. Simulation results are provided in Section ІV. Section V concludes our study.
II. THEORETICAL COMPARISONS OF EXISTING SAD CIRCUITS
In this section, we theoretically analyze and compare power and performance of several existing SAD circuits. In H.264, 4×4 MB is employed as the basic MB of this video compression standard [12] . In such a case, to obtain SAD of single MB, we have to estimate absolute differences (AD's) between 16 pair pixels and then, accumulate these estimation results. Lastly, in the minimum SAD decision part, the output of the accumulation is compared to previous minimum SAD of the current SWR. Throughout this work, we consider that all SAD circuits have the fully-unfolded architecture of Fig. 1 for high throughput.
To simplify our theoretical analysis, we make some assumptions as follows. Firstly, all logic gates are combinations of only three basic gates, which are 2-input NAND/NOR gates and inverter. Secondly, all basic gates have unit time delay. Thirdly, due to conventional two-toone sizing, areas of the NAND gate, the NOR gate and the inverter are 8, 10 and 3 area units respectively. Lastly, we do not consider effects of fan-out, fan-in and interconnection for simplicity.
These assumptions may result in discrepancies with real environments. Nonetheless, our theoretical analysis successfully provides insight regarding the characteristics of existing SAD circuits. We perform logic simulations for accurate comparison, which is discussed in Section ІV.
Power Comparison
We estimate and compare the power of three existing SAD circuits: an RCA-based design, the ones of [6] and [7] . The design of [5] is not considered as our comparison target since this design affects video output quality and compression efficiency. Since it is difficult to directly estimate power dissipation, we employ an indirect method. Under the assumption that switching activity, supply voltage and operating frequency of all logic gates are same, dynamic power is decided by load capacitance, which is highly correlated to area. Hence, we estimate the areas of the above three designs as an indirect method of the power analysis. Fig. 2 shows the area comparison of the above three circuits. Here, the circuit of [7] shows significantly larger area compared to other two designs, due to the parallel structure of speculative difference computation. Hence, it is highly probable that the circuit of [7] consumes larger power dissipation than other ones for nominal voltage operations. The RCA-based SAD circuit shows the smallest area and hence has large possibility to deliver the lowest power dissipation.
Performance Analysis
We theoretically analyze the critical-path delays of the RCA-based circuit and the design of [6] . The design of [7] is excluded in this analysis due to their large power dissipation. Fig. 3 (a) and 4 show the RCA-based design and the design of [6] , where S1, S2, S3 and S4 of Fig. 3 describe S1, S2, S3 and S4 summation of Fig. 1 , respectively. As mentioned in Section I, the design of [6] is based on CSA. Both RCA and CSA consist of full adders (FA's) or half adders (HA's), which are shown in Fig. 3 (b) and 4. Based on the assumption for delay time of the basic gates, we are able to derive the delay times for those two designs, whose results are stuck to each module in Fig. 3 (a) and 4. In our analysis, we consider that the AD parts are implemented as Sklansky-adders (SKA's). This is due to the fact that under our assumption where we do not consider the effect of interconnection, the SKA is the fastest carry-look-ahead adder. Under such a scheme, our estimation shows that the critical path delay of the AD's is 16 time units. In the structure of HA circuit (described in Fig. 3(b) ), the delay from input signals "a" and "b" to "s" and "c out " are two and three time units respectively. Hence, in RCA-based system of Fig. 3(a) , the output signals (c out and s) of the HA of S1 are settled down after 18 and 19 time units, and hence these numbers are stuck at output of the HA element in S1. In the structure of FA circuit (described in Fig. 3(b) ), the delay from input signals "a" and "b" to "s" and "c out " are six and four time units while the delay from "c in " to "s" and "c out " are three and two, respectively. By exploiting this delay, we can easily derive the timing delays of other elements, shown in Fig. 3(a) .
It should be noted that the authors of [6] further improve the AD part by conducting correction-bit additions of two's complement concurrently to the accumulation tree. This makes that the AD part of [6] (=15 timing units) has smaller critical-path delay compared to that of the RCA-based one (=16 timing units), as shown in Fig. 4 . Then, we estimate the timing delays of other elements. As shown in Fig. 3(a) , the minimum SAD decision part of the RCA-based design is implemented as RCA while SKA is used for that of [6] . Our extensive studies show that such architectures deliver highest performances for each design.
The above analysis shows that the critical-path delays of two designs are 64 (for the RCA-based design) and 65 (for the design of [6] ) timing units, as shown in Fig. 3(a) and 4. The RCA-based one delivers comparable performance to the CSA-based one of [6] . Such counterintuitive result is due to the following three factors. Firstly, under the fully-unfolded architecture, the accumulation tree of [6] has six levels while that of the RCA-based one consists of four levels. Secondly, in the CSA-based one, each level of the accumulation tree is sequentially computed. However, as shown in Fig. 3(a) , carry-propagations of S1, S2, S3 and S4 levels are almost concurrently conducted, alleviating long carrypropagation delay problem of RCA's. Lastly, in the CSAbased one, the minimum SAD decision can be computed after completing the last compression of the accumulation tree. However, in the RCA-based one, the minimum SAD decision is also concurrently computed with the accumulation, further mitigating long carrypropagation delay problem of RCA's.
Our analysis on the existed designs indicates that RCA-based SAD has lower power dissipation than CSAbased one, while their performances are almost same. In this work, we present an approximation scheme for RCA-based one, simultaneously improving performance and power. In spite of our approximation, video compression rate and output quality are hardly affected, discussed in the following section. Ultimately, the approximated SAD circuit enhances power efficiency compared to existing designs, without degrading compression efficiency and output quality.
III. APPROXIMATE-SAD CIRCUIT DESIGN
The discussions of Section ІІ motivates our work. The RCA-based SAD circuit has smaller area compared to other existing ones. In addition, under the fully-unfolded architecture of 4×4 MB, the RCA-based SAD circuit shows almost comparable delay to the circuit of [6] . Our proposed technique significantly reduces area and carrypropagation delay of the RCA-based one, improving power efficiency compared to existing designs.
Video Experiments to Determine Optimal Number of Approximation Target MSB's
Our a-SAD circuit is based on the RCA-based circuit. As shown in Fig. 5 , we implement the proposed approximation circuit by OR-gating the input and the carry-in signals. Then, we remove the full adders to compute the original MSB's. In such a scheme, the output of the MSB-approximation circuit is disabled only when all input and carry-in signals are disabled. This scheme allows us to reduce the number of logic gates and carry-propagation delay at the same time. As we increase the number of approximation target MSB's, the power and performance improvement becomes obviously better. However, this lowers possible output range of the approximated SAD circuit. When the minimum SAD of the given SWR is larger than the maximum possible value of the SAD circuit, we may fail to find the minimum SAD. This may affect video compression rate and video output quality.
Hence, we consider that optimal number of approximation target MSB's is the largest one not to affect video compression rate and video output quality. We derive BD-Bitrate (BD-BR) and BD-PSNR [13] to find the optimal number by using JM reference software [14] following the recommended test condition [15] . We customize SAD computation function of JM reference software to survey the influence of MSB-approximation upon video compression rate and video output quality. We made our experiments for several test video sequences having different characteristic and resolution. The sample videos are selected to have the characteristics frequently contained in real-life videos such as fast random motion, scene change with fading, and scene with objects transition from blurred to non-blurred [15] . We conducted these experiments under Full Search (FS), Fast Full Search (FFS) and UMHexagon Search (UHS) algorithms [2] . Under 4×4 MB, SAD value has 12-bit information, as shown in Fig. 1 . We performed the video experiments for the approximation of 3-, 4-, and 5-MSB's under 8-bit YUV-420 format. Table 2 shows our experimental results, where our approximation scheme hardly affects BD-PSNR. The worst video sequence is "Tractor", where 5-MSB approximation results in 4.21 dB BD-PSNR drop. However, in video images it is difficult to recognize real quality degradation, as shown in Fig. 7 . On the other hand, BD-BR is significantly influenced by our approximation, whose effect is dependent on the exploited searching algorithm. We can observe that the increasing rate of BD-BR tends to decrease from FS mode to FFS and UHS mode. The less sensitivity of FFS and UHS to our approximation results from the lower accuracy of block matching with this order. FS algorithm successfully finds the minimal SAD block after the block matching, however requires large computational overhead. Compared to FS, FFS and UHS obtain lower computational complexity by sacrificing the accuracy of the block matching. Our approximation scheme also degrades the accuracy of block matching to a certain degree, however the corresponding effect is smaller in FFS and UHS compared to FS. We simultaneously plot the video simulation data of FS algorithm and the delay improvement rate with respect to the number of approximated MSB's, shown in Fig. 8 . Here, the computation delay reduces linearly as the number of approximated MSB's increases while the corresponding BD-BR's are inclined to increase exponentially. Up to the approximation of 4-MSB's, BD-BR shows small increment, lower than 8%. However, from the approximation of 5-MSB's, we can find considerable degradation for BD-BR, 39.8% for the worst case. This implies that from this point, the efficiency of our approximation scheme is deteriorated. Hence, we conclude that the optimal number of approximation target MSB's is four.
Performance and Power Analysis of Our Proposed a-SAD Circuit
The discussion in Section III.A shows that the optimal number of approximation target MSB's is four. Hence, we design our a-SAD circuit by applying 4-MSB's approximation technique respectively. Fig. 5(a) shows the 4-MSB's approximated a-SAD circuit, where we approximate four highest-order MSB's of original SAD to single MSB. The approximate MSB is enabled as far as original MSB's are not fully disabled. Such a scheme can be easily implemented by OR-gating carry-in signals as shown in Fig. 5(b) . In the proposed a-SAD circuit, we properly modify minimum SAD decision part, as shown in Fig. 5(a) .
In the same way as Section II, we theoretically compare the area and the critical-path delay of our a-SAD circuit with 3-, 4-and 5-MSB's approximation as shown in Fig. 6 . Compared to original RCA-based design (No Approx. in Fig. 6 ), our a-SAD circuit having 4-MSB approximation shows 3.2% area and 9.4% delay reductions in 4×4 MB scenario. Under iso-throughput condition, the delay reduction can be translated to power improvement, as aforementioned. Considering such a fact, the 4-MSB approximation is expected to significantly improve power efficiency. 5-MSB approximation provides further larger area (by 6.8%) and delay (by 12.5%) reductions. However, such an aggressive approximation may result in some degradation of compression rate or video quality under FS algorithm, as discussed in Section ІІІ.A.
IV. SIMULATION RESULTS
We verify the efficacy of our a-SAD circuit by making some logic simulations. Unlike our theoretical analysis, these simulations include the effects of fain-in, fan-out and interconnection, providing accurate power and performance comparisons. We compare three SAD designs: original RCA-based design of Fig. 3 , the CSAbased design of Fig. 4 , and our a-SAD circuit of Fig. 5 . We synthesize these three SAD circuits by using standard cell libraries of 130 nm CMOS. Then, delays and powers of these designs are simulated by Synopsys Primetime and Design Compiler. We synthesize and compare the SAD circuits under both non-pipelined and pipelined architectures.
Non-pipelined Architecture
We simulated the critical-path delay of each design at their minimum clock period, whose results are shown in the second column of Table 1 . Here, our a-SAD circuit shows the highest performance. Compared to the original RCA-based one, our a-SAD circuit shows 9.3% smaller delay. The original RCA-based one delivers higher performance than the design of [6] . These results are well-matched to our theoretical estimations.
In power comparisons, we assume that all designs have same clock period for fairness. We selected 5.4ns clock period where timing violation is not observed for all designs. Under this timing constraint, we synthesize the above three designs. Under this iso-throughput condition, our a-SAD circuit shows the lowest power dissipation among three designs. The third column of Table 3 shows these power simulation data. Here, due to the proposed MSB-approximation technique, our a-SAD circuit obtains 11.6% power reduction compared to the original RCA-based one. Compared to the CSA-based one of [6] , our a-SAD circuit shows 14.1% smaller power under this iso-throughput condition.
We also compare areas of the above three designs. The automatic timing optimization of Design Compiler considerably affects circuit area. This implies that for fair area comparison, we have to perform logic synthesis under iso-throughput condition. We employ the same timing constraint as power comparison. Table 3 shows these comparison results, where our proposed design has the smallest area. This proves that our design improves area efficiency also.
Pipelined Architecture
In the same way as the above section, we simulate and compare performance and power of the above three designs under two and three stage pipelined architectures. We design these pipelined architectures by exploiting retiming option of Design Compiler, which automatically generate pipelined architectures and perform timing optimizations. The simulation results are shown in Table 4 .
In performance comparison, our a-SAD circuit delivers the highest performance as well. Under the pipelined architectures, The CSA-based design of [6] shows higher performance compared to the RCA-based one. However, compared to [6] , our a-SAD circuit provides at least 13% performance improvement for both two and three stages.
For power comparison under iso-throughput condition, we employ 3ns clock period for both two and three stage pipelined architectures. Here, the RCA-based design shows lower power dissipation compared to the circuit of [6] . Compared to it, our a-SAD circuit delivers 17% (for two stage pipelining) and 15.8% (for three stage pipelining) power improvements. Our a-SAD circuit occupies the smallest area under these pipelined architectures as well, as shown in Table 4 .
V. CONCLUSION
We present a power efficient SAD circuit, named as approximate SAD (a-SAD). We apply MSBapproximation for accumulation part of SAD circuit, where some highest-order MSB's of SAD are approximated to single MSB. In spite of this approximation, our a-SAD circuit does not affect video output quality and compression rate. The approximate MSB is enabled as far as original MSB's are not fully disabled. We can easily implement the MSBapproximation circuit by OR-gating some carry-in signals and input signals. From extensive video simulations, we conclude that optimal number of approximation target MSB's is four under 8-bit YUV-420 format. The MSB-approximation technique reduces number of logic gates and carry-propagation delay, thereby improving power and performance at the same time. This enables the proposed design delivers to have good power efficiency under iso-throughput condition. In Samsung CMOS 130 nm technology, we compare performance and iso-throughput power of our a-SAD circuit to those of existing SAD designs. Here, our a-SAD circuit shows higher performance and lower isothroughput power compared to existing SAD designs.
