In this paper, we propose power efficient motion estimation (ME) using multiple imprecise sum of absolute difference (SAD) metric computations. We extend the recent work of Varatkar et al [18] by providing analytical solutions based on modelling of computation errors due to voltage overscaling (VOS) and sub-sampling (SS). Results show that our solutions provide significantly better performance in the sense of rate increase for fixed QP , e.g., less than 5% increase, while in [18] the rate increase could be as high as 20%. It allows us to apply lower voltage which leads to additional power saving. Our analysis also allows us to compare different ME algorithms (e.g., full search vs. a fast algorithm) and SAD computation architectures (parallel vs. serial) in terms of their robustness to imprecise metric computations and their power efficiency. Finally, we demonstrate that additional power savings can be achieved by removing redundancy between the various computations.
INTRODUCTION
Multimedia applications represent a major workload on a large number of hand-held devices such as cellular phones and laptops [11, 8] , for which power (or energy) is the most important design constraint. Video encoders (e.g., H.264/AVC and MPEG-4) are the most power consuming part of multimedia applications. Within a typical video encoder, we focus on the power efficiency of motion estimation (ME) as it consumes large portion of resources (e.g., 66%-94% in an MPEG-4 encoder [12] ).
Algorithmic approaches for power efficient ME [10, 9, 17] have been studied for a number of years. Recently, a new technique, voltage overscaling (VOS) within ME, has been shown to lead to significant additional power savings [6, 18] : given an existing algorithm, the input voltage (V dd) for the SAD computation module is set below critical voltage (V dd crit). A reduction in V dd by a factor W can lead to power dissipation that is nearly a factor of W 2 lower. A major difference with respect to existing algorithmic methods is that the lower power consumption comes at the cost of input-dependent soft errors; lower input voltage increases circuit delay, and the number of basic operations possible for one clock period decreases, thus generating error. In our previous work [6] , we have shown that these errors due to VOS are either i) concealed by the encoding process (e.g., a motion vector selected to minimize SAD can be correct, even if the SAD itself is incorrect) or ii) can lead to "acceptable" degradation (e.g., a small distortion or rate increase). This acceptable degraThis paper is based upon work supported in part by the National Science Foundation under Grant No. 0428940. Any opinions, findings, and conclusions or recommendations expressed in this paper are those of the authors and do not necessarily reflect the views of the National Science Foundation. dation characteristic can be seen as a specific instance of error tolerance (ET), which has been considered in more general settings, including hard hardware faults [1] . Our previous work has demonstrated that image/video compression systems exhibit ET characteristics, even if no explicit error control block is added in the presence of both VOS [2] and hard errors due to deterministic faults [7, 5] . For example, our simulations demonstrated 37% power savings in the ME process with negligible performance penalty in typical video encoders. As part of our work we developed an analytical model for VOS errors in the ME context.
Recent work [18] has also proposed using VOS to achieve power savings in ME. Error concealment is introduced in this system by computing additional SAD values using a sub-sampled (SS) version of the original macro-block data, and then using SAD data computed by these two computation modules in order to estimate the "true" SAD value for each candidate; a simple threshold method is proposed to combine information provided by the two modules. The basic idea is then to use two imprecise SAD computations with different characteristics instead of one. If the error concealment module based on SS consumes much lower power than the VOS module then overall power savings can be achieved, as compared to a technique without error concealment (e.g., [6] ). Note that algorithmic methods to approximate the SAD metric with lower computation cost are well known (e.g., see analysis of SS [13] and references therein), but techniques for SAD computation cost reduction based on VOS are very recent.
In this paper, we study a two-module system such as that proposed by Varatkar and Shanbhag [18] . Our main contribution is to use analytical error models for both VOS [6] and SS [13] . These models lead to novel techniques for combining the SAD values obtained by the two modules, which we show outperform the previous threshold methods [18] . In particular we show that the additional error tolerance enabled by error concealment with improved SAD estimation leads to an increase in the range of operating values for V dd and m, which directly translates into increased power savings. Furthermore, we use these models to evaluate error tolerance and power efficiency of various ME algorithms (full search, FS, and enhanced predictive zonal search, EPZS [3] ) and SAD computation architectures (parallel and serial).
Note that some of our techniques may also apply to environments where multiple different perfect metric computations are performed in a noisy environment (e.g., deep submicron noise [16] ). With appropriate models, the techniques we discuss may lead to increased resilience with lower overhead, as compared to traditional techniques such as triple modular redundancy [14] .
We first briefly explain the ME process, and propose a problem formulation where two imprecise metric computations are used within the ME process in Section 2. In Section 3, we introduce analytical error models for each computation module and propose novel techniques to estimate SAD based on the output of these modules.
In Section 4, we provide simulation results to evaluate the solutions. Results show that our new estimators substantially outperform previous work in terms of rate increase for fixed QP ; our solutions shows less than 5% increases (when previously proposed techniques led to increases of around 20%). Furthermore, we compare ME algorithms (FS/EPZS) and SAD computation architectures (parallel/serial). We also show that additional power savings can be achieved by removing redundancy between computation modules.
MOTION ESTIMATION WITH MULTIPLE IMPRECISE COMPUTATIONS
The ME process comprises a search strategy and a matching metric computation (MMC). The search strategy identifies a set of candidate motion vectors (MVs) and then proceeds to compute the matching metric for the candidates and to select the one that minimizes the matching metric (typically SAD). There are several types of hardware architectures to compute the matching metrics, with different levels of parallelism [15] . We will refer to them as MMC architectures. Among those, we use both serial and parallel architectures, where M 2 adders are connected for SAD computation between two M × M macro-blocks.
For each M × M macroblock in the current frame, the SAD for i-th candidate MV is denoted SADi, and we assume there are N candidates. We define I as the candidate index corresponding to lowest SAD 1 (I = argmini(SADi)), so that the minimum SAD is SADmin = SADI . Here we consider two imprecise SAD computations with VOS and SS (as in [18] ). Note that this formulation can be generalized to multiple SAD computations that are different in the sense of being subject to different types of errors. , the SADs corresponding to the i-th candidate computed by the SS and VOS modules, respectively (See Figure 1 (a) ). These two sets of SAD values are used to estimate the best MV, which we denote with indexÎ. If I =Î, the residual block's distortion as measured by the SAD increases by ESAD = SADÎ − SADI . This increase in distortion (ESAD) may lead to a rate increase (ΔR) which will depend on the chosen QP . For example, a quadratic model [4] would suggest that ΔR should be a linear function of ESAD, so that the relative rate increase can be approximated as [6] . 1 Note that this could be replaced by another selection metric, e.g., one involving a Lagrangian cost. Now our goal is to provide a method to findÎ such that ESAD is minimized, making use of models for the errors E . We propose a method that operates in two steps: i) for each candidate we find an SAD estimate SADi based on information provided by each module
and ii) we then findÎ which minimizes SADi.
PROPOSED SAD ESTIMATORS
In recent work [18] , a threshold-based estimator is proposed. 
Note that the threshold is defined in terms of SADi, which is not known beforehand, so that an approximate procedure (or training) may have to be developed to estimate it (details are not provided by Varatkar and Shanbhag [18] ). This approach is a heuristic based on two simple observations about VOS and SS errors; i) the magnitude of VOS errors tends to be large and always leads to SAD values that are below the correct ones, and ii) SS errors are usually relatively small compared to VOS errors, especially for low sub-sampling factors m, i.e., when a large percentage of pixel data is used to compute SAD ss . This approach works well for V dd and m values where errors are relatively small (e.g., relatively large V dd and m ≤ 4). In what follows, we show that by combining models for both SS and VOS errors, it is possible to achieve good performance over a larger range of V dd and m values, thus further increasing power savings.
Error Characteristics for VOS and SS
We describe briefly error models for VOS [6] and SS [13] that have been proposed in the literature. The VOS error (E vos i ) is a nonpositive discrete random variable with values that are multiples of −2 R S for a given SADi. Here RS is the number of basic operations (e.g., full adder operation in ripple carry adder) possible for one clock cycle, which is a non-decreasing function of V dd: higher V dd implies large RS and thus more operations can be completed per cycle, leading to a lower probability of computation error. As a result, VOS error depends on both V dd and the input characteristics. In summary, we have that SAD is upper bounded by SADi; also, from our simulations we observe that P r(SAD vos i = SADi) for small SADi is significantly smaller than the average P r(SAD vos i = SADi), e.g., by an order of magnitude in some cases.
The SS error (E ss i ) can be modelled as a continuous laplacian random variable with parameter λ for given SADi, where λ is a function of the sub-sampling parameter m; larger m's correspond to larger λ parameters [13] . We observe that λ varies as a function of SADi; thus, for smaller SADi an accurate λ may be up to an order of magnitude smaller than the average λ that would be selected for all SADs. Note that for given SADi, SS and VOS errors are practically uncorrelated (in our simulation, correlation < 0.02): in what follows we assume that SS errors are independent of VOS errors for given SADi.
Adaptive Threshold Estimator and MAP Estimator
An adaptive threshold strategy can be defined based on the observation that in the SS model λ decreases with SADi. We can divide the range of SAD values into intervals r = 1, 2, ..., K, and associate a threshold T hr (and corresponding λr) to each interval, based on the observed SAD |SADi = x), under our assumption of independence of SS and VOS errors for given SADi. Second, the above term only needs to be evaluated for a small number of values because the VOS error is a discrete random variable with sparse support.
Third, we can further approximate the distribution of VOS errors as follows. Given, p r 0 , the probability that no VOS errors occur when SAD ss belongs to the r-th interval, we can make the approximation that all L possible nonzero errors (multiples of −2 R S ) have identical
. This has negligible effect on the MAP estimator performance, according to our simulations. Note that p r 0 can be given empirically by evaluating P r(|SAD vos − SAD ss | < T hr).
In our observation this MAP estimator can occasionally be somewhat sensitive to modelling errors. Thus a more robust estimator would combine the MAP technique with the adaptive threshold estimator. Note also that both the adaptive threshold estimator and the MAP estimator are designed to find the best estimate of SADi, and thus are not optimized in terms of our final objective, i.e., minimizing ESAD. We next propose an estimator to address this objective.
MAX Estimator
To motivate our proposed estimator, we divide the candidates into two sets: B = {k|SAD k >> SADmin}, which contains "bad" candidates, andB = {k|SAD k ≈ SADmin}, which contains those candidates that provide a reasonably good match. Significant motion estimation errors (i.e., large ESAD ) occur if either (i) all candidates inB suffer from large positive errors, leading to the selection of a candidate in B, or (ii) at least one candidate in B suffers from a large negative error, so that it is chosen over candidates inB. Because B is much larger thanB and also given that both VOS and SS errors tend to be larger for larger SADs (so candidates in B tend to suffer from larger errors than those inB), case (ii) represents a much more likely risk. Thus, a good estimator would seek to minimize the likelihood of large negative errors. Based on this, we propose to use the MAX estimator, i.e., SAD
Clearly, this will lead to a positive bias in the estimation, as the largest estimator is always chosen. Recall that the VOS error is always negative, i.e., SAD vos k ≤ SAD k , and the MAX estimator automatically provides estimates that are lower bounded by SAD vos k . Finally, this is a non-parametric estimator, so that overall estimation complexity is modest.
SIMULATION RESULTS AND DISCUSSION
We now evaluate the performance of our proposed estimators: i) adaptive threshold, ii) MAP combined with adaptive threshold, and iii) MAX. For our experiments we use the FOREMAN sequence and an H.264/AVC baseline profile encoder with FS/EPZS ME algorithms and serial/parallel MMC architectures. Only 16 × 16 block partitions and a single reference were considered for ME. A constant QP was used and rate distortion optimization (RDO) was turned on; simulation shows that the rate increase is greater when RDO is turned off. However H.264/AVC encoders tend to operate with RDO on. We assign 15 frames to each group of pictures (GOP), and use an IPPP GOP structure. We collect distortion increase (ESAD = SADÎ − SADI ) data by encoding each GOP with/without errors for different V dd (RS = 10, 12, 14 where RS = 16 for error free operation), QP = 10, 20, 30, and m = 2, 4, 8. Note that m = 4 and RS > 10 are the maximum operating range that could be supported with the threshold estimator proposed by Varatkar and Shanbhag [18] . We evaluate the relative rate increase using
Clearly, in selecting parameters (i.e., λr, T hr, p r 0 , λ, T h) for the various estimators we do not have access to the original SAD. One possible approach would be to use a few blocks per frame for training, and then use the same estimator parameters for all remaining blocks in the frame. However, we observed that estimator parameters can vary significantly within a frame. Thus, as an approximation, we use SAD vos i as an estimate of SADi, and use this to select estimator parameters for each macroblock.
With these settings, we compare the performance of the threshold estimator and the three new estimators. In the sense of minimizing rate increase, the MAX estimator shows best performance, followed by, in order of decreasing performance, MAP combined with adaptive threshold, adaptive threshold, and threshold estimator; performance differences are clearer when FS is used. Refering to Fig. 2 , we can see that for m = 2, 4 and RS ≥ 10 all new estimators show reasonable performance, leading to less than 5% rate increase. However, for m = 8 and RS = 10 MAX exhibits less than 5% rate increase, as compared to 20% for the threshold estimator. Note that a 20% rate increase would alternatively correspond to around 0.5dB PSNR loss if the rate were to be kept fixed, while 5% rate increase leads to less than 0.1dB loss [2] . Operating with minimum degradation for m = 8 and RS = 10 is attractive because this translates into power savings of 50% and 10% in the SS and VOS modules, respectively, as choosing compared to m = 4 and RS = 11.
We also compare the performance under different ME algorithms. EPZS uses a good prediction algorithm to select a small number of MV candidates, with most of these candidates having relatively good matching accuracty. Thus EPZS shows more resilience to errors due to imprecise computations than the FS case, lower V dd can be used for EPZS, resulting in better power efficiency. Varatkar and Shanbhag [18] used a three step search (TSS) algorithm. We have shown that this algorithm is less resilient to soft errors than EPZS [2] , so that overall rate increases can be worse if TSS were used instead of EPZS. But here we do not consider the inherent difference in complexity, regularity, and memory usage between algorithms, which is not easy to quantify; FS algorithm has more searching points but has more regular structure and memory usage than EPZS.
As for the choice of MMC architecture, a parallel architecture shows better performance than a serial one because its intermediate nodes have more balanced partial SAD dynamic range; for given final SAD, serial architecture has more intermediate nodes with bigger partial SAD values than parallel architecture case and in each inter- Note that there is a redundancy between VOS block and SS block; pixel input data to the SS block is a subset of input of VOS block. Thus, removing this redundancy, leads to lower power consumption as the number of operations in the VOS module (See Figure 1 (b) ). Performance also increases because a smaller number of inputs leads to lower probability of error in the VOS module.
