In this paper, we apply dynamic voltage scaling (DVS) to the matching metric computation (MMC) used within motion estimation (ME) in typical video encoders. Our approach is based on "soft DSP" concepts. We analyze the effect of ME errors (due to DVS) in overall coding performance. We propose a model for the resulting rate increase (at a given fixed quantization parameter) as a function of input characteristics and input voltage, for given ME algorithm and MMC architecture. This model is validated using simulations. We then compare ME algorithms and MMC architectures, and propose a method for power saving of the ME process that depend on input characteristics and desired coding performance. As an illustration of the potential benefits of allowing computation errors, we show that allowing errors that lead to a small rate increase (about 3%) produces 37% power savings in the ME process, as compared to not using DVS. An essentially "error-free" DVS approach (no rate penalty) can achieve around 10% power savings.
INTRODUCTION
Power (or energy) is the most important design constraint in many VLSI design scenarios [11] . Many approaches have been proposed for power constrained VLSI, ranging from circuit level to architectural and algorithmic level [7, 9] . Dynamic voltage scaling (DVS) is an attractive technique to reduce power consumption, as lowering input voltage by a factor J, reduces energy dissipation by almost a factor J2 [9] . Soft DSP is an efficient approach for DVS [9] that has been applied to low level systems, such as adders and multiplieraccumulators (MACs) often used in signal processing applications (e.g., linear filters and multi-input-multi-output, MIMO, systems). In soft DSP systems the input voltage is below critical voltage (i.e., we have voltage over scaling, VOS), which leads to input-dependent soft errors. Then, soft-error tolerance is achieved by using explicit error control blocks that provide error concealment so as to operate with negligible loss in algorithm performance.
In our previous work, we have shown that image/video compression systems exhibit error tolerance (ET) characteristics, even ifno explicit error control block is added, and this under both hard errors (due to deterministic faults) [6, 5] and soft errors (due to DVS) [3] . Errors due to VOS in these applications are either i) concealed by other parts ofthe system (e.g., quantization can conceal errors affecting a transform computation) or ii) are "acceptable" [2] . Determining what constitutes acceptable errors is obviously an applicationThis paper is based upon work supported in part by the National Science Foundation under Grant No. 0428940. Any opinions, findings, and conclusions or recommendations expressed in this paper are those of the authors and do not necessarily reflect the views of the National Science Foundation. specific decision; both performance criteria and acceptability thresholds are highly dependent on the application. In [5] , we studied the impact of hard errors on matching metric computation (MMC) within the motion estimation (ME) process in a video compression system. In this case, we showed that both video encoder and decoder remain operational, and thus these errors can be evaluated in terms of the compression performance penalty they produce (i.e., more bits are needed to code data at a given quality level as compared to the bit-rate required by a fault-free system operating at the same QP). This performance penalty may be acceptable for specific application scenarios. We have also provided a primarily experimental evaluation of the behavior of several ME algorithms under "soft error" conditions applying soft DSP approaches to MMC within ME process [3] .
In this paper, we extend our previous work [3] to a DVS scenario. The main novelty comes from i) a model for degradation in video coding performance due to voltage scaling, as a function of input characteristics and for given ME algorithms and MMC architectures (this model can used to select input voltage values for target coding performance criteria), and ii) using this model to compare various ME algorithms and MMC architectures in terms oftheir coding performance under DVS. Our proposed models for DVS performance are designed to be used in hardware-based video encoders, but could also prove useful in the context of general purpose processors for which power control is enabled (see [8] for an example). Since ME is performed at the encoder, our work is primarily applicable to scenarios where power-constrained devices (e.g., cellphones) are used for video capture and encoding.
To introduce our model, we first briefly explain the ME process, introduce different MMC architectures we will use, and describe the basic setting for analysis (Section 2). Each MMC architecture involves several "soft" adders, such as those used in the soft DSP context. We provide a detailed analysis of errors due to voltage scaling for a single adder (Section 3). Then we extend it to model errors in typical MMC architectures and the performance degradation due to soft error as a function of input voltage and input characteristics. This model is validated using simulations (Section 4). Using this model, we propose a voltage control method, which based on our simulations can achieve about 37% power savings in the ME process, as compared to not applying any voltage scaling, with very slight increase in rate (around 3%).
MOTION ESTIMATION WITH SOFT ERROR
The ME process comprises a search strategy of the motion vector (ME algorithm) and a matching metric computation (MMC). The search strategy decides a set of candidate MVs and then proceeds to compute the matching metric for the candidates and select the one that minimizes the matching metric (typically, sum of absolute differences (SAD) SAD is used).
There are several types of hardware architectures [12] to compute the matching metrics, with different levels of parallelism. We will refer to them as MMC architectures. Among those, we choose a serial and parallel architecture for analysis (see Figure 1 ). The serial architecture has M2 serially connected adders for SAD computation between two M x M macroblocks. It is simple but requires longer running time compared to the parallel one. As shown in Figure 1 , the parallel architecture has M parallel groups of "leaf' adders and M "central" adders (in total M2 + M adders are needed). Each group of leaf adders consists of M adders and computes the sum of M AD values. Then, the central adders compute the final SAD adding up M partial SAD values. In leaf adders, outputs are small compared to the final SAD value (on average partial SAD values in a set of leaf adders will be smaller than SAD). Thus errors with magnitude larger than SAD (small compared to the final SAD) are unlikely to be generated in the leaf adders, because soft errors in an adder cannot be greater than the output of the adder (see Section 2. For each macroblock of size M x M in the current frame, the MMC process computes the matching metric for each candidate block in the reference frame's search window; these are denoted SAD1, SAD2, . . ., SADN (sorted in magnitude of SADi, with SAD1 the largest one) where N is number of candidates. We define MVmin as the best MV, which corresponds to the index such that SADi is minimum (here MVmin = N), and SADmin as a minimum SAD among all SADi (here SADmin, = SADN). to rate increases for a given QP, which we propose to model using the quadratic (Q2) model [4] . This model has been applied in implementations of existing video encoding standards ITU-T H.264/MPEG4 AVC [1] and tends to be accurate for large data sets, such as one Group of Pictures (GOP) or one whole sequence (its accuracy increases with the number of frames being modelled). The main Q2 model is as follows:
where R is the rate, MAD is the energy of the prediction residual measured in terms of mean absolute difference (MAD), QP is the quantization parameter and Sl, S2 are parameters to be estimated.
One can see that the Q2 Model can be rewritten as a linear function of SAD for a fixed QP. Now we take derivative of both terms. Then the following relation holds for a SAD increase (ESAD) and rate increase (AR): AR= X1ESAD (2) where Xi is a parameter to be estimated for each set of frames (i.e., one GOP or whole sequence). Therefore if we know the model for ESAD, we know the model for AR. Now we focus on modelling ESAD as a function of Vdd and input characteristics. For this purpose, in the next section we study the characteristics of soft errors in the MMC process (Ej).
The MMC process with Soft Errors
An MMC system includes several n-bit adders. We assume that ripple carry adders with voltage scaling are used, as they provide useful functionality for DVS [9] . We assume that all soft adders in the MMC process have the same input voltage (Vdd), as is typically assumed in soft DSP techniques [9] . When we decrease Vdd for the adders, the circuit delay for one full adder (TFA) increases, but the sampling time (Ts) remains the same. Thus, RS, the number of full adder (FA) operations possible in one TS, will decrease. From now on, we will use RS instead of Vdd as the parameter that controls the operating point of the system; RS is a function of Vdd and depends on several gate parameters (see [9] ). Here we will use the parameters used in [9] . If the number of FA operations required to complete one addition (i.e., the path delay divided by TFA, which is obviously input dependent) is larger than RS, then an error is generated. Our target is to model errors (Ei) due to applying Vdd < Vddcrit to the MMC hardware. As a first step, we need to understand the behavior of an n-bit soft adder (n-bit ripple carry adder with voltage scaling). Each FA has inputs ai, bi, ci-1 and outputs si, ci, and each input pair (ai, bi) is introduced to corresponding FA simultaneously. If a carry ci is generated in each FA, it is propagated to the next FA (see Figure 2 ).
II-102
If Vdd < Vddcrit an error can be generated if Rs is smaller than the path delay required for the computation. The total path delay is determined by the longest consecutive carry propagations. Consecutive carry propagations are generated after an initial carry generation (by an (1, 1) input pair) is followed by carry propagation inputs ((1, 0) or (0, 1) input pairs). Thus if an input has a path delay larger than Rs, carry input to the (Rs + k)-th FA is lost (k is the starting position of carry propagation) and an error with magnitude 2Rs+k is generated (see Figure 2 ).
It is interesting to note that for the path delay to be K + I for one addition, the result, S, has to include a 1 followed by K consecutive Us (from the (k + 1) -th bit to the (k + K) -th bit), i.e., S = m2K+k + R, wherem > 0 and R < 27 Thus if i) S = m2K+k + R where K >-Rs, and ii) ak+1 = 1 (automatically bk+1 = 1), then an error with magnitude 2RS+k is generated. Here we can see that errors can be no greater than S and that an error cannot be generated if S < 2R. Since ak+1 corresponds to a lower significance bit (as compared to S), we can assume that S and ak+ are independent. And if we assume i) P(S = m2Rs+k + R) is similar for all R (S has smooth distribution), and ii) that p(ak+± 
To estimate ESAD according to (5), we evaluate P(i = MVf)
first. For i = MVf, SAD' needs to be smaller than all other SAD', which can be stated as follows:
This equation is based on the observation that each error (E,)
is almost independent of corresponding SAD value (SADw) (see Section 2.1.2). For i to be chosen as the MVf, SAD' should be the minimum among all SAD' , so that SAD' < SADmin and thus the following holds: P SADmin
Here we can derive a simple expression of P(SADV < x) which is a linear function ofx, assuming that 11 PPSAD1 (PSAD1 = m2Rs+k ) (probability that the value of one of the intermediate outputs (PSADj) is m2RS+k) is independent of k and takes a constant value p0.
P(SADJ < x) 2Rx+1P
Thus applying (8) (5) to get the following expression for ESAD: ESAD (SADQ -SADmin)( -(1 7SADmin)NQ) (10) where SADQ is a mean SAD value over set Q, and AR = X1 ESAD. 4 (RL) . Figure 3 shows the variations of AR as a function of Vdd, which will be useful to design a power control mechanism. This result shows that we can precisely estimate AR with our analytical model in the Rs range of interest.
Using simulation result and model, we can compare MMC architectures and ME algorithms. The slope of AR for serial MMC architecture is larger than one of a parallel architecture, because p0 of a serial MMC architecture is larger (see Section 3). But the saturated AR value is similar because it only depends on SADmin, SADQ, which is the same for both cases. Since a EPZS search strategy uses a good prediction algorithm to select a small number of MV candidates, which are already near the minimum SAD point, EPZS has smaller NQ, SADQ than the full search algorithm. Thus AR of EPZS is always smaller than that for the full search case. In summary, EPZS search algorithm and parallel MMC architecture are better than full search algorithm and serial MMC architecture respectively. Note that we do not consider the inherent difference in complexity, regularity, and memory usage between EPZS and FS algorithm; FS algorithm has more searching points but has more regular structure and memory usage than EPZS.
If we can estimate AR as a function of Rs for a given sequence before encoding, we can control Vdd in a optimal fashion during the encoding process, thus saving power. A normal video encoder optimization scheme only considers rate and distortion. But in encoding scenarios that require small power consumption, e.g., hand held devices, we need to take power consumption into consideration by adding this to the cost function. To estimate AR data, we need information about SADi (SAD value for each MV candidate in one macroblock). Since information about SADi is not available before encoding, it can be estimated, for example, by encoding without DVS a single frame within a GOP. This approach will be effective if ESAD and XI do not change much in within one GOP. Selecting an optimal Vdd point can be done using various optimization techniques, such as those based on lagrange multipliers. A heuristic method would be to choose a threshold for rate change (ARth), and select Vdd such that estimated AR is less than ARth. Using these algorithms, we can change Vdd dynamically depending on input characteristic with a slight additional complexity in the encoding system, but with potentially large power savings. Figure 4 highlights the potential for savings in the ME process using DVS; setting ARth = 0.1, leads to 37% power reduction when using EPZS and the parallel MMC architecture. Considering that a significant percentage of power consumption is due to the ME process (at least 20% in case of MPEG2 encoder), total power savings within the video encoding system can be significant. Fig. 3 . Rate change due to DVS in ratio compared to original rate (dot: estimated data using our model, solid: real data); using a FOREMAN sequence, QP=20 Fig. 4 . Power saving effect ofME process using DVS for various ME algorithms and MMC architectures; considering redundancy due to encoding one frame of every GOP with Vddcrit and Rs for SADi information and XI estimation, FOREMAN 
