Abstract-This paper presents techniques to reduce energy with minimal degradation in system performance for multimedia signal processing algorithms. It first provides a survey of energy-saving techniques such as those based on voltage scaling, reducing number of computations and reducing dynamic range. While these techniques reduce energy, they also introduce errors that affect the performance quality. To compensate for these errors, techniques that exploit algorithm characteristics are presented. Next, several hybrid energy-saving techniques that further reduce the energy consumption with low performance degradation are presented. For instance, a combination of voltage scaling and dynamic range reduction is shown to achieve 85% energy saving in a low pass FIR filter for a fairly low noise level. A combination of computation reduction and dynamic reduction for Discrete Cosine Transform shows, on average, 33% to 46% reduction in energy consumption while incurring 0.5 dB to 1.5 dB loss in PSNR. Both of these techniques have very little overhead and achieve significant energy reduction with little quality degradation.
I. INTRODUCTION

P
ORTABLE multimedia devices have proliferated in the last two decades, and the number of applications supported by these devices has increased significantly. Each additional application comes at a cost of higher energy consumption and since most of these devices are battery powered, it is important that every effort be made to reduce the cost. The challenge is to minimize the energy cost while executing increasingly complex functionalities with minimal degradation in algorithm performance quality. Fortunately, many of the multimedia applications do not need 100% correctness during computation and energy saving transformations are favored as long as the output quality is mildly affected [1] , [2] .
Three of the most effective techniques for reducing energy consumption are voltage scaling [3] - [14] , reduction in number of computations [2] , [15] - [20] and dynamic range adjustment [16] , [18] , [21] - [24] . While voltage scaling results in significant reduction in energy consumption due to the quadratic dependence between supply voltage and energy consumption, voltage Manuscript received September 12, 2012 ; revised November 21, 2012; accepted December 30, 2012 . Date of publication June 06, 2013; date of current version October 11, 2013 . This work was supported in part by NSF CSR 0910699. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Yiannis Andreopoulos.
The authors are with the School of Electrical, Computer and Energy Engineering, Arizona State University, Tempe, AZ 85281 USA (e-mail: yemre@asu. edu; chaitali@asu.edu).
Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TMM.2013.2266094
over-scaling (VOS) can lead to failures. Techniques have been developed to mitigate the errors due to critical path violation in the computation unit and memory due to VOS. While circuit-level techniques [25] , [26] are quite effective, there are low overhead algorithm-level techniques that use the inherent redundancy and characteristics of the data to detect and correct errors that occurred during the computation. Unlike general purpose computing, most multimedia applications can provide decent quality even with reduced number of computations as long as the significant computations are retained. The basic idea is that all components of the computation are not equally significant and so for systems with limited resources, the more important computations are done first and the less important computations are performed later or even eliminated. Such a methodology has been applied to many image and video processing algorithms such as filters, where multiplications with large coefficients have higher significance [2] , Discrete Cosine Transform (DCT), where low frequency coefficients have higher significance [2] , [16] , or Discrete Wavelet Transform (DWT) [20] , where low subband coefficients have higher significance. This is also the basis of incremental processing where computations can be halted when decent quality is achieved [20] .
Another popular energy-saving technique is dynamic range reduction in the datapath computation. Typically, low order bits are less important and so can be truncated to save energy. Such a methodology has been used in many multimedia applications such as filtering [16] , DCT [22] , [23] , FFT [5] , etc. However, in some applications such as motion estimation [18] , the less significant bit computations are more important since large values that are the result of computations with the more significant bits, are discarded. While truncation reduces energy consumption, it also introduces errors due to operation with a reduced dynamic range. Simple techniques to compensate for these errors help reduce energy consumption while mildly affecting the algorithm performance quality.
In this paper, we describe several energy-saving techniques that achieve minimum degradation in quality with low overhead. While some of these techniques are general, others have been geared to exploit the algorithmic features and result in superior performance both in terms of energy consumption and algorithm quality. The key contributions of this paper are as follows:
• We provide a survey of general as well as algorithm-specific techniques that trade-off energy with system performance for multimedia signal processing algorithms. Examples include FIR filtering, DCT, DWT, etc. These techniques are based on voltage scaling in datapath and memory, reduction in the number of computations and reduction in the dynamic range.
• We propose hybrid schemes that use combination of these techniques to achieve even higher energy saving with smaller performance degradation. The performance overhead and energy savings of each scheme is quantified and analyzed.
• We study the combination of voltage scaling and dynamic range reduction in the context of low pass FIR filter applications. The errors due to increase in critical path delay during voltage scaling are reduced by truncating the lower order bits which causes a reduction in the critical path. The noise that is introduced due to truncation is compensated by using an unbiased estimator. For a MAC based FIR filter, such a scheme achieves 85% energy saving for a fairly low noise level.
• We study the combination of computation reduction and dynamic range reduction for DCT. We propose a scheme that chooses which DCT coefficients have to be deactivated and the number of bits to be truncated based on the quality metric, Q. We derive combinations of deactivation and truncation for different acceptable PSNR degradations across the whole range of Q. Simulation results show on average, 33% to 46% reduction in energy consumption while incurring 0.5 dB to 1.5 dB degradation in PSNR performance of JPEG. The rest of the paper is organized as follows. A survey of low energy techniques is given in Section II followed by the hybrid schemes that combine different low energy techniques in Section III. Finally, Section IV concludes the paper.
II. ENERGY-SAVING TECHNIQUES
In this section, we describe the three main techniques for reducing energy consumption, namely, voltage scaling (Section II-A), reducing number of computations (Section II-B) and reducing dynamic range in computation (Section II-C). We describe the errors introduced by each of the techniques and ways to compensate them to reduce algorithm-quality degradation. 
A. Voltage Scaling
One of the most effective techniques for energy reduction is voltage scaling. This is due to the quadratic dependence between energy consumption and supply voltage [3] . Fig. 1(a) illustrates the normalized energy and delay plots of a 16-bit ripple carry adder (RCA) as a function of supply voltage. This was obtained by our in-house simulator based on modelSim with 45 nm PTM models [27] ; the normalization is with respect to 1 V nominal voltage operation. Voltage overscaling (VOS) refers to scaling the voltage beyond the value imposed by the critical delay of the circuitry. This may result in timing violations in the data-path, resulting in erroneous operation. Fig. 1(b) illustrates the error distribution of the 16-bit RCA under voltage scaling [8] . Note that most of the errors reside in the most significant bits, which can result in significant performance degradation.
1) Compensating Datapath Errors:
To mitigate the errors due to critical path violation of the computation unit under voltage scaling, algorithm noise tolerance (ANT) has been used in [4] - [7] . Fig. 2 illustrates the general block diagram of the ANT scheme which consists of the main block and the reduced computation block. The main block implements the original computation at full precision; thus its output is prone to critical path violation under voltage scaling. The reduced computation block is designed to generate a statistical replica of the original result with a shorter critical path. These two outputs are compared to detect any errors that may have occurred in the main block. Since VOS results in large errors in magnitude, the system chooses if the difference is smaller than the predetermined threshold (Thr) and otherwise. In an ANT based system, the reduced computation block needs to provide good approximation of the original output, have low complexity circuitry to minimize overall overhead, and have shorter critical path to ensure error-free operation. Reduction in computation is achieved by using reduced precision replica [4] , [5] , subsampling [6] , and prediction based error correction [7] . ANT-based systems have been applied to multimedia applications such as FIR low pass filter [4] , [5] , FFT [5] , and motion estimation [6] . In [4] , correlation of the FIR low pass filter outputs is used to correct errors if any. To minimize the overhead, a very simple low pass filter is employed that computes the estimates of the main block. In [5] , the reduced computation block is based on 4-bit MSB implementation of FFT while the main block operates on 8-bit data. In [6] , a subsampled version of the original motion estimation block is used in the reduced computation block. All these methods achieve 20% to 40% energy reduction while incurring small performance degradation.
Recently, an algorithmic-specific technique has been proposed in [8] that mitigates datapath errors during the computation of 2D-DCT and quantization in JPEG. The technique exploits encoded JPEG data features to detect the VOS induced errors. The features are based on facts such as two adjacent AC coefficients after zig-zag scan have similar values, coefficients corresponding to higher frequencies generally have smaller values and the number of sign extension bits is determined by the quantization level. The technique achieves high performance with small circuit overhead. Simulation results show that the proposed technique has a PSNR performance degradation of around 1.5 dB compared to the error-free case, and 4 dB impro vement compared to the no correction case at compression rate of 0.75 bpp when . The overhead of this technique is quite small; it requires three simple units, namely, a 3-bit majority voter, a 8-bit coefficient comparator and a 8-bit 2 input average calculator.
2) Compensating Memory Errors: SRAM failure analysis under voltage scaling have been investigated by several researchers [9] - [11] , [28] - [31] . In [28] , statistical models of random dopant fluctuations (RDF) are used to determine read, write failure and access time variations. In [29] , read and write noise margins of 6T SRAM cells are used to calculate the memory reliability. Fig. 3(a) illustrates the distribution of read access times at nominal and scaled voltage levels for a 32 nm SRAM cell under 40 mV RDF and 5% channel length variation. From the figure, we see that as the voltage scales down, the tail section of the access time become heavier, which manifests itself in access time errors at scaled voltages. Fig. 3(b) illustrates the read, write and total failure rates as the voltage scales from 0.8 V to 0.6 V [32] . At nominal voltage of 0.9 V, the BER is estimated to be . At lower voltages, the BERs are very high. For instance, at 0.7 V, the BER is and at 0.6 V, it climbs to . Such high error rates were also reported in [10] , [29] .
Several circuit, system and architecture-level techniques have been proposed to mitigate and/or compensate for memory failures. At the circuit level, different SRAM structures such as 8T and 10T have been proposed [10] , [31] . In 8T and 10T structures, data path for read and write operation are separated to increase the robustness of the operations. However, the additional circuitry increases leakage power and circuit area by approximately 20% to 30%. The method in [10] stores the MSBs in a memory bank with 8T SRAM cells and the least significant bits (LSB) in memory banks with 6T cells. It uses the fact that 8T SRAM cells are more robust then basic 6T SRAM cells at scaled voltages. Such a scheme achieves approximately 40% power reduction while having 15% increase in overall circuitry area. The method in [11] operates the memory banks that store MSBs at a different voltage level than the ones that store LSBs. This is shown to achieve 45% power reduction with 10% degradation in image quality for a regular pixel based image-storage system.
Many techniques make use of error control coding (ECC) [8] , [12] , [32] - [34] . In [12] , orthogonal Latin square codes are used to trade-off cache size with correction capability. Extended Hamming codes which provide single error correction, double error detection (SECDED) have been used for several years to combat failures in memory systems [32] - [34] . Their simple structure makes them appealing for applications that require low latency and power consumption. The memory area overhead of the stronger codes is very large, thus using unequal error protection (UEP) that combines strong and weak codes, is a better option. The main idea of UEP is to provide superior protection to the more important bits and thus enable the area overhead to be reduced without sacrificing performance [32] . For instance, in JPEG2000, the higher subband DWT outputs are more important and so should be protected better with stronger codes. In [32] , it has been shown that compared to single ECC, UEP based on different SECDED codes has 35% lower MSE for the same overhead.
Algorithm-specific techniques have also been developed for memory intensive multimedia applications in [13] , [32] , [35] . These techniques mitigate system degradation due to memory failures using additional features that are intrinsic to the algorithm. In [13] , binarization and second derivative of the image are used to detect error locations in different DWT sub-bands in JPEG2000. These are then corrected in an iterative fashion by flipping one bit at a time starting from the MSB. In [35] , application-aware methods have been proposed to reduce power consumption of memories in video coding systems under VOS while maintaining the performance using simple filters after processing. Algorithm-specific techniques that exploit the characteristics of the DWT coefficients have been proposed in [32] for JPEG2000. These techniques identify and correct errors by exploiting the fact that DWT outputs at high subbands typically consist of smaller values and thus contain small number of nonzero bits in MSB planes. Also, there is a similarity between magnitudes of neighboring coefficients. Based on these features, an error is flagged when isolated non-zero MSBs are detected at high frequency bands of the DWT output. These techniques achieve performance results close to error-free curves (only 1 dB degradation in PSNR) at 0.75 bpp when . Even for very a high bit error rate such as , the algorithm-specific scheme can achieve 7.9 dB performance improvement compared to the no correction case, 4 dB to 5 dB improvement compared to the (39, 32) Hamming ECC scheme and only 2.8 dB degradation compared to the no-error case. The overhead of these techniques is very small; the additional circuits include a 9-bit counter, 35-bit all zero detector and a 4-bit comparator.
B. Reducing Number of Computations
The number of computations can be reduced by choosing a smarter algorithm with a lower complexity. Examples of this include Fast Fourier Transform implementation of the Discrete Fourier Transform, differential tree search based vector quantization [3] , etc. Alternately, the number of computations can be reduced with some performance hit. For instance, in block matching that is used in motion estimation, heuristic algorithms such as three step search and diamond search have lower complexity but sub-optimal performance. These search algorithms are used when the performance requirements are not that stringent and/or the energy budget is low [19] . For each search algorithm, sub-sampling can be used to further reduce the number of computations [6] , [17] , [18] . If 1/s is the subsampling ratio, then these schemes reduce the number of absolute difference computations in motion estimation by 1/s and result in significant energy reduction. However there is a performance cost, for instance, increases the compression rate by 6.5% and reduces the PSNR by approximately 0.4 dB [18] .
An effective way of reducing the number of computations with minimal degradation in system performance is by exploiting the fact that different portions of the computation have different levels of significance on the overall system quality. One of the earliest works in this area was done for least mean square (LMS) type adaptive filters in [36] , where the filter length was determined based on the energy consumption and system performance requirements. In DCT implementation, for instance, most of the image energy of the DCT resides in the low frequency coefficients and higher frequency coefficients can be sacrificed when good enough quality is achieved [2] , [16] . Similarly, in FIR filtering, larger filter taps contribute more to system performance, and so sorting the impulse response of the filter taps in decreasing order of magnitude and computing on larger coefficients first helps achieve energy saving with reduced overall quality degradation [16] . Other examples include multilevel DWT where the coefficients are computed incrementally one bit plane at a time till the desired quality is achieved [20] , and salient point detection where the detection result is refined as the image precision improves [37] .
C. Reducing Dynamic Range in Computation
Reducing the datapath precision to lower power consumption is a popular technique in signal processing systems. Typically, high order bits contain most of the information while low order bits capture the details of the application. Fig. 4 illustrates the savings in energy consumption of a 16-bit RCA for different bit widths in 45 nm technology. Since RCA has a regular structure, the energy reduction is proportional to the bit-width of the adder. For instance, at nominal voltage, we observe 24% reduction in energy consumption of the adder when 12-bits are used instead of 16-bits.
One drawback of reduced precision arithmetic is that it introduces truncation errors. Fig. 4 also plots truncation noise defined as the magnitude of the difference between the output obtained with full precision data and the output obtained with truncated data scaled by the full precision output. From Fig. 4 , we see that while truncation noise increases logarithmically, energy saving of the adder increases linearly with increase in number of truncated bits. Low order bit truncation can easily be applied to other multimedia applications such as filtering, DCT; however, one of the main challenges is to compensate for the quality degradation caused by reduced precision.
Additional energy saving can be obtained by approximating the computations in the datapath components [38] - [43] . Adders and multipliers that trade-off accuracy for lower power consumption by reducing the carry chain have been proposed in [38] - [40] . For instance, the modified Kogge-stone adder in [38] operates on a shorter critical path but the errors are not as significant since the probability of having timing violations with shorter carry chain is not very high. The RCA based error-tolerant adder proposed in [39] partitions the carry chain into variable width segments in which MSB side has longer segments than LSB side to reduce the error. Multiplier architectures in [40] truncate the partial product generation at the LSB end resulting in small truncation noise while achieving significant energy savings.
In addition, the building blocks of the adders and multipliers can also be approximated by selectively removing some minterms of their Boolean functions [41] - [43] . For instance, [42] describes a 2 2 under-designed, inaccurate multiplier which is used to implement a larger multiplier for image processing applications. Based on the system requirements, a correction term is introduced to reduce the degradation in algorithmic performance. A more general scheme is proposed in [43] to reduce the area of combinational logic for a given error rate threshold. During the synthesis phase, the number of literals used in the logic function is reduced by complementing the minterms of the original function.
1) Low Order bit Truncation:
Bit truncation methods that remove low order bits have been very effective for motion estimation [17] , [18] , [24] . In [24] , instead of using 8 bits, only 4 or 5 of the higher order bits are kept to reduce the activity in less important regions. In [18] , the performance degradation and increase in compressed data rate have been studied for low order bit truncation in motion estimation used in H.264. Fig. 5 illustrates the average degradation over several video sequences for low order bit truncation ranging from 1-bit to 4-bits for diamond (DS) and three step search (TSS) strategies [18] . Here the performance metrics are which represents the change in compression rate and which represents the change in PSNR; lower and corresponds to better quality. Since motion estimation is based on subtraction of the pixel values, the expected performance degradation is not very high. This is because subtraction is more tolerant to truncation noise than addition or multiplication operations. In algorithms whose building blocks are multiplications and additions such as DCT and FIR filter, the truncation error has to be compensated to maintain good image quality. Next, we describe the proposed technique for compensating truncation error.
Truncation Errors: Analysis and Compensation First, we investigate the effect of bit truncation on simple arithmetic operations such as addition, subtraction and multiplication. We describe the error characteristics for operations on unsigned numbers; however the procedure can easily be extended to operations on two's complement and signed numbers. Next, we describe a method to reduce the effect of truncation based errors on system quality. The output of a DSP system after LSB truncation at time instant can be expressed as:
where is the truncation-free output and is the truncation induced error (noise) which is a random variable with mean and variance . The noise power can be represented by the mean square error (MSE) defined as . In order to reduce the noise power, we propose a method that estimates the mean value of the truncation error during the pre-computation stage and compensates for it. We refer to this method as compensation. The overhead of this method is very small. Moreover the noise power after -compensation does not depend on anymore and is only a function of the variance of the truncation error.
Let us consider a system whose inputs are originally represented with bits, . When bit truncation is employed, where , the input becomes . Assuming uniformly distributed input signals, we can express , the truncation error for the input signal , as:
where is an independent, uniform random variable with two discrete values: 0 and 1. The expected value and variance of are given by,
where and are mean and variance of and is the variance of . Using (1) and (2), we can compute the expected value and variance of the truncation error of an adder with inputs and . Both inputs are independent and in both cases the lower L bits (out of bits) have been truncated. Using a similar analysis, we can compute the expected value and variance of subtraction and multiplication. Details of the calculation for multiplication are given in the Appendix. Fig. 6 illustrates how the noise power (MSE) of 16-bit multiplication of unsigned numbers can be reduced with compensation. We see that the analytical results and simulated results match very closely. Moreover plays an important role in determining the noise power and compensating for helps reduce the MSE by . Since noise power is proportional to , the proposed method helps in reducing the noise power for computations such as additions and multiplications. It does not help in the case of subtractions since the of subtraction is 0. Furthermore, we see that, noise power of an ' ' bit truncation with compensation and '
' bit truncation without compensation are comparable. However since the overhead of compensation is very small, a system with larger number of truncation bits has larger energy savings as will be shown in Section III-C. Next we illustrate the use of the compensation method to compensate for errors in DCT and FIR filter computation. where 's are input pixels in row or column order and 's are the corresponding outputs. Typically 8-point DCT is computed along rows and the coefficients transposed so that data for the 8-point DCT along columns can be obtained efficiently. The properties of the coefficient matrix are used to reduce the number of multiplications. Below is one such method of implementing the odd and even coefficients.
where , , ,
. Fig. 7 describes the architecture to compute 4 DCT coefficients ( and ) of the 8-point DCT used in JPEG. The AND gates at the inputs are used to implement input bit truncation. After pairwise subtraction and addition of the pixels, we obtain to , where , , , and . For and , common sub-expression elimination is used to obtain results with small number of computation units as illustrated in Fig. 7(b) . Implementation of is illustrated in Fig. 7(c) ; a variant of which is used for . Fig. 7(d) shows the computation structure used to find . The odd coefficients ( and ) are computed using units that are similar to the unit for . We calculate the truncation noise (TN) for the DCT outputs for a 14 bit fixed point implementation of DCT, where 12 bits represents the integer part and last 2 bits represent the fractional part of the computation. The expected errors due to truncation in and can be expressed as follows. To simplify our analysis, we assume that all Y values are uncorrelated and so the expected value for L bit truncation is . Since , the expected truncation error for is given by
. Similarly expected value of the truncation error for is given by , and that of is given by . The expected value of truncation noise for and are also zero. The expected truncation noise values are used as unbiased estimators to compensate the errors. Instead of compensating for errors of all the outputs, we only compensate for errors in the computation of and . The motivation for this is that these coefficients are the most important ones and the corresponding estimation errors are the largest. Also this keeps the complexity of the overhead circuitry small. Fig. 8 illustrates the compensation mechanism for computation. The overhead of this scheme is a 14-bit adder at the output as well as the AND gates to disable a selective set of input bits. The area and power overhead due to extra processing elements is around 2% of the overall DCT implementation. Fig. 9 illustrates the performance improvement with the use of unbiased estimators for and when low order bits are truncated for DCT computation of the Baboon image. For 1 bpp compression rate, 4-bit truncation causes a degradation of 1.3 dB which is reduced to 0.6 dB with compensation. For the same 1 bpp compression rate, when 6 bits are truncated, the performance improvement is approximately 1.2 dB compared to the system without compensation. Thus as the truncation level increases, we observe higher performance improvements in systems that use compensation. Example 2: Consider a FIR low pass filter (LPF) using unsigned inputs and coefficients, which is typical in many multi- media algorithms. The output y(n) of an N-tap filter with -bit precision is given by where is the 'th coefficient of the filter and represents the input value at time . Such a computation can be implemented efficiently using MAC based architectures. We can calculate the unbiased estimator for L-bit truncation assuming that the coefficients are less than one.
(3)
When filter coefficients are known, the estimator given in (3) reduces to:
where represents the sum of filter tap coefficients for LPF given by . As an example, for a 3 3 Gaussian Filter when and , the unbiased estimator value is 3; this value increases to 15 when . Fig . 10 illustrates the block diagram of the proposed MAC based architecture for LPF. Filter coefficients and input data are truncated using an array of AND gates before the multiplication; thus only high order bits become active during computation. After N cycles of MAC computation, the correction factor is applied to reduce errors due to truncation. The performance results of this filter are given in Section III.
2) High Order Clipped Computation:
It is not always the case that the low order bits are less significant in computation and so can be dropped. In sum of absolute difference (SAD) computation used in motion estimation, for instance, the high order bits can be dropped. The proposed scheme in [18] uses the statistics of absolute difference (AD) and SAD computations to reduce the dynamic range and approximate the computations. Specifically, it exploits the fact that most of the AD values are small due to locality of current and reference blocks, and that most of the large AD values are for blocks that are likely not to be selected, and thus these values can be approximated. Fig. 11 illustrates the distribution of the AD values, SAD values and selected SAD values for the Football video sequence. From the distributions, we see that the dynamic range of the selected SAD values is significantly lower than the dynamic range when all SAD values are taken into consideration. Thus during SAD computation, it is not necessary to operate on the MSBs during SAD computation with much care. The scheme in [18] detects large AD values using special logic and the delay corresponding SAD values are updated with a correction factor. The resulting architecture has a lower critical path delay compared to the baseline architecture and significantly lower energy consumption. It achieves 37.5% energy reduction at nominal voltage and 68% reduction for iso-throughput while incurring 1.8% increase in compressed data size and approximately 1.3 dB reduction in PSNR.
III. HYBRID CONFIGURATIONS
A combination of the energy saving techniques in Section II helps achieves even higher energy savings as will be demonstrated in this section.
A. Combining Computation Reduction and Voltage Scaling
Several significance driven techniques where the significant components have shorter delay and the less significant components have longer delay, have been proposed in [44] - [47] . These techniques are very effective in reducing the energy consumption without affecting the quality too much. At nominal voltage, all computations ensure no-violation in the critical path while at scaled voltage levels those which have higher critical path delay than that allowed by the operating frequency, are disabled. For instance, selective deactivation of DCT coefficients based on the operating voltage has been proposed in [44] . Since low frequency DCT coefficients contain most of the input image energy, they are significant and implemented with shortest critical path. It has been shown that 41% to 90% power saving is possible compared to baseline scheme with up to 10 dB degradation in PSNR. A similar approach is applied in [45] for color interpolation where only less important computations are affected by voltage scaling and process variation. Such a scheme achieves 40% power savings with 5 dB PSNR degradation.
These significance driven techniques have also been applied to support vector machines that are widely used in data-mining applications [47] . Here the number of support vectors and features per vector are traded-off and voltage over-scaling is used at the circuit level to minimize the energy for a given quality of service. A more general method has been proposed in [48] where the behavior and cost of the computation components are modeled based on their importance on system performance. The supply voltage levels for a specific part of the circuitry are determined based on bit significance (profit) and energy cost (investment).
B. Combining Voltage Scaling and Dynamic Range Reduction
Voltage scaling and dynamic range reduction are two complementary energy reduction techniques. While voltage scaling reduces the energy consumption, it increases the delay of the computation unit and can cause timing errors. However, if reduced precision operation is acceptable, the critical path of the computation is lower and timing errors due to voltage scaling can be avoided. We illustrate this with the help of a 16 bit adder example and then show the effectiveness of this method in achieving energy reduction with minimal quality degradation for a low pass FIR filter.
Consider a simple 16-bit RCA implemented using modelSim with 45 nm PTM model and simulated for uniformly distributed inputs. Fig. 12(a) illustrates the change in critical path delay under voltage scaling; the target delay of the adder at nominal voltage and full precision is illustrated with a dashed line parallel to x-axis. As expected, the critical path delay of the 16-bit adder increases rapidly with voltage scaling. For instance, at 0.8 V, the increase in critical path delay is approximately 45% of the target delay.
The increase in critical path delay is reduced using lower precision arithmetic unit that has shorter critical path. For instance, voltage scaling induced errors due to critical path violation when operating the 16-bit adder at 0.8 V is prevented by truncating 5 low order bits and operating only on the 11 MSBs. Similarly, critical path violation at 0.7 V is prevented by operating on the 8 MSBs. Fig. 12(b) shows the difference in energy saving between a scheme where only voltage scaling is used and a scheme where a combination of voltage scaling and reduced precision is used. At 0.6 V, use of only voltage scaling reduces the energy consumption by 63% while the combination reduces it by 89%.
Next, we analyze the average error induced by voltage scaling and the combined technique. At nominal voltages both systems operate at full-precision, and so the average error is zero. At scaled voltage levels, both systems have comparable average error per operation as shown in Fig. 12(c) . For instance, at 0.8 V, the 11-bit adder and the 16 bit adder have the same error/operation but the 11-bit adder has 20% lower energy (Fig. 12(b) ). The effect of VOS on adders has been formulated in [8] , [49] which use internal architecture of the adders to estimate the noise power. Furthermore, the truncation noise can be lowered using the compensation technique described in Section II. Even without compensation, the combined technique achieves much higher energy saving compared to using only voltage scaling for a comparable error per operation.
We see similar trends in delay, energy and error performance for more complex adders such as carry-look ahead adder (CLA). Fig. 13 shows normalized energy as a function of the supply voltage for a 16-bit CLA. We see that CLA supports more aggressive truncation, for instance, when operated at 0.8 V, it uses only 8 bits (out of 16 bits) and thus achieves higher energy savings. However, the average error per operation for the CLA is typically larger compared to RCA for the same voltage level and RCA tends to have better energy performance for the same error level. This is in agreement with the result presented in [49] .
Next, we present the results of this procedure on real image data. Consider processing the Lena image with a 3 3 Gaussian filter using a MAC based architecture with 8-bit precision. The multiplier is implemented using a carry save adder tree and the final stage is implemented using RCA. Also, both the inputs and the filter coefficients are truncated with the same order. Fig. 14(a) shows the mean squared error noise power (VOS induced+truncation) vs. normalized energy consumption for various levels of low order bit truncation without compensation. Each point in the curve corresponds to a specific supply voltage level. Noise power is calculated using the mean square error (MSE) between the LPF results obtained with voltage scaling and those obtained with nominal voltage operation. Note that MSE can be converted to PSNR using ; however we choose to use MSE here since it provides greater insight into the error performance. From this figure, we see that full precision LPF (original) shows a large increase in noise level when the voltage is scaled to 0.9 V with only about 20% energy saving. On the other hand, 2-bit truncation operating at 0.9 V has lower noise power and 45% energy saving. Thus, dynamic precision adjustment with voltage scaling achieves considerable better performance compared to when only voltage scaling is used. Next, we study the effect of truncation noise compensation. Fig. 14(b) illustrates the performance of the LPF when the estimator described in Section II-C1 is applied. For 4-bit truncation operating at 0.9 V, the noise power reduces by 66% when compensation unit is used. The overhead is very small, since the compensation unit is activated only 1/N of the time where N is the number of the filter taps. At full precision, we have approximately 5% overhead compared to original MAC unit because of the final adder illustrated in Fig. 10 .
Finally, Fig. 15 shows the pareto-optimal curves for voltage scaling in combination with truncation with and without compensation. These curves are generated by connecting the best configurations shown in circles in Figs. 14(a) and 14(b) . We see that the combination scheme always achieves better performance compared to sole voltage scaling at all levels. Furthermore, truncation with compensation achieves higher energy saving for the same noise power. For instance, at MSE (which is approximately PSNR ), FIR using compensation achieves 16% extra energy saving compared to FIR with no compensation.
We repeat the analysis for three different 3 3 Gaussian filters ( and ) and for two different MAC based architectures, one with a RCA in the final stage and the other with a CLA in the final stage. The MSE improvement is calculated as the difference between MSE of truncation) with and without compensation. We use four sample images (Baboon, Lena, Flight, and Pepper) and list the average MSE improvement in Table I . We see that the MAC with RCA has slightly higher MSE improvement. While the MSE performance of the two MAC based systems is slightly different, both benefit from use of this technique. We compare the performance of the proposed voltage scaling with dynamic range reduction technique with the ANT technique for FIR filtering [4] . The reduced computation block in the ANT system is a filter that uses 4 MSBs for both filter coefficients and input values. It consumes approximately 23% extra energy at nominal voltage but at 0.8 V, the ANT system achieves 20% energy reduction for MSE noise power. In comparison, the proposed technique achieves 85% energy reduction for the same level of noise power.
C. Combining Computation Reduction and Dynamic Range Reduction
Computation reduction and dynamic range reduction techniques both try to keep significant computations while removing less significant portions of the computation. The combination is highly dependent on quality requirement and characteristics of the application. We illustrate this method using DCT as a case study.
Here the combination is based on DCT coefficient deactivation and low order bit truncation. The DCT architecture under consideration is given in Fig. 7 . In DCT coefficient deactivation, DCT coefficients are deactivated starting from the highest frequency component of 1D DCT . Thus it is not possible to deactivate without deactivating . In low order bit truncation, inputs are truncated for the entire computation unit with a granularity of 2-bit. These two techniques are combined in such a way that the performance degradation is minimized. Fig. 16 illustrates the proposed methods for 14-bit fixed point DCT implementation. The solid red line in Fig. 16 illustrates the scenario in which is deactivated and 4 low order bits are truncated in the rest of the coefficients. The above procedure can be implemented by controlling the AND gates at the inputs of each DCT coefficient computation unit as illustrated in Fig. 7 .
Next, we describe a scheme to combine coefficient deactivation and low order bit truncation. First we note that there is a crossover point in performance where it becomes better to deactivate a coefficient instead of applying aggressive bit truncation to that coefficient. Fig. 17 illustrates the PSNR performance of Baboon image as a function of low order bit truncation for coefficients. We see that deactivation of and coefficients become more attractive after truncating 7 bits (out of 14). To improve confidence of the crossover point, we investigate the performance of 6 sample images (Baboon, Lena, Flight, Pepper, House and Bridge) and find that it is better to deactivate the DCT coefficient rather than truncating 6 bits. Thus, in our procedure, we limit the low order bit truncation to 4 levels with granularity of 2 bits, namely 0 bit (no truncation), 2 bit, 4 bit, and 6 bit truncation.
Next, we determine the order in which coefficient deactivation and low order bit truncation is applied using a binary decision tree as illustrated in Fig. 18 . We start from full precision, and at Level 1 choose between two competing schemes: 2-bit low order truncation and deactivation based on PSNR. If 2-bit truncation provides better performance, then we pick that branch. In Level 2, we choose between 4-bit truncation and deactivation with 2-bit low order bit truncation. If in Fig. 19 . Performance of the resulting decision order generated using 6 training samples on test images (a) Lake, and (b) Elaine. Level 1,  was deactivated, then in Level 2, we choose between deactivation and 2 bit truncation of all other coefficients or and deactivation. Table II lists the reduction order of 6 images using the binary decision tree method. We read the table from left to right in increasing level number so Level 4 for Lena image corresponds to deactivation of coefficients , and -bit truncation of all coefficients. Using majority voter scheme for each level (each column of Table II) , we form a general order which is given in the last row of Table II . In this order, 2-bit low order truncation is followed by deactivation. Then, -bit low order bit truncation is applied to all the coefficients. Note that, since we consider the same computation units for the and pair to minimize the circuitry, we do not deactivate one of the members of this pairs unless both of them are eligible to be deactivated. Thus in Level 7, we do not deactivate . The reduction order for the first eight levels of majority voter result is also illustrated in Table II to the right.
Using Table II , we determine suitable configurations for three PSNR degradation schemes: i) Scheme dB), ii) Scheme-II (PSNR dB), and iii) Scheme-III (PSNR dB). Here, PSNR is defined as the reduction in the PSNR value of the modified scheme compared to the baseline scheme. We use 6 sample images (Lena, Pepper, Bridge, Baboon, Flight and House) in our evaluation. For a given quality metric (Q) which is used in JPEG [50] , we find all configurations that satisfy the PSNR constraint and choose the one that provides highest saving in computation. We use the majority voter order generated in Table II to determine the priority of a configuration. Table III lists the combination orders for the three schemes.
We test the effectiveness of the combination schemes given in Table III using five test images (Lake, Tank, Elaine, Feather and Boat). Fig. 19 illustrates the results for Elaine and Lake images. For instance, for Lake image at , Scheme II corresponding to PSNR dB) results in PSNR of 32.7 dB, which is only 0.6 dB lower than the original PSNR. Thus the proposed method guarantees that the PSNR constraints are satisfied for Q values from 75 down to 5.
Next, we calculate the power consumption of the original and proposed schemes for different configurations. All multiplications are implemented using carry save adder structures. (2.9%) and leakage (3.6%) compared to the original implementation due to extra units that are used to gate inputs and compensate truncation error. Overall, the proposed scheme provides flexible performance with reduced power consumption for different quality requirements. For instance, using configuration order of L8, we save 61% power consumption and have 14% extra timing slack compared to original full precision DCT engine. The timing slack can be absorbed by operating at 0.9 V (instead of 1 V) resulting in 68% saving in power consumption.
Finally, we present the power savings of Schemes I, II and III for Q values from 75 to 5 in Fig. 20 . We see that Scheme III always achieves the highest power saving due to higher allowable degradation. As we move from high quality (Q large) regions to low quality regions (Q small), we see an increase in the power savings. This is because the combinations used in the low quality regions are quite aggressive in terms of power savings. On average, we achieve 33% power saving for Scheme I, 39% power saving for Scheme II and 46% power saving for Scheme III.
We compare the proposed scheme with the ANT technique [5] , significance driven technique in [44] and adaptive truncation technique in [23] . ANT based scheme using 4-bit MSB replica of the DCT in the reduced computation block has overhead compared to original DCT. We found that it achieves 23% energy saving with approximately 5 dB degradation in PSNR. Significance driven technique achieves 47% energy reduction with more than 4 dB degradation in PSNR [44] . The truncation based technique in [23] achieves up to 40% energy saving while inducing approximately 0.2 reduction in MSSM which corresponds to approximately 15 dB loss in PSNR. In contrast, the proposed combination scheme can achieve average energy savings of 46% with dB PSNR degradation. Moreover, the proposed scheme provides a mechanism for higher energy saving for low Q settings while keeping the degradation low for high Q settings.
IV. CONCLUSION
In this paper, we presented several general as well as algorithm-specific techniques that trade-off energy with system performance for multimedia signal processing algorithms. We provided an overview of energy-savings techniques such as voltage scaling, reducing number of computations and reducing dynamic range. All these techniques introduce errors which can be compensated by algorithm-level optimizations.
Next, we described several hybrid techniques that further reduce energy consumption while causing little reduction in quality. We investigated the combination of voltage scaling and dynamic range reduction and applied it to a low pass FIR filter. The proposed scheme achieved 85% energy saving for fairly low noise level. We also studied the combination of computation reduction and dynamic range reduction for DCT used in JPEG. Simulation results showed, on average, 33% to 46% reduction in energy consumption for a small 0.5 dB to 1.5 dB degradation in the system performance. Thus algorithm-level optimizations can help reduce the energy consumption of many multimedia signal processing algorithms with only a mild degradation in quality.
APPENDIX
We present the expected variance of error due to truncation for unsigned multiplication. The two inputs and are independent and L bit truncation is applied before multiplication to obtain and .
where represents the covariance operation. We use the variance and covariance property for product of random variables to simplify the above expression.
