Abstract-It was proposed recently that quantized representations of the input source (e.g., images, video) can be used for the computation of the two-dimensional discrete wavelet transform (2D DWT) incrementally. The coarsely quantized input source is used for the initial computation of the forward or inverse DWT, and the result is successively refined with each new refinement of the source description via an embedded quantizer. This computation is based on the direct two-dimensional factorization of the DWT using the generalized spatial combinative lifting algorithm. In this correspondence, we investigate the use of prediction for the computation of the results, i.e., exploiting the correlation of neighboring input samples (or transform coefficients) in order to reduce the dynamic range of the required computations, and thereby reduce the circuit activity required for the arithmetic operations of the forward or inverse transform. We focus on binomial factorizations of DWTs that include (amongst others) the popular 9/7 filter pair. Based on an FPGA arithmetic co-processor testbed, we present energy-consumption results for the arithmetic operations of incremental refinement and prediction-based incremental refinement in comparison to the conventional (nonrefinable) computation. Our tests with combinations of intra and error frames of video sequences show that the former can be 70% more energy efficient than the latter for computing to half precision and remains 15% more efficient for full-precision computation.
I. INTRODUCTION
The two-dimensional discrete wavelet transform (2D DWT) has been established as one of the main tools for image compression [1] , image denoising and other popular image processing operations [2] , [18] . In the vast majority of applications, the transform coefficients are produced to the maximum degree of precision and then they are quantized and processed as appropriate [1] . However, it has been recognized that this wastes system resources for the cases where severe quantization would render the majority of the coefficients not being used at all, or used at very low precision [3] . For example, this is commonly the case for low-bitrate image and video coding applications [3] and resource-constrained image and video processing operations [4] . For this reason, previous work proposed schemes for approximate computation of transforms and signal processing operations [3] . A property that has been recognized to be of great importance is incremental refinement of computation [4] - [6] , where the transform representation of a signal (image, video) is produced incrementally with the use of embedded (bitplane-based) quantization. In our recent work [4] , this design has been theoretically analyzed both for the forward and the inverse two-dimensional multilevel DWT using the generalization of the spatial combinative lifting algorithm (SCLA) of Meng and Wang [7] . The overall framework is depicted in Fig. 1 . There, the multilevel DWT decomposition of the input source (video frame) occurs independently for each quantization threshold (bitplane), starting from the most-significant bitplane (MSB) and going down to the least-significant bitplane (LSB). The results are accumulated after each multilevel SCLA computation to form an incrementally-refined output. Similarly, for the DWT reconstruction, the MSBs of the transform-coefficients are inserted first and the multilevel inverse DWT reconstructs the image incrementally. Each additional processing step requires additional energy consumption. If the processing resources are terminated, one receives the decomposed or reconstructed image with the best-possible quality (controlled by the number of bitplanes processed). Although the framework of Fig. 1 receives individual bits (per pixel or per wavelet coefficient), the dynamic range of computations performed is increasing according to i) the lifting coefficients of each lifting step; ii) the input-source statistics; and iii) the number of decomposition levels. In order to adapt the circuit activity according to varying input statistics, it is crucial to have arithmetic units that perform variable dynamic-range computation. Xanthopoulos [8] proposed a suitable framework for this purpose: for all arithmetic units, very low-cost "MSB-rejection" circuits are utilized, which identify the exact number of active bits within each element (adder or multiplier). In this correspondence, we use a "zero-detection" circuit to avoid performing parts of multiplications with zero inputs and demonstrate its effectiveness in conjunction with incremental computation on an FPGA arithmetic co-processor testbed.
The contribution of this correspondence is twofold: firstly, we propose incremental computation of the DWT with the use of prediction within each refinement layer (bitplane) of the input (Sections II and III); in addition, via the utilized FPGA co-processor (introduced Section IV), we demonstrate the energy-distortion scalability offered by incremental computation and the proposed prediction-based incremental computation in comparison to the conventional (nonrefinable) computation (Section V). Our results are relevant to DWT architectures localizing memory accesses to on-chip memory [9] , [10] , or to cases when the entire image can be stored on-chip, since energy consumption stems predominantly from arithmetic operations and not memory accesses in such cases [9] , [10] .
II. OVERVIEW OF SCLA-BASED DWT UNDER INCREMENTAL REFINEMENT OF COMPUTATION
The 2-D DWT of an R 2 C input matrix X consisting of image intensity values is expressed in the spatial domain by 1 S = E1X1E T , where E is the analysis polyphase matrix consisting of alternating rows of low-and high-pass filters shifted by two (in order to apply the DWT downsampling), and S the 2-D matrix of output wavelet coefficients. 1 We are not concerned with the scaling performed after the lifting analysis and before the lifting synthesis [11] because all scaling factors can be incorporated into the subsequent encoding or processing stage [12] . In addition, for notational simplicity, we assume that the image dimensions are integer multiples of 2 , with L the number of wavelet decomposition levels.
1053-587X/$26.00 © 2010 IEEE The Z-domain expression of the analysis matrix can be factored into K lifting steps [11] :
where P k (z) and U k (z) are the kth prediction and update lifting matrices. The lower-or upper-triangular 2 2 2 matrices of (1) contain symmetric Laurent binomials [11] , [12] , with a FXP u(k) the fixed-point representation of the kth update filter coefficient (likewise for the predict step). Although the binomial lifting factorizations represented by (1) do not cover all possible factorizations, they do cover some of the most-popular ones found in practical applications [1] , [7] , [11] , [12] . As an example, for the 9/7 filter pair we have K = 2,
i.e., two prediction and two update filters, each of which has two identical nonzero and nonunity taps, as shown in (1) . Other popular filter pairs also obey this rule, e.g., cubic B-spline filters [11] , the 5/3 filter pair [1] and the 7/5 filter pair [13] . In addition, all symmetric or antisymmetric factorizations can be expressed as binomial lifting factorizations [12] .
A. Overview of Incremental SCLA Computation
A reformulation of the 2-D lifting scheme has been proposed in the spatial domain by Meng and Wang [7] : 
p n
where p n (4)- (7): (4)- (7). Again, (8) is a simple copy operation, while, for (9)- (11), the fixed-point lifting coefficient a FXP u(1) is used to update the input. The remaining steps k = 2; . . . ; K to complete the single-level decomposition using the 2-D lifting formulation of (2) are performed as in (4)- (7) and (8)- (11) 
where HL1 , LH1 and HH1 are the three high-frequency subbands of level one (see Fig. 1 ) and a b assigns b to a. Notice that the lowfrequency coefficients (u n K [2i; 2j ]) are not involved in this process, as they are the input for the subsequent decomposition level, as described in the following.
B. Multilevel Extension
The multilevel extension of the bitwise computation of the DWT for L max levels was performed in our previous work [4] via a "frequencyfirst" incremental refinement of computation: for each input bitplane n, the lifting-scheme computation is continued for all subsequent levels after the first level and the results are accumulated at the end, before moving to the next input bitplane. This is achieved by reformulating the first prediction step of (4)- (7) for all levels l, 2 l L max by (0 i < R=2 l , 0 j < C=2 l ): 
This process is then carried out for the next bitplanes n 0 1; . . . ; 0.
Source code for the entire process is available at http://www.ee. ucl.ac.uk/~iandreop/ORIP.html.
C. Inverse SCLA and Inverse DWT
Concerning the inverse SCLA, the process is exactly antisymmetric, as in the conventional lifting computation: all lifting steps are performed in reverse order by solving the forward bitwise lifting equations for the coefficients being predicted or updated during the forward transform. More explicitly, (8)- (11) Inversion of (4)- (7) occurs as for (8)- (11), with x n
[; ] inside the parentheses of (5) and (7) replaced by p n 
III. PREDICTIVE INCREMENTAL REFINEMENT OF COMPUTATION FOR THE DWT
Since neighboring input pixels or transform coefficients are expected to be correlated, 2 we can attempt to reduce the required computations for the incremental computation algorithm presented previously by introducing a prediction scheme between the pixels or coefficients of the input. For each of the 2 2 2 input 2-D polyphase components of each lifting step, we form the prediction in the direction of the 2-D lifting-based filtering presented in the previous section.
A. Proposed Formulation
The first component of the first step [shown in (4) ] remains unchanged as this is simply an assignment operation. The second component applies the lifting filter along the j-axis. By writing (5) (18), which is performed by the original incremental lifting of (5). Hence, with a multiplier design that is dynamically adjustable to the input sparsity, we expect to have decreased circuit activity for the multiplication operations.
The third polyphase component [shown in (6) ] applies diagonal filtering along both the i and the j axis by reusing the results of the second component computed previously. For this reason, we break the computation in two parts, with the first part performing prediction across the j axis and the second performing prediction across the i axis and incrementing the result of the previous computation. The second lifting step [update step of (8)- (11)] is modified in the same manner by applying the prediction as shown in (18)-(21). The expressions can be derived following the same rules and are omitted for brevity of description. Subsequent pairs of lifting steps follow the same rules. One final aspect relates to the border treatment of the prediction-based incremental refinement of the DWT computation. This is performed by applying the original equations for incremental refinement for the initial row (i = 0) or the initial column (j = 0) of each case.
The modification for the calculation performed for the subsequent levels [ (14)- (16) The calculations at the borders are performed by applying the original equations for incremental refinement [ (13)- (16)] for the initial row (i = 0) or the initial column (j = 0) of each case. Indicative source code for the proposed approach (forward and inverse) is available online.
B. Discussion
The proposed prediction-based incremental refinement of computation is applied per input bitplane. As such, it can be selectively applied to some input bitplanes, e.g., the four most-significant ones, and the original (4)-(11) can be applied for the remaining bitplanes. In addition, it can be applied selectively on some parts of the lifting steps, e.g., we can apply (6) instead of the proposed (19) and (20), or even on selected input frames where the correlation between neighboring input samples is stronger (this is actually used in the experimental results of Section V). This creates a "hybrid" approach that uses prediction only on some parts of the computation. This can be beneficial because, for example, correlation between neighboring inputs is expected to decrease when moving to lower bitplanes, since this corresponds to the low-amplitude parts of the input image, which are expected to contain fine-grain noise. Similarly, decreased correlation between neighboring samples is observed when dealing with error frames instead of video frames, since error frames are the result of temporal prediction.
Even though the description of the paper focused on single-bitplane inputs, a number of bitplanes can be inserted together and the computation can utilize all of them together as one increment layer. For example, for an 8-bit input image, the three MSBs can be processed first, followed by the three intermediate bitplanes and the two LSBs. This creates only three increment layers of computation. This can reduce the expected overhead when increasing the number of increment layers. In our recent work on software designs for incremental image processing [6] , we found that three or four increment layers provide for sufficient quality-complexity scalability without significant overhead in the overall execution time. From the implementation perspective, a crucial element needed in order to take advantage of the reduction of the multiplication bitwidth is a multiplier unit with adjustable circuit activity according to the input bit patterns. This is the topic of the following section. 
IV. FGPA-BASED ARITHMETIC CO-PROCESSOR TESTBED FOR ENERGY MEASUREMENTS
In order to substantiate the potential benefits of the proposed computation in a real environment, we setup an experimental testbed consisting of a main platform (personal computer) realizing the transform decomposition or reconstruction algorithmic flow in software. All arithmetic operations (multiplications, additions) are transferred to a Xilinx Virtex II XC2V1500 FPGA and are computed based on customized designs of multipliers and adders placed on the FPGA. We then measure and report the energy consumption of these operations for the different algorithms (conventional, incremental refinement and prediction-based incremental refinement). We used the ISE 8.2 FPGA development environment with the XPower 8.2 tool 3 for this purpose and all our results relate to the dynamic energy dissipation reported by the tool.
All arithmetic operations are performed in signed fixed-point representation and require (maximally) a 12-bit integer part plus a 12-bit fractional part, which ensures reconstruction peak-signal-to-noise (PSNR) values above 55 dB for up to six levels of decomposition [14] . All our arithmetic units restrict their output to 24 bits, which is comparable or less than related designs [14] , [15] . The multiplier design separates the integer and fractional parts into three 4-bit parts (IA3 IA1 and FA1 FA3) and then performs a cascade realization of the multiplication, whose intermediate operations are depicted in Fig. 2(a) . Since the result will always be contained within the 12 plus 12 bit output, only the intermediate operations contributing to the retained part of the output of Fig. 2(a) are performed. Furthermore, each individual 4-bit multiplication of Fig. 2(a) is enabled based on a zero-detection circuit. This circuit checks all inputs [ : ] and [ : ] of the 4-bit multiplier and raises the (circuit enabling) signal only when the multiplication will have nonzero output, as shown by Fig. 2(b) ; otherwise the default zero result is obtained without activating the multiplier.
The use of the zero-detection circuit ensures that the overall multiplier design is adjusted to i) each lifting coefficient's active bitwidth without customizing the multiplier design to a particular lifting scheme; ii) the input's varying statistics and varying dynamic range; and iii) the output's range (24 bits). This design was found to consume less energy than the standard 24-bit cascade multiplier with the same throughput and latency.
Concerning the adder's design, even though we designed a bitwidthadaptive adder unit, our measurements indicated that the energy consumption of this design is comparable to that of a standard 24-bit adder, and overall more than an order of magnitude smaller than the average energy consumption of the adaptive multiplier design. Hence, we resorted to using a standard 24-bit adder for all additions performed.
Both the multiplier and adder designs operate in one clock cycle. Representative results for the dynamic energy consumption measured for each combination of inputs when clocked at 75 MHz, as well as the layout of the multiplier design are provided online at http://www.ee. ucl.ac.uk/~iandreop/ORIP.html.
V. EXPERIMENTAL RESULTS
We examine the energy consumption of the proposed approach for individual cases of forward and inverse DWT applied to YUV intra frames as well as to YUV error frames produced by motion-compensated prediction. Two schemes are used for comparison purposes: conventional (nonrefinable) computation and incremental refinement of DWT computation as proposed in our previous work [4] . All 300 frames of the CIF sequences Stefan, Coastguard, and Foreman were used for the experiments of this section. Image distortion is measured using the PSNR across luminance (Y) and chrominance (U and V) channels of each video frame:
in order to produce one metric including all channels. The PSNR of each channel C 2 fY; U; Vg is measured as follows:
PSNR(C) = 10 log 10 (255 2 =MSE(C)), where MSE(C) is the mean squared error. For the forward DWT, PSNR is measured by inverting the produced decomposition via a software implementation of the inverse DWT with double-precision floating-point representation.
Energy consumption is measured via the testbed presented in Section IV, which includes one adder and one multiplier clocked at 35 MHz for the conventional approach and at 75 MHz for the incremental approaches. The clock frequencies were chosen so as to allow for processing of 25 frames/s when using all input bitplanes: since the incremental approaches are performing more arithmetic operations (multiple increments but with smaller bitwidth), we increased the clock frequency for these cases in order to ensure they complete the frame processing at the same time with the conventional approach. The 9/7 [11] and 7/5 [13] filter pairs were chosen for our comparisons since they present state-of-the-art compression performance with moderate complexity. The lifting coefficients of the 9/7 and 7/5 filter pairs can be found in [11] and [13] , respectively. The conventional approach is using all input bitplanes simultaneously, while the incremental approaches are breaking the computation into two layers: for the forward DWT, the four MSBs are inserted first, followed by the 4 LSBs; for the inverse DWT, the 8 MSBs are inserted first, followed by the 4 LSBs. This creates two layers of computation labeled as "half-precision" and "full-precision" results. All utilized error frames in our experiments were produced by a spatial-domain motion-compensated prediction scheme [16] using successive (frame-by-frame) prediction and within a group-of-pictures (GOP) of 16 frames. Four wavelet decomposition levels were performed. No wavelet-coefficient coding was performed in order to avoid artificial thresholding of wavelet coefficients before the inverse DWT. The results are reported in Table I . As seen from the top half of the table, when used for intra frames, the proposed approach provides benefits in comparison to the original algorithm for incremental refinement of computation [4] by decreasing its energy consumption and making it comparable to conventional computation. In addition, the proposed scheme brings significant benefits in terms of energy reduction for half-precision computation. However, when used for error frames, the proposed approach does not provide any improvement in comparison to incremental refinement [4] ; instead, energy consumption is increased, sometimes by a significant amount. This is attributed to the failure of the prediction scheme due to reduced correlation between neighboring coefficients of error frames. These observations hold for both filter pairs and both forward and inverse DWT.
There are some minor discrepancies in PSNR between the different approaches caused by the fixed-point precision chosen for the example cases presented, in the order of 0.03 dB. All visual artifacts observed in the reconstructed images and error frames at each terminating quantization precision (bitplane) are typical of low-bitrate wavelet-based quantization and coding approaches studied in the literature [17] .
The bottom half of Table I shows the energy consumption in a video coding scenario where 1 intra frame and 15 error frames are transformed, when the proposed approach is used for the intra frames and incremental refinement [4] is used for the error frames. This "hybrid" approach provides for lower energy consumption for full-precision computation in comparison to the conventional (nonrefinable) computation. Moreover, in such video applications, energy consumption is reduced significantly (by more than 70% on average) by terminating at half-precision (first increment layer), as shown by the bottom half of Table I .
VI. CONCLUSIONS
We propose a prediction-based method for incremental computation of the forward and inverse discrete wavelet transform (DWT) under a bitplane-based formulation of the 2-D lifting decomposition. The proposed approach applies prediction of neighboring pixels or wavelet coefficients in the direction of the lifting-based filtering. This is performed for each of the 2-D polyphase components of the direct 2-D computation of the multilevel DWT decomposition. Based on FPGAbased comparisons with conventional computation, we verified that the proposed approach and incremental refinement has comparative energy consumption in full-precision computation, with the proposed approach providing an improvement for intra frames. At the same time, these approaches can terminate at intermediate energy-distortion points (e.g., half-precision) with significant decrease in energy consumption.
