Abstract-The discrete wavelet transform is a fundamental block in several schemes for image compression. Its implementation relies on filters that usually require multiplications leading to a relevant hardware complexity. Distributed arithmetic is a general and effective technique to implement multiplierless filters and has been exploited in the past to implement the discrete wavelet transform as well. This work proposes a general method to implement a discrete wavelet transform architecture based on distributed arithmetic to produce approximate results. The novelty of the proposed method relies on the use of result-biasing techniques (inspired by the ones used in fixed-width multiplier architectures), which cause a very small loss of quality of the compressed image (average loss of 0.11 dB and 0.20 dB in terms of PSNR for the 9/7 and 10/18 wavelet filters, respectively). Compared with previously proposed distributed-arithmetic-based architectures for the computation of the discrete wavelet transform, this technique saves from about 20% to 25% of hardware complexity.
I. INTRODUCTION

I
N THE LAST FEW years, the discrete wavelet transform (DWT) has gained a wide diffusion. Thanks to its excellent decorrelation properties the DWT has been included into JPEG2000 [1] , the standard recently adopted for digital cinema [2] . This has fostered researchers and led to efficient VLSI architectures to implement the DWT, [3] , [4] . As shown in [5] , the computational kernel of the DWT is a filter bank (FB). Thus, several efforts have been spent to obtain multiplierless architectures of the FB structure. As an example in [6] , [7] the B-spline factorization [8] , [9] is exploited to design multiplierless FB architectures. Recently, other approaches have been proposed as well, e.g., algebraic integer quantization [10] , [11] , coefficient rationalization [12] , polymorphic implementation [13] , and half-band polynomial factorization [14] .
Unfortunately, the aforementioned techniques require not only to know the values of the filter taps but also the mathematical derivation of the filters or at least some specific factorizations. On the contrary, distributed arithmetic (DA) is a systematic methodology to design multiplierless architectures for digital filters. Indeed, it has been recently employed to design low complexity and high throughput architectures for i) finite-impulse-response (FIR) filters [15] , [16] , ii) discrete-cosine-transform (DCT) based architectures [17] , [18] , iii) multiplierless FB implementations of the DWT [19] [20] . Inspired by result-biased techniques proposed in [21] - [24] for fixed-width multipliers, this work aims to show that the complexity of DA-based architectures for DWT computation can be further reduced by applying result-biasing techniques. It is relevant to remark that the proposed approach is agnostic, i.e., it can be applied independently of the design criterion adopted for the addressed filters. In particular, in this work we show that i) the complexity of DA-based architectures for wavelet filters can be reduced by about 20% to 25% with a very limited performance degradation (thus result-biasing compensation can be avoided); ii) the implemented DA-based architecture for the 9/7 wavelet filters features almost the same performance and complexity as other multiplierless solutions, which have been optimized by taking advantage of the specific properties of these filters (see [25] ). Furthermore, the proposed solution features a large complexity reduction compared to state-of-the-art architectures when applied to the 10/18 wavelet filters.
The paper is structured as follows. Section II summarizes the general computational scheme of DA-based architectures for wavelet filters and Section III introduces concepts and definitions for implementing result-biasing techniques. In Section IV result-biasing is applied to two important cases of study: the 9/7 and 10/18 wavelet filters. In Section V experimental results and comparisons are shown. Finally, conclusions are drawn in Section VI.
II. DA-BASED FBS FOR DWT COMPUTATION
Let us consider the FB shown in Fig. 1 where and are the low pass and high pass analysis filters with length and , respectively, and and the low pass and high pass synthesis ones with length and .
A. Analysis Filters
The two analysis outputs ( and ) are obtained as: and , where is the input signal. Let us assume that the taps of the filters are 1549-8328 © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information. amplitude normalized, i.e., , and represented as 2's complement values using bits. Then, we can rewrite and as:
. . . . . .
and . . . . . .
where , each element , represents bit of and , respectively, the operator stands for transposed, and can be either the length of the low pass or high pass filter ( or ).
For a generic filter, the DA-based architecture is obtained by computing the product between (or ) and (or ) first, then, the result is multiplied by . Let and be the matrices containing and , respectively, and and , we then obtain and . This factorization leads to a 3-stage architecture: 1) A butterfly circuit made of adders to implement the and matrix product; 2) A hard-wired shift network to apply the vector; 3) A tree adder to combine partial results. In Fig. 2 the generic DA-based architecture to implement ( or ) is depicted, where is the filter length ( or ), terms are the results of the matrix product ( or ), represents an -position right-shift and . As detailed in Section IV, the downsampling operation at the output of the analysis filters is exploited to alternatively compute and .
B. Synthesis Filters
The computational scheme used for the analysis filters can be used to implement the synthesis filters as well. Indeed, synthesis filters can be obtained from the analysis ones [5] as (3) with and . Moreover, the right part of Fig. 1 
. . .
. . . . . . . . .
with . Thus, the th row of and contains bit of the interlaced sequences of taps, namely and , respectively. As a consequence, when or the corresponding taps ( or ) are zero, leading to columns of zeros in and , respectively. Unfortunately, the effectiveness of DA-based architectures applied to synthesis side of the FB strongly depends on the symmetry of the wavelet filters. Indeed, in Section IV-A we show that for the 9/7 filters the architecture for the synthesis filters is nearly the same as the one for the analysis filters. On the contrary, the architecture for the synthesis filters of the 10/18 wavelet is very different from the analysis one (see Section IV-C). 
III. RESULT-BIASED CIRCUITS FOR DA-BASED ARCHITECTURES
Let us consider the circuit shown in Fig. 3 to compute , where the gray shaded box highlights the circuit of a full adder (FA) and and are represented as 2's complement values using bits. Let be the probability that the th bit of is equal to "1," where is one of the signals involved in the addition, namely , and is the carry-in signal. From Fig. 3 one infers that: (10) Let us introduce a threshold such that, if is sufficiently small, then the following approximation holds true: (11) Since this approximation biases the result of the addition, the value of is used to tune the bias effect. In this work we investigate two strategies to select . These strategies are referred to as shift-based and probability-based thresholding, respectively, and will be described in the following paragraphs.
A. Shift-Based Thresholding
As shown in Fig. 2 , DA-based architectures require right shift operations at the output of the butterfly circuit. It is known that, in fixed point implementation, changing the order between additions and right shift operations leads to a precision loss, i.e., . However, if is the maximum value such that , then we obtain that (12) As a consequence, if and are represented using bits, then we obtain an approximate version of by employing instead of FAs.
B. Probability-Based Thresholding
Another circuit employed in DA-based architectures is the tree adder. As it can be inferred from Fig. 2 , the data combined by the tree adder come from different paths and the magnitude of two terms, out of the available ones, can be very different. Let us assume that all samples have the same order of magnitude, then, the difference between and can be large due to the shift operation. This idea can be exploited to predict the probability of the th carry signal to be "1": (13) For some of the values taken by , if and is "small," then (14) The condition in (14) applied to (13) leads to (15) with
. From (15) we can infer that in many cases the probability of the carry-in signal for the most significant bits tends to 0. Analogously, if and is "small," then (16) which leads to (17) with . From (17) we infer that the carry-in probability tends to 1. In order to address both cases in (15) and (17), we introduce , which is the mean value of the input signal, and we observe that the probability of carry-in signals as a function of creates two regions: i) the most-significantbit-region (MSB-region), where tends to 0 or 1 depending on , ii) the least-significant-bit-region (LSB-region), where depends on the statistic of the input signal and (either or ) is the position at the border between the MSB-region and the LSB-region.
To maximize the occurrence of the conditions in (15) and (17), we add values as follows: (18) with . Then, by setting a threshold , we can find (19) and force for . The same approach can be extended to all the levels in the tree adder. Let be the threshold for adder at level and the position of the last FA such that for . If the input values are represented using bits, then we can obtain an approximate version of each result by employing instead of FAs.
IV. CASES OF STUDY: RESULT-BIASED DA-BASED ARCHITECTURES FOR THE 9/7 AND 10/18 ANALYSIS WAVELET FILTERS Two important cases of study are shown in the following: the experimental results obtained by implementing result-biased DA-based architectures for the 9/7 and 10/18 wavelet fil- as in other works, e.g., [4] , [9] . For our simulations five standard images (256 gray levels), namely "Lena" 512 512, "Barbara" 512 512, "Boat" 512 512, "Goldhill" 512 512 and "Fingerprint" 512 512 [28] , have been employed. 2 The number of DWT decomposition levels has been varied from 1 to 4. This corresponds to , where is the number of DWT resolution levels required by openjpeg. Different compression ratios have been imposed, namely 1:1, 8:1, 16:1, 32:1, and 64:1, precinct and code-block size are the encoder default values. Simulations shown in this work have been obtained by modifying the encoder, namely we implemented the forward DWT with the DA-based solution proposed in [20] for the 9/7 DWT. Then, the DA-based solution has been extended to support the 10/18 wavelet filters as well. Finally, we implemented the proposed result-biasing techniques.
A. DA-Based Architecture for the 9/7 Wavelet Filters
As argued in [20] , it is more convenient to consider the binary representation of and , instead of and , to 1 For other profiles related to digital cinema, the reader can refer to [27] . 2 Other images have been tested as well. Since the results we obtained are similar to ones presented in this paper, we are not showing them for the sake of brevity.
find terms that are common to both the low pass and the high pass taps. Given that the 9/7 wavelet filters are symmetric (see Table I ), we can further reduce the complexity of the butterfly circuit. These two considerations permit to write and for the 9/7 filters, as shown in Table II , where repeated commonterm-vectors are gray-shaded. Moreover, to exploit filter symmetry, we introduce the column vector , which elements are (20) Then, we produce the values, as shown in Fig. 4(a) , by combining with the 13 possible vectors. As an example, Table II shows that , where , is used to calculate , , , for the low pass branch and , for the high pass branch. In general, every product defines a set made of the proper and elements, e.g., , as shown in Table III . Furthermore, as argued in [20] , a Reduced-Adder-Graph-like technique [29] , where common subexpressions are extracted and calculated only once, reduces the number of adders required by the butterfly circuit. As an example, sub-expression , which is common to several products, is computed only once and then reused multiple times.
A similar approach can be employed for the synthesis filters, where odd filter lengths and the symmetry of and matrices can be exploited to define (21) Section II-B) . The corresponding butterfly circuit is very similar to the one shown in Fig. 4(a) and can be derived from the vectors summarized in Table IV . Finally, both analysis and synthesis architectures rely on a shift network and a tree adder to compute the results, as shown in Fig. 2 for a general To set each , we simulated the proposed DA-based result-biased DWT in the openjpeg model with the test conditions detailed at the beginning of Section IV. In Table V we show the results obtained by choosing such that . As an example, means that the elements in are not biased. Experimental results show that the peak signal-to-noise ratio (PSNR) difference between the original DA-based DWT and the proposed one is negligible, when result-biasing is applied to the butterfly circuit (BB, ). Moreover, the standard butterfly circuit [20] requires FAs, where FAs are required to compute . On the other hand, the proposed result-biased butterfly saves FAs, where . As an example, since , the computation of requires only FAs. Since , the standard butterfly circuit requires 240 FAs, whereas the proposed one requires FAs.
2) Result-Biased Tree Adder Implementation:
Stemming from the computational scheme defined in the previous section, the 13 different values are added together. As detailed in Section III-B, we combine values as in (18) . Fig. 4 (b) shows the tree adder and the hard-wired shift network used in the architecture for the 9/7 wavelet filters. As it can be observed, the signal produces or by selecting the proper input to the hard-wired shift network.
As discussed in Section III-B, the delay and the complexity of the tree adder can be reduced by cutting the carry chains, namely by fixing the value of and by finding the corresponding . Simulations were performed on the modified openjpeg model in the test conditions described in the first paragraph of Section IV and including the result-biased butterfly circuit described in Section IV-B1. As an example, Fig. 5 shows the values of obtained with the "Goldhill" image by varying for low pass (lp) and high pass (hp) filters, respectively, at the first level of the tree adder (referred to as in Fig. 4(b), with ). Since the openjpeg model converts image pixels from to , then for the "Goldhill" image. Indeed, the values of in the MSB-region (Fig. 5) tend to 1, whereas in the LSB-region . Simulations show that in both low pass and high pass computation the following approximation holds true: (22) where the values of the coefficients are summarized in Table VI . As shown in Fig. 5 , the curve defined in (22) is a good approximation of in the LSB-region. The approximation in (22) can be used for setting the threshold of each adder in the tree adder. As an example, simulations show that if one sets the threshold to the highest probability in the LSB-region , which corresponds to , then there is a PSNR loss of up to 5 dB. As a consequence, we set with . It is worth pointing out that, since we force for , the corresponding bits of are not correct. However, in JPEG2000 the results of the DWT are quantized, so it is unnecessary to compensate the approximation caused by result-biasing, as discussed in the next paragraphs. Thus, to save complexity we approximate for . Through extensive simulations we found the values for (see Table VI ) that minimize the PSNR loss. These values lead to the results detailed in Table VII as  ,  where is the PSNR obtained by performing result-biasing both in the butterfly circuit and the first level of the adder tree. From the complexity point of view, the number of FAs required for the six adders at the first level of the tree adder decreases from 96 to 57 (39 FAs saved).
The approach used for the first level of adders in the tree adder is applied to the other levels as well and the value of each parameter is summarized in Table VI . As it can be observed, the performance loss caused by result-biasing at the second and third level of adders in the tree adder ( and , respectively) is nearly the same as , where , and , are the PSNR values obtainedby introducing result-biasing in the butterfly circuit and at the first, second and first, second, and third levels in the tree adder. When the result-biasing technique is applied at the second and third level of the tree adder, the number of required FAs decreases from 48 to 36 (12 FAs saved) and from 32 to 26 (6 FAs saved), respectively. Thus, the proposed result-biased tree adder saves 57 FAs on levels 1 to 3 of the tree adder. Further experiments have shown that there is no advantage in implementing result-biasing in the adder at the fourth level.
C. DA-based Architecture for the 10/18 Wavelet Filters
As shown in Table VIII , the 10/18 wavelet filters are symmetric. This property can be exploited to design a reduced complexity architecture. However, since in this case and are even values, we introduce (23) and derive the common-term-vectors , which are summarized in Table IX . The corresponding butterfly circuit is depicted in Fig. 6 , where , and the signal is used to add or subtract input samples in the low-pass and high-pass filter implementations, respectively. Then, as for the 9/7 case, the architecture relies on a hard-wired shift network and a tree adder.
On the other hand, it is not possible to exploit the symmetry of the filters in the implementation of the architecture for synthesis filters as and are even values. Indeed, since and are non-symmetric matrices, can not be defined. As a consequence, there is no simplification in (6) and (8) and Table X . Unfortunately, the content of and shows only partial common-term-vectors, thus the DA approach is more effective for the analysis filters than for the synthesis ones.
D. Result-Biased DA-Based Architecture for the 10/18 Analysis Wavelet Filters 1) Implementation of the Result-Biased Butterfly Circuit:
In order to trim the result-biasing for the butterfly circuit, we built , the sets defined by each of the products, as shown in Table XI . Then, we set (as for the 9/7 filters). In this case, the proposed result-biased butterfly circuit saves 113 FAs. Since , the standard butterfly circuit requires 448 FAs, whereas the proposed one requires FAs.
2) Result-Biased Tree Adder Implementation: The architecture of the result-biased tree adder is nearly the same one employed for the 9/7 filters and described in Section IV-B2. The only differences with respect to the circuit shown in Fig. 4(b) are: i) the hard-wired shift network, where the inputs to the multiplexers are the ones summarized in Table IX , ii) the 1 multiplication at the output. Since we observed similar carry signal probabilities for both 9/7 and 10/18 filters, the same result-biasing strategy has been employed for the implementation of the tree adder. In particular, the approximation in (22) with the parameters shown in Table VI , has been exploited. As The architecture contains also multipliers. Since no reference is available in the literature it has been implemented. a consequence, the number of saved FAs is the same one obtained for the 9/7 filters, namely 57 FAs. Experimental results, achieved by implementing result-biasing in the tree adder, are shown in Table XII . As it can be observed, the performance loss in terms of PSNR of the proposed result-biased variant, with respect to the original DA-based architecture, is limited to few fractions of dB, where and are the PSNRs of the original and proposed solution respectively.
V. HARDWARE IMPLEMENTATION AND COMPARISON
The proposed result-biased architectures for the computation of the 9/7 and 10/18 wavelet filters have been implemented using a 90 nm standard cell technology library for a 200 MHz target clock frequency, leading to areas of 7621 and 12602 , which correspond to about 2.7 and 4.5 equivalent kgates for the 9/7 and 10/18 filters, respectively. Moreover, with the same technology the proposed architectures can achieve maximum clock frequencies of 450 MHz and 360 MHz with areas of 8087 (2.85 eq. kgates) and 13845 (4.91 eq. kgates) for the 9/7 and 10/18 filters, respectively. These results are shown in Table XIII , where the proposed architectures are compared with other solutions available in the literature in terms of PSNR, number of FAs, number of flip-flops (FFs), clock frequency and area.
The proposed architecture for the 9/7 filters offers a relevant complexity reduction with respect to previously published DA-based implementations for DWT computation [19] , [20] , with a very small PSNR loss (see Tables V and VII) . When compared with multiplierless solutions, which were specifically optimized for the 9/7 wavelet filters, the proposed architecture shows a PSNR loss of few fractions of dB as the variants described in [6] , [7] , [13] , [30] . To enable complexity comparison of the proposed architecture with the other works, in particular with the ones described in [13] and [9] for the 9/7 and 10/18 filters, we introduced the normalized area (last column of Table XIII) , where Tech is the technology process used for the implementation, namely 90 nm for the architectures proposed in this work, 45 nm for [13] , and 130 nm for [9] .
From the data in Table XIII we observe that the proposed architecture for the 9/7 filters features almost the same complexity as the lowest complexity implementations, i.e., [7] , [30] . Comparison with [13] in terms of circuit speed is not straightforward as it relies on a technology more scaled than the one employed in this work. As a consequence, [13] features higher maximum clock frequency than our implementation. Beside, the architecture in [13] requires more FFs but less FAs than the result-biased DA-based variant, leading to a slightly higher normalized area than the proposed solution. It is worth noting that all the considered architectures for the 9/7 filters have a throughput of one sample per clock cycle. Furthermore, Table XIII shows that the area of DA-based architectures for the 10/18 filters is from 39% (result-biased DA-based architecture) to 49% (DA-based architecture) the area of other optimized variants based on B-spline factorization, such as [8] , [9] . Even if the architectures in [8] , [9] have a throughput of two samples per clock cycle, the low area required by DA-based implementations makes them superior in terms of throughput to area ratio. Finally, the proposed result-biasing technique reduces the complexity of the architecture for the 10/18 DWT computation as well as for the 9/7 one. These figures of merit highlight the effectiveness of the proposed result-biasing technique as a general method to reduce the complexity of DA-based architectures for the approximate computation of the DWT.
VI. CONCLUSIONS
In this work a result-biased DA-based filter architecture for the approximate computation of the DWT has been presented. The proposed idea has been applied to the well known 9/7 and 10/18 wavelet filters, respectively, to reduce the complexity of DA-based architectures for the DWT computation, with a very small loss in terms of PSNR. Experimental results show that i) the proposed technique is effective in reducing the complexity of DA-based architectures for the DWT computation; ii) the performance and complexity of the variant derived for the 9/7 filters are comparable with the ones of previously proposed architectures, which are specifically optimized for the 9/7 wavelet filters; iii) the performance and complexity of the proposed architecture for the 10/18 wavelet filters are better than those of previously published works. 
