Abstract-This brief proposes a novel low-complexity, efficient 9/7 wavelet filters VLSI architecture for image compression applications. The performance of a hardware implementation of the 9/7 filter bank depends on the accuracy of coefficients representation. The aim of this work is to show that great complexity reduction with excellent performance can be achieved going through the derivation of the 9/7 taps values.
I. INTRODUCTION
T HE discrete wavelet transform (DWT) has gained wide popularity due to its excellent decorrelation property [1] : many modern image and video compression systems embody the DWT as the transform stage (e.g., [2] ). It is widely recognized that the 9/7 filters [3] are among the best filters for DWT-based image compression [4] . In fact, the JPEG2000 image coding standard [5] , [6] employs the 9/7 filters as the default wavelet filters for lossy compression.
The performance of a hardware implementation of the 9/7 filter bank (FB) depends on the accuracy with which filter coefficients are represented. However, high-precision representation increases hardware resources and processing time. To reduce the complexity of the 9/7 filters, the lifting scheme (LS) [7] can be adopted. Unfortunately, the LS increases hardware timing accumulation due to its serial nature [8] , so that for certain applications it cannot be employed. The flipping structure [8] is an attractive alternative to the standard LS DWT, since it reduces timing accumulation, however it still requires multiplications.
Complexity reduction can be achieved resorting to a FB implementation, in particular very good results can be obtained with the cascaded method proposed in [9] and exploited in [10] . The basic idea described in [9] is to minimize the number of bits required to represent the 9/7 coefficients. Since this operation would move filters zeros from their original position, the authors modify some terms to account for zeros compensation. Other techniques to reduce the complexity of FBs implementations are based on distributed arithmetic (e.g., [11] ) where only adders are employed.
Currently, the compatibility of low-complexity 9/7 filters implementation into standard image/video coding systems has not been stressed yet. The aim of this brief is to show that great complexity reduction can be achieved analyzing the 9/7 filters directly from their analytical derivation [3] . In particular, employing the proposed solution into a JPEG2000 encoder and decoding with a standard JPEG2000 decoder, the image quality loss is negligible. Moreover, the complexity and the power consumption with respect to a standard implementation are nearly halved. It is worth noticing that the proposed methodology can be extended to other filters belonging to the 9/7 family as it will be discussed in the following. Compared to the best solution proposed in [9] our solution shows that , the total number of nonzero terms used when writing all the coefficients in sum or difference of powers of two (SPT), is thesame: .Moreover, our implementationshows that can be nearly halved exploiting filters symmetry without any loss in terms of performance.
II. THEORETICAL DERIVATION
Let us consider the FB shown in Fig. 1 , where and are the low-pass and high-pass analysis filters with length and respectively, and and the low-pass and high-pass synthesis ones with length and . It is well known that wavelet FBs ought to satisfy the perfect reconstruction conditions [1] : and . Imposing the biorthogonality condition together with filters symmetry ( and ) we can rewrite the perfect reconstruction conditions as
As shown in [3] , writing the nondistortion condition (1) on and in terms of trigonometric polynomials, it becomes . Moreover, together with divisibility of and , respectively, by and [3] it leads to (3) where is an odd polynomial in and .
1057-7130/$20.00 © 2006 IEEE The 9/7 filters have been proposed in [3] as a particular case of trigonometric polynomials that satisfy (3) with , and . When , , and , (3) becomes
The term can be split into two equal parts with degree 4. The polynomial in can be considered as a third-order equation and factorized into two polynomials with degree 2 and 4, respectively, in order to obtain (4) where is the real solution of the third-order equation (5) the product and the sum of the two complex conjugate solutions are, respectively, and . Thus, (4) leads to (6) (7) From (6) and (7), we can build filters coefficients [3] substituting:
and . Thus, we obtain the coefficients shown in Table I where , , , , and . Similar expressions can be found for other filters which satisfy (3).
III. PROPOSED ARCHITECTURE
The standard architecture for a fast 9/7 implementation is the so called, direct implementation, where samples that have to be multiplied by the same tap are first added together ( with ), then multiplied by the proper tap and finally partial results are combined with a tree adder (as depicted in 2 ) to obtain the result ( ). Since our application is image compression, the term in Table I will appear twice in the computation: once during rows filtering and once during columns filtering. The JPEG2000 image coding standard embeds the factors into the quantizer, so that in the following description they will not be considered anymore. In order to reduce the direct implementation architecture (DA) complexity, the analysis described in Section II will be employed to derive: 1) a preliminary architecture (PA); 2) a low-complexity architecture (LCA); and 3) a very LCA (VLCA).
Considering the two filters and as vectors, we can represent them as the product of a matrix and a vector (for ) or (for ). Besides and symmetry suggests, for the sake of simplicity, to concentrate only on taps with index (see Table I ). Thus, and where (8) and is the sub-matrix obtained from removing the fifth row and the third column. Being a similar expression for can be easily derived: where
A. PA
The PA represents the first modification with respect to the DA shown in Fig. 2 . The basic idea is to perform the simple operations described by the matrix first (matrix plane), then to multiply the results by the values (vector plane), and finally to add together the intermediate values to obtain , as described by (10) where with are the values shown in Fig. 2 . The same approach can be used for (11) In Fig. 3 a block scheme for the PA is shown, where it can be observed that first the additions are applied and then the multiplications by or are performed. Given other filters that satisfy (3), a matrix expression similar to the ones shown in (10) and (11) can be obtained. So that an architecture similar to the one depicted in Fig. 3 can be derived for other filters.
B. LCA
The architecture shown in Fig. 3 reduces the number of multiplications from 9 to 5. Concentrating on the values , , , , and (vector plane) it is possible to further reduce the number of multiplications. Considering the real values obtained solving (5):
, and , we can reduce the number of multiplications approximating and values on a very small number of bit. As suggested in [9] better performance can be achieved granting that original filters zeros are kept as much as possible in their original position. Extensive simulations show that and values can be approximated as , , , and . These values can be obtained starting from , , , and binary representation on 16 bits and then trying to approximate them on a small number of bits while granting that the zeros position is almost the same of the original filters. In Fig. 4 zeros positions for the original and the approximated filters are shown. As it can be observed zeros of the approximated filter are very close to the original ones. This approximation has a positive impact from the architectural point of view. In fact, with the proposed approximation we can modify the architecture shown in Fig. 3 to obtain the LCA. In Fig. 5 , the LCA is depicted, where the multiplications have been substituted with additions (vector plane).
C. VLCA
From Fig. 5 it is possible to further reduce the architecture complexity collapsing together some of the partial results. This operation can be obtained writing and as functions of with . More precisely , , , and , we can build the equivalent filters and as or (see Table I ). The implementation of and leads to the architecture shown in Fig. 6 where multipliers are not employed and the number of adders is reduced with respect to the architecture shown in Fig. 5 . It is worth noticing that and can be represented on 9 bits as two's complement numbers, with 1 bit for the sign and the integer part and 8 bits for the fractional part. However, and are slightly different from the filters we would obtain quantizing the original and on 9 bits ( and ). This difference impacts on filters performance, as it will be detailed in Section IV, since the position of and zeros is near and zeros, whereas the position of and zeros is rather far from and ones. Compared to the best solution proposed in [9] our solution exhibits and . However, exploiting filters symmetry, the proposed architecture's complexity depends only on the SPT of taps with non-negative index ( , , , and , , , ). Thus, the proposed VLCA needs only and . Moreover the architecture depicted in Fig. 6 , exploiting some SPT terms that are common both to and , further reduces the amount of hardware required. In fact, as it is shown in Table I , many terms are common to both and , namely both and need the terms .
IV. EXPERIMENTAL RESULTS
The proposed VLCA has been tested inside the JPEG2000 image coding standard framework [5] . A free JPEG2000 codec written in C language, openjpeg [12] that is Class-1 Profile-1 compliant with the standard, has been employed for our tests. Five standard images have been used "Lenna" 256 256 (img1), "Barbara" 512 512 (img2), "Boat" 512 512 (img3), "Golhill" 512 512 (img4), and "Fingerprint" 512 512 (img5) [13] . The number of DWT decomposition levels ( ) has been varied from 1 to 3 for 256 256 images and from 1 to 4 for 512 512 images. This corresponds to , where is the number of DWT resolution levels required by openjpeg. Different compression ratios ( ) have been imposed, namely 1:1, 8:1, 16:1, 32:1, and 64:1, precinct and code-block size are the encoder default values. First, we evaluated the original openjpeg implementation performance, in terms of peak signal-to-noise ratio (PSNR), for the different and values on the aforementioned images. Then, we substituted the standard 9/7 LS implementation of the encoder with the proposed very low-complexity FB, leaving the standard 9/7 LS at the decoder [13] (original openjpeg decoder). Finally, we employed the and filters to show the loss of quality with respect to the proposed solution (see Table I ). To obtain the results shown in Table II , an FB implementation must be employed. Since the equivalent LS (obtained converting and ) is based on divisions [7] , it requires a higher number of fractional bits. This is a critical aspect in fixed point DWT implementations as in openjpeg. Results shown in Table II prove that the proposed very low-complexity 9/7 filters are compatible with the JPEG2000 image coding standard. In fact given a JPEG2000 bit-stream generated via the proposed very low-complexity DWT, a standard JPEG2000 decoder can decode it granting high quality in terms of PSNR even at . This adavantage stems from the position of our FB zeros. In fact, as shown in Fig. 4 , they are extremely close to the 9/7 original zeros. Moreover the DA and the VLCA have been implemented in VHDL and synthesized on a 0.13-m standard cells technology. Since all the proposed implementations have in common the first additions (the shaded part in Figs. 2, 3, 5 and 6) the with have been considered as the input signals of the architectures. To make the comparison fair the architectures have been implemented as combinational blocks. Even if this choice can not achieve high clock frequencies, we are granted that further complexity in term of sequential elements is not added into the design. In fact, with registers the logic synthesizer could perform retiming operations, that would make the comparison not fair. Thus, the results obtained with the logic synthesizer design_compiler (by Synopsys) actually represent the complexity of the three architectures. In Table III , post-synthesis results are shown. The proposed architectures produce both a low-pass and a high-pass coefficient every clock cycle. Therefore, for an image clock cycles are required to perform the 1-D DWT. Considering that openjpeg represents the 9/7 taps on 13 bits it can be observed that the proposed LCA shows interesting figures both in terms of complexity and power consumption. In fact, compared to a DA, the proposed architecture shows nearly the complexity of a 9-bit DA, with the performance of a 13-bit Da. However, for the sake of completeness, Table III shows results for different data widths. Namely samples are considered to be represented on and bits, whereas taps for the DA on and bits. Finally, to compare the proposed VLCA with other architectures the competitive, multiplierless solution proposed in [11] have been implemented: both the former and the latter show the same latency ( ). Since [11] is derived from a 13-bit FB, it shows the same PSNR of the original openjpeg model. Table III shows that the proposed VLCA has a reduced complexity even compared with [11] .
V. CONCLUSION
In this brief, a very low-complexity, efficient 9/7 wavelet filters implementation, has been derived. Very high quality can be achieved employing the proposed architecture in a low-complexity JPEG2000 encoder. Moreover, the proposed VLCA shows noteworthy figures in terms of complexity and power consumption.
