This work presents a scalable Discrete Wavelet Transform architecture based on the B-spline factorization. In particular, we show that several wavelet filters of practical interest have a common structure in the distributed part of their B-spline factorization. This common structure is effectively exploited to achieve scalability and to save multipliers compared with a direct polyphase B-spline implementation. Since the proposed solution is more robust to coefficient quantization than direct polyphase B-spline, it features further complexity reduction. Synthesis results are reported for a 130 nm CMOS technology to enable accurate comparison with other implementations. Moreover the performance of the new wavelet transform architecture, integrated in a complete JPEG2000 model, have been collected for several images.
I. INTRODUCTION
Filter bank (FB) [1] and lifting scheme (LS) [2] , along with its flipping structure (FS) form [3] , are the most common solutions to implement the discrete wavelet transform (DWT). A novel approach to design DWT architectures, based on the B-spline (BS) factorization, is proposed in [4] to reduce the number of required multipliers. As detailed in [4] , the gate count for the BS architecture of the 9/7, the 6/10 and the 10/18 filters is significantly reduced compared with the corresponding FB or LS implementations. In this work, we propose a new BS architecture that offers scalability and complexity advantages with respect to solution given in [4] .
The BS approach is based on factorizing each DWT as
where H(z) and G(z) are the Z-domain representations of the analysis low-pass and high-pass filters respectively, are referred to as the filter distributed part. In (1) and (2) H BS (z) and G BS (z) account for the γ H and γ G zeros of H(z) and G(z) in z=-1 and z=1 respectively. As pointed out in [4] , direct polyphase implementation of H BS (z) and G BS (z), obtained by cascading γ H (γ G ) multiplierless stages (see Fig. 2 (a) ), is preferred to the Pascal expression for long-tap filters.
On the other hand, the implementation of the distributed part, (3) and (4), requires multiplications [4] . Several works in the literature address the multiplierless implementation of the DWT. As an example [5] , [6] , [7] deal with FB DWT, [5] , [8] , [9] with LS/FS DWT and [10] with BS DWT. In particular in [10] , the use of Canonic Signed
Digit representation is proposed to reduce the distributed terms complexity in BS based architectures. However, only [4] and [10] investigate BS architectures, that, as shown in [4] , feature a reduced number of multipliers compared with FB and LS approaches. Moreover, none of the solutions proposed in the literature exploits the algebraic properties of the distributed part to further reduce the complexity of the DWT. As a first step, this work shows, in section II, that the distributed part has a common processing structure. Consequently, the scientific contribution of this work is to detail how this structure allows for (i) lower number of multiplications, (ii) scalability, (iii) robustness to coefficient quantization with respect to direct polyphase BS implementation. These three aspects are detailed in section III and IV. In particular, in section IV, the robustness to coefficient quantization is proved by
showing experimental results obtained integrating the proposed solution into JPEG2000, the latest international image compression standard, verification model [11] .
II. PROPOSED ARCHITECTURE
As proved in [12] , several DWT filters of practical interest in image compression are obtained from
whereH(ξ) is the low-pass synthesis filter (G(z)=H(−z) and z=e jξ ), 2l=γ H +γ G and θ = [sin(ξ/2)] 2 . We obtain (1) and (2) from (5) by using the following factorization
Significant examples of the filters derived from (5) are the ones considered in [4] , namely the 9/7, the 6/10 and the 10/18. These filters are obtained by proper spectral factorization with 2l=8 for the 9/7 and the 6/10, and 2l=14 for the 10/18.
Since Φ l−1 (θ) is a polynomial with real coefficients its roots are real (r) and complex conjugate pairs (c, c * ).
We can then write Q(z) and R(z) in the form
where L r (z) and
with a=c · c * , b=c+c * and I (10) and (12) requires five multipliers, whereas (14) and (15) can be implemented as shown in Fig. 1 (a) and 1 (b), with a total of three multipliers. Low-pass and high-pass results are obtained by selectively adding or subtracting odd power terms in L r (z) and W a,b (z) (lp/hp signal in Fig. 1) . Furthermore, Fig. 1 (c) shows that both L r (z) and W a,b (z) can be implemented as a single module (LW (z)) resorting to two multiplexers, driven by the LW signal.
However, since the BS terms are in polyphase form and the distributed part is in not-polyphase form, as shown in , to the distributed part input by means of registers (see Fig. 1 (d) ). Moreover, registers are required when more L r (z) or W a,b (z) stages are cascaded to implement Q(z) and R(z), as in the case of the 10/18 filters, where the ouput of the first stage (x) becomes the input of the second stage (see Fig. 1 (e) and Fig. 2 (b) ). 
p [1] p [2] lp/hp
III. RESULTS
In this work we analyze the filters considered in [4] : the 9/7, 6/10 and 10/18 wavelet filters, whose BS part is
and (γ 10/18 H =5, γ 10/18 G =9). The 9/7 and 6/10 wavelet filters derive from (5) with 2l=8, and
has only a real root r and a pair of complex conjugate roots c, c * that lead to we can infer that the 9/7 and 6/10 architectures have the same complexity. On the other hand, the 10/18 wavelet filters are obtained from (5) with 2l=14 and
whose solution is three pairs of complex conjugate roots. Said c 0 , c * 0 and c 2 , c * 2 the couples with minimum and maximum modulus, we obtain
where a i =c i · c * i and b i =c i +c * i . To prove the effectiveness of our methodology we described in VHDL both the BS architectures detailed in [4] and the proposed ones and synthesized them on a 0.13 µm standard cell technology with Synopsys Design
Compiler. The architecture bit-width is the same employed in [4] , namely internal bit-widths are all 16 bit and It is worth pointing out that these values are obtained by synthesizing the basic blocks as stand-alone components, whereas the gate count for the whole BS DWT architectures are obtained by fixing the target clock frequency and enabling the optimization options of the logic synthesizer. As detailed in Table I the proposed methodology compared with [4] reduces the number of multipliers, while slightly increasing the number of adders and keeping the same number of registers for 9/7 and 6/10 filters and nearly the same for 10/18 filters. The gate count complexity for the whole BS DWT architectures synthesized with a 200 MHz clock frequency is given in the sixth column of Table I . It is worth pointing out that the complexity figures detailed in Table I include h 0 , g 0 products in (1), (2), whereas these products are not considered in [4] ( Tables I, II, III, IV) .
In order better highlight the critical path and timing of the proposed architecture we performed also logic synthesis constraining the area to be minimal and leaving to the synthesized the burden of finding the best possible clock period. This new set of results, shown in the third and fourth columns of Table II , strengthens the effectiveness of the proposed architecture in reducing not only the complexity but also the critical path.
Finally, to prove the scalability of the proposed approach we implemented two architectures that support the on-line switching among the 9/7, 6/10 and 10/18 filters. Both the architectures require multiplexers in the BS part to support the aforementioned filters. As far as the distributed part is concerned, the first architecture is derived from the BS solution in [4] : it supports Q 10/18 (z) and R 10/18 (z), shorter filters are obtained by setting unused taps to zero. The second architecture, depicted in Fig. 2 (b) , is based on the proposed approach and employs two W a,b (z) modules and the flexible LW (z) module shown in Fig. 1 (b) and Fig. 1 
IV. QUANTIZATION OF FILTER COEFFICIENTS
Further complexity can be saved by choosing the proper number of bits to represent filter coefficients. To this purpose the proposed solution was integrated into the lossy convolution-based mode of the JPEG2000 verification model [11] . Experimental simulations were performed on five standard images, namely 'Lenna' 256×256 (img1), 'Boat' 512×512 (img2), 'Goldhill' 512×512 (img3), 'Barbara' 512×512 (img4) and 'Fingerprint' 512×512 (img5).
The number of DWT decomposition levels (L) has been varied from 1 to 3 for the 256×256 image and from 1 to obtained by quantizing Q i and R i , whereas the dashed lines detail the values achieved quantizing 1/r, b/a, 1/a. As it can be observed, the curves referred to the 9/7 and 6/10 filters are nearly overlapped. Since representing Q i and R i with k < 2 causes H(z) and G(z) to degenerate to band pass filters, solid line simulations have been carried out for k ∈ [2, 16] . Conversely, the proposed solution with k=0 (only integer part of 1/r, b/a, 1/a) introduces a maximum PSNR degradation of about 1 dB for the 9/7 and 6/10 filters and of about 3.5 dB for the 10/18 filters.
As it can be inferred from Fig. 3 , when k < 10 the quantization of Q i and R i leads to significant performance loss. On the other hand, the quantization of 1/r, b/a, 1/a worsens the PSNR when k < 6.
In Table IV we show for the 9/7 filters the PSNR obtained by averaging the mean square error values achieved for the five test images belonging to J img . The simulation parameters have been changed in the following ranges:
The quantization of Q i and R i leads to significant PSNR degradation mainly for ρ=1 bpp or higher when k≤8 (∆PSNR≥1.2dB). On the contrary, the proposed solution keeps the PSNR degradation limited to less than 0.5 dB with k=4. Similarly in Table V and VI we show the results obtained for the 6/10 and 10/18 filters respectively, using the same setup employed for the 9/7 filters. As it can be observed the proposed approach leads to excellent results also with the 6/10 and 10/18 wavelet filters. (Fig. 3) . On the other hand, we can obtain nearly the same performance with the proposed solution and k=4. To that purpose, we performed new logical synthesis for a target clock frequency of 200 MHz using 16-by-13 multipliers (k=9) and 16-by-16 multipliers (k=9) to represent Q i and R i for the 9/7-6/10 and 10/18 filters respectively. Similarly, we used 16-by-8 multipliers (k=4) and 16-by-9 multipliers (k=4) for the proposed 9/7-6/10 and 10/18 architectures respectively. As shown in the seventh column of Table I 
V. CONCLUSION
In this work we propose a scalable BS DWT architecture that employs a reduced number of multipliers.
Implementation results on a 0.13 µm standard cell technology prove the complexity reduction offered by the proposed methodology. Finally, simulations into a JPEG2000 model show that the proposed methodology is very robust to filter coefficients quantization leading to further complexity reduction. 
