Abstract. Based on B-spline factorization, a new category of architectures for Discrete Wavelet Transform (DWT) is proposed in this paper. The B-spline factorization mainly consists of the B-spline part and the distributed part. The former is proposed to be constructed by use of the direct implementation or Pascal implementation. And the latter is the part introducing multipliers and can be implemented with the Type-I or Type-II polyphase decomposition. Since the degree of the distributed part is usually designed as small as possible, the proposed architectures could use fewer multipliers than previous arts, but more adders would be required. However, many adders can be implemented with smaller area and lower speed because only few adders are on the critical path. Three case studies, including the JPEG2000 default (9, 7) filter, the (6, 10) filter, and the (10, 18) filter, are given to demonstrate the efficiency of the proposed architectures.
Introduction
DWT has been developed as an efficient and powerful tool for signal analysis, image compression, and even scalable video coding recently [1] . Because a huge amount of computation would be required, many VLSI architectures have been proposed, which are mainly based on convolution scheme [2] [3] [4] and lifting scheme [5] [6] [7] . The convolution-based architecture is to implement two-channel filter banks directly, and many VLSI DSP techniques, such as polyphase decomposition [8] , pipelining, and retiming [9] , have been adopted to enhance the performance. On the other hand, the lifting scheme is used to express the two-channel filter banks in a new way [10] . In [11] , a systematic method is proposed to factorize the polyphase matrix into many lifting steps based on the perfect reconstruction property. The lifting scheme usually requires fewer multipliers and adders than the convolution scheme.
However, the intrinsic B-spline property of DWT was not used to construct VLSI architectures in literature. According to [12] , any DWT filters can be factorized into the B-spline part and the distributed part. The B-spline part contributes to all important wavelet properties. And the distributed part is used to design DWT FIR filters. Since only the distributed part requires multipliers, the B-spline factorization could use fewer multipliers than the lifting scheme but induce more adders.
In this paper, we propose to implement DWT based on the B-spline factorization. The B-spline part is proposed to be constructed with the direct implementation or Pascal implementation. The latter could reduce the adders, but could be too complex when the filter tap is too long. The distributed part could be implemented with the Type-I or Type-II polyphase decomposition, and conventional filter implementation methods all can be applied. Three case studies are given to examine the efficiency. However, the principal objective of this paper is to motivate a new category of DWT architectures.
The organization of this paper is as follows. Section 2 reviews previous arts of DWT architectures. The B-spline factorization theory is described in Section 3, and the proposed architectures are presented in Section 4. The case studies of the JPEG2000 default (9, 7) filter, the (6, 10) filter [5] , and the (10, 18) filter [13] , are given in Section 5. Finally, a summary is given to conclude this paper in Section 6.
Previous DWT Architectures
This section introduces previous DWT architectures and classifies DWT architectures into three categories.
Convolution-Based
The multiresolution DWT analysis can be viewed as a cascade of several two-channel filter banks [14] , and the analysis filter bank is shown in Fig. 1 , where H (z) and G(z) are the lowpass and highpass filters, respectively. The convolution-based architectures are to implement DWT with the direct structures of two-channel filter banks. Many VLSI DSP design techniques, such as folding, unfolding, and pipelining [9] , can be adopted to implement the pair of lowpass and highpass filters. Especially, the convolution-based architecture can be constructed by use of polyphase decomposition [8] as shown in Fig. 2 , where
if Type-I decomposition is used, and is used. The Type-I and Type-II decompositions can be illustrated as Fig. 3 . Then the four filters in Fig. 2 can be implemented by serial or parallel filters. In this convolution-based scheme, the lowpass and highpass filters are considered independently.
Lifting-Based
On the other hand, lifting scheme [10] has been widely used to reduce the required multiplications and additions by exploring the relation of lowpass and highpass filters. According to [11] , any DWT filter bank of perfect reconstruction can be decomposed into a finite sequence of lifting steps. This decomposition corresponds to a factorization for the polyphase matrix of the target wavelet filter into a sequence of alternating upper and lower triangular matrices and a constant diagonal matrix, which can be expressed as follows:
where P(z) is the polyphase matrix. Most of the proposed lifting-based architectures in literature are implemented with the above lifting factorization directly [5, 6] . Although the lifting scheme has many advantages, such as fewer arithmetic operations and in-place implementation, the potentially long critical path is a drawback for hardware implementation. In [7] , this timing crisis is discussed in detail and addressed by use of the flipping structure, instead of pipelining.
Classification
As mentioned above, the general two-channel filter banks can be implemented with the convolution scheme. If the two-channel filter bank possesses the perfect reconstruction (PR) property, it could be implemented with fewer arithmetic operations by use of lifting-based architectures. DWT can be implemented with the above two schemes because it can be viewed as a two-channel filter bank of perfect reconstruction property.
However, the B-spline factorization property of DWT has not been used to construct efficient architectures in literature, which is an important property for DWT and will be described in the next section. Thus, DWT architectures can be categorized as shown in Fig. 4 , where DWT is only a subset of convolutionand lifting-based architectures.
B-Spline Factorization
According to [12] , the lowpass filter, H (z) = 
where the first, second, and third terms of the righthand side can be called the B-spline part, distributed part, and normalization part, respectively. Based on the B-spline factorization, the output of highpass filter can be viewed as the γ G -th order difference of the smoothed input signals. There are two differences between the expression (2) and the expression of [12] . The first one is that we treat 1 ± z −1 as the B-spline part, instead of 1+z −1
2
. And the second one is the normalization part which is extracted in this paper only for implementation issues.
The B-spline part is responsible for all important properties of DWT, such as order of approximation, reproduction of polynomials, vanishing moments, and multiscale differentiation property. And the distributed part is used to derive efficient FIR DWT filters [12] . Thus, the order of the distributed part is usually designed as small as possible when the order of the Bspline part is given. The normalization part can be implemented independently from the other two parts and further together with the following quantization if image compression is needed. It is very similar to the normalization step in the lifting scheme.
Proposed B-Spline Factorized Architecture
We propose to implement DWT by using the B-spline factorization as the Eq. (2). For 100% hardware utilization, the polyphase decomposition is adopted first. After the Type-I or Type-II polyphase decomposition, the general B-spline factorized architecture can be expressed as Fig. 5 , where the distributed part, Q(z) and R(z), are decomposed first, and the left is the B-spline part. The distributed part is the only part with multipliers and the four filters can be implemented by serial or parallel filters. Since the normalization part, h 0 and g 0 , can be implemented independently from the other two parts, it will be excluded in the following discussion. Below we will introduce two implementation methods for the B-spline part. 
Direct Implementation of the B-Spline Part
The direct implementation of the B-spline part is a straightforward one. The concept is to implement (1 + z −1 ) and (1 − z −1 ) first, and then the B-spline parts can be constructed by serially connecting (1 + z −1 ) and (1 − z −1 ). But two-input-two-output structures of (1 + z −1 ) and (1 − z −1 ) cannot be derived from polyphase decomposition. We propose to implement them by considering the physical connection of signals as shown in Fig. 6 , where we assume the Type-I decomposition is used so the even signals are prior to odd signals. Thus, the direct implementation requires 2γ H + 2γ G adders for a pair of lowpass and highpass outputs. When connecting the B-spline part to the distributed part, the priority of signals needs to be handled carefully.
Another problem that should be solved is the internal signal wordlength. Since the DC gain of (1 + z −1 ) is 2, the signal magnitude is possible to be double after every (1 + z −1 ) stage, and so is after every (1 − z −1 ) stage. However, implementing (1 ± z −1 )/2 instead will lose too much precision. The precision and wordlength issues should be handled carefully as the precision criteria is given. In this paper, a simple method is used to solve it. We scale down the signal by 2 after every two (1 ± z −1 ) stages for precision preservation and preventing from signal overflow.
Pascal Implementation of the B-Spline Part
Instead of the direct implementation, we also propose the Pascal implementation that can exploit the similarity of the two B-spline parts to reduce adders. The Pascal implementation expresses the (1 + z −1 ) γ H and (1 − z −1 ) γ G as the Pascal expansion and saves the repeated computation. For example, 1 + 6z −2 + z −4 and 4z −1 + 4z −3 can be computed first for the implemen-
. Then the sum of them is (1 + z −1 ) 4 , and the difference is (1 − z −1 ) 4 . Furthermore, the integer multiplications of the B-spline part can be implemented with shifters and adders, instead of multipliers. In this example, the Pascal implementation only requires 12 adders, but the direct implementation will need 16 adders. However, the Pascal implementation of long-tap filters will be too complex to be derived, and the complexity reduction is not guaranteed. The precision and wordlength issues are also more complex than those of the direct implementation. In this paper, we preserve as more precision as possible when the internal wordlength is given.
Performance Discussion
The main advantage of the B-spline factorized architectures is that possibly fewer multipliers are required than the convolution and lifting scheme. This is because the degrees of Q(z) and R(z) (γ Q and γ R ) are designed as small as possible for given γ H and γ G that dominates all wavelet properties.
The below is a general performance discussion. The convolution scheme requires about γ H + γ G + γ Q + γ R multipliers, while the lifting scheme could possibly save a half number of multipliers [11] . But the B-spline factorized architecture only requires γ Q + γ R multipliers which are fewer than
Daubechies wavelets are optimal in the sense that they have a minimum size support of a given number of vanishing moments [15] . Thus, we can derive the expression as follows:
The Eq. (3) means that the sum of vanishing moments (γ H +γ G ) is always less than or equal to a half of the sum of the lowpass and highpass filter lengths. Thus, the B-spline factorized architectures can always guarantee the complexity reduction of multipliers by 2 relative to the convolution-based ones if Daubechies wavelets are used. But the lifting-based architectures cannot guarantee the performance. Now we consider the common used linear filters. For the linear DWT filters, the convolution-based architectures can reduce the multipliers by 2 by adopting the linear properties. Since the B-spline part is always linear, the distributed part is also linear and can reduce the multipliers by 2 as well. However, the lifting-based architectures cannot always adopt the linear properties. Especially for the even length DWT linear filters, the lifting steps are hard to be factorized as linear so that the required multipliers may be even more than convolution-based architectures.
The main disadvantage of the B-spline factorized architectures is that more adders may be required. But the complexity of adders is much less than that of multipliers. And most adders are not on the critical path, so they can be implemented in low speed and small area. In the result, the proposed architectures can provide more reduction of hardware resource than others.
Case Studies
In this section, three Daubechies biorthogonal filters are studied and implemented by use of proposed B-spline factorized architectures, including the JPEG2000 default (9, 7) filter, the (6, 10) filter [5] , and the (10, 18) filter [13] .
JPEG2000 Default (9, 7) Filter
The B-spline factorization of the (9, 7) filter can be expressed as:
where t 1 = −4.630464, t 2 = 9.597484, and t 3 = 3.369536. Thus the B-spline factorized architecture of the (9, 7) filter will only need three multipliers, excluding the normalization part h 0 and g 0 . Here, we use the Pascal implementation for the B-spline part, and the Pascal expression of the (9, 7) filter is shown in Fig. 7 . The proposed B-spline factorized architecture requires 18 adders, of which 12 adders for the B-spline part and 6 adders for the distributed part. The proposed architectures are shown in Fig. 8 , where Fig. 8 (a) and (b) represent Type-I and Type-II polyphase decompositions, respectively. And the notation that we use for FIR filters can be described in Fig. 9 . The original Type-I architecture requires eight registers, and the critical path is T m + 5T a , where T m is the time taken for a multiplication operation, and T a is the time needed for an addition operation. On the other hand, if pipelining is performed through the upside dot line, the critical path can be shortened to T m +2T a with totally 10 registers. However, the critical path of the Type-II architecture is T m + 2T a with only 10 registers.
Comparison.
By extracting the normalization part h 0 and g 0 and utilizing the symmetric property, the convolution-based architecture of the (9, 7) filter can be implemented by use of 7 multipliers, 14 adders, and 7 registers. And the critical path is T m + 3T a if adder tree is used to connect adders.
The lifting scheme of the (9, 7) filter can be factorized as:
where P(z) is the polyphase matrix, and the coefficients are given as a = −1.586134342, b = −0.052980118, c = 0.882911076, d = 0.443506852, and K = 1.149604398. The corresponding signal flow graph is shown in Fig. 10 . Thus, the lifting-based architecture would require 4 multipliers and 8 adders if the normalization steps K and 1/K are excluded. The critical path 4T m +8T a is quite long with only 4 registers and can be reduced to T m + 2T a by pipelining through the dot lines with totally 10 registers. On the other hand, the flipping structure of the (9, 7) filter is proposed to flip Fig. 10 to reduce the critical path [7] as shown in Fig. 11 , where the critical path is T m +5T a without any more hardware overhead than Fig. 10 . The critical path can be further reduced to T m + 1T a with three additional pipelining registers. The proposed B-spline factorized architectures as well as the aforementioned convolution-based and lifting-based ones have been verified by use of Verilog- Table 1 , where the internal bit-widths are all 16-bit, the multipliers are all 16-by-16 multiplications, and the adders are also 16-bit for comparison. The gate counts are given with combinational and noncombinational gate counts separately. The former contributes to the multipliers and adders while the latter is responsible to the registers. For circuit synthesis, the timing constraints are set as tight as possible.
According to Table 1 , the proposed architectures could require fewer gate counts under the same timing constraints. Furthermore, the saving of gate counts will be more significant if the multipliers are required to have higher precision.
The (6, 10) Filter
The B-spline factorization of the (6, 10) filter [5] can be expressed as:
where s 1 = −t 1 , s 2 = t 2 , s 3 = −t 3 , r 1 = 2.630464, r 2 = 1.336557, and r 3 = −9.934042. However, the Pascal implementation can only cover (1 ± z −1 ) 3 , and (6) and (7), respectively. The proposed architectures are shown in Fig. 12 , where the parts marked with '*' and '##' can be shared. Thus, the Solution-1 of the B-spline factorized architecture would require 3 multipliers and 20 adders while the Solution-2 would need 4 multipliers and 18 adders.
The critical path of the Solution-1 architecture could be T m +6T a , T m +4T a , or T m +2T a by retiming, pipelining, or retiming and pipelining together, respectively. The corresponding numbers of registers are 9, 11, and 13. On the other hand, the Solution-2 architecture can be retimed to obtain a critical path of T m + 5T a with totally 9 registers.
Comparison.
By extracting the normalization part h 0 and g 0 and utilizing both symmetric and anti-symmetric properties, the convolution-based architecture of the (6, 10) filter can be implemented by use of 6 multipliers, 14 adders, and 8 registers.
And the critical path is T m + 4T a if the adder tree is used.
In contrast to the odd symmetric (9, 7) filter, the polyphase matrix of the even linear (6, 10) filter can be decomposed as follows:
where the coefficients are given as a = −0.369536, b = −0.42780, c = −0.119532, d = −0.090075, e = 0.872739, g = −0.572909, f = 0.224338, K 1 = 0.874919, and K 2 = 1.142963 [5] . Thus, the liftingbased architecture can be shown as Fig. 13 , where 7 multipliers, 8 adders, and 5 registers are required if K 1 and K 2 are excluded. The critical path is 4T m + 5T a without pipelining and can be pipelined to T m + 2T a with six pipelining registers. The flipping structure can also reduce the critical path to T m + 5T a by flipping and can be further pipelined to T m + 2T a with four pipelining registers [7] . Similarly, the proposed, convolution-based, and lifting-based architectures have been verified and synthesized. The bit-width is the same as the case of (9, 7) filter. The results are listed in Table 2 . In this case, the lifting-based architecture requires even more multipliers than the convolution-based one because the lifting scheme of even-tap linear DWT filters is not as efficient as that of odd symmetric filters. However, the proposed B-spline architecture can still reduce the number of multipliers to three. Table 2 shows that the proposed architectures can achieve the same timing constraints with fewer gate counts than the other three architectures. 
Detailed Gate Count
Comparison. The Bspline factorized architecture can provide fewer multipliers but introduce more adders. We compare the gate counts of multipliers and adders in more detail to examine the resulting hardware resource reduction. The lifting-based architecture with four pipelining stages and the B-spline Solution-1 architecture with pipelining are chosen, which are both of critical path T m +2T a . The detailed comparison of the gate counts is listed in Table 3 , where the gate counts of different kinds of multipliers and adders are separate. The Synopsys Design Compiler synthesizes all multipliers to nonbooth-recorded wallace tree multipliers, which can have trade-offs between the processing speed and the area size. Many kinds of adders are used for circuits synthesis, and the carry-lookahead adders are the fastest but the largest ones.
All multipliers of the lifting-based architecture are on the critical path, so the gate counts of them are quite large and about 1500 gates in average. However, the multipliers of the B-spline factorized architecture are not all on the critical path, so the average gate count is only about 1000 gates. Furthermore, the lifting-based architecture requires 4 more multipliers than the B-spline factorized one. In the result, the total gate counts of multipliers are about 10000 and 3000 gates, respectively.
On the other hand, only one carry-lookahead adder is used in the proposed architecture while five are used in the lifting-based one. Although more adders are required, most of them are synthesized to the smaller adders in the proposed architecture. The overhead gate count of adders for the proposed architecture is about 1600 gates. By combining the result of multipliers, the net reduction of gate count is about 7000 − 1600 = 5400. The efficiency of the proposed architecture for reducing multipliers is demonstrated.
The (10, 18) Filter
The coefficients of the (10, 18) analysis filter bank are given in [13] . The analysis lowpass filter is a symmetric 10-tap filter, and the highpass filter is an anti-symmetric 18-tap filter. The coding efficiency can be better than the well-known (9, 7) filter [13, 16] . The B-spline factorization of the analysis filter bank is as follows:
where u 1 = 0.1049758, u 2 = −0.524577, u 3 = 0.0094393, u 4 = 0.08498056, u 5 = 0.33152476, u 6 = 0.74232477, h 0 = 0.27485, and g 0 = 0.101111. For the (10, 18) filter bank, the Pascal implementation will be too complex to derive because the degrees of the B-spline parts are 5 and 9. Thus, we use the direct implementation for the B-spline part. The proposed architecture for the (10, 18) filter is as shown in Fig. 14 , where 6 multipliers and 40 adders are used if the normalization part is excluded. If retiming z +2 is performed, the critical path will become T m +11T a with totally 23 registers. In concept, we can reduce the critical path to
by pipelining with 4 additional registers.
Comparison.
Here we consider that the convolution-based architecture of the (10, 18) filter is implemented into the parallel filters. If the linear property and the adder tree are adopted, 12 multipliers, 26 adders, and 16 registers are required while the critical path is T m + 5T a . As the case of (6, 10) filter, the lifting scheme of the (10, 18) cannot be linear and cannot reduce the hardware complexity. Thus, we will not include the lifting scheme into the comparison. The proposed and convolution-based architectures have been verified and synthesized. The internal bitwidth is the same as the case of (9, 7) filter, except the multipliers become 16-by-16 multiplications. The results are listed in Table 4 . The pipelining of the proposed architecture is cut before the last two 1 + z Figure 14 . B-spline factorized architecture for the (10, 18) filter.
proposed architectures require only about two-thirds of the gate count of the convolution-based one.
Conclusion
In this paper, a new category of DWT architectures is proposed on the basis of B-spline factorization. The B-spline part can be implemented by use of the direct or Pascal implementation. And the distributed part could be implemented with the Type-I or Type-II polyphase decomposition and conventional filter design techniques. For Daubechies wavelets, the proposed B-spline factorized architectures can guarantee the complexity reduction of multipliers by 2 while the lifting scheme cannot. Although more adders are required, many adders can be implemented in small area and low speed because most of them are not on the critical path. Based on three case studies, including the (9, 7), (6, 10) , and (10, 18) filters, the required gate counts of the proposed architecture are much smaller than that of the convolution-based and lifting-based ones, which demonstrates the efficiency.
