VLSI Architecture for Forward Discrete Wavelet Transform Based on B-Spline Factorization by Chao-tsung Huang et al.
Journal of VLSI Signal Processing 40, 343–353, 2005
c  2005 Springer Science + Business Media, Inc. Manufactured in The Netherlands.
VLSI Architecture for Forward Discrete Wavelet Transform Based
on B-spline Factorization
CHAO-TSUNG HUANG, PO-CHIH TSENG AND LIANG-GEE CHEN
DSP/IC Design Lab, Graduate Institute of Electronics Engineering and Department of Electrical Engineering,
National Taiwan University, 1, Sec. 4, Roosevelt Road, Taipei 106, Taiwan
Received September 18, 2003; Revised October 16, 2003; Accepted November 13, 2003
Abstract. BasedonB-splinefactorization,anewcategoryofarchitecturesforDiscreteWaveletTransform(DWT)
isproposedinthispaper.TheB-splinefactorizationmainlyconsistsoftheB-splinepartandthedistributedpart.The
former is proposed to be constructed by use of the direct implementation or Pascal implementation. And the latter is
the part introducing multipliers and can be implemented with the Type-I or Type-II polyphase decomposition. Since
thedegreeofthedistributedpartisusuallydesignedassmallaspossible,theproposedarchitecturescouldusefewer
multipliers than previous arts, but more adders would be required. However, many adders can be implemented with
smaller area and lower speed because only few adders are on the critical path. Three case studies, including the
JPEG2000 default (9, 7) ﬁlter, the (6, 10) ﬁlter, and the (10, 18) ﬁlter, are given to demonstrate the efﬁciency of the
proposed architectures.
Keywords: discrete wavelet transform, VLSI architecture, B-spline factorization
1. Introduction
DWT has been developed as an efﬁcient and power-
ful tool for signal analysis, image compression, and
evenscalablevideocodingrecently[1].Becauseahuge
amountofcomputationwouldberequired,manyVLSI
architectures have been proposed, which are mainly
based on convolution scheme [2–4] and lifting scheme
[5–7]. The convolution-based architecture is to imple-
menttwo-channelﬁlterbanksdirectly,andmanyVLSI
DSP techniques, such as polyphase decomposition [8],
pipelining, and retiming [9], have been adopted to en-
hance the performance. On the other hand, the lifting
scheme is used to express the two-channel ﬁlter banks
in a new way [10]. In [11], a systematic method is pro-
posed to factorize the polyphase matrix into many lift-
ing steps based on the perfect reconstruction property.
The lifting scheme usually requires fewer multipliers
and adders than the convolution scheme.
However, the intrinsic B-spline property of DWT
was not used to construct VLSI architectures in liter-
ature. According to [12], any DWT ﬁlters can be fac-
torized into the B-spline part and the distributed part.
The B-spline part contributes to all important wavelet
properties. And the distributed part is used to design
DWT FIR ﬁlters. Since only the distributed part re-
quires multipliers, the B-spline factorization could use
fewer multipliers than the lifting scheme but induce
more adders.
In this paper, we propose to implement DWT based
on the B-spline factorization. The B-spline part is pro-
posedtobeconstructedwiththedirectimplementation
or Pascal implementation. The latter could reduce the
adders, but could be too complex when the ﬁlter tap
is too long. The distributed part could be implemented
with the Type-I or Type-II polyphase decomposition,
andconventionalﬁlterimplementationmethodsallcan
be applied. Three case studies are given to examine the
efﬁciency. However, the principal objective of this pa-
peristomotivateanewcategoryofDWTarchitectures.
Theorganizationofthispaperisasfollows.Section2
reviews previous arts of DWT architectures. The344 Huang, Tseng and Chen
B-spline factorization theory is described in Section 3,
and the proposed architectures are presented in
Section 4. The case studies of the JPEG2000 default
(9, 7) ﬁlter, the (6, 10) ﬁlter [5], and the (10, 18) ﬁl-
ter [13], are given in Section 5. Finally, a summary is
given to conclude this paper in Section 6.
2. Previous DWT Architectures
This section introduces previous DWT architectures
and classiﬁes DWT architectures into three categories.
2.1. Convolution-Based
The multiresolution DWT analysis can be viewed as a
cascadeofseveraltwo-channelﬁlterbanks[14],andthe
analysis ﬁlter bank is shown in Fig. 1, where H(z) and
G(z) are the lowpass and highpass ﬁlters, respectively.
The convolution-based architectures are to implement
DWT with the direct structures of two-channel ﬁlter
banks. Many VLSI DSP design techniques, such as
folding, unfolding, and pipelining [9], can be adopted
to implement the pair of lowpass and highpass ﬁlters.
Especially, the convolution-based architecture can be
constructed by use of polyphase decomposition [8] as
shown in Fig. 2, where H(z) = He(z2)+z−1Ho(z2)
and G(z) = Ge(z2) + z−1Go(z2)i fT ype-I decom-
position is used, and H(z) = He(z2) + zH o(z2) and
G(z) = Ge(z2)+zGo(z2)i fT ype-II decomposition
Figure 1.T w o-channel analysis ﬁlter bank.
Figure 2. Polyphase decomposition of the analysis ﬁlter bank.
Figure 3.T wo polyphase decomposition types.
is used. The Type-I and Type-II decompositions can
be illustrated as Fig. 3. Then the four ﬁlters in Fig. 2
can be implemented by serial or parallel ﬁlters. In this
convolution-based scheme, the lowpass and highpass
ﬁlters are considered independently.
2.2. Lifting-Based
On the other hand, lifting scheme [10] has been widely
used to reduce the required multiplications and addi-
tions by exploring the relation of lowpass and high-
pass ﬁlters. According to [11], any DWT ﬁlter bank of
perfect reconstruction can be decomposed into a ﬁnite
sequence of lifting steps. This decomposition corre-
sponds to a factorization for the polyphase matrix of
the target wavelet ﬁlter into a sequence of alternating
upper and lower triangular matrices and a constant di-
agonal matrix, which can be expressed as follows:
P(z) =
 
He(z) Ge(z)
Ho(z) Go(z)
 
=
m  
i=1
 
1 si(z)
01
  
10
ti(z)1
  
K 0
01 /K
 
(1)
where P(z)i sthe polyphase matrix.
Most of the proposed lifting-based architectures in
literature are implemented with the above lifting fac-
torization directly [5,6]. Although the lifting scheme
has many advantages, such as fewer arithmetic opera-
tionsandin-placeimplementation,thepotentiallylong
critical path is a drawback for hardware implementa-
tion. In [7], this timing crisis is discussed in detail and
addressed by use of the ﬂipping structure, instead of
pipelining.
2.3. Classiﬁcation
As mentioned above, the general two-channel ﬁl-
ter banks can be implemented with the convolution
scheme. If the two-channel ﬁlter bank possesses theVLSI Architecture for Forward Discrete Wavelet Transform 345
Figure 4. Categories of DWT architectures.
perfect reconstruction (PR) property, it could be im-
plemented with fewer arithmetic operations by use of
lifting-based architectures. DWT can be implemented
with the above two schemes because it can be viewed
as a two-channel ﬁlter bank of perfect reconstruction
property.
However, the B-spline factorization property of
DWT has not been used to construct efﬁcient archi-
tectures in literature, which is an important property
for DWT and will be described in the next section.
Thus, DWT architectures can be categorized as shown
in Fig. 4, where DWT is only a subset of convolution-
and lifting-based architectures.
3. B-Spline Factorization
According to [12], the lowpass ﬁlter, H(z) =
 PH−1
i=0
hiz−i, and the highpass ﬁlter, G(z) =
 PG−1
i=0 giz−i,o f
any DWT can be factorized as
H(z) = (1 + z−1)γH · Q(z) · h0
(2)
G(z) = (1 − z−1)γG · R(z) · g0
where the ﬁrst, second, and third terms of the right-
hand side can be called the B-spline part, distributed
part,andnormalizationpart,respectively.Basedonthe
Figure 5. General B-spline-based architecture.
B-spline factorization, the output of highpass ﬁlter can
beviewedastheγG-thorderdifferenceofthesmoothed
input signals. There are two differences between the
expression (2) and the expression of [12]. The ﬁrst one
is that we treat 1 ± z−1 as the B-spline part, instead
of 1+z−1
2 . And the second one is the normalization part
whichisextractedinthispaperonlyforimplementation
issues.
The B-spline part is responsible for all important
properties of DWT, such as order of approximation,
reproduction of polynomials, vanishing moments, and
multiscale differentiation property. And the distributed
part is used to derive efﬁcient FIR DWT ﬁlters [12].
Thus, the order of the distributed part is usually de-
signed as small as possible when the order of the B-
spline part is given. The normalization part can be im-
plemented independently from the other two parts and
further together with the following quantization if im-
age compression is needed. It is very similar to the
normalization step in the lifting scheme.
4. Proposed B-Spline Factorized Architecture
We propose to implement DWT by using the B-spline
factorization as the Eq. (2). For 100% hardware uti-
lization, the polyphase decomposition is adopted ﬁrst.
After the Type-I or Type-II polyphase decomposition,
the general B-spline factorized architecture can be ex-
pressed as Fig. 5, where the distributed part, Q(z) and
R(z), are decomposed ﬁrst, and the left is the B-spline
part. The distributed part is the only part with multipli-
ers and the four ﬁlters can be implemented by serial or
parallel ﬁlters. Since the normalization part, h0 and g0,
can be implemented independently from the other two
parts, it will be excluded in the following discussion.
Below we will introduce two implementation methods
for the B-spline part.346 Huang, Tseng and Chen
Figure 6. Direct implementation for the B-spline part.
4.1. Direct Implementation of the B-Spline Part
The direct implementation of the B-spline part is
a straightforward one. The concept is to implement
(1+z−1)and(1−z−1)ﬁrst,andthentheB-splineparts
can be constructed by serially connecting (1 + z−1)
and (1 − z−1). But two-input-two-output structures
of (1 + z−1) and (1 − z−1) cannot be derived from
polyphase decomposition. We propose to implement
them by considering the physical connection of sig-
nals as shown in Fig. 6, where we assume the Type-I
decomposition is used so the even signals are prior to
odd signals. Thus, the direct implementation requires
2γH + 2γG adders for a pair of lowpass and highpass
outputs. When connecting the B-spline part to the dis-
tributedpart,thepriorityofsignalsneedstobehandled
carefully.
Another problem that should be solved is the inter-
nal signal wordlength. Since the DC gain of (1 + z−1)
is 2, the signal magnitude is possible to be double after
every (1+z−1) stage, and so is after every (1−z−1)
stage.However,implementing(1±z−1)/2insteadwill
losetoomuchprecision.Theprecisionandwordlength
issues should be handled carefully as the precision cri-
teria is given. In this paper, a simple method is used
to solve it. We scale down the signal by 2 after ev-
ery two (1±z−1) stages for precision preservation and
preventing from signal overﬂow.
4.2. Pascal Implementation of the B-Spline Part
Instead of the direct implementation, we also propose
the Pascal implementation that can exploit the simi-
larity of the two B-spline parts to reduce adders. The
Pascal implementation expresses the (1 + z−1)γH and
(1 − z−1)γG as the Pascal expansion and saves the re-
peated computation. For example, 1+6z−2 + z−4 and
4z−1 + 4z−3 can be computed ﬁrst for the implemen-
tation of (1 + z−1)4 = 1 + 4z−1 + 6z−2 + 4z−3 + z−4
and (1 − z−1)4 = 1 − 4z−1 + 6z−2 − 4z−3 + z−4.
Then the sum of them is (1+ z−1)4, and the difference
is (1 − z−1)4. Furthermore, the integer multiplications
of the B-spline part can be implemented with shifters
and adders, instead of multipliers. In this example, the
Pascal implementation only requires 12 adders, but the
direct implementation will need 16 adders. However,
thePascalimplementationoflong-tapﬁlterswillbetoo
complex to be derived, and the complexity reduction
is not guaranteed. The precision and wordlength issues
are also more complex than those of the direct imple-
mentation.Inthispaper,wepreserveasmoreprecision
as possible when the internal wordlength is given.
4.3. Performance Discussion
The main advantage of the B-spline factorized archi-
tectures is that possibly fewer multipliers are required
thantheconvolutionandliftingscheme.Thisisbecause
the degrees of Q(z) and R(z)( γQ and γR) are designed
as small as possible for given γH and γG that dominates
all wavelet properties.
The below is a general performance discussion. The
convolution scheme requires about γH +γG +γQ +γR
multipliers, while the lifting scheme could possibly
saveahalfnumberofmultipliers[11].ButtheB-spline
factorized architecture only requires γQ +γR multipli-
ers which are fewer than
γH +γG+γQ+γR
2 if γQ + γR <
γH +γG. Daubechies wavelets are optimal in the sense
that they have a minimum size support of a given num-
ber of vanishing moments [15]. Thus, we can derive
the expression as follows:
(γH + γQ + 1) + (γG + γR + 1) ≥ 2(γH + γG)
⇒ γQ + γR ≥ γH + γG − 2 (3)
The Eq. (3) means that the sum of vanishing moments
(γH+γG)isalwayslessthanorequaltoahalfofthesum
of the lowpass and highpass ﬁlter lengths. Thus, the
B-spline factorized architectures can always guarantee
the complexity reduction of multipliers by 2 relative to
the convolution-based ones if Daubechies wavelets are
used. But the lifting-based architectures cannot guar-
antee the performance.
Now we consider the common used linear ﬁlters.
For the linear DWT ﬁlters, the convolution-based ar-
chitectures can reduce the multipliers by 2 by adopting
the linear properties. Since the B-spline part is alwaysVLSI Architecture for Forward Discrete Wavelet Transform 347
linear, the distributed part is also linear and can reduce
the multipliers by 2 as well. However, the lifting-based
architectures cannot always adopt the linear proper-
ties. Especially for the even length DWT linear ﬁlters,
the lifting steps are hard to be factorized as linear so
that the required multipliers may be even more than
convolution-based architectures.
ThemaindisadvantageoftheB-splinefactorizedar-
chitecturesisthatmoreaddersmayberequired.Butthe
complexity of adders is much less than that of multi-
pliers. And most adders are not on the critical path, so
they can be implemented in low speed and small area.
In the result, the proposed architectures can provide
more reduction of hardware resource than others.
5. Case Studies
In this section, three Daubechies biorthogonal ﬁl-
ters are studied and implemented by use of pro-
posed B-spline factorized architectures, including the
JPEG2000 default (9, 7) ﬁlter, the (6, 10) ﬁlter [5], and
the (10, 18) ﬁlter [13].
5.1. JPEG2000 Default (9, 7) Filter
The B-spline factorization of the (9, 7) ﬁlter can be
expressed as:
H(z) = (1+z−1)4(1+t1z−1 +t2z−2 +t1z−3 +z−4)h0
G(z) = (1 − z−1)4(1 + t3z−1 + z−2)g0
(4)
where t1 =− 4.630464, t2 = 9.597484, and t3 =
3.369536. Thus the B-spline factorized architecture of
the (9, 7) ﬁlter will only need three multipliers, ex-
cluding the normalization part h0 and g0. Here, we use
the Pascal implementation for the B-spline part, and
the Pascal expression of the (9, 7) ﬁlter is shown in
Fig. 7. The proposed B-spline factorized architecture
requires 18 adders, of which 12 adders for the B-spline
part and 6 adders for the distributed part. The proposed
Figure 7.P ascal expression of the (9, 7) ﬁlter.
Figure8. ProposedB-splinefactorizedarchitecturesfor(9,7)ﬁlter.
Figure 9. Notation for ﬁlters.
architectures are shown in Fig. 8, where Fig. 8(a) and
(b) represent Type-I and Type-II polyphase decompo-
sitions, respectively. And the notation that we use for
FIR ﬁlters can be described in Fig. 9.
The original Type-I architecture requires eight reg-
isters, and the critical path is Tm + 5Ta, where Tm
is the time taken for a multiplication operation, and
Ta is the time needed for an addition operation. On
the other hand, if pipelining is performed through the
upside dot line, the critical path can be shortened to
Tm+2Ta withtotally10registers.However,thecritical
path of the Type-II architecture is Tm +2Ta with only
10 registers.
5.1.1. Comparison. By extracting the normalization
parth0 and g0 andutilizingthesymmetricproperty,the
convolution-based architecture of the (9, 7) ﬁlter can
be implemented by use of 7 multipliers, 14 adders, and
7r e gisters. And the critical path is Tm + 3Ta if adder
tree is used to connect adders.348 Huang, Tseng and Chen
Theliftingschemeofthe(9,7)ﬁltercanbefactorized
as:
P(z) =
 
1 a(1 + z−1)
01
  
10
b(1 + z)1
 
×
 
1 c(1 + z−1)
01
  
10
d(1 + z)1
  
K 0
01 /K
 
(5)
where P(z)isthepolyphasematrix,andthecoefﬁcients
are given as a =− 1.586134342, b =− 0.052980118,
c = 0.882911076, d = 0.443506852, and K =
1.149604398. The corresponding signal ﬂow graph is
shown in Fig. 10. Thus, the lifting-based architecture
would require 4 multipliers and 8 adders if the normal-
izationsteps K and1/K areexcluded.Thecriticalpath
4Tm+8Ta isquitelongwithonly4registersandcanbe
reducedto Tm+2Ta bypipeliningthroughthedotlines
withtotally10registers.Ontheotherhand,theﬂipping
structure of the (9, 7) ﬁlter is proposed to ﬂip Fig. 10 to
reduce the critical path [7] as shown in Fig. 11, where
thecriticalpathis Tm+5Ta withoutanymorehardware
overhead than Fig. 10. The critical path can be further
reduced to Tm + 1Ta with three additional pipelining
registers.
The proposed B-spline factorized architectures as
well as the aforementioned convolution-based and
lifting-basedoneshavebeenveriﬁedbyuseofVerilog-
Figure 10. Lifting scheme for the (9, 7) ﬁlter.
Figure 11. Flipping structure for the (9, 7) ﬁlter.
XL and synthesized into gate-level netlists by Synop-
sys Design Compiler with standard cells from Artisan
0.25-µmcelllibrary.Thecomparisonandsynthesisre-
sultsareshowninTable1,wheretheinternalbit-widths
are all 16-bit, the multipliers are all 16-by-16 multipli-
cations, and the adders are also 16-bit for comparison.
Thegatecountsaregivenwithcombinationalandnon-
combinational gate counts separately. The former con-
tributes to the multipliers and adders while the latter is
responsible to the registers. For circuit synthesis, the
timing constraints are set as tight as possible.
According to Table 1, the proposed architectures
could require fewer gate counts under the same tim-
ing constraints. Furthermore, the saving of gate counts
will be more signiﬁcant if the multipliers are required
to have higher precision.
5.2. The (6, 10) Filter
The B-spline factorization of the (6, 10) ﬁlter [5] can
be expressed as:
H(z) = (1 + z−1)3(1 + s3z−1 + z−2)h0
G(z) = (1 − z−1)3(1 − z−1)2(1 + s1z−1
+s2z−2 + s1z−3 + z−4)g0 (6)
= (1 − z−1)3(1 + r1z−1 + r2z−2
+r3z−3 + r2z−4 + r1z−5 + z−6)g0 (7)
where s1 =− t1, s2 = t2, s3 =− t3, r1 = 2.630464,
r2 = 1.336557, and r3 =− 9.934042. However, the
Pascal implementation can only cover (1 ± z−1)3, andVLSI Architecture for Forward Discrete Wavelet Transform 349
Table 1. Comparisons for DWT architectures of the (9, 7) ﬁlter.
Architecture Multiplier Adder Critical path Register Timing (ns) Comb. gate count Non-comb. gate count
Lifting + no pipelining 4 8 4Tm + 8Ta 4 34 15418.4 796.0
Lifting + 4 pipelining stages 4 8 Tm + 2Ta 10 9.8 15152.0 1495.3
Flipping + no pipelining 4 8 Tm + 5Ta 4 14.1 13326.5 763.7
Flipping + 3 pipelining stages 4 8 Tm + 1Ta 7 7.7 12089.4 1197.7
Convolution + no pipelining 7 14 Tm + 3Ta 7 10.8 17830.5 1266.7
B-spline Type-1 3 18 Tm + 5Ta 8 13.6 9670.4 1271.3
B-splineType-II 3 18 Tm + 2Ta 10 10.3 9419.4 1523.3
there are two solutions for the left part (1 − z−1)2 of
G(z),Solution-1andSolution-2,whicharecorrespond-
ing to Eqs. (6) and (7), respectively. The proposed ar-
chitecturesareshowninFig.12,wherethepartsmarked
Figure 12. Proposed B-spline factorized architectures for the (6, 10) ﬁlter.
with ‘*’ and ‘##’ can be shared. Thus, the Solution-1
of the B-spline factorized architecture would require 3
multipliers and 20 adders while the Solution-2 would
need 4 multipliers and 18 adders.350 Huang, Tseng and Chen
ThecriticalpathoftheSolution-1architecturecould
beTm+6Ta,Tm+4Ta,orTm+2Ta byretiming,pipelin-
ing, or retiming and pipelining together, respectively.
The corresponding numbers of registers are 9, 11, and
13. On the other hand, the Solution-2 architecture can
be retimed to obtain a critical path of Tm + 5Ta with
totally 9 registers.
5.2.1. Comparison. By extracting the normalization
part h0 and g0 and utilizing both symmetric and
anti-symmetric properties, the convolution-based ar-
chitecture of the (6, 10) ﬁlter can be implemented
by use of 6 multipliers, 14 adders, and 8 registers.
And the critical path is Tm + 4Ta if the adder tree is
used.
In contrast to the odd symmetric (9, 7) ﬁlter, the
polyphase matrix of the even linear (6, 10) ﬁlter can be
decomposed as follows:
P(z) =
 
1 a
01
  
10
b + cz−1 1
  
1 e + dz
01
 
×
 
10
− f + gz−1 + fz −2 1
  
K2 0
0 K1
 
(8)
where the coefﬁcients are given as a =− 0.369536,
b =− 0.42780, c =− 0.119532, d =− 0.090075,
e = 0.872739, g =− 0.572909, f = 0.224338, K1 =
0.874919, and K2 = 1.142963 [5]. Thus, the lifting-
based architecture can be shown as Fig. 13, where 7
multipliers, 8 adders, and 5 registers are required if K1
and K2 are excluded. The critical path is 4Tm + 5Ta
without pipelining and can be pipelined to Tm + 2Ta
with six pipelining registers. The ﬂipping structure can
also reduce the critical path to Tm + 5Ta by ﬂipping
Table 2. Comparisons for DWT architectures of the (6, 10) ﬁlter.
Non-comb.
Architecture Multiplier Adder Critical path Register Timing (ns) Comb. gate count gate count
Lifting + no pipelining 7 8 4Tm + 5Ta 5 26.6 12664.0 929.33
Lifting + 4 pipelining stages 7 8 Tm + 2Ta 11 9.15 14181.7 1679
Flipping + no pipelining 7 8 Tm + 5Ta 5 13.6 12525.3 1007
Flipping + 3 pipelining stages 7 8 Tm + 2Ta 9 8.7 12304.0 1423.0
Convolution + no pipelining 6 14 Tm + 4Ta 8 12 13685.4 1332
Bspline-1 Type-I + retiming 3 20 Tm + 6Ta 9 14.05 9623.7 1322.3
Bspline-1 Type-I + retiming + pipe. 3 20 Tm + 4Ta 11 11.5 8788.7 1571.0
Bspline-1 Type-I + pipelining 3 20 Tm + 2Ta 13 9.5 8648.7 1782.0
Bspline-2 Type-I + retiming 4 18 Tm + 5Ta 9 13.1 11647.7 1366.7
Figure 13. Lifting scheme for the (6, 10) ﬁlter.
and can be further pipelined to Tm + 2Ta with four
pipelining registers [7].
Similarly, the proposed, convolution-based, and
lifting-based architectures have been veriﬁed and syn-
thesized. The bit-width is the same as the case of (9, 7)
ﬁlter. The results are listed in Table 2. In this case,
the lifting-based architecture requires even more mul-
tipliers than the convolution-based one because the
lifting scheme of even-tap linear DWT ﬁlters is not
as efﬁcient as that of odd symmetric ﬁlters. However,
the proposed B-spline architecture can still reduce the
number of multipliers to three. Table 2 shows that the
proposed architectures can achieve the same timing
constraints with fewer gate counts than the other three
architectures.VLSI Architecture for Forward Discrete Wavelet Transform 351
Table 3. Detailed gate count comparison for the (6, 10) ﬁlter.
Cell Cell Ave. gate Total gate
Architecture name number count count
Lifting + 4
pipelining stages
nbw 7 1476.85 10337.95
cla 5 482.73
3611.97
bk 3 399.44
B-spline-1 Type-l +
pipelining
nbw 3 1071.11 3213.33
cla 1 487.67
bk 9 290.78
5256.67
rpl 9 207.59
rpcs 1 283.67
nbw: non-booth-recorded wallace tree multiplier; cla: carry-
lookaheadadder;bk:brent-kungadder;rpl:ripplecarryadder;rpcs:
ripple carry select adder.
5.2.2. Detailed Gate Count Comparison. The B-
spline factorized architecture can provide fewer multi-
pliers but introduce more adders. We compare the gate
counts of multipliers and adders in more detail to ex-
amine the resulting hardware resource reduction. The
lifting-based architecture with four pipelining stages
and the B-spline Solution-1 architecture with pipelin-
ingarechosen,whicharebothofcriticalpathTm+2Ta.
The detailed comparison of the gate counts is listed
in Table 3, where the gate counts of different kinds
of multipliers and adders are separate. The Synopsys
Design Compiler synthesizes all multipliers to non-
booth-recorded wallace tree multipliers, which can
have trade-offs between the processing speed and the
area size. Many kinds of adders are used for cir-
cuits synthesis, and the carry-lookahead adders are the
fastest but the largest ones.
All multipliers of the lifting-based architecture are
on the critical path, so the gate counts of them are
quite large and about 1500 gates in average. How-
ever, the multipliers of the B-spline factorized archi-
tecture are not all on the critical path, so the average
gate count is only about 1000 gates. Furthermore, the
lifting-based architecture requires 4 more multipliers
than the B-spline factorized one. In the result, the total
gate counts of multipliers are about 10000 and 3000
gates, respectively.
On the other hand, only one carry-lookahead adder
is used in the proposed architecture while ﬁve are used
in the lifting-based one. Although more adders are re-
quired, most of them are synthesized to the smaller
adders in the proposed architecture. The overhead gate
count of adders for the proposed architecture is about
1600 gates. By combining the result of multipliers, the
net reduction of gate count is about 7000 − 1600 =
5400. The efﬁciency of the proposed architecture for
reducing multipliers is demonstrated.
5.3. The (10, 18) Filter
The coefﬁcients of the (10, 18) analysis ﬁlter bank are
givenin[13].Theanalysislowpassﬁlterisasymmetric
10-tapﬁlter,andthehighpassﬁlterisananti-symmetric
18-tap ﬁlter. The coding efﬁciency can be better than
the well-known (9, 7) ﬁlter [13,16]. The B-spline fac-
torization of the analysis ﬁlter bank is as follows:
H(z) = (1 + z−1)5(u1z−8 + u2z−7 + z−6
+u2z−5 + u1z−4) · h0
G(z) = (1 − z−1)9(u3z−8 + u4z−7 + u5z−6 + u6z−5
+z−4 + u6z−3 + u5z−2 + u4z−1 + u3) · g0
(9)
where u1 = 0.1049758, u2 =− 0.524577, u3 =
0.0094393, u4 = 0.08498056, u5 = 0.33152476,
u6 = 0.74232477, h0 = 0.27485, and g0 = 0.101111.
For the (10, 18) ﬁlter bank, the Pascal implementa-
tion will be too complex to derive because the degrees
of the B-spline parts are 5 and 9. Thus, we use the
direct implementation for the B-spline part. The pro-
posed architecture for the (10, 18) ﬁlter is as shown in
Fig.14,where6multipliersand40addersareusedifthe
normalization part is excluded. If retiming z+2 is per-
formed,thecriticalpathwillbecomeTm+11Ta withto-
tally23registers.Inconcept,wecanreducethecritical
pathto
Tm+11Ta
2 bypipeliningwith4additionalregisters.
5.3.1. Comparison. Here we consider that the
convolution-based architecture of the (10, 18) ﬁlter is
implemented into the parallel ﬁlters. If the linear prop-
erty and the adder tree are adopted, 12 multipliers, 26
adders, and 16 registers are required while the crit-
ical path is Tm + 5Ta.A sthe case of (6, 10) ﬁlter,
the lifting scheme of the (10, 18) cannot be linear and
cannot reduce the hardware complexity. Thus, we will
not include the lifting scheme into the comparison.
The proposed and convolution-based architectures
have been veriﬁed and synthesized. The internal bit-
width is the same as the case of (9, 7) ﬁlter, except
the multipliers become 16-by-16 multiplications. The
results are listed in Table 4. The pipelining of the pro-
posed architecture is cut before the last two 1 + z−1
stages as shown in Fig. 14 for this cell library. The352 Huang, Tseng and Chen
Table 4. Comparisons for DWT architectures of the (10, 18) ﬁlter.
Non-comb.
Architecture Multiplier Adder Critical path Register Timing (ns) Comb. gate count gate count
Convolution 12 26 Tm + 5Ta 16 13.5 27173.4 2400.3
Bspline + retiming (z+2)6 4 0 Tm + 11Ta 23 21.8 19193.9 2617.0
Bspline + retiming (z+1) + pipe. 6 40 ∼(Tm + 11Ta)/2 27 12.7 20050.2 3109.7
Figure 14. B-spline factorized architecture for the (10, 18) ﬁlter.
proposed architectures require only about two-thirds
of the gate count of the convolution-based one.
6. Conclusion
In this paper, a new category of DWT architectures
is proposed on the basis of B-spline factorization.
The B-spline part can be implemented by use of the
direct or Pascal implementation. And the distributed
part could be implemented with the Type-I or Type-II
polyphase decomposition and conventional ﬁlter de-
sign techniques. For Daubechies wavelets, the pro-
posed B-spline factorized architectures can guarantee
the complexity reduction of multipliers by 2 while the
lifting scheme cannot. Although more adders are re-
quired, many adders can be implemented in small area
and low speed because most of them are not on the
critical path. Based on three case studies, including the
(9, 7), (6, 10), and (10, 18) ﬁlters, the required gate
counts of the proposed architecture are much smaller
than that of the convolution-based and lifting-based
ones, which demonstrates the efﬁciency.
Acknowledgment
This work was supported in part by MOE Program
for Promoting Academic Excellence of Universities
under the grant number 89E-FA06-2-4-8, in part by
National Science Council, Republic of China, under
the grant number 91-2215-E-002-035, and in part by
MediaTek Fellowship.
References
1. D. Taubman, “Successive Reﬁnement of Video: Fundamental
Issues, Past Efforts and New Directions,” in International Sym-
posiumonVisualCommunicationsandImageProcessing,2003.
2. K.K. Parhi and T. Nishitani, “VLSI Architectures for Discrete
Wavelet Transforms,” IEEE Transactions on Very Large Scale
Integration Systems,v ol. 1, no. 2, pp. 191–202, 1993.
3. M. Vishwanath, R.M. Owens, and M.J. Irwin, “VLSI Architec-
turesfortheDiscreteWaveletTransform,”IEEETransactionson
Circuis and Systems-II: Analog and Digital Signal Processing,
vol. 42, no. 5, 1995, pp. 305–316.
4. C. Chakrabarti, M. Vishwanath, and R.M. Owens, “Architec-
turesforWaveletTransforms:ASurvey,”JournalofVLSISignal
Processing,v ol. 14, 1996, pp. 171–192.
5. K. Andra, C. Chakrabarti, and T. Acharya, “A VLSI Architec-
tureforLifting-BasedForwardandInverseWaveletTransform,”
IEEE Transactions on Signal Processing,v ol. 50, no. 4, 2002,
pp. 966–977.
6. W. Jiang and A. Ortega, “Lifting Factorization-Based Discrete
WaveletTransformArchitectureDesign,”IEEETransactionson
Circuits and Systems for Video Technology,v ol. 11, no. 5, 2001,
pp. 651–657.
7. C.-T. Huang, P.-C. Tseng, and L.-G. Chen, “Flipping Struc-
ture: An Efﬁcient VLSI Architecture for Lifting-Based Discrete
Wavelet Transform,” IEEE Transactions on Signal Processing,
vol. 52, no. 4, 2004, pp. 1080–1089.
8. P.P. Vaidyanathan, Multirate Systems and Filter Banks, Prentice
Hall, 1993.VLSI Architecture for Forward Discrete Wavelet Transform 353
9. K.K.Parhi,VLSIDigitalSignalProcessingSystems:Designand
Implementation, John Wiley & Sons, 1999.
10. W. Sweldens, “The Lifting Scheme: A Custom-Design Con-
structionofBiorthogonalWavelets,”AppliedandComputaional
Harmonic Analysis,v ol. 3, no. 15, 1996, pp. 186–200.
11. I.DaubechiesandW.Sweldens,“FactoringWaveletTransforms
Into Lifting Steps,” The Journal of Fourier Analysis and Appli-
cations,v ol. 4, 1998, pp. 247–269.
12. M. Unser and T. Blu, “Wavelet Theory Demystiﬁed,” IEEE
TransactionsonSignalProcessing,vol.51,no.2,2003,pp.470–
483.
13. M.J. Tsai, J.D. Villasenor, and F. Chen, “Stack-Run Image Cod-
ing,”IEEETransactionsonCircuitsandSystemsforVideoTech-
nology,v ol. 6, no. 5, 1996, pp. 519–521.
14. S.G. Mallat, “A Theory for Multiresolution Signal Decomposi-
tion: The Wavelet Representation,” IEEE Transactions on Pat-
tern Analysis and Machine Intelligence,v ol. 11, no. 7, 1989,
pp. 674–693.
15. S.Mallat,AW aveletTourofSignalProcessing,AcademicPress,
1998.
16. N. Polyak and W.A. Pearlman, “A New Flexible Bi-Orthogonal
FilterDesignforMultiresolutionFilterbankswithApplicationto
Image Compression,” IEEE Transactions on Signal Processing,
vol. 48, no. 8, 2000, pp. 2279–2288.
Chao-Tsung Huang was born in Kaohsiung, Taiwan, R.O.C., in
1979. He received the B.S. degree from the Department of Electrical
Engineering, National Taiwan University, Taipei, Taiwan, R.O.C., in
2001. He currently is working toward the Ph.D. degree at the Gradu-
ateInstituteofElectronicsEngineering,NationalTaiwanUniversity.
His major research interests include VLSI design and implementa-
tion for signal processing systems.
cthuang@video.ee.ntu.edu.tw
Po-Chih Tseng was born in Tao-Yuan, Taiwan in 1977. He received
the B.S. degree in Electrical and Control Engineering from National
Chiao Tung University in 1999 and the M.S. degree in Electrical
Engineering from National Taiwan University in 2001. He currently
is pursuing the Ph.D. degree at the Graduate Institute of Electronics
Engineering,DepartmentofElectricalEngineering,NationalTaiwan
University. His research interests include VLSI design and imple-
mentation for signal processing systems, energy-efﬁcient reconﬁg-
urable computing for multimedia systems, and power-aware image
and video coding systems.
pctseng@video.ee.ntu.edu.tw
Liang-Gee Chen received the B.S., M.S., and Ph.D. degrees in elec-
trical engineering from National Cheng Kung University, Tainan,
Taiwan, R.O.C., in 1979, 1981, and 1986, respectively.
In 1988, he joined the Department of Electrical Engineering, Na-
tionalTaiwanUniversity,Taipei,Taiwan,R.O.C.During1993–1994,
hewasaVisitingConsultantintheDSPResearchDepartment,AT&T
Bell Labs, Murray Hill, NJ. In 1997, he was a Visiting Scholar of the
Department of Electrical Engineering, University of Washington,
Seattle. Currently, he is Professor at National Taiwan University,
Taipei, Taiwan, R.O.C. His current research interests are DSP archi-
tecture design, video processor design, and video coding systems.
Dr. Chen has served as an Associate Editor of IEEE TRANSAC-
TIONSONCIRCUITSANDSYSTEMSFORVIDEOTECHNOL-
OGYsince1996,asAssociateEditoroftheIEEETRANSACTIONS
ON VLSI SYSTEMS since 1999, and as Associate Editor of IEEE
TRANSACTIONS CIRCUITS AND SYSTEMS II since 2000. He
has been the Associate Editor of the Journal of Circuits, Systems,
and Signal Processing since 1999, and a Guest Editor for the Jour-
nal of Video Signal Processing Systems. He is also the Associate
Editor of the PROCEEDINGS OF THE IEEE. He was the Gen-
eral Chairman of the 7th VLSI Design/CAD Symposium in 1995
and of the 1999 IEEE Workshop on Signal Processing Systems:
Design and Implementation. He is the Past-Chair of Taipei Chap-
ter of IEEE Circuits and Systems (CAS) Society, and is a mem-
ber of the IEEE CAS Technical Committee of VLSI Systems and
Applications, the Technical Committee of Visual Signal Process-
ing and Communications, and the IEEE Signal Processing Technical
Committee of Design and Implementation of SP Systems. He is the
Chair-Elect of the IEEE CAS Technical Committee on Multimedia
Systems and Applications, During 2001-2002, he served as a Dis-
tinguished Lecturer of the IEEE CAS Society. He received the Best
Paper Award from the R.O.C. Computer Society in 1990 and 1994.
Annually from 1991 to 1999, he received Long-Term (Acer) Paper
Awards. In 1992, he received the Best Paper Award of the 1992
Asia-Paciﬁc Conference on circuits and systems in the VLSI design
track. In 1993, he received the Annual Paper Award of the Chinese
Engineer Society. In 1996 and 2000, he received the Outstanding
Research Award from the National Science Council, and in 2000,
the Dragon Excellence Award from Acer. He is a member of Phi Tan
Phi.
lgchen@video.ee.ntu.edu.tw