Result-Biased Distributed-Arithmetic-Based Filter Architectures for Approximately Computing the DWT by Martina, Maurizio et al.
04 August 2020
POLITECNICO DI TORINO
Repository ISTITUZIONALE
Result-Biased Distributed-Arithmetic-Based Filter Architectures for Approximately Computing the DWT / Martina,
Maurizio; Masera, Guido; Ruo Roch, Massimo; Piccinini, Gianluca. - In: IEEE TRANSACTIONS ON CIRCUITS AND
SYSTEMS. I, REGULAR PAPERS. - ISSN 1549-8328. - STAMPA. - 62:8(2015), pp. 2103-2113.
Original
Result-Biased Distributed-Arithmetic-Based Filter Architectures for Approximately Computing the DWT
Publisher:
Published
DOI:10.1109/TCSI.2015.2437513
Terms of use:
openAccess
Publisher copyright
(Article begins on next page)
This article is made available under terms and conditions as specified in the  corresponding bibliographic description in
the repository
Availability:
This version is available at: 11583/2616330 since: 2015-09-24T08:11:55Z
IEEE
1Result-biased Distributed-Arithmetic-based filter
architectures for approximately computing the DWT
Maurizio Martina, Senior Member IEEE, Guido Masera, Senior Member IEEE,
Massimo Ruo Roch, Gianluca Piccinini
Abstract—The discrete wavelet transform is a fundamental
block in several schemes for image compression. Its implementa-
tion relies on filters that usually require multiplications leading
to a relevant hardware complexity. Distributed arithmetic is a
general and effective technique to implement multiplierless filters
and has been exploited in the past to implement the discrete
wavelet transform as well. This work proposes a general method
to implement a discrete wavelet transform architecture based
on distributed arithmetic to produce approximate results. The
novelty of the proposed method relies on the use of result-
biasing techniques (inspired by the ones used in fixed-width
multiplier architectures), which cause a very small loss of quality
of the compressed image (average loss of 0.11 dB and 0.20
dB in terms of PSNR for the 9/7 and 10/18 wavelet filters,
respectively). Compared with previously proposed distributed-
arithmetic-based architectures for the computation of the discrete
wavelet transform, this technique saves from about 20% to 25%
of hardware complexity.
Index Terms—low-complexity, FIR filters, Distributed Arith-
metic, JPEG2000, DWT
I. INTRODUCTION
In the last years the Discrete Wavelet Transform (DWT) has
gained a wide diffusion. Thanks to its excellent decorrelation
properties the DWT has been included into JPEG2000 [1],
the standard recently adopted for Digital Cinema [2]. This has
fostered researchers and led to efficient VLSI architectures to
implement the DWT, [3], [4]. As shown in [5], the computa-
tional kernel of the DWT is a Filter Bank (FB). Thus, several
efforts have been spent to obtain multiplierless architectures
of the FB structure. As an example in [6], [7] the B-spline
factorization [8], [9] is exploited to design multiplierless FB
architectures. Recently, other approaches have been proposed
as well, e.g. algebraic integer quantization [10], [11], coef-
ficient rationalization [12], polymorphic implementation [13]
and half-band polynomial factorization [14].
Unfortunately, the aforementioned techniques require not
only to know the values of the filter taps but also the
mathematical derivation of the filters or at least some specific
factorizations. On the contrary, Distributed Arithmetic (DA)
is a systematic methodology to design multiplierless architec-
tures for digital filters. Indeed, it has been recently employed to
design low complexity and high throughput architectures for i)
Finite-Impulse-Response (FIR) filters [15], [16], ii) Discrete-
Cosine-Transform (DCT) based architectures [17], [18], iii)
multiplierless FB implementations of the DWT [19] [20].
The authors are with the Electronics and Telecommunications Department
- Politecnico di Torino
2
2 2
2
xˆx
H(z)
G(z)
yh
yl
H˙(z)
G˙(z)
Figure 1: Block diagram of the filter bank scheme.
Inspired by result-biased techniques proposed in [21]–[24]
for fixed-width multipliers, this work aims to show that the
complexity of DA-based architectures for DWT computation
can be further reduced by applying result-biasing techniques. It
is relevant to remark that the proposed approach is agnostic,
i.e. it can be applied independently of the design criterion
adopted for the addressed filters. In particular, in this work
we show that i) the complexity of DA-based architectures for
wavelet filters can be reduced by about 20% to 25% with
a very limited performance degradation (thus result-biasing
compensation can be avoided); ii) the implemented DA-based
architecture for the 9/7 wavelet filters features almost the same
performance and complexity as other multiplierless solutions,
which have been optimized by taking advantage of the specific
properties of these filters (see [25]). Furthermore, the proposed
solution features a large complexity reduction compared to
state-of-art architectures when applied to the 10/18 wavelet
filters.
The paper is structured as follows. Section II summarizes
the general computational scheme of DA-based architectures
for wavelet filters and Section III introduces concepts and def-
initions for implementing result-biasing techniques. In Section
IV result-biasing is applied to two important cases of study:
the 9/7 and 10/18 wavelet filters. In Section V experimental
results and comparisons are shown. Finally, conclusions are
drawn in Section VI.
II. DA-BASED FBS FOR DWT COMPUTATION
Let us consider the FB shown in Fig. 1 where H(z) =Pk 1
i=0 h[i]z
 i and G(z) =
Pl 1
i=0 g[i]z
 i are the low pass and
high pass analysis filters with length k and l, respectively, and
_H(z) =
P _k 1
i=0
_h[i]z i and _G(z) =
P _l 1
i=0 _g[i]z
 i the low pass
and high pass synthesis ones with length _k and _l.
A. Analysis filters
The two analysis outputs (yl and yh) are obtained as: yl[i] =Pk
j h[j]x[i   j] and yh[i] =
Pl
j g[j]x[i   j], where x[i] is
the input signal. Let us assume that the taps of the filters are
2Bufferfly
adder
Shift
network
Tree
adder
x[i]
x[i− 1]
w(−1)
w(−2)
w(−r)
w(−n+1)
w(0)
x[i− ξ + 1]
>> n− 1
>> n− 2
>> r
>> 2
>> 1
>> 0
f (0)
f (−1)
f (−2)
f (−r)
f (−n+1)
f (−n+2)
y[i]
w(−n+2)
Figure 2: Block scheme of the general DA-based architecture.
amplitude normalized, i.e. h[j]; g[j] 2 [ 1; 1), and represented
as 2’s complement values using n bits. Then, we can re-write
yl[i] and yh[i] as:
yl[i] = (h[0]   h[k   1]) 
0@ x[i]  
x[i  k + 1]
1A (1)
= n 
0BBB@
h(0)[0]    h(0)[k   1]
h( 1)[0]    h( 1)[k   1]
...    ...
h( n+1)[0]    h( n+1)[k   1]
1CCCA  xk
and
yh[i] = (g[0]    g[l   1]) 
0@ x[i]  
x[i  l + 1]
1A (2)
= n 
0BBB@
g(0)[0]    g(0)[l   1]
g( 1)[0]    g( 1)[l   1]
...    ...
g( n+1)[0]    g( n+1)[l   1]
1CCCA  xl;
where n = ( 20 2 1    2( n+1)), each element h( r)[j],
g( r)[j] represents bit 2 r of h[j] and g[j], respectively, the
()t operator stands for transposed, x = (x[i]   x[i  +1])t
and  can be either the length of the low pass or high pass
filter (k or l).
For a generic filter, the DA-based architecture is obtained
by computing the product between h( r)[j] (or g( r)[j]) and
xk (or xl) first, then, the result is multiplied by n. Let hn;k
and gn;l be the matrices containing h( r)[j] and g( r)[j],
respectively, and u = hn;k  xk and v = gn;l  xl, we then
obtain yl[i] = n  u and yh[i] = n  v.
This factorization leads to a 3-stage architecture:
1) a butterfly circuit made of adders to implement the hn;k
and gn;l matrix product;
2) a hard-wired shift network to apply the n vector;
3) a tree adder to combine partial results.
In Fig. 2 the generic DA-based architecture to implement y[i]
(yl[i] or yh[i]) is depicted, where  is the filter length (k or
l), w( r) terms are the results of the matrix product (u or
v), >> r represents an r-position right-shift and f ( r) =
w( r) >> r. As detailed in Section IV, the downsampling
operation at the output of the analysis filters is exploited to
alternatively compute yl[i] and yh[i].
B. Synthesis filters
The computational scheme used for the analysis filters can
be used to implement the synthesis filters as well. Indeed,
synthesis filters can be obtained from the analysis ones [5] as
_h[j] = ( 1)j  g[j] _g[j] = ( 1)j  ( h[j]); (3)
with _k = l and _l = k. Moreover, the right part of Fig. 1
(synthesis filters) shows that
_x[i] =

_h[0]    _h[ _k   1]


0@ _yl[i]  
_yl[i  _k + 1]
1A (4)
+

_g[0]    _g[ _l   1]


0@ _yh[i]  
_yh[i  _l + 1]
1A
= n  ( _u+ _v)
where _yl[i] and _yh[i] are obtained by upsampling the yl[i]
and yh[i] signals, _u = _hn; _k  _yl _k and _v = _gn; _l  _yh _l with
_yl _k = ( _yl[i]    _yl[i  _k+1])t and _yh _l = ( _yh[i]    _yh[i  _l+1])t.
Besides, taking into account the zeros added by the upsampling
blocks and the input shift register (shown on the left side of
Fig. 2), we obtain
_u+ _v =
(
_hgn; _  _ylh _ if _yl[i] 6= 0 "
_ghn; _  _yhl _ otherwise
(5)
where 0 " means a zero added by the upsampling and
_hgn; _ =
0BBB@
_h(0)[0] _g(0)[1]   
_h( 1)[0] _g( 1)[1]   
...
...
...
_h( n+1)[0] _g( n+1)[1]   
1CCCA ; (6)
_ylh _ =
0B@ _yl[i]_yh[i+ 1]
...
1CA ; (7)
_ghn; _ =
0BBB@
_g(0)[0] _h(0)[1]   
_g( 1)[0] _h( 1)[1]   
...
...
...
_g( n+1)[0] _h( n+1)[1]   
1CCCA ; (8)
_yhl _ =
0B@ _yh[i]_yl[i+ 1]
...
1CA (9)
with _ = maxf _k; _lg. Thus, the rth row of _hgn; _ and _ghn; _
contains bit 2 r of the interlaced sequences of taps, namely
( _h[0]; _g[1]; _h[2]; _g[3]   ) and ( _g[0]; _h[1]; _g[2]; _h[3]   ), respec-
tively. As a consequence, when _k  j < _ or _l  j < _ the
corresponding taps ( _h[j] or _g[j]) are zero, leading to columns
3FA FAFA FA
a
(m−1)
b
(m−1)
a
(q)
b
(q)
b
(0)
a
(0)
a
(1)
b
(1)
s
(0)
C
(0)
in = ‘0’
C
(q+1)
in C
(q)
in
s
(q)
a
(q)
b
(q)
s
(m−1)
s
(q)
s
(1)
Figure 3: Ripple carry adder.
of zeros in _hgn; _ and _ghn; _, respectively. Unfortunately, the
effectiveness of DA-based architectures applied to synthesis
side of the FB strongly depends on the symmetry of the
wavelet filters. Indeed, in Section IV-A we show that for the
9/7 filters the architecture for the synthesis filters is nearly the
same as the one for the analysis filters. On the contrary, the
architecture for the synthesis filters of the 10/18 wavelet is
very different from the analysis one (see Section IV-C).
III. RESULT-BIASED CIRCUITS FOR DA-BASED
ARCHITECTURES
Let us consider the circuit shown in Fig. 3 to compute s =
a + b, where the gray shaded box highlights the circuit of a
full adder (FA) and a and b are represented as 2’s complement
values usingm bits. Let p(q) be the probability that the qth bit
of  is equal to ‘1’, where  is one of the signals involved in
the addition, namely a; b; s; cin, and cin is the carry-in signal.
From Fig. 3 one infers that:
ps(q) = pc(q)in
[(1  pa(q))(1  pb(q)) + pa(q)pb(q) ] + (10)
(1  p
c
(q)
in
)[pa(q)(1  pb(q)) + (1  pa(q))pb(q) ]:
Let us introduce a threshold T (q) such that, if p
c
(q)
in
is suffi-
ciently small, then the following approximation holds true
ps(q)  pa(q)(1  pb(q)) + (1  pa(q))pb(q) if pc(q)in < T
(q):
(11)
Since this approximation biases the result of the addition, the
value of T (q) is used to tune the bias effect. In this work we
investigate two strategies to select T (q) . These strategies are
referred to as shift-based and probability-based thresholding,
respectively and will be described in the following paragraphs.
A. Shift-based thresholding
As shown in Fig. 2, DA-based architectures require right
shift operations at the output of the butterfly circuit. It is
Table I: Coefficients for the 9/7 analysis filters.
j h[j] g[j]
0 0.60294901823636 0.55754352622850
1 0.26686411844287 -0.29563588155713
2 -0.07822326652899 -0.02877176311425
3 -0.01686411844287 0.04563588155713
4 0.02674875741081
known that, in fixed point implementation, changing the
order between additions and right shift operations leads to
a precision loss, i.e. sr = [(a + b) >> r] 6= [(a >>
r) + (b >> r)]. However, if q^ is the maximum value such
that f(q^ < r) ^ (p
c
(q^)
in
< T (q^))g, then we obtain that
(a+ b) >> r  [(a >> q^) + (b >> q^)] >> (r   q^): (12)
As a consequence, if a and b are represented using m bits,
then we obtain an approximate version of sr by employing
(m  q^) instead of m FAs.
B. Probability-based thresholding
Another circuit employed in DA-based architectures is the
tree adder. As it can be inferred from Fig. 2, the data combined
by the tree adder come from different paths and the magnitude
of two f ( r) terms, out of the n available ones, can be very
different. Let us assume that all x[i] samples have the same
order of magnitude, then, the difference between jf (0)j and
jf ( n+1)j can be large due to the shift operation. This idea
can be exploited to predict the probability of the qth carry
signal to be ‘1’:
p
c
(q+1)
in
= p
c
(q)
in
[(1  pa(q))pb(q)+ pa(q)(1  pb(q))]+ pa(q)pb(q) :
(13)
For some of the values taken by a, if b  0 and b is “small”,
then
9 ~q0 j fpb(q) = 0 8q j ~q0  q  m  1g : (14)
The condition in (14) applied to (13) leads to
p
c
(q+1)
in
= pa(q)pc(q)in
 p
c
(q)
in
; (15)
with ~q0  q  m   1. From (15) we can infer that in
many cases the probability of the carry-in signal for the most
significant bits tends to 0. Analogously, if b < 0 and jbj is
“small”, then
9 ~q1 j fpb(q) = 1 8q j ~q1  q  m  1g ; (16)
which leads to
p
c
(q+1)
in
= p
c
(q)
in
+ pa(q)(1  pc(q)in )  pc(q)in ; (17)
with ~q1  q  m   1. From (17) we infer that the carry-in
probability tends to 1. In order to address both cases in (15)
and (17), we introduce x, which is the mean value of the
input signal, and we observe that the probability of carry-in
signals as a function of q creates two regions: i) the Most-
Significant-Bit-region (MSB-region), where p
c
(q)
in
tends to 0 or
1 depending on x, ii) the Least-Significant-Bit-region (LSB-
region), where p
c
(q)
in
depends on the statistic of the input signal
and ~q (either ~q0 or ~q1) is the position at the border between
the MSB-region and the LSB-region.
4Table II: Values of the h13;9 and  g13;7 coefficients and corresponding dj vectors for the 9/7 wavelet filters.
 r h[4] h[3] h[2] h[1] h[0] dj  g[3]  g[2]  g[1]  g[0] dj
0 0 1 1 0 0 d0 1 0 0 1 d6
1 0 1 1 0 1 d3 1 0 0 0 d9
2 0 1 1 1 0 d8 1 0 1 1 d11
3 0 1 1 0 0 d0 1 0 0 1 d6
4 0 1 0 0 1 d6 1 0 0 1 d6
5 0 1 1 0 1 d3 0 0 1 0 d12
6 1 0 0 1 0 d4 1 1 0 0 d0
7 1 1 1 0 1 d2 0 1 1 0 d10
8 0 1 1 0 0 d0 0 1 1 1 d5
9 1 1 1 0 0 d1 0 0 1 0 d12
10 1 0 1 1 1 d7 1 1 0 1 d3
11 0 1 1 0 0 d0 0 0 1 0 d12
12 1 0 1 1 1 d7 1 1 0 0 d0
−1
1
0
level−3
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
level−1
level−2
−1
level−4
>>1
>>2
>>3
>>4
>>5
>>6
>>7
>>8
>>9
>>10
>>11
>>12
hard−wired
shift network
d1 · C
d2 · C
d3 · C
d4 · C
d5 · C
d6 · C
d7 · C
d8 · C
d9 · C
d10 · C
d11 · C
d12 · C
f (−1)
f (−2)
f (−3)
f (−4)
f (−5)
f (−6)
f (−7)
f (−8)
f (−9)
f (−10)
f (−11)
f (−12)
t
(−3)
1
t
(−2)
1
t
(−1)
1
t
(−4)
1
t
(−5)
1
t
(−6)
1
t
(−1)
2
t
(−2)
2
t
(−3)
2
t
(0)
3
y[i]
bm−1 bm−2 b2 b0b1
t
(−1)
3
f (0)
x[i]
x[i + 1]
x[i + 2]
x[i + 3]
x[i + 4]
C0
C1
C2
C3
C4
d0 · C
d1 · C
d2 · C
d3 · C
d4 · C
d5 · C
d6 · C
d7 · C
d8 · C
d9 · C
d10 · C
d11 · C
d12 · C
x[i− 1]
x[i− 2]
x[i− 3]
x[i− 4]
(b)
(a)
Hi Lo n
d0 · C
Figure 4: Butterfly circuit (a) and tree adder with hard-wired shift network computational scheme (b) for the 9/7 wavelet filters.
To maximize the occurrence of the conditions in (15) and
(17), we add f ( r) values as follows:
t
( i)
1 = f
( i) + f ( n=2 i); (18)
with 0  i < n=2. Then, by setting a threshold T , we can
find
q = maxfqg 2 [0; ~q] j p
c
(q)
in
< T (19)
and force c(q)in = 0 for 0  q  q. The same approach can
be extended to all the levels in the tree adder. Let T;# be
the threshold for adder # at level  and q;# the position of
the last FA such that c(q)in = 0 for 0  q  q;#. If the input
values are represented using m bits, then we can obtain an
approximate version of each result by employing (m   q;#)
instead of m FAs.
IV. CASES OF STUDY: RESULT-BIASED DA-BASED
ARCHITECTURES FOR THE 9/7 AND 10/18 ANALYSIS
WAVELET FILTERS
Two important cases of study are shown in the following:
the experimental results obtained by implementing result-
biased DA-based architectures for the 9/7 and 10/18 wavelet
filters. In order to show both the complexity reduction and the
Table III: List of dj vectors and corresponding Ij sets for the
9/7 wavelet filters.
dj Ij minrfIjg
d0 I0 = fu(0); u( 3); u( 8); u( 11); v( 6); v( 12)g 0
d1 I1 = fu( 9)g 9
d2 I2 = fu( 7)g 7
d3 I3 = fu( 1); u( 5); u( 10)g 1
d4 I4 = fu( 6)g 6
d5 I5 = fu( 8)g 8
d6 I6 = fu( 4); v(0); v( 3); v( 4)g 0
d7 I7 = fu( 10); u( 12)g 10
d8 I8 = fu( 2)g 2
d9 I9 = fv( 1)g 1
d10 I10 = fv( 7)g 7
d11 I11 = fv( 2)g 2
d12 I12 = fv( 5); v( 9); v( 11)g 5
performance achieved by the proposed result-biased DA-based
architectures, we modified openjpeg [26], a Class-1 Profile-
1 compliant open source JPEG2000 implementation1. To be
compatible with the openjpeg model, we represented h[j] and
g[j] taps with 1 bit for the integer part and 12 bits for the
fractional part (n = 13). Internal data are represented as 16 bit
1For other profiles related to Digital Cinema the reader can refer to [27].
5Table IV: Values of the _hg13;9 and _gh13;9 coefficients and corresponding _dj vectors for the 9/7 wavelet filters.
 r _h[4] _g[3] _h[2] _g[1] _h[0] _dj _g[4] _h[3] _g[2] _h[1] _g[0] _dj
0 0 1 1 0 0 _d0 1 1 0 0 1 _d10
1 0 1 1 0 1 _d3 1 1 0 0 0 _d11
2 0 1 1 1 0 _d8 1 1 0 1 1 _d13
3 0 1 1 0 0 _d0 1 1 0 0 1 _d10
4 0 1 1 0 0 _d0 1 1 1 0 0 _d1
5 0 1 1 0 1 _d3 1 0 0 1 0 _d4
6 0 0 0 1 1 _d5 0 1 1 0 1 _d3
7 0 1 0 0 1 _d6 0 0 0 1 0 _d12
8 0 1 0 0 0 _d9 1 0 0 1 1 _d14
9 0 1 1 0 1 _d3 0 0 0 1 1 _d15
10 0 0 0 1 1 _d5 0 1 0 0 0 _d9
11 0 1 1 0 0 _d0 1 0 0 1 1 _d14
12 0 0 1 1 0 _d7 1 1 1 0 1 _d2
fixed point values (m = 16) as in other works, e.g. [4], [9]. For
our simulations five standard images (256 gray levels), namely
‘Lena’ 512  512, ‘Barbara’ 512  512, ‘Boat’ 512  512,
‘Goldhill’ 512  512 and ‘Fingerprint’ 512  512 [28], have
been employed2. The number of DWT decomposition levels
(L) has been varied from 1 to 4. This corresponds to  = L+1,
where  is the number of DWT resolution levels required by
openjpeg. Different compression ratios () have been imposed,
namely 1:1, 8:1, 16:1, 32:1 and 64:1, precinct and code-
block size are the encoder default values. Simulations shown
in this work have been obtained by modifying the encoder,
namely we implemented the forward DWT with the DA-based
solution proposed in [20] for the 9/7 DWT. Then, the DA-
based solution has been extended to support the 10/18 wavelet
filters as well. Finally, we implemented the proposed result-
biasing techniques.
A. DA-based architecture for the 9/7 wavelet filters
As argued in [20], it is more convenient to consider the
binary representation of h[j] and  g[j], instead of h[j] and
g[j], to find terms that are common to both the low pass
and the high pass taps. Given that the 9/7 wavelet filters are
symmetric (see Table I), we can further reduce the complexity
of the butterfly circuit. These two considerations permit to
write hn;k and  gn;l for the 9/7 filters, as shown in Table
II, where repeated common-term-vectors (dj) are gray-shaded.
Moreover, to exploit filter symmetry, we introduce the column
vector C, which elements are
C! =

x[i] ! = 0
x[i+ !] + x[i  !] ! = 1; : : : ; 4 : (20)
Then, we produce the w( r) values, as shown in Fig. 4 (a) , by
combining C with the 13 possible dj vectors. As an example,
Table II shows that d0 C, where d0 = [0 1 1 0 0], is used to
calculate u(0), u( 3), u( 8), u( 11) for the low pass branch
and v( 6), v( 12) for the high pass branch. In general, every
product dj  C defines a set (Ij) made of the proper u and v
elements, e.g. I0 = fu(0) u( 3) u( 8) u( 11) v( 6) v( 12)g,
as shown in Table III. Furthermore, as argued in [20], a
2Other images have been tested as well. Since the results we obtained are
similar to ones presented in this paper, we are not showing them for the sake
of brevity.
Reduced-Adder-Graph-like technique [29], where common
sub-expressions are extracted and calculated only once, re-
duces the number of adders required by the butterfly circuit.
As an example, sub-expression C3+C2, which is common to
several dj C products, is computed only once and then reused
multiple times.
A similar approach can be employed for the synthesis filters,
where odd filter lengths and the symmetry of _hg13;9 and
_gh13;9 matrices can be exploited to define
_C! =

_y[i] ! = 0
_y[i+ !] + _y[i  !] ! = 1; : : : ; 4 ; (21)
where _y[i] can be either _yl[i] or _yh[i] (see the notation
introduced in Section II-B). The corresponding butterfly circuit
is very similar to the one shown in Fig. 4 (a) and can be derived
from the _dj vectors summarized in Table IV. Finally, both
analysis and synthesis architectures rely on a shift network
and a tree adder to compute the results, as shown in Fig. 2 for
a general case.
B. Result-biased DA-based architecture for the 9/7 analysis
wavelet filters
1) Implementation of the result-biased butterfly circuit:
As described in section III-A, we can reduce the number of
FAs required to compute the w( r) terms as follows. Since
the dj  C products define 13 different sets (Ij), we have
13 possible q^j values. Every q^j is the maximum value that
satisfies f(q < minrfIjg)^ (pc(q)in < T
(q))g, where minrfIjg
is the minimum among the possible shift amounts (r) in Ij .
As an example, minrfI0g = 0 implies that the elements in
I0 are not affected by result-biasing. On the contrary, d10 C,
where d10 = [0 0 1 1 0], is employed to calculate only v
( 7).
Thus,minrfI10g = 7 so v( 7) can be approximated by finding
q^10 = maxqf(q < 7) ^ (pc(q)in < T
(q))g.
To set each T (q), we simulated the proposed DA-based
result-biased DWT in the openjpeg model with the test con-
ditions detailed at the beginning of section IV. In Table V
we show the results obtained by choosing T (q) such that
q^j = minrfIjg 1. As an example, q^0 = minrfI0g 1 =  1
means that the elements in I0 are not biased. Experimental
results show that the Peak Signal to Noise Ratio (PSNR)
6Table V: PSNR comparison between the DA-based DWT [20] (column PSNRDA) and the proposed DA-based DWT with
result-biased butterfly circuit (column PSNRBB = PSNRDA   PSNRBB) for the 9/7 filters.
Image L PSNRDA [dB] PSNRBB [dB]1:1 8:1 16:1 32:1 64:1 1:1 8:1 16:1 32:1 64:1
Lena
1 49.30 38.94 34.24 30.01 24.13 0.00 0.00 0.00 0.00 0.00
2 48.79 39.94 36.68 33.20 29.59 0.00 0.00 0.00 0.00 0.00
3 48.48 40.05 37.19 34.05 30.87 0.00 0.00 0.00 0.00 0.00
4 48.21 40.05 37.20 34.11 30.98 0.00 0.01 0.00 0.00 0.00
Barbara
1 49.49 35.79 29.42 24.04 20.43 0.00 0.00 0.00 0.00 0.00
2 48.97 37.59 32.16 27.77 24.17 0.02 0.00 0.00 0.00 0.00
3 48.62 37.83 32.78 28.75 25.30 -0.01 0.00 0.00 0.00 0.00
4 48.34 37.85 32.77 28.80 25.78 -0.03 -0.01 0.00 0.00 0.00
Boat
1 49.37 37.63 32.62 28.23 23.39 0.00 0.00 -0.02 0.00 0.00
2 48.76 38.88 33.99 30.14 26.96 -0.06 0.00 0.00 0.00 0.00
3 48.37 39.02 34.44 30.91 27.86 0.01 0.00 0.00 0.00 0.00
4 48.05 39.01 34.52 30.97 28.05 0.01 0.01 0.00 0.00 0.00
Goldhill
1 49.68 35.79 31.69 27.55 23.13 0.00 0.00 0.00 0.00 0.00
2 49.19 36.34 32.88 30.16 27.37 0.00 0.00 0.00 0.00 0.00
3 48.78 36.40 33.14 30.52 28.44 0.00 -0.04 0.00 0.00 0.00
4 48.32 36.41 33.18 30.49 28.46 -0.02 0.00 0.00 0.00 -0.01
Fingerprint
1 49.47 35.83 31.74 27.76 17.76 0.00 0.00 0.00 0.00 0.00
2 48.85 36.19 32.36 29.14 25.99 -0.01 0.00 0.00 0.00 0.00
3 48.18 36.22 32.43 29.47 26.78 -0.01 0.00 0.00 0.00 0.00
4 47.42 36.15 32.45 29.50 26.87 -0.02 0.00 0.00 0.00 0.00
0123456789101112131415
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
q
p
c
(q)
in
approx.
t
(−1)
1
t
(−2)
1
t
(−3)
1
t
(−4)
1
t
(−5)
1
t
(−6)
1
(a) lp tree adder first level: low pass generation.
0123456789101112131415
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
q
p
c
(q)
in
approx.
t
(−1)
1
t
(−2)
1
t
(−3)
1
t
(−4)
1
t
(−5)
1
t
(−6)
1
(b) hp tree adder first level: high pass generation.
Figure 5: Values of p
c
(q)
in
vs q in the low pass (lp) and high pass (hp) computation of the tree adder first level with result-biased
butterfly circuit for the ‘Goldhill’ image.
difference (PSNRBB) between the original DA-based DWT
(PSNRDA) and the proposed one is negligible, when result-
biasing is applied to the butterfly circuit (BB, PSNRBB).
Moreover, the standard butterfly circuit [20] requires 15m
FAs, where 4  m FAs are required to compute C. On
the other hand, the proposed result-biased butterfly savesP12
j=0j = 55 FAs, where j = q^j + 1 = minrfIjg. As
an example, since I10 = v( 7), the computation of v( 7)
requires only m 7 FAs. Since m = 16, the standard butterfly
circuit requires 240 FAs, whereas the proposed one requires
240-55=185 FAs.
2) Result-biased tree adder implementation: Stemming
from the computational scheme defined in the previous section,
the 13 different f ( r) values are added together. As detailed
Table VI: Parameters used to approximate p
c
(q)
in
in the LSB-
region.
level adder     
1 all 0.31 0.15 0 3
p
2
2 t( 1)2 0.31 0.11 2 4
p
2
2 t( 2)2 0.31 0.09 1 4
p
2
2 t( 3)2 0.31 0.06 0 4
p
2
3 t(0)3 0.31 0.11 3 10
p
2
3 t( 1)3 0.31 0.11 2 3
p
2
in section III-B, we combine f ( r) values as in (18). Fig. 4
(b) shows the tree adder and the hard-wired shift network used
in the architecture for the 9/7 wavelet filters. As it can be ob-
served, the Hi Lo n signal produces yh[i] (Hi Lo n =‘1’)
7Table VII: PSNR loss with respect to the original DA-based architecture for the 9/7 filters. Results are obtained by enabling
result-biased butterfly circuit and result-biasing at the first level of the tree adderPSNRBB+TB1 = PSNRDA PSNRBB+TB1;
similarly for the second and third level with PSNRBB+TB1=2 = PSNRDA   PSNRBB+TB1=2 and PSNRBB+TB1=2=3 =
PSNRDA   PSNRBB+TB1=2=3, respectively.
Image L PSNRBB+TB1 [dB] PSNRBB+TB1=2 [dB] PSNRBB+TB1=2=3 [dB]
1:1 8:1 16:1 32:1 64:1 1:1 8:1 16:1 32:1 64:1 1:1 8:1 16:1 32:1 64:1
Lena
1 0.10 0.02 -0.06 0.00 0.00 0.11 0.01 -0.08 0.00 0.00 0.11 0.01 -0.09 0.00 0.00
2 0.38 0.04 -0.04 0.01 0.00 0.32 0.04 -0.04 0.01 0.00 0.32 0.04 -0.04 0.01 0.00
3 0.43 0.07 0.04 0.02 0.01 0.48 0.08 0.05 0.02 0.01 0.49 0.08 0.05 0.03 0.01
4 0.87 0.15 0.07 0.05 0.00 0.84 0.16 0.08 0.05 0.00 0.97 0.16 0.08 0.05 0.01
Barbara
1 0.07 0.00 -0.06 -0.01 -0.21 0.08 -0.01 -0.06 0.00 -0.21 0.08 -0.01 -0.06 0.00 -0.21
2 0.17 0.00 0.00 0.09 0.01 0.20 0.00 0.00 0.09 0.01 0.18 0.00 0.00 0.09 0.01
3 0.29 0.04 0.01 0.02 0.00 0.33 0.03 0.00 0.01 -0.28 0.34 0.03 0.00 0.01 -0.28
4 0.46 0.06 -0.01 -0.01 0.00 0.55 0.07 0.00 -0.02 0.00 0.55 0.08 0.00 -0.02 0.00
Boat
1 0.14 -0.02 -0.01 0.02 -0.15 0.21 -0.01 0.00 0.02 -0.15 0.15 -0.01 0.00 0.02 -0.15
2 0.43 0.03 0.02 0.01 0.01 0.46 0.02 0.01 0.01 0.01 0.46 0.02 0.01 0.01 0.01
3 0.83 0.08 0.00 0.02 0.01 0.89 0.09 0.03 0.02 0.01 0.91 0.09 0.00 0.02 0.01
4 1.29 0.18 0.04 0.02 0.06 1.38 0.18 0.05 0.03 0.06 1.40 0.19 0.05 0.03 0.06
Goldhill
1 0.08 0.02 0.00 0.08 0.00 0.08 0.02 0.00 0.08 0.00 0.08 0.02 0.00 0.08 0.00
2 0.18 0.02 0.05 0.00 0.00 0.20 0.02 0.05 0.01 0.00 0.20 0.02 0.05 0.01 0.00
3 0.31 -0.01 0.02 0.00 0.00 0.36 -0.01 0.02 0.00 0.00 0.37 -0.01 0.02 0.00 0.00
4 0.47 0.04 0.02 0.01 0.00 0.54 0.05 0.04 0.00 0.00 0.57 0.05 0.04 0.00 0.00
Fingerprint
1 -0.09 0.02 0.01 -0.06 0.04 -0.08 0.02 0.01 -0.06 0.04 -0.08 0.02 0.01 -0.06 0.04
2 -0.31 0.01 0.00 0.00 0.00 -0.30 0.01 0.00 0.00 0.00 -0.26 0.01 0.00 0.00 0.01
3 -0.64 0.02 -0.02 0.00 0.00 -0.63 0.02 -0.02 0.00 0.00 -0.49 -0.02 -0.01 0.00 0.00
4 -1.01 -0.07 0.03 -0.01 -0.01 -1.01 -0.07 -0.02 -0.01 -0.01 -1.01 -0.06 -0.01 -0.01 -0.01
or yl[i] (Hi Lo n =‘0’) by selecting the proper dj  C input
to the hard-wired shift network.
As discussed in Section III-B, the delay and the complex-
ity of the tree adder can be reduced by cutting the carry
chains, namely by fixing the value of T;# and by finding
the corresponding q;#. Simulations were performed on the
modified openjpeg model in the test conditions described in the
first paragraph of Section IV and including the result-biased
butterfly circuit described in Section IV-B1. As an example,
Fig. 5 shows the values of p
c
(q)
in
obtained with the ‘Goldhill’
image by varying q 2 [0; 15] for low pass (lp) and high pass
(hp) filters, respectively, at the first level of the tree adder
(referred to as t( i)1 in Fig. 4 (b), with i = 1; : : : ; 6). Since
the openjpeg model converts image pixels from [0; 255] to
[ 128; 127], then x < 0 for the ‘Goldhill’ image. Indeed, the
values of p
c
(q)
in
in the MSB-region (Fig. 5) tend to 1, whereas
in the LSB-region p
c
(q)
in
2 [0:15; 0:31]. Simulations show that
in both low pass and high pass computation the following
approximation holds true:
p
c
(q)
in
   e (q )= ; (22)
where the values of the coefficients are summarized in Table
VI. As shown in Fig. 5, the curve defined in (22) is a good
approximation of p
c
(q)
in
in the LSB-region.
The approximation in (22) can be used for setting the
threshold of each adder in the tree adder. As an example,
simulations show that if one sets the threshold to the highest
probability in the LSB-region (T = p
c
(~q)
in
), which corresponds
to q = ~q, then there is a PSNR loss of up to 5 dB. As
a consequence, we set q = ~q    < ~q with  > 0.
It is worth pointing out that, since we force c(q)in = ‘0’
for 0  q  q, the corresponding bits of s = a + b
are not correct. However, in JPEG2000 the results of the
DWT are quantized, so it is unnecessary to compensate the
Table VIII: Coefficients for the analysis 10/18 filters.
j h[j] (j)g[j]
0,-1 5.366288017916415e-001 -4.407818292932527e-001
1,-2 5.429907539425682e-002 1.155190028604326e-001
2,-3 -1.113880188246157e-001 6.057160715369129e-002
3,-4 5.829726464040216e-005 -9.733420187993370e-003
4,-5 2.040184437407670e-002 -2.180274267317332e-002
5,-6 -1.787592313637589e-003
6,-7 6.683900685043967e-003
7,-8 -1.928418995893536e-006
8,-9 -6.748739325063276e-004
approximation caused by result-biasing, as discussed in the
next paragraphs. Thus, to save complexity we approximate
s(q)  a(q) for 0  q  q. Through extensive simulations
we found the values for  (see Table VI) that minimize the
PSNR loss. These values lead to the results detailed in Table
VII as PSNRBB+TB1 = PSNRDA PSNRBB+TB1, where
PSNRBB+TB1 is the PSNR obtained by performing result-
biasing both in the butterfly circuit and the first level of the
adder tree. From the complexity point of view, the number of
FAs required for the six adders at the first level of the tree
adder decreases from 96 to 57 (39 FAs saved).
The approach used for the first level of adders in the
tree adder is applied to the other levels as well and
the value of each parameter is summarized in Table VI.
As it can be observed, the performance loss caused by
result-biasing at the second and third level of adders in
the tree adder (PSNRBB+TB1=2 and PSNRBB+TB1=2=3,
respectively) is nearly the same as PSNRBB+TB1,
where PSNRBB+TB1=2 = PSNRDA   PSNRBB+TB1=2,
PSNRBB+TB1=2=3 = PSNRDA   PSNRBB+TB1=2=3 and
PSNRBB+TB1=2, PSNRBB+TB1=2=3 are the PSNR values
obtained by introducing result-biasing in the butterfly circuit
and at the first, second (BB+ TB1=2) and first, second and
third (BB+ TB1=2=3) levels in the tree adder. When the
8Table IX: Values of the h13;10 and g13;18 coefficients and corresponding dj vectors for the 10/18 wavelet filters.
 r h[4] h[3] h[2] h[1] h[0] dj g[8] g[7] g[6] g[5] g[4] g[3] g[2] g[1] g[0] dj
0 0 0 1 0 0 d0 0 0 1 0 0 0 1 1 0 d7
1 0 0 1 0 1 d1 0 0 1 0 0 0 1 1 0 d7
2 0 0 1 0 0 d0 0 0 1 0 0 0 1 1 1 d8
3 0 0 1 0 0 d0 0 0 1 0 0 0 1 1 1 d8
4 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 d9
5 0 0 0 1 1 d2 0 0 1 0 0 0 0 0 0 d10
6 1 0 0 1 0 d3 0 0 1 0 1 0 0 0 0 d11
7 0 0 1 0 0 d0 0 0 1 0 0 1 0 1 0 d12
8 1 0 1 1 1 d5 0 0 0 0 1 0 0 0 0 d13
9 0 0 1 1 0 d6 0 0 0 0 1 1 1 0 1 d14
10 1 0 0 1 1 d4 0 0 1 1 0 0 0 1 1 d15
11 0 0 0 1 1 d2 1 0 0 1 0 0 0 1 0 d16
12 0 0 0 0 0 0 1 0 1 1 1 0 0 1 1 d17
Table X: Values of _hg13;18 and _gh13;18 coefficients.
PPPPPP r
j 8 7 6 5 4 3 2 1 0 -1 -2 -3 -4 -5 -6 -7 -8 -9
_hg13;18
0 1 0 0 0 1 0 0 1 1 1 1 0 0 1 0 0 0 0
1 1 0 0 0 1 0 0 1 1 0 1 0 0 1 0 0 0 0
2 1 0 0 0 1 0 0 1 0 1 1 0 0 1 0 0 0 0
3 1 0 0 0 1 0 0 1 0 1 1 0 0 1 0 0 0 0
4 1 0 0 0 1 0 0 1 0 1 0 1 0 1 0 0 0 0
5 1 0 0 0 1 0 1 0 1 0 0 1 0 1 0 0 0 0
6 1 0 0 0 0 0 1 0 1 1 0 1 0 0 0 0 0 0
7 1 0 0 0 1 0 1 1 1 1 1 0 1 1 0 0 0 0
8 1 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0
9 1 0 1 0 0 0 1 0 0 1 0 1 1 1 0 0 0 0
10 1 0 0 0 1 0 0 0 0 0 1 0 0 1 1 0 0 0
11 0 0 1 0 1 0 0 1 1 1 1 0 0 0 1 0 0 0
12 1 0 1 0 1 0 0 0 1 0 1 0 0 0 1 0 0 0
_gh13;18
0 0 0 0 0 0 0 1 1 0 1 0 0 0 1 0 0 0 1
1 0 0 0 0 0 0 1 1 1 1 0 0 0 1 0 0 0 1
2 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 0 0 1
3 0 0 0 0 0 0 1 1 0 0 0 0 0 1 0 0 0 1
4 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1
5 0 0 0 0 0 0 0 0 1 1 1 1 0 1 0 0 0 1
6 0 0 0 0 1 0 0 0 0 1 1 1 0 0 0 0 0 1
7 0 0 0 0 0 1 1 1 0 1 0 1 0 1 0 0 0 1
8 0 0 0 0 1 0 1 0 1 1 1 1 0 0 0 1 0 1
9 0 0 0 0 0 1 1 0 0 0 1 1 0 0 0 1 0 1
10 0 0 0 1 1 0 0 1 1 0 1 0 0 1 0 0 0 1
11 0 0 0 1 0 0 0 1 1 1 1 0 0 1 0 1 0 0
12 0 0 0 1 0 0 0 1 0 1 0 0 0 1 0 1 0 1
result-biasing technique is applied at the second and third level
of the tree adder, the number of required FAs decreases from
48 to 36 (12 FAs saved) and from 32 to 26 (6 FAs saved),
respectively. Thus, the proposed result-biased tree adder saves
57 FAs on levels 1 to 3 of the tree adder. Further experiments
have shown that there is no advantage in implementing result-
biasing in the adder at the fourth level.
C. DA-based architecture for the 10/18 wavelet filters
As shown in Table VIII, the 10/18 wavelet filters are
symmetric. This property can be exploited to design a reduced
complexity architecture. However, since in this case k and l
are even values, we introduce
(j) =

1 j  0
 1 j < 0 (23)
and derive the common-term-vectors dj , which are sum-
marized in Table IX. The corresponding butterfly circuit is
depicted in Fig. 6, where C! = x[i + ! + 1]  x[i   !],
! = 1; : : : ; 8 and the Hi Lo n signal is used to add or
subtract input samples in the low-pass and high-pass filter
implementations, respectively. Then, as for the 9/7 case, the
architecture relies on a hard-wired shift network and a tree
adder.
On the other hand, it is not possible to exploit the symmetry
of the filters in the implementation of the architecture for
synthesis filters as _k and _l are even values. Indeed, since
_hg13;10 and _gh13;18 are non-symmetric matrices, _C can
not be defined. As a consequence, there is no simplification
in (6) and (8) and these equations are implemented as sum-
marized in Table X. Unfortunately, the content of _hg13;10 and
_gh13;18 shows only partial common-term-vectors, thus the DA
approach is more effective for the analysis filters than for the
synthesis ones.
D. Result-biased DA-based architecture for the 10/18 analysis
wavelet filters
1) Implementation of the result-biased butterfly circuit: In
order to trim the result-biasing for the butterfly circuit, we
9Table XI: Values of dj vectors and corresponding Ij sets for
the 10/18 wavelet filters.
dj Ij minrfIjg
d0 I0 = fu(0); u( 2); u( 3); u( 7)g 0
d1 I1 = fu( 1)g 1
d2 I2 = fu( 5); u( 11)g 5
d3 I3 = fu( 6)g 6
d4 I4 = fu( 10)g 10
d5 I5 = fu( 8)g 8
d6 I6 = fu( 9)g 9
d7 I7 = fv(0); v( 1)g 0
d8 I8 = fv( 2); v( 3)g 2
d9 I9 = fv( 4)g 4
d10 I10 = fv( 5)g 5
d11 I11 = fv( 6)g 6
d12 I12 = fv( 7); g 7
d13 I12 = fv( 8); g 8
d14 I12 = fv( 9); g 9
d15 I12 = fv( 10); g 10
d16 I12 = fv( 11); g 11
d17 I12 = fv( 12); g 12
built Ij , the sets defined by each of the dj  C products, as
shown in Table XI. Then, we set q^j = minrfIjg   1 (as for
the 9/7 filters). In this case, the proposed result-biased butterfly
circuit saves 113 FAs. Since m = 16, the standard butterfly
circuit requires 448 FAs, whereas the proposed one requires
448-113=335 FAs.
Cin
x[i + 8]
x[i + 9]
x[i− 7]
C6
x[i− 8]
C5
C8
HI Lo n
x[i− 1]
x[i]
d0 · C
d1 · C
d2 · C
d3 · C
d4 · C
d5 · C
d6 · C
d7 · C
d8 · C
d9 · C
d10 · C
d11 · C
d12 · C
d13 · C
d14 · C
d15 · C
d16 · C
d17 · C
Cin
Cin
Cin
x[i + 1]
x[i + 2] C1
C0
C2
C3
C4
x[i + 7] x[i− 6]
Figure 6: Butterfly circuit for the 10/18 filters.
2) Result-biased tree adder implementation: The architec-
ture of the result-biased tree adder is nearly the same one
employed for the 9/7 filters and described in Section IV-B2.
The only differences with respect to the circuit shown in Fig.
4 (b) are: i) the hard-wired shift network, where the inputs
to the multiplexers are the ones summarized in Table IX,
ii) the -1 multiplication at the output. Since we observed
similar carry signal probabilities for both 9/7 and 10/18
filters, the same result-biasing strategy has been employed
for the implementation of the tree adder. In particular, the
approximation in (22) with the parameters shown in Table VI,
has been exploited. As a consequence, the number of saved
FAs is the same one obtained for the 9/7 filters, namely 57
FAs. Experimental results, achieved by implementing result-
biasing in the tree adder, are shown in Table XII. As it
can be observed, the performance loss in terms of PSNR
(PSNRBB+TB1=2=3 = PSNRDA   PSNRBB+TB1=2=3) of
the proposed result-biased variant, with respect to the original
DA-based architecture, is limited to few fractions of dB,
where PSNRDA and PSNRBB+TB1=2=3 are the PSNRs of the
original and proposed solution respectively.
V. HARDWARE IMPLEMENTATION AND COMPARISON
The proposed result-biased architectures for the computa-
tion of the 9/7 and 10/18 wavelet filters have been imple-
mented using a 90 nm standard cell technology library for
a 200 MHz target clock frequency, leading to areas of 7621
m2 and 12602 m2, which correspond to about 2.7 and 4.5
equivalent kgates for the 9/7 and 10/18 filters, respectively.
Moreover, with the same technology the proposed architec-
tures can achieve maximum clock frequencies of 450 MHz and
360 MHz with areas of 8087 m2 (2.85 eq. kgates) and 13845
(4.91 eq. kgates) for the 9/7 and 10/18 filters, respectively.
These results are shown in Table XIII, where the proposed
architectures are compared with other solutions available in
the literature in terms of PSNR, number of FAs, number of
Flip-Flops (FFs), clock frequency (fclk) and area.
The proposed architecture for the 9/7 filters offers a relevant
complexity reduction with respect to previously published
DA-based implementations for DWT computation [19], [20],
with a very small PSNR loss (see Tables V and VII). When
compared with multiplierless solutions, which were specif-
ically optimized for the 9/7 wavelet filters, the proposed
architecture shows a PSNR loss of few fractions of dB as the
variants described in [6], [7], [13], [30]. To enable complexity
comparison of the proposed architecture with the other works,
in particular with the ones described in [13] and [9] for
the 9/7 and 10/18 filters, we introduced the normalized area
An=Area(90/Tech)2 (last column of Table XIII), where Tech
is the technology process used for the implementation, namely
90 nm for the architectures proposed in this work, 45 nm for
[13] and 130 nm for [9].
From the data in Table XIII we observe that the proposed
architecture for the 9/7 filters features almost the same com-
plexity as the lowest complexity implementations, i.e. [7],
[30]. Comparison with [13] in terms of circuit speed is not
straightforward as it relies on a technology more scaled than
the one employed in this work. As a consequence, [13] features
higher maximum clock frequency than our implementation.
Beside, the architecture in [13] requires more FFs but less FAs
than the result-biased DA-based variant, leading to a slightly
higher normalized area than the proposed solution. It is worth
noting that all the considered architectures for the 9/7 filters
have a throughput of one sample per clock cycle. Furthermore,
Table XIII shows that the area of DA-based architectures
10
Table XII: PSNR comparison between the DA-based DWT (column PSNRDA) and the proposed DA-based DWT with result-
biasing applied to the butterfly circuit and to the tree adder (first, second and third level) (column PSNRBB+TB1=2=3 =
PSNRDA   PSNRBB+TB1=2=3) for the 10/18 filters.
Image L PSNRDA [dB] PSNRBB+TB1=2=3 [dB]
1:1 8:1 16:1 32:1 64:1 1:1 8:1 16:1 32:1 64:1
Lena
1 49.35 39.46 34.77 29.89 22.97 0.15 0.00 -0.04 0.01 0.03
2 49.00 40.49 37.18 33.64 30.07 0.40 0.07 0.01 0.06 0.00
3 48.92 40.66 37.52 34.34 31.21 0.83 0.14 0.07 0.03 -0.01
4 48.87 40.67 37.54 34.44 31.34 1.34 0.23 0.12 0.07 0.01
Barbara
1 49.43 36.92 30.08 24.39 20.60 0.11 -0.01 -0.02 0.00 0.00
2 49.14 38.50 33.06 28.44 24.47 0.41 0.02 0.02 0.00 0.01
3 49.07 38.74 33.55 29.28 25.70 0.84 0.08 0.03 0.04 0.00
4 49.02 38.72 33.61 29.34 26.19 1.35 0.11 0.04 0.02 0.05
Boat
1 49.37 38.23 33.13 28.01 21.61 0.11 0.03 -0.03 0.00 -0.15
2 49.04 39.40 34.52 30.64 27.40 0.41 0.02 0.01 0.19 0.09
3 48.97 39.57 34.90 31.16 28.14 0.82 0.10 0.00 0.02 0.01
4 48.92 39.59 34.95 31.22 28.17 1.35 0.22 0.04 0.06 0.02
Goldhill
1 49.75 36.06 31.93 27.70 22.90 0.17 0.00 -0.02 0.02 0.00
2 49.51 36.68 33.17 30.32 27.93 0.50 0.02 0.06 -0.03 0.29
3 49.46 36.75 33.34 30.62 28.54 0.97 0.05 0.07 0.01 0.01
4 49.42 36.78 33.35 30.62 28.57 1.53 0.11 0.03 -0.04 0.02
Fingerprint
1 49.57 36.24 31.92 26.91 15.86 0.17 0.00 0.00 0.01 -0.20
2 49.39 36.60 32.74 29.58 26.21 0.47 0.02 0.02 0.01 0.00
3 49.36 36.67 32.80 29.73 27.15 0.94 0.05 0.04 0.01 0.02
4 49.33 36.68 32.82 29.80 27.19 1.51 0.08 0.03 -0.01 0.01
Table XIII: Architecture comparison in terms of performance (PSNR), complexity (FAs, FFs, Eq. kgate, Area, An),
technology (Tech) and speed (fclk).
Filter DA Arch. PSNR FAs FFs Eq. kgate Tech fclk Area An
[nm] [MHz] m2 m2
9/7
N
[6] Table VII, PSNRBB+TB1=2=3 512 144 - 130 200 - -
[7] Table VII, PSNRBB+TB1=2=3 336 144 2.81 130 200 - -
[13] Table VII, PSNRBB+TB1=2=3 192 213 - 45 500 2135 8540
[30] Table VII, PSNRBB+TB1=2=3 304 144 2.69 130 200 - -
Y
[19] 0 dB 688 144 5.39 130 200 - -
[20] 0 dB 432 144 4.17 130 200 - -
Prop. Table VII, PSNRBB+TB1=2=3 320 144 2.71/2.85 90 200/450 7621/8087 7621/8087
10/18
N [8] 0 dB 640
(a) 432 23.16 250 78 - -
[9] 0 dB 832(a) 464 11.27 130 200 67612 32406
Y -
(b) 0 dB 640 144 5.62/6.22 90 200/365 15868/17559 15868/17559
Prop. Table XII, PSNRBB+TB1=2=3 470 144 4.47/4.91 90 200/366 12602/13845 12602/13845
(a) The architecture contains also multipliers.
(b) Since no reference is available in the literature it has been implemented.
for the 10/18 filters is from 39% (result-biased DA-based
architecture) to 49% (DA-based architecture) the area of other
optimized variants based on B-spline factorization, such as
[8], [9]. Even if the architectures in [8], [9] have a throughput
of two samples per clock cycle, the low area required by
DA-based implementations makes them superior in terms of
throughput to area ratio. Finally, the proposed result-biasing
technique reduces the complexity of the architecture for the
10/18 DWT computation as well as for the 9/7 one. These
figures of merit highlight the effectiveness of the proposed
result-biasing technique as a general method to reduce the
complexity of DA-based architectures for the approximate
computation of the DWT.
VI. CONCLUSIONS
In this work a result-biased DA-based filter architecture for
the approximate computation of the DWT has been presented.
The proposed idea has been applied to the well known 9/7 and
10/18 wavelet filters, respectively, to reduce the complexity of
DA-based architectures for the DWT computation, with a very
small loss in terms of PSNR. Experimental results show that i)
the proposed technique is effective in reducing the complexity
of DA-based architectures for the DWT computation; ii) the
performance and complexity of the variant derived for the 9/7
filters are comparable with the ones of previously proposed
architectures, which are specifically optimized for the 9/7
wavelet filters; iii) the performance and complexity of the
proposed architecture for the 10/18 wavelet filters are better
than those of previously published works.
REFERENCES
[1] M. Boliek, “JPEG 2000 Final Committee Draft,” 2000.
[2] A. Biligin and M. W. Marcellin, “JPEG2000 for digital cinema,” in IEEE
International Conference on Circuits and Systems, 2006.
[3] B. K. Mohanty and P. K. Meher, “Memory-efficient high-speed
convolution-based generic structure for multilevel 2-D DWT,” IEEE
Tran. on Circuits and Systems for Video Technology, vol. 23, no. 2,
pp. 353–363, Feb 2013.
[4] Y. Hu and C. C. Jong, “A memory-efficient high-throughput architecture
for lifting-based multi-level 2-D DWT,” IEEE Tran. on Signal Process-
ing, vol. 61, no. 20, pp. 4975–4987, Oct 2013.
[5] G. Strang and T. Q. Nguyen, Wavelets and Filter Banks. Wellesley-
Cambridge, MA: Wellesley, 1996.
11
[6] K. A. Kotteri, A. E. Bell, and J. E. Carletta, “Design of multiplierless,
high-performace, wavelet filter banks with image compression applica-
tions,” IEEE Tran. on Circuits and Systems-I, vol. 51, no. 3, pp. 483–494,
Mar. 2004.
[7] M. Martina and G. Masera, “Low-complexity, efficient 9/7 wavelet filters
VLSI implementation,” IEEE Tran. on Circuits and Systems-II, vol. 53,
no. 11, pp. 1289–1293, Nov 2006.
[8] C. T. Huang, P. C. Tseng, and L. G. Chen, “VLSI architecture for
forward discrete wavelet transform based on B-spline factorization,”
Journal of VLSI Signal Processing, vol. 40, no. 3, pp. 343–353, Jul.
2005.
[9] M. Martina, G. Masera, and G. Piccinini, “Scalable low-complexity B-
spline discrete wavelet transform architecture,” IET Circuits, Devices
and Systems, vol. 4, no. 2, pp. 159–167, Feb 2010.
[10] M. A. Islam and K. A. Wahid, “Area- and power-efficient design of
Daubechies wavelet transforms using folded AIQ mapping,” IEEE Tran
on Circuits and Systems-II, vol. 57, no. 9, pp. 716–720, Sep 2010.
[11] S. K. Madishetty, A. Madanayake, R. J. Cintra, and V. S. Dimitrov,
“Precise VLSI architecture for AI based 1-D/ 2-D Daub-6 wavelet filter
banks with low adder-count,” IEEE Tran. on Circuits and Systems-I, to
appear.
[12] S. Murugesan and D. B. H. Tay, “New techniques for rationalizing
orthogonal and biorthogonal wavelet filter coefficients,” IEEE Tran. on
Circuits and Systems-I, vol. 59, no. 3, pp. 628–637, Mar 2012.
[13] A. Pande and J. Zambreno, “Poly-DWT: Polymorphic wavelet hardware
support for dynamic image compression,” ACM Tran. on Embedded
Computing Systems, vol. 11, no. 1, pp. 1–26, Mar 2012.
[14] A. K. Naik and R. S. Holambe, “Design of low-complexity high-
performance wavelet filters for image analysis,” IEEE Tran. on Image
Processing, vol. 22, no. 5, pp. 1848–1858, May 2013.
[15] S. Y. Park and P. K. Meher, “Low-power, high-throughput, and low-
area adaptive FIR filter based on distributed arithmetic,” IEEE Tran. on
Circuits and Systems-II, vol. 60, no. 6, pp. 346–350, Jun 2013.
[16] M. S. Prakash and R. A. Shaik, “Low-area and high-throughput archi-
tecture for an adaptive filter using distributed arithmetic,” IEEE Tran.
on Circuits and Systems-II, vol. 60, no. 11, pp. 781–785, Nov 2013.
[17] J. Xie, P. K. Meher, and J. He, “Hardware-efficient realization of prime-
length DCT based on distributed arithmetic,” IEEE Tran. on Computers,
vol. 62, no. 6, pp. 1170–1178, Jun 2013.
[18] Y. H. Chen, J. N. Chen, T. Y. Chang, and C. W. Lu, “High-throughput
multistandard transform core supporting MPEG/H.264/VC-1 using com-
mon sharing distributed arithmetic,” IEEE Tran. on VLSI Systems,
vol. 22, no. 3, pp. 463–474, Mar 2014.
[19] M. Alam, C. Rahman, W. Badawy, and G. Jullien, “Efficient distributed
arithmetic based DWT architecture for multimedia applications,” in
IEEE International Workshop on System-on-Chip for Real-Time Appli-
cations, Calgari, 30 June - 2 July, 2003, pp. 333–336.
[20] X. Cao, Q. Xie, C. Peng, Q. Wang, and D. Yu, “An efficient VLSI
implementation of distributed architecture for DWT,” in IEEE Workshop
on Multimedia Signal Processing, 2006, pp. 364–367.
[21] S. S. Kidambi, F. El-Guibaly, and A. Antoniou, “Area-efficient multi-
pliers for digital signal processing applications,” IEEE Tran. on Circuits
and Systems-II, vol. 43, no. 2, pp. 90–95, Feb. 1996.
[22] K. J. Cho, K. C. Lee, J. G. Chung, and K. K. Parhi, “Design of low-error
fixed-width modified Booth multiplier,” IEEE Tran. on VLSI Systems,
vol. 12, no. 5, pp. 522–531, May 2004.
[23] N. Petra, D. De Caro, V. Garofalo, E. Napoli, and A. G. M. Strollo,
“Design of fixed-width multipliers with linear compensation function,”
IEEE Tran. on Circuits and Systems-I, vol. 58, no. 5, pp. 947–960, May
2011.
[24] D. De Caro, N. Petra, A. G. M. Strollo, F. Tessitore, and E. Napoli,
“Fixed-width multipliers and multipliers-accumulators with min-max
approximation error,” IEEE Tran. on Circuits and Systems-I, vol. 60,
no. 9, pp. 2375–2388, Sep 2013.
[25] M. Antonini, M. Barlaud, P. Mathieu, and I. Daubechies, “Image coding
using the wavelet transform,” IEEE Tran. on Image Processing, vol. 1,
no. 2, pp. 205–220, Apr. 1992.
[26] “http://www.openjpeg.org.”
[27] ISO/IEC 15444-1:2004/FDAM 1, “Information technology - JPEG 2000
image coding system: Core coding system, amendment 1: profiles for
digital cinema applications,” ISO/IEC, Tech. Rep., 2004.
[28] M. Martina, “Low Complexity 9/7 Wavelet:
Modified OpenJPEG model,” downloadable at
http://personal.delen.polito.it/maurizio.martina/wavelet.html.
[29] A. G. Dempster and M. D. Macleod, “Use of minimum-adder multiplier
blocks in FIR digital filters,” IEEE Tran. on Circuits and Systems-II,
vol. 42, no. 9, pp. 569–577, Sep. 1995.
[30] M. Martina and G. Masera, “Multiplierless, folded 9/7-5/3 wavelet VLSI
architecture,” IEEE Tran. on Circuits and Systems-II, vol. 54, no. 9, pp.
770–774, Sep 2007.
Maurizio Martina (S’98-M’94-SM’15) was born
in Pinerolo, Italy, in 1975. He received the M.Sc.
and Ph.D. in electrical engineering from Politecnico
di Torino, Italy, in 2000 and 2004, respectively. He
is currently an Associate Professor of the VLSI-
Lab group, Politecnico di Torino. His research ac-
tivities include VLSI design and implementation
of architectures for digital signal processing and
communications.
Guido Masera (SM’07) received the Dr. Ing. De-
gree (summa cum laude) in 1986 and the Ph.D.
degree in electronic engineering from the Politecnico
di Torino, Torino, Italy, in 1992. Since 1992, he
has been an Assistant Professor and then Associate
Professor with the Electronic Department, where he
is member of the VLSI-Lab group. His research in-
terests include several aspects in the design of digital
integrated circuits and systems, with special empha-
sis on high-performance architecture development
and on-chip interconnect modeling and optimization.
He is an associate editor of IEEE Transactions on Circuits and Systems II.
Massimo Ruo Roch joined the Department of Elec-
tronics, Politecnico di Torino, Turin, Italy, in 1998,
where he has been a Full-Time Researcher since
1995. His research interests include digital design
of application specific computing architectures, high
speed telecommunications, and digital television.
Recent activities include design and modeling of
MPSoCs, embedded systems for bioapplications,
and cloud-based systems for e-learning.
Gianluca Piccinini received the Dr. Ing. and the
Ph.D. degrees in electronics engineering, in 1986
and 1990, respectively. He is a Full Professor from
2006 at the Department of Electronics, Politecnico
di Torino, Torino, Italy. His current research interest
includes the use of nanotechnologies in integrated
systems, and he is working on molecular transport
for beyond CMOS structures and on molecules in-
teraction in molecular QCA.
