High Speed VLSI Architecture for 3-D Discrete Wavelet Transform by Srinivasarao, Batta Kota Naga & Chakrabarti, Indrajit
High Speed VLSI Architecture for 3-D Discrete
Wavelet Transform
B.K.N.Srinivasarao and Indrajit Chakrabarti
Department of Electronics and Electrical Communication Engineering
Indian Institute of Technology, Kharagpur, INDIA
E.Mail : srinu.bkn@iitkgp.ac.in, indrajit@ece.iitkgp.ernet.in
Abstract
This paper presents a memory efficient, high throughput parallel lifting based running three dimensional discrete
wavelet transform (3-D DWT) architecture. 3-D DWT is constructed by combining the two spatial and four temporal
processors. Spatial processor (SP) apply the two dimensional DWT on a frame, using lifting based 9/7 filter bank
through the row rocessor (RP) in row direction and then apply in the colum direction through column processor
(CP). To reduce the temporal memory and the latency, the temporal processor (TP) has been designed with lifting
based 1-D Haar wavelet filter. The proposed architecture replaced the multiplications by pipeline shift-add operations
to reduce the CPD. Two spatial processors works simultaneously on two adjacent frames and provide 2-D DWT
coefficients as inputs to the temporal processors. TPs apply the one dimensional DWT in temporal direction and
provide eight 3-D DWT coefficients per clock (throughput). Higher throughput reduces the computing cycles per
frame and enable the lower power consumption. Implementation results shows that the proposed architecture has
the advantage in reduced memory, low power consumption, low latency, and high throughput over the existing
designs. The RTL of the proposed architecture is described using verilog and synthesized using 90-nm technology
CMOS standard cell library and results show that it consumes 43.42 mW power and occupies an area equivalent
to 231.45 K equivalent gate at frequency of 200 MHz. The proposed architecture has also been synthesised for the
Xilinx zynq 7020 series field programmable gate array (FPGA).
Index Terms
Index Terms : discrete wavelet transform, 3-D DWT, lifting based DWT, VLSI Architecture, flipping structure,
strip-based scanning.
I. INTRODUCTION
Video compression is a major requirement in many of the recent applications like medical imaging,
studio applications and broadcasting applications. Compression ratio of the encoder completely depends
on the underlying compression algorithms. The goal of compression techniques is to reduce the immense
ar
X
iv
:1
50
9.
04
26
8v
1 
 [c
s.A
R]
  1
4 S
ep
 20
15
2amount of visual information to a manageable size so that it can be efficiently stored, transmitted, and
displayed. 3-D DWT based compressing system enables the compression in spatial as well as temporal
direction which is more suitable for video compression. Moreover, wavelet based compression provide
the scalability with the levels of decomposition. Due to continuous increase in size of the video frames
(HD to UHD), video processing through software coding tools is more complex. Dedicated hardware
only can give higher performance for high resolution video processing. In this scenario there is a strong
requirement to implement a VLSI architecture for efficient 3-D DWT processor, which consumes less
power, area efficient, memory efficient and should operate with a higher frequency to use in real-time
applications.
From the last two decades, several hardware designs have been noted for implementation of 2-D DWT
and 3-D DWT for different applications. Majority of the designs are developed based on three categories,
viz. (i) convolution based (ii) lifting-based and (iii) B-Spline based. Most of the existing architectures are
facing the difficulty with larger memory requirement, lower throughput, and complex control circuit.
In general the circuit complexity is denoted by two major components viz, arithmetic and Memory
component. Arithmetic component includes adders and multipliers, whereas memory component consists
of temporal memory and transpose memory. Complexity of the arithmetic components is fully depends
on the DWT filter length. In contrast size of the memory component is depends on dimensions of the
image. As image resolutions are continuously increasing (HD to UHD), image dimensions are very high
compared to filter length of the DWT, as a result complexity of the memory component occupied major
share in the overall complexity of DWT architecture.
Convolution based implementations [1]-[3] provides the outputs within less time but require high
amount of arithmetic resources, memory intensive and occupy larger area to implement. Lifting based a
implementations requires less memory, less arithmetic complex and possibility to implement in parallel.
However it require long critical path, recently huge number of contributions are noted to reduce the critical
path in lifting based implementations. For a general lifting based structure [4] provides critical path of
4Tm + 8Ta, by introducing 4 stage pipeline it cut down to Tm + 2Ta. In [5] Huang et al., introduced a
flipping structure it further reduced the critical path to Tm+Ta. Though, it reduced the critical path delay
in lifting based implementation, it requires to improve the memory efficiency. Majority of the designs
3implement the 2-D DWT, first by applying 1-D DWT in row-wise and then apply 1-D DWT in column
wise. It require huge amount of memory to store these intermediate coefficients. To reduce this memory
requirements, several DWT architecture have been proposed by using line based scanning methods [7]-
[11]. Huang et al., [7]-[8] give brief details of B-Spline based 2-D IDWT implementation and discussed
the memory requirements for different scan techniques and also proposed a efficient overlapped strip-based
scanning to reduce the internal memory size. Several parallel architectures were proposed for lifting-based
2-D DWT [8]-[17]. Y. Hu et al. [17], proposed a modified strip based scanning and parallel architecture
for 2-D DWT is the best memory-efficient design among the existing 2-D DWT architectures, it requires
only 3N + 24P of on chip memory for a N×N image with P parallel processing units (PU). Several
lifting based 3-D DWT architectures are noted in the literature [18]-[24] to reduce the critical path of the
1-D DWT architecture and to decrease the memory requirement of the 3-D architecture. Among the best
existing designs of 3-D DWT, Darji et al. [24] produced best results by reducing the memory requirements
and gives the throughput of 4 results/cycle. Still it requires the large on-chip memory (4N2 + 10N ).
In this paper, we propose a new parallel and memory efficient lifting based 3-D DWT architecture,
requires only 2∗(3N+60P )+48 words of on-chip memory and produce 8 results/cycle. The proposed 3-D
DWT architecture is built with two spatial 2-D DWT (CDF 9/7) processors and four temporal 1-D DWT
(Haar) processors. Proposed architecture for 3-D DWT replaced the multiplication operations by shift and
add, it reduce the CPD from Tm + Ta to 4Ta. Further reduction of CPD to Ta is done by introducing
pipeline in the processing elements. To eliminate the temporal memory and to reduce the latency, Haar
wavelet is incorporated in temporal processor. The resultant architecture has reduce the latency, on chip
memory and to increase the speed of operation compared to existing 3-D DWT designs. The following
sections provide the architectural details of proposed 3-D DWT through spatial and temporal processors.
Organization of the paper as follows. Theoretical background for DWT is given in section II. Detailed
description of the proposed architecture for 3-D DWT is provided in section III. Implementation results
and performance comparison is given in Section IV. Finally, concluding remarks are given in Section V.
II. THEORETICAL BACKGROUND
Lifting based wavelet transform designed by using a series of matrix decomposition specified by the
Daubechies and Sweledens in [4]. By applying the flipping [5] to the lifting scheme, the multipliers in the
longest delay path are eliminated, resulting in a shorter critical path. The original data on which DWT is
4applied is denoted by X[n], and the 1-D DWT outputs are the detail coefficients H[n] and approximation
coefficients L[n]. For the Image (2-D) above process is performed in rows and columns as well. Eqns.(1)-
(6) are the design equations for flipping based lifting (9/7) 1-D DWT [6] and the same equations are used
to implement the proposed row processor (1-D DWT) and column processor (1-D DWT).
H1[n]← a′ ∗X[2n− 1] + {X[2n] +X[2n− 2]} . . . P1 (1)
L1[n]← b′ ∗X[2n] + {H1[n] +H1[n− 1]} . . . U1 (2)
H2[n]← c′ ∗H1[n] + {L1[n] + L1[n− 1]} . . . P2 (3)
L2[n]← d′ ∗ L1[n] + {H2[n] +H2[n− 1]} . . . U2 (4)
H[n]← K0 ∗ {H2[n]} (5)
L[n]← K1 ∗ {L2[n]} (6)
Where a′ = 1/α, b′ = 1/αβ, c′ = 1/βγ, d′ = 1/γδ, K0 = αβγ/ζ , and K1 = αβγδζ [4]. The lifting
step coefficients α, β, γ, δ and scaling coefficient ζ are constants and its values α = −1.586134342,
β = −0.052980118, γ = 0.8829110762, and δ = 0.4435068522, and ζ = 1.149604398.
Lifting based wavelets are always memory efficient and easy to implement in hardware. The lifting
scheme consists of three steps to decompose the samples, namely, splitting, predicting (eqn. (1) and (3)),
and updating (eqn. (2) and (4)).
Haar wavelet transform is orthogonal and simple to construct and provide fast output. By considering
the advantages of the Haar wavelets, the proposed architecture uses the Haar wavelet to perform the 1-D
DWT in temporal direction (between two adjacent frames). Sweldens et al. [25] developed a lifting based
Haar wavelet. The equations of the lifting scheme for the Haar wavelet transform is as shown in eqn.(7) L
H
 =
 √2 0
0 1√
2
 1 S(z)
0 1
 1 0
−P (z) 1
 X0(z)
X1(z)
 (7)
L = 1√
2
(X0 +X1)
H = 1√
2
(X1 −X0)
(8)
Eqn.(8) is extracted by substituting Predict value P (z) as 1 and Update step S(z) value as 1/2 in eqn.(7),
which is used to develop the temporal processor to apply 1-D DWT in temporal direction (3rd dimension).
52-D DWT
(frame: n)
2-D DWT
(frame: n+1)
1-D Haar wavelet
1-D Haar wavelet
1-D Haar wavelet
1-D Haar wavelet
LL0
LL1
LH0
LH1
HL0
HL1
HH1
HH0
LLL
LLH
LHL
LHH
HLL
HLH
HHL
HHH
X(n)
X(n+1)
X(n+2)
X(n+3)
X(n+4)
X(n)
X(n+1)
X(n+2)
X(n+3)
X(n+4)
3-D DWT
Row Processor
Transpose Unit
Column Processor
Re-arrange Unit
R
ow
 M
em
or
y
LLLHHLHH
X(n)X(n+1)X(n+2)X(n+3)X(n+4)
1-D Haar wavelet
>>1 >>4>>3
+
+
-
+
XL
>>1 >>4>>3
+
+
XH
St
ag
e-
1
St
ag
e-
2
>>6>>6
+ +
St
ag
e-
3
X0 X1
Mem_alpha
Mem_beta
Mem_gama
Figure 1. Block diagram for 3-D DWT
Where L and H are the low and High frequency coefficients respectively.
III. PROPOSED ARCHITECTURE FOR 3-D DWT
The proposed architecture for 3-D DWT comprising of two parallel spatial processors (2-D DWT) and
four temporal processors (1-D DWT), is depicted in Fig. 1. After applying 2-D DWT on two consecutive
frames, each spatial processor (SP) produces 4 sub-bands, viz. LL, HL, LH and HH and are fed to the
inputs of four temporal processors (TPs) to perform the temporal transform. Output of these TPs is a
low frequency frame (L-frame) and a high frequency frame (H-frame). Architectural details of the spatial
processor and temporal processors are discussed in the following sections.
A. Architecture for Spatial Processor
In this section, we propose a new parallel and memory efficient lifting based 2-D DWT architecture
denoted by spatial processor (SP) and it consists of row and column processors. The proposed SP is a
revised version of the architecture developed by the Y. Hu et al.[17]. The proposed architecture utilizes
the strip based scanning [17] to enable the trade-off between external memory and internal memory. To
reduce the critical path in each stage flipping model [5]-[6] is used to develop the processing element
(PE). Each PE has been developed with shift and add techniques in place of multiplier. Lifting based
(9/7) 1-D DWT process has been performed by the processing unit (PU) in the proposed architecture. As
shown in Fig. 2, the proposed PU is designed with five PEs, and each PE (except first PE (shift PE))
6+
>>7 >>3
++
-
+
>>1 Reg Reg
X(2n)X(2n-2)X(2n-1)
<<2 <<3
+
X’(2n)H1(n) H1(n-1)
<<4 <<2 >>2 >>3
+ +
+ +
H’1(n) H’’1(n) L1(n)L1(n-1)
+
+
-
<<1 >>1 >>4
+
+
L’1(n)H2(n)
H2(n-1)
+
+
L(n)
X’(2n-1) X’’(2n-1) X(2n-2) X(2n)
S t
a g
e -
1
( s
h i
f t _
P E
)
>>4
H(n)
>>5
S t
a g
e -
2
S t
a g
e -
3
S t
a g
e -
4
S t
a g
e -
5
S t
a g
e -
6
S t
a g
e -
7
S t
a g
e -
8
S t
a g
e -
9
P E
_ a
l p
h a
P E
_ b
e t
a
P E
_ g
a m
a
P E
_ d
e l
t a
1s
t
S t
a g
e
2n
d
S t
a g
e
3r
d
S t
a g
e
4t
h
S t
a g
e
5t
h
S t
a g
e
Shift and 
Add SR SR
+ +
-
[2 ] [2 2]X n X n 
Shift and 
Add
A
+
Shift and 
Add
B C
A
+
+
B C
+
-
Shift and 
Add
D
+
+
D
Shift 
(scaling)
Shift 
(scaling)
* [2 ]b X n
1[ ]H n
1[ 1]H n 
1[ ]L n 1[ 1]L n 
1* [ ]c H n
1* [ ]d L n
1* [ ]d L n
[2 1]X n  [2 2]X n  [2 ]X n
[2 1]X n  [2 1]X n 
2 ( 1)H n  2 ( )H n
2 ( )L n
( )L n ( )H n
* [2 1]a X n 
+
>>7 >>3
++
-
+
>>1 Reg Reg
X(2n)X(2n-2)X(2n-1)
<<2 <<3
+
X’(2n)H1(n)
H1(n-1)
<<4 <<2 >>2 >>3
+ +
+
+
H’1(n) H’’1(n) L1(n)
L1(n-1) +
+
-
<<1 >>1 >>4
+
+
L’1(n)H2(n)
H2(n-1)
+
+
L(n)
X’(2n-1) X’’(2n-1) X(2n-2) X(2n)
1s
t
S t
a g
e
( s
h i
f t _
P E
)
2n
d
S t
a g
e
( P
E _
a l
p h
a )
3r
d
S t
a g
e
( P
E _
b e
t a
)
4t
h
S t
a g
e
( P
E _
g a
m
a )
5t
h
S t
a g
e
( P
E _
d e
l t a
)
>>4
H(n)
>>5
Scaling
(a) (b) (c)
Figure 2. (a) Data Flow Graph of Processing Unit (b) Processing Unit with five pipeline stages (c) Processing Unit with nine pipeline stages
has been constructed with two pipeline stages for further reduction of CPD. This modified PU, reduces
the CPD to Ta (adder delay). Fig. 1 shows that the number of inputs to the spatial processor is equal to
2P+1, which is also equal to the width of the strip. Where P is the number of parallel processing units
(PUs) in the row processor as well as column processor. We have designed the proposed architecture
with two parallel processing units (P = 2). The same structure can be extended to P = 4, 8, 16 or 32
depending on external bandwidth. Whenever row processor produces the intermediate results, immediately
column processor start to process on those intermediate results. Row processor takes 9 clocks to produce
the temporary results then after column processor takes 9 more clocks to to give the 2-D DWT output;
finally, temporal processor takes 3 more clocks after 2-D DWT results are available to produce 3-D DWT
output. As a summary, proposed 2-D DWT and 3-D DWT architectures have constant latency of 18 and
21 clock cycles respectively, regardless of image size N and number of parallel PUs (P). Details of the
row processor and column processor are given in the following sub-sections.
1) Row Processor (RP): Let X be the image of size N × N , extend this image by one column by
using symmetric extension. Now image size is N × (N + 1). Refer [17] for the structure of strip based
scanning method. The proposed architecture initiates the DWT process in row wise through row processor
(RP) then process the column DWT by column processor (CP). Fig. 3(a). shows the generalized structure
7Shifting_PE
PE_alpha
PE_beta
PE_gama
PE_delta
[2 1]X n  [2 2]X n [2 ]X n
[2 ]X n [2 2]X n [2 1]X n  [2 1]X n 
1[ 1]H n * [2 ]b X n1[ ]H n
1[ ]L n 1[ 1]L n 1[ ]H n 1[ ]H n
1* [ ]d L n 2 ( 1)H n 2 ( )H n
( )H n ( )L n
Shifting_PE
PE_alpha
PE_beta
PE_gama
PE_delta
[2 ]X n[2 1]X n [2 2]X n 
[2 2]X n  [2 1]X n  [2 1]X n 
1[ 1]H n
1[ 1]H n 1[ 1]H n 1[ 1]L n 
* [2 2]b X n 
2[ 1]H n 
[ 1]H n 
1* [ 1]d L n 
[ 1]L n 
Shifting_PE
PE_alpha
PE_beta
PE_gama
PE_delta
[2 ]X n k[2 ( 1)]X n k [2 ( 2)]X n k 
[2 ( 1)]X n k   [2 ( 1)]X n k  [2 ( 2)]X n k 
* [2 ( 2)]b X n k  Memory
_alpha
Memory
_beta
Memory
_gama
Transpose 
Reg
Transpose 
Reg
Transpose 
Reg
Transpose Unit
P
r o
c
e
s
s
i n
g
 u
n
i t
- 1
P
r o
c
e
s
s
i n
g
 u
n
i t
- 2
P
r o
c
e
s
s
i n
g
 u
n
i t
- P
Shifting_PE
PE_alpha
PE_beta
PE_gama
PE_delta
SR
SR
SR
Shifting_PE
PE_alpha
PE_beta
PE_gama
PE_delta
SR
SR
SR
P
r o
c
e
s
s
i n
g
 U
n
i t
- 1
P
r o
c
e
s
s
i n
g
 U
n
i t
- P
(a) (b)
1[ 1]H n P 
1[ 1]L n P 
1[ 1]H n P  
1[ 1]H n P  
2[ 1]H n P 
[ 1]H n P  [ 1]L n P 
* [ 1]d L n P  
H(0,2)
L(0,0)
H(0,0)
H(0,3)
L(0,1)
H(0,1)
H(P-1,2)
L(P-1,0)
H(P-1,0)
H(P-1,3)
L(P-1,1)
H(P-1,1)
SR SR
:
HL(P-1,1)
LL(P-1,0)
HL(P-1,0)
:
HH(P-1,1)
LH(P-1,0)
HH(P-1,0)
:
HL(0,1)
LL(0,0)
HL(0,0)
:
HH(0,1)
LH(0,0)
HH(0,0)
Figure 3. (a)Row Processor (b) Column Processor
L(3,P) L(2,P) L(1,P) L(0,P)
H(3,P) H(2,P) H(1,P) H(0,P)
MUX MUX
Reg
Reg
L
(3,P)
H
(3,P)
L
(1,P)
H
(1,P)
L
(2,P)
H
(2,P)
L
(0,P)
H
(0,P)
LL(1,0) HL(1,0) LL(0,0) HL(0,0)
HH(1,0) LH(0,0) HH(0,0)LH(1,0)
LL(1,1) HL(1,1) LL(0,1) HL(0,1)
HH(1,1) LH(0,1) HH(0,1)LH(1,1)
M
U
X
M
U
X
M
U
X
M
U
X
LL(0,1) LL(0,0)
LH(0,1) LH(0,0)
HL(0,1) HL(0,0)
HH(0,1) HH(0,0)
Reg
Reg
0
1
0
1
0
1
0
1
Re-arrange unit
LL
LH
HL
HH
(a)
(b)
Figure 4. (a) Transpose Register (Ref:[17]) (b) Re-arrange Unit
8for a row processor with P number of PUs. P = 2 has been considered for our proposed design. For
the first clock cycle, RP get the pixels from X(0, 0) to X(0, 2P ) simultaneously. For the second clock
RP gets the pixels from next row i.e. X(1, 0) to X(1, 2P ), the same procedure continues for each clock
till it reaches the bottom row i.e., X(N, 0) to X(N, 2P ). Then it goes to the next strip and RP get the
pixels from X(0, 2P ) to X(0, 4P ) and it continues this procedure for entire image. Each PU consists
of five pipeline stages and each pipeline stage is processed by one processing element (PE) as depicted
in Fig. 2(b). First stage (shift PE) provide the partial results which is required at 2nd stage (PE alpha),
likewise processing elements PE alpha to PE delta (2nd stage to 5th stage) gives the partial results along
with their original outputs. (e.g., consider the PE alpha of PU-1, it needs to provide output corresponding
to eqn.(1) (H1[n]), along with H1[n], it also provides the partial output X ′[2n] which is required for the
PE beta). Structure of the PEs are given in the Fig. 2(b), it shows that multiplication is replaced with
the shift and add technique. The original multiplication factor and the value through the shift and add
circuit are noted in Table.I, it shows that variation between original and adopted one is extremely small.
As shown in Fig. 2(b), time delay of shift PE is one Ta and remaining all PEs are having delay of 2Ta.
To reduce the CPD of PU, PEs from PE alpha to PE delta are divided in to two pipeline stages, and
each pipeline stage has a delay of Ta, as a result CPD of PU is reduced to Ta and pipeline stages are
increased to nine and is shown in Fig. 2(c). The outputs H1[n+P − 1], L1[n+P − 1], and H2[n+P − 1]
corresponding to PE alpha and PE beta of last PU and PE gama of last PU is saved in the memories
Memory alpha, Memory beta and Memory gama respectively, shown in Fig. 3(a). Those stored outputs
are inputted for next subsequent columns of the same row. For a N ×N image rows is equivalent to N .
So the size of the each memory is N × 1 words and total row memory to store these outputs is equals
to 3N . Output of each PU are under gone through a process of scaling before it producing the outputs
H and L. These outputs are fed to the transposing unit. The transpose unit has P number of transpose
registers (one for each PU). Fig. 4(a) shows the structure of transpose register, and it gives the two H and
two L data alternatively to the column processor.
2) Column Processor (CP): The structure of the column processor (CP) is shown in Fig. 3(b). To
match with the throughput of RP, CP is also designed with two number of PUs in our architecture. Each
transpose register produces a pair of H and L in an alternative order and are fed to the inputs of one PU
of the CP. The partial results produced are consumed by the next PE after two clock cycles. As such, shift
registers of length two are needed within the CP between each pipeline stages for caching the partial results
9Table I
ORIGINAL AND ADOPTED VALUES FOR MULTIPLICATION
Original Multiplier
PE Multiplier value through
Value shift and add
PE alpha a′=-0.6305 a′=-0.6328
PE beta b′=11.90 b′=12
PE gama c′=-21.378 c′=-21.375
PE delta d′=2.55 d′=2.5625
(except between 1st and 2nd pipeline stages). At the output of the CP, four sub-bands are generated in an
interleaved pattern, i.e.(HL,HH), (LL,LH), (HL,HH), (LL,LH), and so on. Outputs of the CP are
fed to the re-arrange unit. Fig. 4(b) shows the architecture for re-arrange unit, and it provides the outputs
in sub-band order i.e.LL, LH,HL and HH simultaneously, by using P registers and 2P multiplexers.
For multilevel decomposition, the same DWT core can be used in a folded architecture with an external
frame buffer for the LL sub-band coefficients.
B. Architecture for Temporal Processor (TP)
Eqn.(8) shows that Haar wavelet transform depends on two adjacent pixels values (same pixel position
of adjacent frames, for temporal processing). As soon as spatial processors are provide the 2-D DWT
results, temporal processors starts processing on the spatial processor outputs (2-D DWT results) and
produce the 3-D DWT results. Fig. 1 shows that there is no requirement of temporal buffer, due to the
sub-band coefficients of two spatial processors are directly connected to the four temporal processors. But
it has been designed with 3 pipeline stages, it require 6 pipeline registers for each TP. Same frequency
sub-band of the distinct spatial processors are fed to the each temporal processor. i.e. LL,HL,LH and
HH sub-bands of the spatial processor 1 and 2 are given as inputs to the temporal processor 1, 2, 3 and
4 respectively. Temporal processor apply 1-D Haar wavelet on sub-band coefficients, and provide the low
frequency sub-band and high frequency sub-band as output. By combining all low frequency sub-bands
and high frequency sub-bands of all temporal processors provide the 3-D DWT output in the form of
L-Frame and H-Frame (2-D DWT by spatial processors and 1-D DWT by temporal processors).
IV. IMPLEMENTATION RESULTS AND PERFORMANCE COMPARISON
The proposed 3-D DWT architecture has been described in Verilog HDL. A uniform word length of
14 bits has been maintained throughout the design. Simulation results have been verified by using Xilinx
10
Table II
DEVICE UTILISATION SUMMARY OF THE PROPOSED ARCHITECTURE
Logic utilized Used Available Utilization
Slice Registers 1958 106400 1%
Number of Slice LUTs 2852 53200 5%
Number of fully 1137 3673 30%used LUT-FF pairs
Number of Block RAM 3 140 2%
Table III
COMPARISON OF PROPOSED 2-D DWT ARCHITECTURE WITH EXISTING ARCHITECTURES (FOR 1-LEVEL)
Parameter Zhang [12] Mohanty [13] Darji [14] Yusong [17] Proposed
Multipliers 10 9P 10 10P 0
Adders 16 16P 16 16P 34P
Internal Memory 4N+37 15P+5.5N 4N 24P+3N 60P+3N
Critical path Tm Tm + 2Ta Tm Tm + Ta Ta
Computation Time N2/2 N2/2P N2/2 N2/2P N2/2P
Throughput 2/Tm 2P/Tm + 2Ta 2/Tm 2P/Tm + Ta 2P/Ta
Table IV
COMPARISON OF PROPOSED 3-D DWT ARCHITECTURE WITH EXISTING ARCHITECTURES (FOR 1-LEVEL)
Parameters Weeks [19] Taghavi [20] A.Das [22] Darji [24] Proposed
Memory requirement 6N2+6l 5N2 5N2 + 5N 4N2 + 10N 2*(3N+60P)+48
Throughput/cycle - 1 result 2 results 4 results 8 results
Computing time
2N2 + 3l/2 6N2 3N2 3N2 N2/2PFor 2 Frames
Latency 2.5N2 + 0.5l 4N2 cycles 2N2 cycles 3N2/2 cycles 21 cycles
Area - - 1825 slices 2490 slices 2852 slice LUTs
Operating 200 MHz (ASIC) - 321 MHz 91.87 MHz 265 MHzFrequency (FPGA) (FPGA) (FPGA)
Multipliers - - Nil 30 Nil
Adders 6l MACs - 78 48 168
Filter bank l-length D-9/7 D-9/7 D-9/7 D-9/7 (2-D) + Haar (1-D)
Table V
SYNTHESIS RESULTS (DESIGN VISION) COMPARISON OF PROPOSED 3-D DWT ARCHITECTURE WITH EXISTING
Parameters Darji et al.,[24] Proposed
Comb. Area 61351 µm2 526419 µm2
Non Comb. Area 807223 µm2 553078 µm2
Total Cell Area 868574 µm2 1079498 µm2
Operating Voltage 1.98 V 1.2 V
Total Dynamic Power 179.75 mW 38.56 mW
Cell Leakage Power 46.87 µW 4.86 mW
11
ISE simulator. We have simulated the Matlab model which is similar to the proposed 3-D DWT hardware
architecture and verified the 3-D DWT coefficients. RTL simulation results have been found to exactly
match the Matlab simulation results. The Verilog RTL code is synthesized using Xilinx ISE 14.2 tool
and mapped to a Xilinx programmable device (FPGA) 7z020clg484 (zynq board) with speed grade of
-3. Table II shows the device utilization summary of the proposed architecture and it operates with a
maximum frequency of 265 MHz.
The proposed architecture has also been synthesized using SYNOPSYS design compiler with 90-nm
technology CMOS standard cell library. It consumes 43.42 mW power and occupies an area equivalent
to 231.45 K equivalent gate at frequency of 200 MHz.
A. Comparison
The performance comparison of the proposed 2-D and 3-D DWT architectures with other existing
architectures is figure out in Tables III and IV respectively. The proposed 2-D processor requires zero
multipliers, 34P (Pis number of parallel PUs) adders, 60P+3N internal memory. It has a critical path delay
of Ta with a throughput of four outputs per cycle with N2/2P computation cycles to process an image
with size N × N . When compared to recent 2-D DWT architecture developed by the Y.Hu et al. [17],
CPD reduced to Ta from Tm + Ta with the cost of small increase in hardware resources.
Table IV shows the comparison of proposed 3-D DWT architecture with existing 3-D DWT architecture.
It is found that, the proposed design has less memory requirement, High throughput, less computation time
and minimal latency compared to [19], [20], [22], and [24]. Though the proposed 3-D DWT architecture
has small disadvantage in area and frequency, when compared to [22], the proposed one has a great
advantage in remaining all aspects.
Table V gives the comparison of synthesis results between the proposed 3-D DWT architecture and
[24]. It seems to be proposed one occupying more cell area, but it included total on chip memory also,
where as in [24] on chip memory is not included. Power consumption of the proposed 3-D architecture
is very less compared to [24].
V. CONCLUSIONS
In this paper, we have proposed memory efficient and high throughput architecture for lifting based
3-D DWT. The proposed architecture is implemented on 7z020clg484 FPGA target of zynq family, also
12
synthesized on Synopsys’ design vision for ASIC implementation. An efficient design of 2-D spatial pro-
cessor and 1-D temporal processor reduces the internal memory, latency, CPD and complexity of a control
unit, and increases the throughput. When compared with the existing architectures the proposed scheme
shows higher performance at the cost of slight increase in area. The proposed 3-D DWT architecture is
capable of computing 60 UHD (3840×2160) frames in a second.
REFERENCES
[1] Q. Dai, X. Chen, and C. Lin,“A Novel VLSI Architecture for Multidimensional Discrete Wavelet Transform,”IEEE Transactions on
Circuits and Systems for Video Technology, Vol. 14, No. 8, pp. 1105-1110, Aug. 2004.
[2] C. Cheng and K. K. Parhi, “High-speed VLSI implementation of 2-D discrete wavelet transform,” IEEE Trans. Signal Process., vol.
56, no. 1, pp. 393-403, Jan. 2008.
[3] B. K. Mohanty and P. K. Meher, “Memory-Efficient High-Speed Convolution-based Generic Structure for Multilevel 2-D DWT.”IEEE
Transactions on Circuits and Systems for Video Technology, VOL. 23, NO. 2, pp. 353-363, Feb. 2013.
[4] I. Daubechies and W. Sweledens, “Factoring wavelet transforms into lifting schemes,” J. Fourier Anal. Appl., vol. 4, no. 3, pp. 247-269,
1998.
[5] C.T. Huang, P.C. Tseng, and L.-G. Chen, “Flipping structure: An efficient VLSI architecture for lifting-based discrete wavelet transform,”
IEEE Trans. Signal Process., vol. 52, no. 4, pp. 1080-1089, Apr. 2004.
[6] C.-Y. Xiong, J.-W. Tian, and J. Liu, “A Note on Flipping Structure: An Efficient VLSI Architecture for Lifting-Based Discrete Wavelet
Transform,” IEEE Transactions on Signal Processing, Vol. 54, No. 5,pp. 1910-1916, MAY 2006
[7] C.-T. Huang, P.-C. Tseng, and L.-G. Chen, “Analysis and VLSI architecture for 1-D and 2-D discrete wavelet transform,” IEEE Trans.
Signal Process., vol. 53, no. 4, pp. 1575-1586, Apr. 2005.
[8] C.-C. Cheng, C.-T. Huang, C.-Y. Ching, C.-J. Chung, and L.-G. Chen, “On-chip memory optimization scheme for VLSI implementation
of line based two-dimentional discrete wavelet transform,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 17,
no. 7, pp. 814-822, Jul. 2007.
[9] H.Y. Liao, M. K. Mandal, and B. F. Cockburn, “Efficient architectures for 1-D and 2-D lifting-based wavelet transforms,” IEEE
Transactions on Signal Processing, vol. 52, no. 5, pp. 1315-1326, May 2004.
[10] B.F. Wu and C.F. Chung, “A high-performance and memory-efficient pipeline architecture for the 5/3 and 9/7 discrete wavelet transform
of JPEG2000 codec,” IEEE Trans. Circuits Syst. Video Technol., vol. 15, no. 12, pp. 1615-1628, Dec. 2005.
[11] C.-Y. Xiong, J. Tian, and J. Liu, “Efficient architectures for two-dimensional discrete wavelet transform using lifting scheme,” IEEE
Transactions on Image Processing, vol. 16, no. 3, pp. 607-614, Mar. 2007.
[12] W. Zhang, Z. Jiang, Z. Gao, and Y. Liu, “An efficient VLSI architecture for lifting-based discrete wavelet transform,” IEEE Transactions
on Circuits and Systems-II: Express Briefs, Vol. 59, No. 3, pp. 158-162, Mar. 2012.
[13] B. K. Mohanty and P. K. Meher, “Memory Efficient Modular VLSI Architecture for High throughput and Low-Latency Implementation
of Multilevel Lifting 2-D DWT,” IEEE Transactions on Signal Processing, Vol. 59, No. 5, pp. 2072-2084, May 2011.
[14] A.Darji, S. Agrawal, Ankit Oza, V. Sinha, A.Verma, S. N. Merchant and A. N. Chandorkar, “Dual-Scan Parallel Flipping Architecture
for a Lifting-Based 2-D Discrete Wavelet Transform,”IEEE Transactions on Circuits and Systems-II: Express Briefs, Vol. 61, No. 6,
pp. 433-437, Jun. 2014.
13
[15] B. K. Mohanty, A. Mahajan, and P. K. Meher, “Area and power efficient architecture for high-throughput implementation of lifting
2-D DWT,” IEEE Trans. Circuits Syst. II, Exp. Briefs, vol. 59, no. 7, pp. 434-438, Jul. 2012.
[16] Y. Hu and C. C. Jong,“A Memory-Efficient High-Throughput Architecture for Lifting-Based Multi-Level 2-D DWT,”IEEE Transactions
on Signal Processing, VOL. 61, NO. 20, pp.4975-4987, Oct. 15, 2013.
[17] Y. Hu and C. C. Jong, “A Memory-Efficient Scalable Architecture for Lifting-Based Discrete Wavelet Transform,”IEEE Transactions
on Circuits and Systems-II: Express Briefs, VOL. 60, NO. 8, pp. 502-506, Aug. 2013.
[18] J. Xu, Z.Xiong, S. Li, and Ya-Qin Zhang, “Memory-Constrained 3-D Wavelet Transform for Video Coding Without Boundary Effects,”
IEEE Transactions on Circuits and Systems for Video Technology, Vol. 12, No. 9,pp. 812-818, Sep. 2002.
[19] M. Weeks and M. A. Bayoumi, “Three-Dimensional Discrete Wavelet Transform Architectures,”IEEE Transactions on Signal Processing,
Vol. 50, No. 8, pp.2050-2063, Aug. 2002.
[20] Z. Taghavi and S. kasaei, “A memory efficient algorithm for multidimensional wavelet transform based on lifting,” in Proc. IEEE Int.
Conf. Acoust Speech Signal Process. (ICASSP) , vol. 6, pp. 401-404, 2003.
[21] Q. Dai, X. Chen, and C. Lin, “Novel VLSI architecture for multidimensional discrete wavelet transform,” IEEE Transactions on Circuits
and Systems for Video Technology, vol. 14, no. 8, pp. 1105-1110, Aug. 2004.
[22] A. Das, A. Hazra, and S. Banerjee,“An Efficient Architecture for 3-D Discrete Wavelet Transform,”IEEE Transactions on Circuits and
Systems for Video Technology, Vol. 20, NO. 2, pp. 286-296, Feb. 2010.
[23] B. K. Mohanty and P. K. Meher, “Memory-Efficient Architecture for 3-D DWT Using Overlapped Grouping of Frames,”IEEE
Transactions on Signal Processing, Vol. 59, No. 11, pp.5605-5616, Nov. 2011.
[24] A. Darji, S. Shukla, S. N. Merchant and A. N. Chandorkar, “Hardware Efficient VLSI Architecture for 3-D Discrete Wavelet Transform,”
Proc. of 27th Int. Conf. on VLSI Design and 13th Int. Conf. on Embedded Systems pp. 348-352, 5-9 Jan. 2014.
[25] W.Sweldens, “The Lifting Scheme: a Construction of Second Generation of Wavelets,” SIAM Journal on Mathematical Analysis, Vol.29
No.2, pp. 511-546, 1998.
