A High Performance Architecture for 3D Wavelet Transform by Zahra Zaremojtahedi & Farzad Zargari
 
 
A High Performance Architecture for 3D Wavelet Transform  
Zahra ZareMojtahedi
1 and Farzad Zargari
2  
 
 
1 Computer Engineering department of Science and Research branch of  
Iran Islamic Azad University, Tehran, Iran 
 
2 Department of Information Technology of Research Institute for ICT,  
formerly known as Iran Telecom Research Center (ITRC), Tehran, Iran  
 
 
Abstract 
Discrete Wavelet Transform (DWT) is attracted great deal of 
attention and used in various image, video and signal processing 
applications. Even though lifting scheme is proposed to reduce 
the computational load of DWT, traditional hardware solutions 
based on lifting scheme for DWT suffer from high critical path 
and computational resources.  Flipping structure has been 
proposed to resolve these problems in the lifting based hardware 
implementations for DWT. In this paper we propose a pipeline 
architecture based on flipping structure for 3D DWT. The 
proposed architecture has lower critical path and computational 
resources compared to the other proposed hardwares for 3D 
DWT in the literature. Moreover, the proposed architecture can 
be used to implement 1D and 2D DWT besides the 3D DWT. 
Keywords:  3D Wavelet Transform, Computational resources, 
Critical path, Flipping Scheme, Lifting Scheme. 
1. Introduction 
3D DWT has been used in various image and video 
compression and processing applications. Encoding 
volumetric data sets produced by various 3D image 
acquisition devices such as computed tomography (CT), 
position emission tomography (PET) and magnetic 
resonance imaging (MRI) are a number of 3D DWT 
applications. Scalable video coding and noise reduction 
between frames of a video are the applications that we can 
name for 3D DWT in the field of video coding and 
processing. DWT is one of the most computationally 
intensive parts in these image and video coding 
applications. Even though lifting scheme [1],[2] is 
proposed to reduce the computational load of DWT, it still 
takes relatively large portion of coding time in the image 
and video coding process. As a result hardware realization 
of DWT has been widely considered as a solution to 
reduce the coding time in the real time applications [3]. 
Nevertheless, traditional lifting based hardware 
realizations for DWT suffer from high critical path and 
hardware overhead. Flipping structure has been introduced 
to resolve these problems [4]. Although, there are 
proposed solutions for 1D and 2D flipping DWT [5],[6] 
and also hardware solutions for 3D lifting based DWT [7]-
[9], there is not proposed any high performance 
architecture for 3D DWT based on flipping structure, as 
much as we know. 
In this paper we propose a high efficiency parallel 
architecture for hardware implementation of 3D DWT 
based on flipping structure. The proposed architecture 
reduces the critical path latency and the number of 
processing units compared to the traditional lifting based 
3D DWT architectures. Moreover, the proposed 
architecture can be used to produce 1D and 2D DWT 
besides 3D DWT. The rest of paper is organized as 
follows. In section 2, we introduce our proposed 
architecture followed by simulation results in section 3. 
The concluding remarks are given in section 4. 
2. Proposed Architecture 
The lifting scheme for WT by using 9/7 Debauchee’s filter 
is performed as follows 
i i x s 2
0     (1) 
1 2
0
  i i x d   (2) 
0
1
0 0 1
     i i i i s s d a d   (3) 
1
1
1 0 1
     i i i i d d s b s   (4) 
1
1
1 1 2
     i i i i s s d c d   (5) 
2
1
2 1 2
     i i i i d d s d s   (6) 
2
1 i i s k s     (7) 
2
2 i i d k d     (8) 
 
where  i x2 and  1 2  i x are even and odd input elements, 
respectively and   / 1  a ,   / 1  b ,   / 1  c , 
IJCSI International Journal of Computer Science Issues, Vol. 9, Issue 3, No 2, May 2012 
ISSN (Online): 1694-0814 
www.IJCSI.org 151
Copyright (c) 2012 International Journal of Computer Science Issues. All Rights Reserved. 
 
 / 1  d ,    1 k and    / 2  k and 
     and , , ,   are the coefficients employed in the 
lifting scheme and listed in Table 1, 
0
i s and 
0
i d represent 
the input odd and even parts respectively, 
n
i s  and 
n
i d  
(n=1,2) represent the intermediate value obtained in the 
lifting process, and  i s and  i d represent the low pass and 
high pass parts of the output signals, respectively.  1 k and 
2 k  are also known as kL and kH, respectively. In fact in 
flipping scheme there will be a final scaling stage by 
factors kL and kH which will be applied in the last filtering 
step. 
 
Table 1: Lifting Scheme Parameters 
 
Parameter Approximate  Value 
   -1.586134342059924 
   -0.052980118572961 
   0.882911075530934 
   0.443506852043971 
   1.230174104914001 
 
In flipping scheme from one side (3) is merged with (4) 
and on the other side (5) is merged with (6) resulting  the 
outputs in stage 2 as [4]: 
1
1
1 0 1
    i i i i d d bs s  
     
0 0
1
0
1
0
1
0 0 0
i i i i i i i s s ad s s ad bs            
(9) 
2
1
2 1 2
    i i i i d d ds s  
      
1 1
1
1 1
1
1 1 1
i i i i i i i s s cd s s cd ds           
(10) 
In flipping structure is similar to the basic lifting structure 
except that the scaling is performed in different order. For 
example in the first stage of lifting scheme the scaling is 
performed as shown in Fig. 1.a. While the scaling in 
flipping structured is as shown in Fig. 1.b. Comparing Fig. 
1.a and 1.b indicates that in flipping structure the number 
of the multipliers is decreased compared to original lifting 
scheme. 
  
  
(a) (b) 
Fig. 1. Scaling methods in original lifting (a) and flipping (b) 
The following data path in Fig. 2 is introduced by Hao et 
al. [6] to implement (9) and (10). This data path employs 
one multiplier and two adders to produce one output per 
clock. This data path has the advantage of lower critical 
path over the traditional lifting scheme for WT. The 
critical path in the flipping scheme is one multiplier ™ in 
contrast to the critical path of flipping scheme which 
includes one multiplier (Tm) and two adders (Ta).   
 
Fig. 2. Datapath for realization of the first stage of flipping scheme [6] 
Hao et al. [6] introduced the hardware in Fig. 3 to 
implement the 1D WT according to flipping scheme. We 
refer to the architecture in Fig.2 as “basic unit” hereafter. 
This architecture is the basic building block for the 1D and 
2D flipping based DWT hardwares. 
IJCSI International Journal of Computer Science Issues, Vol. 9, Issue 3, No 2, May 2012 
ISSN (Online): 1694-0814 
www.IJCSI.org 152
Copyright (c) 2012 International Journal of Computer Science Issues. All Rights Reserved. 
 
 
Fig. 3. Proposed architecture in [6] to implement 1D WT 
 
xeee
RW1 RW2
xeoe
RW3
xoee
RW4
xooe
Lee Leo Loe Loo Hee Heo Hoe Hoo
T1 T2
CW1 CW2 CW3 CW4
T3 T4
LLeLHe HLeHHe LLo HLo LHo HHo
FW1 FW2 FW3 FW4
3
-
D
 
W
T
c
o
e
f
f
i
c
i
e
n
t
2
-
D
 
W
T
c
o
e
f
f
i
c
i
e
n
t
1
-
D
 
W
T
c
o
e
f
f
i
c
i
e
n
t
S
N
U
Output 1-DWT:
L , H
Output 2-DWT:
LL,LH,HL,HH
Output 3-DWT:
LLL,LLH,HLL
HLH,LHL,LHH
HHL,HHH
M0
M1
M2
M3
S0
S1
S2
S3
 
Fig. 4. Proposed parallel and pipelined architecture to implement 3D WT 
 
IJCSI International Journal of Computer Science Issues, Vol. 9, Issue 3, No 2, May 2012 
ISSN (Online): 1694-0814 
www.IJCSI.org 153
Copyright (c) 2012 International Journal of Computer Science Issues. All Rights Reserved. 
 
 
Fig. 5. The Transpose Unit 
  
 
We have proposed a parallel and pipelined flipping 
structure for 3D DWT by using the basic transform unit 
given in Fig. 3.  Fig. 4 indicates the proposed architecture. 
It consists of 12 one dimensional basic transform units in 
row, column and time stages, four transpose modules and 
one scaling unit (SNU). The first transform stage (row 
transform) employs four basic transform units each one 
has four inner registers. As a result it requires eight 
multipliers, 16 adders and 16 registers. The first stage gets 
four inputs and produces 1D DWT of the inputs at the 
output. The outputs are applied to T1 and T2 transpose 
units to come to the appropriate order for the next 
transform stage. The architectures of the entire transpose 
units in Fig. 4 are similar and Fig. 5 indicates the 
schematic of transpose unit T1. Fig. 5 indicates that each 
transpose unit consists of six delay elements and two 
multiplexers. 
Second stage performs the next 1D transform on its inputs 
and produces the 2D DWT of the initial image 
coefficients. The number of basic transform units in the 
second stage is similar to the first stage but the second 
stage requires 16×M registers for an N×M×F image 
sequence. The third transform stage receives the outputs of 
the second stage in the corrected order via T3 and T4 
transpose modules. The number of the basic transform 
units in this stage is the same as the previous stages, but 
the number of registers increases to 8×M×N registers for 
an N×M×F image sequence. 
 
Fig. 6. The schematic diagram of SNU 
 
In the final stage, SNU unit scales the outputs of first, 
second or third stage, scaling them for 1D, 2D or 3D 
DWT, respectively. The schematic diagram of SNU is 
shown in Fig. 6. In the case of 1D transform, that is inputs 
from first transform stage, the even coefficients are scaled 
by kL and the odd coefficients are scaled by kH. For the 
inputs from the second transform stage. M0 and M2 are 
derived from odd rows and are scaled by kHH and kHL, 
where as M1 and M3 are come from even rows and should 
be scaled by kLL and kLH in order to generate 2D DWT 
coefficients. In the case of 3D DWT the inputs of SNU are 
come from the third transform stage. The inputs to M0 
(even colomn and row), M1 (even row and odd colomn), 
M2 (even colomn and odd row) and M3 (odd colomn and 
row) are respectively scaled by (kLLL, kHLL), (kLLH, kHLH), 
(kLHL, kHHL) and (kLHH, kHHH) to produce the final 3D 
DWT coefficients. 
kL and kH are the coefficients used in the 1D DWT. kab 
(where a and b are H or L) are the coefficients used in 2D 
DWT and are equal to ka×kb. On the other hand, kabc 
(where a, b and c are H or L) are 3D DWT scaling 
coefficients and are equal to ka×kb×kc. 
 
 
IJCSI International Journal of Computer Science Issues, Vol. 9, Issue 3, No 2, May 2012 
ISSN (Online): 1694-0814 
www.IJCSI.org 154
Copyright (c) 2012 International Journal of Computer Science Issues. All Rights Reserved. 
 
Table 2: Comparing the Architectures for 3D WT 
Type of 
DWT 
# of outputs 
per clock 
Critical path  Buffer memory  Adder 
Multiplie
r 
Architecture 
Lifting  4  Tm+2Ta  4(N+2)M  72  96  Dai[8] 
Lifting  4  Tm+2Ta  3.5N
2+4N  40  48  Das[9] 
Lifting  8  Tm+2Ta  5.5N
2+8N  96  56  Xiong[7] 
Flipping  4  Tm  8(2N+MN)  48  28  Porposed 
 
 
3. Simulation Results  
The proposed architecture is implemented by VHDL and 
applied to an image sequence including six 352×288 
images. The hardware simulation results verified by the 
results derived from the software in C language for 
flipping based 3D DWT. The specifications of the 
proposed architecture and 3 previously proposed 
architectures for 3D DWT are listed in Table 2. 
The simulation results indicate that the critical path of the 
proposed architecture is equal to one multiplier delay (Tm) 
while the critical path delay for the rest of the architectures 
increases by two adders’ delay (2Ta) besides one 
multiplier delay (Tm). Moreover the proposed architecture 
requires 28 multipliers and 48 adders which are totally less 
than the architecture with minimum processing units in 
Table II that is proposed by Das et al [9]. Besides the 
aforementioned advantages the proposed architecture can 
be used to generate 1D and 2D DWT coefficients as well.  
4. Conclusions 
In this paper we introduce a parallel and pipelined 
architecture based on flipping structure for 3D DWT. The 
proposed architecture requires less processing units 
compared to the other tested architectures. Furthermore 
the proposed architecture reduces the critical path of the 
other proposed architectures from 2Ta+Tm to only one 
Tm.  
Moreover, the 1D and 2D DWT coefficients can be 
realized by the proposed architecture besides the 3D DWT. 
Therefore, the proposed architecture can be used in the 
real time 1D, 2D and 3D DWT applications which require 
high performance and throughput along with low 
computational resources and critical path. 
 
References 
[1] W. Sweldens, The lifting scheme: a new philosophy in 
biorthogonal wavelet constructions, Proc. SPIE 2569 (1995) 
68–79. 
[2] I. Daubechies, W. Sweldens, Factoring wavelet transforms 
into lifting schemes, J. Fourier Anal. Appl. 4 (3) (1998) 247–
269. 
[3] Bing-Fei Wu and Chung-Fu Lin, A high-performance and 
memory-efficient pipeline architecture for the 5/3 and 9/7 
discrete wavelet transform of JPEG2000 codec, IEEE 
Transactions on Circuits and Systems for Video Technology, 
Vol.15,   No. 12, pp. 1615 – 1628, Dec. 2005 
[ 4 ]  C .  T .  H u a n g ,  P .  C .  T s e ng, and L. G. Chen, “Flipping 
Structure: An Efficient VLSI Architecture for Lifting-Based 
Discrete Wavelet Transform”, IEEE TRANSACTIONS ON 
SIGNAL PROCESSING, VOL. 52, NO. 4, APRIL 2004    
[5] Y. Hao, Y. Liu, R. Wang, “High Performance Hardware 
Implementation Architecture for DWT of Lifting Scheme”, 
International Conference on Intelligent Information Hiding 
and Multimedia Signal Processing, 2008 
[6]  Y. Hao, Y. Liu, R. Wang, “Efficient parallel hardware 
architecture for lifting-based discrete wavelet transform”,   
Chinese Control and Decision Conference (CCDC 2008), 
2008 
[7] C. Xiong, J.Hao, J.Tian,J. Liu, Efficient array architecture 
for multi-dimensional lifting-based discrete wavelet 
transform, IEEE Trans.Signal Processing. (Octobr 
2006)1089-1099 
[8] Q. Dai, X. Chen, C. Lin, A novel VLSI architecture for 
multidimensional    discrete wavelet transform, IEEE 
Trans.Circuits Systems Video Technol.   (August 
2004)1105–1110. 
[9] B. Das, S. Banerjee, Data-folded architecture for running 3D 
DWT using 4-tap   Daubechies  filters,  IEE  Proc. 
Circuits Devices Systems. (February 2005) 17–24 
 
Zahra Zaremojtahedi received the B.Sc. degree in computer 
engineering from computer engineering department of Central 
Tehran branch of Iran Islamic Azad University. She is currently the 
M.Sc. student in the computer engineering department of Science 
and Research branch of Iran Islamic Azad University, Tehran, Iran. 
Her research interests include hardware design for image and 
video coding algorithms. 
 
Farzad Zargari received his B.Sc. degree in Electrical Engineering 
from Sharif University of Technology and his M.Sc. and Ph.D. 
IJCSI International Journal of Computer Science Issues, Vol. 9, Issue 3, No 2, May 2012 
ISSN (Online): 1694-0814 
www.IJCSI.org 155
Copyright (c) 2012 International Journal of Computer Science Issues. All Rights Reserved. 
 
degrees in Electrical Engineering from University of Tehran, all in 
Tehran, Iran.  
He is currently a research associate at the information technology 
department of research institute for ICT, formerly known as Iran 
Telecom Research Center (ITRC), Ministry of Telecommunications 
and Information Technology of Iran. He is also a teaching 
academic staff in the computer engineering department of Science 
and Research branch of Islamic Azad University. His research 
interests include multimedia systems, image and video signal 
processing algorithms, and hardware implementation of image and 
video coding standards. 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
IJCSI International Journal of Computer Science Issues, Vol. 9, Issue 3, No 2, May 2012 
ISSN (Online): 1694-0814 
www.IJCSI.org 156
Copyright (c) 2012 International Journal of Computer Science Issues. All Rights Reserved.