Development of Lifting-based VLSI Architectures for

Two-Dimensional Discrete Wavelet Transform by MOHAMED KOKO, IBRAHIM SAEED
STATUS OF THESIS 
Title of thesis Development of Lifting-based VLS I Architectures for 
Two-Dimensional Discrete Wavelet Transform 
I, IBRAHIM SAEED MOHAMED KOKO 
hereby allow my thesis to be placed at the Information Resource Center (IRC) of 
Universiti Teknologi PETRONAS (UTP) with the following conditions: 
I. The thesis becomes the property of UTP. 
2. The IRC ofUTP may make copies of the thesis for academic purposes only. 
3. This thesis is classified as 
D Confidential 
[2] Non-confidential 
If this thesis is confidential, please state the reason: 
The contents of the thesis will remain confidential for ______ years. 
Remarks on disclosure: 
Ad~/& 
Signature of Author 
Permanent address: 
Sudan U. of Science & Technology 
Dept. of Electronics Engineering 
Sudan, Khartoum 
Date: _1/----'f /'---2-_o-'--/ a_· __ 
Endo sed by 
Sign ture of Supervisor 
Name of the Supervisor 
Date: I /-1 jY I () 
-~,'---+,-------
UNIVERSITI TEKNOLOGI PETRONAS 
DEVELOPMENT OF LIFTING-BASED VLSI ARCHITECTURES FOR TWO-
DIMENSIONAL DISCRETE WAVELET TRANSFORM 
By 
IBRAHIM SAEED MOHAMED KOKO 
The undersigned certify that they have read, and recommend to the Postgraduate Studies 
Programme for acceptance this thesis for the fulfillment of the requirements for the 










DEVELOPMENT OF LIFTfNG-BASED VLSI ARCHITECTURES FOR TWO-
DIMENSIONAL DISCRETE WAVELET TRANSFORM 
By 
IBRAHIM SAEED MOHAMED KOKO 
A thesis 
Submitted to the Postgraduate Studies Programme 
as a Requirement for the degree of 
DOCTOR OF PHILOSOPHY 
fN ELECTRICAL AND ELECTRONIC ENGrNEERfNG 
UN IVERS IT! TEKNOLOGI PETRONAS 
BANDAR SERI ISKANDAR, 
PERAK 
August, 20 I 0 
DECLARATION OF THESIS 
Title of thesis Development of Lifting-based VLSI Architectures for 
Two-Dimensional Discrete Wavelet Transform 
I BRAHIM SAEED MOHAMED KOKO 
hereby declare that the thesis is based on my original work except for quotations and 
citations which have been duly acknowledged. I also declare that it has not been 
previously or concurrently submitted for any other degree at UTP or other institutions. 
Signature of Author 
Permanent address: 
Sudan U. of Science & Technology 








l would like to express my most gratitude to my supervisor Dr. Nor Hisham Bin 
Hamid and co-supervisor Dr. Fawnizu Azmadi for their guidance, encouragement, 
and valuable advises during the course of this research. l would like also to 
acknowledge my former supervisor Dr. Herman Agustiawan for his guidance, 
support, and great help during the two years period before he left the university. 
l wish to thank the University for giving me the opportunity to undertake this 
research at Universiti Teknologi PETRONAS. This work would have been impossible 
without the valuable database resources and facilities provided by the university. 
My sincere gratitude also to my family for their unlimited support and patience 
thought out the period of this work. Finally, l would like to dedicate this work to the 
memory of my mother for her unconditional love, support, and motivation of us 
during her life. 
v 
ABSTRACT 
Two-dimensional discrete wavelet transform (2-D DWT) has evolved as an essential 
part of a modem compression system. It offers superior compression with good image 
quality and overcomes disadvantage of the discrete cosine transform, which suffers 
from blocks artifacts that reduces the quality of the inage. The amount of 
computations involve in 2-D DWT is enormous and cannot be processed by general-
purpose processors when real-time processing is required. Th·"efore, high speed and 
low power VLSI architecture that computes 2-D DWT effectively is needed. In this 
research, several VLSI architectures have been developed that meets real-time 
requirements for 2-D DWT applications. This research iaitially started off by 
implementing a software simulation program that decorrelates the original image and 
reconstructs the original image from the decorrelated image. Then, based on the 
information gained from implementing the simulation program, a new approach for 
designing lifting-based VLSI architectures for 2-D forward DWT is introduced. As a 
result, two high performance VLSI architectures that perform 2-D DWT for 5/3 and 
9/7 filters are developed based on overlapped and nonoverlapped scan methods. Then, 
the intermediate architecture is developed, which aim a·: reducing the power 
consumption of the overlapped areas without using the expensive line buffer. In order 
to best meet real-time applications of 2-D DWT with demanding requirements in 
terms of speed and throughput parallelism is explored. The single pipelined 
intermediate and overlapped architectures are extended to 2-, 3-, and 4-parallel 
architectures to achieve speed factors of 2, 3, and 4, respectively. To further 
demonstrate the effectiveness of the approach single and para.llel VLSI architectures 
for 2-D inverse discrete wavelet transform (2-D IDWT) are developed. Furthermore, 
2-D DWT memory architectures, which have been overlooked in the literature, are 
also developed. Finally, to show the architectural models developed for 2-D DWT are 
simple to control, the control algorithms for 4-parallel architecture based on the first 
scan method is developed. To validate architectures develcped in this work five 
architectures are implemented and simulated on Altera FPGA. 
VI 
In compliance with the terms of the Copyright Act 1987 and the IP Policy of the 
university, the copyright of this thesis has been reassigned by the author to the legal 
entity of the university, 
Institute of Technology PETRONAS Sdn bhd. 
Due acknowledgement shall always be made of the use of any material contained 
in, or derived from, this thesis. 
©Name of candidate, Year of Thesis submission 
Institute of Technology PETRONAS Sdn Bhd 
All rights reserved 
VII 
TABLE OF CONTENTS 
Status of Thesis .............................................................................................................. .i 
Approval Page ............................................................................................................... ii 
Title Page ...................................................................................................................... iii 
Declaration .................................................................................................................... i v 
Acknowledgements ........................................................................................................ v 
Abstract. ....................................................................................................................... v i 
List of Figures ............................................................................................................. xiv 
List of Tables ............................................................................................................... xxi 
CHAPTER ONE: INTRODUCTION 1 
1.1 Background ------------------------------------------ .. ------------------------- I 
1.2 JPEG2000 Image Compression-------------------.. ------------------------- 4 
1.3 Realization of 2-D DWT ----------------------------------------------------- 5 
1.4 Separable and nonseparable transform ------------------------------------- 6 
1.5 Problem statement ------------------------------------------------------------6 
1.6 Research objectives and approach----------------.. ------------------------- 8 
I . 7 Contributions ----------------------------------------· --------------------------9 
1.8 Organization of the thesis ----------------------------------------------------9 
CHAPTER TWO: LITERATURE REVIEW 11 
2.1 Introduction -----------------------------------------·------------------------11 
2.2 RAM-based architectures -------------------------- ------------------------14 
2.2.1 Direct architecture ------------------------------------------------- 15 
2.2.2 Row-column and column-row architecture ---------------------16 
viii 
2.2.3 1-levelline-based architecture ------------------------------------17 
2.2.4 Multi-level line-based architecture ------------------------------18 
2.3 Discussion -------------------------------------------------------------------18 
2.4 Review of the !-level line-based architectures --------------------------1 9 
2. 5 Conclusion ------------------------------------------------------------------ 2 5 
CHAPTER THREE : ARCHITECTURE DEVELOPMENT 27 
3. I Introduction ------------------------------------------------------------------2 7 
3.2 Lifting-based 5/3 and 9/7 algorithms 
and architectures development --------------------------------------------2 8 
3.3 Data dependency graphs (DOGs) for 5/3 and 9/7 algorithms ---------32 
3.4 External Architecture Development and Refinement ------------------32 
3.5 Overlapped and Nonoverlapped Scan Methods------------------------ 38 
3.6 Scan Based Architectures --------------------------------------------------39 
3. 7 Intermediate Architectures -------------------------------------------------44 
3. 7 .I Generalized Overlapped Scan method -------------------------- 44 
3.7.2 Proposed External Intermediate Architecture----------------- 47 
3. 7.3 Second Dataflow ---------------------------------------------------48 
3.8 Processors Datapath Architectures Development ---------------------------52 
3.8.1 5/3 Processor's Datapath Architecture Development--------53 
3.8.2 9/7 Processor"s Datapath Architecture Development--------54 
3.8.3 Row and Column Processors for 5/3 and 9/7 ------------------56 
3.9 Evaluation of architectures----------------------------------------------------- 65 
3. I 0 Combined 5/3 and 9/7 Architecture ------------------------------------------7 4 
3.11 conclusions -----------------------------------------------------------------------75 
v 
CHAPTER FOUR: PARALLEL ARCHITECTURES DEVELOPMENT 76 
4. I Introduction ---------------------------------------------- -------------------------7 6 
4.2 parallel architectures based on first scan method------------- ---------------77 
4.2.1 2-parallel pipelined external architecture ----------------------77 
4.2.2 3-parallel pipelined architecture --------------------------------81 
4.2.3 4-parrallel pipelined architecture -------------------------------89 
4.2.4 Evaluations of architectures -------------------------------------97 
4.3 Parallel form of the intermediate architectures -----------------------------10 I 
4.3.1 2-parallel pipelined intermediate architecture ---------------102 
4.3 .2 Transition to the last run----------------- -----------------------1 08 
4.3.3 3-parallel pipelined intermediate architecture ---------------111 
4.3.4 Scale factor multipliers reduction ------------------------------119 
4.3 .5 Evaluation of performance --------------------------------------120 
4.4 Conclusions ---------------------------------------------- ---------------------- 124 
CHAPTER FIVE: DWT MEMORY ARCHITECTURES 125 
5. I Introduction -------------------------------------------------------------------- 125 
5.2 The LL-RAM architecture development------------··---------------------- 126 
5.2.1 The LL-RAM read operations ----------------------------------129 
5.2.2 The LL-RAM write operations ---------------------------------130 
5.2.3 RAM architecture modifications for 
higher scan methods -------------------------------------------131 
5.2.4 RAM architecture using banks------------------------------- 134 
5.3 Subband memory architecture development --------------------------------13 7 
5.3.1 The bank structure used in forming subband memory -----138 
5.3.2 Details ofthe subband memory architecture ----------------143 
5.3.3 Subband memory architecture for higher scan methods ---150 
VI 
5.4 Control Design for 4-parallel Architecture-------------------------------- !54 
5.4.1 Main Control Unit -----------------------------------------------157 
5 .4.2 Processors Control Unit ----------------------------------------164 
5.4.3 Read LL-RAM Control Unit ----------------------------------174 
5.4.4 Write RAM/Subband Memory Control Unit ----------------178 
5. 5 Cone I us ions ---------------------------------------------------------------------I 8 7 
CHAPTER SIX: 2-DIMENSIONAL INVERSE DISCRETE WAVELETS 188 
TRANSFORM ARCHITECTURE DEVELOPMENT 
6.1 Introduction -----------------------------------------------------------------!88 
6.2 Lifting-based 5/3 and 9/7 synthesis algorithms 
and data dependency graphs ----------------------------------------------1 89 
6.3 Scan methods---------------------------------------------------------------- I 91 
6.4 Proposed External Architecture -------------------------------------------193 
6.5 Processors' architecture development ------------------------------------198 
6.5.1 Inverse 5/3 processor's architecture development --------198 
6.5.2 Inverse 9/7 processor's datapath architecture -------------200 
6.5.3 Combined inverse 9/7 and 
5/3 processors architecture ----------------------------------200 
6.5.4 Modified row and column processors for 5/3 and 9/7 
externa I architecture ------------------------------------------20 1 
6. 6 Performance Evaluation -----------------------------------------------------209 
6. 7 Parallel Architecture Development--------------------------------------- 211 
6. 7 .I Proposed 2-parallel external architecture ----------------- 21 I 
6. 7.2 Modified CPs and RPs for 5/3 and 9/7 2-parallel 
external architecture -----------------------------------------214 
6. 8 Proposed 4-parallel external architecture ---------------------------------217 
vii 
6.8.1 Column and row processors for 5/3 and 9/7 
4-parallel external architecture ---------- ------------------220 
6.8.2 Modified CPs for 4-parallel archite,;ture -----------------220 
6.8.3 Modified RPs for 4-parallell archit(:cture ----------------222 
6.9 Performance evaluation---------------------------------------------------- 224 
6.10 Conclusions-----------------------------------------·---------------------- 227 
CHAPTER SEVEN: EXPERIMENTAL RESULTS 228 
7.1 Performance analysis ------------------------------------------------------228 
7.2 Performance evaluations and comparisons ------------------------------229 
7.3 Experimental results and comparisons ----------------------------------231 
7.4 Cone I us ions -----------------------------------------------------------------241 
CHAPTER EIGHT: CONCLUSIONS AND RECOMMENDATIONS 242 
8.1 Conclusions-------------------------------------------··-----------------------242 
8.2 Recommendations -----------------------------------·-----------------------246 
RE FERENCE S ------------------------------------------------------- -----------------------24 8 
APPENDIX A: Software simulation program development ---·-----------------------253 
APPENDIX B: Dataflow and control signals tables------------.. -----------------------270 
APPENDIX C: FPGA compilation and synthesis results------·-----------------------293 
APPEND IX D : Pub 1 i cations---------------------------------------------------------------2 9 8 
viii 
List of Figures 
1.1.1 (a) The original image (b) decorrelated image (c) reconstructed image ---------2 
1.1.4 A simplified compression system -----------------------------------------------------3 
1.2.1 JPEG 2000 encoding--------------------------------------------------------------------4 
1.3. I Lifting-based tree-structured fi Iter bank ---------------------------------------------6 
2.1.1 One-dimensional tree-structured filter bank and Subband structure -----------11 
2.1.2 Tree-structured filter bank for 2-D DWT for D levels decomposition ---------12 
2. 1.3 3 -!eve I of Wavelet decomposition of an image ------------------------------------13 
2.2.1 Direct 2-D implementation ----------------------------------------------------------15 
2.2.2 RCCR 2-D implementation --------------------------------------------------------- 16 
2.2.3 1-levelline-based implementation ------------------------------------------------- 17 
2.2.4 Multi-level line-based implementation-------------------------------------------- 18 
3.1.1 Lifting-based tree-structured filter bank with processors ------------------------28 
3 .2. I Block diagram representation of Eq ( 4.2.4) ---------------------------------------30 
3.2.2 Three processors based 2-D DWT architecture---------------------------------- 31 
3.3.1 5/3 algorithm's DOGs for (a) odd and (b) even length signals -----------------33 
3.3.2 9/7 algorithm's DOG for odd (a) and even (b) length signals ------------------33 
3.4.1 Architecture for 2-D DWT --------------------------------------------------------- 36 
3.5.1 Overlapped scan method for 5/3 ---------------------------------------------------39 
3.5.2 Non-overlapped scan method for 5/3 --------------------------------------------- 40 
3. 5.3 Overlapped scan method for 9/7 ---------------------------------------------------40 
3.6.1 Proposed overlapped scan architecture------------------------------------------- 41 
3 .6.2 Proposed non-overlapped scan architecture --------------------------------------42 
3.7.1 The third overlapped scan method (a) for 5/3 and (b) for 9/7-----------------45 
3. 7.2 Proposed external intermediate architecture------------------------------------- 49 
3.8.1 5/3 processor's datapath architecture with symmetric extension --------------54 
3.8.2 The 917 processor's datapath architecture with extension----------------------55 
ix 
3.8.3 Modified stage 2 of the 5/3 CP for overlapped and nonoverlapped ----------58 
3.8.4(a) Modified first 9/7 CP for overlapped and nonoverlapped architectures------ 58 
3.8.4(b) Modified second 9/7 CP for overlapped 
and nonoverlapped architectures ---------------------------------------------------59 
3.8.5 Modified stage 2 of 5/3 for intermediate CP ------------·------------------------59 
3.8.6 Incorporation of a TLB in stage 2 of the RP ------------·------------------------ 60 
3 .8. 7 TLB in a separate pipeline stage --------------------------.. ------------------------60 
3.8.8(a) Modified first 9/7 RP for overlapped 
and nonove r lapped arch i lectures -------------------------.. ------------------------62 
3.8.8(b) Modified fsecond 9/7 RP for overlapped 
and nonoverlapped architectures -------------------------.. ------------------------63 
3.8.9 Modified RP datapath for 5/3 and 9/7 intermediate architectures ------------64 
3.8.10 Incorporation of a TLB in stage 3 of Figure 3.9.2 to form 
the 9/7 RP for Intermediate architecture -----------------.. ----------------------- 64 
3.9. I Pipe lined overlapped parallel scan architecture ----------------------------------68 
3.9.2 Pipelined nonoverlapped parallel scan architecture----.. ------------------------ 68 
3.9.3 Pipelined intermediate parallel scan architecture-------.. ----------------------- 69 
3.10.1 Combined 9/7 and 5/3 processors datapath architecture -------------------------74 
4.2. I 2-parallel pipe lined external architecture -----------------·------------------------78 
4 .2. 2 Modified RP --------------------------------------------------.. ------------------------81 
4.2.3 3-parallel pipe lined architecture ---------------------------- .. ----------------------- 82 
4.2.4 Waveforms of the 2 clocks used in 3-parallel --------------------------------------83 
4.2.5 Modified CPs datapath architecture -------------------------------------------------87 
4.2.6 Modified stage 3 of CPs -------------------------------------·------------------------ 89 
4.2. 7 4-parallel pipe lined architecture ----------------------------.. ----------------------- 90 
4.2.8 Waveforms of the 3 clocks used in 4-parallel-------------·------------------------ 91 
4.2.9 Modified stage 2 of the RPs datapath architecture ------- .. ------------------------93 
X 
4.2.1 0 Control signals carried by CST and the block diagram ------------------------ 96 
4.2.11 CPl and CP2 exchange high coefficients------------------------------------------ 97 
4.3. I 2-paralle I pipe! ined intermediate architecture ------------------------------------1 02 
4.3.2 Modified stage 2 of 5/3 RPS datapath architecture ------------------------------105 
4.3 .3 Modified 9/7 RPs datapath for 2-parallel intermediate architecture ----------106 
4.3.4 Control circuit that determines the last run ---------------------------------------110 
4.3.5 (a) Modified 5/3 CP for 2-parallel intermediate architecture ------------------111 
4.3.5 (b) Modified 9/7 CP for 2-parallel intermediate architecture ------------------112 
4.3 .6 3 -parallel pipelined intermediate architecture ------------------------------------113 
4.3. 7 Waveforms of the three clocks -----------------------------------------------------113 
4.3.8 (a) Modified 5/3 RPs datapath for 3-parallel 
intermediate architecture -----------------------------------------------------------115 
4.3.8 (a, b) Modified 9/7 RPs datapath for 3-parallel 
intermediate architecture ------------------------------------------------------------116 
5 .1.1 General structure of a compression system---------------------------------------- 126 
5 .2. I Block diagram of the memory module. -------------------------------------------127 
5 .2.2 RAM architecture using modules --------------------------------------------------128 
5 .2.3 Incorporation of register XR --------------------------------------------------------134 
5.2.4 Bank architecture with 8 modules and its block diagram---------------------- 135 
5 .2.5 RAM architecture using bank ------------------------------------------------------ 136 
5.3.1 Subband memory architecture ------------------------------------------------------139 
5.3.2 Structure of the first bank and its block diagram-------------------------------- 140 
5.3.3 Structure of the second bank and its block diagram -----------------------------141 
5.3.4 Subband memory architecture built using the block 
diagram of the second bank ---------------------------------------------------------142 
5.3.5 Architecture of the subband memory ----------------------------------------------145 
5. 3. 6 Architecture of subband memory --------------------------------------------------146 
XI 
5.3. 7 Incorporation of register XR --------------------------------·----------------------1 5 I 
5.3.8 Flowchart for subband memory write control algorithm -----------------------153 
5.4.1 Subband memory interconnections to 4-parallel ------------------------------ I 55 
5.4.2 DWT Control Unit------------------------------------------------------------------ I 56 
5 .4. 3 C-un it --------------------------------------------------------------------------------- I 58 
5.4.4 ASM flowchart for A-unit and its block diagram -------------------------------161 
5.4.5 ASM chart forB-unit and its block diagram -------------------------------------163 
5.4.6 ASM flowchart for TLB control unit and its block diagnm.------------------ 166 
5.4. 7 ASM flowchart for Extension Control Unit and its block diagram.-----------169 
5.4.8 ASM flowchart for CPs control unit and its block diagram------------------ 172 
5.4.9 ASM chart for Read RAM Control Unit of the RAM architecture 
using modules and its block diagram ----------------------··-----------------------175 
5.4.1 0 ASM chart for Read RAM Control Unit of the 
RAM architecture using banks and its block diagram -------------------------177 
5.4.11 ASM chart for write subband memory control 
unit and its block diagram ----------------------------------··-----------------------180 
5.4.12 ASM chart for write RAM control unit of the RAM architecture 
using modules, the block diagram, and the proposed clock signal ----------- 184 
5.4.13 SM flowchart for write RAM control unit of the RAM 
architecture using banks and block diagram ------------------------------------186 
6.2.1 5/3 synthesis algorithm's DDGs for (a) odd and (b) even length signals ----190 
6.2.2 9/7 synthesis algorithm's DDGs for(a) odd and (b) even length signals ----190 
6.3.1 5/3 CP scan method---------------------------------------------------------------- 192 
6.3.2 917 CP scan method---------------------------------------------------------------- 192 
6.3.3 5/3 RP scan method---------------------------------------------------------------- 193 
6.3.4 9/7 RP scan method ---------------------------------------------------------------- 193 
6.4.1 reposed external architecture for 513 and 9/7 and combined 
xii 
5/3 and 9/7 for 2-D IDWT and Waveforms for clocksfandfl'2 ·············• 195 
6.5.1 Inverse 5/3 processor datapath architecture with symmetric extension ------199 
6.5.2 Inverse 9/7 processor datapath architecture with symmetric extension ······202 
6.5.3 Combined Inverse 9/7 and 5/3 processor datapath architecture ···············203 
6.5.4 Modified inverse 5/3 CP datapath architecture with symmetric extension --203 
6.5.5 Modified CP for 917 and combined 5/3 and 9/7 datapath architecture ····---206 
6.5.6 Modified inverse 5/3 RP datapath architecture with symmetric extension --205 
6.5.7 Modified RP for 9/7 and combined 5/3 and 9/7datapath architecture ·····---206 
6.7.1 Proposed 2-parallel pipelined external architecture for 5/3 and 9/7 
and combined 5/3 and 9/7 for 2-D IDWT and waveforms of the clocks----212 
6.7.2 Modified inverse 5/3 CP for 2-parallel External architecture ................ 215 
6. 7.3 Modified RP for 2-parallel architecture (a) 5/3 and (a) & (b) 917 ············ 216 
6.8.1 (a) Proposed 2-D IDWT 4-parallel pipe lined external architecture 
or 5/3 and 9/7 and combined 5/3 and 9/7. (b) Waveforms of the clocks -----218 
6.8.2 Modified CPs for 5/3 CP I & 3 and 2 &4 for 4-parallel architecture······· 221 
6.8.3 (a) Modified 5/3 RPs I and 3 for 4-parallel External Architecture ·······--222 
6.8.3 (a, b) Modified 9/7 RPs I and 3 for 4-parallel External Architecture ·····---223 
7.3.1 Simulation Report- Simulation Waveforms 
for 5/3 "decorrelate_processor"-··················································236 
7.3.2 Simulation Report- Simulation Waveforms for 5/3 
module ''reconst_processor'' ·······················································236 
7.3.3 Simulation Report- Simulation Waveforms 
for first 9/7 "decrrelation2 _processor" ··········································-23 7 
7.3.4 Simulation Report- Simulation Waveforms 
for second 9/7 "decorelation_processor" ········································-238 
7.3.5 Simulation Report- Simulation Waveforms 
for 5/3 2-parallel module "decorelation _process ·······························239 
Xl11 
A.3.1 (a) Main program (b) Forward program -----------------------------------------260 
A.3.1 (c) Horizontal high pass decomposition flowchart -------------------------------261 
A.3.1 (d) Horizontallowpass decomposition flowchart --------------------------------262 
A.3 .2 (a) In verse program ------------------------------------------------------------------263 
AJ .2 (b) V erticallowpass flowchart ------------------------------ -----------------------264 
A.3 .2 (c) Vertical high pass reconstruction flowcharts ----------------------------------.265 
A.3.2 (d) Horizontallowpass flowchart --------------------------------------------------266 
A.3.2 (e) Horizontal highpass reconstruction flowcharts ------------------------------.267 
A.3.3 (a) The original image (b) decorrelated image 
(c) reconstructed image ----------------------------------- -----------------------269 
A.3.4 Original image pixels highly correlated ------------------------------------------269 
A.3.5 Decorrelated image pixels decorrelated ------------------------------------------269 
B .11 Circuit ---------------------------------------------------------------------------------2 7 5 
C.l.l Compilation Report- Flow Summary for 
5/3 module "decorre late _processor"----------------------------------------------2 93 
C.l.2 Compilation Report- Power Analyzer summary 
for 5/3 "decorre late _processor"---------------------------------------------------29 3 
C.l.3 Compilation Report- Timing Analyzer Summary 
for 5/3 "decorre late _processor"---------------------------------------------------293 
C.2.1 Compilation Report- Flow Summary for 
5/3 module "reconst_processor" --------------------------- ------------------------2 94 
C.2.2 Compilation Report- Power Analyzer 
summary for 5/3 "reconst_processor" -------------------------------------------2 94 
C.2.3 Compilation Report-Timing Analyzer 
Summary for 5/3 "reconst _processor"--------------------.. -----------------------2 94 
CJ.l Compilation Report- Flow Summary 
for first 9/7 "decrre lation2 _processor"-------------------------------------------2 9 5 
XIV 
C.3.2 Compilation Report- Power Analyzer 
summary for 9/7 "decrrelation2 _processor"-------------------------------------295 
C.3.3 Compilation Report- Timing Analyzer 
Summary for 9/7 "decrrelation2 _processor" ------------------------------------295 
C.4.1 Compilation Report- Flow Summary 
for second 9/7 "decore lation _processor"-----------------------------------------2 96 
C.4.2 Compilation Report- Power Analyzer 
summary for 917 "decorelation _processor"--------------------------------------296 
C.4.3 Compilation Report- Timing Analyzer 
Summary for 917 "decrrelation2 _processor" ------------------------------------296 
C.5.1 Compilation Report- Flow Summary 
for 5/3 2-parallel module "two-parallel_DWT"--------------------------------297 
C.5.2 Compilation Report- Power Analyzer 
summary for 5/3 2-parallel module "two-parallel_DWT"---------------------297 
C.5.3 Compilation Report- Timing Analyzer 
Summary for 5/3 2-parallel module "two-parallel_ DWT"--------------------297 
XV 
List of Tables 
2.1 Summary of the RAM-based 2-D architecture --------------------------------------19 
3.1 Control signal values -------------------------------------------··------------------------43 
3 .2 Reduce contro I signals -----------------------------------------· ------------------------44 
3.3 Symmetric extension's control signals for 5/3 --------------------------------------54 
3.4 Symmetric extension's control signals for 9/7 --------------------------------------56 
4.1 Shows scheduling patterns for CPs and registers involved.------------------------ 85 
4.2 Control signal values -------------------------------------------··----------------------- 85 
4.3 Shows how and when CPs exchange high coefficients -----------------------------88 
4.4 Control signal values for signal sc -----------------------------------------------------88 
4.5 Control signal values for s3, SLO, and SLI------------------------------------------ 104 
4.6 Control signal values for signals in stage 2 of both RP1 and RP2 ---------------107 
4.7 Control signal values for s2, slO, and sll in the last run.---··-----------------------109 
4.8 Control signal values -------------------------------------------··-----------------------118 
6.1 Control signal values for eth, ell, and sr ---------------------·-------------- ------- 198 
6.2 Extension's control signals ------------------------------------··---------------------- 200 
6.3 Control signal values for 9/7 RP ------------------------------·---------------------- 208 
6.4 Dataflow of the 5/3 RP I ---------------------------------------· -----------------------21 7 
7 .I Comparisons of severall-level (9/7) 2-D DWT architectures-------------------- 230 
7.2 Experimental results and comparisons-----------------------·---------------------- 234 
B.1 Dataflow for Figures 4.6.1 and 4.6.2 -----------------------------------------------270 
B.2 (a) Dataflow of the second 9/7 pipelined overlapped 
architecture for even N --------------------------------------------------------------2 71 
B.2 (b) Dataflow of the second 9/7 pipelined overlapped architecture for odd N---272 
B.2 (c) Control signal values ---------------------------------------· ----------------------272 
B.3 Dataflow of the intermediate architecture -----------------------------------------273 
B.4 Second dataflow for the architecture -----------------------------------------------274 
XVI 
8.5 (a) Control signal values------------------------- ------------------------------------275 
8.5 (b) Control signal values for sre2 ---------------------------------------------------275 
8.6 5/3 Dataflow for overlapped and nonoverlapped parallel scan architecture --275 
B. 7 5/3 Dataflow for intermediate parallel scan architecture ------------------------276 
8.8 Dataflow for 2-parallel architecture ------------------------------------------------276 
8. 9 Dataflow of the architecture ---------------------------------------------------------277 
B.l 0 5/3 4-parallel architecture's dataflow ---------------------------------------------278 
B.ll 4-parallel 's TL8s read and write dataflow ----------------------------------------279 
B.l2 Dataflow for 2-parallel intermediate architecture --------------------------------280 
8.13 Dataflow of the last run for cases 4 and 3 when N is even ----------------------281 
8.14 Dataflow of the last run for cases 4 and 3 when N is odd -----------------------282 
B.l5 Dataflow of the last run for cases 2 and 1 when N is even ----------------------283 
8.16 Dataflow of the last run for cases 2 and 1 when N is odd -----------------------284 
8.17 Dataflow of the 3-parallel intermediate architecture -----------------------------285 
B.l8 Dataflow of the 5/3 architecture-------------------------------------------------- 287 
8.19 (a) dataflow for 9/7 architecture from CP side -----------------------------------288 
8.19 (b) dataflow for 9/7 architecture from RP side ----------------------------------- 289 
8.20 Dataflow for 2-parallel 5/3 architecture ------------------------------------------290 






Image compression plays an important role in real-time applications especially in the 
bandwidth limited applications such as internet, mobile phone, and telemedicine. 
Images are compressed for fast transmission over a network and efficient storage. 
Image compression takes advantage of the redundant information contained in the 
original image. The redundancy exists in the form of statistical dependencies among 
pixels especially neighboring pixels. However, neighboring or adjacent pixels are 
highly correlated, which implies that it would be very difficult to immediately 
compress the original image pixels. Applying a compression algorithm directly to the 
original image pixels would yield poor compression ratio. Therefore, Transforms such 
as Fast Fourier Transform (FFT), Discrete Cosine Transform (OCT), and Discrete 
Wavelet Transform (DWT) are utilized to decorrelate the original image pixels in 
order to be amenable to compression. Two-dimensional discrete wavelet transform (2-
D DWT) compared to OCT is very efficient in decorrelating an image pixels and thus 
leading to a superior compression performance. DWT naturally as indicated in Figure 
I. 1.1 (b) supports progressive transmission, which is somewhat very difficult to 
implement in OCT-based compression. 2-DWT has evolved as an effective and 
powerful tool in many applications especially in image processing and compression 
[I, 2]. 
To show the correlation property of original image pixels I have plotted in Figure 
A.3.4 the pixels of the original image shown in Figure 1.1.1 (a). It shows that the 
original image pixels are highly correlated. But, when the pixels of the original image 
are applied to the forward discrete wavelet transform (FDWT) software simulation 
program that we have developed which is listed in Appendix A, the result was the 
decorrelated image shown in Figure 1.1.1 (b). The pixels of the decorrelated image 
The original Image 
(a) 
Deco rre late d image 
(b) 
A eoo nst ruote d I mage 
(c) 
Figure 1.1.1 (a) The original image (b) Decorretated image (c) Reconstructed image 
2 
shown m Figure 1.1.1 (b) are then plotted in Figure A.3.5 where it displays a flat 
image indicating that the amount of correlation among pixels has been greatly 
reduced. 
The 2-D DWT considered in this research is part of a compression system based 
on wavelet such as JPEG2000, as shown in Figure 1.1.4. The function of the forward 
discrete wavelet transform (FDWT) in a compression system is to decorrelate the 
image pixels prior to the compression step [3]. Thus, the DWT is used to effectively 
decorrelate the image pixels to achieve higher compression rates [4, 5]. Decorrelation 
step can be thought of introducing distortion to the original image pixels so that they 
can be amenable to compression. 
After transmitting to a remote site, the original image must be reconstructed from 
the decorrelated image. The task for reconstructing and completely recovering the 


















The amount of computations involves in both decorrelation and reconstruction 
steps are enormous, which required very high processing power that can't be achieved 
by general-purpose processors, especially when real-time processing is required. 
Therefore, high speed, low power, and low memory VLSI architectures that compute 
2-D DWT effectively are needed. The objective of this research is to develop such 
architectures based on the lifting scheme [ 4, 5, 6] that meets real-time requirements 
for 2-D DWT applications. Lifting-based, compared with convolution-based, involves 
less computation and lower memory and facilitates high speed and efficient 
implementation of wavelet transform and it is attractive for high throughput and low 
power applications. 
3 
1.2 JPEG2000 Image Compression 
JPEG2000 was developed to provide high rates of compression with good image 
quality and overcome the disadvantages of previous JPEG that uses OCT based image 
compression [7, 8] which suffers from blocks artifacts that reduce the quality the 
image. 
The JPEG2000 standard uses 2-dimentional, separable, non expansive, symmetric 
extension wavelet transforms. In this process the whole image is transformed into 
different resolution levels using the DWT. In case of a large image size, the image is 
optionally decomposed (divided) into a number of non-overlapping rectangular blocks 
called tiles and DWT is applied inside each tile independently. The DWT performs 
either reversible 5/3 filter, which provides loss less coding, or nonreversible 9/7 filter, 
which provides higher compression ratio with lossy coding. The DWT decomposes an 
image into subbands, then coefficients of each subband is partitioned into rectangular 
code block as illustrated in Figure 1.2.1, which are then coded independently using 
EBCOT (Embedded Block Code with Optimized Truncation). EBCOT is the name 
given to the entropy encoder in the JPEG2000 and it differs from JPEG's encoder in 
that the division into independent non-overlapping code-bl•)Cks is done after the 
transform instead of before the transform. EBCOT, which contains tier- I and tier-2 
coding, relies upon independent coding of relatively small bJo,;ks of subband samples 
(e.g., 64 x 64 or 32 x 32 samples). In tier- I each code-block i~: independently entropy 
coded and in tier-2 each encoded bit-stream is optimally trunc<ted such that an overall 
desired bit rate is achieved. Tier-2 is implemented in software whereas tier-! is 
implemented in hardware [8]. 
sub band 
EJ===> DWT sub 1 band ~ Tier I Tier-2 sub sub r band band EBCOT t stream Compresszon 
Figure 1.2.1 JPEG 2000 encoding 
4 
1.3 Realization of 2-D DWT 
The realization of DWT filter bank can be classified into two categories: one is 
based on the convolution operation [I 0), [ 11), [ 12), and the other is based on the 
lifting scheme [ 13), [14], [ 15). The tree structure filter bank is the realization of 2-D 
DWT based on convolution operation. The high-pass and low-pass filters of the filter 
bank are usually FIR (finite impulse response) filters and FIR involves convolution 
operation. This direct realization is termed convolution-based DWT. Convolution 
based DWT is computationally intensive and requires a large number of registers -
features that are not desirable in high-speed and low-power VLS1 implementation. 
On the other hand, lifting-based scheme proposed by Daubechies [4, 5, 6) 
involves less computation and lower memory. The basic principle of lifting scheme is 
to factorize the polyphase matrix of the wavelet filters into a sequence of alternating 
upper and lower triangular matrices and a diagonal matrix called lifting steps [ 4, 5). 
Polyphase divide the filters into even and odd parts as follows [16): 
( 1.1) 
where h(z) and ;if(z) are the low-pass and high-pass analysis filters. he(z)and 
- -ho(z) are the even and odd parts ofh(z), whereas ge(z) and go(z) are the even and 
odd parts ofg(z). Eq(l.1) can be represented in a matrix form, called, polyphase 
matrix, P(z): 
p (z)Jh.e(z) ho(z)l 
l;,;e(z) go(z) (1.2) 
If the determinant of 1\z) is one, then polyphase matrix can be factorized into lifting 
steps [4], as follows: 
Ji(z)=Il[1 s,(z)][ I O][k OJ ,~ 1 0 1 t,(z) I 0 ljk 
(1.3) 
It is a well known result in matrix algebra that any matrix with polynomial entries and 
determinant one can be factored into such elementary matrices. Figure 1.3.1 shows 
the lifting-based tree-structured filter bank representation of 2-D DWT. The new 
representation leads to a faster implementation of the wavelet transform and it is 
5 
attractive for both high throughput and low-power applications. In addition, the 
computational complexity of the lifting algorithm is half of that of convolution 
algorithm [4]. Therefore, the lifting-based DWT becomes the preferred scheme for 
VLSI implementation and it has been selected as the transform coder for image 
compression in the released JPEG2000 standard. 
1.4 Separable and nonseparable transforms 
There are two approaches to compute the 2-D DWT: separabk and nonseparable [12]. 
A key practical advantage of separable transforms is that they may be implemented by 
applying the one dimensional transform first to the rows of the image and then to its 
columns. The inverse transform is implemented in an analogous manner. A 
nonseparable approach for the 2-D DWT directly decompo;es an image into four 
subimages without row and column processes one after another [ 17]. However, the 
dedicated four 2-D filters require considerably more hardware resources. 
X 
Figure 1.3.1 Lifting-based tree-structured filter bank 
1.5 Problem statement 
VLSI architecture for 2-D DWT has not yet been completely and accurately 
developed that meet real-time requirements for 2-D DWT applications. There is need 
for comprehensive and detailed study to understand the 2-D DWT algorithms in order 
6 
to develop more accurate architectures. Thorough understanding of DWT algorithms 
can be gained through developing a software simulation program for both 
decorrelation and reconstruction processes. Developing a simulation program will 
give the hardware architecture designer available opportunity to learn in details the 
behavior of the algorithm and acquire a firm understanding, which in turn will enable 
him to develop more accurate architecture. 
Furthermore, the internal memory of the 2-D DWT processor, which dominates 
the hardware cost and the complexity of the architecture, is still high, while external 
memory consumes the most power. Therefore, the research would focus on reducing 
effectively the internal memory or temporary line buffer (TLB) requirements for 2-D 
DWT architecture. In addition, novel and accurate architectures for 2-D DWT would 
be developed that meet high speed and low memory requirements. Furthermore, a 
specific architecture would be developed that aims at reducing the external memory 
power consumption, which consumes the most power. The intermediate architecture 
developed in chapter 3 addresses this issue and 22% reduction in power consumption 
has been achieved. 
DWT decomposes an NxM image into subbands. These subbands must be stored 
by DWT unit in a memory unit in a specific order that preserves the subbands 
boundaries such that these subbands can be manipulated effectively by both DWT and 
compression units. This would require developing specific VLSl memory 
architectures for 2-D DWT. DWT memory architectures have been usually 
overlooked in the literature. Since, 2-D DWT memory architectures are equally 
important as DWT processor architectures commonly covered in the literature, in this 
work, two novel YLSI architectures for LL-RAM and subband memory would be 
developed. Furthermore, to show the architectures developed in this research are 
simple to control, one of the architecture would be selected and its control algorithms 
will be developed. Both pipe lining and parallelism will be explored to further improve 
performance in terms of speed and throughput to best meet real-time applications of 
2-D DWT with demanding requirements. 
7 
1.6 Research objectives and approach 
The objective of the research is to develop VLSJ architectures for both decorrelation 
and reconstruction processors that meet real-time requin:ments for 2-D DWT 
applications. In developing VLSI architectures for 2-D DWT processors, our goals are 
to achieve high speed, low power, low memory, and complete hardware utilization. 
In this work, specifically, VLSI architectures for lossless 5/3 and lossy 9/7 
algorithms, explicitly defined by the JPEG2000 image compression standard, will be 
used for the development of the 2-D DWT decorrelation and reconstruction 
processors. In addition, symmetric extension algorithm recommended by JPEG2000 
for boundary treatment will be incorporated into 5/3 and 9/7 data dependency graphs 
(DOGs) and will be implemented by the architectures develop<:d in this research. 
To verity the architectures developed in this research are efficient and accurately 
perform their intended functions, some selected architectures, which are 
representative of the other architectures, will be implement·~d on FPGA and a 
timing simulation will be performed to validate the logical operations of the 
designs. 
The approach or the strategy adopted in the development of 2-D DWT 
architectures is based on the observation that the DOGs for 5/J and 9/7 algorithms are 
identical when they are looked at from outside, taking into consideration only inputs 
and outputs requirements, but differ in the internal details. Ba,;ed on this observation, 
the first level of the architecture, call it, the external architecture, which is identical 
for both 5/3 and 9/7, is developed. Then, the internal ddails of the DOGs is 
considered for developing separately the processors' datapath architectures for each 
5/3 and 9/7 filters that can be incorporated into the external architecture, since DOGs 
internally define and specify the structure of the processors. 
This new approach not only can be effectively used m 5/3 and 9/7 based 
architectures development, but can be used also in architecture development for any 
2-D DWT algorithms and it is certain to yield very efficient architectures in terms of 
hardware complexity, speed, and power consumption with manageable control 
complexity. 
8 
I. 7 Contributions 
This research has contributed with several novels VLSI architectural models 
developed specifically for 2-D DWT as follows. First, a software simulation program 
is developed that perform both decorrelation and reconstruction of an MxN image. 
Then, two single pipelined architectures based on overlapped and nonoverlapped scan 
methods are developed for both 5/3 and 9/7 followed by the single pipelined 
intermediate architecture. The above 3 single pipelined architectures are then 
extended to 2-, 3-, and 4-parrallel architectures. In addition, modified datapath 
processor architectures that can be incorporated into single and parallel architectures 
are also developed. 
The research also has addressed one of the critical issues overlooked in the 
literature, the 2-D DWT memory architectures, and has developed two novel VLSI 
architectures for LL-RAM and subband memory. Furthennore, to show that the 
architectures developed in this research are simple to control, the control model and 
its algorithms for 4-parallel architecture based on the first scan method is developed. 
Finally, to show the effectiveness of the approach, the inverse DWT architectures 
for single and parallel 5/3, 917, and combined 5/3 and 9/7 are developed. 
Significant parts of this research had been published in international conferences 
and journals were listed in Appendix D. 
1.8 Organization of the thesis 
Chapter 2 introduces tree structured filter bank for 1-D and 2-D DWT and 
classification of2-D DWT architectures. Then 1-levelline-based architectures, which 
adopt level-by-level approach to achieve multi-level decompositions, are reviewed. 
In chapter 3, the data dependency graphs (DOGs) of the algorithms are derived. 
Based on the DOGs, the overlapped and nonoverlapped single pipelined architectures 
are developed. The intermediate architecture which is an alternative form of reducing 
the power consumption of the overlapped areas is also developed. 
In chapter 4, in order to best meet real-time applications of DWT with demanding 
requirements, the parallel architectures based on the tirst scan method and parallel 
9 
form of the intermediate architectures are developed 
ln chapter 5, DWT Memory architectures for LL _RAM and subband memory, 
which have overlooked in the literature, are developed. To show the architectures 
developed in this work are easy to control, the control algorithms of the 4-parallel 
architecture are developed. 
In chapter 6, to show the effectiveness of the approach and techniques adopted in 
the forward architectures, the single and parallel architectures for inverse 5/3 and 9/7 
are developed. 
In chapter 7, performance evaluations and experimental results for 5 architectures 
developed in this research are implemented on Altera FPGA and then simulated for 
validation. 






The basic operation of a discrete wavelet transfonn is as follow. Applied to a discrete 
signal containing N samples, a pair of filters low-pass (ho) and high-pass (h1) derived 
from wavelet is applied to the signal to decompose it into a low frequency band (L) 
and a high frequency band (H). Each band is subsampled (decimated) by a factor of 
two, so that the two frequency bands each contain N/2 samples. A tree-structured 
transform is obtained by applying the L band again to a pair of low- and high-pass 
filters [15]. The one dimensional case is illustrated in Figure 2.1.1. The recursive 
subdivision is continued for J levels, yielding a total of (J+ I) subbands. The low 
frequency subband LJ contains Nli samples, while the remaining subbands contain 
N/2J samples for 0 < j ,; J. 
X[K] 
(a) 
L_L_41~H-4~~-H-'--~~ ___ H_, ____ _L ________ H_, ________ _L_.. w 
(b) 
Figure 2.1.1 (a) one-dimensional tree-structured filter bank; (b) Subband structure for 
J= 4 levels decomposition. 
II 
A two dimensional transform is constructed by "separable extension" of one 
dimensional transform. In this approach each row of 2-J image is filtered with a low-
pass (ho) and high-pass (h 1) filters and the output of each filter is down-sampled 
(decimated) by a factor of two to produce the intermediate images Land H, as shown 
in Figure 2.1.2. L is the original image low-pass filtered and down-sampled in the 
horizontal direction and H is the original image high-pass filt•;red and down-sampled 
in the horizontal direction. Next, each column of these new images is filtered with 
low- and high-pass filters in the vertical direction and down-sampled by a factor of 
two to produce four sub-images (LL, LH, HL, and HH). The:;e four subband images 
can be combined to create an output image with same number of samples as the 
original. The four subband images contain all of the information present in the 
original image but the sparse nature of the LH, HL, and HH sub bands (many samples 
in these subbands are zeros or close to zeros) makes them amenable to compression. 
In an image compression application, the two-dimensional wavelet decomposition 
described above is applied again to the 'LL' image, forming four new subband 
images. The resulting low-pass image is iteratively filtered to create a tree of sub band 
images filter bank as shown in Figure 2.1.2. The subband structure is shown in 





First level decomposition J'h !eve l decomposition 





L~ H~ HL 1 
LH2 HH2 
LH 1 HH, 
Figure 2.1.3 3-level of Wavelet decomposition of an image 
3-level decomposition of an image. In Figure 2.1.2 the notations *h and • v denote 
horizontal and vertical convolution along rows and columns of the image, 
respectively. And b denote horizontal and vertical decimation by 2 (down sampled 
by 2). Note that only one of the four subbands, the LL band, is recursively 
decomposed into further subbands. If the recursive subdivision is continued for J 
levels, it yields a total of (3J + 1) subbands, with non-uniformly spaced pass bands. 
The LL subband of nonuniform subband decomposition is a low resolution 
versiOn of the original image. Therefore, it follows that the lowpass subbands, 
identified as LLJ in Figure 2.1.2, represent a family of successively lower resolution 
versions of the original image. The sampling density for LLJ is 2"2J times that of the 
original image in each direction, where d = 1 ,2, ... ,J. However, all these low resolution 
images are intermediate results; only LLJ is actually one of the subbands of the final 
tree-structured transform. And each of the images in this multiresolution family may 
be recovered by partial application of the synthesis system. LLJ _" for example, may 
be synthesized from subbands LLJ, LHJ, HLJ, and HHJ, while LLJ _ 2 may be 
synthesized from these subbands, together with LHJ .1• ~~~~J. 1, and HHJ. 1• 
This multiresolution property is particularly interesting for image compression 
13 
applications. It provides a mechanism whereby a compressed bit-stream may be 
partially decompressed to obtain successively higher resolution versions of the 
original image. To be more specific, let Rj be the set containing of subbands LHJ + 1 _ J• 
HLJ + 1 -i and HHJ + 1 _ J for 0 < j :'S 1 and let Ro be the set consisting of only subband 
LLJ. These groupings are also identified in Figure 2.1.2. We refer to the RJ as 
resolution levels, since R0 contains the lowest resolution image and each successive 
resolution level, Rj, contains the additional information required to reconstruct the 
next member of the multiresolution family. Suppose now that the elements of each 
set, RJ, 0::; j ::; J, are compressed independently and their compressed representations 
are separately identifiable within the compressed image representation. Then, the 
compressed representation has a property known as "resolution scalability," whereby 
a compressed representation of any member of the multire5olution family may be 
obtained simply by discarding those pieces corresponding to the irrelevant resolution 
levels, RJ. For image compression applications, the interest in dyadic decompositions 
and hence two channels subband transform is driven primarily by the significance of 
resolution scalability. 
In the literature, 2-dimensional discrete wavelet transform (2-D DWT) 
architectures are classified into two categories [I, 13]: convolution-based and lifting-
based. Convolution-based implements the two-channel filter bank directly. Such an 
implementation demands intensive computations and a largt: number of storage -
features that are not desirable for either high speed or low power applications [13]. On 
the other hand, lifting-based involves less computation and lower memory and 
facilitates high speed and efficient implementation of wavelet transform and it is 
attractive for both high throughput and low power applications 
2.2 RAM-based architectures 
There have been many VLSI architecture proposed for 2-D DWT in literature [13, 27, 
29, 34]. Nevertheless, only RAM-based architectures are mo,;t practical for real-life 
designs because of their greater regularity, density of storage, and simple control 
circuits [1]. However, according to [51], the memory issue dominates the hardware 
cost and complexity of the architecture and is the most critical part for 2-D DWT 
architecture. Instead of number of multipliers that decide the performance of one-
dimensional ( 1-D) DWT architectures. Thus, for 2-D DWT architectures, the memory 
14 
1ssues, including internal memory size and external frame memory access, are the 
most critical problems. The internal memory generally dominates the hardware cost, 
whereas the external frame memory access consumes the most power [51]. ln [1], 
RAM-based architectures for 2-D DWT are categorized as follows. 
2.2.1 Direct Architecture 
The most straightforward implementation is to perform 1-D DWT in one direction 
and store the intermediate coefficients in the same frame memory, and then to 
perform 1-D DWT with these intermediate coefficients in the other direction to 
complete !-level 2-D DWT, as illustrated in Figure 2.2.1. For the other decomposition 
levels, the lowpass-lowpass (LL) subband of the current level is treated as the input 














Figure 2.2.1 Direct 2-D implementation. (a) System architecture. (b) Data flow of 
external memory access (J = 3; white and grey parts represent 
external frame memory reads and writes, respectively). 
15 
2.2.2 Row-column and column-row (RCCR) architecture 
The direct architecture processes row coefficients first in ewry decomposition level 
all the time. Whereas, RCCR architecture processes rov.-column for odd-level 
decompositions and column-row for even-level decompositions [50], then the 
successive tv.o row-wise or column-wise 1-D DWT decompositions can be performed 
simultaneously, as illustrated in Figure 2.2.2. The DWT module of the RCCR 
architecture can be implemented by folding two successive decompositions into 1-D 
DWT module and store the coefficients in a line buffer of size N/2, and then performs 
the latter level decomposition with the stored coefficients. The merging of two 
successive decompositions in the same direction can decreas'e the external memory 
access bandwidth by one half for every level, except the first kvel decomposition. 




(N X N) RCCR 
1-D DWT 















J-1 (LL) LH, ... ,LH 
J-1 (LL) HL, ... ,HL 





Figure 2.2.2 RCCR 2-D implementation (a) System architecture. 
(b) Data flow of external memory access (J = 3). 
16 
2.2.3 !-level Line-Based Architecture 
Unlike the direction-by-direction approach of direct and RCCR architectures, each 
level of the DWT decomposition can be performed at a time, and the multi-level 
decompositions can be achieved by using the level-by-level approach as illustrated in 
Figure 2.2.3. However, this approach may require some internal memory, whose size 
is proportional to the image width, to store the intermediate DWT coefficients of one 
direction and to supply the input signals for the DWT decomposition in the other 
direction [ l l]. 
The external memory bandwidth of the !-level line-based architecture is exactly 
one half of that of the direct architecture. This is due to the utilization of internal 
buffers. Furthermore, unlike the direct architecture that uses the whole frame buffer of 
size N2 as the intermediate coefficient buffer, the !-level line-based architecture only 




(N/2 X N/2) 
Input 








J.1 (LL) LH, ... ,LH 
J-1 (LL) HL, ... ,HL 
J. t . (LL) HH, ... ,HH 
EB 
Figure 2.2.3 l-levelline-based implementation. (a) System architecture. 
(b) Data flow of external memory access (J ~ 3 ). 
17 
2.2.4 Multi-Level Line-Based Architecture 
Instead of level-by-level approach, multi-level line-based architecture performs all of 
the decomposition levels simultaneously, as illustrated in Figure 2.2.4. However, 
using cascaded J 1-levelline-based architectures to implement directly will result in 
very low hardware utilization. In addition, multi-level 2-D architecture requires more 
internal buffer and suitable task assignment for 1-D DWT modules; but it reduces the 
external memory access bandwidth to the minimum 2N2 




(LL) J- 1 LL 




(-2K x N) (LL) ,J-.1 HL .... ,HL 






Figure 2.2.4 Multi-level line-based implementation. (a) System architecture. 
(b) Data flow of external memory access (J = 3). 
2.3 Discussion 
Based on the Table 2.1, [I] the multi-level line-based architecture requires the most 
hardware cost, including the internal line buffer, multiple 1-D DWT modules, and 
complex control circuits. In addition, simultaneously int·~rleaving of the first 
decomposition level computations with all subsequent levels computations is 
somewhat a very complex mechanism to control, which makes this approach 
18 
Table 2.1 Summary of the RAM-based 2-D architecture [I] 
External Intermediate 
Memory Line Frame Control System 
Architecture Access Buffer Buffer Complexity Integration 
(words/image) (words) (words) 
Direct 5.33N' - w Simple Difficult 
RCCR (RPA) 4.67N' - N' Medium Difficult 
RCCR (N/2) 4.67N' 0.5N N' Simple Difficult 
!-level 2.67N' kN N'/4 Medium Medium 
Multi-level 2N' 2kN - Complex Simple 
impractical for real-time implementation. However, it reqmres the least external 
memory bandwidth without using the external frame buffer to store intermediate data. 
The simplest direct architecture has the least hardware cost but requires the most 
external memory bandwidth. The RCCR architecture can decrease the external 
memory bandwidth of the direct architecture by using one small line buffer. 
The 1-level line-based architecture which adopts level-by-level approach to 
achieve multi-level decompositions is a simple mechanism to control. In addition, !-
level line-based architecture is the most practical for real-time implementation 
because of its greater regularity, which suit well for VLSI implementation. Therefore, 
the research would focus on !-level line-based architectures and the related work in 
literature would be reviewed in the next section. 
2.4 Review of 1-/eve/line-ba>·etf architectures 
In the following, line-based architectures recently proposed in literature are 
reviewed. Bing-Fie eta/. [43] proposed a pipelined architecture for 2-D lifting-based 
DWT of the 5/3 and the 917 filters by merging predict and update stages into one stage 
(step). The overall architecture includes three main components: the column 
processor, the transposing buffer, and the row processor. The modified algorithm was 
derived to shorten the data path but it decreases the throughput of the pipelined 
architecture. The architecture based on this modified algorithm is more complex and 
may require a complex control circuits. The transposing buffer is a drawback because 
19 
it is a very expensive memory component and increases the •:omplexity and the cost 
of the hardware without any performance advantage. In addition, the architecture 
requires a total memory of size 3.5N and 5.5N for 5/3 and 9/7, respectively. 
Cheng-Yi eta!. [40] proposed an architecture which is a combination of a !-level 
architecture block and a multilevel architecture block. The !-level architecture block 
consists of 4 processors, while the recursive architectun: block consists of 2 
processors. The !-level architecture performs the first level of decomposition of the 
original image and generates four subbands coefficients LL, LH, HL, and HH every 
clock cycle. The LL coefficients are further pipelined to th•: recursive architecture 
block for performing the next levels of decomposition. However, this architecture 
requires considerable hardware resources with limited utilization, 6 processors and a 
total of line buffer of size 5.5 and it is definitely slow. 
Hongyu el at. [59] proposed an architecture called two··dimensional dual scan 
architecture in which two consecutive rows are scanned simultaneously that allows 
two pixels to be read per clock cycle from memory and applit:d to the row processor. 
In this architecture the FIFO memories had been eliminatEd and the interleaving 
mechanism was substituted by adding an intermediate memory of size N2/2 to store 
LL coefficients for the next levels decompositions. However, the scan method 
adopted requires a total of line buffer of size 2N and 6N for 5;3 and 9/7 architectures, 
respectively. 
Several lifting-based architectures resembling the archite•:ture in [59] were also 
proposed in [3], [28], [29], and [35] in which the datapath (the row and the column 
processors) was pipe lined to increase the throughput of the computations. In [30] and 
[16] very efficient methods were developed that implement the multipliers in DWT 
data path using arithmetic shift operation, which provide better area-power-operating 
frequency. 
In [25] and [26] line-based VLSI architectures for 9/7 and 5/3 based on lifting 
scheme were proposed, respectively. The proposed architecture mainly includes a 
row transform module and a column transform module, working in parallel and 
pipeline. The embedded decimation technique based on fold and time multiplexing is 
exploited to optimize the design of the architecture. The "so-called" embedded 
20 
decimation technique is defined as that, the samples are input in sequence, then the 
prediction (dual) lifting and update (primal) lifting operations are performed at the 
same processing element (PE) by fold and time multiplexing, so that the decimation 
operation is completed in embedded fashion. 
The authors of [25] and [26] claim that by adopting decimation technique they 
have reduced significantly the required number of multipliers, adders, and registers, as 
well as the size of the buffer memory and the amount of the RAM access. However, 
since the two architectures use the raster scan order (RSO) for scanning the external 
frame memory there would be no significant reduction in the line buffer size. In 
addition, use of the same processing element (PE) to perform both predict lifting and 
update lifting operations increase the hardware complexity by requiring introduction 
of several multiplexers which in turn slow the computations. 
In the efficient pipelined architecture presented in [ 61], a critical path delay of Tm 
+Ta and a reduction in the number of multipliers are achieved through optimized data 
flow graph. However, this architecture requires a total line buffer of size ION, which 
is a very expensive memory component. 
The architecture presented in [24] is an attempt to exploit the parallel nature of the 
5/3 algorithm through parallel operation of independent units. The design is further 
optimized by introducing pipeline stages. Input samples are accessed through a 
window of four samples, allowing two concurrent predict operations and two 
concurrent update operations. Four coefficients can be calculated in one clock cycle 
once the pipeline is populated. The major drawback is that the pipeline requires four 
clock cycles to read new values from external memory and how the architecture is 
pipelined is not evidence. In addition, predict and update modules including the 
whole architecture are poorly structured. 
In [ 62], architecture called, deeply parallel architecture is proposed. The 
architecture requires a buffer memory (BM) of size 5N, several FIFO buffers, and a 
main memory (MM) of size 4N, which are very expensive memory components. In 
addition, writing the results into MM and then switching them out to external memory 
(EM) is really a drawback, since external memory usually consumes the most power 
[ 47]. 
21 
Chengyi et al. [64] proposed a line-based architecture for 2-D DWT where an 
embedded decimation technique is exploited to optimize the architecture. The 
architecture is mainly constituted of an input data buffer unit ODBU) implemented as 
(FIFO) RAMs, and a wavelet transform (WT) module. The WT module includes two 
horizontal filters HFI and HF2 for row-transform and one wrtical filter module VF 
for column-transform. The image is scanned into HF I and HF2 in a raster format. 
Two lines of sample are required to input simultaneously to the transform module, 
therefore, the two FIFOs are used first to store the required input data before they are 
sent out to the row-transform module. The architecture requ res excessive hardware 
resources; two FIFOs and two row-processors. In addition, scanning using a raster 
format is a drawback. The architecture also suffers from long latency of N/2 and 2N 
for 5/3 and 917, respectively. The architecture requires a totd memory of size 3.5N 
and 5.5N for 5/3 and 917, respectively. 
Chih et at. [66] proposed based on new algorithms architectures for 5/3 and 917 
which aim at improving the critical issues of the 2-D DWT. The architecture consists 
of four parts, two sets of the first stage 1-D DWT, two sets of the second 1-D DWT, 
control unit, and Mac unit. The new algorithm, however, increases the hardware 
complexity of the architecture and does not decrease lhe transpose memory 
requirement. In fact, the architecture requires a transpose memory of size 2N and 4N 
for 5/3 and 917, respectively, in addition to internal memories. The architecture also 
suffers from long latency, 3/2N +3 cycles. 
Wei et at. [68] proposed architecture for 2-D DWT, which reduces the internal 
memory required for 5/3 and 917 to 2N and 4N, respectively. However, the row and 
the column processors are not pipe lined and require considemble hardware resources 
which lead to longer critical path delay. In addition, scheduling coefficient, generated 
by the row processor, to the column processor and registers used are not shown in the 
architecture. The architecture requires a latency of 3/2N +.l clock cycles, which 
implies the architecture need an additional transpose memory at least of size 1.5N and 
that increases the total memory required for 5/3 and 9/7 to 3.5N and 5.5N, 
respectively. 
Jie et al. [67] proposed a modified interger-to-interger wavelet transform 
architecture based on fixed-point manipulation. The architecture consists of horizontal 
22 
and vertical transform processors, intermediate buffer, control module, and output 
control module. Image is input line-by-line to the horizontal processor to perform 
horizontal filtering. Vertical processor employs row-wise coefficients and 
simultaneously fetches data via intermediate buffers to execute column-wise 
transform. The latency of the architecture is too long, 5N clock cycles. Intermediate 
memory buffer of size 5N, in addition, to several memories which are internal to the 
vertical processor are required in order for the architecture to perform its task. 
Furthermore, the fixed-point manipulation actually increases the computational 
complexity of the architecture, which leads to longer critical path delay. 
The 5/3 architecture proposed in [69], consists of five key modules: data choose 
module, the row DWT module, the column DWT module, DWT control unit, and 
external RAM. The architecture requires a transpose memory of size 2N and internal 
memory of size 2N, a total of 4N memory which is considered a large memory for 5/3 
architecture. The data choose module is a drawback since it constitutes an extra 
module, in addition, its structure is not drawn and how it operates is not described. 
In [70], VLSI architecture for the 2-D 917 float discrete wavelet transform (DWT) 
for the Consultative Committee for Space Data Systems image data compression is 
proposed. The proposed architecture mainly consists of five parts: row processor, 
column processor, intermediate buffer, controller, and external memory. The row 
processor calculates the horizontal DWT of each row of the external memory image 
data. Then, the resulting decomposed high-pass and low-pass coefficients are stored 
in the intermediate buffers. The column processor calculates the vertical DWT as 
soon as five rows have been processed. That means, the architecture would require a 
latency of 5N clock cycles which is a very long latency. In addition, the row and the 
column processors require large hardware resources and the internal memory 
requirement is too large, 22N, which makes this architecture very expensive. 
One of the serious limitations of the lifting-based architecture is its potentially 
long critical path [2]. This problem was addressed in [2] and [21] and these papers 
proposed architectures which aim at shorting the critical path of the lifting-based 1-D 
architectures. Huang et al. [2], proposed an efficient VLSI architecture, called flipping 
structure, in which the problem of serious timing accumulation for lifting-based 
architectures is addressed by flipping some computing units with the inverse of 
23 
multiplier coefficients such that the critical path can be greatly reduced. However, this 
architecture requires a total line buffer of size II N, which is a very expensive 
memory component. A modified view of the flipping structure is presented in [21]. 
Compared with Huang's method, the method proposed in [21] is more efficient in 
reducing critical path and memory requirement for one processor is 4N. But, usually 
2-D DWT architectures consist of 2 processors, which would require more line 
buffers. Furthermore, reducing the critical path delay to one multiplier is no longer a 
critical issue, since coefficients and scaling factors of the 9/7 can be implemented in 
hardware with only 2 adders using arithmetic shift method [23J. 
In [60], by reordering the lifting-based DWT of the 917, tile critical path delay of 
the pipelined architecture has been reduced to one multiplier delay. But the 
architecture requires a total line buffer of size 5.5N, which is a very expensive 
memory component. In addition, it requires real multipliers w th long delay that can't 
be implemented by using arithmetic shift method. Moreover, the fold architecture 
which uses one module to perform both predictor and update steps in fact increases 
the hardware complexity, e.g., use of several multiplexers, and the control 
complexity. Use of one module to perform both predictor and update steps implies 
both steps have to be sequenced, which will definitely slow down the computation 
process. 
In [63], a line-based pipelined architecture for the :i/3 and the 9/7 2-D 
DWT is proposed. The architecture consists of three key modules: the row DWT 
module, the data buffer, and the column DWT module. The row module performs 
row-wise DWT and the output data is stored in the data buffer. When enough rows are 
processed the column module starts to perform the column-wise transform as soon as 
possible and stores the intermediate results in the temporal bufYer memory. The 
folding technique is employed to reduce the hardware cost, which achieves a critical 
path of one multiplier delay. The folding technique even though it reduces the 
arithmetic resources, it require, besides increasing number of multiplexers used, the 
used of real multipliers which leads to longer critical path delay and more hardware 
resources. In addition, the temporal buffers, which hold the intermediate results 
generated by the column DWT module, are not incorporated into the column 
module's architecture, thus, the architecture is not complete. Furthermore, the 
24 
architecture reqmres a total memory of size 3.5 N and 5.5N for 5/3 and 9/7, 
respectively. 
Chung-Fu et al. [7 I] proposed a pipeline architecture for the 9/7 2-D DWT. The 
proposed architecture is composed of column and row processors to perform the 
separable 2-D DWT. Based on a rescheduling algorithm, which merges the 
computation of each lifting step, a critical path of one multiplier and two full-adders 
delay is achieved. The architecture is generally complex and requires more hardware 
resources such as Wallace tree multipliers. In addition, the architecture requires a total 
memory of size 5.5N. 
JPEG2000 allows (optionally) an image to be divided into a number of smaller 
non-overlapping rectangular blocks known as "tiles" and 2-D DWT is applied inside 
each tile independently. Tiling provides a simple mechanism for controlling the 
amount of working memory used to compute 2-D DWT of a large image [8]. Papers 
reviewed so far have proposed non-tile-based architectures, i. e.; they process the 
whole image as one tile. Srikar et al. [27] and Dimitroutakos et al. [36] proposed tile-
based architectures for computing 2-D DWT. These architectures are somewhat too 
complex and memory requirement is high which make them impractical. 
Nevertheless, tiling is a useful mechanism to use for computing 2-D DWT of a large 
image independent of its size with the use of the smaller intermediate memory size to 
store "LL" values for next level decomposition. 
2.5 Conclusion 
I conclude that the most critical part of 2-D DWT architectures is the memory 
issue, especially internal memory of the processors, which dominates the hardware 
cost and complexity of the architecture, while, external memory access consumes the 
most power. Most of the architectures proposed in the literature managed to reduce 
internal memory (line buffers) requirements of the processors between 5.5N to II N, 
which is still a large memory. In addition, no architectures were developed on purpose 
that address directly the problem of reducing the power consumption of the 2-D 
DWT. Other architectures, on the other hand, have focused on reducing the critical 
path delay of the processor to one multiplier delay. However, this issue becomes less 
25 
critical after the fact that scales factors and coefficients of the 9/7 filters can be 
implemented in hardware using only two adders. In addition, these architectures are 
largely inaccurate and incomplete. Furthermore, two very important issues have been 
overlooked in the literature, which will be addressed in this research, the DWT 





This research is started off by developing a software simulation program for both 
decorrelation and reconstruction processes. The objective of developing the software 
program is to learn in depth the behavior of the algorithm and in the process to 
acquire a firm understanding, which would enable us to develop more accurate 
architectures. The software program is listed in Appendix A. 
Then, equipped with information gained from developing the software program, in 
this chapter, novel VLS! architectures based on lifting scheme that compute 2-D 
DWT in an image compression system and meet the high speed requirement for real 
time applications of2-D DWT will be developed. 
As a starting point consider the general lifting-based tree-structured filter bank for 
the first level decomposition shown in Figure 3.1.1. The figure suggests that 2-D 
DWT can be implemented by three processors as indicated by dotted lines in the 
figure. The processors are row-processor, column-processor-H, and column-
processor-L. The row-processor (RP) computes DWT row wise i.e., the RP applies 
one-dimensional DWT algorithm in each row of an image to produce the YH and YL 
decompositions. The two column processors each compute DWT column wise by 
applying one-dimensional DWT algorithm in every column of YH and YL. The 
column-processor-H takes as an input YH and produces subbands HL and HH, while 
the column-processor-L takes as an input YL and produces the LH and LL subbands. 
Since the tree-structure shown in Figure 3.1.1 is a general representation of 2-D 
DWT, it would be necessary now to determine the wavelet algorithm that would be 
used by the three processors to compute DWT. As a matter of fact, any wavelet 
algorithm could be chosen and the processors hardware architecture could be 
designed based on it. At this point it is also clear that each processor should be 
27 
designed to execute one-dimensional DWT algorithm applied either to all rows or all 
columns of an image. Therefore, to be specific in the architectures development, the 
one-dimensional lifting-based 5/3 and 917 wavelet transform algorithms are selected 




--... -+HHi k 
HL 










Figure 3.1.1 Lifting-based tree-structured filter bank 
3.2 Lifting-based 5/3 and 917 algorithms and architectures development 
The lossless 5/3 and lossy 917 discrete wavelet transforms algorithms are defined by 
the JPEG2000 image compression standard for 1-D signal X containing N samples, as 
follow [27, 29]: 
513 analysis algorithm 
step!: Y(2j + 1) = X(2j +I) -l X( 2)) + ~(2) + 2) J 
step2: Y(2j) = X(2j) + l Y( 2j -l) + :{2) + l) + 2 J 
28 
917 analysis algorithm 
step!: Y'(2J + 1) = X(2j + 1)+ a(X(2J)+ X(2J + 2)) 
step2 : Y'(2J) = X(2J) + ji(Y'(2J -I)+ Y'(2J +I)) 
step3: Y'(2J +I)= Y"(2j + 1)+ y(Y"(2J)+ Y"(2J + 2)) 
step4: Y'(2J) = Y"(2J)+ o(Y'(2J -I)+ Y'(2J +I)) 
stepS: Y(2j +I)= 1/k Y'(2J +I) 
step6: Y(2j) = kY'(2j) 
where}= 0, 1, 2 ... ...... , N-1. 
For the RP to compute 2-0 FDWT for an N x M image, the 5/3 algorithm can be 
written as follows. 
for i = 0 to N - I do 
for j = 0 to M -I do 
Y(i,2j +I)= X(i,2j +I) -l X(i,lj) + :(i,lj + l) J 
Y(i, 2 j) = X(i,lj) + l Y(i,lj- I)+ :(i,2j +I)+ 2 J 
end 
end 
Where Y(i,2j +I) and Y(i,2j)are the high and low decompositions that would result 
when the image X (i, j) is applied to the algorithm above. This algorithm implies that 
the high and the low output coefficients are stored in the same memory Y with the 
high coefficients occupying the odd indexed locations and the low coefficients 
occupying the even indexed locations. However, I prefer to store high and low 
coefficients each in a separate memory, so the algorithm above is rewritten as 
for i = 0 to N - I do 
for j = 0 to M - I do 
YH (i, j) = X (i,2j + I) -l X (i,lj) + ; (i, 2 j + 2) j 




In this representation X (i, j) is interpreted as a two-dimensional array in a software 
implementation and a physical memory in a hardware implementation containing the 
original image pixels. The algorithm takes as an input X (i, j) and decomposes it into 
high (H) and low (L) decompositions, which are stored in the memories denoted by 
YH (i, j) and YL (i, j), respectively. This algorithm can be represented in a block 
diagram as shown in Figure 3.2.1. The block diagram consists of a row-processor 
(RP) and an external memory X (N, M) that contains the original image. The processor 
reads the contents of the memory labeled X (N, M) line by line and computes the high 
and low coefficients of the image and stores the results in the memories labeled YH 
and YL, respectively. 
YH 
' M External ' 
';v < 2 !.. Frame X (i,j) nternal 
Memory Row-
X(N, M) processor ~~L 
Lf-VvxM 
...... 2 
Figure 3.2.1 Block diagram representation of the algorithm 
By slightly modifying the indexes of the last algorithm, algorithms for the 
column-processor-H and the column-processor-L are obtained, respectively. The 
column-processor-H reads the contents of the memory labeled YH as input and yields 
subbands HH and HL. Whereas, the column-processor-L reads contents of the 
memory labeled YL and yields subbands LH and LL. 
Column-processor-H 
for j=O to M-1 do 
for i = 0 to N - I do 
YHH (i, j) = YH (2i +I, j) -l YH (2i, j) + ~H (2i ·- 2, j) J 





for j = 0 to M - 1 do 
for i = 0 to N - 1 do 
YLH (i,j) = YL(2i + 1,}) -l YL(2i, j) + ~L(2i + 2, j) J 
YLL(i, j) = YL(2i, j) + l YLH (i -1, j): YLH (i, j) + 2 J 
end 
end 
When the two column-processors are combined with the architecture shown in 
Figure 3.2.1, the architecture shown in Figure 3.2.2 is obtained, which computes the 
first level DWT decomposition for an NxM image. To obtain J levels decomposition 
the LL sub band coefficients of each successive level are stored in the memory labeled 
LL-RAM for further decompositions as shown in Figure 3.2.2. This implies the 
architecture decomposes 2-D images into the desired number of decomposition levels, 
level by level. 
Similar procedure can be applied to transform the 9/7 algorithm. A careful 
examination of the last 3 algorithms shows that they are basically identical 
algorithms, which imply that their processor architectures would also be identical. In 
addition, the architecture is modular, since it consists of three modules one row-













3.3 Data dependency graphs (DDGs) for 5/3 and 917 algorithms 
The data dependency graphs (DOGs) for the 5/3 and the 9/7 algorithms derived from 
their respective algorithm are shown in Figures 3.3.1 and 3.3.2, respectively. In the 
DOGs, a node circled with a number represents a camputation. All step] 
computations in 5/3 algorithm are performed by the nodes circled with odd numbers 
(first level) in the DOGs of Figure 3.3.1. On the other hand, step 2 computations are 
performed by the nodes circled with even numbers in the second level labeled Y(2j) in 
the DOGs. The symmetric extension algorithm is incorporated in the DOGs to handle 
the boundary problems. The symmetric extension is represented in the DOGs by 
dotted lines. The boundary treatment is necessary to keep the number of wavelet 
coefficients same as that of the original input. The boundary t:eatment is only applied 
at the beginning and ending of the process [3]. That means in 2-D images, it will be 
applied at the beginning and the ending of each row or column. The nodes circled 
with the same numbers in the DOGs are considered redundant computations, which 
will be computed once and used thereafter. In addition, note that the symmetric 
extension algorithm behaves differently for even and odd length signals when it is 
applied to the data dependency graph. Therefore, two DOG; are provided for each 
algorithm, one for even and another for odd length signals. The data dependency 
graph would be a useful tool in architecture development and (:nhancement. 
3.4 External Architecture Development and refinement 
In the architecture shown in Figure 3.2.2, the row-processor scans (reads) the external 
memory, which contains the original image pixels, row-by-row and decomposes the 
image into high (H) and low (L) coefficients which are stored in the memories labeled 
YH and YL respectively. Then, the two column processors simultaneously each reads 
its respective memory, YH and YL, and compute subbands HH. HL, LH, and LL 
coefficients in parallel. 
In order to reduce the size of the internal memories YH and YL and to allow the 
two column processors to work in parallel with the row-processor, the DOGs are 
considered. The DOGs show that, to ease the development of architectures the 
strategy would be to divide the details of the development into two steps, each having 
less information to handle. In the first step, the DOGs are loo:<ed at from the outside, 
32 
XU) 2 __ 1 __ 2 _ _3 __ 4 __ 567876 




,: I ,' 




' ... redundant 
computations 
(a) 







X(n) ~ ) :f ,I, 0 __ 1_2 __ 3_4 __ i_6 __ 7 __ 
', : ,/:'. :: 
4, 1 ~ ,1 ,_o_ t __ 2 __ 3 __ 4 __ L_6 __ 
I I 11 1 I I 
-,9 5 4 
/•', : 
I 1 I I I 1 
' ' ' 
' 
I I ,'• I 







Y(2n), Y(2n+ l): k k: 
' 
' -- -- --
YO n Y2 Y3 Y4 YS Y6 Y7 f1l YO n Y2 Y3 Y4 Y5 Y6 Y7 
(a) (b) 
Figure 3.3.2 9/7 algorithm's DOG for odd (a) and even (b) length signals 
which is specified by the dotted boxes in the DOGs, in terms of the input and output 
requirements. We have observed that the DOGs for 5/3 and 9/7 are identical when 
they are looked at from outside, taking into consideration only the input and output 
requirements; but differ in the internal details. Based on this observation the first level 
of the architecture, the external architecture, is developed. In the second step, the 






datapath architectures, since DOGs internally define and specify the internal structure 
of the processors. 
The advantage of this new approach along with scan m(:thods developed in the 
next section can be used not only in the forward 2-D DWT architecture 
development but in inverse and any DWT algorithm and it is certain to yield very 
efficient architectures in terms of hardware complexity, speedup, and power 
consumption with manageable control complexity. 
The DOGs of Figures 3.3.1 and 3.3.2 show that to compute one high and one low 
coefficient at anytime, the processor needs three pixels as an input. Thus, for the two 
column processors to work in parallel with the row-processor, the row-processor must 
compute DWT for the first two rows. Then, the two column processors can start 
computing as soon as the result of the first operation in the third row is available. 
After that the three processors proceed computing in parallel until the row-processor 
(RP) performs the last operation in the third row. The two column processors then go 
into idle states, while the RP works on the fourth row. When the RP reaches the fifth 
row and as soon as the result of the first operation in the row is available, the two 
column processors again resume computing in parallel with the RP using the results 
of the third, fourth, and fifth rows, until the last operation in the fifth row is 
performed. Then, the two column processors again go intc idle states, while RP 
operates on the sixth row to repeat the process. It is obvious the two column 
processors would be in idle states or under utilized half •Jf the time. But, the 
advantage is that the sizes of the two column processors memories labeled YH and YL 
can each be reduced to M instead of N x M /2, which is a con;iderable reduction in a 
very expensive memory component. In addition, since the two column-processors 
(CPs) are under utilized half of the time, it is possible to remove one of the CPs and 
keep only one to compute the four subbands HH, HL, LH, and LL. When these 
changes are made to the architecture shown in Figure 3 .2.2, the architecture shown in 
Figure 3 .4.1 is obtained and the hardware utilization is 100%. In this architecture, the 
internal memories YH and YL each can be considered as consisting of two memory 
banks of size M/2. 
To evaluate the performance of the two architectures shown in Figures 3.2.2 and 
3.4.1 in term of speedup, consider the following. Assume the RP of the architecture in 
34 
Figure 3.2.2 takes T clock cycles to perform one level of decomposition. Then the two 
column processors, working in parallel; each would need T/2 clock cycles for a total 
of T + T /2 = 3/2 T cycles to perform one level of decomposition by the three 
processors. On other hand, the architecture shown in Figure 3.4.1 only requires a total 
ofT cycles to compute one level of decomposition which is a gain in speedup factor 
of3/2 as compared with the architecture shown in Figure 3.2.2. 
Let us now explain the dataflow of the architecture shown m Figure 3.4.1. 
Specifically, how data would flow from the outputs of the RP, through the internal 
memories YH and YL, to the inputs of the CP. The RP scans the external memory 
row-by-row, by reading every cycle 3 pixels and placing them into the registers 
labeled RtO, Rtl. and Rt2 to initiate an operation. and produces as output coefficients 
of the high (H) and low (L) decompositions, according to the DOGs. The results of 
the first row computations, which are placed on output lines labeled H and L, are 
stored in the memory banks BO of YH and BO of YL respectively. The results of the 
second row computations are stored in the memory banks Bl of YH and BI of YL. The 
CP would start its computations as soon as the results of the first operation in the third 
row are computed and placed into registers Rt3 and Rt4. The CP performs its 
computations by reading two coefficients data from the memory banks of YH and the 
third from register Rt3. Data in register Rt3 follows the path that leads to Mux2 , to 
register Rt6 and finally to the column-processor input labeled Ic2. While, data from 
banks BO and Bl of YH follow the paths that lead to MuxO and Muxl to be loaded 
into Rt7 and Rt5, respectively. The CP repeats this process every clock cycle until it 
consumes the data in the two banks of the YH memory including the immediate data 
coming through Rt3. According to the DOGs, the low and high coefficients produced 
as a result of processing the third row by the RP are needed not only in the current but 
also in the next calculations involving the 4th and the 5th rows of the YL and YH 
decompositions. Therefore, these high and low coefficients are stored in the memory 
banks Bo and Bl of YH, respectively, while the CP retrieves data from memory YH 
banks. Of course, that would require reading and writing the same memory location 






N M LL 
-X- ~-·-----------' 
2 2 
Figure 3.4.1 Architecture for 2-D DWT 
implementing the memory banks of YH and YL as FIFO queues. That sounds 
logically correct, but practically would require a large number of registers for 2-D 
images and that would be a very expensive solution which we prefer to avoid. 
Therefore, we prefer that the memory banks of YH and YL be implemented as RAM. 
Then, read and write conflict can be resolved with careful timing by allowing read to 
be performed in the first half cycle and writing in the second half. 
As soon as the CP is done with the data stored in memory YH it turns to memory 
YL and starts its second batch by operating on the data stored there. Each clock cycle, 
two data one from bank Bl which takes the path that leads to muxl and the other from 
bank BO that takes the path leading to MuxO. The third data i:; read at the same time 
from bank Bl of YH to complete the three inputs requirement for an operation. While 
the CP is retrieving and operating on the data stored in the memory banks of YL and 
Bl of YH, the high and low coefficients, generated by the RP as a result of applying 
DWT to the pixels of the fourth row in the external memory, are stored in banks BO 
and Bl of YL, respectively. The third batch of computations take place by reading the 
high coefficients stored in bank BO of YH and in bank BO of YL, while the high 
coefficients, generated by the RP using data of the fifth row, are passed from register 
Rt3 through the path leading to mux2 to CP as a third input. At the same time, the 
high and low coefficients computed using the fifth row's data are stored in bank BO of 
36 
YH and in bank BO of YL, respectively, since they are needed in the computations of 
the next two batches. The fourth batch is a low coefficients processing begins by 
reading the data stored in banks BO and Bl of YL and Bl of YH, which follow the path 
leading to MuxO, to Rt7 register, and finally enters the CP through the input labeled 
leO. Meanwhile, the high and low coefficients computed by the RP using the data of 
the sixth row are routed to BI of YH and Bl of YL, respectively. Data read from bank 
BO of YL enter the CP through the input labeled lc2. 
A careful examination shows that after the fourth batch is processed, the dataflow 
or scheduling of batches repeat the same patterns described above for the four 
batches. That means the next 4 batches would also exhibit the same scheduling 
patterns of the first four batches and so on. Furthermore, with the pipeline registers 
Rtf, Rt2, RtO, Rt3, Rt4, Rt5, Rt6, Rt7, Rt8, and Rt9 are in place not only the RP works 
in parallel with the CP but the whole architecture are now fully pipelined. The 
pipeline consists of three stages: the RP stage, the YH and YL memory stage, and the 
CP stage. Pipelining improves the performance of the architecture in terms of speedup 
and throughput as compared with non-pipelined architecture. It is possible to attain 
maximum speedup and throughput in this architecture because 2-D DWT 
computations involve a large number of operations. The larger the number of pipeline 
stages, the higher the speedup. 
Even though we have managed to reduce the hardware complexity to a great 
extend from 3 processors and a total internal memory of size N x M consisting of YH 
and YL in the architecture shown in Figure 3.2.2, to two processors and a total 
memory of size 2M for YL and YH in the architecture shown in Figure 3.4.1 and in 
the process have gained a speedup factor of 3/2 as compared with the architecture in 
Figure 3.2.2, the disadvantage of the architecture shown in Figure 3.4.1 is that it 
requires a very complex control circuitry to govern the dataflow across the memory 
banks of YH and YL. In addition, the internal memory requirement is still high. 
However, it is possible to eliminate the internal memories labeled YH and YL entirely 
and use instead a few registers and reduce the control complexity to a great deal by 
adopting a different scan strategy for scanning the external memory, as would be 
illustrated in the following section. 
37 
3.5 Overlapped and Nonoverlapped Scan Methods 
I believe that minimization of the internal memory, and hence the hardware 
complexity in general for 2-D DWT architectures, depends on the proper scan method 
adopted for scanning the external frame memory. Therefore, .n this section two scan 
methods are illustrated and will be adopted instead of the row-by-row scan method 
used so far, to further refine the architecture and obtain novel architectures that best 
meet real-time applications of2-D DWT requirements. 
The two scan methods, overlapped and nonoverlapped, are illustrated in Figures 
3.5.1 and 3.5.2, respectively. The pixels in the overlapped areas, indicated by the dark 
lines in Figure 3.5.1, are scanned twice. For an NxM image, the overlapped scan 
method requires NM + N (]_( M _ 1 l /2 j) clock cycles to scan the external memory for 
the first level decomposition, whereas in the nonoverlapped method, the overlapped 
areas are eliminated to reduce the external memory access cydes to NM clock cycles 
only and hence reduce the power consumption. The external memory access usually 
consumes the most power [33, 51]. 
The scan method shown in Figures 3.5.1 and 3.5.2 are appropriate for both 5/3 and 
917 algorithms. But, when this scan method is used in 917, it would not yield any 
output coefficients in the first run, according to the 9/7 DDGs. Thus, to allow the 9/7 
to generate output coefficients starting from the first run, we propose the overlapped 
scan method shown in Figure 3.5.3. This scan method differs from 5/3 in the first run 
only, which requires scanning of 5 pixels from each row. These two scan methods are 
developed mainly with two objectives to achieve, that is, to make the external 
architecture for both algorithms identical and to reduce the inlernal memory between 
RP and CP to a few registers. 
The following two observations, regarding the two scan methods would be 
necessary in order to develop precise architectures for computing 2-D DWT. First, in 
the case when the row length of an image is odd, pixels of th<: last column (M-1) are 
considered overlapped and are scanned twice. In the first scan. according to the DDG 
for odd length signals shown in Figures 3.3.1 and 3.3.2, they are used in the 
calculation of the last high coefficient in each row, whereas in the second scan, they 
are used in the calculation of the last low coefficient in each rJw. On the other hand, 
38 
when the row length of an image is even, only the last two pixels in each row 
(columns M-2 and M-1) are scanned and are used by the RP in the calculations of the 
last low and high coefficients, as required by the DOG for even length signals. 
3.6 Scan Based Architectures 
Based on the scan methods and the DOGs for 5/3 and 9/7 shown in Figures 3.3.1 and 
3.3.2, when they looked at from outside, the architectures shown in Figures 3.6.1 and 
3.6.2 are proposed for overlapped and non-overlapped scan methods, respectively. 
The architectures operate in a pipeline fashion, consisting of two stages, the RP stage 
and the CP stage. The two architectures are basically identical. The main difference is 
that the nonoverlapped architecture contains a line buffer (LB) of size N. This line 
buffer is added to hold N pixels that lay in each overlapped areas in Figure 3.5.1 in 
order to reduce the external memory access and hence the power consumption. Pixels 
in an overlapped area such as column 2 are also required in the next N operations. 
According to the DOGs, each operation performed by either RP or CP would require 
three inputs. For example, the inputs labeled 0, I, and 2 in DOG of Figure 3.5.2 
initiate the first operation to yield the coefficients labeled YO and Yl, whereas inputs 
2, 3, and 4 initiate the second operation which yields Y2 and Y3 and so on. Fig. 3.6.2 
shows the nonoverlapped architecture from the RP side only, since its remaining parts 




runl M n2 ru 











1·/ l:=t. / = 
(b) 
Figure 3.5.1 Overlapped scan method for 5/3 (a) Odd length signals 




Figure 3.5.2 Non-overlapped scan method for 5/3(a) Odd length signals 
(b) Even length signals 
Figure 3.5.3 Overlapped scan method for 917 
If external memory is scanned with frequency f, both architectures shown in 
Figures 3.6.1 and 3.6.2 should operate with frequency f /3. The dataflow for both 
architectures is given in Table B. I (Appendix B). Note that this dataflow is derived 
based on the 5/3 scan methods shown in Figures 3.5.1 and 3S2 and it is identical to 
the 9/7 architecture's dataflow, based on the same scan methods, in all runs except the 
first run where 9/7 does not yield any output coefficients. The dataflow of the 9/7 
architecture based on the scan method of Figure 3.5.3 is shown in Table B.2 
(Appendix B). 
Looking at the DOGs shown in Figures 3.3.1 and 3.3.2 from the outside, it can be 
observed that in the last high and low coefficients calculatiom., where the row length 
of an image is even, only the last two pixels in a row, r, at locations X(r, M-2) and 
































Figure 3.6.2 Proposed non-overlapped scan architecture (RP-side only). 
implementing the extension part, requires the pixel located at X(r, M-2) to be 
considered as the first and the third inputs. This must be passed to the RP with the 
second input pixel from location X(r, M-1), to compute the last high and low 
coefficients in the row r. Thus, the function of the multiplexer labeled MuxreO is to 
pass the pixel read from location X(r, M-2) after it has been transferred to register 
RdO, to the row-processor's latch, Rt2, as the third input. Register Rd1 holds the 
second inputs, pixel from location X(r, M-1). Similarly, the multiplexer labeled 
MuxceO performs the same function, when the CP applies DWT to columns. In other 
words, MuxreO and MuxceO, which are extension multiplexers, are used only in 
calculation of the last coefficient in even row or even column images. 
On the other hand, when the row length of an image is odd, according to the 
DOGs for the odd length shown in Figure 3.3.1 and 3.3.2, to calculate the last low 






row-processor. This pixel is loaded into RdO and then passed to the row-processor 
where it is used in the computation of the last low coefficient. 
In the architecture based on the nonoverlapped scan method, starting from the 
second run, the dataflow or scheduling of pixels to RP and LB should be as follows. 
Assume the cycle where the last three pixels that are scanned from the last row in the 
first run are loaded into the RP's latches by the pulse ending, say, cycle n. Cycle n 
also transfers the pixel from location X{N-1,2) into Rd. In cycle n+ I, the second run 
begins and the first pixel for the first operation is read from location X(O, 3) and is 
loaded into Rd I by the pulse ending the cycle. In addition, during cycle n +I, contents 
of register Rd are written into the last location of the LB. In cycle n+ 2, the first 
location of the LB is loaded into RdO by the pulse ending the cycle and it is the only 
event that takes place during the cycle. Cycle n+ 3 transfers the second pixel from 
location X(0,4) to both Rd and Rt2 and contents of RdO and Rdl to RtO and Rtf by the 
pulse ending the cycle, respectively. In cycle n+4, Rd"s contents are written in the 
first location of the LB. In addition, the first pixel of the second operation which is in 
location X (I, 3) is loaded into Rd I by the pulse ending the cycle. This pattern of 
scheduling is repeated until the whole image is scanned. 
The control signal values that must be issued by the control unit for the signals 
labeled Ed2, Ed3,SO, Ed4, Ed5, Ed6, and Sf in the architecture shown in Figure 3.6.1 
can be derived, reference to clockf, from Table B. I and starting from clock cycle 6 
as shown in Table 3.1. Note that the pattern included in the dotted box repeats after 
cycle 9. In addition, the number of control signals in Table 3.1 can be reduced 
further, as shown in Table 3.2, by observing that signals Ed2~SI~Ed6~SO and 
signals Ed3~Ed5. 
Table 3.1 Control signal values 
Cycle Ed2 Ed3 so Ed4 Ed5 Ed6 Sf 
6 I X I I X X X 
9 0 I X 0 I X X 
12 I X X 0 0 I I i I 





___ Q _____ I 
18 I X I 0 0 I 1 
21 0 1 0 1 1 X 0 
43 
Table 3.2 Reduced control signals 
Cycle Ed2 Ed4 Ed5 
6 I I X 
9 0 0 I 
I 12 - I 0 -6 I 
I 15 0 I I J 
18 I 0 0 
21 0 I I 
3. 7 Intermediate Architectures 
Two lifting-based VLSI architectures for 2-D DWT for the 5/3 and the 9/7 algorithms 
were proposed in the previous section based on two scan m~thods, overlapped and 
nonoverlaped. In the architecture based on the overlapped scar. method, the maximum 
power consumption occurs due to overlap external frame memory access. On the 
other hand, in the nonoverlapped architecture, the power consumption was reduced to 
minimum by eliminating the overlapped areas which require:; the addition of a line 
buffer of size N. In this section, we developed a new architecture, called intermediate 
architecture, for 5/3 and 9/7 algorithms, which aim at reducing the power 
consumption of the overlapped areas, without using the expensive line buffer, to 
somewhat between the two extreme architectures proposed in the previous section and 
hence the name intermediate. The intermediate architectures are based on the 
generalization of the overlapped scan method which is introdm:ed next. 
3. 7.1 Generalized Overlapped Scan method 
Suppose the overlapped scan method shown in Figure 3.5.1 is termed as the first scan 
method, since three pixels are scanned from each row. The second method scans 5 
pixels from each row. The third scans 7 pixels and the fourth scans 9 pixels and so on. 
In general, the i1h scan method scans 2i+ 1 pixels from each row and the number of 
overlapped areas in the i1h scan method can be written as l(M -l)/2i J. Similarly, 
consider the overlapped scan method shown in Figure 3.5.3 for 9/7 as the first scan 
method. Then successive scan methods for 9/7 will differ from that of the 5/3 only in 
the first run, which requires scanning of 3+2i pixels from each row, while scanning in 
the remaining runs remain the same. These scan methods reduce the excess memory 
access and hence the power consumption by a factor of 1/i as ~ompared with the first 
44 
scan method. In addition, the internal memory between the row and column 
processors increases by 5i registers, where i = 1,2,3, ...... · denote the first, the 
second, and the third scan methods and so on. The excess memory access is due to 
scanning pixels in the overlapped areas twice. Figures 3.7.1 (a) and (b) show the third 
overlapped scan method for 5/3 and 9/7, respectively, where the external memory 
access due to overlapped areas scanning is reduced by a factor of 1/3. Thus, by 
adopting a higher scan method it is possible to obtain an intermediate architecture, 
since the external memory access due to scanning of the overlapped areas will be 
somewhat between the two extreme architectures proposed based overlapped and 
nonoverlapped scan methods. 
To appreciate and have more insight into the excess memory access, which is due 
to scanning of the overlapped areas twice, consider the following. The architecture 
based on the first overlapped method, the total external memory access time Tmo in 
clock cycles for J levels of decomposition can be estimated as follows. 
M 
M 
0!2345 6 7 8 9 I 0 II 2 11 4 I <; 1 f. 7 1 ~ 1 o 20 ? 1 
"'-
--






Figure 3.7.1 The third overlapped scan method (a) for 5/3 and (b) for 917 
45 
+~+~ --1 2 NM N (( M )/} 
41-1 2./-1 21-1 
T = NM 1+-+-+····+-l 1 1 (1)1 - 1 m<> 4 16 4 NM N NM N NM +----+----+--2 2 8 4 32 
n2 anl n2+1 
Then using geometric series summation formula La' = -a , obtain 
k=nl 1-a 
( 
1 )1-1 4- -
3 4 
Tm, = 2 NM ---'--
3 
-'-----
T ~-NM 4--1 l ( 1 ) 1 - 1 J 
mo 2 4 
( )
J-1 
Since the term ± will be very small, the above equation c2n be reduced to 









This equation can be used also to estimate the computation time of 2-D DWT 
architectures. 
46 
On the other hand, for the architecture based on nonoverlapped scan method 
shown in Figure 3.5.2, the total external memory access time, Tmn, in clock cycles for 
J levels of decomposition can be estimated as 
l 1 1 (')1 - 1J n (1).1-l Tm, =NM 1+-+-+ .... ·+- =NML-4 16 4 1=1 4 (3.9) 
(3 .I 0) 
Thus, the excess memory access time, Tm,, due to overlapped areas scanning for J 
levels of decomposition is given by 
Tm, = Tm"- Tmn = 2NM- 4/3NM = 2!3NM (3 .II) 
which is significant. In the architecture shown in Figure 3.6.2, Tm, is eliminated and 
minimum access time Tmn and hence minimum power is obtained by nonoverlapped 
scan method. But, the method requires the addition of a very expensive memory 
component, a line buffer, in the architecture. The intermediate architectures are 
alternative form for reducing the power consumption of the overlapped areas, 
expressed in Eq(3 .II), without a I ine buffer. 
3. 7.2 Proposed External Intermediate Architecture 
Based on the scan method shown in Figure 3.7.1 and DOGs for 5/3 and 9/7 shown in 
Figures 3.3.1 and 3.3.2, the architecture shown in Figure 3.7.2 is developed. The 
architecture is valid for both 5/3 and 9/7 algorithms, since it is developed based on the 
observation that the DOGs for 5/3 and 9/7 are identical when they are looked at from 
outside, taking into consideration only inputs and outputs requirements. The 
architecture operates in a pipelined fashion consisting of two stages, the row-
processor (RP) and the column-processor (CP). If external memory is scanned with 
frequency f, then registers RdO and Rd I should operate with frequency f and the rest 
of the architecture should operate with frequency f /3 as indicated in Figure 3. 7.2. 
The dataflow of the architecture, derived based on 5/3 scan method shown in Figure 
3.7.1 (a), is shown in Table B.3 (Appendix B). The dataflow is identical to the 9/7 
47 
dataflow in all runs except in the first run where 9/7 scans 9 pixels, whereas 5/3 scans 
7 pixels from each row. 
The clock period r and hence frequency f of the proposed overlapped, 
nonoverlapped, and intermediate architectures can be determined by the following 
statement. fm is the external memory frequency of operation, J;, is the processor 
frequency and I is the number of input pixels that are required for an operation. I= 3 
for 5/3 and 9/7 algorithms. 
Statement I 
Case If fm ~ t p then 
r = fm 




else r = tm 
To this point the processor critical path delay (tp = 1/.{p) is expected to be much larger 
than that of the external frame memory scan delay, 1m= llfm· Therefore, the processor 
delay fp would be the determining factor of the frequency f In other words, case2 will 
be always true. The situation would change when the processors are pipelined later. 
3. 7.3 Second Dataflow 
The dataflow given in Table B.3 (Appendix B) is justified by the fact that each 
operation performed by the RP and the CP requires three input data. In addition, since 
the processor delay t r determines the scanning frequency J, then 
J, =II r = 3 It = 3f I l p p (3.12) 
That is, the scanning frequency J, should be at least three times faster than the 














































time specified by t P. Nevertheless, it is possible to obtain a different dataflow with 
different frequency by realizing that after the first operation in each row, the second 
and the third operations in the same row need only 2 pixels to be scanned. This is 
because the third input pixel of the previous operation which is also the first input in 
the next operation is already scanned and is available in register RdO. This implies, a 
new scanning frequency, I, can be used, which is given by 
(3.13) 
49 
The scanning frequency f 2 is two times faster than the processor's frequency of 
operation fr. Thus, with the second scanning frequency, j 2 , it is possible to achieve a 
great reduction in the external memory power consumption but with a drop in speed. 
The second dataflow is illustrated in Table 8.4 (Appendix B). 
To compare the performance of the two dataflow in term~: of power consumption 
and speed consider the following. In the first dataflow shown in Table 8.3 
p, = 27 clock cycles are needed to yield the first pair of output. The remaining (n- I) 
outputs require 3(n- I) cycles. Thus, the total time, Tl, required to yield n paired 
outputs is given by 
(3.14) 
Similarly, the second dataflow shown in Table 8.4 requires p 2 =21 cycles to 
yield the first pair of output. According to Table 8.4, the remaining (n- I) outputs 
require 713(n- I) clock cycles. Thus, the total time, T2, required to produce n paired 
outputs is given by 
T2 = [p, + 7/3(n -l)}r, (3 .15) 
The speedup factor is then given by 
(3.16) 
(3.17) 
That means the first dataflow is 7/6 times faster than the :;econd. In other words, 
the total execution time of the second dataflow is increased by 16.7% as compared 
with the total execution time of the first dataflow. 
The power consumption of VLSI architectures can be estimated [I 7] as 
2 P '= C1ora1 · Vo · f (3.18) 
where C,o,.l denotes the total capacitance of the architecture, Vo is the supply voltage, 
f is the clock frequency. 
50 
To detennine the amount of power reduction in the external memory that can be 
achieved; when the second dataflow with frequency h is used, consider the following. 
First, detennine the power consumption due to scanning the external memory, 
when the nonoverlapped scan method is used with frequencies fi and h· Thus, if P1 
and P2 denote the power consumed by the external memory for both J; and f 2 , 
respectively, then P1 and P2 can be written as. 




Where C'"'"' · V0 
2 
·;;and C""'' · V0 
2 
· j 2 are the external memory power consumption 
due to first overlapped scan method for J; and f 2 , respectively and f1 ~ Tm)Tm, ~ 2/3. 
Second, taking into account the fact that the scan method shown in Figure 3.7.1 
reduces the power consumption of overlapped areas by a factor of 113, then the power 
consumption due to scanning the overlapped areas using the first and the second 
dataflow, Po I and Po2, respectively are given by 




Where f10 ~ T," /Tm, ~ 1/3. Thus, the total power consumption due to external 
memory access for the first and the second dataflow, PI"''"' and P2towl are 
PI,,,,~~ +Pol ~c",,,·V0 2 ·JP ·(3/1+/10) 
and 







Eq (3.29) implies that power consumption due to external memory scanning in the 
second dataflow is 2/3 of the first dataflow. ln other words, the second dataflow 
reduces the power consumption by 33.3% over the first dataflc·w. 
On the other hand, the percent of power reduction achieved in the intermediate 
architecture shown in Figure 3.7.2 for the first and the second dataflow as compared 




Where P,,,,, is the total power consumption of scanning the ('xternal memory for the 
architecture based on the overlapped scan method. Eq(3.30) implies that the power 
consumed due to scanning the external memory in the intermediate architecture based 
on the first dataflow is reduced by 22.22% as compared with the architecture based on 
the first scan method. Whereas, 
P2/ulal = P2/oJal . Pllotal = 14 
~t!lal PI Iota/ ~olaf 2 7 
(3.31) 
implies that the power consumption of the external memory in the intermediate 
architecture based on the second dataflow is 14/27 of the architecture based on the 
first scanning method. In other words, the external memory power consumption in the 
intermediate architecture is decreased by 48% as compared with the architecture 
based on the first scan method. 
3. 8 Processors Datapath Architectures Development 
To complete the architectures for 2-D DWT, the last phase is to design the row and 
column processors datapath architectures for 5/3 and 9/7 algorithms separately that 
can fit into the three architectures shown in Figures 3.6.1, 3.6.2, and 3.7.2. The three 
architectures are valid architectures for both 5/3 and 9/7 algorithms, since they were 
developed based on the observation that the DOGs for 5/3 anc. 9/7 are identical, when 
they are looked at from outside, taking into consideration only the input and output 
requirements. 
52 
3.8.1 513 Processor's Datapath Architecture Development 
Based on the 5/3 algorithm and its DOGs shown in Figure 3.3.1, the 5/3 processor 
datapath architecture is shown in Figure 3.8.1. The multiplexers labeled muxeO, 
muxe I, and muxe2 implement the symmetric extension. This 3-stage pipe lined 
processor is formed by mapping the two lifting steps of the 5/3 algorithm into two 
pipeline stages. Stage 3 is added to reduce the critical path delay of stage 2; 
specifically the path connecting the adders in stage2 to the RP's output L, to muxceO 
through muxl, and end at Rt4. Suppose Ia and lx denote adder and multiplexer delays, 
respectively. Then, the critical path of stage 2 becomes large, 3ta + 31x, when the 
processor datapath is incorporated into the architecture. The addition of stage 3, which 
is obtained by splitting stage 2, reduces the critical path of stage 2 to 2ta + lx and that 
of stage 3 to la + 2tx. 
Stage I computes the high coefficients (stepl) and sends results to the output 
labeled H, whereas stages 2 and 3 compute the low coefficients (step2) and send 
results to the output labeled L. According to the DOGs in Figure 3.3.1, each high 
coefficient calculated in stage I enters not only in the calculation of the current low 
coefficient in stage 2 but also in the next low coefficient calculation in stage 2. 
Therefore, Rtl output of stage 3, which holds the high coefficient, is fed back into 
Muxe I and Muxe2 to be considered in the next low coefficient calculation. Stage 2 of 
the pipeline is a little bit complicated because it implements part of the extension. So 
in the following, the dataflow of stage 2 is explained. First, according to the DOGs for 
5/3, in the calculation of the first low coefficient YO, the high coefficient value Yl, 
calculated in stage I, must be allowed to pass through the multiplexers, labeled Muxe I 
and Muxe2, to the adder in stage 2. Second, in the calculation of the last coefficient, 
for example, Y8 in the DOG of odd length signals in Figure 3.3.1(a), the high 
coefficient (Y7) in RTI of stage 3 must be allowed to pass through both Muxel and 
Muxe2 to the adder. During normal computations that occur between the first and last 
coefficients calculations, the current high coefficient calculated in stage I and the 
previous high coefficient in Rtlof stage 3 are allowed to pass through Muxe I and 
Muxe2 to the adder, respectively. Note, in even length signals, the last high and low 
coefficients calculations occur normally. Table 3.3 shows the values of the control 
signals that have to be issued by the control unit so that the extension multiplexers 
53 
perform the required functions. Note also, the shift operations that are indicated on the 
figure by the symbol>> are implemented in hardwire. 
3.8.2 917 Processor's Datapath Architecture Development 
A 6-stage pipe lined datapath architecture for 9/7 processor is shown in Figure 3.8.2. It 
is formed using both the 9/7 algorithm and its DOGs show~ in Figure 3.3.2. ln this 
Processor datapath architecture 
X(2J + 1) stage3 H: 
Figure 3.8.1 5/3 processor's datapath architecture with symmetric extension 
Table 3.3 symmetric extension's control signals for 5/3 
seO se l se2 seO se I se2 
First 0 0 0 First 0 0 0 
Normal 0 0 I Normal 0 0 
Last 0 I Last 0 
a) Even length signal b) odd length signal 
architecture the pipeline stages I, 2, 4, and 5 represent the first 4 steps in the 9/7 
algorithm. The implementation of stepS and step6 are incorpomted in stage 6 to allow 
the two steps to operate in parallel. Stage 3, which connects stage 2 with stage 4, is 
54 
added because stage 4 requires two successive low coefficients that must be produced 
by stage 2 in order to perform an operation. When the first coefficient produced by 
stage 2 is in Rt of stage 4 the second coefficient will in Rt of stage 3 and will be 
applied to stage 4 through the path labeled forward. The 9/7 processor shown in 
Figure 3.8.2, can be thought formed by connecting together two 5/3 processors 
through stage 3, assuming the 5/3 is a 2-stage pipe lined processor. 
The multiplexers in stages 2, 4 and 5 including the one labeled MuxeO implement 
the symmetric extension algorithm that is part of the DOGs in Figure 3.3.2. Table 3.4 
shows the appropriate values of the control signals that must be issued by the control 
unit to the extension multiplexers so that they perform the required functions. The 
extension multiplexers in stages 2 and 5 function exactly the same way as that of the 
5/3, described earlier. The normal function of the extension multiplexer labeled 
muxeO is to pass the input signal X(2n + 2) to the latch, whereas function of the 
extension multiplexer labeled muxe3 in stage 4, is to pass the forward 
signal, Y'(2n + 2) to the adder. Only in the even length signals and in the calculation 
of the last coefficient, muxeO passes the input signal X(2n) to the latch and Muxe3 
.stage! stage2 stage3 stage4 
' Rtl: Y'(2n +I) 










Figure 3.8.2 The 9/7 processor's datapath architecture with extension 
55 
Table 3.4 symmetric extension's control signals for 9/7 
step! step2 step3 step4 
seO se I se2 se3 se4 se5 
First 0 0 0 0 0 0 
Normal 0 0 1 0 0 1 
Last 1 0 1 1 0 1 
a) Odd length signals 
step 1 step2 step3 step4 
seO se I se2 se3 se4 se5 
First 0 0 0 0 0 0 
Normal 0 0 1 0 0 1 
Last 0 1 0 1 1 
b) E'en length signals 
passes the delay signal Y'(2n) to the adder instead of the forward signal Y'(2n + 2). 
Note that multiplication operations in Figure 3.8.2 can be implemented by only two 
adders as illustrated in [23]. 
3.8.3 Row and Column Processors for 513 and 917 
The 5/3 and 9/7 processor datapath architectures shown in Figures 3.8.1 and 3 .8.2 
were developed assuming the external memory is scanned either row·by-row or 
column-by-column. The CPs in the two architectures shown in Figures 3.6.1 and 
3.6.2 for overlapped and nonoverlapped scan methods, respec1ively, scan the high and 
the low coefficients generated by RP column-by-column. But, since the CPs alternate, 
in an interleave fashion, between the high and the low coefficients calculations as 
indicated in Table B.l, therefore, the 5/3 CP's datapath and both 9/7 CPs' datapath 
based on the scan method shown in Figures 3.5.1 and 3.5.3, ffi'JSt be modified to allow 
interleaving in execution. The modified 5/3 and 9/7 CPs' datapath are shown in 
Figures 3.8.3 and 3.8.4, respectively. 
In the 5/3 CP shown in Figure 3.8.3, registers RdO and Rdl are added to allow 
interleaving in execution. The first 9/7 CP shown in Figure 3.E.4(a), which is based on 
the scan method of Figure 3.5.1, is obtained by splitting stage 3 of the 9/7 processor's 
datapath shown in Figure 3.8.2 into two stages to allow also interleaving of two 
columns coefficients in execution. On the other hand, the second 9/7 CP shown in 
Figure 3.8.4(b), which is based on the scan method shown in :0 igure 3.5.3, is obtained 
by splitting stage 3 of the 9/7 processor's datapath of Figure 3.8.2 into four stages and 
adding 4 registers labeled RO, Rl, R2, and R3 in stage 5. The multiplexers labeled 
mux, control the interleaving operations. In the first run, the control signals, sc, of the 
multiplexers are set 0, to allow in execution the interleaving pattern of run!, as 
56 
illustrated in the dataflow Table B.2 (a). In all subsequent runs, the multiplexers' 
control signals are set I to allow normal interleaving of two columns. 
As for the 5/3 CP in the intermediate architecture shown in Figure 3.7.2, it should 
be modified as shown in Figure 3.8.5. This is necessary, since the intermediate CP 
scans three columns in each H and L decomposition in a run as illustrated in the 
dataflow shown in Table B.3 and alternates between executing 3 high and 3 low 
operations in H and L decompositions. 
On the other hand, the row-processors m the proposed overlapped and 
nonoverlapped architectures for 5/3 and 9/7 scan the external memory according to 
one of the scan methods illustrated in Figs 3.5.1, 3.5.2 and 3.5.3. A careful 
examination of the scan methods and the DOGs shows that the N high coefficients of 
step I in the 5/3 and steps I, 2, and 3 in the 917 that were calculated during a run must 
be kept, in order to be used in the N operations of the next run. This requires the 
addition of a temporary line buffer (TLB) of size N in stage 2 of the 5/3 and in each of 
stages 2, 3, and 5 of the 9/7. Thus, the RP's datapath that fit into the two proposed 
architectures is obtained when a TLB is incorporated into stage 2 of the 5/3 and in 
each of stages 2, 3, and 5 of the 9/7 as shown in Figure 3.8.6. The inclusion of the 
TLB may decrease the speed of the architectures. To maintain the speed, the TLB can 
be placed in a separate pipeline stage as shown in Figure 3.8.7. However, inclusion of 
a TLB causes a problem because the same TLB 's location must be read and written in 
the same clock cycle. To solve this problem, the signal labeled R I W is connected 
to the clock jl3 so that the TLB can be read in the first half cycle and written in the 
second half. The register labeled TLBAR (TLB address register) generates addresses 
for the TLB. Initially, TLBAR is cleared to zero by asserting signal incar (increment 
address register) low to point at the first location. Then to address the next location, 
after each read and write, register TLBAR is incremented by one by asserting incar 
high. 
Figure 3.8.7 is appropriate for 5/3 RP in overlapped and nonoverlapped 
architectures. To obtain the first and the second 9/7 RPs' datapath based on the scan 






Figure 3.8.3 Modified the 5/3 CP for overlapped and nonoverl.lpped architectures 




~ ..... ___ .. 
Stage 7 X(2n+1) 
Stage 6 ::LJ":\-~~ 1-------r--------,---.,F~ 
k~ 
k 
f-------------~1~ L~ ::J , \..:_) 'LTx{;n) 
Figure 3.8.4 (a) Modified first 9/7 CP based on the scan method of Figure 3.5.1 for 
overlapped and nonoverlaped architectures 
58 
Stage 1 Stage 3 Stage 4 Stage 5 Stage 6 
Stage 8 
sc 
Figure 3.8.4 (b) Modified second 9/7 CP based on the scan method of Figure 3.5.3 for 
overlapped and nonoverlapped architectures 
~r-----~------------~·EJr 
Figure 3.8.5 modified stage 2 of 5/3 CP for intermediate architecture 
59 
Eilb 
f-!.-K;::::=:=~ address lines 
Figure 3.8.6 Incorporation of a TLB in stage 2 of the RP 
clock f 13 
TLB 
Figure 3.8.7 TLB in a separate pipeline stage 
3.8.2 should be modified as shown in Figures 3.8.8 (a) and (b), respectively. The 
operations of the multiplexers labeled mux in Figure 3.8.8 (b) can be controlled by 
setting the select signals, sr, of the multiplexers 0 during the first run and I in all 
subsequent runs. 
A careful examination of the 9/7 DOGs shows tha.t when the last run's 
computations are executed they would not yield all required output coefficients. Thus, 
to get the remaining output coefficients, the control unit should be instructed to 
execute one more run, call it, the extra run. In addition, examination of the last run's 
portion of the 9/7 DOG for odd length signals shows that the extension signal labeled 
sre I is required to be set I in order to compute the operation in the level labeled 
Y'(2n) in the DOG. But, when the computation reaches levelY'(2n), the operation in 
60 
that level requires signal sre I to be set 0. Furthermore, in the extra run, the operation 
at level Y'(2n) requires signal srel to be set I. Therefore, a circuit consisting of an 
AND gate and an inverter is inserted into stages 6 and 7 of Figures 3.8.8 (a) and (b), 
respectively. The circuit operates according to Table B.5 (a). However, in the case of 
even length signals, according to the DOG of the 917, both srel and QI are set 0 in all 
runs. 
Similarly, examination of signal sreO, in the last and extra runs, for both even and 
odd signals, reveals that this signal should be set also according to Table B.5(a) and 
the circuit consisting of the AND gate and the inverter should be inserted into 
stages 4 and 5 of Figures 3.8.8 (a) and (b), respectively. For the architecture 
developed based on the scan method of Figure 3.5.1, signal sre2 should be set 
according to Table B.5 (b) and the circuit consisting of the AND gate the inverter 
should be inserted into stage 6 of Figure 3.8.8 (a). 
Furthermore, to allow TLB3 of Figure 3.8.8 (b) to store coefficients generated by 
stage 6 in the first run, a circuit consisting of a multiplexer and an inverter is inserted 
into stage 6 of Figure 3.8.8 (b). In addition, to allow register TLBAR3 to address the 
first location of the TLB3, when a transition is made from run I to run2, a circuit 
consisting of a multiplexer, two inverters, and an AND gate is inserted into stage 5 of 
Figure 3.8.8 (b). 
On the other hand, to obtain the RP datapath for 5/3 and 9/7 intermediate 
architectures, stage 2 of the 5/3 and stages 2 and 5 of the 917 datapath architectures 
shown in Figures 3.8.1 and 3.8.2 should be modified as shown in Figure 3.8.9. The 
advantage of this arrangement is that the TLB is not required to be read and written in 
the same clock cycle. 
Furthermore, examination of step2 (Y"(2n)) in the 917 DOGs shows that the fourth 
low coefficients labeled Y'(6) calculated for each row in a run using the third 
intermediate scan method should be stored in a buffer of size N, since they are 
required in theN operations of the next run. This requires the addition of another TLB 
in stage 3 of the 917 datapath architecture shown in Figure 3.8.2. Figures 3.8.10 shows 
how this TLB can be incorporated into stage 3 of Figure 3.8.2 to form the required 





' Stage 2 I 








Figure 3.8.8 (a) Modified first 9/7 RP based on scan method J.S.l for overlapped and 
nonoverlapped architectures 
be read and written in the same clock cycle. Figures 3.8.9 and3.8.1 0 form the first 5 
stages of the modified 7-stage 9/7 RP for intermediate architecture and the remaining 










sr ~ QO 
Stage 8 
Figure 3.8.8 (b) Modified second 9/7 RP based on scan method 3.5.3 for overlapped 
and nonoverlapped architectures 
63 




Figure 3.8.9 Modified RP datapath for 5/3 and 9/7 intermediate architectures 
stage 3 stage 4 stage 5 
Forward2 
TLB 
Y"(2n + 2) 
se3 
Figure 3.8.1 0 Incorporation of a TLB in stage 3 of Figure 3.8.2 to form the 9/7 RP for 
Intermediate architecture 
64 
3. 9 Evaluation of architectures 
In section 3.7.2, it is mentioned that statement] can be used to determine the 
frequency f of the architectures. Pipe lining the processors to k stages changes the 
frequency f, which can be determined by the following statement which is a slight 
modification of statement]. 
Statement] 
t 
case I :If t"' :>: __!'_ then 
k 
T = t 
m 
t 





else T = t m 
Where r =I/ f, t"' = 1/fm , and t P = 1/ fr are the clock period, the critical path delay 
of the external frame memory and the processors, respectively. 
In the algorithm stated above either case 1 or case 2 can be true. Case 2 implies 
the availability of a very high speed scan that can scan the three pixels required for an 
operation during the specified time limit given by f/k. If that is the case, the 
architectures shown in Figures 3.6.1, 3 .6.2 and 3. 7.2 with their processors pipe lined, 
the hardware utilization is 100% and the architectures are complete. Now, suppose 
r 1 and r 2 denote the clock periods of the architectures before and after pipelining, 




And from statement] case2 
tp 
r =--
2 I· k 
The speedup factor S is given by 












Thus, the architectures with pipelined processors are k times faster than the 
architectures with nonpipelined processors with efficiency I. 
On the other hand, case I implies low scanning frequency. That means the time 
required to scan the three pixels for an operation will take at least 3t/k seconds or 
three clock cycles, where t/k is the stage critical path delay of the pipelined 
processor. In that case, the architectures with pipelined processors will be under 
utilized 2/3 of the time, since every three clock cycles yield one output. In addition, 
the speedup due to pipelining is proportional to k. To determine that consider the 




The speedup factor S is then given by 
(3.37) 
The efficiency (3.38) 
Thus, in 917 architectures, a gain in speedup factor of 2 can be achieved since k = 6 
and I = 3 but no gain in speedup can be achieved in the case of 5/3 architectures, 
since k = 3, by pipe lining the processors and the efficiency is very low, 1/3. 
The under utilization and speedup problems can be alleviated, and the entire 
architecture can be made to operate with frequency f = kltp and fully utilized, 
producing outputs every cycle. If the architecture is allowed to read from the extemal 
memory the required three pixels for an operation in parallel every clock cycle instead 
of one pixel at time. Of course, that will require three buses instead of one to scan the 
extemal frame memory. The parallel scan architectures can be obtained by slight 
66 
modifications of the architectures shown in Figures 3.6.1, 3.6.2, and 3.7.2 from RP 
side only as shown in Figures 3.9.1, 3.9.2, and 3.9.3, respectively, since 
modifications only affect this part of the architecture and the other parts remain the 
same. The 5/3 dataflow of the pipe lined parallel scan architectures for overlapped and 
nonoverlapped in Figures 3.9.1 and 3.9.2, respectively, is shown in Table B.6, 
whereas the dataflow of the pipe lined intermediate parallel scan architecture, Figure 
3.9.3, is shown in Table B.7. Tables B.6 and B.7 are derived assuming the RP and the 
CP are 4- and 3-stage pipelined processor, respectively. 
A problem occurs in the line buffer (LB) of Figure 3.9.2 because the same 
memory location in the line buffer must be read and written in the same clock cycle. 
To solve this problem, the LB is read in the first half cycle and is written in the 
second half. To perform this operation the clock line is connected to the control 
signal labeled R/W of the LB. When the clock is low, read takes place and the result 
is loaded into Rd by the positive transition of the clock and when it is high write 
operation takes place, as illustrated in Figure 3.9.2. The signal labeled Elb (enable 
LB), when it is asserted high, read and write take place, otherwise, no read and write 
take place. 
To compare the performances of the pipelined parallel scan architectures 
with the nonpipelined sequential scan architectures shown in Figures 3.6.1 and 3.6.2, 
consider the following. In the architectures shown in Figures 3.6.1 and 3.6.2, p1 = 15 
clock cycles (Table B.l) are needed to complete the execution of the first operation, 
whereas p
1 
= 27 is needed in the intermediate architecture shown in Figure 3.7.2 
(Table BJ). The remaining (n~l) operations require l(n-1) cycles, where I= 3 for 5/3 
and 9/7. Thus, the total time required to perform (n) operations or tasks is 
(3.39) 
where r, = 1/ J; is the clock period. On the other hand, the pipe lined overlapped and 
nonoverlapped parallel scan architectures shown in Figures 3.9.1 and 3.9.2 require 



















Figure 3.9.1 Pipelined overlapped parallel scan architecture 
s 
L:========~ LL-RAM f+--.!:L=.L __ 
N/2xM/2 
Figure 3.9.2 Pipelined nonoverlapped parallel scan architecture 
68 
s 
s L:========::l LL-RAM 1+--!::!LL:__ __ 
L-----------1 Nl2xM/2 
Figure 3.9.3 Pipelined intermediate parallel scan architecture 
p 1 = 14 for 5/3 (Table B. 7) is needed in the pipe lined intermediate parallel scan 
architecture shown in Figure 3.9.3. The remaining (n- 1) tasks require (n- 1) 
cycles. The total time required to execute n tasks is given by 
The speedup factor is then given by 
S = T(non},, _ (p, +l·(n-!)}r, 
T(pipe)P"'- [p1 +(n-!)}r, 
For large n, the above equation reduces to 
The efficiency 






That is the pipelined parallel scan architectures are k times faster than the 
nonpipelined sequential scan architectures with efficiency 1. 
69 
The throughput, H, which is defined as the number of tasks (operations) 
performed per unit time, can be written as 
(3.44) 
H(pipe ),,, = (p, +I .(n- I ))r' nk (3.45) [p2 +I· (n -l)]r1 
- nkJ; 
- p 2 + I · (n - I) (3.46) 
(3.47) 
(3.48) 
The maximum throughput, H'"'", occur when n is very large (n ---+ ctJ) and in these 




H (pipe l::' = H(pipe );;; = kJ; /I (3.50) 
The pipelined parallel and sequential scan architectures' throughputs have increased 
by a factor of k as compared with the nonpipelined architectures. 
Based on the above evaluations, we can conclude that both pipelined sequential 
and parallel scan architectures achieve the same performan:e in terms of speedup, 
efficiency, and throughput. 
To evaluate the power consumption of the pipelined parallel scan architectures 
shown in Figures 3.9.1, 3.9.2, and 3.9.3 and that of the pipe lined sequential scan 
architectures shown in Figures 3.6.1, 3.6.2, and 3.7.2 consider the following. First, 
consider the power consumption of the pipelined parallel and sequential scan 
70 
architectures without external memory. From Eq (3.36) the frequency of the pipelined 
parallel scan architectures is 
(3.51) 
Whereas from Eq (3.33) the frequency of the pipelined sequential scan 
architectures is 
(3.52) 
If the total the capacitance, Cwwl. of parallel and sequential scan architectures are 
equal, then that implies they are also consume the same power. 
On the other hand, the external memory power consumption of the pipeiined 
sequential and parallel scan architectures can be obtained as follow. The total power 
consumption of the external memory for the pipelined overlapped sequential scan 
architecture, Pm(over)"q is written as 
(3.53) 
Where c,:,"1 is the total capacitance of the external memory. The total external 
memory power consumption for the pipelined nonoverlapped sequential scan 
architecture, Pm(nonover),q is written as 
(3.54) 
Whereas the total external memory power consumption of the pipe lined intermediate 
sequential scan architecture, Pm(int)."" can be obtained as follow. If P,(int)."'l is the 
power consumption due to scanning the overlapped areas of the external memory 
sequentially is give by 
(3.55) 
Where I· k · c,:,"1 • V0
2 
· fP is the external memory power consumption of the 




= 1· k · fr ·C,:,al · Vo'(fJ + flo/3) (3.58) 
On the other hand, the total external power consumption of the pipelined 
overlapped and nonoverlaped parallel scan architectJres Pm(over)pa' and 
Pm(nonover)pu" respectively, are written as 
(3.59) 
(3.60) 
Pm (nonover) ""' = fJ ·1 · k · c,;,a, · Vo' · fr (3.61) 
Whereas, the total external power consumption of the pipelined intermediate 
parallel scan architecture, Pm(inl)pa' can be obtained as follow. If Pu(int)pa' is the 
power consumption due to scanning the overlapped areas of the external memory by 
parallel scan architecture is give by 
Where 1· k · c,:,, · V0
2 
• JP = PJover) P"' , then 
PJint)P"' = Pm(nonover)pa, + P,(int)P"' 
= 1· k · fJ · c,:,, · V02 • fr +flo .J. k · c;,;,, · V02 • fr /3 
= 1· k · JP · c,:,, · V0
2 





The above evaluations show that the external memory power consumption of the 
sequential and parallel overlapped architectures are equal (Eq:; 3.53 and 3.60) and that 
of the sequential and parallel nonoverlapped (Eqs 3.54 and 3.61) and the sequential 
and parallel intermediate (Eqs 3.58 and 3.65). 
In the following, an estimate for the total number of operations performed by the 
row-processor for j levels of decomposition is derived. Number of operations 
performed by the row-processor in each level of decompo:;ition can be written as 
72 
nl = N (I M; r lJ (3.66) 
n2 = l N/2 J(ll M/~ J+ llJ (3.67) 
n3 = lN/4J(IlMi~J+ 1l) (3.68) 
n4 = lN/8 J(llM/~ J+ ll) (3.69) 
(3.70) 
Then the total number of operations (n) performed by the RP for j levels of 
decomposition can be estimated as 
n = N[rM2+ 11J+ ~I ~2+ 11]< I ,~2+ l + 2~' [12~~ +Ill 
n=+NM[I+~+ I~+ 6~ + ··· +(~r}v[ ++~+i+ +(±r] 
. l"{l!) +(ll}" :u •(!J']HlJ' 
+M['l!i'} H· (it: 








Eq (3.75) also estimates the total number of operations performed by the CP and the 
total number of paired outputs for j levels of decomposition. 
73 
3.10 Combined 513 and 917 Architecture 
The 9/7 processor datapath architecture of Figure 3.8.2 can be viewed as formed by 
connecting two 513 processors through stage 3, assuming 5/3 is a 2-stage pipelined 
processor. That suggests the possibility of modifying the 9/7 processor datapath 
architecture shown in Figure 3.8.2 such that it performs both 9/7 and 5/3 algorithms. 
To obtain such processor architecture the 5/3 algorithm is incorporated in stages I, 2, 
and 3 of Figure 3.8.2 as shown in Figure 3.10.1. The control signal value of the signal 
labeled/oss/ess/lossy determines which function the architecture would perform. If 
loss less/lossy is 0, the architecture performs the loss less 9/7, otherwise, performs the 
lossy 5/3. The combined architecture is useful and very efficient in situations where 
the encoder in one site is required to perform either lossless or lossy image 
compression. The advantage of the combined architecture is that a substantial saving 




X ( 2n-'-) .__~ 
loss less/ lossy 
Figure 3.1 0.1 Combined 9/7 and 5/3 processors data path architecture 
74 
3.11 Conclusions 
In this chapter, 3 high-speed and novel pipelined VLSI architectures, overlapped, 
nonoverlapped, and intermediate architectures were developed for 5/3 and 9/7, 
respectively. Pipelining technique is utilized to achieve high-speed performance. The 
advantage of the overlapped and intermediate architectures is that they only require a 
total temporary line buffer (TLB) of size N and 3N for 5/3 and 9/7, respectively. The 
intermediate architecture, which is an alternative form for reducing the power 
consumption of the overlapped areas of the external memory expressed in Eq(3.9), 
reduces the external memory power consumption by 22.22 % as compared with the 
external memory power consumption of the architecture based on the first overlapped 
scan method. However, the intermediate architecture with the second dataflow Table 
8.4 reduces the power consumption of the external memory by 48%. Therefore, 
intermediate architecture could be a very good candidate in applications where power 




PARALLEL ARCHITECTURES DEVELOPMENT 
4.1 Introduction 
In chapter 3, three pipelined architectures were developed. The first architecture, 
which is based on the first overlapped scan method, the maximum power 
consumption occurs due to overlapped external memory access. The second 
architecture, which is based on the nonoverlapped scan method, the power 
consumption of the external memory has been reduced to minimum by eliminating the 
overlapped areas but requires the addition of a line buffer (LB) to the architecture. 
The intermediate architecture, which is based on the generalized overlapped scan 
method, is introduced to reduce the power consumption of the external memory 
access, without using the expensive line buffer, to somewhat between that based on 
the first scan method and that based on the nonoverlapped scan method. 
In this chapter, to further increase the performance in order to closely meet real-
time applications of DWT with demanding requirements, the parallel architectures 
based on the first scan method and the parallel form of the intermediate architectures 
will be designed. First, the parallel architectures based on the first overlapped scan 
method will be developed followed by the intermediate parallel architectures. 
In general, the scan frequency fi and hence the period r 1 = 1/.!; of parallel 
pipelined architectures can be determined by the following statement, when the 
required pixels I of an operation are scanned simultaneously in parallel. Suppose lp 
and 1m are the processor and the external memory critical path delays, respectively. 
Statement3 
lft"jl·kztm then 
r 1 =t,j(l·k) 
else r1 = tm 
76 
Where l = 2, 3, 4 ... denote 2, 3, and 4-parallel and t P j k i; the stage critical path 
delay of a k- stage pipe lined processor. 
4.2 parallel architectures based on first scan method 
In this section, three parallel architectures based on the first overlapped scan method 
will be developed for 5/3 and 9/7 2-D DWT algorithm~;. These three parallel 
architectures will be referred to as 
• 2-parallel pipe lined architecture. 
• 3-parallel pipe lined architecture. 
• 4-parallel pipe lined architecture. 
The 2-parallel, the 3-parallel, and the 4-parallel architectures each increases the 
speedup by a factor of 2, 3, and 4, respectively, as compared with the single pipelined 
architecture based on the first scan method developed in chapter 3. 
4.2.1 2-paralle/ pipelined external architecture 
Based on the first overlapped scan methods shown in Figures 3.5.1 and 3.5.3 and 
DOGs for 5/3 and 9/7, respectively, the 2-parallel architecture shown in Fig. 4.2.1 is 
developed for 5/3 and 9/7. The architecture is valid for both 5/3 and 9/7 algorithms, 
since it is developed based on the observation that the DDGs for 5/3 and 9/7 are 
identical when they are looked at from outside, taking into consideration only inputs 
and outputs requirements. 
The architecture consists of 2 k-stage pipe lined row-proeessors labeled RPl and 
RP2 and 2 k-stage pipelined column-processors labeled CPl and CP2. The 
architecture scans external memory with frequency 12 and it operates with 
frequency f 2 /2. The buses labeled busO, bus 1, and bus2 are used for transferring in 
every clock cycle 3 pixels from external memory to RP's latches RtO, Rtl, and Rt2. 
The RPl 's latches load data every time clock f 2 /2 mabs a positive transition, 
whereas RP2's latches load data every time a negative transition occurs as indicated 





N M J....--------' 
2 2 
Figure 4.2.1 2-paralle1 pipe1ined external architecture 
On the other hand, the column-processors CP1 and CP2 and their associated latches 
load new data every time clockJ;/2 makes a positive transition. 
The DOGs for even length signals show that in the last high and low coefficients 
calculations, only the last two pixels in a row, r, at locations X(r, M-2) and X(r, M-1) 
are read from external memory. In addition, the extension part of the DOGs for even 
length requires the pixel located at X(r, M-2) to be considered as the first and the third 
inputs. This pixel must be passed to the RP2 with the second input pixel from location 
X(r, M-1 ), to compute the last high and low coefficients in row r. Thus, the 
multiplexer labeled muxreO, which is an extension multiplexer, passes in all cases data 
78 
coming through bus2, except when the row length (M) of an image is even and only in 
the calculations of the last high and low coefficients in a row r, the pixel of location X 
(r,M-2), which will be read into busO, must be allowed to pass through muxreO and 
then loaded into Rt2 as well as RtO. The two multiplexers labded muxceO, attached to 
CPs, are also extension multiplexers and operate similar to muxreO when DWT is 
applied column-wise by CPs. The three multiplexers labeled muxc allow either the 
external memory or the LL-RAM data to be passed to the RP's latches RtO, Rtf, and 
Rt2. 
On the other hand, when the row length of an image i; odd, according to the 
DOGs for odd length signals, to calculate the last low coefficient only one pixel the 
last one at location X(r, M-1) should be passed to the RPI. 
The dataflow of the architecture is shown in Table B.S. This dataflow table is 
derived based on the 5/3 scan method shown in Figure 3.5.1 and it is identical to 917 
dataflow except in the first run, where 9/7 scan method shown in Fig. 3.5.3 requires 
scanning of 5 pixels from each row. The 5/3 scan method shown in Figure 3.5.1 is 
also a valid scan method for 917 and the dataflow for 5/3 shown in Table B.8 would 
be identical to 9/7 dataflow derived using 5/3 scan method except in the first run 
where 917, according to its DOGs, would not be able to yield any output coefficients. 
The 9/7 RPs in the first run will be able to compute only two coefficients labeled 
Y'(l) and Y'(O) in the DOGs for each row of run I and these coefficients can be 
stored in TLBs so that they can be used in the next run computations. Inclusion of 
TLBs will be discussed later when modified RP datapath architecture is developed. 
The utilization of the 5/3 scan method as a unified scan method for both 5/3 and 
9/7 gives many advantages: 
• Similar control algorithms, if not identical, can be used for both 5/3 and 
9/7. 
• Ease of integration of the 5/3 into the 9/7 processor datapath architecture 
for combined 5/3 and 9/7 architecture. 
79 
For these two reasons, the 5/3 scan method as unified scan method for both 5/3 and 
917 is preferred and therefore, will be used in all parallel architectures developed in 
this chapter. 
Note that according to the first overlapped scan method shown in Figure 3.5.1, in 
any particular time 3 columns are considered for scanning and in every clock cycle 3 
pixels are scanned one from each column until end of the columns are reached, say, to 
complete a run. Then a transition is made to the beginning of the next 3 columns to 
initiate another run. In the clock cycle where a transition occurs, especially when 
column length of an image is odd, the external memory should not be scanned since 
during that cycle the two CPs each will compute the last low coefficient as required 
by the DOGs for odd length signals. That is, during that cycle no pixel is loaded into 
RP2 latches while the control is allowed to return to RPI by the pulse ending the 
cycle. This also implies that each run will begin at RP I and the high coefficients 
generated during a run, which are required in the next run computations, will be 
stored in the TLB of the RP that generated them. 
Figure 4.2.2 shows how stage 2 of the pipe lined 5/3 RP and stages 2, 3 and 5 of 
the pipelined 9/7 RP should be modified when they are incorporated into the 2-
parallel architecture processors. The modifications require addition of a TLB size of 
N/2 in each stage mentioned. The TLB is necessary, according to the DOGs, to keep N 
coefficients calculated during a run in each of stages I, 2, and 4 of Figure 3.8.2 that 
are also needed in the N operations of the next run. Signal fi./ w (read/write) is 
connected to the clock /,/2 in Figure 4.2.2 so that the TLB can be read in the first half 
cycle and written in the second half as required. The data read in the first half cycle, 
for example, from TLBJ, is stored in register Rdl by the negative edge of the clock. 
Then the positive edge of the clock loads it into the latch of the next stage. Note that 
each of the 2-parallel 9/7 RP is identical to the RP shown in Figure 3.8.8 (a). 
The register labeled TLBAR (TLB address register) generates addresses for the 
TLB. Initially, register TLBAR is cleared to zero by asserting signal incar low to point 
at the first location in the TLB. Then to address the next location after each read and 












[Rf1 P¥i1tl sre 2 --~·~~----------------~·~~------·------------
/,12 !,12 
Figure 4.2.2 modified 2-parallel RPs 
4.2.2 3-para/lel pipelined architecture 
The 3-parallel pipelined architecture is shown in Figure 4.2.3 and its dataflow based 
on 5/3 scan method shown in Figure 3.5.1 is given in Table 13.9. The architecture has 
two more processors, labeled RP3 and CP3, than the 2-parallel architecture shown in 
Figure 4.2.1. The architecture operates with frequency f 1 j:l and scans the external 
memory with frequency f 3 • 
Figure 4.2.4 shows two waveforms for the frequency ;;/3 labeled f 3" and f 3h. 
81 
The RPI and its associated latches use the clockf,, , whereas the RP2 and the RP3 
and their associated latches use the clock J;, as indicated in Figure 4.2.3. 
In every clock cycle, 3 pixels are scanned from external memory and are loaded 
into the latches of one of the RPs. First, RPI latches are loaded then RP2 latches 
followed by RP3 latches and then the process repeats. The 3 row-processors latches 
should be loaded with the required data during the time limit specified by t r j k before 
bus2 






!, 2 3 4 h_h 
r I L I I I /,, =J,/3 I I 
I I I 
/,h = J,/3 
lcP2 
_fL 
' ' ' RPI RP2 RP3 
Figure 4.2.4 waveforms of the 2 clocks used in 3-parallel 
it repeats. The RP I and RP2 latches are loaded every time clocks / 3, and f" make a 
positive transition, respectively, whereas RP3 latches are ioaded each time clock 
f" makes a negative transition. 
The extension multiplexer's labeled muxreO and muxceO in Figure 4.2.3, function 
the same way as in the 2-parallel extension multiplexers described in section 4.2.1. In 
addition, note that the RP3 has two Rtl output latches labeled Rt/3a and Rt/3b instead 
of one because the dataflow in Table B.9 requires the presence of such latches. These 
latches are required to hold its contents sometime for more than one clock cycle with 
respect to clock f 3h. Therefore, the control signals e3a and e3b are added to control 
the loading of these two latches 
The strategy adopted in this architecture is that each run must begin at RPI. The 
advantage of the strategy is that it will not require any modifications to the RPs 
datapath architecture shown in Figure 4.2.2 except the 3 RPs in the 3-parallel each 
will has a TLB of size fN/Jl, while any other strategy will complicate very much the 
RPs datapath and the control circuitry. Application of this strategy requires that if a 
run ends at RPI, then the next run should begin after 2 clock~; cycles during which the 
external memory is not scanned whether the column length ("') is even or odd. But, if 
a run ends at RP2, then the next run must begin after one dock cycle. The external 
memory is not scanned also during this cycle whether N is even or odd. 
On the other hand, if a run ends at RP3 and N is even, th•!n the next run can begin 
immediately, otherwise, if N is odd, then 3 clock cycles must elapse before the next 
run can begin. These guidelines are necessary in order to avoid any conflict in the 
83 
dataflow. To identify at which RP a run would end, a 2-bit register can be used. The 
register is initially set to 0 and then is incremented by one every clock cycle to count 
from 1 to 3 and repeats. When a run ends the 2-bit register will contain the RP 
number. 
Now, let's move to the CPs side to see how this part of the architecture works. 
According to the dataflow shown in Table B.9, CPl and CP3 work in parallel starting 
from cycle 13. However, CPl executes high coefficients stored in Rthl, Rth2, and 
Rth3, while CP3 executes low coefficients stored Rtll. Rt/2, and Rt/3. Whereas, 
starting from clock cycle 14, the CP2 alternates between executing high and low 
coefficients. Moreover, both CPl and CP3 are run by the clock labeled [ 3, and every 
time it makes a positive transition new data are loaded simultaneously into both CPl 
and CP3 latches RIO, Rtf, and Rt2. CP2 is run by the clock f 1h and loads new data into 
its latches RIO, Rtf, and Rt2 every time the clock makes a positive transition. 
In order to understand and appreciate why the 3 sets of the multiplexers labeled 
muxl. mux2, and mux3 are included, why they are interconnected in that way, and 
finally, how they operate, consider Table 4.1. Table 4.1 is obtained from Table B.9 
and it lists groups of RPs' output latches, identified in the table as patterns, and shows 
how they are scheduled for the CPs. As shown in Table B. 9 in cycle 13, pattern 1 
latches are scheduled for CPland CP3. In cycle 14, pattern2latches are scheduled for 
CP2. In cycle 16, pattern 3 latches are scheduled for CP 1 and CP3, whereas in cycle 
17, pattern4 latches are scheduled for CP2. These scheduling patterns again repeat 
starting from pattern 1 and so on. Thus, looking at pattern 1 and pattern 3 latches, the 
presence and interconnections of the three CP 1 multiplexers labeled mux I and the 
three CP3 multiplexers labeled mux3 can be justified. In Figure 4.2.3, pattern 1 latches 
are connected to the inputs of the multiplexers labeled 0, whereas pattern3 latches are 
connected to inputs labeled 1. The operation of the two set of the multiplexers can be 
controlled by one signal labeled sp 1. First, sp 1 is set to 0 to schedule pattern 1 and 
then is set to 1 to schedule pattern 3 and so on. 
Similarly, looking at pattern 2 and pattern 4 latches, which are used by CP2, the 
inclusion of the three multiplexers, labeled mux2 and their interconnections can be 
84 
Table 4.1 Shows scheduling patterns 
for CPs and registers involved 
Pattern RP's output latches CP 
Rthl Rt/1 
I Rth2 Rt/2 1&3 
Rth3 Rt/3a 
Rth3 
2 Rthl 2 
H2 
Rth2 Rt/3a 
3 Rth3 Rt/1 1&3 
HI Rt/2 
Rt/2 
4 Rt/3b 2 
Rt/1 
verified. In the architecture, pattern 2 latches are connected to the inputs of the 
multiplexers labeled 0, whereas pattern4 latches are connected to the inputs labeled I. 
The operations of these multiplexers are controlled by one signal labeled sp2. First, 
sp2 is asserted low to schedule pattern 2 and then high to s,;hedule pattern 4 and so 
on. 
On the other hand, examination of tables B.9 and 4.1 sta1ing cycle 12 until cycle 
17 shows that the control signal values for signals e3a, e3b, spl, and sp2 can be 
derived as shown in Table 4.2. These signal values repeat every 6 clock cycles. In 
addition, as indicated in the table, signals spland sp2 can be combined into one signal 
sp. 
According to the DOGs for 5/3 and 9/7, a high coefficient calculated in a previous 
operation is also required in the calculation of the next operation. This implies, since 
Table 4.2 Control signal values 
Cycle e3a e3b spl sp2 Sp 
number 
12 I 0 X X 0 
13 0 0 0 X 0 
14 0 0 X 0 0 
15 0 I X X 0 
16 0 0 I X I 
17 0 0 X I I 
85 
CP2 interleave in execution coefficients of both H and L decomposition generated by 
the RPs, then it should be able to pass the high coefficients it generates to CPI and 
CP3, and receive high coefficients generated by CPI and CP3. Therefore, the paths, 
labeled hI, h2, /1, and /2, are added in Fig. 4.2.3 to serve this purpose. 
In order for the CPs to exchange these high coefficients properly, the CPs datapath 
architecture, specifically stage 2 of the 5/3 and stages 2 and 5 of the 917 should be 
modified as shown in Figure 4.2.5. Table 4.3 provides the information necessary for 
passing high coefficients between CPs. This table is used as mean in implementing 
the modifications shown in Figure 4.2.5. Therefore, understanding of Table 4.3 ts 
essential to appreciate the changes that have been incorporated into Figure 4.2.5. 
Table 4.3 shows that in cycle 16, CPJ and CP3 generate the high coefficients 
HHO,O and LHO,O, which are placed in Rtl and Rt3, respectively, by the pulse ending 
the cycle. The pulse ending cycle 17 loads HHO,O into RJ2 of the CP2 as indicated by 
the arrow labeled I. Similarly, the pulse ending cycle 20 transfers the high coefficient 
LH I ,0 stored in Rt3 of the CP3 to RJ2 of the CP2 as indicated by the arrow labeled 2. 
Note that this pattern of scheduling high coefficients to Rd2 repeats again in cycles 23 
and 26. Thus, since Rd2 accept data either from Rtl of the CP I or Rt3 of the CP3, the 
multiplexer labeled rnuxc2 is added in Figure 4.2.5 to allow Rd2 to select between 
these two inputs. Similarly, the inclusion of the multiplexers, labeled rnuxcl and 
rnuxc3 attached to RJJ of the CPI and Rd3 of the CP3, respectively, can be verified. 
Another point that needs to be addressed is that Figure 4.2.5 shows that the 
operations of rnuxc 1 and rnuxc3 can be controlled by only one signal labeled scI. This 
can be verified also with the aid of Table 4.3. For instance, the two arrows labeled 3 
and 5 in Table 4.3 indicate that two data transfers take place at the same time; one is 
going to Rdl of the CPI and the other to Rd2 of the CP3. This implies that the two 
data transfers can be accomplished if the data pointed by arrow 3 and that pointed by 
arrow 5 are connected to input 0 of rnuxc I and rnuxc3, respectively. On the other 
hand, the second data transfer indicated by the two arrows labeled 4 and 6 can be 
accomplished by connecting the data pointed by arrow 4 and that pointed by arrow 6 








scl = sc3 = sc2 = sc 
I 
"' IJ?tl~--------------sc_e__ 2 ~Rt --------~· w ~w 





Table 4.3 shows how and when CPs exchange high coefficients 
ck CP CPl CP2 CP3 
Rtl Rdl Rt2 Rd2 Rt3 Rd3 
16 1,3 HHu,v - iL -------- -------- LHO,O --------
17 2 HHO,O -------- HHr;o--tHHO,O LHO,O --------
18 ---- HHO 0 -------- r:: JiHl,O HHO,O LHO,'(}(q-----
19 1,3 HH2:0 HHl~ HHl,O HH0,0'2 ,..b!-11,0 Lllll,O 
20 2 HH2,0 HHl,O LH2,0 LHl,tr LHl,O LHO,O 
21 ---- HH~HHl,O L , HI O,.,., LHl,O LHO,O 
22 1,3 H , Qlf LH2,0 LH 1,0"" LH.•,U Cl't2,0 
23 2 HH3,0 HH2,0 HH'I,U ~3,0 LH3,0 LH2,0 
24 ---- HH3,0 HH2,0 'J vH~-~4,0 HH3,0 LH3~H2,0 
25 1,3 HH5,0 HH4,r HH4,0 HH3,0""' _.LH4,0~Ll+3,0 
26 2 HH5,0 HH4,0 LH5,0 LH4~ LH4,0 LH3,0 
27 ---- H~HH4,0 L~ 'Hd o,t;;J LH4,0 LH3,0 
28 1,3 HH6,0""tfH5,0 LH6:0 LH4,0~ LHo, CPt5,0 
can be used for deriving control signal values for signals scI =sc3 and sc2 as shown in 
Table 4.4. These signal values repeat every 6 clock cycles. As indicated in the table 
these two signals can be further combined into one signal sc. 
A careful examination of 917 DOGs shows that stage 3 of the 917 CPs in the 3-
parallel architecture should be also modified as shown in Figure 4.2.6. This figure can 
be verified using 9/7 DOGs. The operations of the 3 multiplexers, labeled mux in 
Figure 4.2.6 can be controlled simply by setting the control signal s repeatedly 3 
consecutive cycles low and 3 cycles high as soon as stage 3 latches of the CP2 are 
loaded, as shown in Table 4.4 for signal sc. Figures 4.2.5 and 4.2.6 form the first 3 
stages of the 6-stage 9/7 CP and the remaining two stages are identical to stages 1 to 2 
and the last stage is the scale factor. 
Table 4.4 Control signal values for signal sc 
Cycle number scl=sc3 sc2 Sc 
16 X 0 0 
17 X X 0 
18 0 X 0 
19 X 1 1 
~- X X 1 
21 I X 1 
88 
-------.~~~~-------s-t~ag~e-3 _______ ~ 
m~~ 
}~ CPI I 
0 
• 









Figure 4.2.6 modified stage 3 of the 917 CPs 
4.2.3 4-parral/el pipelined architecture 
The 4-parallel pipelined architecture is shown in Figure 4.2.7 and its dataflow is given 
in Table B.IO. This architecture closely resembles the 2-para:lel architecture shown in 
Figure 4.2.1. The main difference is that the 2-parallel architecture consists of two 
pipelined processors, whereas the 4-parallel consist of 4 pipelined processors. Each 
pipe lined processor contains one RP and one CP. 
The architecture scans the external memory with frequency f" and itself operates 
with frequency. The clock frequency f, can be obtained from statement] as 
(4.1) 
Note that when degree of parallelism increases from 2 to 3 e.g., the scanning 
frequency fi also increases, while the architecture frequency of operation, which is the 
89 
bus2 
/," LL ( CP4) 
U CP3) 










' ' ' 




Figure 4.2.8 Wavefonns of the 3 clocks used in 4-parallel 
reciprocal of the stage critical path delay of the pipelined processors, remams 
unchanged. 
Two waveforms of the frequency labeled.fia and(" that can be generated from[, 
are shown in Figure 4.2.8. In the architecture, RPl and RP3, and their associate 
latches employ the clock labeledf,u, whereas RP2 and RP4 and their associate latches 
employ the clock labeled.fi, as shown in Figure 4.2.7. 
As shown in Table B.! 0, in every clock cycle, three pixels are scanned from 
external frame memory and are loaded into the latches of one of the RPs. First, RPJ 
latches are loaded followed by RP2 latches then RP3 latches followed by RP4 latches, 
and then the process repeats. When the scanning process return to RPI to initiate 
another operation, the RPI should have completed its curr·~nt operation in the time 
specified by t, j k, and should be ready to accept the pixels of the next operation. As 
indicated, in the architecture, RP I latches will be loaded with new data every time 
clock.fia makes a negative transition, while RP3 latches wiL be loaded at the positive 
transition. Whereas, RP2 and RP4 latches will be loaded at the negative and the 
positive transitions of clock.fi,, respectively. 
In the 3-parallel architecture, the strategy adopted was to allow each run to begin 
at RPI. This strategy was preferred over the one that allow> each new run to start its 
computations in the RP that immediately comes after the RP where the previous run 
end, mainly because with the later it is very difficult to com·~ up with a simple scheme 
that allows us to decide which TLB a high coefficient needed in the next run should 
be stored and when it can be retrieved. However, the situation is quit different in the 
91 
4-parallel architecture because there can be found a simple and very efficient scheme 
that encourages the adoption of the later strategy. 
The scheme, which can be reasoned from Table B.! 0, is summarized as follows. 
The decision, where to store each high coefficient calculated in the previous run that 
are needed in the calculations of low coefficients in the next run, can be made by 
examining the two least significant bits of N . Case one; if the two least significant 
bits of N are 00 or II then the high coefficients should be stored in the TLBs of the 
RPs that generate them. Case two; if the two least significant bits of N are either 0 I or 
I 0, then the high coefficients of RPI should be stored in the TLB of RP3 and vice 
versa, and the high coefficients of RP2 should be stored in the TLB of RP4 and vice 
versa. Symbolically, case two can be written as 
RP!~RP3 
RP2~RP4 
Therefore, the paths labeled Pa and Pb are added in Figure 4.2.7. 
(4.2) 
Not that the following fact is used also to arrive at the above result. In the clock 
cycle where a transition from a run to the next occurs, especially when the column 
length (N) of an image is odd, the external memory is not scanned and no pixels are 
loaded into the RP latches. Since, during this cycle two CPs (CPI and CP3) or (CP2 
and CP4) each will compute the last low coefficient using the last high and the last 
low coefficients in H and L columns, respectively, as required by DOGs for odd 
length signals. In Table B.! 0, the columns labeled Rth and Rtl represent H and L 
columns, respectively. 
The above scheme only affects stage 2 of the four 5/3 RPs and stages 2, 3, and 5 
of the four 9/7 RPs and it can be implemented as shown in Figure 4.2.9. Signal (zs) 
which control the operations of the four multiplexers labeled muxl, mux2, mux3, and 
mux4 can be generated by use of a simple 2-input XNOR gate with its two inputs 
connected to the two least significant bits of N. Thus, if the input to the XNOR are 
eitherOO or II (case one), zs is asserted high to pass the high coefficient generated in 
stage I of the same RP. Otherwise (case two) it is asserted low to pass the high 
coefficient stored in each register BIR (butTer input register) that have been generated 
92 
Figure 4.2.9 Modified stage 2 of the RPs datapath architecture 
93 
by one of the RP. Note that signal zs will only have one value during each level of 
decomposition. For example, during the whole period of the first level decomposition, 
zs may be equal to I or 0, but not both. 
This scheme, even though it optimizes the performance in term of number of 
clock cycles that are needed for j-level decomposition, but, it complicates very much 
the operations of the 4 RPs which would require a very complex control circuitry. In 
addition, it needs more hardware and long buses. The alternative scheme would be to 
allow each run to begin at RP I, as in case I. The advantage of this scheme is that it 
would reduce the hardware and the control complexities to the level of case I which is 
less complex and manageable. In addition, it will eliminate the long buses, the four 
BIR registers, and the four multiplexers labeled muxl, mux2, mux3, and mux4. The 
disadvantage of the alternative scheme is that it will increase the execution time by 
Mli' 1 cycles for each decomposition level, when case2 occurs. However, since, the 
hardware complexity is less; the alternative scheme will operate with higher 
frequency which would compensate for the performance lost. 
Read and write operations in the 4 TLBs for case2 is somewhat complex. 
Therefore, Table B. I I is provided to illustrate how read and write operations take 
place in the TLBs during each run of case2. Table B. II shows read and write 
operations for RPI and RP3, which is also identical to that, take place in TLBs of RP2 
and RP4, respectively. Table B. II shows that in the first run, RPI and RP3 each uses 
its TLBARa for addressing its TLB and in each cycle, reference to clockj,,, the same 
location is read in the first half cycle and is written in the second half cycle starting 
from the first location. In the second run, as in the first run, RPI uses only TLBARla 
to address its TLB, while RP3 uses both TLBARa and TLBARb to address its TLB, 
which take place as follows. In each cycle two successive locations are accessed. The 
first location is accessed by TLBAR3a, while the second is accessed by TLBAR3b. In 
the first half cycle, reference to clock J;a. TLBAR3b reads its location and loads the 
result into register BOR3 by the negative transition of /Ia, whereas, during the second 
half cycle, TLBAR3a write contents of register BIR3 into the location it addressing. 
This writing completes by the positive transition of clock .!.a· For example, Table 
B. I I shows that in cycles 35 and 37, TLBAR3a is addressing location 2, while 
TLBAR3b is addressing location 3. 
94 
In addition, note that in cycle 23, where run2 begins, and cycle 25, Table B.ll 
shows that TLBAR3a is addressing location 4 to write the last coefficient of run 1, 
while TLBAR3b is addressing location 0 to read the first location which contains the 
first coefficient needed in run2 first operation. 
Finally, note when the control signals sal2 or sa34 of the multiplexers, labeled 
muxa are set 0 in a run, each OR gate passes the clock signal to the multiplexer muxa 
control signal. The clock signals of f.u or /.h allow both TLBARa and TLBARb to be 
used for addressing TLBs, as shown in RP3 's run2 in Table 13.11. On the other hand, 
when, sal2 and sa34 are set 1 in a run only TLBARa is used for addressing TLBs, as 
shown in run 1 of Table B.ll. In case 1, signals sal2 and sa3< are set 1 in all runs and 
only TLBARa of each RP is used for addressing TLB. 
The control signals such as zs, incar, and sre2 etc., which are generated by the 
control unit can be arranged as shown in Figure 4.2.1 0 (a) and its block diagram is 
shown in Figure 4.2.1 0 (b). The control signal values issued in each clock cycle by 
control unit are transferred to the first stage of the pipeline and are loaded into the 
control signal latches (CSTs) to carry these signal values from stage-to-stage. When a 
stage where a signal(s) is used is reached, the signal value carried by its CST is 
applied, while the remaining signals are carried on to the next stage. 
Now, let's move to the CPs side to see how this part c.f the architecture works. 
The 4 CPs run by the clock labeled f.u· According to the dataflow shown in Table 
B.l 0, both CPl and CP3 execute in parallel starting from cycle 15 and load new data 
every time clock/." makes a positive transition. Similarly, b:lth CP2 and CP4 execute 
in parallel starting from cycle 17 and load new data every time clock .f.u makes a 
negative transition. Thereafter, all RPs and CPs in the architecture work in parallel. 
However, both CPl and CP2 execute high coefficients stored in Rthl, Rth2, Rth3 and 
Rth4, whereas CP3 and CP4 execute low coefficients stored in Rtll, Rt/2, Rt/3, and 
Rt/4 
The two paths labeled h 1 and h2 between CP 1 and CP2, and that labeled 
/3 and 1. between CP3 and CP4 are used for passing high coefficients among CPs, 
since each high coefficient generated by a CP is also required in the next operation 









- --- - --1----+ 
Figure 4.2.1 0 (a) Control signals carried by CST and (b) the block diagram 
of both CPI and CP2 or CP3 and CP4, in case of 5/3, and between stages 2 and 
between stages 5 of both CPI and CP2 or CP3 and CP4, in case of9/7, as illustrated 
in Fig. 4.2.11 for CP1 and CP2. The first 2 stages of Figure 4.2.11 represent modified 
5/3 CPI and CP3, while, stages I to 3 represent the first 3 stages of the modified 6-
stage 9/7 CP1 and CP3 and the following two stage are identical to stages I to 2. 
In a control design it would be necessary to determine the clock cycle (Cl) where 
the first input data are loaded into the CPs latches and the clock cycle (C2) where the 
first output coefficients are loaded into the CPs output latches. The following two 
equations can be used to determine Cl and C2. 
96 
Cl =I· k, + 2i +I 
C2 = Cl +l·k, 
(4.3) 
(4.4) 
Where I= 2, 3, 4 ... denote 2-, 3-, 4-parallel (degree of parall·~lism) and i = I, 2, 3 ... 
denotes the first, the second, the third scan method and so on. K, and k, are the 
number of pipeline stages in a RP and a CP, respectively. Note that Eqs ( 4.4) and 














[4a f4a f4a 
Figure 4.2.11 CPI and CP2 are modified to exchange high coefficients 
4.2.4 Evaluations of architectures 
To evaluate the performances of the three parallel architectures developed in this 
section, in terms of speedup, efficiency, hardware utilization, and power consumption 
consider the following. In the single pipe lined processor arc 1itecture based on the first 
overlapped scan method developed in chapter 3, the total time Tl required to execute 
n operations for j-level decomposition of an NxM image is given by Eq (3.14) as 
97 
(4.5) 
From statemen/2, case 2, 
r, =t,/(I·k) (4.6) 
Where I = 3 for 5/3 and 9/7. Thus, 
Tl = [p1 +3(n- 1)]t,/3k (4.7) 
On the other hand, the total time, T2, required for executing n operations for j-
level decomposition of an NxM image on the 2-parallel pipelined architecture shown 
in Figure 4.2. I, can be estimated using Table 8.8 as 
From statemen/3, 
Therefore, 
T2 = [p2 + 2(n -I )]r 2 
2 
r 2 =t,/2k 
T
2 
= [p2 + 2(n- I )]t" j2k 
2 
The speedup factor (S2) is then given by 
For large n, the above equation reduces to 
3(n-l)t,/3k 




(4. I 0) 
( 4.1 I) 
(4. I 2) 
Eq ( 4.12) indicates that the 2-parallel architecture is 2 times faster than the single 
pipelined architecture. 
The efficiency (£1) of an /-parallel processors system is defined by [58] as 
(4.13) 
98 
The efficiency measures the useful portion of the total work performed by I 
processors. The lowest efficiency corresponds to the case of an entire NxM image 
being decomposed on a single pipelined processor (consisting of a RP and CP). The 
maximum efficiency is achieved when all I pipelined processors are fully utilized 
throughout the execution period. Thus, the efficiency of che 2-parallel pipelined 
architecture can be written as 
E, = S,/2 =I (4.14) 
Hardware utilization indicates the extent to which resources (e.g. processors) are 
utilized during a parallel computation [58]. Since in parallel architectures, hardware 
utilization can be measured by efficiency [40], therefore, it can be concluded that 
hardware utilization in the 2-parallel architecture is I 00%. 
The total time (TJ) required to perform n operations, in j-level decomposition of 
an NxM image on the 3-parallel pipelined architecture, can be written as 
From statement3, 
T3 = (p3+3(n-l)}r, 
3 
Thus, 
(p3 + 3(n -l)]t P j3k T3 = .::.____~__:_:_-"-'..__ 
3 
The speedup factor (SJ) is given by 
For large n, SJ reduces to 
SJ=27(n-1)= 3 9(n -I) 








Eq ( 4.19) indicates that the 3-parallel architecture is 3 time faster than the single 
pipelined architecture with efficiency I. 
Similarly from Table B.10, the total time (T4) require to execute n operations for j 
levels of decomposition of an NxM image on the 4-parallel pipelined architecture can 
be written as 
T4 = (p4 + 2(n -i))r, 
2 
From statement3, r 4 = t P / 4k 
Thus, T 4 
= (p4 + 2(n -1)]t r/4k 
2 
The speedup factor (S4) is then given by 
For large n, the above equation reduces to 







Equations (4.25) and (4.26) imply that the 4-parallel architecture is 4 times faster than 
the single pipe lined architecture and the efficiency is 1, respectively. 
On the other hand, the power consumption of /-parallel pipelined architecture as 
compared with the single pipelined architecture can be obtained as follows. Let Pt and 
P, denote the power consumption of the single and /-parallel architectures without the 
external memory, and Pm 1 and Pm1 denote the power consumption of the external 
memory for the single and /-parallel architectures, respectively. Then, 
P, = c"''"' . Vo' . J; 13 , P, = f. C,,,, . vo' . It I I (4.27) 
100 
and (4.28) 
where C1, 1at is the total capacitance of single pipe lined archit<:cture. 
On the other hand, P m 1 and P mt can be estimated as 
(4.29) 
and ( 4.30) 
where em 1 is the total capacitance of the external memory· and !=3 is number of lola 
buses. 
From the above evaluations, it can be concluded that as the degree of parallelism 
increases the speedup and the power consumption of the architecture, without external 
memory, and the power consumption of the external memory increase by a factor of I, 
as compared with single pipe lined architecture. 
4.3 Parallel form of the intermediate architectures 
As mentioned before, the rational behind developing intermediate architecture is to 
reduce the excess power consumption of the external memory, due to scanning 
overlapped areas, to somewhat between the architecture based on the first overlapped 
scan method and that based on the nonoverlapped scan method developed in chapter 
3. In this section, the single pipelined intermediate architecture shown in Figure 3.7.2 
will be extended to 2- and 3-parallel pipelined architectures to achieve speedup 
factors of 2 and 3, respectively. The two proposed parallel architectures are intended 
for used in real-time applications of 2-D DWT, where very high speed and throughput 
are required. 
4.3.1 2-para/le/ pipelined intermediate architecture 
Based on the DOGs for 5/3 and 9/7 filters shown m Figures 3.3.1 and 3.3.2, 
respectively, and the scan method shown in Fig. 3.7.1 (a), the 2-parallel pipelined 
intermediate architecture shown in Figure 4.3.1 is developed. The dataflow of the 
101 
architecture is given in Table B.l2. The architecture consists of 2 k-stage pipe lined 
row-processors labeled RPI and RP2 and 2 k-stage pipelined column-processors 
labeled CPI and CP2. In the previous chapter, the RP and the CP for the 5/3 were 
pipelined into 4 and 3 stages, respectively, whereas, the RP and the CP for the 9/7 
were pipelined into 8 and 6 stages, respectively. 
The architecture scans the external memory with frequency f2 and 
operates with frequency f212. The buses labeled busO, hus I, and bus2 are used for 
transferring every clock cycle pixels from external memory to one of the RPs latches 
labeled RtO, Rtl, and Rt2, according to the scan method in Figure 3.7.1 (a). This scan 
method requires that in the first clock cycle, the 3 buses should be used for scanning 
the first 3 pixels from the first row of the external memory, whereas in the second and 
Edh = ed/, s3 =s2 
SID= shO, slf = shf 
Figure 4.3.1 2-parallel pipelined intermediate architecture 
third cycles each scans two pixels through bus I and bus2. Then the scan moves to the 
second row to repeat the process. The RPI latches load new data (pixels) every time 
clock f2/2 makes a positive transition, whereas RP2 latches load new data when a 
I 02 
negative transition occurs. Assume the first half cycle of the clocks/2 and/2/2 are low. 
On the other hand, both CP l and CP2 and their associate latc'1es load new data every 
time clockf2/2 makes a positive transition. 
Furthermore, since in every clock cycle, 3 pixels are required to initiate an 
operation and the third pixels, according to the DOGs, is always needed in the 
next operation, therefore, register RdO is added to hold the third pixel for the next 
operation. The multiplexer labeled mux1 passes RdO to eith<:r RtO of RPl or RtO of 
RP2. Register RdO loads a new pixel from bus2 every time clock./2 makes a negative 
transition. 
The control signal s 1 of the multiplexer labeled mux 1 is ,;et to 0 in the first clock 
cycle of .f2 to pass data in busO and is set to 1 in the second and third clock cycles to 
pass RdO contents. The above steps are repeated in cycles 4, :i, and 6 and so on, when 
scan moves to the second row. 
The multiplexer labeled muxreO is an extension multiplexer, passes in all cases 
data coming through bus2. Except when the row length (M) of an image is even and 
only in the calculations of the last high and low coefficients in a row r, according to 
the DOGs, the pixels at location X(r,M-2), which will be placed in busO, must be 
allowed to pass through muxreO and then be loaded into Rt2 as well as RtO. The two 
multiplexers labeled muxceO, located at the CPs side, are abo extension multiplexers 
and perform the same function as that of muxreO when DWT is applied column-wise 
by the CPs. 
The registers labeled SRH 1, SRHO, SRLJ, and SRLO are FIFO shift registers each 
holds at any time 3 coefficients. Registers SRH 1, SRHO, and RdH are used for storing 
high coefficients generated by RPI and RP2, whereas SRL1, SRLO, and RdL are used 
for storing low coefficients. These registers all operate with frequency ./2. In addition, 
the control signals slO~shO and sll ~sh1 control the operation of the FIFO registers. 
When they are high, the FIFOs shift in new data, otherwise, no shift take place. The 
high coefficients stored in SRHO and SRH1 are executed by CPI, while CP2 executes 
low coefficients stored in SRLO and SRL1. 
The operations of the two multiplexers, labeled muxh and muxl, can be controlled 
by one control signal labeled sf h. This control signal is connected to the clock.f2/2. 
103 
Whenf/2 is low, both multiplexers pass coefficients generated by RPI, otherwise, 
pass that generated by RP2. 
Observe that the dataflow pattern between cycles 13 and 18 in Table B.l2, 
especially in the 4 FIFO registers including RdH and RdL, repeats each 6 clock cycles. 
A careful investigation of Table B.l2 from cycles 13 to 18 shows that the control 
signals of the two multiplexers labeled mux2 and two multiplexers labeled mux3 
including the control signals (edh and edl) of the registers labeled RdH and RdL can 
all be combined into one signal, s2. Moreover, examination of Table B.l2 shows that 
the control signals values for signals s2. slO~shO, and sll ~sh1 starting from cycles 13 
to 18 can be as shown in Table 4.5. These control signal values repeat every 6 clock 
cycles. 
Table 4.5 Control signal values for s3, s/0, and s/1 
Cycle number s2 s/0 sll 
13 0 I I 
14 I 0 0 
15 I I 0 
16 0 0 I 
17 I I I 
18 0 0 I 
According to the 5/3 DOGs shown in Figure 3.3.1, each coefficient calculated in 
the first level (step!) is also required in the calculations of two coefficients in the 
second level (step 2). That implies a high coefficient calculated by RPI in stage I 
should be passed to stage 2 of RP2 and vice versa. The 917 DOGs shown in Figure 
3.3.2 also shows similar dependencies that exist among coefficients of two levels or 
steps. Therefore, the path labeled P 1 and P2 have been added in Fig. 4.3.1 so that the 
two RPs can pass high coefficients to each other. However, this would require the two 
RPs datapath architectures for 5/3 and 9/7 to be modified as shown in Figures 4.3.2 
and 4.3.3, respectively. 
In addition, if the third high coefficient of the first row labeled Y(5) in the 5/3 
DOGs is stored in the first location in TLB1 of RPI, then the third high coefficient of 
the second row should be stored in the first location in TLB1 of RP2 and so on. 
I 04 
Similarly, the 9/7 coefficients labeledY"(5),Y"(4), and Y'(3) in the DOGs generated 
by processing the first row of the first run should be stored in the first locations of 
each TLBJ, TLB2, and TLB3 of RPI, respectively, whereas the same coefficients 
generated by processing the second row of the first run should be stored in the first 
locations of each TLBJ, TLB2, and TLB3 of RP2, respectively, and so on. The same 
process also applies in all other runs. 
J,/2 





i-1'-------1'-1'----+~-s-re_2 ______ ...,..;¢ 
J,/2 J,/2 





inc~ s4 ~ + E ~ J,/2 TLBI -~ sre2 ~----~~----+~ 
J,/2 J,/2 J,/2 








Figure 4.3.3 Modified 9/7 RPs datapath for 2-parallel intermediate architecture 
The control signal sf of the 8 multiplexers labeled muxf in Figure 4.3.3 can be set 
0 in the first run and I in all other runs. It is very important to note that, especially in 
the first run, the scan method in Figure 3.7.1 (a) allows 5/3 RPs to yield 6 coefficients, 
where half belong to the first 3 columns of H decomposition and the other half to L 
106 
decomposition, each time it processes 7 pixels of a row, while 9/7 yield only 4 
coefficients, 2 high and 2 low coefficients by processing the same number of pixels in 
a row. This implies that in the first run each 5/3 CP would process 3 columns in an 
interleave fashion as shown in Table B.l2, whereas each 9/7 CP would process in the 
first run only two columns in an interleave fashion. However, in all other runs, except 
the last, both 9/7 and 5/3 CPs would process 3 columns at a time. This interleaving 
process, however, would require 9/7 and 5/3 CPs to be modified in order to allow 
interleaving in execution to take place. 
The advantage of this organization is that the TLBs in Figures 4.3.2 and 4.3.3 are 
not required to be read and written in the same clock cycle, since, according to the 
scan method shown in Figure 3.7.1 (a), 7 pixels are scanned from each row to initiate 
3 successive operations and the TLB is read in the first operation and is written in the 
third operation starting from the second run. Furthermore, the fact that 7 pixels are 
scanned from each row to initiate 3 consecutive operations and the TLB is read in the 
first operation and written in the third can be used to derive, for all runs except the last 
one, the control signal values for the signals labeled R/W and incar in both TLBs 
including s4, as shown in Table 4.6. These signal values repeat every 3 cycles starting 
from the first cycle. However, since in the first run TLBs are only written then signal 
s4 can be set 0 in the first run, whereas, in all subsequent runs it is set according to 
Table 4.6. Signals in Table 4.6 including the extension mu.tiplexers control signals 
which will be generated by a separate control unit can be carried by latches, 
similar to pipeline latches, from the control unit to the first ~:tage of the pipeline then 
to the next stage and so on. When a stage where a signal(s) will be used is reached 
that signal(s) can be dropped and the rest are carried on to the next stage and so on 
until they are all used. 
Table 4 6 Control signal values for signals in stage 2 of both RPl and RP2 
Cycle Number RP number R./w mcur s.f 
I I 0 0 I 
2 2 0 0 0 
3 I I I 0 
107 
4.3.2 Transition to the last run 
The description given so far including the control signal values in Tables 4.5 and 4.6 
apply to all runs except the last run, which requires special handling. The last run in 
any decomposition level can be determined and detected by subtracting after each run 
6 from the width (M) of an image. The last run is reached when M becomes less than 
or equal to 6 (M~6) and M can have one of the six different values 6, 5, 4, 3, 2, or I, 
which imply 6 different cases. These values give number of external memory columns 
that will be considered for scanning in the last run. 
According to the scan method, in each run 7 columns in the external memory are 
considered for scanning and each 7 pixels scanned, one from each column, initiate 3 
consecutive operations. Thus, since cases 6 and 5 initiate 3 operations they can be 
handled as normal runs. 
On the other hand, cases 4 and 3 initiate 2 operations and the dataflow in the last 
run will differ from the normal dataflow given in Table 8.12. Therefore, 2 dataflow 
are provided in Tables 8.13 and 8.14 for even and odd N, respectively, so that they 
can be applied when either of the two cases occurs. The dataflow shown in Table 
8.13 is derived for case 4 but it can be used also for case3. Similarly, Table 8.14 is 
derived for case3 but it can be used also for case 4. Moreover, examination of Tables 
8.13 and 8.14, especially signals s2, s/0 and s/1, show that after 2k+ 2 cycles from the 
last empty cycle, where k is the number of pipeline stages of the RPs, the control 
signal values of signals s2, s/0, and sll, which repeat every 4 clock cycles, should be 
as shown in Table 4.7 for the rest of the decomposition level. However, during the 
2k+ 2 and the empty cycles, the control signal values for s2, s/0 and sll follow Table 
4.5. Therefore, cases 4 and 3 can be considered as one case. Only at the beginning of 
the transition to the last run, if N is even, then one empty cycle is inserted, otherwise, 
4 cycles are inserted, according to Table 8.13 and 8.14, respectively. During an 
Empty cycle external memory is not scanned. 
On the other hand, cases 2 and 1, each initiate one operation. Case 2 initiates an 
operation each time 2 pixels, one from each column, are scanned, whereas case 1 
initiate an operation each time a pixel is scanned from the last column. Therefore, 
dataflow of the last run in the two cases will differ from the normal dataflow given in 
108 
Table 4. 7 Control signal values for s2, slO, and sll in the last run. 
Cycle number s2 S/0 .1"{{ 
34 0 0 I 
35 I I I 
36 I I I 
37 I I 0 
Table 8.12. For this reason, two dataflow are given in Table~; 8.15 and 8.16 for even 
and odd N, respectively, in order to be used when either of the two cases occurs. The 
dataflow in Table 8.15 is derived for case 2, even N, but it em be also applied in case 
I for even N as well. Similarly, Table 8.16 is derived for ca~;e I, odd N, but it can be 
applied in case 2 for odd N. Furthermore, study of Tables E:.l5 and 8.16 shows that 
in the last run the control signal values for s2, s/0 and sll ~Jllow Table 4.5 until the 
clock cycle that is 2k+ 1 cycles away from the last empty cycle is reached. In that 
clock cycle, change the control signal value of signal s/0 to zero instead of one. Then, 
for all subsequence cycles and to the end of the decomposition level, the control 
signal values for signals s/0, sll, and s2~s3 should remain a1: one and ed!~edh should 
alternate between 0 and 1. Therefore, cases 2 and 1 can be treated as one case. Only at 
the beginning of the transition to the last run, even N requires insertion of two empty 
cycle and odd N requires insertion of five cycles, according to Tables 8.15 and 8.16, 
respectively. 
Figure 4.3.4 shows the block diagram of the control unit that generates signals s2, 
s/0, and sf/ along with the circuits that detect the occurrenc·~ of the last run and the 6 
cases. First, M is loaded into register RM, then register R6, which contain the 2's 
complement of 6, is subtracted from RM through the 2's -complement adder circuit 
and the result of the subtraction is loaded back into RM If Lr is 1, then that implies 
the last run is reached and the result of the subtraction is not transferred to RM. The 3 
least significant bits of register RM is then examined by thE control unit to determine 
which of the 6 cases has occurred. First zl is examined. lfzl is I, that implies the 
occurrence of either cases 6 or 5 and the control unit proceeds as usual. But, if z 1 is 0, 
then z2 is examined. If z2 is I, then cases 4 and 3 are applied, otherwise cases 2 and 1. 
109 
The above description can be generalized for determining the last run in any scan 
method (first, second, or third scan method and so on) used in designing single or /-
parallel architectures. Thus, in general, the last run in any scan method can be 
determined and detected by subtracting after each run 2i from the width (M) of an 
image. The last run is reached when M becomes less than or equal to 2i (M9i), where 
i=l, 2, 3 ... denote the first, the second, and the third scan method and so on. M can 
have one of 2i different values, when last is reached, as follows: 2i, 2i-l, 2i-2 ... 2, I, 
which implies 2i cases. 
These values give number of external memory columns that would be considered 
for scanning in the last run. In addition, cases 2i and 2i-l can always be handled as 
normal runs. 
According to the 5/3 DOGs, each 5/3 CP should also interleave in execution 3 
columns, if case 5 or case 6 is the last run. But, if case 3 or case 4 is the last run, 
according to Tables B.l3 and B.l4, each CP should process 2 columns in interleave 
fashion, whereas, if case I or case 2 is the last run, according to Tables B.l5 and 
B.l6, each CP should process one column. On the other hand, each 9/7 CP, according 
to the DOGs, should also interleave in execution 3 columns, if either (cases 3 and 4) 
or (cases 5 and 6) is the last run. However, if case I or case 2 is the last run, then each 
CP should interleave 2 columns in execution, as shown in Tables B.l3 and B.14. 




s/0 =- shO r-_ _ryt-,:--........J 
010 
001 




6, shows that the 2-parallel RPs would not be able to yield all required output 
coefficients. Thus, to get the remaining coefficients the 4 RP~' should be instructed to 
execute one extra run. In the extra run, each CP would only process one column, as 
shown in Tables B.l5 and B.l6. Signal s5 of the multiplexers labeled mux5 in Figs 
4.3.2 and 4.3.3 should be set I only in the computations involving cases 3 and 4 of the 
5/3 and cases I and 2 of the 9/7, otherwise, it remains at 0. 
To enable each CP to process single column and interleave in execution 3 and 2 
columns, each of the 5/3 and 9/7 processor's datapath should be modified as shown in 
Figures 4.3.5 (a) and (b), respectively. Through the multiple,ers labeled mux the CP 
control the process of executing single column, interleaving 2 or 3 columns. 
4.3.3 3-paral/e/ pipelined intermediate architecture 
The 2-parallel pipelined intermediate architecture developed in section 4.3.1 can be 
extended to 3-parallel pipelined intermediate architecture as shown in Figure 4.3 .6. 
This architecture increases the speed up by a factor of 3 as compared with single 
pipelined architecture. The architecture performs its compt.tations according to the 
dataflow given in Table B.l7. It operates with frequency .fj /3 and scans the 




Stage2 ~~n+1) f-------r..,......;;=;..;;..-,~ H 
t::J2L2n! f--------------·---------~-~L 
0 0 interleave 3 columns (run1 to the run before last+ last run of cases 5 & 6) 
0 1 interleave 2 columns (if last run is cases 3 or 4) 
1 x single column ( if last is cases 1 or 2) 
Figure 4.3.4 (a) Modified 5/3 CP for 2-parallel intermediate architecture 
Ill 






0 0 interleave 3 columns (run1 to the run before last+ last run of cases 3 & 4 and cases 5 & 6) 
0 1 interleave 2 columns ( if last run is cases 1 or 2) 
1 x single column (Extra run for cases 6 & 5) 
Figure 4.3.5 (b) Modified 9/7 CP for 2-parallel intermediate architecture 
frequency jj. The clock frequency jj can be obtained from statement3 as 
f, = 3kjtl' (4.31) 
The waveform of the frequency .fi including two waveforms of the frequency jj /3 
labeledfia andfJh that can be generated from.fi are shown in Figure 4.3.7. 
The RP2 loads new data into its latches every time clock Jih makes a positive 
transition, whereas RPI and RP3 load when clockfia makes a positive and a negative 
transition, respectively. On the other hand, CPI and CP3 loads simultaneously new 
data every time clock fia makes a positive transition and CP2 loads every time clock 
.fih makes 
112 
edh=sh3 =sh2 e·dl=s/3 =s/2 
Figure 4.3.6 3-parallel pipelined intermediate architecture 
Clock/U fl_fl_ 
J,, ~ J,/3 
f, ~ J,/3 
1CP1 
, / CP3 / -"'1 ReadTLB n Write TLBi 





,--I--, ... .. 
::~~ I muxl& mw.:l& 
I "k:;p~ 1~- muxh 
\ 
mw.:h 
\ I \ pass RPI pass RP2 
load RPl load RP2 load RP3 output OUipUI 







a positive transition Furthermore, for the architecture to operate properly, it is 
essential the three clocks labeled fi, fiu, and jj be synchronized as shown in Figure 
4.3.7. Clockfiu andfih can be generated fromfi using a 2-bit register clocked by fi 
and with a synchronous control signal clear. In order to obtain the divide-by-3 
frequency, the register should be designed to count from 0 to 2 and then repeats. The 
synchronization can then be achieved by the control unit simply by asserting the clear 
signal high just before the first cycle where the external memory scanning begins. 
The buses labeled busO, bus I, and bus2 are used for transferring, in every clock 
cycle, 3 pixels from external memory to one of the RPs latches labeled RtO, Rt I, and 
Rt2. In the first clock cycle, 3 pixels are scanned from external memory, locations 
X(O,O), X(O,l) and X(0,2), and are loaded into RPI latches to initiate the first 
operation. While the third pixel (X(0,2)) in bus2, which is required in the next 
operation, is also loaded into RdO. The second clock cycle scans 2 pixels from 
external memory, locations X(0,3) and X(0,4), through bus! and bus2, respectively, 
and loads them into RP2 latches along with the pixel in register RdO by the pulse 
ending the cycle. This cycle also stores pixel carried by bus2 in register RdO. 
Similarly, the third clock cycle transfers 2 pixels from external memory, locations 
X(0,5) and X(0,6), including the pixel in register RdO to RP3 latches to initiate the 
third operation. The scan then moves to the second row 
The paths labeled PI, P2, and P3 in Figure 4.3.6 are used for passing coefficients 
between the three RPs, since a coefficient calculated in one stage of a RP is always 
required in the next stage of another RP. This will require the combined three RPs 
datapath architectures for 5/3 and 9/7 to be modified as shown in Figure 4.3.8 (a) and 
(a, b), respectively, so that they can fit into RPs of the 3-parallel architecture shown in 
Figure 4.3.6. Note that Figures 4.3.8 (a) and (b) together form the 9/7 RPs datapath 
architecture. This architecture can be verified using the 9/7 DDGs. The control signal 
sf of the 9 multiplexers, labeled muxf in Figure 4.3 .8 is set I in the first run and 0 in 
all other runs. 
In the 5/3 datapath architecture shown in Figure 4.3.8 (a), all high coefficients, 


















s ·S sre2 ·S 
f,(l /,, /,, 
(a) 
Figure 4.3.8 (a) Modified 5/3 RPs datapath architecture 
115 
Stage4 Stage5 Stage6 Stage7 
(b) 
Figure 4.3.8 (a,b) Modified 9/7 RPs datapath for 3-parallel intermediate architecture 
be used by RPI in the calculations of low coefficients in the next run. On the other 
hand, the 9/7 datapath stores, the coefficients labeled Y.(5), y··(4), and Y.(3) in the 
DOGs that can be generated as a result of processing the first 7 pixels of every row in 
the first run, in TLB I, TLB2, and TLB3, respectively. Similarly, all other runs can be 
handled. 
For the same reason mentioned in the 2-parallel, the 5/3 RPs will generate 6 
coefficients each time they process 7 pixels of a row, while 9/7 RPs will generate 4 
coefficients by processing the same number of pixels in the first run. Each 4 
coefficients will be generated by RPI and RP2, while RP3 will generate invalid 
coefficients during the first run. As shown in Table B.l7, each CP in the 3-parallel 
116 
architecture processes, in a run, 2 columns coefficients in an interleave fashion. This 
interleave processing will also require each CP to be modified as shown in Figures 
3.8.3 and 3.8.4 (a) for 5/3 and 9/7, respectively. 
In the first run the TLB is only written. However, starting from the second run 
until the run before last, the TLB is read and written in th(: same clock cycle, with 
respect to clockha· 
The negative transition of clock ha always brought a new high coefficient from 
stage I into stage 2 of the RP3. During the low pulse of clock ha the TLB is read and 
the result, which is placed in the path labeled P 3, is loaded by the positive transition 
into latch Rt2 in stage 3 of RPI where it will be used in the calculation of the low 
coefficient. On the other hand, during the high pulse, as indicated in Figure 4.3.7, the 
high coefficient in Rtl which is needed in the next run will be stored in the TLB. 
The register labeled TLBAR (TLB address register) generates addresses for the 
TLB. Initially, register TLBAR is cleared to zero by asserting .>ignal incar low to point 
at the first location in the TLB. Then to address the next location after each read and 
write, register TLBAR is incremented by asserting incar high. Each time a run 1s 
complete, register TLBAR is cleared zero to start a new run and the process 1s 
repeated. 
The two multiplexers labeled muxh and muxl are used for passing every clock 
cycle, reference to clock jj, the high and low coefficients, respectively, generated by 
the three RPs. The two control signals of the two multiplexers are shown in Figure 
4.3.6 connected to clocks ha and jjh. When the two pulses of the clockjja and jjb are 
low, the two multiplexers would pass the output coefficients generated by RPI, 
whereas when a high pulse of the clockjja and a low pulse oF the clockfih occur, the 
two multiplexers would pass the output coefficients generated by RP2 as indicated in 
Figure 4.3.7. Finally, when the two pulses are high, the two multiplexers would pass 
the output coefficient ofRP3. In addition, note that the path extending from the inputs 
of the multiplexer muxh, passing through muxh2, muxceO, and ending at Rt2 may 
form a critical path, since signals through this path should rea~h Rt2 during one cycle 
of clockjj. 
117 
The registers labeled SRHI, SRHO, SRLI, and SRLO, including RdH and RdL 
operate with frequency fi. Registers SRH I, SRHO, and RdH store high coefficients, 
while registers SRLJ, SRLO, and RdL store low coefficients. New coefficients are 
loaded simultaneously into both CPl and CP3 latches every time clock fia makes a 
positive transition, whereas CP2 latches are loaded when clock fih makes a positive 
transition. Furthermore, each time a transition from a run to the next is made, when 
the column length (N) of an image is odd, the external memory should not be scanned 
for 3 clock cycles, since during this period the CPs will process the last high and low 
coefficients in each of the 3 columns of H and L decompositions, as required by the 
DOGs for odd signals. This is also true for 2-parallel intermediate architecture. No 
such situation occurs when the column length of an image is even. 
It can be reasoned from Table 8.17, the control signals of the two multiplexer's 
labeled muxh2, muxh3, and register RdH can all be combined into one signal, sh2. 
Similarly, the control signal of the two multiplexer's mux/2, mux/3, and register RdL 
can be combined into one signal, s/2. Furthermore, a careful examination of Table 
8.17 shows that the control signal values that must be issued by the control unit for 
signals shl, shO, sll, s/0, sh2, and s/2, starting from cycles 16 to 21 and repeat every 6 
cycles, should be as shown in Table 4.8 
Table 4.8 control signal values 
Cycle Shl ShO sll s/0 Sh2 s/2 
16 I I I I 0 0 
17 I I 0 0 0 I 
18 0 0 0 0 I I 
19 I I I 1 I I 
20 I 0 I I 0 I 
21 I 0 I 0 0 0 
Moreover, if it is necessary to extend the 2-parallel architecture to 4-parallel 
architecture, from the experience gained in designing 2- and 3-parallel architectures, 
the best architecture for 4-parallel would be obtained if the fourth overlapped scan 
method is used and 5-parallel if the fifth scan method is used and so on. Then the 
architecture design for a higher degree parallelism becomes similar to that 
experienced in the 3-parallel intermediate architecture. While an attempt, e.g., to 
118 
design 4-parallel intermediate architecture using the third scan method would require 
very complex modifications in the datapath architecture of :the combined 4 RPs and 
complex control logic. However, the objective for choosing a higher scan method in 
the first place is to reduce the power consumption due to overlapped areas scanning of 
external memory. Therefore, it makes sense if 4-parallel is designed with fourth scan 
method and 5-parallel with fifth scan method and so on. 
4.3.4 Scale factor multipliers reduction 
In the lifting-based tree-structured filter bank for 2-D DWT shown in Figure 3.1.1, it 
can be observed that the high output coefficients, which form H decomposition, each 
is multiplied by the scale factor k in the first pass. In the second pass, the high output 
coefficients, which form HH subband, each is multiplied by k. This implies the first 
multiplication can be eliminated and the output coefficient;; of HH subband can be 
multiplied by e using one multiplier after the second pass. While, the high output 
coefficients, which form HL subband, each is multiplied by 1/k. This implies no 
multiplications are required and scale multipliers along this path can be eliminated, 
since HL subband coefficients are formed by multiplying each coefficient in the first 
pass by k and then in the second by pass by 1 /k. 
On the other hand, the low output coefficients of the Jirst pass, which form L 
decomposition, each is multiplied by 1/k. Then in the ;;econd pass, the output 
coefficients, which form LH subband, each is multiplied by k, which implies no 
multiplications are required along this path. While, the output coefficients of the 
second pass, which form LL subband, each is multiplied by 1/k. Thus, instead of 
performing two multiplications, one multiplication can be performed by 11e after the 
second pass [22, 23, 59]. However, note that the simple computations involve in each 
lifting step of the 5/3 and 917 algorithms have made arriving at these results possible. 
This process reduces number of multipliers used for scale factor multiplications in 
the tree-structured filter bank to 2 instead of 6 multipliers. When it applied to single 
pipelined architectures, it reduces number of scale multipliers to 2 instead of 4, 
whereas, in 2- and 3-parallel pipelined architectures, it n~duces number of scale 
multipliers to 2 and 4 instead of8 and 12, respectively. 
119 
In [23], it has been illustrated that the multipliers used for scale factor k and 
coefficients a,jJ,y, and 5 of the 917 filter can be implemented in hardware using 
only two adders. 
4.3.5 Evaluation of performance 
To evaluate the performance of the two proposed parallel architectures in terms of 
speedup, throughput, and power consumption as compared with the single pipelined 
intermediate architecture consider the following. In the single pipelined intermediate 
architecture, the total time, TJ, required to yield n paired outputs for j-level 
decomposition of an NxM image is given by 
(4.32) 
The dataflow of the 2-parallel architecture in Table 8.12 shows that p 2 = 19 
clock cycles are needed to yield the first 2-pair of output. The remaining (n-2)12 
outputs require 2(n-2)/2 cycles. Thus, the total time, T2, required to yield n paired 
outputs is given by 
T2=[p, +(n-2)}r2 
From statement3, r 2 = t P j2k then 
T2 = [p, + (n- 2)]1 P/2k 
The speedup factor S is then given by 
Tl [p1 +3(n-1)]tP/3k s - - - T-'------i'--'--'-;---
2 - T2 - [p, + (n- 2) ]t P j2k 
For large n, the above equation reduces to 






Eq (4.36) implies that the proposed 2-parallel intermediate architecture is 2 times 
faster than the single pipelined intermediate architecture. 
On the other hand, to estimate the total time, T3, required for j-level 
decomposition of an NxM image on the 3-parallel pipelincd intermediate architecture, 
120 
assume the output generated by CP2 in Table B. I? are shifted up one clock cycle so 
that it parallel that of CPI and CP3. Then, p 2 = 25 clock cycles are needed to yield 
the first 3-pair of output. The remaining (n-3)13 3-paired outputs require 3(n-3)/3 
clock cycles. Thus, the total time, T3, required to yield n paimd outputs is given by 
The speedup factorS is then given by 
S=3(n-1)= 3 3 (n- 3) 
( 4.3 7) 
(4.38) 
( 4.39) 
Eq ( 4.39) implies that the proposed 3-parallel pipe lined intermediate architecture is 3 
times faster than the single pipe lined intermediate architecture. 
The throughput, H, which can be defined as numbe:: of output coefficients 
generated per unit time, can be written for each architectures as 
H(sin gle) = n/(p, + 3(n -I ))t P j3k 
The maximum throughput, H"'ax, occurs when n is very large (n-.. oo ). Thus, 
Hm" (single)= H(sin gle ),~~ 
= 3 ·n ·k· fr/3·n = k· fr 
H(2- parallel)= n/(p, + (n- 2 ))t P j2k 
H'""(2- parallel)= H(2- parallel)H, 
= 2·n·k· fr/n = 2·k · fr 
H(3- parallel)= nj(p, +(n-3))tr/3k 
Hm" (3- parallel)= H(3- para/lel)H, 







Hence, the throughputs of the 2-parallel and 3-parallel pipe lined architectures have 
increased by a factor of 2 and 3, respectively, as comparEd with single pipelined 
architecture. 
121 
To determine the amount of power reduction achieved in the external memory of 
the intermediate parallel architecture as compared with first scan method based 
parallel architecture, consider the following. If the power consumption of VLSI 
architectures can be estimated as 
P-C ·V' · f 
- lola/ o . ( 4.46) 
where C,,1,, denotes the total capacitance of the architecture, Yo is the supply voltage, 
and f is the clock frequency, then the power consumption due to scanning external 
memory of the single pipe lined architecture based on nonoverlapped scan method can 
be written as 
P, (non)= f3 · C,,w1 • V,' · ft (4.47) 
where C,,,,1 • V,
2 
·;; is the external memory power consumption due to first 
overlapped scan method, f, IS the external memory scan frequency, 
and f3 = Tm, /Tm, = 2/3. Tm, and Tm, denote total external memory access time in clock 
cycle for J levels of decomposition for architecture based on the first overlapped and 
nonoverlapped scan methods, respectively. 
Using the fact that the scan method shown in Figure 3.7.1 (a) reduces the power 
consumption of the overlapped areas by a factor of 1/3, the power consumption due to 
scanning the overlapped areas of Figure 3.7.1 (a) can be written as 
( 4.48) 
where f3" = Tm)Tm, = 1/3 and Tm, is the excess memory access time due to 
overlapped areas scanning for J levels of decomposition. Thus, the external memory 
power consumption of the single pipe lined intermediate, P,(int), is 
P, (int) = P, (non)+ P,,(areas) 
= f3 · C,olal • V,,2 · .t; +flo · C,otat · V,,2 • J; I 3 
= C,owl . V,'. ft (fJo/3 + fJ) 
= 3 · k · C,,,,1 · V,' · fr (/30 /3 + /3) 






The external memory power consumption of /-parallel pipelined intermediate 
architecture, P, (int) can be written as 
P, (int) = I· C,,"1 · V"' · ft · (j30 /3 + j3) (4.53) 
From statement3, ljr1 = ft =I· kjt r , then 
P,(int) =I· k ·I· C'"'"1 • V,' · fr · (f30 /3 + j3) (4.54) 
where(!) is number of input buses and is 3 in the parallel architecture. 
Similarly, the external memory power consumption of /-parallel pipelined 
architecture based on the first scan, P, (first) can be written as 
P (fir •·t) - I · C · V 2 • ' - I· k · I · C · V' · f I "- total o )/- lr>laf o ·P (4.55) 
Thus, 
P, (int) I· k. I. c,,wl . V,2 . Ir . (f3o /3 + j3) 
= 
l·k·I·C ·V'·f total o p 
( 4.56) 
PJfirst) 
= jJ,/3 + j3 = 7/9 ( 4.57) 
implies that the intermediate parallel architecture based on scan method shown in 
Figure 3.7.1 (a) reduces power consumption of the external memory by a factor of7/9 
as compared with parallel architecture based on the first scan method. On the other 
hand, 
P, (int) = I· k ·I· C"'"1 • V,' · f" · (f30 /3 + j3) =I 
P (int) 3 · k · C · V 2 • f · (j3 13 + j3) 
·' /(J/al o . p 0 I 
(4.58) 
implies that as the degree of parallelism increases the external memory power 
consumption of the intermediate parallel architecture based on the scan method in 
Figure 3.7.1 (a) also increases by a factor of I as compared with single pipelined 
intermediate architecture's external memory power consumption. 
123 
4.4 Conclusions 
In this chapter, the single pipelined overlapped architecture is extended to 2-parallel, 
3-parallel, and 4-parallel architectures to achieve speedup factors of 2, 3, and 4, 
respectively, according to the evaluation given in section 4.2.4. Similarly, the single 
pipeline intermediate architecture is extended to 2-parallel and 3-parallel 
architectures. According to the evaluation given in section 4.3.5, the 2-parallel and 3-
parallel intermediate architectures achieve speedup factors of 2 and 3, respectively. 
The intermediate parallel architecture reduces the power consumption of the external 
memory by a factor of 7/9 as compared with the overlapped parallel architecture, 
Eq(4.57). The advantage of the parallel architectures developed in this chapter, is that 
the total temporary line buffer (TLB) requirement does not increase from the single 




DWT MEMORY ARCHITECTURES 
5.1Jntroduction 
DWT memory architectures have been usually overlooked in the literature. However, 
since 2-D DWT memory architectures are equally important as DWT processor 
architectures commonly covered in the literature, in this chapter, two novel VLSI 
architectures for LL-RAM and subband memory are developed. 
The general structure of a compression system is shown in Figure 5. I .1. The DWT 
unit generally consists of a row-processor (RP) and a column-processor (CP). RP 
reads LL-RAM, while CP writes into LL-RAM and subband memory. 
DWT decomposes an NxM image into subbands, as shown in Figure 2.1.3 for 3 
decomposition levels. These subbands must be stored by DWT unit in a memory such 
that they can be manipulated effectively by compression unit for compression 
purposes. Therefore, a memory architecture, which allows DWT unit to perform 
efficiently both, reads and writes and compression unit to perform reads is necessary. 
Figure 2.1.3 shows that the first decomposition generates 4 subbands labeled HL I, 
HH I, LH I, and LL I. The coefficients of the first 3 sub bands would be stored in a 
memory, call it subband memory, which would contain memory blocks HLI, HHJ, 
and LHI. The compression unit can then read the 3 subbands and compress each 
independently, while subband LLI would be stored in another memory, call it, LL-
RAM or just RAM, for further decompositions. 
The second decomposition generates 4 subbands, labeled HL2, HH2, LH2, LL2, 
by reading subband LLI coefficients stored in the LL-RAM. The coefficients of the 3 
subbands HL2, HH2, and LH2 would be stored also in the subband memory blocks 
labeled HL2. HH2, and LH2, while subband LL2 would be stored in the RAM for 
further decomposition. 
125 
"c § 0 
..o E 
..0 " bl~ 
Figure 5.1.1 General structure of a compression system. 
In the discussion above, two memory components have been identified, the LL-
RAM and the subband memory, that need to be designed sw:h that DWT unit can 
perform effectively both read and write operations in the LL-RAM and 
write only into subband memory, while compression can read subband memory. 
Thus, in this chapter, the architectures of the LL-RAM and sub band memory would 
be developed. First, the LL-RAM architecture will be developed followed by subband 
memory architecture. 
5.2 The LL-RAM architecture development 
The LL-RAM is used by the DWT unit to store the coefficients of the LL subband 
that it generates in each decomposition level, for further decomJositions. In the DWT 
unit, the RP scans (reads) the LL-RAM, and the CP wr tes the LL sub band 
coefficients in the LL-RAM. The generalized scan method requires the RAM to be 
read in every clock cycle with frequency Ji, where 1~ 1,2,3 denote single, 2- or 3-
parallel, and to be written according to the order in which each scan method generates 
its output coefficients. Which implies that reads operations will coincide with writes 
operations. Therefore, the RAM architecture should be designed such that both read 
and write can take place in the same clock cycle. Thus, the first half cycle of clockfi 
will be reserved for read and the second half cycle for write. 
The RAM, which can be viewed as a 2-dimensional memory of size N/2xM/2, 
where N ~ 2" and M ~ r, can be readily constructed from M/2 modules with each 
module having N/2 locations. 
The block diagram of the memory module that would be used in forming the 
RAM architecture is shown in Figure 5.2.1. The E signal, which is active high, 
126 
enables the module for reads and writes. The module is read and the result is placed in 
the output bus when the signal labeled R/W is low, otherwise, it is written. The 
address bus is used in addressing each location in the module for read or write. The 
control signal labeled MS (module select) is useful when several modules are used in 
forming a memory. It allows through a decoder one module to be selected for read or 
write. The module can be read or written only when both signal E and MS are asserted 
high. 
The complete architecture of the RAM that facilitates both reads and writes is shown 
in Figure 5.2.2 (a) and (b). This architecture is based on the first scan method. 
However, the RAM architecture can be easily modified to handle other (or higher) 
scan methods, as will be explained later. The decoder labeled dcodms is responsible 
for selecting modules for reads or writes. When the architecture performs read 
operations, the register labeled RMSR (read module select register) determines 
through muxs which modules to be enabled. When it performs write operations, 
register WMSR (write module select register) is used for selecting modules. Both 
registers are (m-2)-bit counters with control signal clr (clear) and inc (increment) and 
operate with frequency ft. 
The multiplexer labeled muxs, its control signal is shown connected to clockfi. 
Whenfi is low, read operation takes place and RMSR controls the decoder. On the 
l 
outp ut}us R/W input 
<;:::= J 
bus 





Figure 5.2.1 Block diagram of the memory module. 
127 
c{r inc 
p hi X y ei 





RE = 0 ft r='tc: 1/W 1/-w -t- 1'/W ~,r.r. R/W RW ml m2 m) m4 m5 ml m8 
c rmc 
'W MS MS f 1 -''#- c.41~ I';; 0 OR}J I 'll =:o- OR : 
"' T "' OR} ~ I ..§ 3 
: 
"' IE I 
a {l 
'lj =( 1 S I I 












other hand, when fi is high, write operation takes place and WMSR controls the 
decoder. 
The circuit in the upper left corner of Figure 5.2.2 (a), consisting of 3 multiplexers 
labeled muxb, a NOR gate, signal RE and the register labeled WER (write enable 
register), is in charge of generating signal values for the read and write signal 
labeled Ji.jw. The control signals of the 3 multiplexers are alsc- shown connected to 
128 
the clockfi. Whenfi is low, modules are enabled for read, otherwise, are enabled for 
write. Signals generated by this circuit will be described later in details. 
The address bus is managed by two registers labeled RMAR (read module 
address register) and WMAR (write module address register) through muxa. When the 
multiplexer control signal is driven low by clock .ft. read operation takes place and 
RMAR provides addresses to modules, otherwise, write operation takes place and 
WMAR provides address to modules. 
In the following, read and write operations will be described in details. First, read 
operations will be described followed by write operations. 
5.2.1 The LL-RAM read operations 
The LL-RAM is read according to the scan method shown in Figure 3.5.1, the first 
overlapped scan method. This scan method requires reading every clock cycle 3 
pixels simultaneously, one from each module as follows. When .ft is low, the 3 
multiplexers labeled muxb pass signal RE, which is active low, to the output signals 
YO, Y2. and Yl. The three output signals enable all memory modules for read. 
However, the scan method requires that in every run 3 modules should be enabled for 
read as follows. First, modules I, 2, and 3 should be enabled then modules 3, 4, and 5 
followed by modules 5, 6, and 7 and so on. Thus, the role of the decoder labeled 
dcodms is to guarantee that modules are enabled in the order specified above. First, 
the output of the decoder labeled 0 will be activated to enable modules I, 2, and 3. 
Then, using the address bus, the first location of each enabled module is read into the 
output buses. Whenfi makes a positive transition, the 3 pixels in the output buses are 
loaded into a temporary register labeled DL (Data latch). Then the negative transition 
of the clock fi loads the 3 pixels into the RPs latches. To address the second location 
in each module, the negative transition of fi increments also register RMAR. This 
process is repeated until the 3 enabled modules are read. 
To enable the next 3 modules, register RMSR is incremented by one, which 
asserts the second output of the decoder high. The decoder output labeled I enables 
modules 3, 4, and 5 for read. When all 3 modules are read, the decoder output line 
labeled 2 is activated by incrementing RMSR again by one to enable the set that 
129 
contains modules 5, 6, and 7. This process is repeated until the whole RAM ts 
scanned. 
In Figure 5.2.2 (b), the output of modules 3 and 5 are shown connected to muxO 
and mux1, respectively. These multiplexers are necessary because all modules with 
odd numbers, except the first, are scanned twice. For example, in the first run, when 
locations of module 3 are scanned they are placed in the bus labeled bus2, whereas in 
the second run they are placed in another bus labeled busO. Thus, to allow these 
multiplexers to switch between bus2 and busO their control signals are connected to 
decoder dcodms output lines labeled 0 and I and so on. 
5.2.2 The LL-RAM write operations 
How the LL-RAM should be written can be determined by examining the scan 
method or the dataflow table of the DWT architecture uncer consideration. For 
example, examination of the first scan method shows that the CP would generate 
output coefficients column-by-column, which implies that the RAM should be written 
module-by-module. 
In general, the RAM can be written as follows. Whenft is high, the outputs b1 and 
b2 of the register labeled WER (write enable register), which is initially cleared to 
zero, and the output of the NOR gate are passed through multiplexers to the outputs 
labeled Y1, Y3, and YO, respectively. Since WER is initially zero, only YO will be 
asserted high, which enable for write all modules 1+ 3i, where 1 ~ 0, 1, 2 ... , m-1. For 
example, if m~3, then modules I, 4, and 7 will be enabled. However, the RAM is 
required to be written module-by-module and in order, i.e., f.rst module I, then 2 
followed by 3 and so on and the function of the decoder labeled dcodms is to provide 
this module-by-module control. Thus, the decoder output Ia.)eled 0 will be first 
asserted high through WMSR to enable only module number I for write. Note that a 
module is enabled for write when its both signals MS and R/W are asserted high and 
all modules are disabled when signal E of dcodms is low. 
When all locations of module I are written, WER is incremtmted by one to assert 
only Yl high. Yl enables all modules labeled 2+ 3i, but since the first output of the 
decoder is still high, only module 2 will be selected for write. When all locations of 
130 
module 2 are written, WER is incremented again by one to assert this time Y2 high. Y2 
enables all modules labeled 3+ 3i but since the first output of the decoder is still high, 
only modules 3 will be selected for write. When all locations of module 3 are written, 
WER is cleared to zero to set YO high, and WMSR is incremented by one to assert the 
decoder second output labeled I high. Assertion of both YO and the second output of 
the decoder enable only module 4 for write. This process is repeated until all modules 
are written. 
Note that WER is a 2-bit register that count from 0 to 2 and repeats. Furthermore, 
the amount of data to be written in each decomposition level including number of 
modules and number of locations to be written in each module, can be determined in 
advance from the knowledge of the height and width of the image that will be 
processed. 
5.2.3 RAM architecture modifications for higher scan methods 
The RAM architecture shown in Figure 5.2.2 can be easily modified to handle other 
scan method. The circuits in the upper corner of the RAM architecture, consisting of 
register WER and multiplexers labeled muxb, remain unchanged. However, 
modifications for a specific scan method in general, can be obtained by eliminating 
some of the OR gates whose outputs are connected to signal MS, as follows. For 
example, the second scan method, which requires 5 modules to be considered for read 
and two modules for write at a time, would require eliminating the first OR gate and 
connecting the first output of the decoder labeled dcodms to signal MS of each 
modules m I, m2, and m3. While, connections to modules m4 and m5, remain 
unchanged. Then, the connection pattern of the first 5 modules ml, m2, m3, m4, and 
m5 is repeated in the next 5 modules m5, m6, m7, m8, and m9 and so on. 
Similarly, the third scan method, which requires 7 modules to be considered for 
read and 3 modules for write at a time, would require eliminating the first and the 
second OR gates and then connecting the first output of the decoder to MS signal of 
modules ml, m2, and m3 and that of the second output to signal MS of modules m4, 
and m5. While connections to modules m6 and_m7, remain the same. The connection 
patterns of the first 7 modules is repeated in the next 7 modules m7, m8, m9, miO, 
mll, ml2, and ml3 and so on. 
131 
Now, let's see how read operations are performed on the RAM architecture based 
on the second scan method. Since, the second scan method rquires 5 modules to be 
considered for read at a time, the modules labeled ml, m2, m3, m4, and m5 will be 
considered first. Thus, to read these modules location-by-location, registers RMAR 
and RMSR are reset 0. This will allow register RMAR to address the first location in 
each module and register RMSR to enable modules ml, m2, and m3 through the 
decoder dcodms. Then, in the first clock cycle, when fi is low, the R/W signals of 
modules ml, m2, and m3 are activated for read. This will allow the first location of 
each modules ml, m2, and m3 to be read into the buses labeled busO. bus!, and bus2, 
respectively. Then register RMSR is incremented by I to enabl·~ modules m4, and m5 
for read. When fi is low, again in the second clock cycle, the first location in each 
modules m4 and m5 are read into bus! and bus2, respectively. When this is done, 
register RMAR is incremented by I to point at the second location in each module. 
Register RMSR is reset 0 to enable again modules ml, m2, and m3. Whenfi is low in 
the third cycle, the second location in each modules ml, m2, and m3 are read into the 
buses. Then, register RMSR is incremented by I to enable modules m4 and m5 and 
disable ml, m2, and m3 though the decoder labeled dcodms. Again, whenfi is low in 
the fourth clock cycle, the second location of each modules m4' and m5 are read into 
bus! and bus2, respectively. Then, register RMAR is incremented by one to address 
the third location of each module and register RMSR is reset 0 to enable again 
modules ml, m2, and m3. This process is repeated until the fiN 5 modules are read. 
Then the same process is applied on the next 5 modules m5, mt, m7, m8, and m9 and 
so on. 
Similarly, the RAM architectures for third and fourth scan method etc. can be read 
in the same manner described above. Note that, in the read operations described above 
for the second scan method, after each read operation performed on modules m4 and 
m5, the control should return to module ml and repeat the proce,;s. The same situation 
also occurs when the next 5 modules m5, m6, m 7, m8, and m9 are considered for read 
and so on. That is, returning to module m5 from module m9 should be remembered by 
the control. Therefore, register XR is added to serve this purpose and it can be 
connected to register RMSR as shown in Figure 5.2.3. A similar problem occurs with 
write operations using registers WMSR and WER, and the solution shown in Figure 
5.2.3 can be used, which is described in details in section 5.3.4. 
132 
This RAM architecture would work well in DWT architectures, where pixels are 
scanned in parallel, such as in the parallel architectures developed in chapter 5. But, if 
a DWT architecture is required to scan RAM pixel-by-pixel, then in that case all OR 
gates in Fig. 5.2.2 (a) are eliminated and each output of decoder dcodms is connected 
only to signal MS of one module and the output buses are reduced to one bus. 
On the other hand, how the RAM should be written would depend on the scan 
method adopted. The first scan method, as described earlier, requires the RAM to be 
written module-by-module. Whereas, the second scan method requires considering 2 
modules for write at a time, as follows. Initially, registers WER, WMSR, and WMAR 
are set 0. Setting WER and WMBR 0 while .!i is high enable module I for write, and 
WMAR addresses the first location of module 1. This will let the first output 
coefficient, LLO,O to be stored in the first location of module 1. When the negative 
transition of clockfi ending the cycle occurs, it will increment WER by one to enable 
module 2 for write. During the high pulse of the second cycle of clock fi, the second 
coefficient labeled LLO, 1 is stored in the first location of module 2, while the negative 
transition of clockfi ending the cycle clears WER to enable again module I for write 
and increments WMAR to address the second location of module 1. In this location, 
the third output coefficient, LLI ,0 is stored during the high pulse of the third cycle of 
clock/i. The negative transition of clock/i ending the third cycle, increments WER by 
one to enable module 2 again for write. During the high pulse of the fourth cycle, the 
fourth output coefficient, LLI, 1 is stored in the second location of module 2. This 
process is repeated until all required locations in the two modules are written. Then 
the same process is applied on the next 2 modules m3 and m4 and so on. Note that 
writing into the RAM does not take place every clock cycle as reading but when it 
occurs it coincides with reading and the order of writing coefficients occur as 
described above. 
Similarly, the third scan method requires writing into 3 modules at a time. In 







Figure 5.2.3 Incorporation of register XR 
5.2.4 RAM architecture using banks 
The decoder labeled dcodms, in the RAM architecture shown in Figure 5.2.2, is a very 
large decoder. This large decoder can slow down the LL-RAM's operations and can 
degrade its performance in terms of speed and power. Therefore, it is necessary to 
reduce the size of the decoder to a practical level. Furthermore, the signal labeled YO, 
Y1, and Y2, each is shown in Figure 5.2.2 connected to drive read/ write signal labeled 
Ji.jw of several modules. Driving this large capacitive load in this way can also 
negatively affect the performance of the RAM. For these reasons, the bank method is 
introduced in Fig. 5.2.4 (a) to alleviate these problems. 
Figure 5.2.4 (a) shows a bank structure with 8 modules. The bank can contain any 
number of 2b modules where b = 1, 2, ... m-2. Read and write operations in the bank 
can be performed in the same way as described for Figure 5 .2.2. Figure 5 .2.4 (b) 
shows the block diagram of the bank. This block diagram is used in building the RAM 
architecture shown in Figure 5.2.5. This architecture can be thought formed by 
dividing the architecture in Figure 5.2.2, which can be considered as one big bank 
holding 2m·l modules, into several smaller independent baoks each holding 2b 
modules. Inside the smaller banks reads and writes are performed as in the big bank 
but faster and more efficient. 
The architecture performs read or write operations bank-by-bank and in order, b 1 
first, b2 second followed by b3 and so on. In the architecture shown in Figure 5.2.2, 
the decoder labeled dcodms is used for selecting modules, whereas the decoder 
labeled dcodbs in Figure 5.2.5 is used for selecting banks. When read operation takes 
place, the register labeled RBSR (read bank select register) controls the decoder but 
when write operation takes place, the register labeled WBSR (write bank select 
!34 
(r .... :r· bl 
J; ~ "' " ~ b) 




.. r£~ ~'R/"w., ~~~;yJ ~~~ ~fjf:~ 1t'w~ 1tw~ ~tw~ 
ml rrf2 m3 m4 m5 rriJ 
MSI- MS-! ~M.§_ c4f.§_ ~M.§_ '0~ T 
-; ~ 1 r r-----
8 I {QV-
--; '!I 2 
3 
'"£["-' ~~~ r;;----r ri-~ ~~~: I Ill ~ I I I I 
I mr;xO I I mutt I I '"'"" I 
l_ f-- I l I l - I 


















1 ~~I I I 








J; clr inc 
b1 
aD a1 




zh11 :all modules in a bank are written 
;hr all modules in a bank a:re read 
Input Bus buso bus2 bus 1 
~ 
l l l 





dcodbs ~ E 
!, 
inc ..----' llu.._ i ~---<?1.5,~ r-;;;,r-r-- me 
c/r RBSR ~ ! ~ WBSR ~ c/r 
!, 
I 
Figure 5.2.5 RAM architecture using bank 
136 





register) controls the decoder. Both registers are (m-4)-bit counters with control 
signals clr (clear) and inc (increment). 
The decoders which are attached to the banks labeled b 1 and b2 etc. in the RAM 
architecture shown in Figure 5.2.5, each is responsible for selecting modules when its 
bank is enabled by decoder dcodbs. When the architecture performs read operations in 
bank bl, for example, the register labeled RMSR (read module select register) controls 
the decoder output through mux. When it performs write operations, the register 
labeled WMSR (write module select register) controls output of the enabled decoder. 
Registers RMSR and WMSR both are 2-bit counters that count from 0 to 3 and 
repeats. When all modules in a bank are read or written, the signals labeled zbr or zbw 
will be asserted high, respectively, indicating that the next bank can now be enabled 
by dcodbs. To see how effective the bank method in reducing the decoder size, 
consider the following. Suppose, M=r is the largest image width that can be 
processed by the DWT unit. Then, the maximum number of modules in the RAM will 
be (2m-I) modules with decoder size m-2: r-2• Now, if each bank is structured to 
contain 26 modules, then 
(5.1) 
represents number of banks and number of decoder dcodbs outputs. Whereas, 
(5.1) 
gives the reduction in the decoder size. Thus, if b=3, the decoder size decreases by a 
factor of 4. 
5.3 Subband memory architecture development 
The basic architecture of the subband memory is shown in Figure 5.3.1. The 
architecture is developed with two objectives in mind to achieve, that is, write 
operations by the DWT unit and read operations by compression unit, which are 
somewhat complex operations, should be performed etTectively. 
The strategy adopted for managing subband memory architecture for an NxM 
image is as follow. The first decomposition, which consist of subbands HLl, HH 1, 
and LHl, are stored in the memory blocks labeled HLJ, HHJ, and LHJ, respectively. 
Then, the compression unit is informed to read these memory blocks. The 
137 
compression unit can read each subband memory block code-block by code-block for 
EBCOT (Embedded Block Coded with Optimized Truncation) coding as required by 
JPEG2000 standard [7]. The compression unit applies compn~ssion algorithm on 
each code-block independently. The compression unit first reads contents of HLJ, 
then HH I, and last LH I, while, the LLI sub band coefficients, which are stored in the 
RAM, are scanned by the RPs for further decomposition. 
Subbands of the second decomposition HL2, HH2, and LH2, are stored in the 
subband memory blocks labeled HL2, HH2, and LH2, whereas, subbands of the 
third decomposition are stored in the subband memory blocks labeled HL3, HH3, and 
LH3, and so on. However, subbands of the last decomposition are stored in the 
subband memory labeled HL1max• HH1mux. L~max. and LL1max· 
When the LLI subband is decomposed into the required number of decomposition 
levels, the compression unit is again informed. Thus, the compr·~ssion unit is informed 
twice during the whole decomposition process. First, when subbands of the first 
decomposition are available in subband memory blocks HLI, HHI, and LHI. 
Second, when all subsequence decompositions of LLI subband are completed and are 
stored in their respective sub band memory blocks. 
5.3.1 The bank structure used in forming subband memory 
In Figure 5.3.1, each block of the subband memory labeled HLI, HH I, etc. is a 2-
dimensional memory block, size 2"''xr•', where j =I, 2, 3 .. .jmax and )max is the 
maximum number of decomposition levels allowed. Two methods of forming a bank 
containing modules are shown Figures 5.3.2 and 5.3.3. The first bank shown in Figure 
5.3.2 contains 2h modules. When signal EM is asserted high, it enables the bank for 
both read and writes operations. Whereas, which module to read or write is 
determined by the decoder and the address lines are used to address each location in 
the selected module starting from location zero to location 2n'1-.J. The block diagram 
of the bank is shown in Figure 5.3.2 (b). 
138 
data in HL! 
2"-1 X 2m-! 
HL2 





2n-jmax X 2m-jmax 
data in 
HH! 
2"-1 X 2m-! 
HH2 





zn-jma~ X zm-;max 
data in LH! 
2n-l x2m-l 
LH2 
2n-2 X 2m-2 






Figure 5.3.1 Subband memory architecture 
139 
The second bank and its block diagram are shown in Figure 5.3.3 (a) and (b), 
respectively. It consists of two small banks, the upper and the lower banks, which in 
turn form a larger bank. The second bank method reduces the' decoder size by y, as 
compared with the first bank method, and allows more packing of modules into a 
bank. The number of modules in the larger bank is 2h, whik the lower and upper 
banks each contains (f1) modules as indicated in Figure 5.3 .3(a). Reads or writes 
into the bank take place module-by-module. Modules in the upper bank are read (or 
written) first followed by the lower bank modules. When signal E is enabled, the 
upper bank is selected by asserting the signal EUB (enable upper bank), whereas the 
lower bank is selected by asserting the signal ELB (enable lower bank). Modules in 
the upper or lower banks are selected by the decoder. Modules are selected in the 












"'C' Rfc- Data in 
m2 



























































2b-1 -1 I §::J 
"l"l 


















§0!1 ~0!) ~~ ~~ 
T 1 Jfi.T 
J;, 10 rl£ !0 
ndecoder decoder 










Figure (a) subband memory block architecture formed u:;ing the block 
diagram of the second bank (b) its block diagram 
Using the block diagram of the second bank, the subband memory block 
architecture shown Figure 5.3.4 (a) is formed_ The architecture consists of r-h,; 
142 
banks, each bank contains 2h memory modules and each module contains 2"j 
locations. The decoder labeled dcodbs in Figure 5.3.4(a) selects one bank at a time for 
reads or writes. Banks are selected in order, first bl, and second b2 and so on. The 
modules inside a selected bank are enabled one at a time through the lines labeled MS 
(module select). The line labeled UB/ LBenables the upper bank when asserted low 
and the lower bank when asserted high. Reads or writes occur when signal E of 
decoder dcodbs is asserted high. 
The block diagram of the architecture is shown in Figure 5.3.4(b). This block is 
used further for forming the subband memory architecture shown Figure 5.3.1. That 
means, each block in Figure 5.3.1 is replaced by the block diagram shown in Figure 
5.3.4(b). 
Suppose, for instant, the largest image size that can be processed is N~M~2'0, b is 
3, and the maximum number of decomposition levels, jmax is 7. Then, this implies 
that the subband memory blocks labeled HLI, HHl, and LHl forj~l, should each be 
designed to contain 64 banks and each memory module in a bank should contain 29 
locations. The blocks of the second level labeled HL2. HH2, and LH2 for j~2, each 
should contain 32 banks and each module in a bank should contain f memory 
locations. Similarly, the sizes of the subband memory blocks for third and forth and so 
on to jmaxth level can be determined. Note that the blocks of the last level labeled 
HL1max. Hff;m,a. Llf;m,a. and LL;max for j=jmax~7, each must be designed with one bank 
with each module in the bank having 23 memory locations. That is each block should 
be 8x8. 
5.3.2 Details of the subband memory architecture 
The details of the sub band memory architecture and its interconnections are shown in 
Figures 5.3.5 and 5.3.6. These two figures together give the complete architecture of 
the sub band memory. The architecture is designed to allow the DWT unit to write into 
subband memory and the compression unit to read it. 
The two sets of registers labeled MARl and MAR2 in Figure 5.3.5 supply address 
to modules that are selected for reads or writes. MARl, which is an (n-1)-bit counter, 
provides addresses to modules of the first level memory blocks labeled HLJ, HHl, 
143 
and LHI, whereas, MAR2, an (n-2)-bit counter, provides addresses to all memory 
blocks that lay below the first level. Note that in Figure 5.3 .6, the 3 signals labeled 
BS, UB/ LB, and MS are grouped together and are connected to the output of the 
register labeled SMSR (subband module select register), where BS and MS occupy the 
most and the least significant bits positions, respectively. Grouping of these 3 signals 
in this way facilitate banks and modules within a bank to be accessed successively. 
These signals can be generated by register SMSR, which is a simple counter. This 
register will drive these signals and will determine their valu·~s by simply counting 
from 0 to 2m'1, where 2m-J represents number of modules to be •.-.ritten (or read) in each 
subband memory block. The value in the SMSR gives, when z block of the subband 
memory is enabled for reads (or writes), the bank number and the module number 
selected in the upper or lower bank. SMSRJ is an (m-1)-bit register and is used along 
with MARl to address only subband memory blocks of the first level. Whereas 
SMSR2, which is an (m-2)-bit register, is used along with MAR2 to address all 
subband memory blocks that lay after the first level. 
Figures 5.3.5 and 5.3.6 also show two groups of registers labeled A and B. These 
registers make it possible to control storing of output coeffi~ients in the subband 
memory by either single or parallel pipelined 2-D DWT architectures. Single 
pipe lined architectures generate two output coefficients each clock cycle, reference to 
the processor's clock. The two output coefficients might belong to either subbands 
LH and LL or subbands HL and HH. In the first case, one coefficient (the high 
coefficient) is stored in the sub band memory block LH using group B registers, while 
the other coefficient (low coefficient) is passed to LL-RAM where it is stored. In the 
second case, simultaneously, the low and high coefficients are stored in the subband 
memory blocks HL and HH, respectively, using group A registers. On the other hand, 
the parallel architectures generate 4 output coefficients every c·ock cycle that belong 
to sub bands HL, HH, LH, and LL. The 3 coefficients of subbands HL, HH, and LH 
are stored in the subband memory blocks HL, HH, and LH using both groups A and B 
registers, while coefficient of subband LL is passed to LL-RAM 
144 
I 
~~ LHjmax I_ 1~ 
1-+!Address ~ ·-., ££ ~~:.J Fbwe E ~}., 0 ~f~~~~~E~:::==~' ~~1;2=o~=~=i~~~d!ct"~d E 
~'!t~t WDER) ·~ ~ ~7::JRBEI'9J Jvmax-1 11 0,----d,~a!.Jt"'a!..,Uin:~..._i-l•IO{_[A_d_dr_es_'.:L:_::LJ~·m:::ax::__~ t:::-::1~-----~--1 ."'>~ E dcodr set-r 1 
set---r .-:'\ cf,._._.:l. FLLre ( RDER "t" inc 
c/r-.::J..FLLwe) l L clr 
inc clr 
Figure 5.3.5 Architecture of the subband memory 
145 
set 






























w h -1 
MS LHI 
SMSRI m-h-1 UB/LB 
BS 
Ti ,-; w 





m h-2 w 
n H LHjmax 
se~=r B "\ 
c/r- P~?. 
qw LLjmax 














Figure 5.3.6 Architecture of the subband merr.ory 
Suppose, now the DWT unit is requested to process, for example, a 256x200 
image and to decompose it into 5 levels of decomposition. The first decomposition 
146 
will generate 4 subbands, each of size 128xl00. The 3 subbands HLl, HHl, and LHl 
will be written into the subband memory blocks HLI, HHJ, and LHJ. That is, in each 
subband memory blocks HLJ, HHJ, and LHJ, 100 modules will be written and each 
module addresses range from 0 to 127. SMSRJ selects a bank and a module in the 
bank to be written, while MARl generates addresses for accessing locations in the 
selected module. 
The second decomposition generates also 4 subbands images, each of size 64x50. 
The 3 subbands HL2, HH2, and LH2 will be written into the subband memory blocks 
HL2, HH2, and LH2. In each subband memory block, SMSR2 is used for selecting a 
bank and a module in the bank and MAR2 is used for generating addresses for 
accessing each location in the module. 
The third decomposition generates 4 subbands HL3, HH3, LH3, and LL3 each of 
size 32x25. The first 3 subbands are stored in the subband memory blocks HL3, HH3, 
and LH3, respectively. 
The fourth decomposition generates 4 subbands HL4, HH4, LH4, and LL4. 
Subbands HL4 and HH4 each is of size 16x 12, while subbands LH4 and LL4 each is 
of size 16x 13. The first 3 subbands are stored in the subband memory blocks HL4, 
HH4, and LH4, respectively. 
The fifth decomposition, which is the last decomposition, generates 4 subbands 
HL5, HH5, LH5, and LL5. Subbands HL5 and HH5 each is of size 8x6, while 
subbands LH5 and LL5 each is of size 8x7. These 4 subbands are stored in the 
subband memory blocks HL5, HH5, LH5, and LL1mux, respectively. Note that the LL, 
sub band of the last decomposition should always be stored in the subband memory 
block labeled LL1ma'· 
The decoder labeled dcodw along with the register labeled WDER and the FFs 
labeled Fbwe, Fwl, Fw2, and FLLwe are used for enabling subband memory for 
writes. Whereas the decoder labeled dcodr along with the two registers labeled RDER 
and RBER, and the FFs labeled FRI. FR2, and FLLre are used by compression unit 
for enabling sub band memory for reads. 
147 
The two registers labeled WDER (write decomposition register) and RDER (read 
decomposition register) both are counter that count from 0 toj-1. These registers are 
initially designed to count from 0 tojmax-1, where Jmax is the maximum number of 
decomposition allowed. In a decomposition process, the required number of 
decompositions,} desired should be provided by loading} into a register. Moreover, 
the order of writing into the subband memory blocks are controlled by WDER, 
whereas the order of reading them by compression unit are controlled by the two 
registers labeled RDER and RBER. 
To write subbands coefficients of the first level decomposition into subband 
memory, the DWT unit initially clears registers SMSRJ, MARl, WDER, and the flip-
flop (FF) labeled FLLre to zero and sets the FFs labeled Fw I and Fbwe I. Fbwe 
enables the decoder dcodw and since WDER is 0, the first output of the decoder 
labeled 0 is activated. Activation of this output signal enables subband memory 
blocks HLI, HH I, and LH I for write. The value in register SMSRJ determines the 
bank number and the module number to be written in each enabled sub band memory 
block. While register MARl is used for addressing each location in the 3 selected 
modules. When all locations of the 3 modules are written, register SMSRI is 
incremented by one to select the next 3 modules, one from each enabled blocks. This 
process is repeated until all modules in the 3 enabled subband memory blocks are 
written. The DWT unit resets FF Fwl 0 and then informs the compression unit, say, 
by asserting a FF high. The compression unit responds by wading contents of the 
subband memory blocks HLI, HHI, and LHJ, and compresses them independently. 
Meanwhile, the DWT unit moves to the second level in the subband memory by 
incrementing register WDER and setting Fw2 I. This allows the DWT unit to write 
subbands coefficients of the second decomposition into the subband memory. 
Incrementing register WDER by one activates the second output of the decoder 
labeled dcodw. This output enables subband memory blocks labeled HL2, HH2, and 
LH2 for write. In addition, registers SMSR2 and MAR2 are reset zero. Resetting 
SMSR2 zero, selects the first bank in each one of the 3 enabled blocks and enables the 
first module in each selected bank for write. Register MAR2 is used for addressing 
each location in the 3 enabled modules. The process of writing into these modules 
proceeds as that of the first level. When all modules in the 3 enabled subband memory 
I48 
blocks are written, the third level in the subband memory is enabled by incrementing 
WDER by one. This activates the third output of the decoder, which enables blocks 
HL3, HH3, and LH3 for write. This process is continued until the last decomposition 
level is reached. When all subbands coefficients of the last decomposition are written, 
the DWT unit will inform again the compression unit. It will also reset Fbwe and 
Fw2 zero to disable subband memory for writes, until it read by compression unit. 
On the other hand, reading of subband memory by compression unit proceeds as 
follows. As soon as the compression unit receives the first signal from DWT unit, 
confirming that the first level decomposition is completed and its subbands 
coefficients are available in the subband memory blocks HLI,HHl, and LHl, the 
compression unit clears registers RDER, RBER, SMSRl,and MARl to zero and sets FF 
FRl 1. Resetting RDER and RBER zero enable the subband memory block labeled 
HLI for read. While resetting SMSRl selects the first bank in block HLI and enables 
the first module in the bank. Then MARl is used for addressing each location in the 
module for read. The next module is enabled by incrementing SMSRl by one. The 
compression unit continues in this fashion until all HLI modules are read. Then RBER 
is incremented by one to enable HHI for read and SMSRJ and MARl are reset zero to 
select the first bank and enable the first module in the bank. Then, reading of block 
HH I proceeds as that of HLI. 
To enable block LH I, the compression unit increments again RBER by one and 
resets SMSRI and MARl zero. When all modules in LHI are read and the second 
signal from DWT unit is received to confirm that all subband coefficients, starting 
from the second level decomposition, are available in their respective subband 
memory blocks, register RDER is incremented by one to enable the second decoder 
(dcod2) and RBER is reset zero to activate the first output of the decoder. In addition, 
FRIis reset zero and FR2 is set I. Activation of the first output of the second decoder 
enables block HL2 for read. Then compression unit uses registers SMSR2 and MAR2 
to read block HL2 module-by-module as described in the first level. After HL2 is 
read, HH2 is enabled for read then LH2. The compression unit reads subband memory 
level-by-level and each level is read block-by-block and each block is read bank-by-
bank and each bank is read module-by-module until it reaches the last subband 
memory block labeled LL1max· To read block LL1m,,, the compression unit sets FLLre 1 
149 
to enable this block for read and then uses registers SMSR2 and MAR2 to read its 
contents. 
5.3.3 Subband memory architecture for higher scan method., 
With first scan method, writing into each sub band memory block takes place module-
by-module. That means, only one module in each block will be enabled for write at a 
time. The second and the third scan methods require writing into 2 and 3 modules at 
time in each block, respectively. In general, the ith scan method requires writing into i 
modules in each subband memory block. 
To see how this can take place consider, for example, the dataflow for the 2-
parallel intermediate architecture shown in Table B.l2. The dataflow table shows that 
the architecture yields 4 output coefficients every clock cycle, reference to 
clock j, /2. The 3 output coefficients labeled HHO,O, HLO,O, and LHO,O in Table B.12 
should be stored in the first location of the first module in each subband memory 
blocks HHJ, HLJ, and LHJ, respectively. The second output coefficients HH0,1, 
HL0,1, and LHO, 1 should be stored in the first location of the second module in each 
subband memory blocks HH1, HLl, and LH1, respective.ly. The third output 
coefficients HH0,2, HL0,2, and LH0,2 should be stored in th,~ first location of the 
third module in each subband memory blocks HHJ, HLJ, and LHJ, respectively. The 
fourth output coefficients HH1,0, HLJ,O, and LH1,0 should be stored in the second 
location of the first module in each subband memory blocks HHJ, HLJ, and LHJ, 
respectively. 
It is obvious, after the third output coefficients are stored, I he process of storing 
coefficients returns to the first module in each block to repeat the process until the 
first 3 modules in each subband memory blocks HHJ, HLJ, and LHI are written. 
Similarly, the next 3 modules in each subband memory blocks HHJ, HLJ, and LHJ 
are written and so on. When all modules in the subband memory blocks HHJ, HLJ, 
and LH 1 are written, the process moves to the second level of the sub band memory 
blocks HH2, HL2, and LH2 to store subbands coeffici(,nts of the second 
decomposition level. However, in order for the control to move dfectively between 3 
modules, the first module number ought to be remembered by the control. For this 
150 
reason, register XR is added and is connected to register SMSR as shown in Figure 
5.3.7. 
Initially, registers SMSR, MAR and XR are reset 0. When SMSR is reset, BS 
enables the first bank in each subband memory blocks HHJ, HLJ, and LHJ, while 
UB and MS enable the upper bank and the first module in each bank, respectively. 
This will allow the first 3 output coefficients HHO,O, HLO,O, and LHO,O to be 
stored in the first location of 
j, 
Figure 5.3.7lncorporation of register XR 
each module in blocks HHJ, HLJ, and LHJ, respectively, addressed by MAR. Then 
register SMSR is incremented by one to enable the second module in each subband 
memory blocks HHJ, HLJ, and LHJ. This will allow the second output coefficients 
HHO, I, HLO, I, and LHO, I to be stored in the first location of the second module in 
each subband memory blocks HHJ, HLJ, and LHJ, respectively. To store the third 
output coefficients HH0,2, HL0,2, and LH0,2 in the first location of the third module 
in each block, register SMSR is again incremented by one. 
Since, the fourth output coefficients HH 1 ,0, HL I ,0, and LH I ,0 should be stored in 
the second location of the first module in each subband memory blocks HH 1, HLJ, 
and LHJ, respectively, register XR, which is 0, is loaded into SMSR while MAR is 
incremented by one to address the second location in each module. This process is 
repeated until the first 3 modules in each block are written. At that point, where run 2 
begins, SMSR will be 2, indicating that the third module is the last module written in 
each block. To enable the fourth module in each block, register SMSR is incremented 
by one and the result is loaded into XR so that this module number can be 
remembered, while MAR is reset 0 to address the first location in each module. This 
151 
will allow the first 3 output coefficients of run 2 to be stored in the first location of 
each module enabled in the subband memory blocks HH I, HL 1, and LH 1. Then, 
register SMSR is incremented by one to enable the fifth module in each block. When 
the first location of each module is written, register SMSR is incremented again by 
one to enable the sixth module in each block. When the first lc·cation of each module 
is written, register MAR is incremented by 1 and register XR i; loaded into SMSR to 
enable again the fourth module in each block and the process repeats. When all 
modules in the first level are written, the subband memory blocks HH2, HL2, and 
LH2, in the second level, are enabled and writing into these blc·cks proceeds as in the 
first level. 
A flowchart, which describes the control algorithm that can be used to control 
subband memory write operations, is shown in Figure 5.3.8. In the flowchart, the 
following 3 registers are used. Register RN3 holds number of .locations to be written 
in a module. Register RM3 holds number of modules to be written in a subband 
memory block, while RS holds the scan method number. Thus, if DWT architecture is 
based on the third scan method, e.g., 3 is loaded into RS to indicate number of 
modules that will be considered for write in each subband memory block at a time. 
Flast is a FF, when it is set 1, indicates the last run. 
The flowchart remains in state SO as long as the status input signal wsub is low. 
When wsub is asserted high, the process of storing subbands of the first 
decomposition level begins. As the flowchart moves from states SO to Sf it resets 
registers SMSR, MAR, WDER, XR, and FF F/ast 0, sets FFs FWJ and Fbwe 1, loads i 
into RS, while number of modules and number of locations are loaded into RM3 and 
RN3, respectively. In state Sf, register RS is examined. As long as it is not 1, the loop 
consisting of states Sf and S2 is executed, during which write operations take place in 
the modules enabled in each subband memory blocks HH I, HU, and LH I. When RS 
becomes l, register RN3 is examined. If RN3 is not equal l, the eontrol moves to state 
S3. As the control moves from states S3 to Sf, register MAR is incremented and 
register RN3 is decremented, while register XR and i arc loaded into SMSR and RS, 
respectively. If RN3 is 1, it indicates the last location is reached and the flowchart 
moves to state S4. As it moves from states Sf to S4, it loads SMSR into XR and 
152 
0 
5M5R ~5M5R + 1 
R5~R5-1 
5M5R, MAR, WDER, XR, Flast ~ 0 
FW1, Fbwe -1. RS- i 
RN3 ~ N/2, RM3 ~ M/2 
52 
RN3 ~RN3 -1, MAR~ MAR+ 1 
5M5R ~xR, R5 ~ i 
54 
RN3 ~ N/2, XR ~ XR + 1 




End of a decntnpo'>ition 
Figure 5.3.8 Flowchart for subband memory write control algorithm 
subtracts i from RM3 to reflect number of modules that remain to be written in the 
subband memory blocks that are under consideration. 
!53 
In state S4, a signal would be issued to reset MAR 0, to increment SMSR and XR, 
and to load RN3 with number of locations, while register RM3 is examined. If RM3 > 
i, the flowchart moves to state Sf to consider the next i modules in each subband 
memory block for write. But, if (RM3 ~ i), then the last run is reached and RM3 
contains number of modules that are remain in each subband memory block which 
will be considered for write in the last run. Number of modules that will be considered 
in the last run will be i, i-1, i-2 ... or I depending on the image width M. For example, 
if the architecture is based on the third scan method , then nlmber of modules that 
will be consider in the last run will be either 3, 2, or I. In addition, if RM3 ~ i, the 
status of the next input is examined. If Flast is 0, then the control moves to state Sl to 
begin storing the output coefficients that will be generated in the last run and as it 
moves to state Sl, it set Flast I and loads RM3 into RS. When Flast is I, the flowchart 
returns to its initial state SO and remains in that state until ac:ivated for the second 
level decomposition. The algorithm given in Figure 5.3.8 is general and is intended to 
illustrate in a broad sense how subband memory is written. However, the algorithm 
can be modified to fit any specific architecture requirements. 
5.4 Control Design for 4-paral/e/ Architecture 
In this section, to demonstrate that the controls for the architectures developed m 
chapter 4 and 5 are simple to design, the control algorithms for the 4-parallel 
architecture shown in Figure 4.2.7 including the LL-RAM and subband memory 
architectures will be developed. Control unit is responsible for tssuing proper control 
signals, in respond to a clock pulse, to the components of the architecture where data 
processing take place. 
Figure 5.4.1 (a) shows the interconnection between subband memory of Figure 
5.3.1 and the 4-parallel pipelined architecture shown in Figure 4.2.7. The 
interconnection between the two entities is accomplished through four multiplexers, 
labeled mux. Furthennore, since CPI and CP3, and CP2 and CP4 load into their 
output latches four new coefficients each time clock !-1a makes a positive and a 
!54 
l Hl r--1 
CP1 Mux 


















zn-1 X zm-1 
HL2 






2n-1 X zm-1 
HH2 





2n-jmax X 2m-1max 
LHjmax 


















Figure 5.4.1 (a) Subband memory interconnections to 4-parallel (b) Control input 







r-;:::::=;,~ .-J Qr4 h a: ':'; 
§ ;t ~ 




~ ~ lr:-:1 ~~ ::~ ~ ~ Ur2~ ,_ 





























g tfclr;1 1'2 
8 -~~ .. 
If. iii ~ '"' M QO 1_1 To RP 1 ,2,3,& 4 ,------. ~ 0 
Write 
RAM 
,-.:~: control IX~g ~ IF~ 
:}_ 8 ""' B~RP1,2,3,&4 
j, ;::J , f-~ RP 1 ,2,3,& 4 









l IL__ ______ ~ 
L_ ______________ __ 
Read RAM 
control unit 
L__ _ ....J 
Figure 5.4.2 DWT Control Unit 
negative transition, respectively; therefore, the clock J,a is connected to the input 
control signal of the four multiplexers. When J,a is high, the four multiplexers will 
pass the four output coefficients generated by CPl and CP3 to :;ubband memory and 
LL-RAM for storage. Otherwise, the four multiplexers will pass the output 
156 
coefficients of CP2 and CP4 to subband memory and LL-RAM for storage, as 
illustrated in Figure 5.4.1 (b). 
In Figure 5.4.2, which represent the overall DWT control unit, four control units 
have been identified and labeled main control unit, processors control unit, read RAM 
control unit, and write RAM/subband memory control unit. The main control unit 
consists of3 units, A-unit, B-unit, and C-unit. 
In the following, a description of each control unit function will be given along 
with its algorithmic state machine (ASM). The ASM is a special flowchart, which 
precisely specifies the control algorithm that can be used for deriving the hardware of 
the control. 
5.4.1 Main Control Unit 
a) C-unit 
This unit is basically consists of various registers, as shown in Figure 5.4.3. These 
registers functions are to generate control signals, which will be used by all other 
control units as input control signals. At the start of a decomposition process, the 
height (N) and the width (M) of an image along with the desired number of 
decomposition levels (J) must be loaded into registers RNO, RMO, and RD, 
respectively. The loading of these registers should be handled by an entity other than 
the DWT unit, for example, microprocessor. Then DWT unit is activated by asserting 
the start signal of A-unit. 
The signals labeled EN and EM in Figure 5.4.3 are examined by the control units 
to determine whether Nand Mare even or odd. In section 4.2.3, two cases where 
identified regarding storage of high coefficients. In the first case, if the two least 
significant bits of N are either 00 or II, then the high coefficients should be stored in 
the TLBs of the RPs that generate them. In the second case, if the two least 
significant bits of N are either 0 I or I 0, then the high coefficients of RP I should be 
stored in the TLB of RP3 and vice versa, while the high coefficients of RP2 should 
be stored in the TLB of RP4 and vice versa. Thus, the signal labeled zs is formed to 
157 
" 




l~y ot N I~M ~ Lj RNO ~ MO E RD 
riH N I M/2 
"" 
-i 7t-' L::::l RN1 ~ ,$ N/2 rc_ RM~ I~ « ___L _j_ ;~ jt ~r:::: e ~ RN2 ~RN3 ~ ···~ =· = lr«~ ,.., ' \( 
loss -C\j Y cue 
""" 
RN2 
C-unit f= :'=' 
'<;Y 
l Tr (transltton) zwc z1 z5 
holds number of RM3 
operations in a column 
z2 
I 
zs EN J. r zl zm EM c 
holds number of moclules to be 
written in the RAM & in subband memory 
RN3: holds number of Zwc: all locations in enabli~d 
locations to be written in a module modules are written 
RN1 : holds number of locations to Lr: last run in a 
be read from each module in a run decomposition is reached 
RM1 · holds number of runs (each Zm last module is written 
run activates 3 modules) Zlc the last operation in tile last 
column is reached 
RM2 holds number of columns 
EP1 End of decompositicn process to be scheduled for CPs 
Z1 End of a run 
RD holds number of 
decomposition levels desired EP2· last decomposition level 
lossy ·is a FF, if zero, performs 513, othefV.Iise, 917 
Figure 5.4.3 C-unit 
158 
detect occurrence of these two cases. If zs is 1, it signifies occurrence of the first case, 
otherwise, the second case. 
Figure 5.4.3 shows that contents of RNO should be transferred to both registers 
RN I and RNC. However, if RNC is odd, which can be determined by examining 
EN, it is first shifted to right (divided by 2) and then is incremented by one, otherwise, 
it shifted to right only. These operations are controlled by A-unit. The result is then 
loaded into two registers labeled RN2 and RN3. Register RN2 holds number of 
operations in a column when DWT is applied column wise by CPs and each operation 
requires 3 pixels or coefficients except the last operation, while register RN3 holds 
number of locations to be written in a module. 
On the other hand, contents of the register labeled RMO is examined by the B-unit 
to determine whether it is even or odd. If signal EM is 1, then RMO is odd and it is 
shifted to right and then is incremented by one, otherwise, it is shifted only to right. 
The result is then transferred to the three registers labeled RMI, RM2, and RM3. 
Registers RN1 and RMI are used by the read RAM control unit. Register RMI 
holds number of runs required in a level decomposition, where each run activates 3 
modules for read except the last run. When the signal labeled Lr (last run) is asserted 
high it indicates that the run before the last has completed. On the other hand, register 
RN 1 holds number of locations to be read from each module in a run. The signal 
labeled z2, which is generated by an XNOR gate attached to RN I, is shown connected 
to RM I 's signal labeled dec (decrement). When register RN1 is counted down to 2, 
signal z2 is asserted high, which in the next clock cycle will decrement register RM 1 
by one to reflect number of runs remaining. Signal zl is similar to z2, but it is asserted 
high when RN 1 is counted down to I and it indicates a run has completed. Then the 
next run can be initiated by reloading register RNJ from RNO. Signal z5 is asserted 
high when RN 1 is counted down to 5. This signal will be made clear when TLB 
control unit is introduced later. 
The registers labeled RM3 and RN3 are used by both write control units of the 
LL- RAM and subband memory to control write operations in the two memories. 
Register RM3 function is to hold number of modules to be written in the RAM and in 
each subband memory block enabled for write in a level decomposition. When RM3 
159 
is counted down to zero, signal zm is asserted high to indicate all modules for this 
decomposition have been written and the next decomposition level can be initiated. 
On the other hand, register RN3 function is to hold number of locations to be written 
in a module. When all locations in a module are written, the signal labeled zwc is 
asserted high and RN3 can be then reloaded from RNC for the next module to be 
written. This process is repeated until all modules in a decomposition level are 
written. The occurrence of this event will be signified by assertion of signal zm. 
The registers labeled RM2 and RN2 are used by the CPs control unit, which is 
part of the processors control unit. Register RM2 holds number of columns, inLand 
H decompositions, to be scheduled for CPs. When all columns in L and H 
decompositions are scheduled, the signal labeled zlc is asserted high to indicate that 
this is the last cycle where the coefficients of the last operation in the last column will 
be transferred to CPs input latches. On the other hand, register RN2 holds number of 
operations in a column, where each operation requires 3 coefficients except the last 
one. Each time an operation is scheduled, RN2 is decremented by one. When all 
operations in a column of Land a column of H are scheduled, signal Tr (transition) is 
asserted high. That is when RN2 is counted down to 2. Assertion of signal Tr 
indicates that in the clock cycle after next, the last operation in a column, before a 
transition is made to the next column, will be scheduled. 
The final register in C-unit is the register labeled RD. Register RD holds number 
of decomposition levels (J) desired for an (NxM)-image decomposition. Each time a 
decomposition level is completed, RD is decremented by one. When all J levels of 
decomposition are completed, that is, when RD is counted down to zero, the signal 
labeled EP I is asserted high signifYing end of the process. The second signal labeled 
EP2 is asserted high when RD is counted down to l to indicate this is the last 
decomposition. 
h) A-unit 
The ASM flowchart and the block diagram for A-unit are shown in Figures 5.4.4 (a) 





E!JL: End of a decomposition level 
stBU: activate B~unit 
Fs, Fcomp, Fllre, FR2, FR1 ~ 0 
RNC~ RNO 
Y2 
RD ~ RD-1 )+---'----< 0 
sh 
EDL : end of a decomposition level 
EP1 :end ofthe decomposition process 
(a) 
Figure 5.4.4 (a) ASM flowchart for A-unit (b) Block diagram 
161 
while the block diagram displays the input and output control signals. As soon as 
registers RNO, RMO, and RD are loaded with N, M, and J, respectively, A-unit is 
activated by asserting the start signal. As long as the start signal is low the A-unit 
remains in the initial state SO. The activation of A-unit starts the decomposition 
process. 
When start signal is asserted high, the A-unit first initializes several registers and 
flip-flops (FFs) by asserting its output signal labeled YO and then it moves to state S I 
at the clock event. In state S I, it examines signal EN to determine whether register 
RNC is even or odd. If EN is I, then RNC is odd and the ASM asserts the conditional 
output signal labeled shnc. At the clock event, RNC is shifted to the right. In state S2, 
RNC is incremented by one. If EN is 0, register RNC is shifted to the right only. 
Register RNC now holds the number that will be loaded into register RN2 and RN3. 
In state S3, the B-unit is activated by asserting signal stBU high. In state S4, signal 
EDL (end of a decomposition level) is examined. If EDL is 0, the ASM remains in 
state S4 until EDL is I. When EDL becomes I, register RD is decremented by one and 
the ASM moves to state S5. In state S5, the status input signal labeled EP 1 is 
examined. If EP 1 is I, then this indicates the decomposition process has completed 
and the control returns to its initial state SO at the clock event. Otherwise, the control 
executes the loop consisting of states S6, S7, S8, and S I. Inside the loop a new value 
for RNO is computed. This value gives the height of the LL-image to be decomposed 
next. 
c) B-unit 
The B-unit is represented by the ASM flowchart and the block diagram shown m 
Figures 5.4.5 (a) and (b), respectively. When B-unit is activated, by asserting its input 
signal labeled stBU high, it immediately initializes all FFs labeled Qr, in the 
processors control unit, to zero by asserting the output signal labeled initQrs and then 
moves to state S I. In state S I, registers RN2 and RN I are loaded from RNC and RNO, 
respectively, while register RMO is shifted to the right one position. 
In state S2, a decision is made based on signal EM, the least significant bit of 
RMO. IF EM is I, RMO is incremented by one; otherwise, RMO is left unchanged. In 












activate CPs control unit 
activate TLB control unit 
activate RAM control unit 
Figure 5.4.5 (a) ASM chart forB-unit (b) Block diagram 
163 
In state S4, the FF FE is set I to enable the LL-RAM for read and write. The 
RAM is enabled when signal E of dcodms or dcodbs are high. In addition, the TLB 
control unit and the CPs control unit are activated by asserting the input signals stTLB 
and stCPC, respectively. Furthermore, while the ASM is in state S4, signal 
fs is examined. If fs is 0, the scanner control unit is activated to scan the original 
image pixels; otherwise, read RAM control unit is activated to scan the LL-RAM. 
In a decomposition process, the original image pixels are scanned first through an 
image scanner. Thus, in the first level decomposition the scanner control unit is 
activated to scan the original image pixels. Then in all subsequence decompositions, 
read LL-RAM control unit is activated. This process is controlled by signal fs of FF 
Fs. First, Fs is cleared to zero by A-unit and then examined by B-unit in state S4. The 
scanner control unit sets Fs I at the end of the scan to allow in all subsequence 
decompositions the LL-RAM to be scanned. Signalfs can also be used to control the 
operations of the multiplexers that would be needed in Figure 4.2. 7 to select between 
passing the scanner or the LL-RAM data. If signal.fs is 0, the multiplexers should pass 
to RPs the pixels that will be scanned by the scanner, otherwise, should pass data that 
will be read from the LL-RAM. 
5.4.2 Processors Control Unit 
The processors control unit consists of two control units, the RPs control unit and the 
CPs control unit, which are in charge of issuing control signals to RPs and CPs, 
respectively. The RPs control unit generates the following signals labeled zs. sreO, 
sre3, sre I, sre2, and incAR for the RPs. These signals are generated by the RPs 
control unit by setting or resetting each of the FFs labeled QrO, Qrl, Qr2, and Qra 
shown in Figure 5.4.2. These signals are then transferred to the first stages of the RPs 
and loaded into the latches labeled CST (control signal latches). These latches then 
carry these signals from stage-to-stage. Each time a stage is reached; signals that are 
used in that stage can be dropped from the CST and the rest are carried on until the 
last stage is reached. These signals are used in both 5/3 and 9/7 processors. For 
example, signal incAR which is used in stage 2 of the 5/3 is also used in stages 2 and 5 
of the 9/7. This is also true for other control signals. Thus, the control developed here 
can be used in both 5/3 and 9/7 architectures. Similarly, the CPs control unit generates 
164 
four extension signals labeled sceO, sce3, sce2, and see I by setting or resetting each of 
the FFs labeled Qc5, Qc6, and Qc7 shown in Figure 5.4.2. 
a) The RPs Control Unit 
The RPs control unit is further divided into two units, the TLB control unit and the 
extension control unit. 
i) The TLB Control Unit 
The TLB control unit is in charge of the reads and writes operations that take place in 
the 4 RPs' TLBs. The control unit generates the control signal incar (increment 
address register) for both TLBARa and TLBARb registers shown in Figure 4.2.9. Both 
TLBARs are (n-2)-bit counters. 
The ASM chart, which represents the control algorithm of the TLB control unit, is 
shown in Figure 5.4.6. The control unit is activated when its status input signal stTLB 
is asserted high by B-unit. Then at the clock event, the ASM moves to state Sl. In 
state S 1, FF FEXR is set 0 and signals ETLB, sa 12, and sa34 are set 1, while a 
decision is made based on the input signal labeled zs. If zs is 1, the control takes the 
path labeled case 1 and in every clock cycle each location of a TLB is read in the first 
half cycle and written in the second half using only TLBARa as address register. But if 
zs is 0, the control takes the path labeled case2 and read and write operations take 
place according to Table B.ll. 
As explained in chapter 4, signal zs will be I, when the two least significant bits of 
N are either 00 or 11, which implies that the high coefficients of stage 1 will be stored 
in the TLB of the RP that will generate them, starting from the TLB of RPI. This 
would require FF Qra, which drives signal incar of each TLBARa in the 4 RPs shown 
Figure 4.2.9, to be set 1 a clock cycle before external memory scanning begins, as 
shown in Figure 5.4.6 (a). ln state S2, where scanning of the external memory begins, 
the extension control unit is activated by asserting signal stEX high. When the ASM 
moves to state S3, the first three pixels and content of Qra are loaded into the three 
RPl latches and CSTa, respectively. 
ln state S3, the control examines signal z5 and will continue executing the loop 
165 
stTLB :start TLB control unit 
stEX stan (activate) 
Extension Control Unit 
z1 endofarun 
Lr. last run 
0 case2 




. " ara~1 S19 
--' 
(a) 
Figure 5.4.6 (a) ASM flowchart for TLB control unit (b) The block diagram 
consisting only of S3 as long as z5 is 0. Each time this loop is executed three pixels 
and Qra are loaded into one of the RP latches until z5 is asserted high. Assertion ofz5 
allows Qra to insert zero in each of the last 4 operations that will be scheduled for the 
4 RPs. The insertion of zeros occurs while the control is in state S4. These zero 
values of signal incar are necessary to reset register TLBAR of each TLB zero so 
that it addresses the first location at the start of the next run. The control remains in 
state S4 until zl becomes I. When zl is I, the control examines signal EN If EN is 
I, then N is odd and the external memory will not be scanned in the next cycle. 
166 
Therefore, the control sets signal ETLB 0 to disable TLB so that read and write can 
not take place during the next cycle and then moves to state S5. But, if EN is 0, the 
control sets FF Qra 1, which asserts signal incar high, and then moves to state S5. 
In state S5, the control sets Qra l and examines FF FEXR and signal Lr (last run). 
If both are 0, then the next run is initiated by executing the loop consisting of states 
S3, S4, and S5. This loop usually will be executed for several times and each time it 
executed, a new run will be initiated until signal Lr becomes 1. Signal Lr will be 1 
only when last run is initiated. When signal Lr becomes l, signal lossy is examined. 
If lossy is 0, the operation is 5/3 last run and the control returns to its initial state SO 
and remains in that state until activated. Otherwise, the operation is 9/7, which 
requires extra run, and the control set both FFs QO and FEXR I and moves to state S3 
to initiate the last run. When the control reaches state S5 again, it examines FEXR. At 
this time FEXR should be I and the control sets both FFs Q I and QO 0, as required by 
Table B.5 (a), to initiate the extra run. Then the control moves to its initial state SO. 
On the other hand, when the two least significant bits of N are either 0 I or I 0, 
signal zs becomes 0 and the control takes the path labeled case2 to state S6. When this 
path is taken, high coefficients generated by stage I of each RP will be stored 
according to Eq(4.3) starting from TLB of RP3. Therefore, setting of Qra is delayed 
until state S7. 
In state S6, where scanning of the external memory begins, the extension control 
unit is activated by asserting signal stEX high. When the control moves to state S7, 
the first 3 pixels scanned from the external memory are loaded into RP1 latches. In 
state S7, Qra is set I and signal EN is examined to determine whether N is even or 
odd. If N is I, then N is odd and the control moves to state S8, where it examines 
signal z2. As long as z2 is 0, the control executes the loop consisting only of S8. 
Signal z2 will be I when register RNI is counted down to 2 by read RAM control unit 
and it indicates that in the next cycle the last operation of the current run will be 
scheduled for computation. When z2 becomes I, the control examines signal sa34. 
According to Table B.ll, signal sa34 will alternate between I and 0 values. Therefore 
it has been used here to indicate whether the current run sequence is even or odd. 
Signal sa34 will be I when a run sequence is odd and it will be 0 when the sequence 
is even. Thus, at the end of the first run, sa34 will be 1 and the conditional output 
167 
signal Qrab1 will be asserted high and at the end of the second run, it will be 0 to 
assert signal QrabO high and so on. In both cases, QrabO and Qrab 1 set FFs Qra, 
Qrb 12, and Qrb34 according to Table B.ll so that TLBARa and TLBARb of each RP 
address the first location in the TLB each time a transition to a new run is made. FF 
Qrbl2 drives signal incar of both TLBAR1b and TLBAR2b ofRPI and RP2, whereas, 
FF Qrb34 drives signal incar of both TLBAR3b and TLBAR4b in RP3 and RP4 shown 
in Figure 4.2.9. FF Qra drives signal incar of all TLBARa of the 4 RPs. 
In states S I 0 and S II, signals sal2 and sa34 are also set according to Table B.ll. 
State S 14 is parallel to state S5 when the control takes the path labeled case I. Thus, 
every thing said there is also true here. 
On the other hand, if EN is 0, then N is even and the control moves to state S9 
where it examines signal z3. As long as z3 is 0, the control executes the loop 
consisting only of S9, until z3 is I. Signal z3 will be I when register RN I is counted 
down to 3 and it indicates that in the next two clock cycles, the last two operations of 
the current run will be scheduled and a new run then can be initiated. From this point 
on every thing that has been said when the control takes the path EN= I is also true 
for EN=O. 
ii) The Extension Control Unit 
The extension control unit controls the operation of the two extension 
multiplexers found in stage 3 of the four 5/3 RPs and stages 3 and 7 of the four 
9/7 RPs, through the two signals labeled sre 1 and sre2. The extension control unit 
generates these two signals by setting or resetting each of the two FFs labeled Qr I and 
Qr2 in Figure 5.4.2. 
The ASM chart for the extension control unit is shown in Figure 5.4.7 (a) and the 
control block diagram is shown in Figure 5.4.7 (b). The TLB control unit 
activates, by asserting its output signal stEX (start extension), the extension control 
unit in the clock cycle where external memory scan begins. At the clock event, the 
ASM moves from states SO to Sl. In state Sl, the ASM examines signal z1 and 
remains in that state as long as z 1 is 0. During this period where the first run takes 
place, Qr2 and Qrl are left unchanged (retain zero values). The reason for this is that 
the first run requires the two multiplexers to pass in each clock cycle the current 
168 







5/3 first run and 9/7 
second run end and 
intermediate Runs begin 
0 
sre1 






Figure 5.4.7 (a) ASM flowchart for Extension Control Unit (b) The block diagram. 
169 
high coefficient required in the calculation of the current low coefficient and inserting 
zeros by Qr2 and Qrl during this period will guarantee the proper operation of the 
multiplexers. When z I becomes I, the control asserts its conditional output signal sre2 
to set Qr2 and Q2 I, as required by Table B.5 (b) for run2 of the 9/7, and examines 
signal lossy. If lossy is 0, the control moves to state S3 to initiate run2 of the 5/3, 
otherwise, it moves to state S2 to initiate run2 of the 9/7. In state S2, the control 
examines signal zi again and remains in that state until zi becomes I, which indicates 
end of run2. As the control moves from states S2 to S3 it set FF Q2 0, as required by 
Table B.5 (b) for run3 and all subsequent runs of the 9/7. 
In state S3, the first run of the 5/3 or the second run of the 9/7 end and the 
intermediate runs begin. Intermediate refers to the runs that are between the first and 
last run. During intermediate runs the two multiplexers are required to pass both the 
current high coefficient and the previous high coefficient read from TLB. Thus, for 
the multiplexers to be able to accomplish this task, Qr2 is set I while Qrl is left 
unchanged (zero) during the whole intermediate period. In addition, in state S3, a 
decision is made based on signal EM, the least significant bit of register RMO, to 
determine whether the width M of the image is even or odd. If EM is 0, then M is 
even and the control returns to its initial state SO, since, as in the intermediate runs, 
even M requires Qr2 and Qrl to be set I and 0, respectively, in the last run. 
On the other hand, if EM is I, then M is odd and the last run would require both 
Qr2 and Qr I to be I. Therefore, in state S4, the ASM waits in a loop controlled by Lr 
until the last run is reached. The last run is reached when Lr equals I. Then, the ASM 
sets Qrl and Ql I and returns to the initial state SO. 
Finally, note that the output of the XNOR gate attached to register RNO 
will generate the control signal zs, whereas signals sreO and sre3 will be obtained by 
directly connecting signal set ofQrO to signal Lr. as indicated in Figure 5.4.2. 
h) The CPs Control Unit 
The CPs control unit is in charge of issuing the four extension signals labeled sceO, 
sce3, sce2, and see I that control the operations of the extension multiplexers in the 
four pipe lined CPs. The CPs control unit generates these signals by setting each of the 
FFs labeled Qc5, Qc6, and Qc7 in Figure 5.4.2 either 1 or 0. According to Tables 3.3 
170 
and 3.4, since CPs compute DWT column-by-column, Qc5 which drives both signals 
sceO and sce3 should be set to I every time the last operation in a column is scheduled 
for execution; otherwise, it remains at zero. On other hand, the two signals see] and 
sce2, which control the two multiplexers in stage 3 of the 5/3 and stages 3 and 7 of the 
9/7 processors, according to Tables 3.3 and 3.4, should be set as follows. Every time 
the first operation in a column is scheduled, both Qc6 and Qc7 should be set zero. All 
operations between the first and last operations in a column require Qc6 and Qc7 to 
be set I and 0, respectively. The last operation in each column requires Qc6 and Qc7 
to be set I if the column length is odd, otherwise, Qc6 and Qc7 are set I and 0, 
respectively. 
The cycle number (CJ) at which the first input data are loaded into both CPI and 
CP3 latches for both 5/3 and 9/7 is given by Eq (4.4). For 5/3 CJ is 19, since its RPs 
are pipelined into 4 stages, whereas CJ is 35 for 9/7, since its RPs are pipelined into 8 
stages. In order to detect occurrence of this event, register RC is added to the CPs 
control unit as shown in Figure 5.4.8 (b). Register RC is a down counter with control 
signals set and dec (decrement). Initially, RC is set to 18 or 34 by asserting signal set 
high. Register RC then is decremented by one every clock cycle starting from the 
cycle where scanning of external memory begins. When RC becomes 0, it sets signal 
zc high to indicate that the pulse ending this cycle will load CPI and CP3 latches with 
data for the first time. 
The ASM chart for the CPs control unit and its block diagram are shown in Figure 
5.4.8, respectively. The CPs control unit is activated when its input signal stCPC is 
asserted high by the TLB control unit. As the ASM moves from states SO to S I, 
register RC is set to its initial value. In state S I, FFs QcS, Qc6, and Qc7 are set 0. In 
state S2, where scanning of external memory begins, register RC is decremented by 
one. 
In state S3, the ASM executes the loop consisting only of state S3 and controlled 
by signal zc. Each time this loop is executed, RC is decremented by one. When zc is 
I, the control exits the loop and moves to state S4. As the control moves from states 
S3 to state S4, it activates the write subband memory control unit by asserting the 






RC ~ RC- 1)+---"0~C 
YO 
RN2~ RN2-1 
CP1 & CP3 inpu1 
latches are loaded 
for the first time 
(a) 
0 
Tr indicates in the clock 
cycle after next, the last 
operation in a column will 
be scheduled before a 
transition is made 
Zlc : last operation in the last 
column is reached 





" wsub :§ 






Figure 5.4.8 (a) ASM flowchart for CPs Control Unit (b) The block diagram. 
172 
ASM asserts its conditional output signal labeled Yl to activate the write RAM 
control unit and decrement register RN2 by one. The control will execute this path 
and activate the write LL-RAM control unit in all decomposition levels except in the 
last level decomposition. The reason is that, the LL-subband of the last 
level decomposition should be stored in the subband memory block labeled LL1max. 
not in the LL-RAM. When EP2 becomes I, it indicates that the last level 
decomposition is in process. 
In addition, note that when the ASM makes a transition from states S3 to S4, CPI 
and CP3 latches will be loaded for the first time with high and low coefficients of the 
first operations, respectively. In state S4, Qc6 is set I, since all operations between the 
first and last operations in a column, as explained before, require Qc6 and Qc7 to be 
set I and 0, respectively. 
In state S5, a decision is made based on signal Tr, which is the output of the 
XNOR gate attached to register RN2. As long as, Tr is 0, the loop consisting of states 
S5 and S6 is executed and register RN2, which hold number of operations in a 
column, is decremented by one to reflect number of operations left. Register RN2 is 
decremented each time a high and a low operation are scheduled from H and L 
decompositions, respectively. Note that, the actual scheduling of operations is done 
internally by clock f4a, as indicated in the architecture shown in Figure 4.2.7, and 
during execution of the above loop. However, all operations scheduled for CPs during 
this loop execution are that between the first and last operation in a column. 
Signal Tr becomes I when RN2 is decremented to 2. When Tr is I, the decision 
box with input signal EN is examined to determine whether N is even or odd. If EN is 
I, then N is odd and Qc7 is set I in order to satisfy the requirement that both Qc7 and 
Qc6 must be I in the last operation. Otherwise, Qc7 is left unchanged. Then the 
control moves to state S7. 
In state S7, the ASM asserts the output signal labeled Y5. This output signal 
decrements register RM2, which holds number of column to be scheduled for CPs, by 
one and sets Qc5 I. Setting Qc5 I for the last operation in a column, which will be 
scheduled in the next state (SS), will allow the extension multiplexers controlled by 
signals sceO and sce3 to pass data of the bus connected to the input of the extension 
173 
multiplexers labeled 1 instead of 0 to Rt2 as a third input, as rec,uired when N is even. 
In state S8, where the last operation in a column is scheduled for execution, a 
decision is made based on signal zlc, which is the output of the XNOR gate attached 
to register RM2. If zlc is I, it indicates that all columns in L and H decompositions 
have been scheduled and the control returns to its initial state SO. On the other hand, if 
zlc is 0, the control moves to state S9 to initiate processing of the next column. As the 
control moves from states S8 to S9, it loads again register Rl\2 and clears FFs Qc5, 
Qc6 and Qc7 to zero by asserting its conditional output signal labeled Y6. When the 
control moves from S I 0 to S4 it loads coefficients of the first operation of the next 
column in each Hand L decomposition into CPI and CP3 or CP2 and CP4 latches. 
5.4.3 Read LL-RAM Control Unit 
Read LL-RAM control unit is responsible for reading LL-RAM memory according to 
the scan method shown in Figure 3.5.1. Two control algorithm; (or ASM charts) will 
be developed, one for the RAM architecture designed using modules shown in Figure 
5.2.2 and the other for the RAM architecture designed using banks shown in Figure 
5.2.5. Remember, the LL-RAM architecture is designed to allow both read and write 
to take place in the same clock cycle. Read takes place in the fi:·st half cycle and write 
in the second half cycle. 
The ASM chart for read RAM control unit and its block diagram that controls the 
read operations of the RAM architecture shown in Figure 5.2.2 are given in Figures 
5.4.9 The ASM chart of the control unit is activated when its input signal rram is 
asserted high by B-un it. As a result, the control moves from states SO to S I. In state 
S I, both registers RMAR (read module address register) and RMSR (read model select 
register) are set zero. Register RMSR enables the first 3 modules for read, while 
register RMAR points to the first location in each module. Then, the control moves 
unconditionally to state S2, where the process of scanning the RAM begins. When the 
control moves from states S2 to S3 three pixels are scanned, one from each module, 
and then are loaded into the RPI 's latches. In addition, register RMAR is incremented 
by one so that it addresses the second location in each module, while register RN I is 
decremented by one to reflect that one read operation has been performed. Register 




S1 RMAR, RMSR <--- 0 
Y1 
































Figure 5.4.9 (a) ASM chart for Read RAM Control Unit of the RAM 
architecture using modules (b) The block diagram. 
175 
In state S3, the control executes the loop consisting only of state S3 and controlled 
by signal zl. This loop allows the control to continue reading the enabled RAM 
modules. Each time the loop is executed, register RMAR is incremented so that it 
points to the next location, while register RN I is decremented by one. When RN I is 
decremented to I, it asserts signal zl high to indicate the three modules enabled in the 
current run all have been read and the next 3 modules for next run can be initiated. As 
the ASM moves from states S3 to S4, to get ready for the next run, register RNI 1s 
again loaded with the same value, register RMAR is set 0, and register RMSR 1s 
incremented by one to select the next three modules that would be read in the next 
run. 
In state S4, where a run ends and another begins, signal EN is examined, the least 
significant bit of RNO. If EN is I, then N is odd and no read will take place when the 
control moves to SS. This will satisfy the condition requ red by the 4-parallel 
architecture, when a transition is made from a run to the next and if N is odd, no data 
is read from external memory. Otherwise, N is even and the first read operation in the 
new run is immediately performed. In both cases, the next state is SS. 
In state SS, signal Lr (last run) is examined to determine whether the last run is 
reached. As long as, Lr is 0, the last run is not reached and the ASM executes the loop 
consisting of states S3, S4, and SS until Lr becomes I. When Lr becomes I, it 
indicates that the run before the last one is now completed and the last run is in 
progress. Then, the ASM moves to state S6 to continue with the last run. Signal Lr, 
which is the output of the XNOR gate attached to register RMl, becomes 1 when 
RMl is decremented to I. Note that register RMl is decrerrented internally by the 
signal labeled z2 in C-unit. 
In state S6, the ASM chart executes the loop consisting only of state S6 and 
controlled by signal zl. As long as, signal zl is 0, this loop will be executed and read 
operations required in the last run will be performed. When zl becomes I, it indicates 
that all required reads in the last run have been performed. Then at the clock event, 
the control returns to its initial state SO. 
The ASM chart for the second read RAM control unit and its block diagram, 
which controls the read operations of the RAM architecture (Figure 5.2.5) designed 
176 
0 
S1 RMAR, RMSR,RBSR ~ 0 







" 0 (J 
:IE '2 q::::> 
a:: 







Zbr: all modules in a bank are read 
Odd: 
zl :end of a run (ali modules enabled in the current 
run are read) 
Lr : last run in a decomposition is reached 
Figure 5.4.10 (a) ASM chart for Read RAM Control Unit of the RAM 
architecture using banks (b) The block diagram 
177 
using banks, are given in Figure 5.4.10. The ASM chart shown in Figure 5.4.10 (a) is 
basically identical in every aspect to the one shown in Figure 5.4.9 (a). Except, it has 
one extra decision box between states S3 and S4 with the control input signal labeled 
zbr (see Figure 5.2.5). When all modules in a bank are read, signal zbr becomes I. 
When zbr becomes I, register RNI is loaded again with th~ same value, register 
RMAR is set 0, and register RBSR (read bank select register) and RMSR are 
incremented to select the first three modules in the new bank. Otherwise, the control 
will continue reading the same bank. In both cases, the next state is S4. 
5.4.4 Write RAM/Subband Memory Control Unit 
Write RAM/subband memory control unit consist of two control units, write RAM 
control unit and write subband memory control unit. Write RAM and subband 
memory control units are responsible for performing write operations in the LL-RAM 
and subband memory, respectively. Both control units are activated at the same time, 
when signals wsub and wram are asserted high by the CPs control unit and are 
terminated at the same time. However, in the last level decomposition, only write 
subband memory control unit will be activated, since the LL-subband of the last 
decomposition is required to be stored in the subband memo~y block labeled LL;mux 
not in the LL-RAM. 
On the other hand, number of clock cycles that would elapse between the cycle, 
where the first inputs are loaded into CP I and CP3 latches and the cycle where the 
first output coefficients generated CPI and CP3 are loaded into the output latches, can 
be obtained from Eqs (4.4) and (4.5) as follows. 
C2- Cl = 4k, (5.3) 
In order to detect occurrence of this event, register RFO is added to write sub band 
memory control unit shown in Figure 5.4.11 (b). Register RFO ts a down counter with 
control signals set and dec (decrement). Initially, RFO is set equal to 4k, by asserting 
signal set high. This register is then decremented by one every clock cycle. When 
RFO is decremented to 1, it will assert signal zfo high to indicate that the first output 
coefficients will be available in CPI and CP3 output latches at the end of the cycle. 
According to the dataflow table of the 4-parallel architecture, once the first four 
output coefficients are produced, then in every other clock cycle four new output 
178 
coefficients will be produced until the process of decomposing a level into subbands 
is completed. 
a) Write Subband Memory Control Unit 
The ASM chart that describes write sub band memory control unit is shown in Figure 
5.4.11 (a) and its block diagram is shown in Figure 5.4.11 (b). The ASM chart is 
derived such that the control unit can write into subband memory according to the 
strategy explained in section 5.3.2, which can be summarized as follows. The strategy 
begins by storing the first three subbands of the first level decomposition in the 
subband memory blocks labeled HLI, HHI, and LHI. As soon as, the three subbands 
are written, the compression unit is informed by setting the FF labeled Fcomp high. 
Then the compression unit can read each subband block and compress it 
independently, while the DWT unit continues to further decompose the LL-subband 
of the first level decomposition. First, the compression unit will reset Fcomp zero 
and then will go on with compression process. When all levels after the first are 
decomposed and their subbands are stored in their respective subband memory blocks, 
the compression unit is again informed by asserting FF Fcomp high. 
Write subband memory control unit, represented by the ASM chart shown in 
Figure 5.4.11 (a), is activated when the input signal wsub of the ASM is asserted high 
by the CPs control unit. Then the ASM moves from its initial state SO to state S 1. As 
the control moves from state SO to S I, register RN3 is loaded with number of 
locations to be written in a module and register RFO is set equal 4k,, while the input 
latches of CPI and CP3 are loaded internally with data of the first operation. 
In state S I, the ASM execute the loop controlled by signal zjo, which consists of 
state S I and the conditional output labeled YI. As long as zfo is 0, this loop is 
executed and register RFO is decremented by one, while the control remains in the 
same state, S 1. When register RFO is decremented to I, it asserts its output signal zfo 
high, which indicates that the first output coefficients generated by CPI and CP3 will 
be loaded into the output latches by the pulse ending the cycle (when the control 
moves from states S I to S2). In addition, when signal zfi; is I, two status input signals 
EP2 andji· are examined. If both signals are 0, which will be true only if this is the 




RFO ~ RFO -1 
unit 
(b) 
EDL: end of a decompos!tJOn level 
=we: all locations in enabled modules 
are written 
EP2: lwt decomposllwn level 
4kc: number of clock cycles that must 
ela~'se before the first output are 
loaded mto CP I and CP3 
output latches 
Figure 5.4.11 (a) ASM chart for write subband memory control unit 
(b) The block diagram 
180 
EP2 is I, then it implies that the final decomposition is in progress and the conditional 
output labeled Y2 is executed as the control moves from states S I to S5. Execution of 
Y2 sets FF Fllwe I, which enables the subband memory block labeled LL1mux to store 
the last subband LL-image. However, in all decomposition levels that are between the 
first and the last decomposition, signal EP2 andfs will be 0 and I, respectively, and 
the path leading to state S5 through the conditional output labeled Y3 will be 
executed. 
In state S2, the ASM executes the loop consisting of states S2 and S3. Each time 
this loop is executed three coefficients from CPs output latches will be simultaneously 
transferred to subband memory, where each coefficient will be stored in the first 
module of each sub band blocks labeled HLI, HH, and LHI, starting from the first 
location. In addition, register RN3, which holds number of locations to be written in a 
module, is decremented by one and register MARl is incremented by one so that it 
points to the next location in the three enabled modules that will be written next. 
When register RN3 is decremented to I, it asserts signal zwc high to indicate that 
all locations in the three enabled modules are written and the next three modules can 
be enabled for write. Then the ASM moves from states S2 to S4. As the ASM moves 
from states S2 to S4, register RM3, which holds number of modules to be written in 
each subband memory block, is decremented by one. In state S4, register RN3 is 
loaded again with the same value and register MARl is reset 0, while register SMSR 
is incremented by one to select the next 3 modules, one from each subband memory 
blocks labeled HLI, HHI, and LHI that will be written next. 
In state S4, a decision is made based on signal zm. If zm is 0, the loop consisting 
of states S2, S3, and S4 is executed. This loop will execute several times before zm 
becomes I. Signal zm becomes I, when register RM3 is decremented to 0, which 
confirms that all modules in the first level are written. Then the control moves from 
states S4 to S8 during which register WDER is incremented by one to enable the next 
3 subband memory blocks labeled HL2, HH2, and Lll2 for writing the second level 
decomposition. In addition, Fwl is reset 0 and Fw2 is set I to prevent further writing 
in the first level of the subband memory and to enable the second level for write, 
respectively. Furthermore, FF Fcomp is set I to inform the compression unit that the 
181 
first level decomposition is completed and its subbands are now available m the 
subband memory blocks HLI, HHl, and LHl for compression. 
In state S8, the output signal labeled EDL (end of a decomposition level) is 
asserted high to inform the A-unit that the first level decomposition has completed and 
the next level decomposition can be initiated. Then at the clock event, the control 
returns to its initial state SO and remains in that state until it is activated for 
the next level decomposition. 
In all decomposition levels except the first, the second path leading to state S5 is 
executed. The second path executes a loop identical to the one in the first path. So 
every thing that has been said for the loop in the first path is a .so true for the loop in 
the second path. 
At the end of the second loop, when signal zm is 1, the status input signal EP2 is 
examined again, this time to determine if the last decomposition is completed. Signal 
EP2 becomes I only when register RD is decremented to 1. Thus, the path labeled 0 
leading to state S8 through the conditional output signal labeled Y9 is always executed 
until the last decomposition is completed. When the last decomposition completes, 
signal EP2 will be still 1. Then, at the clock event as the ASM moves to state S8, FFs 
Fllwe, Fbwe, and Fw2 are reset 0 to disable sub band memory so that no further writes 
take place until it is read by the compression unit and the compression unit is 
informed by setting Fcomp 1. 
In state S8, the output signal EDL is asserted high and at the clock event, the 
control returns to its initial state SO and remains in that state until activated 
for decomposition of another image. 
b) Write LL-RAM Control Unit 
In following, two ASM charts for write LL-RAM control unit v.ill be derived, one for 
the RAM architecture designed usmg modules shown in Figure 5.2.2 and the 
other for the RAM architecture designed using banks shown in Figure 5.2.5. 
The first ASM chart that describes write RAM control unit for the RAM architecture 
shown in Figure 5.2.2 is given in Figure 5.4.12 (a) and its block diagram is shown in 
182 
Figure 5.4.12 (b). This control unit is activated when its input signal wram is asserted 
high by the CPs control unit. As the control moves from states SO to S 1 the FF labeled 
FM is set 0. FM is a FF with two signals clr (clear) and T(toggle). This FF is initially 
cleared to 0 and each time signal T is high it toggles. Since, the decoder labeled 
dcodms enables at a time 3 modules and writing is required to take place module-by-
module, FM is used for determining the time at which register WMSR should be 
incremented such that the next 3 modules are enabled by the decoder at appropriate 
time, while writing into only one module at a time is still possible. Looking at the 
architecture in Figure 5.2.2 it can be determined that as soon as module number 
(2m) is written, where m~ 1, 2, 3, .... register WMSR can be incremented so that the 
decoder can safely select the next 3 modules. In other words, register WMSR will be 
incremented first after module number 2 is written then after module number 4 is 
written and so on. Thus, FM is used to serve this purpose. 
In state S 1, the ASM executes a loop exactly identical to the one in state S 1 of the 
write subband memory control unit. This might suggest the possibility of eliminating 
this loop and the control can be activated from write subband memory control unit 
instead. Any way, as the control moves from states S 1 and S2 register WMSR, 
WMAR, and WER are reset 0. Registers WMSR and WER together determine which 
module will be enabled for write, whereas register WMAR is used to address each 
location in the enabled module. 
In state S2, two loops are executed, the inner loop which is controlled by signal zwc 
and the outer loop which is controlled by signal zm. These two loops are similar to the 
two loops that are in states S2 and S4 of the ASM chart for write sub band memory 
control unit. The inner loop writes into the enabled module through register WMAR, 
which serves as address pointer starting from the first location. On the other hand, the 
outer loop selects the next module to be written through registers WER and WMSR. 
When all modules are written, signal zm becomes 1. Then, at the clock event the 
control moves to state S5. As the control moves to state S5, FF FE, which its output 
should be connected to the enable signal of the decoder labeled dcodms in Figure 
5.2.2 (a), is set 0 to disable the LL-RAM so that it safeguard its contents until next 






WMSR, WMAR, WER ~ 0 
Read RAM 
The first output coefficients are loaded 
into CP1 & CP3 output latches 
Zwc · all locations in the 
enabled module are written 
Zm last module is written 
S5 





Write r--Write Write 
RAM Read RAM RAM Read RAM RAM 
begns ends begin 
(c) 
Figure 5.4.12 (a) ASM chart for write RAM control unit ofth·~ RAM architecture 
using modules (b) The block diagram (c) Proposed clock signal 
184 
In state S5, registers WMAR, WMSR, and WER are reset 0. This step is necessary 
to prevent modification of stored data by illegal writes during the period where the 
RAM is enabled and only read operations are taking place. This occurs always at 
the beginning of each decomposition level, since the LL-RAM is designed to allow 
both read and write to take place in the same clock cycle. This step will force the first 
module to be enabled and register WMAR to point at the first location. Thus, during 
this period all illegal writes will occur in the first location of the first module which 
will be read before the first illegal write takes place. Then, at the clock event the ASM 
moves from states S5 to SO and remains in that state until it is activated again. 
The second ASM chart shown in Figure 5.4.13 (a) describes the write RAM 
control unit for the RAM architecture designed using bank shown in Figure 5.2.5. 
The block diagram of the control unit is shown in Figure 5.4.13 (b). This ASM is 
basically identical in every part to the one shown in Figure 5.4.12 (a). Except that 
it has one extra decision box with a status input signal zbw (see Figure 5.2.5) and one 
conditional output box labeled Y4 immediately inserted after the conditional output 
labeled Y2. When all modules in a bank are written, signal zbw is asserted high. Thus, 
every time signal zbw is I, register WBSR (write bank select register) is incremented 
by one to enable the next bank for write and the control moves to state S2. 
Finally, before closing this section, a very important issue regarding clock f, 
would be addressed. As mentioned before, the LL-RAM architecture is designed to 
support both reading and writing operations to take place in the same clock cycle. 
Read occurs in the first half cycle and write in the second half cycle. This might 
suggest the low and high pulses of clock f.; should be equal. But, from the dataflow 
given in Table B.! 0 it can be seen that the CPs yield four output coefficients every 
other clock cycle, reference to clock[;. That means these output coefficients remain in 
the output latches for two clock cycles before the next output coefficients are loaded. 
Thus, using a clock with equal pulses will be definitely inefficient. For example, if 
read is performed during the time where the first pulse of the clock is low and write is 
performed during the time where the second pulse of the clock is high, then in every 
two clock cycles, the second pulse of the first cycle will be used for writing, but the 
second pulse of the second cycle will be unused. Thus, in order to use the whole 
185 
0 
CP1 & CPJ are loaded 
L-,--.J for the first time 
Y1 
WBSR, WMAR, WER ~ 0 




loaded into CP1 & 
CPJ output latches 
1 
0 
zwc : all locations in enabled 
modules are written 
zm : !ast module is written 
zbw: all modules in a bank are written 
S5 
WMAR, WBSR, WER ~ 0 
(a) 
(b) 
Figure 5.4.13 (a) ASM flowchart for write RAM control unit of the 
RAM architecture using banks (b) The block diagram 
186 
period effectively, a clock signal of the form shown in Figure 5.4.12 (c) is proposed. 
In this clock, the low pulse width is longer than the high pulse width and write 
operation which starts at a high pulse is allowed to complete in the next high pulse of 
the clock as indicated in Figure 5.4.12 (c). In addition, the fact that memory read 
operation takes more time than write operation makes this solution more attractive. 
5.5 Conclusions 
In this chapter, two novel VLSI memory architectures for 2-D DWT architectures for 
5/3 and 9/7 are developed. Banking technique is utilized to form more efficient DWT 
memory architectures in term of speed. The advantage of the two proposed 
architectures is that they can be easily incorporated into single or parallel DWT 
architectures. Furthermore, to show that the architectures developed in this research 
are simple to control, the control algorithms for 4-parallel architecture including the 
LL-RAM and the subband memory were developed. To ease the control development, 
the overall system control is divided into several smaller units. Then, the algorithmic 
state machine (ASM) for each unit is developed. The control algorithms developed 




2-DIMENSIONAL INVERSE DISCRETE WAVELETS TRANSFORM 
ARCHITECTURE DEVELOPMENT 
6.1 Introduction 
In chapter 3, architectures for 2-dimensional forward discrete wavelet transform (2-D 
FDWT) for 5/3 and 9/7 algorithms were developed. In this chapter, architectures for 
2-dimensional inverse discrete wavelet transform (2-D IDWT) for 5/3 and 9/7 
algorithms will be developed. 
The function of the 2-D FDWT in a compression system is to decorrelate image 
pixels prior to compression step, whereas the function of the 2-D IDWT is to 
reconstruct and completely recover the original image from the decorretated image. 
The 2-DFDWT decomposes an NxM image into subbands as shown in Figure 
6.1.1 for 3-level decomposition. The decorrelated image shown in Figure 6.1.1 can be 
reconstructed by using 2-D IDWT as follows. First, it reconstructs in the column 
direction subbands LL3 and LH3 column-by-column to recover L3 decompostion. 
Similarly, subbands HL3 and HH3 are reconstructed to obtain H3 decomposition. 
Then L3 and H3 decompositions are combined row-wise to reconstruct subband LL2. 
This process is repeated in each level until the whole image is reconstructed. 
The reconstruction process described above implies that the task of the 
reconstruction can achieved by using 2 processors. The first processor (the column-
processor) computes column-wise to combine subbands LL and LH into L and 
subbands HL and HH into H, while the second processor (the row-processor) 
computes row-wise to combine L and H into the next level sub band. The decorrelated 
image represented in Figure 6.1.1 is assumed to be residing with the same format in 









Figure 6.1.1 Subband decomposition of an NxM image into 3 levels. 
6.2 Lifting-based 513 and 917 synthesis algorithms and data dependency graphs 
The 5/3 and the 9/7 inverse discrete wavelet transforms algorithms are defined by the 
JPEG2000 image compression standard for 1-D signal Y(n) containing N samples as 
follow: 
5/3 synthesis algorithm 
step!: X(2n) ~ Y(2n) -l Y(2n -I)+ :(2n +I)+ 2 J 
step2: X(2n +I)~ Y(2n +I)+ l X(2n) + ~(2n + 2) J where n =' 0,!,2 .... N -1 
917 synthesis algorithm 
Step!: Y'(2n) ~ 1/ k · Y (2n) 
Step2: Y'(2n + 1) ~ k · Y(2n + 1) 
Step3: Y'(2n) ~ Y'(2n)- o(Y'(2n -1) + Y'(2n + 1)) 
Step4: Y'(2n + 1) ~ Y'(2n +I)- y(Y'(2n) + Y'(2n + 2)) 
StepS: X(2n) ~ Y'(2n)- j3(Y'(2n -1) + Y'(2n + 1)) 
Step6: X(2n + 1) ~ Y'(2n + 1)- a(X(2n) + X(2n + 2)) 
The data dependency graphs (DOGs) for 5/3 and 9/7 derived from the synthesis 
algorithms are shown in Figures 6.2.1 and 6.2.2, respectively. The DOGs are very 
useful tools in architecture development and provide the information necessary for the 
designer to develop more accurate architectures. The symmetrie extension algorithm 
recommended by JPEG2000 is incorporated into the DOGs to handle the boundaries 




X(2n + 1) 




























Figure 6.2.1 5/3 synthesis algorithm's DDGs for (a) odd and (b) even length signals 
7 6 5 
XO XI X2 X3X4 X5 X6 X7 X8 XO Xl X2 X3 X4 X5 X6 X7 
(a) (b) 
Figure 6.2.2 9/7 synthesis algorithm's DDGs for (a) odd and (b) even length signals 
the same as that of the original input. The boundary treatment is only applied at the 
beginning and ending of the process. The nodes circled with the same numbers are 
considered redundant computations, which will be computed once and used thereafter. 
Note that the inputs coefficients with even numbers in the DDGs are low coefficients 
and that with odd numbers are high coefficients. 
The strategy or the approach used in chapter 4 for developing 2-D FDWT 
architectures can be also used in 2-D IDWT architectures development. To ease the 
architecture development, the strategy divides the details of the development into two 
parts or steps each having less information to handle. In the first step, the DDGs are 
looked at from the outside, which is specified by the dotted boxes in the DDGs, in 
190 
terms of the inputs and outputs requirements. It can be observed that the DOGs for 5/3 
and 9/7 are identical when they are looked at from outside, taking into consideration 
only the input and output requirements, which can be specified for each algorithm by 
adopting appropriate scan method; but differ in the internal details Based on this 
observation, the first level of the architecture, call it, the external architecture is 
developed. In the second step, the internal details of the DOGs are considered for the 
development of the processors' datapath architectures, since the DOGs internally 
define and specify the internal structure of the processors. 
6.3 Scan methods 
The first step in developing external architecture for 5/3 and 9/7, which would consist 
of a column-processor (CP) and a row-processor (RP), is to specify an appropriate 
scan method for each processor. Therefore, in Figures 6.3.1 and 6.3.2, two scan 
methods for 5/3 and 9/7 CP are illustrated, respectively. Similarly, two scan methods 
are illustrated in Figures 6.3.3 and 6.3.4 for 5/3 and 9/7 RP, respectively. These scan 
methods are developed mainly with one objective in mind to a,;hieve, that is, to make 
the external architecture for both 5/3 and 9/7 algorithms identical. Note that the boxes 
labeled (a) in Figures 6.3.1 and 6.3.2 are formed for illustration purposes by merging 
together subbands LL and LH, where LL-subband coefficients occupy even rows and 
LH-subband coefficients occupy odd rows. Similarly, the boxes labeled (b) in Figures 
6.3.1 and 6.3.2 are formed by merging HL and HH together. 
The 5/3 CP scans the external memory column-by-column according to the scan 
method shown in Figure 6.3.1. The scan method illustrated in Figure 6.3. I (a) scans 
the sections of the external memory labeled LL and LH as follows. First, the low 
coefficient, LLO,O is scanned followed by the high coefficient, LHO,O to initiate the 
first operation. The second operation is initiated by scanning coefficient LLI ,0 
followed by LHI,O and so on. Note that coefficient LHO,O is also required in the 
second operation. This process is repeated until the first column in both LL and LH 
are scanned. Then the scan moves to the second column in both LL and LH to repeat 
the process and so on. Similarly, sections HL and HH of th'~ external memory are 
scanned. 
191 
runl run I 
.. ···a·· ... I 2 3 ,/ij'· .. , I 2 3 
LLO,O ~ .. Q. '" HL 0,0 ~ ... 0. ... 
LHO,O~··i" ... HH 0,0 ~"-\" .. 
LL 1,0 ~ ... z .. .. HL 1,0 ~ .. ·2· .. 




Figure 6.3.1 5/3 CP scan method (a) merging of LL and LH 
(b) merging of HL and HH 
run 1 run 2 
(a) (b) 
Figure 6.3.2 9/7 CP scan method( a) merging of LL and LH 
(b) merging of HL and HH 
However, in order to allow the RP, which operates on data generated by the CP, to 
work in parallel with the CP as soon as possible, the (a)'s (LL+LH) first column 
coefficients are interleaved in execution with the (b)'s (HL+HH) first column 
coefficients. Then the second column coefficients in both (a) and (b) are interleaved 
and so on. This columns coefficients interleaving process take place as follow. First, 
two coefficients LLO,O and LHO,O are scanned from the first column of (a) followed 
by another two coefficients HLO,O and HHO,O from the first column of (b). Then the 
scan moves to (a)'s first column and scans LLl,O and LHl,O followed by HLl,O and 
HHl ,0 from the first column of (b). This is repeated until the two columns are 
processed, say, to complete a run. The second run, similarly, processes the second 
column in both (a) and (b) and so on. The advantage of interleaving process not only 
it speedups the computations by allowing the two processors to work in parallel 
192 
Lf,O H010 L,0,1 ~0,1 
0 1 i2 i3 4 5 o~··"'m.,·~~-··· Ll,Q.
2
\. ... •· .. .·• 




Figure 6.3.3 5/3 RP scan method (a) Even length row (b) Odd length row 
run! run2 
.:····························::;.. ................... . 
.... 0 2 .... 3 \ 4 5 ··. 
0 
... 
.. , ... .. . .
. .... 
..·· 
.L.··· .... J;: 2 






..t!··· .... .. .J:." 4 
Figure 6.3.4 9/7 RP scan method 
earlier during the computations, but also reduces the internal memory requirement 
between CP and RP to a few registers. 
The scan method for 5/3 CP and the DDGs suggest that the 5/3 RP should scan its 
coefficients, which are generated by CP, according to the scan method illustrated in 
Figure 6.3.3. This figure is formed, for illustration purposes, by merging L and H 
decompositions, even though they are actually separate. In Figure 6.3 .3, L 's 
coefficients occupy even columns, while H's coefficients occupy odd columns. In the 
first run, coefficients of columns 0 and 1 are scanned by RP as shown in Figure 6.3.3. 
In the second run, coefficients of columns 2 and 3 are scanned and so on. 
The scan method shown in Figure 6.3.2 for the 9/7 CP is basically identical in all 
runs to that of the 5/3 CP except in the first run which requires, according to 917 
DDGs, interleaving of 4 columns; two from each (a) and (b) of Figure 6.3.2 as 
follows. First, coefficients LLO,O, HLO,O from the first column of (a) are scanned. 
Second, coefficients HLO,O and HHO,O from the first column of (b) are scanned, then 
193 
LLO, I and LHO, I from the second column of (a) followed by HLO, I and HHO, I 
from the second column of (b) are scanned. The scanning process then returns to the 
first column of (a) to repeat the process and so on. 
The scan method for 9/7 RP is illustrated in Figure 6.3.4, which is basically also 
identical to the 5/3 RP scan method except in the first run. In the first run, the 9/7 
RP's scan method requires considering the first four columns for scanning as follows. 
First, coefficients LO,O and HO,O from row 0 followed by Ll ,0 and HI ,0 from row I 
are scanned. Then the scan returns to row 0 and scans coefficients LO, I and HO, I 
followed by Ll,l and HI, I. This process is repeated as shown in Figure 6.3.4 until 
the first run completes. 
6.4 Proposed External Architecture 
Based on the scan methods and the DOGs for 5/3 and 9/7, the architecture shown in 
Figure 6.4.1 (a) is proposed for 2-D IDWT. This architecture is also valid for 
combined 5/3 and 9/7 architecture. The architecture consists of two fully pipelined 
processor labeled CP and RP which will be developed later. The proposed 
architecture scans the external memory with frequency f, while the architecture 
operates with frequency jl2 as indicated in Fig. 6.4.1 (a). The waveforms of the two 
clocks are shown in Figure 6.4.1 (b). The CP and the RP latches load new data every 
time clockfl2 makes a positive transition. 
The CP in the proposed architecture scans the external memory according to the 
scan methods shown in Figures 6.3.1 and 6.3.2 for 5/3 and 9/7, respectively, whereas 
RP scans the output latches of the CP labeled Rt/0, Rtf/, and Rth according to scan 
method illustrated in Figure 6.3.3 and 6.3.4 for 5/3 and 9/7, respectively. The 
architecture reconstructs a decorrelated image stored in the external memory such as 
the one shown in Figure 6.1.1 as follows. The CP begins the reconstruction process by 
scanning column-by-column the external memory's sections labeled LL3 and 
LH3.and that labeled HL3 and HH3 in an interleave manner to yield L3 and H3 
decomposition, which are passed to RP through the latches labeled Rt/0, Rtll, and 
Rth. L3's coefficients are stored in Rt!O and Rtll, whereas H3's coefficients are stored 








2! H ~ 
I I I I 
(b) 
sr 
Figure 6.4.1 (a) Proposed external architecture for 5/3 and 9/7 and combined 
5/3 and 9/7 2-D IDWT (b) Waveform for clockfandj!2. 
To be specific consider the dataflow of the architecture when it executes 5/3 
algorithm. In the first clock cycle, coefficient LLO,O from the first column of LL3 in 
the external memory, is scanned and is loaded into RdO by th·~ positive transition of 
clock/ The second clock cycle scans coefficient LHO,O from the first column of LH3 
and places it in the path labeled Y(i,j). Then the positive transition of clockj!2 loads 
RdO and LHO,O into CP latches RtO and Rtl, respectively. 
In the third clock cycle, coefficient HLO,O, from the first column of HL3, is 
scanned and is loaded into RdO by the positive transition of the clock f The fourth 
clock cycle scans coefficient HHO,O from the first column of HH3 in the external 
memory and places it in the path labeled Y(i,j). Then the posi::ive transition of clock 
j!2 loads contents of RdO and HHO,O into the CP's latches labeled RtO and Rtl, 
respectively. The scanning process then returns to subband LL3 in the external 
memory to repeat this interleaving process. 
The CP generates every clock cycle two output coefficients. The first two output 
coefficients, LO,O and L I ,0 which belong to L3 decomposition are loaded into Rt/0 
and Rtll, respectively, by the positive transition of clock .f12. During the next clock 
195 
cycle, say, cycle n coefficients HO,O and HI ,0 which belong to H3 decomposition, 
will be placed in the output paths labeled L and H, respectively. Then the positive 
transition of the clock ending the cycle transfers Rt!O and HO,O in the output path, L, 
to the RP's latches labeled RtO and Rtf, respectively, through the two multiplexers 
labeled muxr, while HI,O in the output path labeled H is loaded int Rth. The second 
two output coefficients of L3, L2,0 and L3,0 are loaded into Rt/0 and Rtll, 
respectively, by the positive transition of the clock ending cycle n+ I, while contents 
of Rt/1 and Rth are transferred to RP latches RtO and Ril, respectively. This process is 
repeated according to the scan method illustrated in Figure 6.3 .3. 
On the other hand, the dataflow of the 9/7 architecture, which differs mainly in the 
first run from that of the 5/3 by requiring interleaving of 4 columns instead of two, is 
as follow. However, since the dataflow of the 9/7 CP is same as that of the 5/3 up to 
the fourth clock cycle, the dataflow description would continue from the fifth cycle. 
In the fifth clock cycle, the scanning process returns to LL3 and scans coefficient 
LLO, I from the second column and loads it into RdO by the positive transition of the 
clock ending the cycle. The sixth clock cycle scans coefficients LHO, I from the 
second column of LH3 and places it in the path labeled Y(i,;). Then the positive 
transition of the clock jl2 loads RdO and LHO,l into CP's latches RtO and Rtf, 
respectively. In the seventh clock cycle, the scan moves to HL3 in the external 
memory and scans coefficient HLO, I from the second column and loads it into RdO by 
the pulse ending the cycle. The eighth clock cycle, scans coefficient HHO, I from the 
second column of HH3 and places it in the path labeled Y(i,j). Then the positive 
transition of the clock f/2 loads RdO and HO, I into CP's latches RtO and Rt I, 
respectively. The scanning process then returns to subband LL3 in the external 
memory to repeat the process until the first run completes. In the second run, the third 
column in both (a) and (b) of Figure 6.3.2 are consider for processing and proceeds as 
that of the 5/3 described earlier. Remember, in Figure 6.3.2 (a), coefficients of 
subband LL occupy even rows, while subband LH coefficients occupy odd row. 
Similarly, in Figure 6.3.2 (b), coefticients of subband HL occupy even row, while 
sub band HH coefficients occupy odd rows. 
Now, let's look at the dataflow of the 9/7 from RP side. The CP yields every clock 
cycle two output coefficients. The first two output coefficients, LO,O and L1 ,0 from 
196 
L3 decomposition are loaded into Rt/0 and Rtll, respectively, by the positive 
transition of clock jl2. During the next clock cycle, say, cycle n, coefficients HO,O 
and HI ,0 from H3 decomposition will be placed in the output path labeled L and H, 
respectively. Then, the positive transition of the clock ending the cycle, transfers Rt/0 
and coefficient HO,O in the output path labeled L, to RP's latches RtO and Rtl, 
respectively, while Hl,O in path H is loaded into Rth. In cycle n+ I, coefficients in 
Rtll and Rth are transferred to RP's latches RtO and Rtf, resp<~ctively, while the two 
output coefficients LO, l and Ll, l from L3 decomposition are loaded into Rt!O and 
Rtll, respectively, by the positive transition of the clock ending the cycle. During 
cycle n+2, two output coefficients HO, I and HI, I from H3 decomposition will be 
placed in the output path labeled Land H, respectively. Then the positive transition of 
the clock ending the cycle, transfers Rt/0 and HO, I in path L to RP latches RtO and 
Rtf, respectively, while Hl,l in path H is loaded into Rth. Cycle n+ 3 transfers 
contents of Rtll and Rth to RP latches RtO and Rtf, respectively, while the two new 
output coefficients, L2,0 and L3,0 from L3 decomposition generated by CP are loaded 
into Rt!O and Rt/J, respectively. This process is repeated according to the scan method 
shown in Figure 6.3.4. The dataflow table of the architecture will be given later after 
the two processor, labeled CP and RP in Figure 6.4.1 are developed. 
One important point, if number of columns in (a) and (b) of Figures 6.3.1 and 
6.3.2 are not equal, then the last run will consist of only one column of (a). In that 
case, scan the last column of (a) every other clock cycle, reference to clockfl2, so that 
CP yields a valid pair of output coefficients every other clc•ck cycle. Because, an 
attempt to scan the last column every clock cycle ofjl2 will result in CP generating 
more coefficients than that can be handled by RP. The dataflow from RP side is as 
follow. Suppose, at clock cycle n the first two output coefficients of the CP LO,m and 
L I ,m of the last column m are loaded into Rt/0 and Rtf I, respectively. ln the next 
clock cycle, cycle n+ I, RtlO is transferred to RtO of RP, whil'' data in path L and H 
generated by CP during the cycle are not loaded into RtlO and Rtll, since they are 
invalid coefficients. In cycle n+ 2, coefficients L2,m and L3,rn generated by CP are 
loaded into Rt/0 and Rtll, respectively, while content of Rtll is transferred to RP 
latch RtO through muxr. This process is repeated until the run C•)mpletes. 
197 
The control signal values for signals Eth, Etl, and sr that could be issued by a 
control unit are derived in Table 6.1 starting from clock cycle n where the first two 
output coefficients generated by CP are loaded into Rt!O and Rtll. However, note that 
signal Eth can be eliminated, since it alternates between don't-care and 1. In addition, 
since the first value of signal sr is a don '!-care and the rest of the signal values are 
same as that of signal Etl, then signal sr and Etl can be combined into one signal sr. 
Table 6.1 Control signal 
values for Eth Ell and sr 
' 
CKj12 Eth Etl sr 
N X 1 X 
n+l 1 0 0 
n+2 X I I 
n+3 1 0 0 
n+4 X I I 
6.5 Processors' architecture development 
6.5.1 Inverse 513 processor's architecture development 
To complete the architecture for 2-D IDWT, the last phase is to design the row and 
column processors' datapath architectures for 5/3 and 9/7 algorithms separately that 
can be incorporated into CP and RP of the external architecture shown in Figure 6.4.1 
(a). First, the datapath architecture for 5/3 will developed followed 9/7 in the next 
section. 
Based on the algorithm (6.1) and the DDGs shown in Figure 6.2.1, the inverse 5/3 
processor datapath architecture shown in Figure 6.5.1 is obtained. The multiplexers 
labeled muxeO, muxe I, and muxe2 implement the symmetric extension algorithm 
incorporated into the DDGs. This 3-stage pipelined processor is formed by mapping 
the two lifting steps of the inverse 5/3 algorithm into two pipeline stages. Steps 1 and 
2 are mapped into stages I and 3 in Figure 6.5.1, respectively. Then. stages I and 3 
are connected through stage 2 to form a 3-stage pipelined processor. Stage 2 is 
necessary because stage3, which implements step 2, requires two successive low 
coefficients from stage I to perform an operation. When the first coefficient generated 
by stage I is in RtO of stage 3, the second coefficient will be in RtO of stage 2 and will 
be applied to stage 3 through the path labeled X(2n+2), the Forward path. The nodes 
198 
circled with even number in the DOGs, which represent step 1 of the algorithm, are 
all computed in stage 1 in the order indicated in the DOGs. Similarly, nodes circled 
with odd number, which represent step2, are computed in stage 3 in the order 
specified in the DOGs. 
In the following the operations of the extension multiplexers are explained. First, 
according to DOGs for 5/3, in the calculation of the first low coefficient XO, the 
second input Y1 must be allowed in stage 1 to pass through the two multiplexers, 
labeled muxeO and muxe 1 to the adder. Second, in the calculation of the last 
coefficient, for example, X8 in the DDG for odd length signals, the input coefficient 
Y7, which will be in Rt 1 of stage 2, must be allowed to pass through both muxeO and 
muxe 1 to the adder. On the other hand, during the normal computations, which take 
place between the first and last calculations, the current inpFt coefficient in Rtl of 
stage 1 and the previous coefficient in Rt 1 of stage 2 are allowed to pass through 
muxeO and muxel, respectively, to the adder. However, note that in even length 
signals, according to the DDG in Figure 6.2.1 (b), the last high and low coefficients 
calculations take place as normal calculations. As for the extension multiplexer 






Figure 6.5.1 Inverse 5/3 processor datapath architecture with symmetric extension 
labeled muxe2 in stage 3, its normal function is to pass in all cases the forward signal, 
X(2n+2), to the adder in stage 3, except in the even length signals and in the 
calculation of the last coefficients, multiplexer muxe2 passes the coefficient stored in 
199 
RtO of stage 3 to the adder instead of the one in the Forward path. Table 6.2 shows the 
control signal values that are required to be issued by the control unit order for the 
extension multiplexers to perform the required functions. 
Table 6.2 Extension's control signals 
seO sel se2 seO Sel se2 
First 0 0 0 First 0 0 0 
Normal 0 I 0 Normal 0 I 0 
Last I I 0 Last 0 I I 
a) Odd length signals b) Even length signals 
6.5.2 Inverse 9/7 processor's datapath architecture 
Based on the 9/7 algorithm 6.2 and its DDGs shown in Figure 6.2.2, the inverse 9/7 
processor datapath architecture is shown in Figure 6.5.2. This processor architecture is 
formed by mapping steps 3, 4, 5, and 6 of the algorithm into stages 2, 4, 5, and 7, 
respectively, while steps I and 2 are mapped into stage I to allow the two steps to 
perform in parallel. This architecture also can be thought formed by connecting two 
5/3 processors at stage 4. 
The multiplexers in stages 2, 4, 5, and 7 implement the symmetric extension 
algorithm that is part of the DDGs shown in Figure 6.2.2. Table 6.2 also provides 
appropriate control signal values that must be issued by the control unit to the 9/7 
extension multiplexers so that they can perform their required functions. These 
extension multiplexers functions exactly the same way as that of the 5/3 described 
earlier. 
6.5.3 Combined inverse 9/7 and 5/3 processors architecture 
The 9/7 processor architecture shown in Figure 6.5.2 can be modified as shown in 
Figure 6.5.3 to give the combined processor architecture for both 9/7 and 5/3. The 5/3 
processor is incorporated into the 9/7 processor by modifying stages I, 2, and 4, while 
the remaining stages remain the same. The control signal labeled lossy I loss less 
enables the architecture to be selected either to perform 9/7 or 5/3 algorithms. Thus, if 
signal lossy I loss less is I, the architecture reconstructs the image using 9/7 algorithm, 
200 
otherwise, it reconstructs the image using 5/3 algorithm. The combined architecture 
could be a very useful and efficient in situations where the decoder in one site is 
required to perform either lossless or lossy image reconstruction. In addition, the 
advantage of the combined architecture is that a great saving in silicon area can be 
achieved. 
6.5.4 Modified row and column processors for 513 and 917 external architecture 
The 5/3 and 9/7 processors datapath architectures shown in Figures 6.5.1 and 6.5.2 
were developed assuming the processors scan coefficients from external memory row-
by-row or column-by-column. The CPs for 5/3 and 9/7 external architecture do, 
according to the scan methods shown in Figures 6.3.1 and 6.3.2, scan the external 
memory column-by-column. However, since the CPs for both 5/3 and 9/7 are required 
to rotate between executing coefficients of subbands LL and LH with that of HL and 
HH in an interleave fashion, the processor datapath archite,;tures for 5/3 and 9/7 
shown in Figures 6.5.1 and 6.5.2 should be modified as shown in Figures 6.5.4 and 
6.5.5, respectively, in order to allow interleaving in execution. The 513 processor 
shown in Figure 6.5.1 is modified by adding one stage between stages 2 and 3, since it 
interleaves two column in execution, to obtain a 4-stage CP :;hown in Figure 6.5.4 
that fit into 5/3 external architecture. 
On the other hand, the 7-stage 9/7 processor datapath architecture shown in Figure 
6.5.2 is modified by adding 3 stages between stages 3 and 4 and stages 6 and 7 each, 
since it is required to interleave 4 columns in the first run, to obtain a 13-stage CP 
shown in Figure 6.5.5 for 9/7 external architecture. Figure 6.5.5 show only the first 
seven stages, since the remaining 6 stages are identical to stages 2 to 7. Tables B. IS 
and B.l9 (a) show the dataflow of the 513 and the 9/7 architectures, respectively, 
which illustrate how interleave execution takes place. 
In Figure 6.5.5, the control signal, s of the two multiplexer:; labeled mux is set I in 
the first run to allow interleaving of 4 columns, whereas in all other runs it is set 0 to 
allow interleaving of 2 columns as required by scan method shown in Figure 6.3.2, 
which is identical to 5/3 scan method shown in Figure 6.3.1 in all runs except the first 
run. This also implies that reference to Figure 6.5.3, Figure 6.5.5 can be easily 
201 
Stage 2 









Figure 6.5.2 Inverse 9/7 processor datapath architecture with symmetric extension 
202 
Stage 2 Stage 3 Stage 4 
X(2n) 
L 
Figure 6.5.3 Combined Inverse 9/7 and 5/3 processor datapath architecture 
Stage I Stage 2 Stage 3 Stage 4 
Y(2n t I) Y(2n-·l) Y(2n -I) 
»I 
Forward 
r:::J.____J X (2 n) 
f-----~Rtof-..l..+j~~RtOI-'----.1..-+L 
X(2n) X(2n) X(2n) 
Figure 6.5.4 Modified inverse 5/3 CP datapath architecture with symmetric extension 
203 
Stage 1 Stage 2 Stage 3 Stage 4 Stage 5 Stage 7 
Y'(2n+l) Y'(2n+l) Y'(2n+l) 
Y'(2n +I) 
Figure 6.5.5 Modified CP for 917 and combined 5/3 and 9/7 datapath architecture 
modified as a CP for combined 5/3 and 9/7 external architecture shown in Figure 
6.4.1. Thus, when signals lossy /loss less of Figure 6. 5.3 and s both are zero the 
architecture performs 5/3; otherwise, it performs 917. 
On the other hand, the RP in the proposed external architecture scans coefficients 
of the high (H) and low (L) decompositions generated by CP according to scan 
methods shown in Figure 6.3.3 and 6.3.4 for 5/3 and 9/7, respectively. Thus, this 
would require modifying the 5/3 and the 9/7 processor datapath architectures shown 
in Figures 6.5.1 and 6.5.2, respectively, as follows. Looking at the input conditions of 
the 5/3 and the 917 in the DOGs and the scan methods shown in Figures 6.3.3 and 
6.3.4 one can immediately recognize that all input coefficients occupying odd 
columns in Figures 6.3.3 and 6.3.4 in each run need to be stored in a temporary line 
buffer (TLB) of size N, since they are required in next run's computations. Therefore, 
a TLB should be added in both Figures 6.5.1 and 6.5.2. 
Furthermore, according to the 5/3 DOGs, applying the scan method shown in 
Figure 6.3.3 would require addition of another TLB of size N in order to store low 
coefficients of a run calculated in stage I of Figure 6.5 .I, since they are required in 
high coefficients that would be calculated in stage 3 in the next run. When these 
changes are incorporated into Figure 6.5.1, the 4-stage RP shown in Figure 6.5.6, is 
obtained for 5/3 external architecture. Table 8.18 shows the dataflow of the 5/3 
204 
architecture. In this dataflow table, the first location of TLB I, for example, contains 
coefficient YO(!) and the second location contains Y 1 (1) followed by Y2( I) in the 
third location and so on. In the first run, TLBs are only written .. whereas starting from 
the second run, the TLBs are read and written in the same clock cycle. For instance, in 
the second run at cycle 30, Table B.l8 shows that the first loeation of TLB I is read 






Y(2n) X(2n) X(2n) 
Figure 6.5.6 Modified inverse 5/3 RP datapath architecture with symmetric extension 
On the other hand, according to the 9/7 DOGs, applying the scan method shown in 
Figure 6.3 .4 would require addition of three TLBs each of size N in the data path 
architecture shown in Figure 6.5.2. The first TLB is needed because all coefficients 
calculated in stage 2 of Fig. 6.5.2, in a run, are required in stag·~ 4 in the next run. The 
second TLB is needed for storing N coefficients calculated in stage 4 in a run, which 
are required in the calculations that take place in stage 5 in the next run. The third 
TLB is necessary to keep N coefficients calculated in stage 5 in a run, which are 
required m stage 7 calculation in the next run. When these changes are 
incorporated into Figure 6.5.2, the 9-stage RP shown in Figure 6.5.7, is obtained for 
9/7 external architecture. 
205 
ETLB: enable TLB tncar increment AR c/ar: clear AR 
Y'(2n+ 1) j12 Y'(2n + 1) Y'(2n + 1) 12 
Stuge 2 Stuge 4 
j12 
1-----+!RtO!-----~ 




Y"(2n) Y"(2n) X(2n) X(2n) 
Figure 6.5.7 Modified RP for 9/7 and combined 5/3 and 9/7 datapath architecture 
The registers labeled RO and R1 in stage 3 of Figure 6.5.7 are added because the 
scan method for 9/7 illustrated in Figure 6.3.4 requires in the first run, for example, 
storing the second input coefficient of both rows 0 and I in Figure 6.3.4, labeled HO,O 
and H1 ,0, since these two coefficients are required in the second operation of rows 0 
and 1, respectively. Whereas, registers RO and R I in stage 4 are added to store in the 
first run, the first two coefficients computed in stage 3 for each two rows using the 
first two input coefficients of each row, since they are required in the two successive 
computations that take place in stage 5. Note that the control signal s of the two 
multiplexers, labeled mux in stages 3 and 4 of Figure 6.5.7 is set I in the first run to 
pass coefficients stored in RO and R 1 and 0 in all other runs to pass coefficients stored 
in TLB I and TLB2. 
206 
The details of the 9/7 architecture dataflow from RP side is given in Table B.l9 
(b). This table shows that in the first run each two outputs are followed by two empty 
cycles. To see why this occurs can be determined by looking at column 5 (stage 5) in 
the dataflow Table B.l9 (b), which shows that in clock cycle 22 and 23 no data are 
passed to stage 5 from 4. Similarly, in clock cycles 26 and 27, and so on. This 
mainly is a consequence of the scan method adopted in the first run, which forces 
stage 5 to wait each time on two successive coefficients calculated in stage 3 before it 
can proceed. However, in all subsequent runs, the 917 architecture would yield a pair 
of output every clock cycle. 
It is very important to note that when the RP executes its last set of input 
coefficients, according to 917 DOGs for odd and even signals shown in Figure 6.2.2 it 
will not yield all required output coefficients as expected by the last run. For example, 
in the DOGs for odd length signals shown in Figure 6.2.2 (a), when the last input 
coefficient labeled YS is applied to RP it will yield output coefficient X5 and X6. To 
get the last remaining two coefficients X7 and XS, the RP must execute another run, 
which will be the last run in order to compute the remaining two output coefficients. 
Similarly, when the last two input coefficients labeled Y6 and Y7 in the DOG for 
even length signals shown in Fig. 6.2.2 (b) are applied to 917 RP it will yield output 
coefficients X3 and X4. To obtain the remaining output coeffi~ients X5, X6, and X7, 
two more runs should be executed by RP according to the DOG. The first run will 
yield X5 and X6, whereas the last run will yield X7. The details of the computations 
that take place during each of these runs can be determined by examining the specific 
area of the DOGs. 
Control signals of a pipe lined processor such as the signal~ of the pipeline 9/7 RP 
shown in Figure 6.5.7 can be issued every clock cycle by a control unit. The control 
signal values issued in each clock cycle are transferred to the first stage of the pipeline 
and are loaded into the control signal latches (CSTs) that are similar to the pipeline 
latches, to carry these signal values from stage-to-stage. When a stage where a signal 
(or signals) is used is reached, the signal value carried by its CST is applied, while the 
remaining signals are carried to the next stage. For example, in Table 6.3 starting 
from cycle 14, the control signal values for signals incar, clar, ETLB, seO, etc. for 4 
207 
Table 6.3 Control signal values for 9/7 RP 
CK incar clar ETLB s seO sel se2 
14 0 0 0 I 0 0 0 
15 0 I 0 I 0 0 0 
16 I 0 I I 0 I 0 
17 I 0 I I 0 I 0 
cycles are derived. In cycle 14, the control signal values listed at cycle 14 in Table 6.3 
would be loaded by the control unit into CSTs of the first pipeline stage. Similarly, in 
cycle 15, the control signal values listed at cycle 15 in the table would be transferred 
to CSTs of the first stage, while the control signal values issued in cycle 14 would be 
transferred to CSTs of the next stage and so on. 
In addition, observe that if registers RO and R I in stages 3 and 4 are 
eliminated, the RP for 9/7 from stages 2 to 5 and from 6 to 9 are similar in structure to 
the 4-stage 5/3 RP shown in Figure 6.5.6. This implies that the RP for 9/7 can be 
easily modified to work as a RP for the combined 5/3 and 9/7 external architecture. 
In the combined architecture, signal s of the two multiplexers, labeled mux in 
stages 3 and 4 of Fig. 6.5.7 is set 0 if the architecture is to perform 5/3; otherwise, it is 
set 1 in the first run and 0 in all other run if the architecture is to perform 9/7. 
Moreover, the multiplexer labeled muxco in stage 5 is only needed in the combined 
5/3 and 9/7 architecture, otherwise, it can be eliminated and Rt2 output can be 
connected directly to the input of the RIO of the next stage. Thus, in the combined 
architecture signal sea of muxco is set 0 if the architecture have to perform 5/3, 
otherwise, it is set I if the architecture have to perform 9/7. 
Note that the TLBs in Figures 6.5.6 and 6.5.7 are required to be read and written 
in the same clock. Therefore, signal R/W is connected to clock j72 so that the TLB 
can be read in the first half cycle and written in the second half cycle. The register 
labeled TLBAR (TLB address register) generates addresses for TLB. Initially, 
TLBAR is cleared to zero to point at the first location. Then to address the next 
location, after each read and write, register TLBAR is incremented by one. 
208 
6.6 Performance Evaluation 
Suppose 1m and lp are the critical path delays of the external memory and the non-
pipe lined processor architecture, respectively. I is the number of input coefficients 
scanned from external memory for each operation. I= 2 for both inverse 5/3 and 9/7. 
Then the scan clock period r and hence the scan frequency f of the proposed 
architecture can be determined by the following algorithm. 
Statement4 
case I : If I m ~ t P / k then 
r tm 
case 2 : Else if I p I I . k " t m then 
r t,/I·k 
else r = t m 
In the algorithm above either case I or case 2 can be true. Case 2 implies the 
availability of a very high speed scan that can scan the two pixels required for an 
operation during the specified time limit given by t/k. If that is the case-the 
architecture shown in Figure 6.4.1 with it processor pipelined-the hardware utilization 
is 100% and the architecture is complete. Now, suppose r 1 and r, denote the scan 
clock periods of the architecture before and after pipelining, respectively. Then 
r, =tPji. 
And from statemen/4, case2 
The speedup factor S is then given by 
S=r1/r2 =r,/(r,/k)=k 
The efficiency E of k-stage pipeline is defined as 





Thus, the architecture with pipe lined processors is k times fast<er than the architecture 
with non-pipe lined processors with efficiency 1. 
On the other hand, case 1 implies low scanning frequency. That means the time 
required to scan the two pixels for an operation will take at least 2t/k seconds or two 
clock cycles, where 1/k is the stage critical path delay of the pipe lined processor. In 
209 
that case, the proposed architecture would not only be slow but would be under 
utilized half of the time, since every 2 clock cycles would yield one output. To 
remedy this problem, the proposed architecture can be allowed to read from external 
memory the required 2 coefficients for an operation in parallel every clock cycle 
instead of one coefficient at a time, if the frequency of the pipe lined architecture and 
the external memory scan frequency are made equal. This would require two buses 
instead of one to scan the external memory in the parallel scan architecture. 
If the clock period r 3 for both external memory and the pipe lined architecture are 
made equal to tplk, then the speedup factorS of the pipe lined parallel scan architecture 
as compared with the non-pipelined architecture is given by 
(6.5) 
The efficiency E=S/k=l 
That is the parallel scan architecture is k times faster than nonpipelined architecture 
with efficiency I. 
On the other hand, to compare the power consumption of the pipe lined parallel 
and sequential scan architectures consider the following. First, since both pipelined 
parallel and sequential scan architectures operate with frequency kltp and are equal in 
capacitance, therefore, they consume the same power. Second, the external memory 
power consumption in the pipe lined parallel scan architecture, P m(pipe)pu' and that in 
the pipe lined sequential scan architecture, P m(pipe)m1 can be determined as follow. If 
the power consumption of VLSl architecture can be estimated as 
P=C ·V'·f lulu/ " (6.6) 
where Ctotal denotes the total capacitance of the architecture, Yo is the supply voltage, 
andfis the clock frequency, then 
P (pipe) -em ·V'·f -em ·V 2 1/r -J.em ·V'·k/t (6.8) m seq- total o 2- Iota/ o 2- /1!/al o p 
e"' 
'"'"' is the total capacitance of the external memory. 
210 
Based on the above evaluations, it can be concluded that both pipelined parallel 
and sequential scan architectures achieve the same performance in terms of speedup, 
efficiency and they consume the same power. 
6, 7 Parallel Architecture Development 
In order to best meet real-time applications 2-0 DWT with demanding requirements, 
in this section, parallelism will be explored. The single pipelined architecture 
developed in the previous sections will be extended to 2- and 4-parallel pipelined 
architectures to achieve speedup factors of 2 and 4, respectively. First, the 2-parallel 
pipelined architecture for 5/3 and 9/7 will be developed followed by the 4-parallel 
pipe lined architecture. 
6. 7.1 Proposed 2-parallel external architecture 
Based on the scan methods and the DOGs for 5/3 and 9/7, the 2-parallel external 
architecture shown in Fig. 6.7.1 (a) is proposed for 5/3 and 9/7 and combined 5/3 and 
917 for 2-D IDWT. The architecture consists of two k-stage pipelined column-
processors labeled CPl and CP2 and two k-stage pipelined row-processors labeled 
RPl and RP2. The waveforms of the two clocks / 2 and / 2 /2 that are used in the 
architecture are shown in Fig. 6.7.1 (b). The clock frequency .f2 is determined from 
statement3 as 
(6.9) 
The architecture scans the external memory with frequency f 2 and it operates with 
frequency / 2 /2. Each clock cycle two new coefficients are scanned from external 
memory through the two buses labeled busO and bus]. The two new coefficients are 
loaded into CPl or CP2 latches RtO and Rtl every time clock f,/2 makes a negative 
or a positive transition, respectively. On the other hand, both RPI and RP2 latches 
RtO and Rt 1 load simultaneously new data from CPI and CP2 output latches each time 
clock / 2 /2 makes a negative transition. 
The dataflow for 5/3 2-parallel architecture is shown in Table B.20, where CPs 
and RPs are assumed to be 4-stage pipelined processors. This 5/3 dataflow table is 
211 
J,/2 






I . 3 . J, . 2 . I I I I r 
. . 
J,/2 • • I ~Load RP1 & RP2 
• • . .
• . Load CP1 Load CP2 
(b) 
Figure 6.7.1 (a) Proposed 2-parallel pipelined external architecture for 5/3 and 9/7 and 
combined 5/3 and 9/7 for 2-D IDWT (b) Waveforms of the clocks 
derived based on the 9/7 scan methods shown in Figs. 6.3.2 and 6.3.4 instead of 5/3 
scan method shown in Figs. 6.3 .1. The reason is to show that 9/7 scan methods can be 
used for 513 as well. In addition, a unified scan method for both 917 and 5/3 make 
their control algorithms identical, which is advantageous especially in combined 5/3 
and 9/7 architecture. The dataflow for 917 2-parallel architecture is similar, in all runs, 
to the 5/3 dataflow except in the first run, where RPI and RP2 of the 9/7 architecture 
each would generate one output coefficient every other clock cycle, reference to 
clockJ,/2 . The reason is that the first 4 coefficients of each row processed in the first 
run by either RPI or RP2 of the 9/7 would require, according to the DOGs, two 
successive low coefficients from the first level of the DOGs labeled Y"(2n) in order to 
212 
carry out node I computations in the second level labeled Y'(2n+ 1). In Table B.20, the 
output coefficients in RtO of both RPI and RP2 at cycles 19, 23, and 27 and so on 
represent the output coefficients of the 9/7 in the first run. 
The strategy adopted for scheduling memory columns for CPI and CP2 of the 5/3 
and 9/7 2-parallel architectures, which are scanned according to the scan method 
shown in Figure 6.3.2, is as follow. In the first run, both 5/3 and 9/7 2-parallel 
architectures are scheduled for executing 4 columns of memory, two from each (A) 
and (B) of Figure 6.3.2 . The first two columns of Fig. 6.3.2 (A) are executed in an 
interleaved fashion by CPI, while the first two columns of Fig. 6.3.2 (B) are executed 
by CP2 also in an interleaved fashion as shown in the dataflow Table B.20. In all 
subsequent runs, 2 columns are scheduled for execution at a time. Each time one 
column from (A) of Fig. 6.3.2 will be scheduled for execution by CPI, while another 
from (B) will be scheduled for CP2. However, if number of columns in (A) and (B) of 
Fig. 6.3.2 is not equal, then the last run will consist of only one column of (A). In that 
case, schedule the last column in CPI only, but its output coefficients will be executed 
by both RPI and RP2. The reason is that if the last column is s~heduled for execution 
by both CPI and CP2, they will yield more coefficients than that can be handled by 
both RPI and RP2. 
On the other hand, scheduling RPI and RP2 of 5/3 and 9/7 2-parallel architectures 
occurs according to scan method shown in Fig. 6.3.4. In this scheduling strategy, all 
rows of even and odd numbers in Fig. 6.3.4 will be scheduled for execution by RPI 
and RP2, respectively. In the first run, 4 coefficients from each 2 consecutive rows 
will be scheduled for RPI and RP2, whereas in all subsequent runs, two coefficients 
of each 2 consecutive rows will be scheduled for RPI and RP2, as shown in Figure 
6.3.4. However, if the number of columns in Figure 6.3.4 is odd, that occurs when 
number of columns in (A) and (B) of Fig. 6.3.2 is not equal, then the last run would 
require scheduling one coefficient of each 2 successive rows to RP I and RP2. 
In general, all coefficients belong to columns of even numbers in Fig. 6.3.4 will be 
generated by CPI and all coefficients belong to columns of odd numbers will be 
generated by CP2. For example, in run I, first, CPI will geLerate two coefficients 
labeled LO,O and Ll,O that belong to locations 0,0 and 1,0 in Fig. 6.3.4, while CP2 
will generate coefficient HO,O and HI ,0 that belong to locatiDns 0, I and I, I. Then 
213 
coefficients in locations 0,0 and 0, I are executed by RPI, while coefficients of 
locations I ,0 and I, I are executed by RP2. Second, CPI will generate two coefficients 
for locations 0,2 and I ,2, while CP2 will generate two coefficients for locations 0,3 
and I ,3. Then coefficients in locations 0,2 and 0,3 are executed by RPI, while 
coefficients in locations I ,2 and I ,3 are executed by RP2. The same process is 
repeated in the next two rows and so on. 
In the second run, first, CPI generates coefficients for locations 0,4 and l ,4, 
whereas CP2 generates coefficients for locations 0,5 and 1,5 in Fig. 6.3.4. Then 
coefficients in locations 0,4 and 0,5 are executed by RP I, while coefficients in 
locations I ,4 and I ,5 are executed by RP2. This process is repeated until the run 
completes. However, in the even that the last run processes only one column of (A), 
CPI would generate first coefficients of locations O,m and J,m where m refers to the 
last column. Then coefficients of location O,m is passed to RPI, while coefficient of 
location l,m is passed to RP2. In the second time, CP! would generate coefficients of 
locations 2,m and 3,m. Then 2,m is passed to RPI and 3,m to RP2 and so on. 
6. 7.2 Modified CPs and RPs for 513 and 917 2-paral/e/ external architecture 
Each CP of the 2-parallel external architecture is required to execute two columns in 
an interleave fashion in the first run and one column in all other runs. Therefore, Fig. 
6.5.1 should be modified as shown in Fig. 6. 7.2 by adding one more stage between 
stages 2 and 3 for 5/3 2- parallel external architecture to allow interleaving of two 
columns as described in the dataflow Table B.20. Through the two multiplexers 
labeled mux the processor controls between executing 2 columns and one column. 
Thus, in the first run, the two multiplexers' control signal labeled s is set I to allow 
interleaving in execution and 0 in all other runs. The modified 9-stage CP for 9/7 2-
parallel external architecture can be obtained by cascading two copies of Figure 6.7.2. 
On the other hand, RPI and RP2 of the proposed 2-parallel architecture for 5/3 
and 9/7 are required to scan coefficients of H and L decompositions generated 
by CPI and CP2 according to the scan method shown in Fig. 6.3.4. In this scan 
method, all rows of even numbers are executed by RP I and all rows of odds numbers 
214 





Figure 6.7.2 Modified inverse 5/3 CP for 2-parallel external architecture 
are executed by RP2. That is, while RPl is executing rowO coefficients, RP2 will be 
executing rowl coefficients and so on. In addition, looking at the DOGs for 5/3 and 
9/7 one might immediately observe that applying the scan methods shown in Fig. 
6.3.4 would require inclusion of temporary line buffers (TLBs) in RPl and RP2 of the 
proposed 2-parallel external architecture as follows. ln the first run, the fourth input 
coefficient of each row in the DOGs and the output coefficients labeled X(2) in the 
5/3 DOGs and that labeled Y"(2), Y"(l), and X(O) in the 9/7 DOGs, generated by 
considering 4 inputs coefficients in each row, should be stor<:d in TLBs, since they 
are required in the next run's computations. Similarly, in the second run, the sixth 
input coefficient of each row and the output coefficients labeled X( 4) in the 5/3 DOGs 
and that labeled Y"(4), Y"(3), and X(2) in the 9/7 DOGs generated by considering 2 
inputs coefficients in each row, should be stored in TLBs. Accordingly, 5/3 would 
require addition of 2 TLBs each of size N, whereas 9/7 would require addition of 4 
TLBs each of size N. However, since 2-parallel architecture consists of two RPs, each 
5/3 RP will has 2 TLBs each of size N/2 and each 9/7 RP will has 4 TLBs each of 
size N/2 as shown in Fig. 6.7.3. Figure 6.7.3 (a) represents the 5/3 modified RP, 
while both (a) and (b) represent the 9/7 modified RP for 2- parallel architecture. 
To have more insight into the two RPs operations, the dataflow for 5/3 RPl ts 
given in Table 6.4 for first and second runs. Note that stage l input coefficients in 
Table 6.4 are exactly the same input coefficients of RPl in Table B.20. In the first 
run, TLBs are only written, but in the second run and in all subsequent runs, TLBs are 
215 
ETLB: enabh: nB incar: mm:menl AI? dar: clear AN 
/2 f2 
Figure 6.7.3 Modified RP for 2-parallel architecture (a) 5/3 (a, b) 9/7 
read in the first half cycle and written in the second half cycle. In the cycle 15, Table 
6.4 shows that coefficients HO,I is stored in the first location of TLBI, while 
coefficient H2, I is stored in the second location in cycle 19 and so on. Run 2 starts at 
cycle 27. In cycle 28, the first location ofTLBI, which contains coefficients HO,l is 
read during the first half cycle and is loaded into Rdl by the positive transition of the 
cycle, whereas coefficient H0,2 is written into the same location in the second half 
cycle. Then, the negative transition of clock cycle I 0 transfers contents of Rdl to Rt2 
in stage 2. 
In Figure 6.7.3, the control signal, s, of the two multiplexers' labeled mux is set I 
during run I to pass RO of both stages 2 and 3, whereas in all other runs, it is set 0 to 
216 
Table 6.4 Dataflow of the 5/3 RPI 
CK RPI input latches RP!output 
j, STAGE I STAGE 2 STAGE 3 STAGE 4 latches 
RtO Rtl TLBI RtO Rt2 Rtl RO RIO Rtl RO TLB2 lltO Rtl Rt2 RIO Rtl 
II LO,O HO,O 
----- -----
13 LO,I HO,I LO,O ---- HO,O 
-----
-----
15 L2,0 H2,0 HO,I LO,I ---- HO,I HO,O XO,O --- ---- ----- -----
- 17 L2,1 H2,1 L2,0 ---- H2,0 ----- X0,2 HO,O XO,O XO,O ---- ---- ----- -----5 19 L4,0 H4,0 H2,1 L2,1 ---- H2,1 H2,0 X2,0 ----- ----- X0,2 X0,2 HO,O XO,O XO,O -----
"' 21 L4,1 H4,1 L4,0 ---- H4,0 ----- X2,2 H2,0 X2,0 X2,0 ----- ----- X0,2 XO,I 
23 L6,0 H6,0 H4,1 L4,1 
----
H4,1 H4,0 X4,0 ----- ----- X2,2 X2,2 H2,0 X0,2 X2,0 
-----




X4,2 H4,0 X4,0 X4,0 ----- ------ X2,2 X2,1 
27 L0,2 H0,2 H6,1 L6, I ---- H6,1 H6,0 X6,0 -----
-----
X4,2 X4,2 H4,0 X4,0 X4,0 -----
29 L2,2 H2,2 H0,2 L0,2 HO, I H0,2 ----- X6,2 H6,0 X6,0 X6,0 
----- ------
X4,2 X4,1 
"' 31 L4,2 H4,2 H2,2 L2,2 H2, I H2,2 ----- X0,4 HO, I ----- X6,2 X6,2 H6,0 X6,0 X6,0 -----z 
:::J 33 L6,2 H6,2 H4,2 L4,2 H4,1 H4,2 
----




H6,2 L6,2 H6, I H6,2 ----- X4,4 H4, I ----- X2,4 X2,4 H2,1 X2,2 X0,4 X0,3 
37 ----
----- ----- ----- ------ -----
X6,4 H6, I ----- X4,4 X4,4 H4,1 X4,2 X2,4 X2,3 
39 
---- ----- ----- ----- ------ ---- ------ ----- -----
X6,4 X6,4 H6,1 X6,2 X4,4 X4,3 
41 ---- -----
----- ----- ------ ---- ------ ----- -----
---- ----- ------ X6,4 X6,3 
pass coefficients read from TLB I and TLB2. 
6.8 Proposed 4-para/lel external architecture 
To further increase speed of computations twice as that of the 2-parallel architecture, 
the 2-parallel architecture is extended to 4-parallel architecture as shown in Fig. 6.8. I 
(a). This architecture is valid for 5/3, 917, and combined 5/3 and 9/7. It consists of 4 k-
stage pipelined CPs and 4 k-stage pipelined RPs. The waveforms of the 3 clocks[;, 
/.a. and/.h used in the architecture are shown in Fig. 6.8.1 (b). The frequency 
of clock[. is determined from statement] as 
(6.1 0) 
The architecture scans the external memory with frequency;" and it operates with 
frequency /.a and/.h· Every time clockf4a makes a negative transition CPI loads into 
its input latches RtO and Rtl two new coefficients scanned from external memory 
through the buses labeled busO and bus I, whereas CP3 loads every time clock /.a 
makes a positive transition. CP2 and CP4 load every time clock/46 makes a negative 
and a positive transition, respectively. On the other hand, both RPI and RP2 load 
simultaneously new data into their input latches RtO and Rtf each time clock[.a 
makes a negative transition, whereas RP3 and RP4 loads each time clock[., makes a 
negative transition. 
217 
k Load [ 






f RPI, RP2 
Figure 6.8.1 (a) Proposed 2-D IDWT 4-parallel pipelined external architecture for 5/3 
and 9/7 and combined 5/3 and 9/7 (b) Waveforms of the clocks 
218 
The dataflow for 4-parallel 5/3 external architecture is given in Table B.21, where 
CPs and RPs are assumed to be 3- and 4-stage pipelined processors, respectively. The 
dataflow table for 4-parallel 917 external architecture is similar in all runs to the 5/3 
dataflow except in the first run, where RPs of the 9/7 architecture, specifically RP3 
and RP4 generate a pattern of output coefficients different from that of the 5/3. RP3 
and RP4 of the 917 architecture generate every clock cycle, ref(,rence to clockj,b, two 
output coefficients as follows. Suppose, at cycle number n the first two coefficients 
X(O,O) and X (I ,0) generated by RP3 and RP4, respectively, are loaded into output 
latch RtO of both processors. Then, in cycle n+ 1, RP3 and RP4 generate coefficients 
X(2,0) and X(3,0) followed by coefficients X(4,0) and X(5,0) ia cycle n+ I and so on. 
Note that these output coefficients are the coefficients generated by both RP1 and 
RP2 in Table B.2l. 
The strategy used for scheduling memory columns for CPs of the 5/3 and 9/7 4-
parallel architecture, which resemble the one adopted for 2-parallel architecture, is as 
follow. In the first run, both 5/3 and 9/7 4-parallel architecture will be scheduled to 
execute 4 columns of memory, two from (A) and the other two from (B), both of Fig. 
6.3.2. Each CP will be assigned to execute one column of memory coefficients as 
illustrated in the first run of the dataflow shown in Table B.21, whereas in all 
subsequent runs, 2 columns at a time will be scheduled for execution by the 4 CPs. 
One column from Fig. 6.3.2 (A) will be assigned to both Cf'1 and CP3, while the 
other from Fig. 6.3.2 (B) will be assigned to both CP2 and CP4 as shown in the 
second run of Table B.2l. However, if number of columns in (A) and (B) of Fig. 6.3.2 
is not equal, then the last run will consist of only one column of (A). In that case, 
schedule the last column's coefficients in both CP1 and CP3 as shown in the third run 
of Table B.21, since an attempt to execute the last column using 4 CPs would result 
in more output coefficients been generated than that can be handled by the 4 RPs. 
On the other hand, scheduling rows coefficients for RPs, which take place 
according to scan method shown in Fig. 6.3.4, can be unders1ood by examining the 
dataflow shown in Table B.21. In cycle 17 and 18, the first two rows coefficients are 
scheduled for RPs as shown in Table B.21, while CPs generate ,;oefficients of the next 
two rows, row2 and row3. Table B.21 shows that the first 4 coefficients of row 0 are 
scheduled for execution by RPI and RP3, while, the first 4 coefficients of row I are 
219 
scheduled for RP2 and RP4. In addition, note that all coefficients generated by CP4, 
which belong to column 3 in Fig. 6.3.4, are required in the second run's computations, 
according to the DOGs. Therefore, this would require inclusion of a TL8 of size N/4 
in each of the 4 RPs to store these coefficients. The second run, however, requires 
these coefficients to be stored in the 4 TL8s as follows. Coefficients HO, 1 and HI, 1 
generated by CP4 in cycle 16 should be stored in the first location ofTL8 ofRP1 and 
RP2, respectively. These two coefficients would be passed to their respective TL8 
through the input latches ofRP1 and RP21abeled Rt2, as shown in cycle 17 of Table 
8.21. Whereas, coefficients H2, 1 and H3, I generated by CP4 at cycle 20 should be 
stored in the first location of TLB of RP3 and RP4, respectively. These two 
coefficients are passed to their respective TLB through the input latches of RP3 and 
RP4 labeled Rt1, as shown in cycle 22 of Table 8.21. Similarly, coefficients H4,1 and 
H5, 1 generated by CP4 at cycle 24 should be stored in the second location of TL8 of 
RP1 and RP2, respectively, and so on. These TLBs are labeled TLB 1 in Fig. 6.8.1 (a). 
6.8.1 Column and row processors for 5/3 and 9/7 4-parallel external architecture 
The 5/3 and the 9/7 processors datapath architectures shown in Figs. 6.5.1 and 6.5.2 
were developed assuming the processors scan external memory either row by row 
or column by column. However, CPs and RPs of the 4-parallel architecture are 
required to scan external memory according to scan methods shown in Figs. 6.3.2 and 
6.3.4, respectively. The 4-parallel architecture, in addition, introduces the requirement 
for communications among the processors in order to accomplish their task. 
Therefore, the processors datapath architectures shown in Figs. 6.5.1 and 6.5.2 should 
be modified according to the scan methods and the communications requirements so 
that they fit into the 4-parallel's processors. Thus, in the following, the modified 4 
CPs will be developed first followed by the 4 RPs. 
6.8.2 Modified CPs for 4-parallel architecture 
The 4 CPs of the 4-parallel architecture each is required in the first run to execute 
one column at a time. That means the first run requires no modifications of the 5/3 
and 9/7 datapath architectures shown in Figs. 6.5.1 and 6.5.2. However, in all 
subsequent runs, each two processors (CP1 and CP3 or CP2 and CP4) are assigned to 
execute one column together, which requires interactions between the two processors 
220 
to accomplish the required task. Therefore, both CPs I and 3, similarly, CPs 2 and 4 
should be modified as shown in Fig. 6.8.2 to allow communications. The two 
processors communicate or interact through the paths (buses) labeled Pi, P2, P3, and 
P4. Fig. 6.8.2 shows modified 5/3 CPs I and 3 which is identical to CPs 2 and 4. Fig. 
6.8.2 also represents the first 3 stages of 917 CPs I and 3 (and 9/7 CPs 2 and 4) and 
the remaining stages are identical to stages I to 3. Note that since the first 3 stages of 
5/3 and 9/3 are similar in structure, the 5/3 processor can be easily incorporated into 
917 processor to obtain the combined 5/3 and 9/7 processor for 4-parallel architecture. 









Figure 6.8.2 Modified 5/3 CPs 1 & 3 for 4-parallel architecture 
221 
allow each processor to execute one column and I in all other runs to allow execution 
of one column by two processors. 
6.8.3 Modified RPs for 4-paral/el/ architecture 
In section 6.7.2, it has been pointed out the reasons for including TLBs in the two RPs 
of the 2-parallel architecture. For the same reasons, it is also necessary to include 
TLBs in the 4 RPs of the 4-parallel architecture, as shown in Figures 6.8.3 (a) and 
(a,b) for 5/3 and 9/7, respectively. The processor datapath for both RPI and RP3, 
which is also identical to the processor datapath of both RP2 and RP4, are drawn 









Stage7 ~ 1--+--.....;~- ~ {' H 






Figure 6.8.3 (a, b) Modified 9/7 RPs I and 3 for 4-parallel ''xternal architecture 
both processors are required to execute together the first 4 ceefficients of each row. 
Which implies interactions between the two processors during the computations and 
that take place through the paths (buses) labeled PI, P2, P3, and P4. However, in all 
subsequent runs, according to the scan method shown in Fig. 6.3.4, each RP will be 
scheduled to execute each time two coefficients of a row as shown in cycles 37 and 38 
of Table B.21. The advantage of this organization is that the total size of the TLBs 
does not increase from that of the single pipe lined architecture, when it is extended to 
2- and 4- parallel architecture. 
In the first run, all TLBs m Fig. 6.8.3 will be written only, whereas, in all other 
runs, the same location of a TLB will be read in the first half cycle and written 
in the second half cycle with respect to clock[," or /lh-
223 
The control signal, s of the six multiplexers, labeled mux in Fig. 6.8.3, is set 0 in 
the first run to allow in the RP I, coefficient coming through path 0 of each 
multiplexer to be stored in its respective TL8, whereas in the RP3, it allows contents 
of Rt2 and Rdl in stages I and 3, respectively, to be passed to the next stage. In all 
subsequent runs, s is set I to pass coefficients read from TL8s to next stage. 
Note that during run 2 all RPs execute independently with no interactions 
among them. In addition, in the first run, if the first coefficient generated by stage 2 of 
RP3 is stored in TL82 of RPI, then the second coefficient should be stored in TL82 
of RP3 and so on. Similarly, TL8 I, TL83, and TL84 of both RPI and RP3 are 
handled. Furthermore, during the whole period of run I, the control signals of the 
three extension multiplexers labeled muxeO, muxel, and muxe2 in RPI should be set 
0, according to Table 6.2, whereas those in RP3 should be set normal as shown in the 
second line of Table 6.2, since RP3 will execute normal computations during the 
period. However, in the second run and in all subsequent runs except the last run, the 
extension multiplexers control signals in all RPs are set normal. Moreover, the 
multiplexers labeled muxco in stage 4 is only needed in the combined 5/3 and 9/7 
architecture, otherwise, it can be eliminated and Rt2 output can be connected directly 
to RtO input of the next stage in case of 9/7, whereas in 5/3, RIO is connected directly 
to output latch RIO. In the combined architecture, signal sea of muxco is set 0 if the 
architecture is to perform 5/3; otherwise, it is set I if the architecture is to perform 
917. 
6. 9 performance evaluation 
In order to evaluate performance of the two proposed parallel pipelined architectures 
in terms of speedup and throughput as compared with single pipelined architecture 
consider the following. Assume subbands HH, HL, LH, and LL of each level are 
equal in size. The dataflow for single pipelined architecture shown in Table 8.18 
shows that p, = 20 clock cycles are needed to yield the first output. Then, the total 
number of output coefficients in the first run of the J'" level reconstruction can be 
estimated with the help of Table 8.18 as 
N/2 1 -' (6.11) 
and the total number of cycles in run I is given by 
224 
(6.12) 
The total time, T1, required to yield n pairs of output coefficients for the J'h level 
reconstruction by single pipe lined architecture can be estimated as 
(6.13) 
On the other hand, the dataflow Table 8.20 for th,; 2-parallel pipelined 
architecture shows that p 2 = 19 clock cycles are needed to yield the first 2 output 
coefficients. Then, the total numbers of paired output coefficients in the first run of 
the J'h level reconstruction can be estimated as 
3/2 N /21 _,. 
The total number of2-paired output coefficients is given by 
and the total number of cycles in run I is 




Note that the total number of paired output coefficients of the first run in each level of 
reconstruction starting from the first level can be written as 
3/2 N,3/2 N /2,3/2 N 1 4, ........... ,3/2 N /21 _, (6.17) 
where the last term is Eq (6.14). 
The total time, T2, required to yield n pairs of output coefficients for the J'h level 
reconstruction of an NxM image on the 2-parallel architecture can be estimated as 
T2 = {p, +2N/21 - 1 +2(n/2~3/4N/2'-'J)r, 
T2={p, +I/2N/21 - 1 +n)tr/2k 
(6.18) 
(6.19) 
The term 2(n/2~3/4N/21-') in (6.18) represents the total number of cycles of run 2 
and all subsequent runs. 
The speedup factor, S2, is then given by 
TI {p, +N/2'-' +2n)tr/2k 
S2~-~ ~ T2 ~ (p, + I/2N/2' 1 + n)tr/2k (6.20) 
225 
For large n, the above equation reduces to 
82 = 2(1/2N/2H + n) = 2 {I/2N/21 - 1 + n) - (6.21) 
That means the 2-parallel architecture is 2 times faster than the single pipelined 
architecture. 
Similarly, the dataflow Table 8.21 for the 4-parallel pipelined architecture shows 
that p 4 = 33 clock cycles are needed to yield the first two output coefficients. In 
addition, with the help of the dataflow table of the 4-paralell architecture it can be 
estimated that both RPI and RP2, in the first run of each level reconstruction, yield 
(N /21 - 1 )/2 pairs of output coefficients, while both RP3 and RP4 yield N /2'-1 pairs 
of output coefficients, a total of 3/2 N /21 - 1 pairs of output coefficients. The total 
number of cycles in run I is then given by 
4(N /2 1 - 1 )/2 (6.22) 
Thus, the total time, T4, required to yield n pairs of output coefficients for the J1h level 
reconstruction of an NxM image on the 4-parallel architecture can be estimated as 
T4=(p, +2N/21 - 1 +2(n-3/2N/2 1 - 1 )/2)r, 
T4 = {p, + 2N /2 1 - 1 + (n- 3/2 N /2 1 - 1 ))t" j4k 




The term (n- 3/2 N /2 1 - 1) represents the total cycles of run 2 and all subsequent runs. 
The speedup factor, S4, is then given by 
TI {p1 +N/21 - 1 +2n)t"j2k 
S 4 = -T-4 = '(P"-,-'+-t/'2'--N-;""1 2-c,-:_1_+_n:._)f-t'--"-/ 4-k 






(l/2N/2 1 - 1 +n) 
(6.26) 
(6.27) 
Thus, the 4-parallel architecture is 4 times faster than the single pipe lined architecture. 
The throughput, H, which can be defined as number of output coefficients 
generated per unit time, can be written for each architecture as 
226 
H(single)=n/(p, +N/21 - 1 +2n)tP/2k 
The maximum throughput, ft"ax, occurs when n is very large (n->oo ), thus, 
Hm"(sin gle) = H(sin gle),~, = nkJPj(l/2 N /2 1 - 1 + n) 
H(2- parallel)= n/(p, + 1/2 N /2 1 - 1 + n)t P j2k 
Hm"(2- parallel)= H(2- parallel)H, =. 2knfP/(l/2N/21 - 1 +n) 
H(4- parallel)= nj(p4 +lj2Nj21 - 1 +n)tPj4k 







Thus, the throughputs of the 2-parallel and the 4-parallel pipe lined architectures have 
increased by factors of 2 and 4, respectively, as compared with the single pipelined 
architecture. 
6.10 Conclusions 
In this chapter, to show the effectiveness of the approach adopted for developing 
forward architectures in chapters 3 and 4, the architectures for 2-dimensional inverse 
discrete wavelet transform (2-D IDWT) for 5/3 and 9/7 were developed. First, a high-
speed single pipelined inverse architecture including its column-processor (RP) and 
row-processor (CP) were developed. Then, the single pipelined architecture IS 
extended to 2-parallel and 4-parallel to achieve speedup factors of 2 and 4, 
respectively, according to the evaluation given in section 6.9. The advantage of the 
single pipelined architecture developed here is that it only requires a total temporary 
line buffer (TLBs) of sizes 2N and 4N for 5/3 and 9/7, respectively, and the TLB 
requirement does not increase when it extended to parallel architecture. The 
interleaving technique is utilized to speedup the computations by allowing the two 
processors to work in parallel earlier during the computations and to reduce TLB 




7.1 Performance analysis 
In chapter 3, two scan methods were developed for 9/7 algorithm. The first scan 
method shown in Figure 3.5.1 can be used for both 917 and 5/3 algorithms. 
Architecture developed based on this scan method will not yield any output 
coefficients in the first run. However, starting from the second run its dataflow is 
same as that of the 5/3 dataflow shown in Table 8.6. On the other hand, the 9/7 
architecture developed based on the second scan method shown in Figure 3.5.3 will 
yield output coefficients starting from the first run, as illustrated in the dataflow 
shown in Table 8.2(a). This might give the impression that the second scan method 
performs better than the first scan method. To show that both scan methods achieve 
the same performance in terms of the total number of cycles and throughput, consider 
the following. From the RP and the CP of the 9/7 shown in Figures 3.8.8(a) and 
3.8.4(a), respectively, which are based on the scan method shown in Figure 3.5.1, it 
can be shown that (p, +N) cycles are needed to yield the first pair of output 
coefficients. The remaining (n-1) pairs of output coefficients, which will be produced 
according to Table 8.6, would require (n-1) cycles. Thus. the total time Tl required to 
yield n pairs of output coefficients for j-level decomposition of an NxM image is 
given by 
Tl = {p, + N + (n- I ))r, 
where r, = t P j k is the clock period. The throughput His given by 
H = n/(p, + N + (n- 1 ))r, 





On the other hand, Table B.2 of the architecture based on the scan method shown 
in Figure 3.5.3, indicates p2 = 23 cycles are needed to yield the first pair of output 
coefficients. In addition, the total number of paired output coefficients and the total 
number of cycles in the first run are Nand 2N-2, respectively. Thus, the total time T2 
required to yield n pairs of output coefficients for j-level decomposition is estimated 
as 
(p2 +(2N-2)+(n-N)}r, 
where r 2 = tP/k. The throughput His given by 
H = n/(p, + 2N + (n- N))r2 




Similar analysis also can be carried out for intermediate architectures based on the 
scan methods shown in Figures 3.7.1 (a) and (b). Equations 7.1, 7.3, 7.4, and 7.6, 
show that the architectures developed based on both scan methods give the same 
performance in terms of the number of clock cycles and throughput, if r, = r 2 • 
However, the hardware and the control complexities of the architecture based on the 
second scan method, as indicated in Figures 3.8.4(b) and 3.8.8(b) are more complex 
than the one based on the first scan method shown in Figures 3.8.4(a) and 3.8.8(a). 
The situation becomes even more complex and worse when the architecture is 
extended to parallel. Furthermore, the implementation results in Figurers C.3.3 and 
C.4.3 show the speed advantage of the first scan method. Figure C.3.3 shows that the 
first scan method architecture operates with frequency 147.95 MHz, while Figure 
C.4.3 shows the second scan method architecture operates with frequency 136.04 
MHz. For these reasons, therefore, the first scan method is adopted for all parallel 
architectures developed in chapter 4. 
7.2 Performance evaluations and comparisons 
This section evaluates and compares architectures developed in this research with 
most recent architectures in the literatures. The architectures are evaluated in terms of 
hardware complexity, hardware utilization, computing complexity, and control 
complexity. Hardware complexity is measured by the number of multipliers, the 
number of adders, the total size of the line buffer, and the complexity of the control 
229 
circuits [40]. Computing complexity for 2-D DWT is estimated by the number of 
clock cycles required to scan an NxM image for j levels of decomposition. 
Table 7.1 shows the performance comparison results. The line-based architecture 
presented in [1] requires a line buffer of size 5.5N implemented in two-port RAM. 
Besides, its critical path delay is large, 4Tm + 8Ta. Whereas the proposed 
architectures use single-port RAMs of sizes 3N and 4N for overlapped and 
nonoverlapped architectures, respectively. 
Flipping structure [2] introduces a new method to shorten the critical path of the 
lifting-based architecture to one multiplier delay but requires a line buffer of size liN 
[43]. In [21], a modified view of the flipping structure, which shortens the critical 
path delay to one multiplier and reduces the size of the line buffer required to 
4N, is presented. In fact, [2, 21] have only introduced a method not an architecture, 
which aims at shorting the critical path delay of lifting- based to one multiplier delay. 
However, this issue becomes less important after the fact that scale factors and 
coefficients of the 9/7 filter can be implemented in hardware using only two adders as 
illustrated in [23]. The proposed overlapped and nonoverlapped architectures require 
a total line buffer of size 3N and 4N, respectively. However, note that by adding a line 
buffer of size N in the nonoverlapped architecture, the power consumption has been 
Table 7.1 Comparisons ofseverall-level (9/7) 2-D DWT architectures 
Architecture Multi Adders Line Computing Critical 
buffer Time Path 
Generic RAM-based [I] 10 16 5.5N 2(J-4l )NM 4Tm +8Ta 
Flipping [2] 10 16 liN 2(1-4"l)NM Tm 
Chao [60] 6 8 5.5N 2(J-4l)NM Tm 
PLSA [21] 12 16 4N N/A Tm 
lling [43] 6 8 5.5N 2(1-4·J )NM Tm 
Lan [29] 12 12 6N 2(1-4·) )NM Tm 
Jain [61] 9 16 ION 2(1-4·) )NM Tm+Ta 
Cheng [22](2-oarallel) 18 32 5.5N (l-4J)NM N/A 
FIDF [62](2-oarallcl) 24 32 5N (1-4-J)NM Tm+2Ta 
Proposed (overlapped) 10 16 3N 2(1-4·) )NM Tm+2Ta 
Proposed (nonovcrlappcd) 10 16 4N 2(1-4-J)NM Tm+2Ta 
Proposed (2-parallel) 18 32 3N (l-4·J )NM Tm+2Ta 
Pronosed ( 4-narallel) 36 64 3N 1/2(1-4·) )NM Tm+2Ta 
Proo. (2-oarallcl intermediate) 18 32 3N (1-4-j)NM Tm+2Ta 
Proo. (3-oarallel intermediate) 28 48 3N 2/3(1-4·) )NM Tm+2Ta 
Proo. (single pioelined inverse) 10 16 4N 2(1-4·) )NM Tm+2Ta 
Prooosed (2-parallel inverse) 18 32 4N (1-4·) )NM Tm+2Ta 
Prooosed (4-=Darallel inverse) 36 64 4N 112(1-4l)NM Tm+2Ta 
230 
reduced to minimum. Thus, the nonoverlapped architecture could be a very efficient 
alternative in applications where power consumption is a seriov.s concern. 
In [ 43, 60], by reordering the lifting-based DWT of the 9// filter, the critical path 
of the pipe lined architectures have been reduced to one multiplier delay but requires a 
total line buffer of size S.SN. However, [43] requires two row processors and [60] 
requires 4 processing elements (PEs), two in each horizontal and vertical processors, 
to perform prediction lifting and update lifting. In addition, both [43, 60] require the 
use of real multipliers with long delay that cannot be implemented by using arithmetic 
shift method [23]. The architecture proposed in [29] achieves a critical path of one 
multiplier delay using very large number of pipeline registers. In addition, it requires 
a total line buffer of size 6N. In the efficient pipelined architecture [61], a critical path 
delay of Tm+ Ta is achieved through optimized data flow graph but requires a total 
line buffer of size I ON. 
On the other hand, the architectures proposed in [22, 621, like the proposed 2-
parallel architectures, achieve a speedup factor of 2. How~ver, [62], the deeply 
parallel architecture requires a total line buffer of size SN, whereas [22] requires a 
total line buffer of size S.SN. The advantage of the parallel architectures developed in 
this research is that the total line buffer does not increase from that of the proposed 
single pipeline architectures when the degree of parallelism is increased. In addition, 
the architectures proposed in this research are real architectures, which compared with 
architectures listed in Table 7.1 are accurate and complete. 
7.3 Experimental results and comparisons 
To further verifY that the architectures developed here are accurate, efficient and 
practically can be implemented, we have chosen for FPGA implementation five 
architectures, which are representative of the other architectures: the 5/3 forward 
overlapped scan architecture shown in Figure 3.6.1, the inverse 5/3 architecture 
shown in Figure 6.4.1, two 9/7 forward overlapped architectures, one is based on the 
scan method shown in Figure 3.5.1 and the other is based on the scan method shown 
in Figure 3.5.3, and the 5/3 2-parallel architecture shown in Figure 4.2.1. First, the 
Verilog HDL descriptions for the five architectures are developed and then 
implemented on Altera FPGA with 16-bit word length for internal datapath. The 
231 
Verilog HDL program codes for the five architectures are named as module 
"decorrelate _processor" for forward 5/3 architecture, module "reconst_processsor" 
for inverse 5/3 architecture, module "decrrelation2 _processor9 _7" for the first 9/7 
architecture based on the scan method of Figure 3.5.1, module 
"decorelation _processor9 _7" for the second 9/7 architecture based on the scan method 
of Figure 3.5.3, and module "two_parallel_DWT" for the 5/3 2-parallel architecture. 
The Veri log descriptions of the five architectures are compiled and synthesized on 
Altera FPGA Stratix II device EP2515F484C3 using Quartus II CAD software. This 
software provides automatic mapping of designs written in Verilog into Field 
Programmable Gate Arrays (FPGAs). 
The compilation and the synthesis reports for module "decorrelate_processor" are 
shown in Figures C. I.!, C.1.2, and C.1.3, whereas, the compilation reports for module 
"reconst_processor" are shown in Figures C.2.1, C.2.2, and C.2.3. The forward 9/7 
compilation reports for module" decrrelation2_processor9_7" are shown in Figures 
C.3.1, C.3.2 and C.3.3, while that of module "decorelation_processor9 _7" are shown 
in Figures C.4.1, C.4.2, and C.4.3. The 2-parallel architecture compilation reports for 
module "two_parallei_DWT" are shown in Figures C.5.1, C.5.2, and C.5.3. 
The compilation report in Figure C. I.! shows that the design uses 93 pins, a total 
of 438 logic cells, and a total of 434 registers, whereas, the compilation report shown 
in Figure C.1.2 indicates that the total power dissipation of the design is 500.46 mW. 
On the other hand, the Compilation Report-Timing Analyzer Summary shown in 
Figure C.1.3 lists four parameters. The first parameter /.,, indicates the worse-case 
setup time required is 3.195 ns and it is from Ed3 to REd3. This parameter means 
that signal Ed3 must have a stable value at least 3.195 ns before each active edge of 
the clock. The second parameter 1m indicates the worse-case clock-to-output delay is 
6.301 ns from register L_data_out[8] to pin L_data_out[8]. In other words, it indicates 
the time elapsed from an active edge of the clock at the clock pin until an output 
signal is produced at an output pin [65]. The third parameter in the Timing Analyzer 
Summary is th, which give the worse-case hold time, and it is 1.831 ns for the path 
from pin data _inO[O] to register RIO _1 [I 0]. Hence, the signal at pin data _inO[O] must 
maintain a stable value for at least 1.831 ns after each active edge of the clock. The 
last parameter in the list gives the maximum frequency, which is often called Fmax. at 
232 
which the synthesized circuit can operate isl85.74 MHz. This is a useful indicator of 
performance. The maximum frequency is determined by the path with longest 
propagation delay, often called the critical path, between any two registers (flip-flops) 
in the circuit. 
Figure C.l.3, shows that the maximum operating frequency Fmax of the module is 
detennined by the TLB operations where the path with Ionge:;! delay occurs. This is 
expected since the overlapped architecture requires both read and write operations in 
the TLB to take place in the same clock cycle. However, since the intermediate 
architecture for 5/3 shown, in Fig 3.7.2, does not require such constraint on its TLB, 
therefore, the intermediate architecture would operate with higher frequency. 
Furthennore, the synthesis results shown in Figures C.l.3, C.2.3, C.3.3, and C.4.3, 
which show the maximum frequencies of the four implemented architectures, imply 
that the parallel forms of these architectures will also operate with the same 
frequencies. In fact, the 5/3 2-parallel architecture operating with frequency of 186.01 
MHz, which is the parallel fonn of the single 5/3 pipelined architecture operating with 
frequency of 185.74 MHz, verifies that the 2-parallel architt:cture is 2 times faster 
than the single pipelined architecture. This result is also in agreement with the 
theoretical evaluation given in section 4.2.4. 
To compare the implementation results of our architectures with other 
implementations in the literature, Table 7.2 is provided which summarizes the 
experimental results of several implemented architectures. This table shows that the 
5/3 implementations in [ 3, 24] with 8-bit word length operate with frequencies of 110 
MHz and 129.93 MHz, respectively, whereas, the proposed S/3 forward and inverse 
with 16-bit word length operate with maximum frequencies of 185.74 MHz and 
188.32 MHz, respectively. In addition, the implementation in [3] requires a large 
number of FPGA logic cells and registers. On the other hand, the 5/3 2-parallel 
architecture in [62], which is implemented on the same FPGA device, operates with a 
frequency of 145.54 MHz, whereas, the proposed 5/3 2-parallel architecture operates 
with frequency of 186.01 MHz. 
The last 3 implementations in Table 7.2 are 9/7 architectures. Comparing the two 
9/7 architectures, in term of speed, with the architectures proposed in [30, 40], shows 
233 
Table 7.2 Experimental results and comparisons 
Architectures Type Logical Regs Max Power Word 
cells frequency dissipation lengt_h 
Zewail [24] 5/3 473 149 112.93 MHz N/A 8-bit 
FIDF[62] 513 1316 466 145.54 MHz N/A 16-bit 
2-parallel 
Gregory[3] 5/3 1741 2542 110 MHz N/A 8-bit 
PLSA[21] 9/7 416 192 152.39 MHz N/A 16-bit 
Xiong[40] 9/7 2992 N!A 50 MHz 393.62 mw 16-bit 
Sandro[30] 917 1002 N/A 105 MHz N/A 8-bit 
Proposed forward 5/3 438 434 185.74 MHz 500.46 mw 16-bit 
Proposed inverse 5/3 446 457 188.32 MHz 465.39 mw 16-bit 
Proposed 513 872 697 186.01 MHz 580.98 mw 16-bit 
2-parallel forward 
Proposed first 9/7 2036 858 147.95 MHz 673.37 mw 16-bit 
Proposed second 9/7 2529 1049 136.04 MHz 739.36 mw 16-bt 
that the 917 architectures implemented in this work operate with higher frequencies. In 
addition, the implementation in ( 40] requtres more logic cells and the 
operating frequency is very slow, 50 MHz. The implementation in [21], operates with 
a frequency of 152.39 MHz, which is slightly higher than the first proposed 9/7 
implementation, which operate with a maximum frequency of 147.95 MHz. However, 
(21] introduced only a method, not architecture, for reducing the critical path delay to 
one multiplier and had implemented only one processor for 1-D DWT, while 2-D 
DWT architectures usually consist of two processors. 
The final stage of the implementation is the timing simulation. To verify that both 
forward and inverse 5/3 architectures, the 5/3 2-parallel architecture, and both 917 
architectures perform their intended logical functions accurately in the worst case 
timing of the target device; we have applied test input patterns and have simulated the 
implemented architectures' hardware modules. Figures 7.3.1, 7.3.2, 7.3.3, 7.3.4, and 
7.3.5 show the simulation waveform results for the five implemented architectures. 
The forward 5/3 module "decorrelate _processor" is simulated by applying a 2-
dimensional array of size 6x5 containing random numbers. This 6x5 image is 
scheduled according to the scan method shown in Figure 3.5.1, which requires 3 
pixels to be fed into the circuit every clock cycle. The 3 pixels are indicated as 
data_inO, data_in1, and data in2 in Figure 7.3.1. In cycle number 2 of Figure 
7.3.1, the first 3 pixels 22, 143, and 65 of the first row are applied to the hardware 
module. In cycle 3, the first 3 pixels 62, 5, and 222 of the second row are applied to 
the hardware module. In cycle 7, the last 3 pixels 64, 121, and 34 of the last row are 
234 
applied to complete the first run. The second run begins at cyde 8, where pixels 65, 
192, and 115 are applied, and ends at cycle 13 with pixels 34, 143, and 32. The last 
run begins at cycle 14 and ends at cycle 19. Note that pixels of the last column are 
applied to the circuit one pixel at a time as shown in Figure 7.3.1, which is in 
accordance with the scan method. 
The first two outputs of run! simulation, which are shown under the labels 
L data out and H data out in Figure 7.3.1, appear at c:ycle 12 with output 
coefficients 21 and -I 03. These two coefficients belong to the first locations in 
subbands LL and LH, respectively. The second two output coefficients -3 and -207 
belong to the first locations in subbands HL and HH, respectively. The hardware 
module alternates between generating output coefficients for subbands LL and LH 
and subbands HL and HH until the run ends. The first run ends by the positive 
transitions of clock cycle 18 with output coefficients 2 and 33. The positive transition 
of cycle 18 marks the ending of run! and the beginning of run 2 with coefficients 131 
and 29. These two output coefficients belong to the first location of the second 
column in each subbands LL and LH, respectively. The positive transition of clock 
cycle 24 marks the ending and the beginning of run 2 and tht: last run, respectively. 
The last run generates only output coefficients for subbands LL and LH. The 
simulation results in Figure 7.3.1 show that the hardware module for 
"decorretate_processor" precisely performs its function and according to Table B.6. 
The signal between data_in2 and L_data_out in Figure 7.3.: are control signals for 
RP and CP of Figures 3.8.7 and 3.8.3, respectively. The control signals sreO, srel, and 
sre2 are control signals for RP's extension multiplexers, whereas, signals sceO, see!, 
and sce2 are the control signals for CP's extension multiplexers. Signals incar and 
rst_ TLBAR control the operation of the TLBAR (TLB address register), while signal 
ETLB is used for enabling TLB for read and write operations. The control signals 
Ed2, Ed3, and Ed4 control the operations of the registers and multiplexers that exist 
between the RP and the CP in Figure 3.6.1 and are set in Figure 7.3.1 according to 
Table 3.2. 
In order to validate the inverse architecture, the output coefficients generated by 
module "decorrelate _processor" are fed into the inverse hardware module 
"reconst_processor" as shown in Figure 7.3.2. The coefficients are scheduled 
235 
[ 'tuloroal 1overlapp_drthotedur., d.,correlat.,_pro(e§~ drcorrelate...J)foti:"Swr- [51mufationRt!p<lrt- §I • 
...14J.l!J 
Vievf---~~- Ass9"*'!.~~---F!~ Tools -~ .. -Help -:-c---r- --~-~ 
" 1•1 .x "'e. I"' " !!"'"""'"-""'""' 3fiC "«~ ~0~;-i~T>;; oT~ I-& 1 ® 1111! :G I 
















scel ~~~~~~i~~~~~iiiii~iii~~iiiiiii~~iii~ii~ $ce2 Ill L_d"'-'" ill H_dat~_aut 
Fig. 7.3.1 Simulation Waveforms for forward 5/3 module "decorrelate_processor". 
'f.ew Project Assqwnents Proc;essii'IQ Tools Wrldow 1-1$ 
"J!il'~ "£ e.l·" ~'JI"'""~J''"""-' -~::J I }t "«~ ~ e> Iii~"-. ! 'Ill o I !:'. I~ I® I !:ill '!1 
oocenor.v I~Silluletionftepad:·SilluWionW...r- I ----
Ide: Timg 
rime Bar: I ·- 200.0 na 
I~ 1111 2G.~nt 52.pm 79}m 1051Sna 132,0na 1~4na 1841Bna I ,_ 2000 ns 
'""' 
.... L.i • 1!1 data_1nO X · · 6 111 ·91 51 X X X Iii d..ta_onl 
~•0 w;:· 
ocol M<> 
















Figure 7.3.2 Simulation Waveforms for inverse 5/3 module "reconst_processor" 
236 
" a t ~!::-~ 1 ~-··c~~ .ldec~~~~bor~2_proc~_:s_~r~-~--- ::J i.~.-~ <~_~ ~_l!_~~-!-~~ j ~(!)-~-~ -~ . ~ I®,~ :~ -~-
"'Report • SiiMIIation Wavefonn 
avefotm~ 
:le: Timiftl 
imeBs:l 11l5ns ±J Poinler; r------131.52 m: 
I ~" 14.pns "P"' 42.pnt "'P"' 70.pnt 84.pm 98.pnt 112;0ns 12610m 140;0 m: 1541Dns 16810ns I ,_ 171 5 nl 
'"'' 
~rtn,tL.n_,Jl~nn_n.£lJ"LJ"Lh~ ~ 
13 data_inO 4 4J 0 230 11 ·1 -130 80 34 1 0 0 190 57 11 -49 9 ~ 71 1 
13 data_in1 X 
" 
·4 2 10 . 4 1 
'" 
110 ·130~18 1 1 0 
"' 
52 ·3 ·1 ·2 5 
IB data_1n2 X 2 1 1 0 80 1 ·1 0 9 31 1 1 X 









_r-Lfj={ g J=f'~ 02 Ed2 . Ed3 
Ed4 _r- l___j"" l.. ..I l.. ..I l__-
sceO 
sce1 




71.5nt t~Sns 1~5ns 21~5ns 22715nt 24115nt 25515ns ~Snt ~·"' 29715nl 311 15ns ~'"' ~ ,_ 71 5 ns 
clock ULnnn.~r~nn 
00 data_rnO 111 X 
Ill data_1nl 1. X 








Ed2 }={ := ]={ ]={ r= }={ }={ = pl__ff1 }= }={J Ed3 Ed4 ~ l.. ..I l__f l__ ..I l___j"" l.. ..I 
seeD ! 
scel 
sce2 I d. li L_data_out 61 26 ·14 ·1 4 10 
" 
·2 1 .l_4 5i' I 211 14 20 78 J3 1 4 -260 
IB H_data_out 86 44 ·11 44 8 ·I ·14 -11 14 ·10 42 86 3 5 ·I 6 14 66 











































" 13 L_data_out 
r±l H_data_out 
194.52ns '""r-· 
I '" '"I"~ "-'1'~ "-57'~ 11&lS8~ "'""' ,,,,,. I 
1 527" 
. 
1 ·1 1 
_A 
H 
l___f 1__f l___f l___J 





147 15S ·17 66 120 2 243 104 262 ·221 26 134 1 





l_ -'J=f '=="= P-s l_ J l_ 
s 
~ ·1 ~ 
211 140 20 
·1 - 42 ·8 
78 333 105 33 134 . 6 
1 14 X 
Fig 7.3.4 Simulation Report- Simulation Waveforms for second 9/7 module 
"decorelation _processor" 
238 
-~-!t'-""'-'-"---~ Processir'Q Tools~~- -·-------------· _ _ _____________ -· _ ~~ 
"Ia I~ <n 1B I"' 0' lltw~:"""''-0wr __ H_ :::Jil:i / ~ ~ 0 I !l'_c_":__!"'__'(!)_t!> I:'.J~!I® l~_!_f~L 
Jn Report . SinUetion Wawtfoona I 
'ln!IBar:j·----71i4ns ~Poinler:[' 13'11Wns lntetYatf S1.Uns --St1•tl- Erd:j-
I"P-,--;:,::::,;_M_.:...,25::-:,:-,_-__ --:,.:::-1-:-,=, =-::._-_::,:1:::-.,_-,-.:-=:_,.:-:.p:-.. -,,-,-,-:,.::-,:-,·,-_-_-__ -_:-:,.-:-f~--:OO_-... _"',-=oz~.•'"" .. ;; ...... -; .. ,-::,.,"'.,.,_c."'-.. -.--,,c:,.:-:!~-"~] 
+-d=oc, --f .JLDD.ILDDDDD.D-J .n.n.n.n.n.n.n.n.n.s 
- [3 d4ta_in0 ! X ~, 4 2. 1 2 63 31 11 32 












X .~93 '1 
X 
X 
- [!l LL_out X 
X 
[13 66 131 202 188 1 9 4 3 X 
- [!l LH_mi 1 .;•1 3 29 -2 1 12 48 X 
[!l Hl_out 
[!] HH_W X 20 316355 
Fig 7.3.5 Simulation Report- Simulation Waveforms for the 5/5 2-parallel's module 
"decorelation_processor" 
according to the scan method shown in Figure 6.3.1. The output of the simulation in 
Figure 7.3.2 indicates that the hardware module "reconst_processor" accurately 
reconstructs the original image pixels. In Figure 7.3.2, the first six outputs of run! 
under L_data_out are valid output pixels, while the first output ofH_data_out are not, 
according to the Table 8.18. The first six outputs of L_data_out represent pixels of 
the first column in the 6x5 image. The second and the la"t runs each yield two 
columns to complete the 5 columns of the 6x5 image. The CP and the RP of the 
inverse external architecture implement the datapath architectures shown in Figures 
6.5.4 and 6.5.6, respectively. The control signal sr of the external architecture is set in 
Figure 7.3 .2 according to Table 6.1. 
The hardware modules for both 9/7 forward pipe lined overlapped architectures are 
tested by applying an image of size 6x8. This image is scmned into the first 9/7 
hardware module "decrrelation2_processor" according to the scan method shown in 
Figure 3.5.1 and the results of the simulation are shown in Figure 7.3.3. This module 
does not yield any output coefficients in the first run, but start: ng from the second run 
it generates output patterns that are similar to the 5/3 forward overlapped 
239 
X 
architecture. It yields its first pair of output coefficients 26, and 44 at clock cycle 25, 
as shown in Figure 7.3.3. The positive transition of clock cycle 31 marks the ending 
of the second run, with output coefficients 104 and -50, and the beginning of the third 
run with output coefficients 262 and 8. This hardware module implements the RP and 
the CP datapath shown in Figures 3.8.8 (a) and 3.8.4 (a), respectively. The control 
signals sreO, QO, sre1, Q1, sre2, and Q2, which are issued according to Table B.5, are 
control signal for RP's extension multiplexers. The control signals Ed2, Ed3, and Ed4 
in Figure 7.3.3 are set according to Table B.2 (c). 
On other hand, in the second 9/7 hardware module ''decorelation_processor9 _7", 
the image is scanned into the module according to the scan method shown in Figure 
3.5.3. The simulation results are shown in Figure 7.3.4. The difference between this 
module and the first 9/7 module is that this module generates output coefficients 
starting from the first run and according to Table B.2. In Figure 7.3.4, its first pair of 
output coefficients 26 and 44 appears at cycle 25. The positive transition of clock 
cycle 35 marks the ending of the first run, with output coefficients I 04 and -50, and 
the beginning of the second run with output coefficients 262 and 8. The simulation 
results shown in Figures 7.3.3 and 7.3.4 for both 9/7 module verify that both hardware 
modules perform their logical functions accurately in the worse case timing 
simulation. This hardware module implements the RP and CP datapath architectures 
shown in Figures 3.8.8 (b) and 3.8.4 (b), respectively. A table similar to Table B.2 (c), 
which contains control signal values, was derived from Table B.2 (b) for signals Ed2, 
Ed3, and Ed4 and then was used in Figure 7.3.4 for setting these signals. 
The 5/3 2-parallel hardware module "two_parallel_DWT" is simulated by 
applying an image of size 6x5 which is identical to the one applied to the single 
pipelined architecture's module "decorrelate_processor". The image pixels are 
scanned into the hardware module according to the scan method shown in Figure 
3.5.1. The simulation results are shown in Figure 7.3.5. In this figure, the first 4 
output coefficients 21, -103, -3, and -207 appear at cycle 11. The positive transition of 
clock cycle 14 marks the ending of the first run with output coefficients 66, 39, 2, and 
33 and the beginning of the second run with output coefficients 131, 29, 103, and 1. 
Cycle 17 marks the ending of the second run with output coefficients 188, 150, 85, 
and 55 and the beginning of the last run with output coefficients 169 and 5. In the last 
240 
run, only CP2 generates output coefficients for subbands LL and LH. The 5/3 2-
parallel's simulation results shown in Figure 7.3.5 are identical to the 5/3 single 
pipe lined architecture's simulation results shown in Figure 7 .3.1 and that verifies that 
the 2-parallel architecture performs its intended computations correctly as required. 
The 2-parallel hardware module implements the RP and the CP datapath architectures 
shown in Figure 4.2.2 and 3.8.1, respectively. In Figure 7.3.5, RP1 input latches are 
loaded with 3 pixels every time the clock makes a negative transition, whereas, RP2 
input latches are loaded on the positive transition of the clock. 
The six papers listed in Table 7.2, which had implemented their architectures on 
FPGA, had only provided synthesis results such as shown in Table 7.2 without any 
simulation waveforms results. Simulation results such as shown in Figs 7.3.1, 7.3.2, 
7.3.3, 7.3.4, and 7.3.5 serve as prove the implemented architectures perform their 
functions correctly under the worse case timing of the target FPGA device. 
7.4 Conclusions 
In this chapter, 5 selective architectures, which are repre:;entative of the other 
architectures developed in this work, are implemented and synthesized on Altera 
FPGA. The compilation results of the implementation and comparisons are 
summarized in Table 7.2. the comparison results given in Table 7.1 and 7.2 including 
simulation results shown in Figs 7.3.1, 7.3.2, 7.3.3, 7.3.4, and 7.3.5 verify that the 
architectures implemented in this work not only are accurate and fast but are efficient 
in terms of power dissipation and hardware complexity. In addition, the synthesis 
results of the 2-parallel architecture shown in Fig C.5.3 confirm that the 2-parallel 
pipelined architecture is 2 times faster than the single pipelined architecture. 
Furthermore, the compilation results given in Figs C.3.1, C.3.2, and C.3.3 for the first 
917 architecture and compilation results shown in Figs C.4.1, C.4.2, and C.4.3 for the 
second 9/7 architecture show that the first 9/7 architecture p·~rforms better than the 




CONCLUSIONS AND RECOMMENDATIONS 
8.1 Conclusions 
In this research, two highly efficient and novel architectures for 2-D DWT are 
proposed that meet the high speed, low power, and memory requirements for real-
time applications. The most noticeable accomplishment is the elimination of the 
internal memories, between row and column processors, which dominates the 
hardware cost. In the proposed pipelined architecture based on the nonoverlapped 
scan method, the power consumption due to the external frame memory access is 
reduced to minimum and it could be a very efficient alternative in applications where 
the power consumption is a serious issue. 
In the development of the architectures, two cases were identified based on the 
scanning frequencies; case I, low scan frequency and case2, high scan frequency. In 
case I, the optimal performances of the pipe lined architectures in terms of speed, 
efficiency, and hardware utilization are achieved by scanning 3 pixels in parallel each 
cycle. This requires slight modifications of the architectures developed in the first part 
that scan the external memory pixel-by-pixel. In case2, the optimal performances of 
the architectures are immediately obtained by pipelining the processors with no 
further modifications of the architectures developed in the first part. 
Furthermore, the critical path delay of the proposed pipe lined architectures can be 
reduced to four adders delays when multiplications operations in the 917 processors 
are implemented by adders only. The advantage of the approach adopted in the 
development of the two proposed architectures is that it can be used in developing 
architecture for any 2-D DWT algorithm and it is certain to yield very efficient 
architectures in terms of hardware complexity, speedup, and power consumption with 
manageable control complexity. 
242 
Based on the generalization of the overlapped scan method, the intermediate 
architecture is developed, which aims at reducing the power consumption of the 
overlapped areas without using the expensive line butTer to somewhat between the 
two extreme architectures, the overlapped and nonoverlapped. Compared with the 
power consumption of scanning the external memory for the architecture based on the 
first scan method, the intermediate architecture decreases the power by 22% with no 
lost in speed. While the intermediate architecture with the second dataflow decreases 
the power consumption of scanning the external memory by 48%. However, the 
second dataflow increases the total execution time by 16.7% over the architecture 
based on the first scan method and the intermediate architecture using the first 
dataflow. In addition, since the reduction in the power consumption is achieved 
without using a line buffer, the intermediate architecture occupies less silicon area. 
Therefore, intermediate architecture could be a very efficient alternative for high-
speed, low cost, and low power applications such as mobile video phone. 
To further improve performance in terms of speed and throughput to best meet 
real-time applications of 2-D DWT with demanding requirements, parallel 
architectures were developed. The single pipelined overlapped architecture is 
extended to 2-parallel, 3-parallel, and 4-parallel architectures to achieve speedup 
factors of 2, 3, and 4, respectively, according to the evaluation given in section 4.2.4. 
The scheme adopted in the development of the 4-parallel architecture optimizes the 
performance, in term of number of clock cycles requires for j l·~vels of decomposition, 
as compared with the alternative scheme which increases the execution time by 
M/2 1 - 1 cycles for each level of decomposition, when case 2 occurs. Similarly, the 
single pipeline intermediate architecture is extended to 2-parallel and 3-parallel 
architectures. According to the evaluation given in section 4.3.5, the 2-parallel and 3-
parallel intermediate architectures achieve speedup factors of 2 and 3, respectively. 
The intermediate parallel architecture reduces the power consumption of the external 
memory by a factor of 7/9 as compared with the overlappE:d parallel architecture, 
Eq(4.57). 
The advantage of the proposed parallel architectures developed in this research is 
that the total temporary line buffer (TLB) does not increase from that of the proposed 
single pipelined architectures, when degree of parallelism is increased. In addition, the 
243 
comparison results show that single and parallel architectures developed in this 
research compared with most recent architectures in the literature require only a total 
TLB of size N in the 5/3 processor datapath and 3N in the 9/7, while other 
architectures listed in Table I 0 require more TLBs, which are very expensive memory 
components. In addition, the control architecture that detects occurrence of the last 
run and the 6 cases of the intermediate architectures is also designed. Furthermore, to 
reduce control designs effort, several tables giving the control signal values for 
several control signals are provided. 
This research has also addressed in details one of the important issues that 
have been overlooked so far, that is, the 2-D DWT memory architectures and 
management and has proposed two novel VLSl memory architectures, the LL-RAM 
and subband memory, which are based on the first scan method. The LL-RAM and 
subband memory were designed such that DWT unit performs effectively both read 
and write operation in the LL-RAM and write only into suband memory while 
compression unit reads subband memory. How the two memory architectures can be 
modified for higher scan method is also illustrated. The banking technique is used to 
further improve and form more efficient memory architectures in terms of speed and 
power consumption. The bank-based architecture can be thought formed by dividing 
the module-based RAM architecture, which can be considered as one big bank, into 
several smaller independent banks. Inside the smaller banks reads and writes are 
performed as in the big bank but faster and more efficiently. The advantage of the two 
proposed memory architectures is that they can be easily incorporated into single or 
parallel 2-D DWT processor architectures. 
To show that the architectures developed in this research are simple to control, the 
control algorithms for 4-parallel architecture including the LL-RAM and the subband 
memory were developed. To ease the control development, the overall system control 
is divided into several smaller units. Then, the algorithmic state machine (ASM) for 
each unit is developed. The control algorithms developed here can be used to derive 
the hardware of the control. 
Furthermore, based on data dependency graphs (DOGs) and scan methods 
specifically developed for inverse 5/3 and 9/7, the external architectures for single and 
parallel 5/3, 9/7, and combined 5/3 and 9/7 were developed. First, a high-speed single 
244 
pipelined inverse architecture including its column-processor (RP) and row-processor 
(CP) were developed. Then, the single pipelined architecture is extended to 2-parallel 
and 4-parallel to achieve speedup factors of 2 and 4, respectively. The advantage of 
the single pipelined architecture developed here is that it only requires a total 
temporary line buffer (TLBs) of sizes 2N and 4N for 5/3 and 917, respectively, and the 
TLB requirement does not increase when it extended to parallel architecture. The 
combined architecture is very useful and efficient in situations where a decoder in one 
site is required to perform either lossless 5/3 or lossy 917 image reconstruction. In 
addition, the advantage of the combined architecture is that a considerable saving in 
silicon area can be achieved. The proposed architectures besides precisely 
implementing the two algorithms, their control complexity is simple. Specifically the 
external architecture's control signals of the single pipelined inverse architecture 
shown in Figure 6.4.1 were reduced to only one control signal. The interleave 
technique used by CP for combing subbands not only speeds up the computations by 
allowing RP to work in parallel with CP as early as possible, but reduces internal 
memory requirement between CP and RP to a few registers. 
The processor datapath architectures were first developed assuming the external 
memory is scanned either row-by-row or column-by-column. However, since the 
external architectures developed in this work scan the external memory differently, 
the processors datapath for single and parallel architectures are modified in order to fit 
into the external architectures' processors. 
The symmetric extension algorithm is incorporated in the data dependency graphs 
(DOGs) to handle the boundary problem and then implemented by all architectures 
developed in this work. Symmetric extension is a necessary treatment to prevent 
distortion from appearing at the image boundaries. 
The scan method adopted, for development of architectures, not only reduces the 
internal memory between RPs and CPs to a few registers, but also reduces the internal 
memory or number of TLBs in the RP to minimum. In addition, it allows CPs to work 
in parallel with RPs earlier during the computation, which leads in reducing the 
latency to a few cycles. 
The approach or the strategy adopted in the development of the proposed single 
245 
and parallel architectures can be used in architecture development for any 2-D DWT 
algorithm and it is certain to yield very efficient architectures in terms of hardware 
complexity, speedup, and power consumption with manageable control complexity. 
The simulation results of the five architectures implemented and synthesized on 
Altera FPGA verify that the architectures developed in this work not only are accurate 
and fast but are efficient in terms of power dissipation and hardware complexity. In 
addition, the synthesis results of the 2-parallel architecture shown in Figure C.5.3 
confirm that the 2-parallel pipelined architecture is 2 times faster than the single 
pipe lined architecture. Furthermore, the compilation results given in Figures C.3.1, 
C.3.2, and C.3.3 for the first 9/7 architecture and compilation results shown in Figures 
C.4.1, C.4.2, and C.4.3 for the second 9/7 architecture show that the first 9/7 
architecture performs better than the second 9/7 architecture in terms of speed, power 
consumption, and hardware complexity. 
The Verilog version used in Altera FPGA Quartus II does not have the capability 
of supporting simulation using real images. This limitation has forced my to use 
images of sizes 6x5 and 6x8 containing random numbers in the final simulation. 
Another limitation is that the architectures developed in this work are designed to 
process the whole image as one tile. JPEG2000 allows (optionally) an image to be 
divided into a number of smaller non-overlapping rectangular blocks known as "tiles" 
and then each tile is processed independently by DWT unit. This mechanism is a 
useful to use for computing 2-D DWT of a large image independent of its size with 
the use of the smaller intermediate memory (LL-RAM) to store "LL" coefficients for 
next level of decomposition. Thus a control algorithm is needed to divide a large 
image into tiles and then passes each tile to the DWT unit for processing. 
8.2 Recommendations 
The possible future work would be to extend the approach and the techniques 
acquired from this research to develop architectures for any 2-D DWT algorithms 
including development of VLSl architectures for signal and image processing 
algorithms. This work also could be extended to develop architectures for 3-
dimensional images where computational requirements are very intensive with 
246 
complex and large memory requirements. Furthermore, the concept and techniques 
developed in this work also can aid in the development of VLSI architectures for 
Turbo decoder. Turbo code is one of the most attractive error .;orrection codes and it 
is an essential component in digital communication and data storage systems 
Another possibility would be to extend this work to develop architecture for 
compression part of the system, which uses EBCOT (Embedded Block Code with 
Optimized Truncation), to independently code each subband coefficients. EBCOT 
contains Tier I and Tier 2. Tier I is implemented in hardware, whereas, Tier 2 is 
implemented in software. The insight gained from this work would aid the designer to 
develop compression architecture that can be integrated into the 2-D DWT 
architecture. 
Moreover, this research includes many in-depth and optimized designs and 
therefore, can be available reference for graduate students and researchers pursuing 
in-depth study in this field. 
247 
REFERENCES 
[1] Chao-Tsung, Po-Chih, and Liang-Gee, ''Generic RAM-Based architecture for 2-D Discrete wavelet 
transform with line-based method'', IEEE trans on circuits an Systems for video technology, 
vol. 15, No.7, July 2005, PP. 910-920. 
[2] C-T. Huang, P.-C, Tseng, and L.-G Chen, "Flipping structure: an e!Ticient VLSI Architecture 
for lifting-based discrete wavelet transform,'' IEEE Trans. Signal Processing, vol. 52, No. 
4, April 2004, PP. 1080- 1089. 
[3] Gregory Dillin, Benoit Georis, Jean- Didier Legant, and Olivier Cantineau."Combined Line-
based Architecture for the 5-3 and 9-7 Wavelet Transfprm of JPEG2000,"IEEE Trans. on 
circuits and Systems, vol. 13, No.9, Sep. 2003, PP. 944- 950 .. vol 54, No.5, May 2006, PP. 
1910-1916. 
{4] Daubechies and Sweldens, ''Factoring \vavelet transform into lifting schemes,'' J. Fourier 
Analysis and Application, vol. 4, No.3. 1998, PP. 245- 267 .. 
[5] Sweldens, "The lifting scheme: A new philosophy in biorthogonal wavelet constructions," in 
proc. SPIE, vol. 2569, 1995, PP. 68- 79. 
(6] Calderbank, I. Daubechies, Swelden, and Yeo, "Wavelet transf(Jrms that map integers," J. Applied 
and Computational Harmonic. Analysis, vol. 5. No.3, Sept. 1998, PP.332- 369. 
[7] fSO/lEC, ISO/IE 15444- I, information technology-JPEG2000 image coding system, 2000. 
Website : http://w\v\v .jpeg.org/CDs 15444.htrnl. 
[8] David S. and Michael W. "JPEG 2000 image compression fimdamentals, standards and 
practice," Kluwer Academic polishers, 2002. 
[9] Mu-Yu chiu, Kun-Bin Lee and Chein-Wei Jen,"Optimal data transfer and Buffering 
Schemes for JPEG2000 encoder." in proceeding IEEE workshop signal proc. S)1St 2003, PP. 
177- 182. 
[10] K. K. Parhi and T. Nishitani. "VLSI architectures for discrete wavelet transforms," IEEE 
Trans. Very Large Scale integration (VLSI) System, June 1993, PP. 191-202. 
[11] C. Chryatis and A. Orlega, ''Line-based, reduced memory, wavelet image compression," IEEE 
Trans. Image Processing, vol. 9, No. 3, March 2000, PP. 378-389. 
(12] M. Week and M. Bayoumi, ·'Discrete wavelet transform: Architectures, design and performance 
issues," J. VLSI signal processing & systems, vol. 35, No.2, 2003, PP. !55 -178. 
[13] Kishore A., Chaitati Ch., and Tinku A. "A VLSI architecture for Iifiing-based forward and 
inverse wavelet transfOrm," IEEE Trans. on signal processing, vol.50, No. 4, April 2002. 
[14] W. Jiang and A. Ortega, "Lifting Factorization-based Discrete Wavelet Transform Architecture 
Desgin,'' IEEE Trans. on Circuits & Sys. For Video Technology, vol. II, N. 5, May 2001, 651 
- 657. 
[15] Ani! K. Jain, "Fundamentals of digital image processing," Prentice Hall 1989. 
[16] K. K. Parhi, VLSI Digital Signal Processing System: Design and Implementation. New York: 
Wiley, I 999. 
[17] F. Marino, "Efficient high-speed/ low-power pipelined architecture for the direct 2-D 
248 
discrete wavelet transform," IEEE Trans. Circuits Sys.II, Analog Dig,tal signal processing vol. 
47, No. 12, Dec 2000, PP. 1476- 1491. 
[ 18] Sanjit K. Mitra, "Digital signal processing, a computer-based approach," McGraw _Hill, 2001. 
[19] Jaideva C. Goswami and Andrew K. Chan,"Fundamentals of wavele1s, theory algorithm, and 
applications," New York, Wiley, 1999. 
[20] Agostino Abbate, Casimer M. and Pankaj K. Das,"Wavelcts and subbands, Fundamentals and 
applications," Birkauser, 200 I. 
[21] Cheng_Yi, Jim_ Wen. and Jian Liu, "A note on "Flipping structure: an efficient VLSI 
architecture for lifting-based discrete wavelet transform"." IEEE trans. on signal processing, 
vol.54, No.5, May 2006, PP. 1910-1916. 
[22] Cheng- Yi Xiong, Jim-Wen Tian and Jian Liu,"Efficient parallel architecture for lifting-based 
two-dimensional discrete wavelet transform:' IEEE Int. Workshop V~:..SI Design & Video Tech. 
China, May 2005, PP. 75- 78. 
[23] Qing-ming Yi and Sheng-Li Xie,"Arithmetic shift method suitable for VLSI implementati-on to 
CDF 9/7 discrete wavelet transform based on lining scheme," Proceedtngs of the Fourth Int. 
Conf. on Machine Learning and Cybernetics. Guangzhou, August 2005, PP. 5241-5244. 
[24] R. Zewail, P. Marshall, S. Kozicki, N. Ying, D. Elliott, and N. Durdle," A reconfigurable fully 
Scalable integer wavelet transform unit for JPEG200," CCECE/CCGEI Saskatoon, IEEE. May 
2005, PP. 798- 801. 
[25] Zhi-Rong Gao and Cheng-Yi Xiong," An efficient Line-based architec1ure for 2-D discrete 
wavelet transform,'' proceeding of IEEE international conference on communications, circuits 
and systems, 2005, PP. 1322- 1325. 
[26] Chengyi Xiong, Jinwen Tian and Jian Lui,''A fast VLSI architecture for two-dimensional discrete 
wavelet transform based on lifting scheme,'' proceeding of IEEE 71h international conference on 
solid-state and integrated circuits technology, 2004, PP. 1661-1664. 
[27] Srikar Movva and Srinivasan S.,"A novel architecture for lifting-based discrete wavelet transforms 
for JPEG2000 standard suitable for VLSI implementation," Proceedings of the 6'h International 
Cont on VLSI Design, 2003 IEEE, PP. 202-207 
[28] K-C. B Tan and T. Arslan,"Shift-accumulator ALU centric JPEG200(1 5/3 lifting based discrete 
wavelet transform architecture," proceeding of2003 IEEE, PP. vl61- vl64. 
[29] Xuguang Lan, Nanning Zheng, & Yuehu Liu, ''Low-Power and High-~ peed VLSI Architecture 
For lifting-Based Forward and Inverse Wavelet Transform'' IEEE tram.action on consumer 
electronics. Vol. 51, issue 2, 2005, PP 379 385 .. 
[30] Sandro V. Silva & Sergio Bampi, "Area and Throughput Trade_offs in the Design of Pipelined 
Discrete Wavelet Transform architectures,'' proceedings of the design automation and Test in 
Europe conference and Exhibtion. 2005 IEEE, PP 32- 37. 
[31] W. Swelden, ''The lifting scheme: A custom-design construction ofbiorthogonal wavelets, " 
Applied and computational Harmonic Analysis. vol. 3, No. 15, 1996, l'P.l86- 200. 
[32] Zhong Guangjum, Cheng Lizhi & Chen Huowang. "A simple 9/7-tap wavelet filter based on 
lifting scheme," proceeding of IEEE international conference on ima:~e processing, vol. 2, 2001, 
249 
pp 249-252. 
[33] Chao-Tsung Huang, Po-Chih Tseng, and Liang-Gee Chen, "Analysis and VLSI Architecture for 
1-D and 2-D Discrete Wavelet Transform,'' IEEE Trans. on signal processing, vol.53, No.4, 
April2005, PP.I575- 1586 .. 
[34] Jen-Shiun Chiang, Chih-Hsicn Hsia, Hsin-Jung Chen, and Te-Jung Lo, "VLSI Architecture of 
Low Memory and High Speed 2-D lifting-Based Discrete Wavelet Transform for JPEG 2000 
Applications,'' IEEE international symposium on circuits and systems, ISCAS 2005, Vol. 5, PP. 
4554-4557. 
[35] S. Barua, J. E. Carletta, K. A. Kotteri, A. E. Bell, ''An efficient architecture for lifting-base two-
dimensional discrete wavelet transforms". Integration, the VLSI journal, 2005 Elsevier, 
PI'. 341 - 352. 
[36] G. Dimitroulakos, M.D. Galanis, A. Milidonis, and C. E. Goutis, "A high-throughput and 
Memory efficient 2-D discrete wavelet transform hardware architecture for JPEG2000 
Standard,'' IEEE international symposium on circuits and systems ISCAS 2005, Vol. I, PP.472-
475 .. 
[37] Chengjun Zhang, Chunyan Wang, M. Omair Ahmad," A VLSI architecture for a High-speed 
computation of the 1-D discrete wavelet transform'', IEEE international symposium on circuits 
and systems, ISCAS 2005, Vol. 2, PP. 1461- 1464. 
[38] K. A. Kotteri, S. Barua, A E. Bell, and J. E. Carletta,"A comparison of hardware implementations 
of the Biorthogonal 9/7 DWT: convolution versus lifting". IEEE Trans. on Circuits & 
System, vol. 52, No.5, May 2005, PP. 256- 260. 
[39] Yuan-Long Jean, Kai-Jearg, Kai-Jyun Liang. Jiun-llau Tu, Jain-Zhou lluang, and Pingshou 
Cheng. ''An embeded wavelet image coding algorithm and its hardware implementation Based on 
Zero-block and Array (EZBA)," 48ili Midwest symposium on circuits and systems, IEEE 2005, 
Vol. 2,PP.1414-1417. 
[40] C-Y. Xiong, J-W. Tian, J. Liu, "Efficient high-speed/low-power line- based architecture for 2-
dimcnsional discrete wavelet transforms using lifting scheme." IEEE Trans. On Circuits & sys. 
for Video Tech.Vol.16, No.2, February 2006, PP. 309-316. 
[41] Michael Unser, & Thierry Blu, ·'Wavelet Theory Demystified", IEEE Trans. on Signal 
Processing, vol. 51, No.2, February 2003, PP. 470-483. 
[42J Chao-Tsung Huang, Po-Chih Tseng & Liang-Gee Chen, "VLSI architecture for forward discrete 
wavelet transform based on B-splinc factorization", Journal of VLSI signal processing. 
2005 Springer Science, PP. 343 - 353. 
[43] B-F. Wu, C-F. Lin, "A high-Performance and Memory-Efficient Pipeline Architecture for the 5/3 
and 9/7 Discrete Wavelet Transform of JPEG2000 Codec," IEEE Trans. on Circuits & Sys. for 
Video Technology, Vol. 15, No. 12, December 2005, PP. 1615-1628. 
[44] lain E. G. Richardson, ''H. 264 and MPEG-4 video compression, video coding for next-
generation multimedia," Wiley 2003. 
[45] Maurizio Martina and Guido Masera. ''Low-complexity, Etlicient 917 Wavelet Filters 
Implementation," proceeding of2005 IEEE. 
250 
[46] Gab Cheon lung, Seong Mo Park, and lung Hyoun Kim, "An Efficient VLSI Architecture for 
JPEG2000 Encoder," proceeding of2005 IEEE, PP. 1203- 1206. 
[47] David B. H. Tay, "A class of lifting based integer wavelet transform,'· proceeding of IEEE 
international conference on image processing, vol. 1, 2001, PP. 602-605. 
[48] Zhi-Rong Gao & Cheng-Yi Xiong, "Combining Parallel Lifting and Retiming Architecture for 
Discrete Wavelet Transform,'' IEEE Int. workshop VLSI design and video Technology, 
Suzhou, China, May 2005, PP. 175- 178. 
[49] Michael D. Adams, & Faouzi Kossentini, "Reversible Integer-to-Integer Wavelet Transforms for 
Image Compression: Performance Evaluation and Analysis,'' IEEE T1·ansactions on Image 
Processing, vol. 9, No.6, June 2000, PP. 1010- 1024. 
[50] P.-C. Tseng, C.-T. Huang, and L.-G Chen,"VLSI implementation of shape adaptive discrete 
wavelet transform," in Proc. SPIE Int. conf. Visual Communications and Image Processing, 
2002, PP. 655-666. 
[51] N.D. Zervas,G. P. Anagnostopoulos, V. Spiliotopoulos, Y. Andreopoclos, and C.E. Goutis, 
"Evalution of design alternatives for the 2-D discrete wavelet transform,'' IEEE Trans. 
Circuits Syst. Video Technology, vol. II, No. 12, December 2001, PP. 1246- 1262. 
[52] H. Yamauchi eta!., "Image processor capable of block-noise-free JPE:G2000 compression with 
30 frames Is for digital camera applications," in proceeding IEEE Int. Solid-State Circuits 
Conf., vol. I, PP. 46-477, 2003. 
[53] Jorg Ritter and Paul Molitor," A pipelined architecture for partitioned DWT based lossy 
image compression using FPGA's,'' Monterey, CA, USA. ACM 2001, PP. 201-206. 
[54] L. Liu, X. Wang, H. Mcng, L. Zhang, Z. Wang, and H. Chen," A \'LSI architecture of spatial 
combinative lifting algorithm based 2-D DWT/IDWT," in proc. 2002 Asia-pacific conf circuits 
and systems, vol. 2, 2002, PP. 299-304. 
[55] B. F. Wu and C. F. lin, "A rescheduling and fast pipeline VLSI architecture for lifting-based 
discrete wavelet transforms," in proceeding IEEE JSCAS, May 2003, PP. 732- 735 
[56] H. Meng and Z. Wang, "Fast spatial combinative lifting algorithm of wavelet Transform using 
the 9/7 filter for image block compression," Electron. Lett., Vol. 36, No. 21, Oct. 2002, PP. 
1766-1767. 
[57] MPEG-4, !SO/IEC JTCJ/SC29/WGII.FCD 14496, "coding of moving pictures and audio," May 
1998. 
[58] Kai Hwang, "Advaccd Computer Architecture: Parallelism, Scalability, Programmabilty," 
McGraw-Hill 1993. 
[59] Hongyu Liao, Mrinal Kr., and Btuce F. "Efficient architectures for 1-D and 2-D lifting-based 
wavelet transform,'' IEEE Trans. on signal processing. vol. 52, No. 5. May 2004. 
[60] W. Chao, W. Zhilin, C. Peng, and L. Jie, "An efficient VLSJ Architecture for lifting-based 
discrete wavelet transform," Mulltimedia and Epo, 2007 IEEE lntl~:rnationa! conference, PP. 
1575-1578. 
[61] R. Jain and P.R. Panda,"An efficient pipelined VLSI architecture fc•r Lifting-based 2D-
discrete wavelet transform," ISCAS, 2007 IEEE, PP. 1377-1380. 
251 
[62] B-F. Li andY. Dou, "FlOP A novel architecture for lifting-based 20 DWT in JPEG2000," 
MMM (2), lecture note in computer science, vol. 4352, Springer, 2007, PP. 373-382. 
[63} Peng Cao, Xin Guo, Chao Wang, and Jie Li, ''Efficient architecture for two-dimensional discrete 
Wavelet transform based lifting scheme," 7" International Conference on ASIC, ASICON'07, 
2007 IEEE, PP. 225-228. 
[64] Chengyi Xiong, Jinv..'en Tian, and Jian Liu. ''Efficient architecture fot t\vo-dimensional discrete 
Wavelet transform using lifting scheme," IEEE transactions on image processing, Vol. 16, No.3, 
March 2007, PP. 607-614. 
[65] Stephen Brown and Zvonko Vranesic, "Fundamentals of digital logic with verilog design," 
Second edition, Me Graw-Hill, higher eductation, 2008. 
[66] Chih-Hsien Hsia and Jen-shium Chiang, "New memory-efficient hardware architecture of2-D 
dual-mode lifting-based discrete wavelet transform for JPEG2000," II •h IEEE Singapore 
International Conference on Communication Systems, 2008 IEEE, PP. 766- 772. 
[67] Jie Guo, Ke-yan Wang, Cheng-ke Wu, and Yun-song Li, "Eflicient FPGA implementation of 
Modified DWT for JPEG2000," 9'h International Conference on Solid and Integrated-circuit 
Technology, 2008 IEEE, PP. 2200-2303. 
[68J Wei-Ming Li, Chih-Hsien Hsia, and Jcn-Shiun Chiang, "Memory-efficient architecture of2·D 
dual-Mode discrete wavelet transform using lifting scheme for motion-JPEG2000," IEEE 
International Symposium on circuits and systems, 2009, PP. 750-753. 
[69] Pingping Yu, Suying Yao, and Jiangtao Xu, "An efficient architecture for 2-D lifting-based 
discrete \vavelet transform," 41h IEEE Conference on Industrial Electronics and Applications, 
2009. PP. 3667- 3670. 
[70] Xiaodong Xu and Yiqi Zhou, ·'Efficient FPGA implementation of2-D DWT for 9/7 float wavelet 
filter,'' International Conference on Information Engineering and Computer Science, 2009 IEEE, 
PP. 1-4 
[71] Chung·Fu Lin, Pei·kung, and Bing·Fei Wu, "An efficient pipeline architecture and memory bit· 
\\'idth Analysis for discrete wavelet transform of the 917 filter for JPEG2000,'' 1 Sign Process 




SOFTWARE SIMULATION PROGRAM DEVELOPMENT 
A. I Introduction 
It will be of a great benefit to start this research by developing a software simulation program that 
computes both forward and inverse 2-D DWT using lifting-based 5/3 algorithms. The forward operations 
decorrelate the original image to be amenable to compression, whereas the inverse operations reconstruct 
the original image from the decorrelated image. Developing a simulation program will give the hardware 
architecture designer available opportunity to learn in details the behavior of the algorithm and acquire a 
firm understanding, which in turn will enable him to develop more accurate architecture. 
A.2 Forward and inverse lifting-based 513 algorithms and software development 
Lifting·based forward and inverse 5/3 wavelet transform algorithms are defined by the JPEG2000 image 
compression standard as follows [7, 27, 29]. 
5/3 forward algorithm 
step!: Y(2j +I)= X(2j +I) -l X( 2J) + ~(2} + 2) J 
2 2 .) 2 . lY(2j-l)+Y(2j+l)+2j step :Y( J =X( J)+ 
4 
513 inverse algorithm 
step!: X(2J) = Y(2n) -l Y( 2j -I)+ :(2} +I)+ 2 J 
step2:X(2j+l)=Y(2j+l)+l X(2})+~(2i+ 2) J 
Based on the above two algorithms, the data dependency graphs (DDGs) for forward and inverse, 
shown in Figures 3.3.1 and 6.2.1, respectively, are derived. The symmetric extension algorithm 
recommended by JPEG2000 is also incorporated into the DDGs to handle boundary problems. Based on 





This software is developed by Ibrahim Saeed Koko at 





% program fdwt 
Xl = imread( 1 cameraman.tif 1 ); %read image and storE it in 
Xl rgb2gray(Xl); 
X double (Xl); 
[m,n] size (X) ; 
YH horizontalf (X); 
YL horizontalfl(X,YH); 
YHHl verticalf (YH) ; 
YHLl verticalFL(YH,YHHl); 
YLHl verticalf (YL) ; 
YLLl verticalFL(YL,YLHl); 
YH = [] ; YL = [] ; 








verticalf (YH) ; 
verticalFL(YH,YHH2); 
verticalf (YL) ; 
verticalFL(YL,YLH2); 
YH = [] ; YL = [] ; YLLl [] ; 





% array X 
% Separates image from colors 
% convert pixels from grayscale 
% numbers to signed r:.umbers 
% first level decomposition 
%free YH and YL 
% second level decomposition 




YLL2 []; YH []; YL []; 
function YH= horizontalf(zO) 
[m,n] = size (zO); 
k = fix(n/2); 
for i ::= l:m 
for j = l:k 
if (j < k) I (k -= n/21 
%horizontal highpass decomposition 
YH(i,j) = zO(i,2*j)- fix((zO(i,2*j-1)+zO(i,2*j+1))/2); 
else 




function YL= horizontalfl(zO,YH) 
[m,n] = size(zO); 
%horizontal lowpass decomposition 
k = fix(n/2); 
if k -= n/2 
k = k + 1; 
end 
for i l:m 
end 
for j = 1:k 
if j == 1 
end 
YL(i,j) = zO(i,2*j-1) + fix(YH(i,j)/2); 
else if (fix(n/21 == n/21 (j < k) 
end 
YL(i,j) = zO(i,2*j-l) + fix((YH(i,j-1)+YH(i,j)+2)/4); 
else 
YL(i,j) zO (i,2*j-1) + fix(YH(i,j-1) /2); 
end 
255 
function ZL= verticalFL(z1,ZH) 
[m,n] = size(z1); 
k = fix(m/2); 
if k -= m/2 
k = k + 1; 
end 
fori l:n 
for j = 1:k 
if j == 1 
%vertical lowpass decomposition 
ZL(j,i) = z1(2*j-1,i)+fix(ZH(j,i)/2); 
else if(fix(m/2) == m/2) I (j < k) 
ZL(j,i) = Z1(2*j-1,i)+fix((ZH(j-1,i)+ZH(j,i)+2)/4); 





function ZH = verticalf(z1) 
[m,n] = size(z1); 
k = fix(m/2); 
fori = l:n 
for j = 1 :k 
%vertical highpass decomposition 
if I j < k) I lk -= m/2) 
ZH(j,i) = Z1(2*j,i)-fix((z1(2*j-1,i)+Z112*j+1,i))/2); 
else 




% function f2dwt 
fdwt; % call main program. 
Y3 
Y2 
[YHH3 YLH3;YHL3 YLL3]; 
[YHH2 YLH2; YHL2 Y3]; 
% combine subbands to obtain 
% decorrelated ima::re 
256 
Yl [YHHl YLHl; YHLl Y2]; 
Y = mat2gray(Yl); 
figure, imshow(Y); 
title( 1 Decorrelated image 1 ) 
[m,n] = size (Yl); 
y l:l:m; 
x l:l:n; 
[x,y] = meshgrid(x,y); 
figure, mesh(x,y,Yl); 
%decomposed image 
%covert a data to a grayscale image 
% Display decorrelated image 
title( 1 This figure shows the decomposed image pixels are 
decorrelated 1 ) 
figure, mesh(x,y,X); 
title('This figure shows the original image pixels highly are 





YLL2 = horizontalR(YH,YL); 
YH [] YL []; 
YL verticalR(YLH2,YLL2); 
YH verticalR(YHH2,YHL2); 
YLLl = horizontalR(YH,YL); 
YL []; YH []; 
YL verticalR(YLHl,YLLl); 
YH verticalR(YHHl,YHLl); 
xrl = horizontalR(YH,YLI; 
INVERSE PROGRAM 
% activate fdwt to compute the fdwt. 
% first level reconstruction 
% second level reconstruction 
% third level reconstruction 
% reconstructed image 
xr mat2gray(xrl); % convert matlab image to a grayscale image. 
257 
figure, imshow(xr) % display the reconstructEd image. 
title('Reconstructed image') 
figure, imshow (X1) 
title('Original image') 
% display the original mage. 
DIFF = difference(xr1,X) %call function difference 
function Xrec = horizontalR(YH,YL) %horizontal reconstruction 
[m,n1] = size(YL); 
[m,n] size (YH); Xrec zeros(m,n+n1); 
for i 1:m 
for j = 1: n1 
if j == 1 





Xrec (i,2*j -1) 
else if (n1 == n) 
Xrec(i,2•j-1) 
YL(i,j) - fix(YH(i,j)/2); 






else Xrec(i,2*j-1) = YL(i,j) - fix(YH(i,j-1)/2); 
end 
1 :n 
(j < n) 
% horizontal highpass reconstruction 
(nl -= n) 
Xrec(i,2•j) = YH(i,j) + fix( (Xrec(i,;~•j-1) + 
Xrec { i , 2 * j + 1) ) I 2) ; 
else 




function YL = verticalR(YLH,YLL) 
[m,n] = size(YLH); 
[ml,n] = size(YLL); 
if ml -= m 
YL zeros(2*m+l,n); 




for i 1 :n % vertical lowpass reconstruction 
for j = L m1 




YLI2*j-1,il YLLij,il- fixiYLHij,il/21; 
else if lm1 == ml lj < m11 
end 
1 :n 
YL(2*j-1,il = YLLij,il- fixi(YLHij-1,ii+YLH(j,il+21/41; 
else YL(2*j-1,il = YLLij,il - fix(YLH(j-1,il/21; 
end 
%vertical highpass reconstruction 
for j = 1;m 
if I j < ml (m1 -= ml 
YLI2*j,il = YLHij,il + fixi(YL(2*j-1,il + YLI2*j+1,ill/21; 
else 




function diff = difference(xl,x2)% This function computes the 
difference %between the original image and the reconstructed image. 
[m,n] = size(x21; 
z"'O; 
for i 1 m 
for j 1 n 
z z + (xl(i,j) - x2(i,j)); %compute differences. 
end 
end 
if z = 0 
end 
disp('the orginal and the reconstructed images are identical') 
disp('We have a perfect reconstruction') 
else disp('the original and the reconstructed images are not 
identical' I 
259 
The flowcharts for both forward and inverse 2-D DWT programs are shown in Figures A.3.1 and 
A.3.2, respectively. Note that in the forward program, the flowcharts for functions verticalFL and verticalf 
are similar to the flowcharts for functions horizontalfl and horizontal f. respeotively. The only difference is 
that the vertical functions compute column-wise, whereas, the horizontal functions compute row-wise. 
Program fdwt I 
• Read an mxn image and store 
it in X1 
..!. 
Convert pixels from grayscale 
to signed numbers 
.!. 
Get image size (m,n) I 
.!. 
Call function horizontalf 
To compute YH (highpass 
decomposition) 
... 
Call function horizontalfl 
To compute YL (lowpass 
decomposition) 
t 
Call function verticalf 
To compute subband YHH 
..!. 
Call fdwt 




Combine subband to I Call function verticalf I To compute subband YLH 
obtain decorrelated j. 
image 
..!. l Call function verticaiFL I To compute subband YLL 
Convert data of the ..!. 
decorrelated image to Repeat the last 6 calls or 
grayscale image and steps for :'1 levels 
then display it decompositions 
.L. .!. ( stop stor~ 
(a) (b) 
Figure A.3.1 (a) Main program (b) Forward program. 
260 
outer loop 
YH(i,j) = x{i,2j)- x{i,2j-1) 




Get image size 
(m,n) 
k=ln/21 




YH(i, j) = x(i,2 j)- i(x(i,2j -1) + x(i,2j + 1))/ 21 
j = j +1 
(c) 
Figure A.3.1 (c) Horizontal highpass decomposition flowchart 
261 
outer loo 
i = i+1 
j = 1 
No 
YL(i,j) = x(i,2j -1) +I YH(i,j -1)/21 
j = )+1 
Function 
horizontalfl 
Get image size 
(m,n) 
k=ln/21 
Seti= 1 andj= 1 
Inner loop 
yes YL(i,J) '' x(i,2j -1) +I YH(i,j)l 21 
j = }+1 
es 
YL(i,J) = x(i,2j -1) -I(YH(i,J -1) + YH(i,J) + 2)/ 41 
J=j+ 
(d) 
Figure A.3.1 (d) Horizontallowpass decompositio~ flowchart. 
262 
Call fdwt 
To decorrelate an mxn image 
+ 
Call function verticaiR 
To reconstruct YL 
~ 
Call function verticaiR 
To reconstruct YH 
~ 
Call function horizontaR 
To reconstruct YLL 
~ 
Repeat steps 2, 3, and 4 until 
the whole image is 
reconstructed 
t 
Convert reconstructed matlab image 
to a grayscale image and display it 
along with the original image 
+ 
Call function psnr 









Get YLH size (m,n) 
Get YLL size (m1,n) 
yes YL = zeros(2m+1,n) I 
No 
No 
YL(2j -l,i) = YLL(j,i)-r YLH (J -1,1)/21 
j= j+I 
yes YL(2j-I,i) = YLL(J,i,)-jYLH(i,j)!2l 
J = j +I 
es 
YL(2j -l,i) = YLL(j,i)-j(YLH(j -l,i) + YLH(J,i) + 2)/ 
YL & YLH 
Continue to the next page 
(b) 
Figure AJ.2 (b) Verticallowpass flowchart. 
264 
YL(2j,i) = YLH(J,i) + YL(2j- l,i) 






YL(2j,i) = YLH(J, i)- I<YL(2j- I, i) + YL(2j + l,i))/ 21 
j = j+l 
(c) 





i = i+1 L yes /s 
j = 1 I = n1? 
No 
x(i,2j -I) = YL(i, j) -r YH(i,j -1)/21 




Get YL size (m,n1) 
Get YH size (m,n) 
Seti&j=1 
No is yes J End ) i=m? \ 
No 
/s yes x(i,2j -I)= YL(i, j)-r YH(i, j)l 21 
j = 1? j = }+1 
No 
isnl =nor j <nl yes 
x(i,2j -I) = YL(i,j) -r(YJI(i, j -I)+ YH(i,j) + 2)/ 41 
1 x& YH 
Continue to the next page 
(d) 
j = j +I 





i = i+1 
j = 1 










x(i,2j) = YH(i,j) + i(x(i,2j- I)+ x(i,2j + I)) I 21 
j =j+l 
(e) 
Figure A.3 .2 (e) Horizontal high pass reconstruction flowchart. 
The forward program consists of 6 parts: programfdwt, reads in the original image to be decomposed 
and then calls appropriate functions to decompose (decorrelate) it, the horizontalf and horizontalji 
functions compute DWT in the horizontal direction to yield the highpass (H) and the lowpass (L) 
decompositions, respectively, the verlicalfh function computes DWT in the vertical direction to decompose 
H into subbands HH and HL, the vertica(fl function computes DWT in the vertical direction to decompose 
L into subbands LH and LL. The last part of the forward program is program j2dwt (the main program). 
This program combines subbands of the decomposed image to form the decorrelated image. Then, it 
displays the decorrelated image and plots pixels of the original and dccorrelated images to show correlation 
and decorrelation properties, respectively. 
267 
The forward program is activated by typing at the prompt "f2dwt", which activates the main program. 
The main program in turn calls fdwt. The four functions named horizonto/(, horizontaljl, vertical(, and 
vertica/FL are called from programfdwt, for example, in the first level decomposition as follows. First, the 
horizontalffunction is called followed by the horizontaljl function to yield Hand L decompositions, which 
are stored in YH and YL, respectively. Then, function verticalfis called with YH as a parameter to yield 
HH, which is stored in YHH I. Next function vertica/FL is called with YH and YHH I as parameters to 
yield subband HL which is stored in YHLI. Again, function verticalfis call-oct, with YL as a parameter, to 
yield subband LH which is stored in YLHI. Then, function vertica!FL is called with YL and YLHI as 
parameters to yield subband LL which is stored in YLLI. This process is repeated in each decomposition 
level until the entire image is decomposed into the desire number of levels. 
On the other hand, the inverse program consists of 4 parts (functions): ia'wt, vertica/R. horizonta!R, and 
psnr. The vertica/R function reconstructs the original by combining in each level subbands LH and LL into 
Land subband HH and HL into H. Whereas, the function of horizonta/R, in each level, is to combine H and 
L decompositions to form the next LL subband .. 
The difference function computes the difference between the original image (X I) and the reconstructed 
image (X2) using the following formula [8, 15]. 
A/ N 
z= LL(Xi(i,j)-X2(i,j)) (A.!) 
l=l p=l 
If the difference (z) is zero, the two images' pixels are identical; otherwise, the two images' pixels are not 
identical. 
The function of the idwt is to reconstruct the original image by calling vertica/R and horizohnta/11. 
The inverse program is activated by typing at the prompt "idwt", which activates program idwt. Then, idwt 
callsfdwt to decorrelate the image. The reconstruction process for the first !eve! begin by calling vertica/R 
with YLH3 and YLL3 as parameters to yield L3 decomposition which is stored in YL. Again, function 
vertica/R is called with YHH3 and YHL3 as parameters to yield H3 decom:Josition which is stored in YH. 
Then, function horizon/aiR is called with YH and YL as parameters to yield sub band LL2 which is stored 
in YLL2. This completes the first level reconstruction. For each subsequencr~ level reconstruction the above 
steps are repeated until the whole image is reconstructed. When this is done, idwt dis lays both the original 
and reconstructed images and then call psnr to compute signal-to-noise ratio (SNR) between the original 
image and reconstructed image. 
The following figures show simulation results of applying an original image to the software simulation 
program. When the image shown in Figure A.3.3 (a) was processed by the forward simulation program, the 
result was the image shown in Figure A.3.3 (b), which is the wavelet representation of the decorrelated 
image. Then the decorrelated image is applied to the inverse software program to yield the image shown in 
268 
Figure A.3.3 (c) which is a perfect reconstruction of the original image without distortion in the image 
boundaries. 
On the other hand, the result of the simulation in Figures A.3.4 and A.3.5 show clearly the correlation 
and decorrelation properties, respectively. Figure A.6 shows the original image pixels are highly correlated, 
while Figure A.3.5 shows the image pixels, which are the result of applying FDWT to the original image 
pixel, are decorrelated. 
The original Image Oeoorrelated Image Reoonstruoted Image 
(a) (b) (c) 





Thl• flgu,.. ano~ 1ne original l,....ge pbcete are highly oorrel•t•d 
Figure A.3.4 Original image pixels highly correlated 
Thle figure eho~ the deoornpoeed trn.ge ptx••• .,.. dltoor,..lated 
0 0 





DATAFLOW AND CONTROL SIGNALS TABLES 
B. I Dataflow tables of chapter 3 
Table B.! Dataflow for 5/3 overlapped and overlapped scan architectures 
Ck RP's input Cp's input latches Cp's output 
f latches RP's output latches atchcs 
RdO Rd1 Rd LB RtO Rt2 Rt1 RdO Rd3 Rd4 Rd6 Rd5 Rt3 Rt4 RtS Rt6 Rt7 
I xO,O 
2 xO,O xO,I 
3 
---- ---
x0,2 xO,O x0,2 xO, I 
4 xi.O --- x0,2 
5 xi,O xl,l 
6 




8 x2,0 x2,1 
9 
---- ---
x2,2 x2,0 x2,2 x2, I LO,O LI,O 110,0 ---- HI,O 
10 x3,0 --- x2,2 
11 x3,0 x3, I 
-0 12 
---- --- x3.2 x3,0 x3,2 x3,1 L2,0 ---- HO,O 112,0 HI.O LO,O L2,0 LI,O 0 
"' 
13 x4,0 --- x3,2 
14 x4,0 x4,1 
15 
--- ---
x4,2 x4,0 x4,2 x4, I L2,0 L3,0 H2,0 ----- H3,0 HO.O H2,0 HI ,0 LHO,O LLO,O 
16 x5.0 --- x4.2 
17 x5,0 x5, I 
18 
--- ---
x5,2 x5,0 x5,2 x5,1 L4,0 --- 112,0 H4,0 113,0 L2.0 L4.0 L3,0 IIHO.O IILO.O 
19 x6,0 --- x5,2 
20 x6,0 x6, I 
21 
--- ---








--- --- ---- ---- ---- L6,0 ---- H4,0 H6,0 H5,0 L4,0 L6.0 L5.0 HHI,O HLI.O 
25 x0.2 ---
26 x0.2 x0,3 
27 
---- --- x0,4 x0,2 x0,4 x0,3 L6.0 ----- H6,0 ----- ----- H4,0 H6,0 H5,0 LH2,0 LL2.0 




x1,4 x1,2 x1,4 x!J LO, I ----- H6,0 HO,I ----- L6.0 ----- ----- HH2,0 HL2,0 
3 1 x2,2 
---- x1.4 
32 x2,2 x2,3 
33 
---- ----
x2,4 x2,2 x2,4 x2,3 LO,I Ll,l HO,I ----- HI,! H6.0 ----- ----- ------ LL3,0 N 
§ 34 x3,2 ---- x2.4 
"' 
35 x3,2 x3,3 
36 
---- ----
x3,4 x3.2 x3,4 x3,3 L2,1 
----
HO,I H2,1 HI,! LO, I L2. I Ll,l 
------ HL3.0 
37 x4,2 ---- x3,4 
38 x4.2 x4,3 
39 
----




41 x5,2 x5,3 
42 
---- ----




44 x6,2 x6,3 
45 
---- ---- x6,4 x6,2 x6,4 x6,3 lA, l L5,1 H4,l ----- H5,l H2,l H4,l H3,1 LHI,I LLI.l 
270 
Note that in Table B.! at cycles 22, 23, and 24 the external memory is not scanned and no pixels are 
loaded into RP latches RtO, Rtl, and Rt2 at cycle 24 where a transition from run I to run 2 is made. This is 
only required every time a transition from a run to the next is made when the column length N of an image 
is odd. 
Dataflow tables for the second 9/7 pipelined overlapped architecture, developed based on the scan 
method shown in Figure 3.5.3, are shown in Tables 8.2 (a) and (b), respectively. Note that when the 
column length N of an image is odd, after the second run, an empty cycle should be inserted whenever a 
transition is made from a run to the next as shown for example at cycle 22 in Table 8.2 (b). 
Control signal values for signals Ed2, Ed3, Ed4, Ed5, Ed6, SO, and S I derived from Table B.2(a) 
are shown in Table 8.2 (c). Note that number of control signals in Table 8.2 (c) can be reduced to 3 signals 
by observing that signals Ed2, Ed6, SO and Sl are equal and so are signals Ed3 and Ed5. 
Table 8.2 (a) Dataflow of the second 9/7 pipelined overlapped architecture for even N 
ck RP's input RP's output Latches CP's mput CP's output 
latches Latches Latches 
RtO Rt2 Rtl Rd2 Rd3 Rd4 Rd6 Rd5 Rt3 Rt4 Rt5 Rt6 Rt7 
I xO,O x0,2 xO, I 
2 x0,2 x0,4 x0,3 
3 xl,Ox1,2 xl,l 
4 xl,2 x\,4 xl,3 
" 
5 x2,0 x2,2 x2,1 
" 
6 x2,2 x2,4 x2,3 
"' 7 x3,0 x3,2 x3,1 
8 x3,2 x3,4 x3,3 
9 x4,0 x4,2 x4,1 
10 x4,2 x4,4 x4,3 LO,O --- 110,0 
II x5,0 x5,2 xS,l LO,O --- 110,0 
12 x5,2 x5,4 x5,3 LO,O Ll,O 110,0 ---- 111,0 
13 x0,4 x0,6 x0,5 LO,O Ll,O HO,O ---- Hl,O 
14 xl,4xl,6xl,5 1.2,0 --- HO,O H2,0 HI ,0 LO,O L2,0 Ll ,0 
';! 15 x2,4 x2,6 x2,5 L2,0 --- 110,0 112,0 111,0 110.0 H2,0 111,0 
" 16 x3,4 x3,6 x3,5 L2,0 Ll.O H2,0 ---- H3,0 ----- ----- -----
"' 17 x4,4 x4,6 x4,5 L2,0 Ll.O H2,0 ---- H3,0 ----- ----- -----
18 x5,4 x5,6 x5,5 L4,0 --- H2,0 H4,0 H3.0 1.2,0 1.4,0 LJ,O 
19 x0,6 x0,8 x0,7 !A,O --- H2,0 H4,0 H3,0 112,0 114,0 113,0 
M 
20 xl,6 xl,8 xl,7 L4,0 L5,0 114.0 ---- H5,0 
---- -···· 
..... 
c 21 x2,6 x2,8 x2,7 L4,0 LS,O 0 114,0 ---- HS,O ·---- --·-- -----
"' 
22 x3,6 x3,8 x3,7 LO,l ---- H4,0 110,1 H5,0 L4,0 L4,0 L5,0 
23 x:4,6 x4,8 x4,7 LO,l Ll.l HO,l ----HI,! H4,0 H4,0 H5,0 LHO,O LLO.O 
24 x5,6 x5,8 x5,7 L2,1 ---- flO, I H2,1 Hl,l LO,l L2,1 Ll,l HHO,O HLO,O 
25 L2,1 LJ,l H2,1 ---- H3,1 HO.l H2,1 HI,! ------- -------
26 L4,1 ---- 112, I 114, I H3, I 1.2, I L4,\ 1.3, I ···---- --··---
27 L4,1 L5,1 H4,1 ---- H5,1 H2,1 H4,1 H3,1 LHI,O LLI,O 
2H L0,2 ---- H4,1 H0,2 H5,1 L4,1 L4,1 L5,1 11111,0 111.1,0 
29 L0,2 Ll,2 H0,2 ---- H3, I H4,1 114,1 H5.1 ------- ···----
30 L2,2 ---- 110.2 112.2111,2 1.0,2 1.2,2 1.1 ,2 ···---- --···--
31 1.2,2 1.3,2 H2,2 ---- H3,2 H0,2 H2,2 HI ,2 1.112,0 LL2,0 
32 L4,2 
----
H2,2 H4,2 H3,2 1.2,2 1.4,2 1.3,2 HH2,0 HL2,0 
33 L4,2 L5,2 114,2 ---- H5,2 H2.2 H4,2 H3,2 LHO,l LLO,l 
34 L4,2 1.4,2 L5,2 HIIO,l HLO,l 
35 H4,2 H4,2 H5.2 LHI,l LLI,l 
36 HHI,l HLI,l 
37 LH2,1 LL2,1 
38 HH2,1 HL2,1 
271 
Table B.2 (b) Dataflow of the second 9/7 pipelined overlapped architecture for odd N 
ck RP's input RP's output Latches CP's input CP's output 
latches Latches Latches 
RtO Rt2 Rt1 Rd2 Rdl Rd4 Rd6 RdS Rt3 Rt4 Rt5 Rt6 Rt7 
9 x4,0 x4,2 x4, I 
-
10 x4,2 x4,4 x4,3 LO,O --- 110,0 
" 
11 x5,0 x5,2 xS, I 
" 
LO,O --- HO,O 
"' 
12 x5,2 x5,4 x5,3 LO,O Ll.O HO,O ---- Ill ,0 
13 x6,0 x6,2 x6, 1 LO.O Ll,O HO,O ---- H 1,0 
14 x6,2 x6,4 x6,3 L2,0 
---
HO,O H2,0 H 1,0 LO.O L2,0 Ll,O 
15 x0,4 x0,6 x0,5 L2,0 
---
HO,O H2,0 H1.0 HO,O H2,0 HI,O 
16 x1,4 x1,6 xl,S L2.0 LJ,O H2.0 ---- HJ,O ----- ----- ------




" 18 x3,4 x3,6 x3,5 L4,0 H2,0 H4,0 Hl,O L2.0 L4,0 u.o 
"' 
---
19 x4,4 x4,6 x4,5 L4,0 --- H2.0 H4,0 HJ,O H2,0 H4,0 H,O 
20 x5,4 x5,6 x5,5 L4,0 L5,0 H4,0 ---- H5,0 ----- ----- -----
21 x6,4 x6,6 x6,5 L4,0 L5,0 H4,0 ---- 115,0 ----- ----- -----
22 -----
---- ----- L6,0 ---- H4,0 H6,0 H5,0 L4,0 L6,0 L5.0 
23 x0,6 x0,8 x0,7 L6,0 ---- H6,0 ---- ----- H4.0 H6,0 H5.0 LHO,O LLO,O 
M 
24 xl,6 xl,8 x1,7 L0,1 ---- H6,0 H0,1 ---- L6,0 ----- ----- 111!0.0 HLO,O 
" 
" 
25 x2,6 x2,8 x2, 7 L0.1 L1,1 110.1 ---- H1.1 H6,0 ----- ----- ------- -------
"' 
26 x3,6 x3,8 x3,7 L2,1 ---- H0,1 H2,1 H1,1 LO,I L2,1 L1,1 
-------
-------
27 x4,6 x4,8 x4,7 L2,1 L3,1 H2,1 ---- HJ.I H0,1 H2,1 H1,1 LH1.0 LLl,O 
28 x5,6 x5,8 x5,7 L4,1 ---- H2,1 H4,1 HJ,1 L2,1 L4,1 l.l, 1 HIII,O HL1,0 
29 x6,6 x6,8 x6,7 L4,1 L5.1 H4,1 ---- H5,1 H2.1 H4,1 HJ,1 ------- -------
30 L6,1 ---- H4, 1 H6, 1 115,1 1.4,1 L6,1 1.5.1 
-------
-------
31 L6,1 ---- H6,1 ---- ----- H4,1 H6, I 115,1 LH2,0 LL2.0 
32 L0,2 ---- H6, 1 H0,2 ---- L6,1 ----- ----- H112.0 HL2.0 
33 L0,2 L1,2 H0,2 ---- H 1,2 H6,l ----- ·---- ----- LLJ,O 
34 L2,2 ---- 110,2 H2.2 H 1,2 L0,2 L2,2 1.1 ,2 ----- HLJ,O 
35 L2,2 LJ,2 H2,2 ---- 113,2 H0.2 H2,2 111.2 LliO.I LL0,1 
36 L4,2 ---- H2,2 114,2 HJ,2 L2,2 L4.2 Ll.2 H110.1 HL0.1 
37 L4,2 L5,2 114,2 ---- H5,2 H2,2 H4,2 113,2 Llll.l LL1,1 
38 L6,2 ---- H4,2 H6,2 H5.2 L4,2 L6,2 L5,2 11111.1 IILI.1 
39 L6,2 ---- H6,2 ----
-----
H4,2 H6,2 H5,2 LH2,1 LL2.1 
40 H6,2 ----
-----







Table B.2 c) Control signal values 
clock Ed2 Edl Ed4 Ed5 Ed6 so :~I 
10 1 X 1 X X 1 X 
11 0 X 0 X X X X 
12 0 I 0 1 X X X 
13 0 0 0 0 X X X 
14 1 X 0 0 1 X I 
-
" 
15 0 X 0 0 0 X ) , 
"' 
16 0 1 1 1 X 0 X 
17 0 0 0 0 X X X 
18 1 X 0 0 1 X I 
19 0 X 0 0 0 X J 
20 0 1 1 1 X 0 X 
21 0 0 0 0 X X X 
22 1 X 0 0 1 X I 
23 0 1 I 1 X 0 0 
N 24 1 X 0 0 1 X I 
§ 25 0 1 I 1 X 0 0 
"' 
26 1 X 0 0 1 X 1 
27 0 1 1 1 X 0 0 
28 1 X 0 0 1 X 1 
29 0 I 1 1 X 0 0 
30 1 X 0 0 1 X I 
272 
Table B.3 Dataflow of the intermediate architecture 
YH YL 
SRO SR2 SRI SRO SRI 
Clk RdO Rdl RtO Rt2 Rtl R2 Rl RO R2 Rl RO R2 Rl RO R2 Rl RO R2 Rl RO Rt3 Rt4 Rt5 
I xO,O -
2 xO,O xO,l 
3 x0,2 - xO,O x0,2 xO, I 
4 x0,2 x0,3 
5 x0,2 x0,3 
6 x0,4 - x0,2 x0,4 x0,3 hO,O LO,O -
7 x0,4 x0,5 
8 x0,4 x0,5 
9 x0,4 x0,6 x0,5 hO,l hO,O LO,l 1.0.0 -
10 xi,O -
II xl,O xl,l 
12 x1,2 - xl,O x1,2 xl,l h0,2 hO,l hO,O L0,2 LO,l LO,O 
13 x1,2 xl,3 
14 xl,2 xU 
15 xL4 - x\,2 x\,4 xl,3 h0,2 hO,l hO,O hl,O L0,2 LO,l LO,O 1.1,0 
16 xl,4 xl,S 
17 x\,4 xl,S 
18 - - x1,4xl,6x1,5 h0,2h0,\h0,0 hl,lhl,O L0,2LO,!LO,O Ll,!Ll,O 
19 x2,0 -
20 x2,0 x2, l 
21 x2,2 - x2,0 x2,2 x2,1 h0,2 hO,l hO,O h1,2 hl,lhl,O L0,2 LO,l LO,O L1,2 Ll,ILI,O 
22 x2,2 x2,3 
23 x2,2 x2,3 
24 x2.4 - x2,2 x2,4 x2,3 h0,2 hO, I hO,O h2,0 hI ,2 hI, I hi,O 1.2.0 1.0,2 LO, I L I ,2 L\,1 1.0.0 1.2,0 \.I ,0 
25 x2,4 x2,5 
26 x2,4 x2,5 
Rt6 Rt7 
27 - - x2.4x2,6x2,5 h0,2h0,1 hO,O h2,1 h2,0 h1,2hl,l hi,O 1.2,1 L2,0L0,2 L\,2 LO,I 1.2,1 L\,1 Lh0,0\.1.0,0 
28 x3,0 -
29 x3,0 x3, I 
30 x3,2 - x3,0 x3,2 x3, I h0,2 hO, I hO,O h2,2 h2, I h2,0 hI ,2 hI, I hI ,0 L2,2 L2, 1 L2,0 L0,2 L2,2 L I ,2 LhO, I LLO, l 
31 x3,2 x3,3 
32 x3,2 x3,3 
33 x3,4 - x3,2 x3,4 x3,3 h2,0 h0,2 hO, I h2,2 h2, 1 h3,0 h 1 ,2 h 1 ,I L2,2 L2, I L2,0 LJ,O - hO,O h2,0 hI ,0 Lh0,2 LL0,2 
34 x3,4 x3,5 
35 x3,4 x3,5 
36 - - x3,4x3,6xl,5 h2,1 h2,0h0.2 - h2,2 hl,l hl,Ohl,2 1.2,21.2,1 1.2.0 1.3,1 LJ,O - hO,I h2,1 hl,l hhO,OhLO,O 
37 x4,0 -
38 x4,0 x4,1 
39 x4,2 - x4,0 x4,2 x4, I h2,2 h2,1 h2,0 h3,2 h3,1 h3,0 L2,2 L2, I L2.0 L,3,2 LJ, I L3,0 h0,2 h2,2 hi ,2 hhO,l hL0,1 
40 x42 x4,3 
41 x4,2 x4,3 
42 x4,4 - x4,2 x4,4 x4,3 h2,2 h2,1 h2,0 h4,0 h3,2 h3,1 h3,0 L4,1l L2,2 L2,1 L3.2 LJ,I L2,0 L4,0 LJ,O hh0,2 hL0,2 
43 x4,4 x4,5 
44 x4,4 x4,5 
45 - - x4,4x4,6x4,5h2,2h2,1 h2,0 h4,1h4,0 h3,2h3,1 h3,0 lA, I L4,0L2,2 - L3,2 L2,1 L4,1 L3,\ Lhl,OLLI,O 
It is important to keep in mind that each time a transition from a run to the next is made, when the 
column length of an image is odd, the external memory is not scanned for i consecutive clock cycles 
reference to the processor clock, where i = I, 2 , 3, .. denotes first, second, third scan methods, and so on. 
The reason is that during this period the CP would be processing the last coefficient in i columns of each H 
and L decomposition that were under consideration in the previous run as required by the DOG for odd 
length signals. No such situation arises when the column length of an image is even. 
273 
Table 8.4 Second dataflow for intermediate archite,:ture 
YH YL 
SRO SR2 SRI SRO SRI 
Clk RdO Rdl RtO Rt2 Rtl R2 Rl RO R2 Rl RO R2 Rl RO R2 Rl RO R2 Rl RO Rtl Rt4 Rt5 
I xO,O -
2 xO,O xO, I 
3 x0,2 - xO,O x0,2 xO, 1 
4 x0,2 x0,3 
5 x0,4 - x0,2 x0,4 x0,3 hO,O 
6 x0,4 x0,5 
7 - x0,4 x0,6 x0,5 hO,I hO,O 
8 xl,O -
9 xl,O xl,l 
10 x\,2 - xl,Ox1,2xl,l h0,2h0,1 hO,O 
II x\,2 xl,3 
12 x\,4 - xl,2 xl,4 xl,3 h0,2 hO,I hO,O 
13 x1.4 xl,5 
14 - xl,4 xl,6 xl,S h0,2 hO,l hO,O 
15 x2,0 -
16 x2,0 x2,1 
LO,O -
LO,l LO.O 
L0,2 LO, I LO,O 
hi,O L0,2 LO, I LO,O Ll ,0 
hl,lhi,O L0,2 LO, I Lll,O L1, I l.l ,0 
17 x2,2 - x2,0x2,2x2,1 h0,2h0,1 hO,O hl,2hl,lhi,O L0,2LO,I LO,O LI,2LI,ILI,O 
18 x2,2 x2,3 
19 x2,4 - x2,2 x2,4 x2,3 h0,2 hO,J hO,O h2,0 hI ,2 hI, I hI ,0 L2,0 L0,2 LO, I "I ,2 Ll, I LO,O L2,0 L I ,0 
20 x2,4 x2,5 
Rt6 Rt7 
21 - x2,4x2,6x2,5 h0,2h0,1 hO,O h2,1 h2,0 h1,2hl,l hi,O L2,1 L2,0L0,2 Ll.2 LO,I L2,1 Ll,l LhO,OLLO,O 
22 x3,0 -
23 x3,0 x3, I 
24 x3,2 - x3,0 x3,2 x3, I h0,2 hO, I hO,O h2,2 h2, I h2,0 hI ,2 hI, I hI ,0 L2,2 L2, I L2,0 L0,2 L2,2 L1 ,2 LhO, I LLO, I 
25 x3,2 x3,3 
26 x3,4 - x3,2 x3,4 x3,3 h2,0 h0,2 hO, I h2,2 h2,1 h3,0 hI ,2 hI ,I L2,2 L2, I L2,0 LJ,O hO,O h2,0 h 1,0 Lh0,2 LL0,2 
27 x3,4 x3,5 
28 - - x3,4 x3,6 x3,5 h2,1 h2,0 h0,2 - h2,2 h3,1 hl,O hl,2 L2,2 L2,1 L2.0 LJ, LJ,O hO,I h2,1 hl,l hhO,O hLO,O 
29 x4,0 -
30 x4,0 x4,1 
3 I x4,2 - x4,0 x4,2 x4, I h2,2 h2, I h2,0 h3,2 hl, I h3,0 L2,2 L2, I L2,0 L) 2 LJ. I LJ,O h0,2 h2,2 hI ,2 hhO, I hLO, I 
32 x4,2 x4,3 
33 x4,4 - x4,2 x4,4 x4,3 h2,2 h2,1 h2,0 h4,0 hl,2 hl,l hl,O L4,0 L2,2L2,1 Ll,2Ll,l L2,0 L4,0 LJ,O hh0,2 hL0,2 
34 x4,4 x4,5 
35 - - x4,4x4,6x4,5h2,2h2,1 h2,0 h4,1h4,0 hl,2h3,1 h3,0 L4,1 L4,0L2,2 - LJ,2 L2,1 L4,1 LJ,I Lhi,OLLI,O 
Control signals such as sreO, sre I, sre2, and incar etc., are issued by the control unit and are loaded, in 
every clock cycle, into the first stage of the RP. Then, these signals are carried from stage-to-stage. When a 
stage where a signal is used is reached that signal is applied and the reset :1re carried on to the next stage 
until the last stage is reached. However, in the 9/7, applying the scan met"1ods such as shown in Figures 
3.5.1, 3.5J, and 3.8.1 would require these signals values of the RP to change as they move from stage-to-
stage, especially in the last and extra runs. Tables B.5 (a) and (b) and the circuit shown in Figure B. I l, 
which operate according to Table B.5, are provided in order to be applied as described in section 3.8,3, 
• Signal srel takes on the signal values of Table B.5 (a), when the row length of an image is odd. In 
the case of even length both sre l and Q l are set 0 in all runs. 
• Table B.5 (a) is used for signal sreO for both odd and even length, 
274 
• Table B.5 (b) is applied only in the architecture developed based on the scan method of Figure 
3.5.1. For the architecture based on the scan method of Figure 3.5.3, signal sre2 is set to alternate 
between 0 and I, while Q2 is set 0, in the first run. In all subsequent runs, sre2 and Q2 are set I 
and 0, respectively, as shown in the third row of Table B.5 (b). 
Table B.5 (a) control signal values 
sreO QO 
srel Ql 
0 0 0 Run I to 
0 0 0 the run 
0 0 0 before 
0 0 0 last. 
I I 0 Last run 
I 0 I Extra run 
Table B.5 (b) control signal values for sre2 
sre2 Q2 
0 X Run 1 
I I Run 2 
I 0 Run 3 to extra run. 
~ Q 
Figure B.l.l circuit 
Table 8.6 5/3 Dataflow for overlapped and nonoverlapped 
parallel scan architecture 
Clk RtO Rt2 Rtl Rd2 Rdl Rd4 Rd6 Rd5 Rtl Rt4 Rt5 Rt6 Rt7 
I xO,O x0,2 xO, 1 - - -
-
2 xl,Ox\,2xl,l - -
3 x2,0 x2,2 x2, I 
- -
-
4 xJ,O x3,2 x3,1 - - -
5 x4,0 x4,2 x4,1 LO.O 110.0 - - - -
6 x5,0 x5,2 x5, I LO,O Lt ,0 HO,O Hl,O - -
-
7 x6,0 x6,2 x6,1 L2,0 HO,O H2,0 Ill ,0 LO,O 1.2,0 Ll,O 
8 x7,0x7,2x7,l L2,0 LJ,O 112,0 HJ,O 110,0 H2,0 Hl,O -
9 x8,0 x8,2 x8, I L4,0 H2,0 114,0 HJ,O L2,0 L4,0 LJ,O -
10 x9,0 x9,2 x9,1 L4,0 LS,O H4,0 HS,O H2,0 H4,0 Hl,O LHO,O LLO,O 
II xlO,O xl0,2 xlO,l L6,0 H4,0 H6,0 HS,O L4,0 L6,0 LS,O HHO,O llLO,O 
12 xll,Oxll,2xll,l L6,0 L7,0 H6,0 117,0 H4,0 H6,0 HS,O LH1,0 LLl,O 
1 3 x\2,0 x12,2 xl2,1 L8,0 - H6,0 H8,0 H7,0 L6,0 L8,0 L7,0 HH1,0 HLt,O 
14 x\3,0 xl3,2 xl3,1 L8,0 L9,0 H8,0 
-
H9,0 H6,0 H8,0 H7,0 LH2,0 LL2,0 
15 x14,0 xl4,2 x\4,1 L!O,O H8,0 H 10,0 H9,0 L8,0 Lt 0,0 L9,0 HH2,0 HL2,0 
16 xl5,0 x15,2 xiS,! LlO,OL11,0 1110,0 Hl1,0 H8,0 H10,0 H9,0 Llll,O LLJ,O 
275 
Table B.7 513 Dataflow for intermediate parallel scan architecture 
YL YH 
SRO SRI SRO SR2 SRI 
Clk Rd RtO Rt2 Rtl R2 Rl RO R2 Rl RO R2 Rl RO R2 Rl RO R2 Rl RO Rtl Rt4 Rt5 Rt6 Rt7 
1 x0,2 xO,O x0,2 xO, 1 
2 x0,4 x0,2 x0,4 x0,3 
3 x0,6 x0,4 x0,6 x0,5 
4 xl,2 xl,O x1,2 x\,1 
5 xl,4 xl,2 xl,4 x\,3 LO,O 110,0 
6 x1,6 xl,4xl,6x1,5 LO,l LO,O HO,l HO,O-
7 x2,2 x2.0 x2,2 x2,1 L0,2 LO, I LO,O H0,2 HO,I HO,O 
8 x2,4 x2,2 x2,4 x2.3 L0,2 LO,l LO,O LI,O H0,2 HO,I HO,O HI,O 
9 x2,6 x2,4 x2,6 x2,5 L0,2 LO, I LO,O Ll, I L1 ,0 - H0,2 HO, I HO,O HI, I HI ,0 
10 x3,2 xl,O x3,2 xl,l L0,2 LO,I LO,O l.l,21.1,1 LI,O H0,2 HO,I HO,O Hl,2 HI,! HI,O 
II xl,4 x3,2xl,4xl,J L2,0L0,2LO,I - LI,2LI,I H0,2HO,I HO,OH2,0 HI,2HI,I HI,O !.O,OL2,0LI.O 
12 x3,6 x3,4x3,6x3,5 L2,1 L2,0L0,2 - - L\,2 H0,2HO,l HO,OH2,1 H2,0 - H1,2Ht,l Hl,O LO,l L2,1 Ll,l 
I 3 x4,2 x4,0 x4,2 x4, I L2,2 L2.1 L2,0 - - H0,2 HO, I HO,O H2,2 H2, I H2,0 HI ,2 HI ,I HI ,0 L0,2 L2,2 Ll.2 
14 x4,4 x4,2 x4,4 x4,3 L2,2 L2,1 L2,0 Ll,O H2,0 H0,2 HO,I - H2,2 H2,1 Hl,O Hl,2 HI.! HO.O H2,0 HI,O LHO,O LLO,O 
15 x4,6 x4,4x4,6x4,5 L2,2L2,1 L2,0 LJ,I LJ,O H2,1 H2,0H0,2 - - H2,2 H3,1 H3,0HI,2HO,I H2,1 HI.! LIIO,I Ll.O.I 
16 x5,2 x5,0 x5,2 x5,1 L2,2 L2,1 L2,0 LJ,2 LJ,I LJ,O H2,2 H2,1 H2,0 - - HJ,2 HJ,I H3,0 H0,2H2,2 Hl,2 LII0,2 LL0,2 
17 x5,4 x5,2 x5,4 x5,3 !A,O L2,2 L2,1 - LJ,2 LJ,I H2,2 H2,1 H2,0 H4,0 Hl,2 Hl,l H3,0 L2,0 L4,0 LJ,O HHO,O HLO,O 
18 x5,6 x5,4x5,6x5,5 !A,I L4,0L2,2 - - L3,2 H2,2H2,1 H2,0H4,1 H4,0 - Hl,2Hl,l H3,0 L2,1 L4,1 Ll,l HHO,I HLO,I 
19 x6,2 x6,0 x6,2 x6, I !A,2 L4,1 L4,0 - H2,2 112, I 112,0 H4,2 H4, I H4,0 Hl,2 HJ, I Hl,O L2,2 !A,2 L3,2 IIHO,I HLO,I 
20 x6,4 x6,2 x6,4 x6,3 !A,2 L4,1 L4,0 LS,O H4,0 H2,2 H2,1 H4,2 H4,1 HS,O 113,2 H3,1112,0 H4,0 Hl,O LHI,O LLI,O 
21 x6,6 x6,4 x6,6 x6,5 !A,2 L4, I L4,0 L5,1 L5,0 114, I H4,0 H2,2 - - H4,2 HS, I H5,0 Hl,2 H2, I H4, I HJ, I LH 1.1 LL I, I 
B.2 Dataflow tables of chapter 4 
Table B.8 Dataflow for 2-parallel architecture 
CK RP RPI & RP2 Rth Rtl CPI input latches CP2 input latches CP I & CP2 OUTPUTS 
RtO Rt2 Rtl RtO Rt2 Rtl RtO F.t2 Rtl RtO Rtl RtO Rtl 
I I xO,O x0,2 xO,I 
2 2 x!,O xl,2 xl ,I 
3 I x2,0 x2,2 x2,1 
4 2 x3,0 x3,2 x3,l 
5 I x4,0 x4,2 x4,1 
6 2 xS,O x5,2 x5,1 
7 I x6,0 x6,2 x6,1 HO,O LO,O 
8 2 x7,0 x7,2 x7,1 HI,O LI,O 
9 I x8,0 x8,2 x8,1 112,0 L2,0 HO,O H2,0 HI,O LO,O L2,0 LI,O 
10 2 x9,0 x9,2 x9,1 HJ,O Ll,O ---------------------- ---------- ------- -------
II I xlO,O x10,2 x!OJ H4,0 !A,O H2,0 H4,0 Hl,O L2,0 L1,0 Ll,O 
12 2 xll,O xll,2 xll,l H5,0 L5,0 ----- ---------------- ----------------- -------
13 I xl2,0 xl2,2 xl2,1 H6,0 L6,0 H4,0 H6,0 H5,0 !A,O 1.6,0 L5,0 
14 2 xl3,0 x13,2 x13,1 H7,0 L7,0 ----- ---------------- ----------------- -------
15 I xl4,0 xl4,2 xl4,1 H8,0 L8,0 H6,0 H8,0 H7,0 L6,0 L8,0 L7,0 HHO,O HLO,O LHO,O LLO,O 
16 2 x!S,O xl5,2 x\5,1 H9,0 L9,0 ---------------------- ------ ---- ------- ----- -----------------------------------
17 I xl6,0 x\6,2 xl6,1 HIO,O LIO,O H8,0 HIO,O H9,0 L8,0 1.10,0 L9,0 HHI.O HLI,O LHI,O LLI,O 
18 2 xl7,0 xl7,2 x\7.1 HII,O LII,O ---------------------- ------ ---- ------- ----- ------------------------------------
19 I xl8,0 xl8,2 xl8,1 Hl2,0 L12,0 HIO,O Hl2,0 HII,O LIO,O 1.12,0 LII,O HH2.0 HL2,0 l.H2,0 I.L2,0 
276 
Table 8.9 dataflow of the 3-parallel architecture 
Ck RP RP' s input latches RP's output latches (CPI &CP3) /CP2 input latches 
f, RtO Rt2 Rtl Rth Rtl Rtl3b RtO Rt2 Rtl RtO Rt2 Rtl 
I I X 0,0 X 0,2 xO,I 
2 2 X 1,0 xl,2 xl,l 
3 3 X 2,0 X 2,2 X 2,1 
4 I X 3,0 X 3,2 X 3,1 
5 2 X 4,0 X 4,2 X 4,1 
6 3 X 5,0 X 5,2 X 5,1 
7 I X 6,0 X 6,2 x6,1 
8 2 X 7,0 X 7,2 X 7,1 
9 3 X 8,0 X 8,2 X 8,1 
10 I X 9,0 X 9,2 X 9,1 110,0 LO.O 
II 2 X 10,0 X 10,2 X 10,1 HI,O L 1.0 
12 3 X 11,0 X 11,2 X 11,1 H2,0 L2.0 
13 I X 12,0 X 12,2 X 12,1 H3,0 L3.0 110,0 112,0 HI,O LO,O L2,0 LI,O 
14 2 X 13,0 X 13,2 X 13,1 H4,0 L4,0 ··------------------- -- 112,0 114,0 H3,0 
15 3 X 14,0 X 14,2 X 14,1 115,0 L2,0 L5,0 -----------------------------------------------------
16 I X 15,0 X 15,2 X 15,1 H6,0 L6,0 -------- H4,0 1160 H5,0 L2,0 L4,0 L3,0 
17 2 X 16,0 X 16,2 X 16,1 H7,0 L7,0 -------- ----------------------- L4,0 L6.0 L5,0 
18 3 X 17,0 X 17,2 X 17,1 H8,0 L8,0 -------- -----------------------------------------------------
19 I X 18,0 X 18,2 X 18,1 H9,0 L9,0 -------- H6,0 H8,0 H7,0 L6,0 L8,0 L7,0 
20 2 X 19,0 X 19,2 X 19,1 1110,0 LIO,O ------- ------------------------ H8,0 1110,0 119,0 
21 3 X 20,0 X 20,2 X 20,1 Hll.O L8,0 Lll,O -----------------------------------------------------
22 I X 21,0 X 21,2 X 21,1 Hl2,0 Ll2.0 ------- HIO.O 1112,0 1111,0 L8,0 LIO,O L9,0 
23 2 X 22,0 X 22,2 x22,1 Hl3,0 Ll3.0 ------- ------------------------- LIO,O Ll2,0 Lll,O 
24 3 X 23,0 X 23,2 X 23, I Hl4,0 Ll4,0 ------- -----------------------------------------------------
25 I X 24,0 X 24,2 X 24, I Hl5,0 L15,0 ------- Hl2.0 1114,0 Hl3,0 Ll2,0 L14,0 L13,0 
26 2 X 25,0 X 25,2 X 25,1 Hl6,0 Ll6,0 ------- ------- --- ------- ---- -- --- Hl4,0 H16,0 H15,0 
27 3 X 26,0 X 26,2 X 26,1 Hl7,0 Ll4,0 Ll7,0 -----------------------------------------------------
28 I X 27,0 X 27,2 X 27,1 Hl8,0 Ll8,0 ------- Hl6.0 Hl8,0 Hl7,0 Ll4,0 Ll6,0 Ll5,0 
29 2 X 28,0 X 28,2 X 28,1 1119,0 Ll9.0 ------- ------- --- ------ ---------- L16,0 LI8,0 Ll7,0 
CK CP I & CP3 output latches CP2 output latches 
Rth Rtl Rth Rtl Rth Rtl 
22 HHO,O HLO,O LHO,O LLO.O 
23 ----------------------------------- IIIII ,0 HLI, 0 
24 
25 HH2,0 HL2,0 LHI,O LLI,O 
26 Lll2,0 LL2,0 
27 
28 HH3,0 HL3,0 LH3.0 LL3,0 
29 HH4,0 HL4,0 
277 
Table B.IO 5/3 4-parallel architecture's dataflow 
ck RP RP's input latches RP's output latches CP I & C'P3 input latches I 
RtO Rt2 Rtl Rth Rtl RtO Rt2 Rtl RtO Rt2 Rtl I 
I I X 0,0 X 0,2 xO,I 
2 2 X 1,0 xl,2 X 1,1 
3 3 X 2,0 X 2,2 X 2,1 
4 4 X 3,0 X 3,2 X 3,1 
5 I X 4,0 X 4,2 X 4,1 
6 2 X 5,0 X 5,2 X 5,1 
7 3 X 6,0 X 6,2 x6,1 
8 4 X 7,0 X 7,2 X 7,1 
9 I X 8,0 x8,2 X 8.1 
10 2 X 9,0 X 9,2 X 9,1 
II 3 X 10,0 xl0.2 X 10,1 
12 4 X 11,0 X 11.2 X 11,1 
13 I X 12,0 X 12,2 X 12,1 HO,O LO,O 
14 2 X 13,0 X 13,2 X 13,1 HI,O LI,O 
15 3 X 14,0 X 14,2 X 14,1 H2,0 L2,0 HO.O H2,0 HI.O LO,O L2,0 LI,O 
16 4 X 15,0 X 15,2 X 15,1 H3,0 L3,0 ~----------------------------------------------------
17 I X 16,0 X 16,2 X 16.1 H4,0 L4,0 -----------------------------------------------------
18 2 X 17,0 X 17,2 X 17,1 H5,0 L5,0 -----------------------------------------------------
19 3 X 18,0 X 18,2 X 18,1 !16,0 L6,0 H4,0 H6l· H5,0 L4.0 L6.0 L5,0 
20 4 X 19,0 X 19,2 X 19.1 117,0 L7,0 -----------------------------------------------------
21 I X 20,0 X 20,2 x20,1 H8,0 L8,0 -----------------------------------------------------
22 2 X 21,0 X 21,2 X 21,1 H9.0 L9,0 -----------------------------------------------------
23 3 x22.0 X 22.2 X 22,1 HIO,O LIO,O H8,0 Hi'J,O H9,0 L8,0 LIO.O L9,0 
24 4 X 23.0 X 23,2 x23.1 HII,O LII,O -----------------------------------------------------
25 I X 24,0 X 24,2 X 24,1 Hl2,0 Ll2,0 ------------·----------------------------------------
26 2 X 25,0 X 25.2 X 25,1 HIJ.O L13,0 
-----------------------······························ 
27 3 X 26,0 X 26,2 X 26,1 H14,0 Ll4.0 Hl2.0 Hl4,0 Hl3.0 Ll2,0 1.14.0 Ll3,0 
28 4 X 27,0 X 27,2 X 27.1 Hl5,0 Ll5,0 .......................... ····-····-· ····--····-···- w 
29 I X 28.0 X 28,2 X 28,1 1116,0 Ll6,0 
---·- -------· --- ---------- ----- ------ ----------· -··· w 
CP2 &CP4 input latches CP 1 & CP3 output latchts CP2 & CP4 output latches 
CK RtO Rt2 Rtl RtO Rt2 Rt I Rthl Rtll Rth3 Rtl3 Rth2 Rtl2 Rth4 Rtl4 








25 1110,0 Hl2,0 HII.O LIO,O Ll2,0 LII,O 
26 
27 -------------···-------·-----·----··-----··· HHO,O HLO,O LHO,O LLC, 0 
28 
29 H 14,0 H 16,0 HI 5.0 L 14,0 L 16,0 Ll 5,0 ---------------------·----------·-- HHLO HLI,O LHI.O LLI.O 
278 
Table B.ll 4-parallel's TLBs read and write dataflow for case 2 
RPI 
Sta e 2 Stage 3 
Ck RtO Rtl Sal2 Ia lb BIRI TLBI BORI Rt2 RtO Rtl 
(, 
5 xO,O HO,O I 0 0 ------- ---- ----- -----
7 xO.O HO.O I 0 0 ------- ------ ----- ------
9 x4,0 H4,0 I 0 0 H2,0 ------ xO,O HO,O 
II x4,0 H4,0 I 0 0 H2.0 ------ xO,O HO,O 
" 
13 x8,0 H8,0 I I 0 H6,0 H2.0 
------
x4,0 H4.0 
" 15 x8.0 H8,0 I I 0 H6,0 H2,0 x4,0 H4,0 <>: ------
17 xl2,0 Hl2,0 I 2 0 HIO,O H2,0 H6,0 
------
x8,0 H8.0 
19 X 12,0 1112,0 I 2 0 H10,0 H2,0 H6,0 
------
x8,0 H8,0 
21 x16,0 Hl6,0 I 3 0 Hl4,0 H2,0 H6.0 H 10,0 ------ xl2,0 Hl2.0 
23 xl6,0 Hl6,0 I 3 0 H14,0 H2,0 H6,0 HI 0,0 
------
xl2,0 Hl2,0 
25 x2.2 H2,1 I 0 0 HO,I H2,0 H6,0 H 10,0 1114,0 ------ xl6,0 Hl6,0 
27 x2,2 H2.1 I 0 0 HO,I H2,0 H6,0 H 10,0 H 14,0 H2,0 ------ xl6,0 Hl6,0 
29 x6,2 H6,1 I I 0 114,1 HO.I H6,0 H 10.0 1114,0 H2,0 H2,2 x2.2 H2,1 
N 31 x6,2 116,1 I I 0 H4,1 HO,I H6,0 HIO,O Hl4,0 H6,0 H2,2 x2.2 H2,1 
" 
" 33 x10,2 HIO,I I 2 0 118, I 110, I 114, I H 10,0 1114,0 116,0 H6,0 x6,2 H6,1 
"' 35 xl0,2 HIO,I I 2 0 H8,1 HO,I H4,1 HIO,O Hl4,0 HIO,O H6,0 x6,2 H6,1 
37 xl4.2 Hl4,1 I 3 0 1112,1 HO, I H4, I 118.0 1114,0 1110,0 HIO,O xl0,2 HIO,I 
39 xl4,2 Hl4,1 I 3 0 1112,1 110, I H4, I H8, I 1114.0 1114,0 HIO,O x10.2 HIO,I 
41 x0,4 110,2 0 4 0 1116,1 110, I 114.1 118, I 1112.1 1114,0 Hl4,0 xl4,2 Hl4,1 
"' 
43 x0,4 H0.2 0 4 0 Hl6,1 HO.I H4,1 H8,1 Hl2,1 HO.I Hl4,0 xl4,2 Hl4,1 
" 45 x4.4 H4,2 0 0 I H2,2 ----- H4, I H8, I Hl2,1 H16,1 HO,I HO,I x0,4 H0,2 
" 
"' 47 x4,4 H4,2 0 0 I 112,2 ----- H4.1 118, I 1112,11116,1 114, I HO,I x0.4 H0,2 49 x8,4 H8.2 0 I 2 H6,2 H2,2 ----- H8, I 1112,1 Hl6.1 H4,1 H4,1 x4,4 H4,2 
RP3 
Stage 2 Stage 3 
Ck RtO Rtl Sa34 3a 3b BIR3 I TLB3 BOR3 Rt2 RtO Rtl 
(, 
7 x2,0 H2,0 I 0 0 HO.O ------ ------ ------
9 x2,0 H2,0 I 0 0 HO,O ------ ------ ------
II x6.0 H6.0 I I 0 H4.0 HO,O ----- x2.0 H2,0 
- 13 x6,0 H6.0 I I 0 H4.0 110,0 ----- x2,0 H2.0 
" 
" 15 x10.0 HIO,O I 2 0 H8,0 HO.O H4,0 x6,0 H6,0 <>: -----
17 x!O.O HIO,O I 2 0 H8,0 HO.O H4,0 ----- x6.0 H6,0 
19 xl4,0 Hl4.0 I 3 0 Hl2,0 HO,O H4,0 H8,0 ----- x!O,O 1110,0 
21 x14,0 Hl4.0 I 3 0 H12,0 HO.O H4,0 H8,0 ----- xiO.O HIO,O 
23 x0,2 HO.I 0 4 0 1116,0 110,0 114,0 H8,0 H 12,0 ----- xl4,0 Hl4,0 
25 x0.2 HO,I 0 4 0 Hl6,0 110,0 H4,0 H8,0 H 12,0 HO.O ----- xl4.0 Hl4,0 
27 x4.2 H4.1 0 0 I H2,1 ----- H4,0 H8,0 Hl2,0 Hl6,0 HO,O HO,O x0.2 HO,I 
29 x4,2 114,1 0 0 I 112, I ----- H4.0 H8,0 1112.0 Hl6,0 H4,0 HO,O x0,2 HO,I 
';j 31 x8.2 H8.1 0 I 2 116, I 112, I ----- H8,0 Hl2.0 Hl6,0 H4,0 H4.0 x4.2 H4,1 
" 33 x8.2 H8.1 0 I 2 H6,1 H2,1 ----- H8.0 Hl2.0 Hl6.0 H8.0 H4.0 x4.2 H4,1 <>: 
35 xl2,2 H12,1 0 2 3 1110.1 112, I H6.1 ------ H 12,0 H 16,0 H8,0 H8.0 x8,2 H8,1 
37 x12,2 Hl2.1 0 2 3 1110.1 112,1 H6.1 ------ 11 12,0 H 16,0 Hl2,0 H8.0 x8,2 H8.1 
39 xl6,2 Hl6,1 0 3 4 Hl4.1 H2,1 H6,1 1110,1------ H16,0 Hl2,0 Hl2.0 xl2.2 H12,1 
41 xl6,2 Hl6,1 0 3 4 1114, I 112, I 116.1 HIO,I------ Hl6,0 Hl6,0 Hl2,0 xl2,2 Hl2,1 
43 x2,4 H2.2 I 0 0 H0.2 H2.1 H6,1 HIO,I1114,1------ 1116,0 1116,0 xl6,2 Hl6.1 
"' 45 x2,4 H2.2 I 0 0 110,2 112,1 H6.1 HIO,I Hl4.1------ H2.1 H16,0 xl6,2 Hl6.1 
" 
" 47 x6,4 H6.2 I I 0 H4,2 H0,2 H6,1 HI 0, I H 14, I ------ H2,1 H2,1 x2,4 112,2 <>: 
49 x6,4 H6.2 I I 0 114.2 H0,2 H6, I HI 0, I 1114. I ------ H2.1 H2,1 x2,4 H2.2 
Ia: TLBARia, lb: TLBARib, 3a: TLBAR3a, 3b: TLBAR3b 
279 
Table B.l2 Dataflow for 2-parallel intermediate architecl ure (k~3) 
Ck RP RdO RP's input latches RdH SRHO SRHI Rd. SRLO SRLI 
RtO Rt2 Rtl R2 Rl RO R2 Rl RO R2 Rl RO R2 Rl RO 
I I x0,2 X 0,0 X 0,2 X 0, I 
2 2 X 0,4 X 0,2 X 0,4 X 0,3 
3 I X 0,6 X 0,4 X 0,6 X 0,5 
4 2 X 1,2 X I ,0 X L2 X I' 1 
5 I X ],4 x1,2xl,4xl,3 
6 2 X ],6 X ] ,4 X ] ,6 X J ,5 
7 I X 2,2 X 2,0 X 2,2 X 2,] HO,O ----- ----- LO,O ----- -----
8 2 X 2,4 X 2,2 X 2,4 X 2,3 HO,I HO,O ----- LO.l LO.O -----
9 I X 2,6 X 2,4 X 2,6 X 2,5 H0,2 1!0, I HO,O L0,2 LO,I LO,O 
10 2 X 3,2 x3,0 x3.2 x3,1 HI,O ----- ----- l.l,O ----- -----
II I X 3,4 X 3,2 X 3,4 X ),3 Hl,l HI,O ----- Ll,l Ll.O -----
12 2 X 3,6 X 3,4 X 3,6 X ),5 H0,2 HO,l HO.O H1,2 Hl,l HI,O L0,2 LO, I LO,O 1.1.2LI.I LI,O 
13 I X 4,2 x4,0 x4,2 x4,1 H2,0 H0,2 HO,I 
-----
Hl,2 Hl,l L2,0 L0,2 LO, I ------ L1,2 Ll,l 
14 2 X 4,4 X 4,2 X 4,4 X 4,3 H2,1 H2,0 H0,2 110, I ----- 111,2 Hl,l L2,1 L2,0 L0,2 LO, I 
------ L1 ,2 L1, I 
15 I X 4,6 X 4,4 X 4,6 X 4,5 H2,2 H2, I H2,0 H 0,2 
-----
Hl,2 Hl,l L2,2 L2,1 L2,0 L0,2 ------ L1,2 Ll,l 
16 2 X 5,2 X 5,0 X 5,2 X 5,] H2,2 H2, I H2,0 H 0,2 H3,0 ----- Hl,2 L2,2 L2,1 L2,0 L0,2 LJ,O ------ Ll,2 
17 I X 5,4 X 5,2 X 5,4 X 5,3 ------ H2,2 H2, I H 2,0 H3,1 HJ,O ------ ------ L2,2 L2.1 L2,0 LJ.I LJ,O -----
18 2 X 5,6 X 5,4 X 5,6 X 5,5 
------
H2,2 112, I H 2,0 H3,2 HJ,I HJ,O ------ L2,2 L2,1 L2,0 LJ,2 LJ,l LJ.O 
19 I X 6,2 X 6,0 X 6,2 X 6, I ------ H4,0 H2,2 H2, I ------ H3,2 H3,1 ------ L4,0 L2.2 L2.1 ------ L3,2 LJ,I 
20 2 X 6,4 X 6,2 X 6,4 X 6,3 1!4, I H4,0 H2,2 H2. I ------ H3,2 1!3,1 U,l L4,0 1.2,2 L2, I ------ LJ.2 LJ, I 
21 I X 6,6 x6,4 x6,6 x6,5 H4,2 H4,1 H4,0 H2,2 ------ 1!3,2 H3, I U.2 L4,1 L4,0 L2,2 ------ Ll,2 Ll,l 
22 2 X 7,2 x7,0 x7,2 x7,1 H4,2 H4,1 H4,0 H2,2 H5,0 ------ 113,2 U-,2 L4,1 L4,0 L2,2 LS,O ----- Ll.2 
23 I X 7,4 X 7,2 X 7,4 X 7,3 ·----- H4,2 H4,1 1!4,0 H5,1 HS,O ------ ------ L4.2 lA,! L4,0 L5,1 LS,O -----
24 2 X 7,6 X 7,4 X 7,6 X 7,5 ----- H4,2 H4,1 H4,0 H5,2 115, I H5,0 
--·--- L4,2 L4,1 L4,0 L5,2 L5, I LS,O 
25 I X 8,2 X 8,0 X 8,2 X 8,] 
------
H6,0 H4,2 H4, I ------ H5,2 H5, I --··--- L6,0 L4,2 1.4, I ----- L5,2 L5,1 
ck RP CPJ &CP2 input latches CP 1 & CP2 output latches 
RtO Rt2 Rtl RtO Rt2 Rtl Rtl RtJ Rtl RtO 
13 HO,O H2,0 Ill ,0 LO,O L2,0 L I ,0 
14 2 
15 I HO,l H2,1 111,1 LO,l L2,1 Ll,l 
16 2 
17 I H0,2 H2,2 111,2 L0,2 L2,2 Ll,2 
18 2 
19 I H2,0 H4,0 113,0 L2,0 L4,0 LJ,O HHO,O HLJ,O LHO,O LLO,O 
20 2 
21 I H2,1 H4,1 H3,1 L2,1 L4,1 L3,1 11110, I HLJ, I LHO, I LLO, I 
22 2 
23 H2,2 H4,2 H3,2 L2,2 L4,2 L3,2 HH0,2 HL0,2 LH0,2 LL0,2 
24 2 
25 114,0 116.0 H5,0 L4,0 L6,0 1.5,0 HHI,O HLI,O LIII,O LLI,O 
280 
Table B.l3 Dataflow of the last run for cases 4 and 3 when N is even 



























RIO Rt2 Rtl R2 Rl RO R2 Rl RO R2 Rl RO 
H6,0 H4,2 H4, I ------ H5,2 HS, I 1.6,0 L4,2 1.4,1 
X 0,8 X 0,6 X 0,8 X 0,7 116,1 116,0 114,2 H4, I ------ H5,2 HS, I 1.6,1 1.6,0 L4,2 1.4,1 
X 0,8 X 0,8 X 0, 9 H6,2 H6, I H6,0 H4,2 ------ H5,2 H5, I 1.6,2 1.6, I L6,0 1.4,2 
x\,8 xl,6xl,8xi,7 H6,2 H6,1 H6,0 H4,2 H7 ,0 ------ H5,2 1.6,2 1.6, I 1.6,0 1.4,2 
X \,8 X \,8 X \,9 116,2 116, I H6,0 H7,1 H7,0 ------ 1.6,2 1.6, I 1.6,0 
X 2,8 X2,6 X 2,8 X 2,7 H6,2 H6, I H6,0 H7,2 H7,1 H7,0 1.6,2 L6, I 1.6,0 
x2,8 x2,8 x2,9 ----- H6,2 H6, I ------ H7,2 117, I ------ 1.6,2 1.6, I 
X ),8 X ),6 X ),8 X 3,7 H0,3 ----- 116,2 H6, I ------ H7,2 117, I 1.0,3 ------ 1.6,2 1.6, I 
X 3,8 X 3,8 X 3,9 H0,4 HO,J ----- H6,2 ------ H7,2 H7,1 1.0,4 1.0,3 ------ 1.6,2 
x4,8 x4,6 x4,8 x4,7 110,4 H0,3 ----- H6,2 H\,3 ------ H7,2 1.0,4 LO.J ----- 1.6,2 
X 4,8 X 4,8 X 4,9 H0,4 H0,3 ----- HI ,4 HI ,3 ------ 1.0,4 1.0,3 -----
X 5,8 X 5,6 X 5,8 X 5,7 H2,3 ----- H0,4 HO,l ------ HI ,4 Ill.) l.2.3 ----- 1.0,4 1.0,3 
X 5,8 X 5,8 X 5,9 H2,4 H2,3 ----- H0,4 ------ H1,4 Hl,l 1.2,4 1.2,3 ----- 1.0,4 
X 6,8 X 6,6 X 6,8 X 6,7 H2,4 112,3 ----- H0,4 HJ,3 ------ 111,4 1.2,4 1.2,3 ----- 1.0,4 
x6,8 x6,8 x6,9 H2,4 H2,3 ----- H3,4 H3,3 ----- L2,4 1.2,3 -----
x7,8 x7,6 x7,8 x7,7 H4,3 ------ H2,4 H2,3 ----- H3,4 H3,.1 1.4,3 ------ 1.2,4 1.2,3 
X 7,8 X 7,8 X 7,9 H4,4 114,3 ------ H2,4 ----- H3,4 113,3 L4,4 L4,3 ------ L2,4 
---------------------- H4,4 H4,3 ------ H2,4 115,3 ---- H3,4 1.4,4 1.4,3 ------ 1.2,4 
---------------------- H4,4 H4,3 ------ H5,4 H5,3 ----- L4,4 L4,3 ------
---------------------- H6,3 ----- 114,4 H4,3 ------ H5,4 115,3 1.6.3 ----- 1.4,4 1.4,3 
---------------------- H6,4 H6,3 ----- H4,4 ------ H5,4 H5,3 1.6,4 1.6.3 ----- 1.4,4 
---------------------- H6,4 H6,3 ··--- H4,4 H7,3 ------ 115,4 1.6,4 1.6,3 ----- 1.4,4 






H6,4 H6,3 H7,4 117,3 1.6,4 1.6,3 
------ H6,4 H7,4 H7,3 ------ L6,4 
------ H6,4 ------ 117.4 ------ 1.6,4 
CPI &CP2 input latches 
RtO Rt2 Rtl RtO Rt2 Rtl 
H4,2 H6,2 H5,2 1.4,2 L6,2 1.5,2 
H6,0 H6,0 H7,0 L6,0 1.6,0 1.7,0 
CPl & CP2 output latches 
Rt Rt Rt Rt 
Hlll ,2 IlL 1,2 LH I ,2 LLI ,2 
Hll2,0 HL2,0 LH2,0 1.1.2,0 
33 H6,1 H6,1 H7,1 L6,1 L6,1 1.7,1 H112,1 HL2,1 LH2,1 1.1.2,1 
34 2 
35 I 1-16,2 1-16,2 H7,2 L6,2 L6,2 1.7,2 HH2,2 HL2,2 LH2,2 LL2,2 
36 2 
37 I HO,J H2,3 111,3 L0,3 1.2,3 Ll,J Hll3,0 Hl.J,O LH3,0 LLJ,O 
38 2 
39 110,4 112,4 Hl,4 1.0,4 1.2.4 Ll,4 HH3,1 HLJ.I Llll,l LLJ,I 
40 2 
41 ll2,3 114,3 H3,3 L2,3 L4,3 L3,3 HH3,2 HL3,2 Lll3,2 LL3,2 
42 2 
f--74 3~-t-c;l:--t--'11..,2,_., 4_._11,.4,., 4_._Hoc3,_,, 4_-"L"-2 ,_.4 --"L"4'-',4~L~l,_., 4-j-_.H=II 0) H LO ,3 LIIO ,3 L 1.0 ,3 
44 2 
45 I H4,3 H6,3 H5,3 L4.3 L6,3 1.5,3 11110.4 HL0,4 LH0,4 1.1.0,4 
46 2 
47 114,4 116,4 H5,4 1.4,4 1.6,4 1.5,4 HH1,3 HL1,3 Ll11,3 LLI,3 
48 2 
49 116,3 116,3 H7,3 1.6,3 1.6,3 1.7,3 HHI,4 HL1,4 Ll11,4 LLI,4 
50 2 
51 H6,4 H6,4 H7,4 1.6,4 1.6,4 1.7,4 HH2,l HL2,3 LH2,3 1.1.2,3 
281 
SRLI 
R2 Rl RO 
----- 1.5,2 1.5, I 
----- 1.5,2 1.5, I 
----- 1.5,2 1.5, I 
L7,0 ------ L5,2 
L 7, I L 7,0 ------




Ll,3 ------ 1.7,2 
Ll,4 Ll,3 -----
------ Ll ,4 I. I ,3 
------ Ll,4LI,3 
1.3,3 ------ Ll ,4 
1.3,4 1.3,3 -----
----- 1.3,4 LJ,3 
----- 1.3,4 1.3,3 
1.5,3 ----- LJ,4 
L5,4 L5,3 -----
------ 1.5,4 1.5,3 
------ 1.5,4 1.5,3 




------ L 7,4 
Table B.l4 Dataflow of the last run for cases 4 and 3 when N is odd 






















RtO Rt2 Rtl R2 Rl RO R2 Rl RO R2 Rl RO R2 Rl RO 
H4,2 H4,1 H4,0 H2,2 HS,O ------ H3,2 1.4,2 L4,1 L4,0 L2,2 L5,0 ---- L3,2 
H4,2 H4,1 114,0 H5, I H5,0 ----- 1.4,2 1.4,1 1.4,0 1.5, I 1.5,0 ----
H4,2 H4,1 H4,0 H5,2 H5, I H5,0 L4,2 L4, I L4,0 L5,2 LS, I L5,0 
H6,0 H4,2 H4, I H5,2 H5,1 L6,0 L4,2 L4, I ----- L5,2 L5, I 
x0,8 x 0,6 x 0,8 x0,7 H6, I H6,0 H4,2 H4, I H5,2 HS, I 1.6, I L6,0 L4,2 L4, I ----- L5,2 1.5, I 
x 0,8 ------ ----- H6,2 H6, I H6,0 H4,2 H5,2 HS, I L6,; L6, I L6,0 L4,2 ----- L5,2 L5, I 
x1,8 x 1,6 x 1,8 x\,7 H6,2 H6,1 H6,0 H4,2 ------ H5,2 L6,; L6,1 L6,0 L4,2 ----- ----- L5,2 
X I ,8 ------ ----- H6,2 H6, I 116,0 L6,2 1.6, I L6,0 
x2,8 X2,6 X 2,8 x2, 7 116,2 116, I H6,0 L6,2 L6, I L6,0 
X 2,8 ---·-- ····· ------ H6,2 H6, I ------1.6,2 1.6, I 
x3,8 x 3,6 x 3,8 x3,7 110,3 ------ H6,2 H6,1 -------------------- LO,'· ------1.6,2 L6.1 
X 3,8 ···-·- ----- 110,3 ------ H6,2 -------------------- LD.'· 1.0,3 ------1.6,2 
x4,8 x4,6 x4,8 x4,7 H0,3 ------H6) Hl,3 ------------ LO,<· L0.3------L6,2 Ll,3 ----------
x4,8 ------ ----- ------ H0,3 ----- Hl,J ---- L0,4 LO,J ---- Ll.4 Ll,J -----
x5,8 x 5,6 x 5,8 x5,7 H2,3 ------ ----- H0,3 ----- Hl,3 L2) ----- L0,4 L0,3 ----- Ll,4 LIJ 
X 5,8 ------ -----
x6,8 x 6,6 x 6,8 x6,7 






















112,3 ------ ----- ---- Ill ,3 1.2,<1 L2,3 ---- L0.4 ----- L 1.4 L I ,3 
H2,3 ------ ----- H3,3 L2,'1 L2,3 ---- L0,4 LJ,J ----- 1.1.4 
H2,3 ----- H3,3 ----- L2,4 L2,3 ---- L3,4 L3,3 -----
H4,3 ------ H2,3 ----- H3,3 1..4,:1 ------1.2,4 L2,3 L3,4 
LJ,J 
H4,3 ----- 113,3 1.4,4 L4,3 ---- L2,4 L3,4 LJ,J 
H4,3 H5,3 ------ ----- L4,'1 L4,3 ---- 1.2,4 L5,3 ---- LJ,4 
CPI &CP2 input latches 
RtO Rt2 Rt I RtO Rt2 Rt I 
H2,2 H4,2 H3,2 L2,2 1.4,2 1.3,2 
114,0 H6,0 H5,0 1.4,0 1.6,0 LS,O 
H4,1 H6,1 H5,1 L4,1 1.6,1 1.5,1 
H4,2 H6,2 H5,2 1..4,2 1.6,2 L5,2 
H6,0 ------ ------ L6,0 ------ ------
116, I ------ ------ L6, I ----- ------
116,2 ------ ------ 1.6,2 ----- ------
H0,3 H2,3 Hl,J L0,3 L2,3 Ll,J 
----------------------- L0,4 L2,4 Ll ,4 
H2,3 H4,3 H3,3 1.2,3 1.4,3 L3,3 
CPI & CP2 output latches 
Rt Rt Rt Rt 
HH0,2 HL0,.2 LH0,2 LL0,2 
HHI,O HLI,l LHI,O LLI,O 
HHI,l HLI,l LHI,l LLI,l 
HHI,2 HLI,2 LHI,2 LLI,2 
HH2,0 HL2,J LH2,0 LL2,0 
HH2,1 HL2,1 LH2,1 1.1.2,1 
HH2,2 HL2,2 1.112.2 1.1.2,2 
------- HLJ,i) ------- LL3,0 
------- HLJ,l ------- LLJ,l 





























Table B.l5 Dataflow of the last run for cases 2 and I when N is even 













RtO Rt2 Rtl R2 Rl RO R2 Rl RO R2 Rl RO 
---------------------- 116,0 H4,2 H4, I ------ H5,2 HS,l L6,0 L4,2 L4,1 
---------------------- H6, l H6,0 H4,2 H4, I ------ 115,2 HS,I L6,1 L6,0 L4,2 L4,1 
X 0,6 X 0,6 X 0,7 H6,2 H6,1 H6,0 H4,2 
------ ll5,2 H5,1 L6,2 L6,1 L6,0 L4,2 
X 1,6 X 1,6 X 1,7 H6,2 H6,1 H6,0 H4,2 H7,0 ------ H5,2 L6,2 L6,1 L6,0 L4,2 
X2,6 X 2,6 X 2, 7 116,2 H6,1 116,0 117,1 117,0 ------ L6,2 L6,1 L6,0 
X 3,6 X 3,6 X 3,7 116,2 H6,1 H6,0 H7,2 H7,1 H7,0 L6,2 L6,1 L6,0 
x4,6 x4,6 x4,7 H6,2 116,1 ------ H7,2 H7,1 ------ L6,2 L6,1 
X 5,6 X 5,6 X 5,7 H6.2 H6.1 ------ H7.2 H7,1 ------ L6.2 L6,1 
X 6,6 X 6,6 X 6, 7 H0,3 H6,2 H6, I 
------ 117,2 fl7,1 LO,J ------ L6,2 L6,1 
X 7,6 X 7,6 X 7,7 HOJ ----- H6,2 HI ,3 ------ H7,2 L0,3 ----- L6,2 
---------------------- 112,3 ------ HO,J ----- ----- H 1.3 ------ L2,3 ----- L0,3 -----
---------------------- H2,3 H2,3 ------ H0,3 llJ ,3 ------ H 1,3 L2,3 L2,3 ----- LO,J 
---------------------- 114,3 ----- H2,3 ------ ------ H3,3 ------ L4,3 ------ L2,3 -----
---------------------- H4,3 H4,3 ----- H2,3 HS,J ------ H3,3 IA,J !.4,3 ----- L2,3 
---------------------- H6,3 ------ 114,3 ----- ------ H5,3 ------ L6,3 ------ L4,3 -----
---------------------- 116,3 H6,3 ------ H4,3 H7,3 ------ H5,3 L6,3 L6,3 ------ L4,3 













------ ------ H6,3 ------ ------ H7,3 ------ ------ L6,3 
CP 1 &CP2 input latches 
RtO Rt2 Rtl RtO Rt2 Rtl 
CP l & CP2 output latches 
Rt Rt Rt Rt 
H4,2 H6.2 H5.2 1.4,2 1.6,2 1.5,2 HHI.2 HLI,2 LH1,2 LLI,2 
H6,0 H6,0 117,0 L6,0 L6,0 L7,0 11112.0 HL2,0 Lll2,0 LL2,0 
H6,1 H6,1 H7.1 L6, I 1.6, I 1.7,1 11112.1 HL2,1 LH2,1 LL2,1 
H6,2 H6,2 H7,2 L6,2 L6,2 L7.2 HH2.2 HL2,2 LH2.2 1.!.2,2 
110,3 112,3 111,3 L0,3 L2,3 Ll,l 11111.0 HLJ,O l.H3.0 1.1.3,0 
39 H2,3 114,3 H3,3 1.2,3 1.4,3 L3,3 IIHl,l HLl,l LH3,1 LLl,l 
40 2 
41 H4,3 H6,3 H5,3 L43 L6,3 LS,3 HH32 HL3,2 LH3,2 LL3,2 
42 2 
43 116,3 116,3 117,3 1.6,3 1.6,3 L7,3 f 1110.3 HL0,3 LH0,3 LL0,3 
44 2 
45 Hll 1.3 HL 1.3 Lll I ,3 LLI,3 
46 2 
47 11112,3 IIL2,3 LH2,3 1.1.2,3 
48 2 
49 I --··--------------------------------------------- I {J 13,3 HL3 ,3 I.H3 ,3 I .L3,3 
283 
SRLI 
R2 Rl RO 
----- L5,2 LS,I 
----- L5,2 LS,I 
----- L5,2 LS,I 
L7,0 ------ L5,2 
L 7,1 L7,0 ------
L7,2 L7,1 1.7,0 
------ L7,2 L7,1 
------ L 7,2 L7,1 
------ L7,2 L7,1 
L1,3 ------ L7,2 
------ Ll,J -----
LJ,J ----- L 1,3 
------ LJ ,3 -----
L5,3 ------ L3,3 
------ L5,3 -----
1.7,3 ------ 1.5,3 
----- L7,3 ------
























Table 8.16 Dataflow of the last run for cases 2 and I when N is odd 












RtO Rt2 Rtl R2 Rl RO R2 Rl RO R2 Rl RO R2 Rl RO 
·-···----------------- H4,2 ! 14,1 H4,0 H2,2 H5,0 ------ H3,2 L4.2 1.4,1 1.4,0 1.2,2 1.5,0 ----- 1.3,2 
---------------------- H4,2 H4, I H4,0 H5,1 H5,0 ------- 1.4,2 IA,I 1.4,0 1.5,1 1.5,0 ----
---------------------- H4,2 H4, I 114,0 H5,2 H5,1 H5,0 1.4,2 L4,1 1.4,0 1.5,2 1.5,1 1.5,0 
---------------------- H6,0 H4,2 H4, I H5,2 H5,1 1.6,0 1.4,2 1.4.1 ----- 1.5,21.5,1 
---------------------- H6,1 H6,0 H4,2 H4,l H5,2 H5,1 1.6,1 1.6,0 L4,2 1.4, I ----- 1.5,2 1.5,1 
x 0,6 ------ ------- H6,2 H6, I H6,0 H4,2 H5,2 HS,I 1.6,2 1.6,1 1.6,0 1.4,2 ----- 1.5,2 1.5,1 
x I ,6 ------ ------- H6,2 H6,1 H6,0 H4,2 ------ H5,2 1.6,2 1.6, I 1.6,0 1.4,2 ----- ----- L5.2 
X 2,6 ------ ------- H6,2 116, I H6,0 ---------------------- L6,2 L6, I L6,0 
X 3,6 ······ ·------ 116,2 116, I H6,0 1.6,2 L6, I 1.6,0 
X 4,6 -·-··- ·•••··· ------ H6,2 H6,1 ------ 1.6,2 1.6, I 
X 5,6 ---··· ------- ------ 116,2 116,1 ------ 1.6,2 L6, I 
X 6,6 ------ ·•••··· ----- ------ H6,2 LC,3 ------ 1.6,2 1.6, I 
------ ------ H6,2 U,3 1.0,3 ------ 1.6,2 L1 ,3 ----- -----
---------------------- ---------------------- L2,3 ----- 1.0,3 ------ ------ L I ,3 -----
L2.3 1.2,3 ------ 1.0,3 1.3,3 ----- L1,3 
---------------------- ---------------------- lA,3 ------ 1.2,3 ------ ------ Ll,J -----
lA ,3 1.4,3 ------ 1.2,3 1.5,3 ----- 1.3,3 
Lt,3 ------ 1.4,3 ------ ------ 1.5,3 -----
1.6,3 1.6,3 ------ 1.4,3 ----- ------ 1.5,3 



















CPI &CP2 input latches 
RtO Rt2 Rt I RtO Rt2 Rt I 
------ ------ 1.6,3 
CPI & CP2 ·Jutput latches 
Rt Rt Rt Rt 
112,2 114,2 H3,2 1.2,2 1.4,2 L3,2 Hl10,2 HL0,2 LH0,2 LL0,2 
H4,0 116,0 H5,0 L4,0 1.6,0 1.5,0 HHI,O HLI,O LHI,O LLI,O 
H4,1 H6,1 H5,1 L4,1 1.6,1 1.5,1 HHI,I HLI,I LHI,I LLI,I 
H4,2 H6,2 H5,2 IA,2 1.6,2 1.5,2 HH1,2 HL1.2 LH1,2 LL1,2 
H6,0 ------ ------ 1.6,0 ------ ------ HH2,0 HL2,0 LH2,0 1.1.2,0 
H6, I ------ ------ 1.6,1 ----- ------ HH2,1 HL2. I LH2, I LL2,1 
H6,2 ------ ------ 1.6,2 ----- ------ HH2 2 HL2 2 LH2,2 LL2,2 
----------------------- L0,3 L2,3 LIJ I ------- HL3,0 ------- LU,O 
39 ----------------------- 1.2,3 1.4,3 1.3,3 ------- HLJ,I ------- LLJ,I 
40 2 
41 I ----------------------- IA,3 1.6,3 1.5,3 -------- HLJ 2 -------- LL3,2 
42 2 
43 ----------------------- 1.6,3 ------ ------ ------- -------- LHO,J I.LO,l 
44 2 
45 I ------------------------------------------------- ------- -------- LH I ,3 LLI ,3 
46 2 
47 I ------------------------------------------------- ------- -------- I.H2,J 1.1.2,3 
48 2 
49 ------------------------------------------------- ------- -------- ------- LLJ,J 
284 
Table 8.17 Dataflow of the 3-parallel intermediate architecture 
Ck RP RdO RP's input RdH SRHO SRHI RdL SRLO SRLI 
latches R2 Rl RO R2 Rl RO Rl RO R2 Rl RO R2 Rl RO 
RIO Rt2 Rtl 
I I x0,2 x 0,0 x 0,2 x0,1 
2 2 x0,4 X 0,2 X 0,4 x0,J 
3 3 x0,6 X 0,4 X 0,6 x0,5 
4 I x\,2 xl,O xl,2xl,l 
5 2 xl,4 x 1,2x 1,4x 1,3 
6 3 xl.6 X 1,4 x 1,6 xl,S 
7 I x2,2 X 2,0 X 2,2 x2,J 
8 2 x2,4 X 2,2 X 2,4 x2,J 
9 3 x2,6 X 2,4 X 2,6 x2,5 
10 I x3,2 x3,0 x3,2x3,1 HO,O --··· ----- LO,O ..... -----
II 2 x3,4 X 3,2 x 3,4 x3,3 HO,I HO,O ----- LO,I LO,O ..... 
12 3 x3,6 X 3,4 x 3,6 x3,5 H0,2 HO,I HO,O L0,2 LO,I LO,O 
13 I x4,2 x 4,0 X 4,2 x4, I H0,2 HO,I HO,O HI,O ----- ----- L0,2 LO, I LO,O LI,O ----- -----
14 2 x4.4 X 4,2 X 4,4 x4,3 H0,2 HO, I HO,O Hl,l HI,O ----- L0,2 LO, I LO,O Ll,l LI,O ..... 
15 3 x4,6 X 4,4 X 4,6 x4,5 H0,2 HO, I HO,O Hl,2 Hl,l HI,O L0,2 LO, I LO,O Ll,2 Ll,l LI,O 
16 I x5,2 X 5,0 X 5,2 x5,J H2,0 H0,2 HO, I 
-----
HI,2HI,I L2,0 L0,2 LO, I ...... Ll,2 Ll,l 
17 2 x5,4 X 5,2 X 5,4 x5,3 ----- H2,1 H2,0 H 0,2 ----- ------ H1,2 L2,1 ---- L2,0 L0,2 LO, I ------ Ll,2 Ll,l 
18 3 x5,6 X 5,4 X 5,6 x5,5 H2,2 H2, I H2,0 H 0,2 ----- -----· HI ,2 L2,2 L2,1 L2,0 L0,2 LO, I ...... Ll,2 Ll,l 
19 I x6,2 X 6,0 X 6,2 x6,J ----- H2,2 H2,1H 2,0 HJ,O ----- ------ ----- L2,2 L2,1 L2,0 L0,2 LJ,O -----· Ll ,2 
20 2 x6,4 X 6,2 x 6,4 x6,3 
-----
H2,2 H2, I H 2,0 IIJ, I H3,0 ...... ------ ------ L2,2 L2,1 L2,0 LJ,I LJ,O ..... 
21 3 x6,6 X 6,4 X 6,6 x6,5 ----- H2,2 H2,1H 2,0 H3,2 H3,1 IIJ,O ------ ------ L2,2 L2, I L2,0 LJ,2 LJ,I L3,0 
22 I x7,2 X 7,0 X 7,2 x7,J ----- H4,0 H2,2 H2,1 ------ H3,2 H3, I ------ ------ L4,0 L2,2 L2,1 ...... LJ,2 LJ,I 
23 2 x7,4 X 7,2 X 7,4 X7,3 ----- H4,1 H4,0 H2,2 ------ ----- H3,2 L4,1 ...... L4,0 1.2,21.2,1 ...... L3,2 LJ,I 
24 3 x7,6 x7,4 x7,6x7,5 H4,2 H4,1 H4,0 112,2 ------ ------ H3,2 L4,2 L4,1 L4,0 L2,2 L2, I ------ L3,2 LJ, I 
25 I x8,2 X 8,0 X 8,2 x8,1 ----- H4,2 H4,1 H4,0 115,0 ............ i ----- L4,2 L4,1 L4,0 L2,2 L5,0 ----- L3,2 
26 2 x8,4 X 8,2 x 8,4 x8,3 ----- H4,2 H4,1 H4,0 H5,1 H5,0 ------ ------ ------ L4,2 L4,1 L4,0 LS,I L5,0 -----
27 3 x8,6 X 8,4 X 8,6 x8,5 ----- H4,2 H4,1 114,0 H5,2 H5,1 HS,O ------ ------ L4,2 L4,1 L4,0 L5,2 L5,1 L5,0 
28 I x9,2 X 9,0 x 9,2 x9,1 ----- H6,0 114,2114,1 ------ H5,2 H5,1 ------ ------ L6,0 L4,2 L4,1 ----- L5,2 L5, I 
29 2 x9,4 X 9,2 X 9,4 x9,3 ----- H6,1 H6,0 H4,2 ------ ----- H5,2 L6,1 ------ L6,0 L4,2 L4, I ----- L5,2 L5,1 
30 3 x9,6 X 9,4 X 9,6 x9,5 H6,2 H6,1 H6,0 fl4,2 ...... ------ H5,2 L6,2 L6,1 L6,0 1.4,2 L4, I ----- L5,2 L5,1 
ck RP CPI & CP3 input latches CP2 input latches CPl & CP3 output latches CP2 output latches 
RtO Rt2 Rtl RtO Rt2 Rtl RtO Rt2 Rtl Rth Rtl Rth Rtl Rth Rtl 
t6 t HO,O H2,0 111,0 LO,O L2,0 Ll,O 
17 2 ------------------------------------------------- ~10, 1 112, 1 HI, I 
18 3 
19 H0,2 H2,2 HI ,2 LO,l L2, I L I, I ---------------------
20 2 ------------------------------------------------- L0,2 L2,2 L I ,2 
21 3 
22 H2,0 H4,0 H3,0 L2,0 L4,0 LJ,O ---------------------
23 2 
24 3 
25 I 112,2 H4,2 H3,2 1.2,1 L4,1 LJ,I ..................... HHO,O HLO,O LHO,O LLO,O 
26 2 ------------------------------------------------- L2,2 L4,2 L3,2 ----------------·------------------- HHO, I 111.0, I 
27 3 
28 I H4,0 H6,0 HS,O L4,0 L6,0 L5,0 ..................... HH0,2 HL0,2 LHO,I LLO,I ..................... . 
29 2 .................................... LH0,2 LL0,2 
30 3 
285 
B.J Dataflow tables of chapter 5 
In Table 8.19 (a), the pipeline stages 4, 7, and 10 of Figure 6.5.5 have not included, since they are in the 
first run, which ends at cycle 20, only pass coefficients of the previous sta.~e to the next, whereas in the 
second run, which begins at cycle 25, and in all subsequent runs, stages 4 and I 0 are bypassed, as shown in 
Table 8.19 (a). For instance, RtO and Rt1 of stage 2 are shown holding coefficients YL'2,0 and YL'2,1 in 
cycle 26, during which coefficient YL"2,0 is computed. Then in cycle 27 YL"2,0 is loaded into RtO of stage 
3 while YL'2,1 is loaded into Rt1 of stage 5 through the multiplexer labeled mu.x bypassing stages 3 and 4. 
In cycle 28, YL'2, I in Rtl of stage 5 is loaded into Rt1 of stage 6, while YL"2,0 in RtO of stage 3 is 
transferred to RtO of stage 6 bypassing stages 4 and 5, where the two coefficients proceed together until 
stage 8. 
Note that the first indexes in YL, YH, XL, and XH in Tables 8.18 and 8.19 (a) refer to column 
numbers in Figures 6.3.2 (A) and (8). While the second indexes refer to input numbers in each column in 
accordance with the convention followed in the DOGs. On the other hand, the first indexes of Y and X in 
Tables 8.18 and 8.19 (b) refer to input numbers in each row in accordance with the convention followed in 
the DOGs which is also indicated in the processors datapath architecture. 
286 
Table B.l8 Dataflow of the 5/3 architecture 
Ck I 2 3 4 CP output latches I 2 3 4 RP output 
f CP input latches RP input latches latches 
RdO RIO Rll RIO Rll RIO Rll RIO Rll RtlO Rtll Rlh RtO Rtl TLBI RIO Rll Rt2 RIO Rll TLB2 RIO Rtl Rt2 RIO Rll 
I LLO.O ····--- --------
2 ------- LLO.O LHO.O 
3 HLO,O LLO,O LHO,O 
4 HLO,O HHO,O XLO(O) YLO( I) 
5 LL 1 ,0 IILO,O HHO,O XLO(O) YLO( I) 
6 ------- LLI,O LHI,O XHO(O) YHO( I XLO(O YLO(l 
7 HLLO LL\,0 UII,O XHO(O) YHO(l) XLO(O) YLO(l) 
8 ------- HLI.O HHI.O XL0(2) YLO(l) XHO(O) YHO( I) XLO(O) YLO( I) 
9 LL2,0 HLI,O HHI,O XL0(2) YL0(3) XHO(O) YHO(l) XLO 0 YLO(l 
10 ------- LL2.0 L!-12,0 XH0(2) YH0(3) XL0(2) YL0(3) XHO(O) YHO( I) LO,O Ll,O -----
II HL2.0 LL2.0 LH2.0 XH0(2) YHO(J) XL0(2) YL0(3) XHO(O YHO(l) LO.O Ll.O -
12 - -- HL2,0 HH2,0 XL0(4) YL0(5) XH0(2) YH0(3) XL0(2) YLO(l) ----- LI,O Hl,O LO.O HO.O ------
13 LLJ.O HL2.0 H\12.0 XL0(4) YLO(S) XHO 2) YH0(3 XL0(2) YL0(3) ----- L1 ,0 HI ,0 LO,O HO,O ------
14 ------- LLJ.O ------- X\10(4) Y\10(5) XL0(4) YLO(S) XH0(2) YH0(3 L2.0 L3.0 ------ Ll.O H\.0 YO( I) YO(O YO I ----
15 HL3,0 LL3,0 ------- XH0(4) YH0(5) XL0(4) YLO 5) XH0(2) YH0(3) L2,0 LJ,O ------ Ll,O Hl,O YO(O) YO( I) ----
16 ------- 11!.3.0 ----- XLO( 6) -------- XH0(4) YH0(5) XLO 4) YL0(5 -- U,O HJ,O L2.0 H2.0 Y\(1) Yl 0 Y\(1 ---- xo 0) -----
17 ! LU,J HL3,0 XLO( 6) -·---·-- XH0(4) YH0(5l_ XL0(4) YL0(5) ----- U,O 113,0 L2,0 H2,0 Y\(0) Y\(1) ---- XO(O) -----
18 ---- LLU_l LHU_l XH0(6) ------- XL0(6) - - XHO 4) YHO 5) L4,0 LS,O ------ Ll.O H3.0 Y2(1) Y2(0) Y2(1) ···· X\(0) ----- XO(O) XO(O) ---- ----
19 llLOI 1_1_11] 1_1!0_1 XH0(6) ------· XL0(6) -------- XH0(4) YH0(5) L4,0 LS,O - ---- U,O H3,0 Y2(0) Y2( I ) ---- X\(0) ----- XO(O) - --
20 ------- I 11 0.1 1\IICI_l XL\(0) YLl( II XH0(6) ------- XLO( 6) -------- ----- LS,O HS,O L4,0 H4,0 YJ( 1) Y3(0) Y3(1 ---- X2(0) ----- X\(0) X\(0) ---- ---- X(O.O) ----
21 Ll IJ lll_li_]J!IIn_] Xl.l((Jl 'il li!_1 XII0(6) XLO 6) ·---·--- ----- L5.0 H5.0 L4,0 H4,0 Y3 0) Y3 I ---- X2 0 ----- X\(0) ---- ---- X(O.O) -----
22 Ll.l I I 111 I \.llltl!iY!llil_l XI it'l} YLlt ll XIIO( 6) ···---- L6,0 ------ ------ LS.O H5.0 Y4(1) Y4(0)Y4(1J --·· X3(0 ----- X2(0) X2(0) ---- ---- X(l.O) -----
23 JILl I l.Ll 1 L!ll,l \.lll(UiYlllil) \1 ](11)'11 It I) XH0(6) - -- L6.0 ----- ------ L5.0 H5.0 Y4(0)Y4(1) ---- X3(0) --- X2(0) ·-- ---- X( 1.0) ····· 
24 ------- 11LLI 1\lll I \.1 IC?.1 VLii~J :\Ill (iJ 1 Ylllll) X!.ltll}YI 1(11 ------ ------ ------ L6.0 116.0 Y5(1) Y5(0) YS(l) ---- X4(0) ----- X3(0) X3(0) ---- ---- X(2.0) -----
25 11.2.1 111 l.lll11U XI_](2J 'l Lli3l \.IIIHIJ Ylll( II Xl]{f,)YII(II ------ - - ------ L6,0 H6,0 ¥5(0 Y5(1 ---- X4 0) ----- X3(0) ·-- X(2.0) --··-
26 ------- l.L2.1 I.IP.l X!li(::'_IYIII<J} ALI(2) 'l'l.l(3j XHl(O) 'l'IJ](I l I_! I_] L1, I ------ ----- Y 6( I ) Y6(0) Y6(1) ---- X5(0) ----- X4(0) X4(0) ---- ---- X(3.0) -----
27 11!.2,1 LL2_l Lll2_1 Xll\(2_1 'r'lll(') X\.1(2} Yll(') Xlli(O) YH](I l Ul_l II I ----- ------ ----- Y6(0) Y6(1 ---- X5(0) ----- X4(0) ---- ---- X(3.0) -----
28 I II 2.1 ll/!2J XI 1<--1) Yl It") _'\1!1(21 YI!IOJ :XLlf2J YLIU) Ll I Ill I IJJ.l H(J.J ------ ----- ---- X6 0 ---- X5(0) X5(0) ··-- ---- X(4.0) -----
29 I U_l l!l 2.1 11112.1 XI It~ 1 Yl It~~ '\111!21 YIIJ(.i_l AL1(2J Yl It _I 1 I ~ I Ill I I II_ 1 I Hi.! ------ ----- ---- X6(0) ----- X5(0) ---- X(4.0) ----
30 Ll 3. l ------ Xllli--1.) \HI(5J XLJ(--Il YUt>J .XIII('~ 1 Y111 (J) L2,1 U_l ------ Ll I fiLl YUt3J YI)(..,JYO{JJ'ri;(!) --····· -·--- X6(0 X6(0) ---- ---- X(5.0) -----
31 l!L3_1 LU_I --- Xllli-1)YI!](~l XLII~ I YLii_'> I X!ll('l Yl/1(3) 1.2. I L.LI I I I Ill I Ylli.-,1 YC,(_'l) Yilt I 1 ------- ----- X6(0 ---- ---- X(5.0) -----
32 !lLU ------- XL I ( 6) X!ll(.J.)Yilli:'l XLII4J YLl(:') I:U fLU 12_1 IP_I Yi()J YI('JYI(~)YiilJ '(0(2) YIJIII ------ ---- ---- X(6.0) -----
33 !IU.l XL\(6) ------- X!ll(4) 'llllt."J '([_](41 'd_](_'\) [__)_I IL~ _ I 1 2.1 112. 1 Y1t2JYI('JY1\1J X0(2) Y11111 ------ ---- ---- X(6.0) -·--· 
34 ------- ------- ------- Xlll(bl XLif(J) ------- X11114)YIII(::-) 1AJ LS_l ------ 1.3.1 113.1 Y 1 (J) Y2{2J Y2() l Y2\ 1) X!(2JYIIIJX0(2) XOt2l Yo( 1 J XO~Ol ------- -----
35 ------ :\1!!(1>! ------- XLl(6J -- -- X\!1(4) Yl!lt5J L4.1 1_:' I U_l !U_I Y2(2) Y2(l J Y2( 1) Xlt2) Ylrl) Xli(2) YO( I l XII(O) ------- -----
36 ------- ------- .. - --- XH!\61 ------- XL1(6) ------- ---- 1.5.1 115,1 1A.I H4_1 YJI)J YJ(2)Y3t3JYJ(1) X2(2 l Y2fl) :X!( 21 X!~2) Yl(l 1 Xl(O) :'<(1!.2) XW.l) 
37 ------- ------- --------- -------- XlliU>I XLl<6) --- --- ----- L::-.1 115.1 1.4.1 114,1 Y3(2)Y3i3iY3(1) X2\2) Y2(1) Xl(2) Yl(l l XI(O) xm2) X((), I l 
38 ------- --- ------ -------- --------- -------- Xlll(6) -- 1.6.1 L::-.1 11::-J Y-'l-C'l Y4(2) Y4U) Y4{1) X312) y_-, I) X2t2) X2(2) Y2 ll X2(0J XtUJ X(l,l J 
39 --------- -------- ------- XII 1(6) Lr>_l ----- LS_I 115,1 Y4{2)Y4t3JY4(1) X3t 1 J Y3(1) X2(2) Y2i I l X2(11) Xt 1.2) X( L1) 
40 ------ ------- - -- --------- -------- --------- -------- ------ ------ L6,1 11(),1 Y:-(3) Y5(2) Y5(3 J Y5( 1) X--112)Y4(1)X312J X3(2J Y3(ll XJ({J) xc:~_2) X(2.1 l 
41 - ----- --------- -------- - . ------ u-._1 H6_1 Y5(?_.IY)(3)Y5tll X4(2) Y4( I l XJ( 2 1 \'3( I) X ~(0 1 X(2.2J Xt2J l 
42 ---- --------- -------- --- ------- ------ ----- ----- Y6UJ Yht2J Y6(3) '{fl( I ·1 X"(2) Y4t I J X4C2) X4(2JY4t1)X4(0J X(3.2J XUJ l 
43 -- -- ------- _._ ____ - ------- -------- --------- -------- ------ ------ ------ YN21 Y6(_~) Yl'l I J X5(2) Y<-11 I l X4(2JY--I(I)X4dll X(3,2J Xt3.1J 
287 
Table 8.19 (a) dataflow for 9/7 architecture from CP side 
Ck I 2 3 5 6 8 9 II 12 13 CP output 
P2 CP input latches latches 
RtO Rtl RtO Rtl RtO Rtl RtO Rtl RtO Rtl RtO Rtl RtO Rtl RtO Rtl RtO Rtl RtO Rtl RtlO Rtll Rth 
I LLO,O LHO,O 
2 HLO,O HHO,O YL'O,O YL'0,1 
3 LLO,I LHO,I YH'O,O YH'0,1 YL"O,O YL'0,1 
4 HLO,l HHO,l YL'1,0 YL'1,1 YH"O,O YH'0,1 
5 LLI.OLH\,0 YH'1,0YH'1,1 YL"1 ,0 YL'1, 1 YL"O,O YL'O, 1 
6 HLI,O HHI,O YL'0,2 YL'0,3 YH"1,0YH'1,1 YH"O,O YH'0,1 YL"O,O YL'O, 1 
7 LLI,I LHI,I YH'0,2 YH'0,3 YL"0,2 YL'0,3 YL"1,0YL'1,1 YH"O,O YH'O, 1 
8 HLI,l HHI,I YL'1 ,2 YL'1 ,3 YH"0,2 YH'0,3 YH"1 ,0 YH'1, 1 YL"1 ,0 YL'1, 1 YL"O,O YL"O, 1 
9 LL2,0 LH2,0 YH'1 ,2 YH'1 ,3 YL"1 ,2 YL'1 ,3 YL"0,2 YL'0,3 YH"1,0YH'1,1 YH"O,O YH"O, 1 XLO,O YL"O, 1 
10 HL2,0 HH2,0 YL'0,4 YL'0,5 YH"1,2 YH'1,3 YH"0,2 YH'0,3 YL"0,2 YL'0,3 YL"1 ,0 YL"1, 1 XHO,O YH"0,1 
II LL2,1 LH2,1 YH'0,4 YH'0,5 YL"0,4 YL'0,5 YL"1 ,2 YL'1 ,3 YH"0,2 YH'0,3 YH"1,0YH"1,1 XL1 ,0 YL"1, 1 XLO,O YL"0,1 
12 HL2,1 HH2,1 YL'1,4 YL'1,5 YH"0,4 YH'0,5 YH"1 ,2 YH'1 ,3 YL"1,2 YL'1,3 YL"0,2 YL"0,3 XH1,0YH"1,1 XHO,O YH"O, 1 XLO,O YL"O, 1 
13 LLJ,O LHJ,O YH'1 ,4 YH'1 ,5 YL"1 ,4 YL'1 ,5 YL"0,4 YL'0,5 YH"1 ,2 YH'1 ,3 YH"0,2 YH"0,3 XL0,2 YL"0,3 XL1 ,0 YL"1, 1 XHO,O YH"O, 1 XLO,O YL"0,1 
14 HL3,0 HH3,0 YL'0,6 YL'O, 7 YH"1,4 YH'1 ,5 YH"0,4 YH'0,5 YL"0,4 YL'0,5 YL"1 ,2 YL"1 ,3 XH0,2 YH"0,3 XH1 ,0 YH"1, 1 XL1 ,0 YL"1, 1 XHO,O YH"O, 1 LO,O Ll ,0 ----
15 LL3,1 LH3,1 YH'0,6 YH'0,7 YL"0,6 YL'0,7 YL"1 ,4 YL'1 ,5 YH"0,4 YH'0,5 YH"1 ,2 YH"1,3 XL 1 ,2 YL"1,3 XL0,2 YL"0,3 XH1 ,0 YH"1, 1 XL1,0 YL"1, 1 ----- Ll,O HI,O 
16 HU,l HHJ,l YL'1,6 YL'1,7 YH"0,6 YH'0,7 YH"1 ,4 YH'1,5 YL"1,4 YL'1,5 YL"0,4 YL"0,5 XH1,2 YH"1 ,3 XH0,2 YH"0,3 XL0,2 YL"0,3 XH1,0 YH"1, 1 LO,I Ll,l ----
17 LL4,0 ------ YH'1,6 YH'1,7 YL"1,6 YL'1,7 YL"0,6 YL'0,7 YH"1 ,4 YH'1 ,5 YH"0,4 YH"0,5 XL0,4 YL"0,5 XL 1 ,2 YL"1 ,3 XH0,2 YH"0,3 XL0,2 YL"0,3 ----- Ll,l Hl,l 
18 HIA,O ------ YL'0,8 - -- YH"1,6 YH'1,7 YH"0,6 YH'0,7 YL"0,6 YL'0,7 YL"1 ,4 YL"1 ,5 XH0,4 YH"0,5 XH1,2 YH"1,3 XL1 ,2 YL"1,3 XH0,2 YH"0,3 L2,0 L3,0 ----
19 LL4, I ------ YH'0,8 --- YL"0,8 ---- YL"1 ,6 YL'1 ,7 YH"0,6 YH'0,7 YH"1 ,4 YH"1 ,5 XL 1 ,4 YL"1 ,5 XL0,4 YL"0,5 XH1 ,2 YH"1 ,3 XL1,2 YL"1 ,3 ----- L3,0 H3,0 
20 HIA, I ------ YL'1 ,8 - - YH"0,8 -- YH"1 ,6 YH'1, 7 YL"1,6 YL'1,7 YL"0,6 YL"0,7 XH1 ,4 YH"1 ,5 XH0,4 YH"0,5 XL0,4 YL"0,5 XH1 ,2 YH"1 ,3 L2.1 L3, I ----
21 ----- - ------ YH'1,8 YL"1,8 - YL"0,8 - YH"1 ,6 YH'1 ,7 YH"0,6 YH"O, 7 XL0,6 YL"O, 7 XL 1 ,4 YL"1 ,5 XH0,4 YH"0,5 XL0,4 YL"0,5 ----- L3,1 H3,1 
22 ------- ------ ---------- ------- YH"1,8 --- YH"0,8 ----- YL"0,8 --- YL"1,6 YL"1,7 XH0,6 YH"0,7 XH1 ,4 YH"1 ,5 XL1 ,4 YL"1 ,5 XH0,4 YH"0,5 IA,O LS,O ----
23 ------- ------ ---------- ------- ---------- ------- YL"1,8 ---- - YH"0,8 YH"1 ,6 YH"1, 7 XL1,6YL"1,7 XL0,6 YL"0,7 XH1.4 YH"1,5 XL1 ,4 YL"1,5 ----- L5,0 H5,0 
24 ------- ------ ---------- ------- ---------- ------- YH"1,8 --- YL"1,8 ---- YL"0,8) ---- XH1 ,6 YH"1 ,7 XH0,6 YH"0,7 XL0,6 YL"0,7 XH1 ,4 YH"1,5 IA,I L5,1 ----
25 LUL2 U HL1 ------- YH"1,8 YH"0,8 - XL0,8 XL1.6YL"1.7 XH0,6 YH"O, 7 XL0,6 YL"0,7 ----- L5,1 H5,1 
26 11!_11.2 I !l\1 1.:::: YL'2.0 YL'2. 1 ---------- ------- --------- ------- ---------- ------- YL"1,8 -- XH0,8 -- XH1 ,6 YH"1 ,7 XL1 ,6 YL"1 ,7 XH0,6 YH"0,7 L6,0 L7,0 ----
27 I I I,' I 111.2 YH'2.0 YH'2, 1 YL"2.0 -------- YL'2,1 ---------- ------- YH"1,8 ---- XL1,8 ---- XL0,8) ------ XH1,6 YH"1,7 XL1,6 YL"1,7 ----- L7,0 H7,0 
28 fiLU 11!11.2 YL'2.2 YL'2,3 YH"2,0 -------- YH'2.1 YL"2.0 YL'2.1 ---------- ------- XH1,8 XH0,8 - - XL0,8 - XH1,6 YH"1,7 L6,1 L7,1 ----
29 LL2_2 Ll\2_2 YH'2.2 YH'2.3 YL"2.2 ------- -------- YL'2.3 YH"2.0 YH'2.1 ---------- ------- -------- ~-~-~-- .X.L 1,8 ---- XH0,8 ---- XL0,8 ---- ----- LI,J 01,1 
30 IIL2.2lll/2.2 YL'2.4 YL'2,5 YH"2.2 ---- -------- YH'2.3 YL"2.2 YL'2,3 YL"2.0 YL"2, 1 -------- ------- XH1,8 ---- XL1,8 ---- XH0,8 --- L8,0 ---- -----
31 LU_2 Ll LL2 YH'2.4 YH'2,5 YL"2.4 .. YL'2.5 YH"2 .. 2 YH'2,3 YH"2.0 YH"2 .. 1 XL2J.\ - YL"2.1 XH1,8 XL1,8 .. ------ ----- -----
32 1 IU.~ 1!11.~.2 YL'2.6 YL'2.7 YH"2..4 -------- YH'2,5 YL"2 4 YL'2.5 YL"2,2 YL"2,3 XI!2Ji ------- ------- YH"2.1 XL2_1J YL"2.1 XH1,8 ---- L8, I ---- -----
33 !.lA_2 ------- YH'2.6 YH'2,7 YL"2.6 ·------ YL'2.7 YH"2..4 YH'2.5 YH"2 .. 2 YH"2.3 XL2,2 ----- YL"2.3 Xll2_11 YH"2.1 XI '"~_II YL"2.1 ------ ----- -----
34 Ill A_~ ------- YL'2.8 ------- YH"2.6 -------- -------- YH'2.7 YL"2 6 YL'2 7 YL"2.4 YL"2 .. 5 Xll2.2 -------- YH"2,3 XU,J YL"2,3 Xli1_0 YH"2.1 L0,2 Ll ,2 -----
35 ------- ------· YH'2 .. 8 ------- YL"2.8 ........... -------- YH"2.6 YH'2.7 YH"2.,4 YH"2.5 XL2,4 YL"2.5 XI 12_2 YH"2,3 XI .2_2 YL"2.3 ----- Ll,2 H1,2 
36 ------- ------- ------- YH"2.8 --- . ------- YL"2.8 YL"2.6 YL"2 .. 7 :'(1!2_1 YH"2,5 '<! 2_--1- YL"2.5 XI 12_J YH"2.3 L2,2 L3,2 -----
37 -- ------- ------- ------- -~----- ------ YH"2,8 ---··-- YH"2.6 YH"2.7 Xl.~h -------· YL"2.7 XI 1.2_--1 YH"2.5 XU,--1 YL"2,5 ......... L3,2 H3,2 
38 ------- ------- YL"2.8 Xll2_6 ------- YH"2 .. 7 XL2_r• Yl"2.7 XII2A YH"2,5 L4,2 L5,2 -
39 ------- ------- ------ ------ YH"2 .. 8 -------- XI.2.S ------- - ---- XI !..2.1• YH"2.7 XL2J• YL"2,7 ......... L5.2 H5,2 
40 ------- ------- ------- ------- ------- --~--~- All2_X ------- ------- AL2_H ------- XIC_6 YH"2.7 L6,2 L 7,2 -----
41 ------- ------- ------- ------- ------- ------- ------- ------- ------- All2 X XL2J\ ----- L7,2 H7,2 
42 - ....... ------- ------- ------- XI!2_.S ------- L8_2 ----- -----
43 ------- ------- ------- ------- -- ---- ------- ------- ------- ----- ----- -----
288 
Table 8.19 (b) dataflow for 9/7 architecture from RP side 
Ck I 2 3 4 5 6 7 8 9 RPout 
f!2 RP mput latches 
RID Rtl RtO Rtl TLBI RtO Rtl Rt2 Rl RO RtO Rtl TLB2 Rl RO RtO Rtl Rt2 RtO Rtl TLB2 RIO Rtl Rt2 RtO Rtl TLB4 RtO Rtl Rt2 RtO Rtl 
14 ---- ----
IS LO.O HO.O 
16 Ll,O 111,0 YO.O Y0.1 
17 LO.I HO.I Y1.0 Y1.1 YO.O Y0.1 - - -- -
18 LI.IHI.I Y0.2 Y0.3 Y1.0Y1.1 --Y0.1 -- YO.O --- --- --- ---
19 L2,0 112.0 Y1.2 Y1.3 Y0.3 Y0.2 Y0.3 -- Y1.1 Y0.1 Y1.0 -- - YO.O --
20 L3,0 H3,0 Y2.0 Y2.1 Y1.3 Y1.2 Y1.3 -- --- Y1.1 Y0.2 Y0.1 --- Y1.0 YO.O 
21 L2,1 H2.1 Y3.0 Y3.1 Y2.0 Y2.1 -- ---- ---- Y1.2 Y1.1 Y0.2 - Y1.0 Y0.2 Y0.1 YO.O 
22 Ll.l H3.1 Y2.2 Y2.3 Y3.0 Y3.1 Y2.1 - Y2.0 ---- Y1.2 - - Y1.2 Y1.1 Y1.0 YO.O Y0.1 -
23 L4.0 H4.0 Y3.2 Y3.3 Y2.3 Y2.2 Y2.3 - Y3.1 Y2.1 Y1.0 -- Y2.0-- ----- ----- ----- Y1.0Y1.1 Y0.1 YO.O Y0.1 --
24 LS,O HS,O Y4.0 Y4.1 Y3.3 Y3.2 Y3.3 Y3.1 Y3.2 Y2.1 Y3.0 Y2.0 ----- ---- --- - -------- Y1.1 Y1.0 Y1.1 ---- xO,O -- --
25 L4,1 H4,1 YS.O Y5.1 Y4.0 Y4.1 -- --- --- Y3.2 Y3.1 Y2.2 --- Y3.0 Y2.2 Y2.1 Y2.0 ----- ----- ----- ----- ---- x 1,0 ---- xO,O xO,O ---- -----
26 LS,l H5J Y4.2 Y4.3 Y5.0Y5.1 -Y4.1 -- Y4.0 -- Y3.2 - Y3.2 Y3.1 Y3.0 Y2.0 Y2.1 ----- ----- ---- ----- ---- xl,O xl,O xO,O ----
27 L6,0 H6,0 Y5.2 Y5.3 Y4.3 Y4.2 Y4.3 Y5.1 Y4.1 YS.O Y4.0 ----- ----- ----- Y3.0 Y3.1 Y2.1 Y2.0 Y2.1 -- - -- ----- ---- ----- xl,O ----
28 L7.0 H7.0 Y6.0 Y6.1 Y5.3 Y5.2 Y5.3 - --- Y5.1 Y4.2Y4.1 YS.O Y4.0 ----- ----- ----- ----- ----- Y3.1 Y3.0Y3.1 -- x2.0 ---- ----- ---- ----- ---- ----
29 L6,1 H6J Y7.0 Y7.1 Y6.0 Y6.1 - - Y5.2 Y5.1 Y4.2 YS.O Y4.2 Y4.1 Y4.0 -- -- --- - -- x3,0 x2,0 x2,0 --- ---- ----
30 L7,1 117,1 Y6.2 Y6.3 Y7.0 Y7.1 -- Y6.1 -- Y6.0 -- Y5.2 -- -- Y5.2 Y5.1 YS.O Y4.0 Y4.1 ----- ----- ---- ----- ---- x3,0 x3,0 ---- ----- x2,0 ----
31 L8.0 H8.0 Y7.2 Y7.3 Y6.3 Y6.2 Y6.3 Y7.1Y6.1 Y7.0 Y6.0 - - ----- ----- ----- YS.O Y5.1 Y4.1 Y4.0 Y4.1 -- ----- ---- ----- ---- ----- x3,0 ----
32 ----- ----- Y8.0 Y8.3 Y7.3 Y7.2 Y7.3 - - Y7.1 Y6.2 Y6.1 Y7.0 Y6.0 ----- ----- ----- ----- ----- Y5.1 YS.O Y5.1 --- x4,0 ---- ----- ---- ----- ---- ----
33 L8,1 H8,1 ------ Y8.0 Y8.1 --- ---- ---- Y7.2 Y7.1 Y6.2 --- Y7.0 Y6.2 Y6.1 Y6.0 ----- ----- ----- ----- ---- x5,0 ---- x4,0 x4,0 ---- ----- ---- ----
34 --- - ----- Y8.2 Y8.3 -- --- Y8.1 - Y8.0 ---- Y7.2 - - ---- Y7.2 Y7.1 Y7.0 Y6.0 Y6.1 ----- ----- ---- ----- ---- x5,0 x5D ---- ----- x4,0 ----
~~ 111.2 1111_2 ------ Y8.3 Y8.2 Y8.3 --- -- Y8.1 -- -- YB.O -- ----- ----- ----- Y7.0 Y7.1 Y6.1 Y6.0 Y6.1 -- ----- ---- ----- ---- ----- x5,0 ----
_-;I, 1.1_2 II L2 Y0.4 Y0.5 ---- ---- ----- Y8.2 Y8.1 -- Y8.0 ----- ----- ----- ----- ----- Y7.1 Y7.0 Y7.1 x6,0 ---- ----- ----- ---- ----- ---- ----
~7 !_22112_2 Y1 4 Y1 5 Y0,5 Y0.4 YO 5 YO 3 Y8.2 -- - - Y8.2 Y8.1 Y8.0 - - - - - ----- ---- x7,0 -- x6,0 x6,0 - - ----
:;s U_2 ILl 2 Y2.4 Y2.5 Y1.5 Y1.4 Y1 5 Y1 3 ·--- ----- YOA Y0.3 ----- ------ ----- ----- Y8.0 Y8.1 ----- ----- ---- ----- ---- x7,0 x7,0 ---- ----- x6,0 ----
-;c) 14.211-U Y3.4 Y3.5 Y2.5 Y2.4 Y2.5 Y2.3 ·--- ----- Y1.4Y1,3Y0,4 - YOA Y0.3 YO.Z ----- ----- Y8.1 YB.O Y8.1 ----- ---- ----- ---- ----- x7,0 ----
40 !__,.2 liS.::' Y4.4 Y4.5 Y3.5 Y3.4 Y3.5 Y3.3 ---- ---- Y2.4 Y2.3 Y1.4 --- Y1.4 Y1.3 Y1.2 Y0.2 Y0.3 ----- ---- ---- x8.0) ---- ----- ---- ----- ---- ----
41 [h_2 llh _ __., Y5.4 Y5.5 Y4.5 Y4.4 Y4.5 Y4.3 --------- Y3.4 Y3.3 Y2.4 ---- - Y2.4 Y2.3 Y2 2 Y1.2 Y1.3 Y0.3 Y0.2 Y0.3 Y0.1 ----- ---- x8,0 x8,0 ---- ----- ---- ----
42 17_2!172 Y6_4 Y6 5 Y5 5 Y5 4 Y5.5 Y5.3 -- Y4.4 Y4.3 Y3.4 ---- ---- Y3 4 Y3.3 Y3 2 Y2.2 Y2 3 Y1 3 Y1.2Y13Y11 -..<1_2 Y0.1 ----- ---- --- - x8.0 
--~-~ I .<,;, ~ J II'_~ Y7.4 Y7.5 Y6.5 Y6.4 Y6.5 Y6.3 ·--- ----- Y5 4 Y5.3 Y4.4 Y4.4 Y4.3 Y4.2 Y3.2 Y3.3 Y2.3 Y2.2 Y2.3 Y2.1 '\l.2Y1.1 ,o_2 >h2 Y0.1 '.JJ u ---- ----
44 ----- ----- Y8.4 Y8.5 Y7.5 Y7.4 Y7.5 Y7.3 Y6.4 Y6.3 Y5.4 - Y5.4 Y5.3 Y5.2 Y4.2 Y4.3 Y3.3 Y3.2 Y3.3 Y3.1 x2,2Y2.1 xU xL2Y11 xiJJ x0.2 xO.I 
4) ----- ----- Y8.5 Y8.4 Y8.5 Y8.3 ---- ----- Y7.4 Y? 3 Y6.4 ---- ---- Y6 4 Y6 3 Y6.2 Y5.2 Y5.3 Y4.3 Y4 2 Y4.3 Y4.1 _,3,2 Y3. 1 :-.2,2 '\2_2 Y2.1 x2_il -..... 1_2 "I I 
289 
Table 8.20 Dataflow for 2-parallel inverse 5/3 archilecture 
Ck CP CPI & CP2 CPI output CP2 output RPI input RP2 input Output latches of 
h input latches latches latches latches latches RPI RP2 
RtO Rtl Rt!O Rtll RthO Rthl RtO Rtl RtO Rtl RtO Rtl RtO Rtl 
I I LLO,O LHO,O 
2 2 HLO,O HHO,O 
3 I LLO,l LHO,l 
4 2 HLO,l HHO,l 
5 I LLI,O LHI,O 
6 2 HLI,O HHI.O 
7 I LLI,l LHI,l 
- 8 2 HLI,l HHI,l 
z 9 I LL2,0 LH2,0 LO,O Ll,O :0 
"' 10 2 HL2,0 HH2,0 HO,O Hl,O 
II I LL2,1 LH2,1 LO,l Ll,l LO,O HO,O Ll,O Hl,O 
12 2 HL2,1 HH2,1 HO,l HI,! 
13 I LLJ,O LH3,0 L2,0 LJ,O LO,l HO.l Ll,l HI,! 
14 2 IILJ,O HIIJ,O H2,0 H3,0 
15 I LLJ,l LH3,1 L2,1 LJ,l L2,0 H2,0 LJ,O HJ,O 
16 2 HLJ,l HH3,1 H2,1 H3,1 
17 I LL0,2 LH0,2 L4,0 L5,0 L2,1 H2,1 Ll,l Hl,l 
18 2 HL0,2 HH0,2 H4,0 H5,0 
N 19 I LL1,2 LH1,2 L4,1 L5,1 L4,0 H4,0 L5,0 H5,0 XO,O ----- XI ,0------
z 20 2 HL1,2 HH1,2 114,1 HS,l 
:0 21 I LL2,2 LH2,2 L6,0 L7,0 L4,1 H4,1 L5,1 H5,1 X0,2 XO,l Xl,2 XI,! 
"' 22 2 HL2,2 HH2,2 H6,0 H7,0 
23 I LLJ,2 LH3,2 L6,1 L7,1 L6,0 116,0 L7,0 H7,0 X2,0 
-----
X3,0 -----
24 2 HLJ,2 HH3,2 H6,1 H7,1 
25 I L0,2 Ll,2 L6,1 H6,1 L7,1 117,1 X2,2 X2,1 X3,2 X3,1 
26 2 H0,2 Hl,2 
27 I L2,2 LJ,2 L0,2 H0,2 Ll,2 111,2 X4,0 ------ X5,0 ------
28 2 J-12,2 H3,2 
29 I L4,2 L5,2 L2,2 H2,2 L3,2 H3,2 X4,2 X4,1 X5,2 X5,1 
30 2 114,2 115,2 
31 I L6,2 L7,2 L4,2 H4,2 L5,2 H5,2 X6,0 ----- X7,0 -----
32 2 1-16,2 H7,2 
33 I L6,2 H6,2 L7,2 H7,2 X6,2 X6,1 X7.2 X7,1 
34 2 
35 I X0,4 XO,J Xl,4 Xl,J 
36 2 
37 I X2,4 X2,3 XJ,4 X3,3 
38 2 
39 I X4.4 X4,3 X5,4 X5.3 
290 
Table B 21 Dataflow for 4 parallel inverse 5/3 architecture -
CK CP CPs input CPs I &3 CPs 2 & 4 RPs l & 3 RPs 2 & 4 input RPs I & 3 RPs 2 & 4 
[, Latches Out latches Out latches input latches latches Out latches Out latches 
RtO Rtl Rt!O Rtll RthO Rthl RP RtO Rtl Rt2 RP RtO Rtl Rt2 RtO Rtl RtO Rtl 
I I LLO.O LHO.O 
2 2 HLO.O HHO.O 
3 3 LLO,l LHO,i 
4 4 IILO, I HHO,i 
5 I LLI,O LHI,O 
6 2 HLI,O HHI,O 
7 3 LLI,l LHI,l 
8 4 HLI,l HHI,l 
- 9 1 LL2,0 LH2,0 
z 10 2 HL2,0 HH2,0 
::J II 3 LL2,1 LH2,1 
"' 12 4 HL2,1 HH2_! 
13 I LLJ,O LIIJ,O LO,O LI,O 
14 2 HLJ,O HHJ,O HO,O Hl.O 
15 3 LLJ,l LHJ,i LO,l Ll,l 
16 4 HLJ,l HHJ,l HO,l HI,! 
17 I LL4,0 ------ L2,0 Ll,O I LO,O HO,O HO, I 2 LI,O Hl,O HI_! 
18 2 HL4,0 ------ H2,0 H3.0 3 LO,l HO,l HO,O 4 Ll,l HI,! Hl.O 
19 3 LL4,1 ------ L2,1 LJ,1 
20 4 HL4,1 ------ H2,1 HJ,l 
21 I LL0,2 LH0,2 L4,0 L5,0 I L2,0 H2.0 ----- 2 LJ,O HJ,O -----
22 2 HL0,2 HH0,2 H4,0 H5,0 3 L2,1 H2,1 H2,0 4 LJ,l HJ,l 113,0 
23 3 LLI,2 LHI,2 L4,1 L5,1 
24 4 HLI,2 IIHI,2 H4,1 H5,1 
N 25 I LL2,2 LH2,2 L6,0 L7,0 I L4,0 H4,0 114, I 2 L5,0 H5,0 H5,1 
26 2 HL2,2 HH2,2 H6,0 117,0 3 L4,1 H4,1 H4,0 4 L5,1 HS,l H5,0 z 
::J 27 3 LLJ,2 LH3,2 L6.1 L7,1 
"' 28 4 HLJ,2 HH3,2 H6,1 H7,1 
29 1 LL4,2 
------ L8,0 ----- I L6,0 H6,0 ----- 2 L7,0 H7,0 -----
30 2 HL4,2 ------ H8,0 ----- 3 L6,1 H6,1 H6,0 4 L7,1 H7,1 H7,0 
31 3 
------- ------- L8,1 -----
32 4 ------- ------- 118.1 -----
33 I LLO,J LHO,J L0,2 Ll,2 I L8,0 H8,0 H8, I 2 ----- ---- ---- XO,O ---- Xl,O -----
34 2 110,2 Hl,2 3 L8,1 H8,1 H8,0 4 -----
---- ----
X0,2 XO,l Xl,2 XI,! 
35 3 LLI,J LHI,] L2,2 Ll,2 
]6 4 H2,2 H3,2 
37 I LL2,3 LH2,3 L4,2 L5,2 I L0,2 H0,2 ----- 2 L1,2 Ill ,2 ----- X2,0 ---- XJ,O -----
M 38 2 H4,2 H5,2 3 L2,2 H2,2 ----- 4 LJ,2 113,2 ----- X2,2 X2,1 X3,2 XJ,l 
z 39 3 LLJ,J LHJ,J L6,2 L7,2 ~ 40 4 H6,2 H7,2 
41 I LL4,3 ------ L8,2 ----- I L4,2 H4,2 ----- 2 L5,2 H5,2 ----- X4,0 ---- X5,0 -----
42 2 H8,2 ------ 3 L6,2 H6,2 ----- 4 L7,2 117,2 ----- X4,2 X4,1 X5,2 X5,1 
43 3 
------- ------- ----- -----
44 4 
------- ------- ------ ------
45 I LO,J LU I LR,2 H8,2 -----
' 
----- ---- ----
X6,0- X7 ,0 -----
46 2 ------ ------ 3 ---- ---- ----- 4 ----- ---- ---- X6,2 X6,l X7,2 X7,1 
47 3 L2,3 Ll,J 
48 4 ------ ------
49 I L4,3 L5,3 I LO,J ------ ---- 2 Ll,J ----- ----- X8,0 ----
-----
-----
50 2 ------ ------ 3 L2,3 ------ ----- 4 LJ,J ----- ----- X8,2 X8,1 ----- -----
51 3 L6,3 L7,3 
52 4 ------ ------
53 1 L8,3 ----- I L4,3 ------ ----- 2 L5,3 ------ ----- X0,4 XO,J Xl,4 Xl,J 
54 2 ------ ------ J L6,3 ------ ----- 4 1.7,3 ----- ----- X2,4 X2,3 X3,4 XJ,J 
55 J ------ ------
56 4 ----- ------
57 I I L8,3 ------ ----- 2 ----- ---- ---- X4,4 X4,3 X5,4 X5,3 
58 2 J ----- ----- ----- 4 ----- ---- ---- X6,4 X6,3 X7,4 X7,3 
59 3 
60 4 
61 I X8,4 X8,3 ----- ------




FPGA COMPILATION AND SYNTHESIS RESULTS 
C./ Compilation reports for forward 513 module "decorrelate_processor" 
Flow Status 
Quartus II Version 
Revision Name 
T op·level Entity Name 
Family 
Met timing requirements 
Logic utilization 
Combinational ALU T s 
Dedicated logic registers 
Total registers 
Total pins 
Total virtu.el pins 
Total block memory bits 





Successful· TueApr 2013:11:30 2010 






438/12_480 ( 4 ~ l 










Figure C.l.l Compilation Report- Flow Summary for forward 5/3 module "decorrelate_processor". 
PowerPiay Power Analyzer Status 
Quartus II Version 
Revision Name 




Total Thermal Power Dissipation 
Core Dynamic Thermal Power Dissipation 
Core Static Therm.e~l Power Dissipation 
1/0 Thermal Power Dissipation 
Power Estimation Confidence 
Succeulul· TueApr 20 13:11:30 2010 










Medium: user provided moderately complete toggle rate data 
Figure C.l.2 Compilation Report- Power Analyzer summary for forward 5/3 module "decorrelate_processor". 
4' Quart us II~ (:/tutorral1 overlapp_atchrtecture1 decorrelate_processor ~ decorrelate_proceSior- [Compilation Repcitt- TiDtg ~ 
jj D ~ 11113 i 19 I ¥. i'lll f\ I "' r> lldecorrelate_processor 
Worst-case lh 
Clock Setup: 'clock' 
Total number of farled po~J~hs 
N/A None :-1.831 ns 
N/A ·None '185.74 MHz( period "5_384 ~;)i~l~sJncram: TLB_rti_Oiahyncram_tprl :auto_generated_J_a~_-block1 aO...:p;rlb~;ddress_reg7 Rd[9] 
- o- ----- - - -- I -- --- ---- - " · --- -- - - ------ -
Figure C.l.3 Compilation Report- Timing Analyzer Summary for forward 5/3 module "decorrelate_processor" 
293 
C.2 Compilation reports for inverse 513 module "reconst_procossor" 
Flow Status 
Quartus U Version 
Revision Name 
Top-level Entity Name 
Famil}l 
Met timing requirements 
Logic utilization 
Combinational ALUT s 
Dedicated logic registers 
Total registers 
Total pins 
Total virtual pins 
Total block memory bits 





Successful· TueApr 20 13:28:12 2010 






446/12.480 ( 4%) 










Figures C.2.1 Compilation Report- Flow Summary for inverse 5/3 module "reconst_processor" 
PowerPiay Power Analyzer Status 
Gluartus U Version 
Revision Name 




Total Thermal Power Dissipation 
Core Dynamic Thermal Power Dissipation 
Core Static Thermal Power Dissipation 
110 Thermal Power Dissipation 
Power Estimation Confidence 
Succassful • TueApr 2013:28:12:1010 










Medium: user provided moderately complete toggle rate data 
Figure C.2.2 Compilation Report- Power Analyzer summary for inverse 5/.3 module "reconst_processor". 
Fig. C.2.3 Compilation Report-Timing Analyzer Summary for inverse 5/'3 module "reconst_processor" 
294 




Top-level Entity Name 
Family 
Met timing requirements 
Logic utrnzation 
Combinational ALUT s 
D odicated logic registers 
Total registers 
Total pins 
T otel virtual pins 
Total block memory bits 





Succeooful- TueApr 2013:47:13 2010 

















Fig. C.3.1Compilation Report- Flow Summary for first 9/7 module "decrrelation2_processor" 
PowerPiay Power Analyzer Status 
Quartus II Version 
Revision Name 




Total Thermal Power Dissipation 
Core Dynamic Thermal Power Dissipation 
Core Static Thermal Power Dissipation 
1/0 Thermal Power Dissipation 
Power Estimation Confidence 
Succeooful- TuoApr 2013:47:13 2010 










Medium: user provided moderately complete toggle rate data 
Figure C.3.2 Compilation Report- Power Analyzer summary for first 9/7 module "decrrelation2_processor". 
<@) File Ed~ View Project Assignments Proces~ng Tools Window Help 
jj D ~ lil Cill ~ llb ~@ I"' r• lldecrrelation2_processor9_7 :::J I~ .I <I ~@ I G I~<> ~ 11-& l•tt 
I <@) Compilation Report - Timing Analgzer s-
Figure C.3.3 Compilation Report- Timing Analyzer Summary for first 9/7 module "decrrelation2 _processor" 
295 
C. 4 Compilation reports for second forward 917 module "decorrelation _processor9 _7" 
Flow Status 
Quartus II Version 
Revision Name 
T op·level Entity Name 
Family 
Met timing requirements 
Logic utilization 
Combination~! ALUT s 
Dedicated logic registers 
Total registers 
Total pins 
Total virtual pins 
Total block memory bits 





Successlul· TuoApr 2014:01:55 2010 






2.529 I 1 2.480 ( 20 Y. l 










Figure C.4. I Compilation Report- Flow Summary for second 9/7 modul' "decorelation_processor" 
PowerPiay Power Analyzer Status 
Quartus II Version 
Revision Name 




Total Thermal Power Dissipation 
Core Dynamic Thermal Power Dissipation 
Core Static Thermal Power Dissipation 
1/0 Therfl'W!II Power Dissip~tion 
Power Estimation Confidence 
Successful· TuoApr 20 14:01:55 2010 










Medium: user provided moderatebJ complete toggle rate data 
Figure C.4.2 Compilation Report- Power Analyzer summary for second 9/7 module "decorelation_processor". 
43> File Edit View Project Assignments Processing Tools Window Help 
jj D c;j; Iii! ell I df I Jb Cli!l Ia I"" r• lldecorelation_processor9_7 
~ Compilation Report - Timing Analyzer Summary 
Fig. C.4.3 Compilation Report- Timing Analyzer Summary for second 9/7 module "decrrelation _processor" 
296 
C.5 Compilation repomjor 513 2-para//el module "twoyaral/ei_DWT7" 
Flow Status 
Ou~rtus II Version 
Revision Name 




Met timing requirements 
Logic utilization 
Combinational ALU T s 
Dedicated logic registers 
Total registers 
Total pins 
Total virtual pins 
Total block memory bits 
D S P block 9·bit elements 
Total Plls 
Total Dlls 
Successful· TueApr 2012:21:07 2010 

















Figure C.5.1 Compilation Report- Flow Summary for 5'3 2-parallel module "two _parallel_ OWl" 
PowerPiay Power Analyzer Status 
Quartus II Version 
Revision N arne 




Total Thermal Power Dissipation 
Core Dyn<!!lmic Thermal Power Dissipation 
Core Static Thermal Power Dissipation 
1/0 Thermal Power Dissipation 
Power Estimation Confidence 
Successful· TueApr 2012:21 07 2010 










Medium: user provided moderately complete toQQie rate data 





] Worst·co~e th 
; ""'· If! i .. : ~-~ : ltwo_parallei_DWT 
j "0 two_poalilei_DWT v\¥1 
,,~ 
---!~~--~": ____ .~.2~_'2.s_ _______________________ J~1?_,._,~El._ __ ----·-----·- ·- ---·---·-- ... -·--···-·--···----·--
N.ill. None 6 493 ns HH_out[9]-regD 
1'./A None ·2 Q48 n~ scel 
. HH_ou![3J 
Fscel_cl_l 
4 Clock Setup "cloc" N/A None :186 01 MH~ 1 penod • 5 J7E m ( dltsyncrdm T LB2_ •II_ 1 i<~ltsyncr ~1'"1_!1•1 duiO_;Jenerdtedlr<~m _blocK- aO-portt_ addre>s_reg7 ~ Rd2!91 
5 r otdl numoer oltoUed ~-<'llhs. 






(I] Ibrahim Saeed Koko and Herman Agustiawan, "Lifting-based VLSI architectures for 2- Dimensional 
discrete wavelet transform for Effective image compression," in: Proceedings of the International 
MultiConference of Engineers and Computer Scientists 2008 Vol.!, Hong Kong, March 2008, PP. 339-347. 
(2] Ibrahim Saeed Koko and Herman Agustiawan, "High-speed and power efficient lifting-based VLSI 
architectures for two-Dimensional discrete wavelet transform," proceedings of the IEEE Second Asia 
International Conference on Modeling and Simulation, AMS , May 2008, PP. 998-l 005. 
(3] Ibrahim Saeed Koko and Herman Agustiawan, "Pipelined lifting-based VLSI architecture for two-
dimensional inverse discrete wavelet transform," proceedings of the IEEE International Conference on 
Computer and Electrical Engineering, ICCEE, December 2008, Phuket Island, Thailand, PP. 692-700. 
(4] Ibrahim Saeed Koko and Herman Agustiawan, "Parallel Pipelined VLSI Architectures for Lifting-
based Two-dimensional Forward Discrete Wavelet Transform," proceedings of the IEEE International 
Conference on signal acquisitions and processing, ICSAP, April2009, Kuala Lumpur, Malaysia, PP. 18-25. 
Journal paoers 
[5] Ibrahim Saeed and Herman Agustiawan, "Two-dimensional Discrete Wavelet Transform Memory 
Architectures," International Journal of Computer and Electrical Engineering, Vol. I, No. I, April 2009, 
pp 84-97. 
(6] Ibrahim Saeed and Herman Agustiawan, "Parallel form of the Pipelined Lifting-based VLSI 
Architectures for Two-dimensional Discrete Wavelet Transform," International Journal of Computer 
Theory and Engineering, Vol. I, No. I, April2009, PP 85-96. 
(7] Ibrahim Saeed and Herman Agustiawan, "Parallel Form of the Pipelined Intermediate Architecture for 
2-dimensional Discrete Wavelet Transform," IAENG International Journal of Computer Science, Vol. 36, 
issue 2, June 2009. 
Book Chapter 
(I] Ibrahim Saeed and Herman Agustiawan, "High Performance Parallel Pipelined Lifting-based VLSI 
Architectures for Two-Dimensional Inverse Discrete Wavelet Transform," book title: "VLSI", ISBN 978-
3-902613-50-9, IN-TECH, Feb. 2010. 
298 
