VLSI implementation of discrete cosine transform using a new asynchronous pipelined architecture. by Lee, Chi-wai. & Chinese University of Hong Kong Graduate School. Division of Electronic Engineering.
VLSI Implementation of Discrete Cosine Transform 
Using a New Asynchronous Pipelined Architecture 
LEE Chi-wai 
A Thesis Submitted in Partial Fulfillment 
of the Requirements for the Degree of 
Master of Philosophy 
in 
Electronic Engineering 
© The Chinese University of Hong Kong 
February 2002 
The Chinese University of Hong Kong holds the copyright of this thesis. Any person(s) 
intending to use a part or whole of the materials in the thesis in a proposed publication must 
seek copyright release from the Dean of the Graduate School. 
A / 统 系 馆 書 圖 
g( 1 1 l i 13 ] | | 
UNIVERSITY y p j 
Abstract of this thesis entitled: 
VLSI Implementation of Discrete Cosine Transform 
Using a New Asynchronous Pipelined Architecture 
Submitted by LEE Chi-wai 
for the degree of Master of Philosophy in Electronic Engineering 
at The Chinese University of Hong Kong in June 2001 
This thesis presents two different asynchronous VLSI implementations of Discrete 
Cosine Transform (DCT). Although asynchronous design has potential advantages 
over the synchronous design, the handshaking overhead and the design difficulties 
limit the speed performance of asynchronous design. In order to break through the 
barrier, a new asynchronous pipelined architecture is described in this thesis. It 
relaxes the handshaking protocol and has a simpler architecture, the performance of 
asynchronous design is improved. Since the new architecture employs dynamic 
logic, a new technique called Refresh Control Circuit is also introduced to reduce the 
performance degradation associated with the traditional technique. 
The first DCT implementation is realized in a programmable DSP processor. This 
programmable processor makes use of asynchronous pipeline, dataflow architecture 
and parallelism, a reasonable but encouraging result of 22Mpixel/sec in DCT 
operation is obtained with a limited number of arithmetic units. 
i 
The second DCT implementation is performed on a dedicated 2D DCT/IDCT 
processor. It is a fully pipelined design and can operate at 76Mpixel/sec for 2D 
DCT/IDCT operation. It is capable of processing the high quality MPEG-2 and 
baseband HDTV signal in real-time, and is competitive to other synchronous designs 
even less arithmetic units are included in this processor. 
The results of the two implementations demonstrate the high performance of the new 
asynchronous pipelined architecture and the advantages of the asynchronous 




本論文介紹了兩個可應用於離散餘弦變換（Discrete Cosine Transform)的異步 
(asynchronous)超大規模集成電路。雖然相對於同步設計(synchronous design)， 





















I would like to express my deepest gratitude to various individuals who provided me 
with sincere assistance in this research. 
First of all, I would like to thank my supervisor, Prof. Oliver, C.S. Choy, for his 
invaluable guidance, advice, and support during the course of this research work. 
Notwithstanding his busy schedule, he has worked with me throughout the lengthy 
and demanding project, providing me continuous comments, patience, supervision, 
and encouragement. Without his endless help and assistance, this thesis would never 
have been possible. I would also like to express my gratitude to Prof. C.F. Chan who 
has given me insightful suggestions during my research work. In addition, a special 
expression of thanks goes to the research assistant, Mr. Jan Butas, for their kind 
assistance during my study. 
Thanks are also due to my colleagues Mr. Cheng Wan Chi, Mr. Hon Kwok Wai, Mr. 
Leung Lai Kan, Miss Mak Wing Sum, Mr. Siu Pui Lam and Mr. Tang Tin Yau, the 
laboratory technician Mr. Yeung Wing Yee, and who have always been my sources 
of fun and encouragement. 
I would also like to thank my close friend, Miss Vivian Tsoi, for her kind assistance 
and concern throughout the process of this study. Her constant encouragement and 
V 
everlasting patience and support were my strength, motivation, and inspiration all 
along. With her cordially support, I could exert my best effort on this study. 
At last, I would also like to express my truly gratitude to my parents and my sisters 
for their understanding and devoted love throughout my whole course of studies. 
Without their concerns and support, I would not finish my study successfully. I am 
once again indebted to all of these people. 
Lee Chi Wai 
June 2001 
vi 
Table of Contents 
Abstract of this thesis entitled: i 
_ iii 
Acknowledgements v 
Table of Contents vii 
List of Tables x 
List of Figures xi 
Chapter 1 
Introduction 1 
1.1 Synchronous Design 1 
1.2 Asynchronous Design 2 
1.3 Discrete Cosine Transform 4 
1.4 Motivation 5 
1.5 Organization of the Thesis 6 
Chapter 2 
Asynchronous Design Methodology 7 
2.1 Overview 7 
2.2 Background 8 
2.3 Past Designs 10 
2.4 Micropipeline 12 
2.5 New Asynchronous Architecture 15 
Chapter 3 
DCT/IDCT Processor Design Methodology 24 
3.1 Overview 24 
3.2 Hardware Architecture 25 
3.3 DCT Algorithm 26 
3.4 Used Architecture and DCT Algorithm 30 
3.4.1 Implementation on Programmable DSP Processor 31 
3.4.2 Implementation on Dedicated Processor 33 
vii 
Chapter 4 
New Techniques for Operating Dynamic Logic in Low Frequency 36 
4.1 Overview 36 
4.2 Background 37 
4.3 Traditional Technique 39 
4.4 New Technique - Refresh Control Circuit 40 
4.4.1 Principle 41 
4.4.2 Voltage Sensor 42 
4.4.3 Ring Oscillator 43 
4.4.4 Counter, Latch and Comparator : 46 
4.4.5 Recalibrate Circuit 47 
4.4.6 Operation Monitoring Circuit 48 
4.4.7 Overall Circuit 48 
Chapter 5 
DCT Implementation on Programmable DSP Processor 51 
5.1 Overview 51 
5.2 Processor Architecture 52 
5.2.1 Arithmetic Unit 53 
5.2.2 Switching Network 56 
5.2.3 FIFO Memory 59 
5.2.4 Instruction Memory 60 
5.3 Programming 62 
5.4 DCT Implementation 63 
Chapter 6 
DCT Implementation on Dedicated DCT Processor 66 
6.1 Overview 66 
6.2 DCT Chip Architecture 67 
6.2.1 ID DCT Core 68 
6.2.1.1 Core Architecture 74 
6.2.1.2 Flow of Operation 76 
6.2.1.3 Data Replicator 79 
6.2.1.4 DCT Coefficients Memory 80 
6.2.2 Combination of IDCT to 1D DCT core 82 
6.2.3 Accuracy 85 
6.3 Transpose Memory 87 
6.3.1 Architecture 89 
6.3.2 Address Generator 91 
6.3.3 RAM Block 94 
via 
Chapter 7 
Results and Discussions 97 
7.1 Overview 97 
7.2 Refresh Control Circuit 97 
7.2.1 Implementation Results and Performance 97 
7.2.2 Discussion... 湖 
7.3 Programmable DSP Processor 102 
7.3.1 Implementation Results and Performance 102 
7.3.2 Discussion 104 
7.4 ID DCT/IDCT Core 107 
7.4.1 Simulation Results 107 
7.4.2 Measurement Results 109 
7.4.3 Discussion 113 
7.5 Transpose Memory 122 
7.5.1 Simulated Results 122 
7.5.2 Measurement Results 123 




Operations of switches in DCT implementation of programmable DSP processor 
133 
C Program for evaluating the error in DCT/IDCT core 135 
Pin Assignments of the Programmable DSP Processor Chip 142 
Pin Assignments of the ID DCT/IDCT Core Chip 144 
Pin Assignments of the Transpose Memory Chip 147 
Chip microphotograph of the ID DCT/IDCT core 150 
Chip Microphotograph of the Transpose Memory 151 
Measured Waveforms of ID DCT/IDCT Chip 152 
Measured Waveforms of Transpose Memory Chip 156 
Schematics of Refresh Control Circuit 158 
Schematics of Programmable DSP Processor 164 
Schematics of ID DCT/IDCT Core 180 
Schematics of Transpose Memory 187 
References 191 
Design Libraries - CD-ROM 197 
ix 
List of Tables 
Table 5.1 - Instructions of switch 58 
Table 6.1 — Data rate at different stages of the ID DCT core 79 
Table 6.2 - Bit length in different parts of the 2D DCT/IDCT processor 85 
Table 6.3 — Accuracy of the 2D DCT/IDCT processor 86 
Table 6.4 - Four different operation modes of the unified ID DCT/IDCT core 87 
Table 7.1 - Transistor count on different units of Refresh Control Circuit 98 
Table 7.2 - Current drawn by the each parts of the Refresh Control Circuit 99 
Table 7.3 - Performance of multipliers by different techniques 100 
Table 7.4 - Bit length information of the 9-bit programmable DSP processor ……102 
Table 7.5 — Performance comparison of different ID DCT implementations 103 
Table 7.6 - Performance comparison of 2D DCT implementation on different 
programmable processors 106 
Table 7.7 — Performance of different processing units on the ID DCT/IDCT core 107 
Table 7.8 - Input data, measured result and calculated result of the DCT row 
operation 111 
Table 7.9 - Performance comparison of different 2D DCT implementations 113 
Table 7.10 — Simulation results of power consumption of different operation units in 
the ID DCT/IDCT core 118 
Table 7.11 - Comparison of power consumption on DCVSL and single-rail adder 120 
Table 7.12 - Performance of different units in the transpose memory 122 
•X 
List of Figures 
Figure 2.1 - Communications between sender and receiver in an asynchronous 
circuit ， 
Figure 2.2 — Timing diagram of (a)two-phase, (b)four-phase handshaking protocol. 8 
Figure 2.3 - (a) connections in asynchronous circuit, (b) operation in asynchronous 
circuit 9 
Figure 2.4 — (a) symbol of C-element, (b) dynamic C-element and (c) static C-
element 11 
Figure 2.5 - Basic control FIFO sequence in Micropipeline structure 13 
Figure 2.6 - Micropipeline with computation 14 
Figure 2.7 - Domino Logic 16 
Figure 2.8 - Asynchronous architecture by using dynamic logic 17 
Figure 2.9 - (a) new handshake cell, (b) timing diagram of a pipeline stage 18 
Figure 2 .10- (a ) new asynchronous pipeline connection, (b) flow of operations in the 
new asynchronous pipeline 20 
Figure 2.11 - Differential Cascode Voltage Switch Logic (DCVSL) 21 
Figure 2.12 - Completion signal generated from the DCVSL 22 
Figure 2.13 - (a) modified handshake cell, (b)modified handshake cell and basic 
FIFO cell in DCVSL structure, (c) connection of the asynchronous pipeline... 23 
Figure 3 . 1 - 8 x 8 image block 27 
Figure 3 . 2 - 2 D DCT of 8x8 image block 27 
Figure 3.3 - 2D DCT by row-and-column decomposition method 29 
Figure 3 . 4 - 2 D DCT by direct method 30 
Figure 3.5 — Signal flow diagram of the Jeong's ID DCT fast algorithm 33 
Figure 4.1 - (a) 3-input NAND dynamic logic, (b) voltage in the floating node of the 
dynamic logic 57 
Figure 4.2 — Addition of the pull-up path in (a)dynamic logic, (b)domino logic ……39 
Figure 4.3 - Techniques of overcome the charge redistribution problem 40 
Figure 4.4 - Modified structure of the (a)dynamic logic, (b)domino logic 41 
Figure 4.5 - Proposed refresh structure for (a)dynamic logic, (b)domino logic 42 
xi 
Figure 4.6 — Voltage sensor, (a)differential amplifier as the first stage with reference 
voltage generator, (b)two-stage sense amplifiers as the second stage 43 
Figure 4.7 - Ring oscillator 43 
Figure 4.8 - Charging and discharge current in the inverter chain 44 
Figure 4.9 — Ring oscillator with delay elements 45 
Figure 4 .10-Delay element, transmission gate with Schmitt trigger 45 
Figure 4.11 - (a) a voltage controlled inverter, (b) part of the voltage controlled ring 
oscillator 45 
Figure 4 .12-Block diagram of the Refresh Control Circuit 48 
Figure 4.13 — Timing diagram of the Refresh Control Circuit 49 
Figure 5.1 - Dataflow architecture of the programmable DSP processor 52 
Figure 5.2 — Product Full Adder (PFA) of the multiplier 53 
Figure 5.3 - The 8x8 multiplier core 54 
Figure 5 . 4 - Input buffer of multiplier 55 
Figure 5.5 - (a)2-to-2 programmable switch and its six modes of connection, 
(b)block diagram of the internal structure of switch, (c)CMOS structure of basic 
multiplier cell of the MUX 1 57 
Figure 5.6 - 8-to-8 switching network 58 
Figure 5.7 - Structure of FIFO memory 59 
Figure 5.8 - Instruction memory, (a) block diagram of the instruction memory, (b) 
the structure of cyclic FIFO, (c) structure of the instruction decoding network 61 
Figure 5 . 9 - Flow diagram of the first stage of DCT implementation 63 
Figure 5.10 - Flow diagram of the second stage of DCT implementation 63 
Figure 5.11 - Flow diagram of the third stage of DCT implementation 64 
Figure 5.12 ~ Flow diagram of the forth stage of DCT implementation 64 
Figure 6.1 - Dataflow diagram in 2D DCT by row-and-column decomposition 
method 67 
Figure 6.2 - Block diagram of 2D DCT processor 68 
Figure 6.3 — Block diagram of the ID DCT core 69 
Figure 6.4 — Structure of the 8-bit BLC adder 70 
Figure 6.5 - Modified input buffer for 2 complement input 72 
Figure 6.6 — Multiplier core of 2 complement multiplier 73 
Figure 6.1 - Architecture of ID DCT core 75 
xii 
Figure 6.8 — (a) block diagram parallel-to-serial shift register in synchronous design, 
(b) block diagram of data replicator 79 
Figure 6.9 - (a) normal basic FIFO cell, (b) modified basic FIFO cell, (c) basic 
DCVSL structure of pre-storing data 81 
Figure 6.10 - DCT coefficients memory in upper path 82 
Figure 6.11 - Block diagram of the IDCT 83 
Figure 6.12-Overall architecture of the ID DCT/IDCT processor 84 
Figure 6.13 - Modification of memory cell of pre-storing different data in DCT and 
IDCT 84 
Figure 6.14 — Bit length in different parts of the 2D DCT/IDCT processor 86 
Figure 6 . 1 5 - Unified structure of ID DCT/IDCT core 87 
Figure 6.16 - Write order of the transpose memory 88 
Figure 6 .17-Read order of (a) DCT operation, (b) IDCT operation 89 
Figure 6.18 - Block diagram of transpose memory 90 
Figure 6.19 - New structure of the transpose memory 90 
Figure 6.20 - Write address generator 92 
Figure 6.21 — Operation of the transpose memory 93 
Figure 6.22 - Block diagram of the RAM block 94 
Figure 6.23 — (a) SRAM basic cell, (b) monitor cell, (c) monitor cell in a bit column 
of SRAM 95 
Figure 7.1 - Simulation result of ring oscillator 98 
Figure 7.2 — Simulation result of the Refresh Control Circuit 98 
Figure 7.3 — Output signals of different multipliers 100 
Figure 7.4 — Simulation result of the programmable DSP processor 102 
Figure 7.5 — Timing diagram of the DCT operation 103 
Figure 7.6 — Layout of the 9-bit programmable DSP processor 104 
Figure 7.7 - Simulation result of the DCT coefficients memory 108 
Figure 7.8 - Layout of the ID DCT/IDCT core processor 109 
Figure 7.9 — (a) input waveform of the DCT/IDCT core, (b) measured output 
waveform of the DCT/IDCT core in DCT row operation 110 
Figure 7.10 - (a) construction of the Done signal, (b) timing diagram of the Input 
Request, Acknowledgement and Done signal.. 111 
xiii 
Figure 7.11 - (a) measured waveforms of the Output Request (lower) and 
Acknowledgement (upper) signal, (b) zoomed waveforms which shows the 
average throughput is 76MHz 112 
Figure 7.12 - (a) carry generation in domino logic, (b) carry generation in static logic 
117 
Figure 7.13 - Simulation result of the write and read operation 122 
Figure 7.14 - Layout of the transpose memory 123 
Figure 7.15 - (a) input waveform of the transpose memory, (b) measured output 
waveform of transpose memory in DCT operation mode 124 
Figure 7.16 - (a) measured waveforms of the Output Request (upper) and 
Acknowledgement (lower) signal, (b) zoomed waveforms which shows the 
average throughput is lOOMHz 126 
Figure 7.17 — Done signals generated from the 32x15bit RAM block 127 
xiv 
Chapter 1 - Introduction  
Chapter 1 
Introduction 
1.1 Synchronous Design 
Synchronous design is the most popular digital circuit design technique today in the 
VLSI world. In a synchronous circuit, global clock is used to synchronize and 
trigger all the operations. As the technology of VLSI grows towards higher speed, 
smaller feature size and larger chip size, the performance of synchronous circuit is 
limited due to its global clock approach. 
The main reason of the limitation is the clock skew problem [1][2]. Clock skew is 
the difference in the arrival time of clock signal at different parts of the circuit. As 
the chip size gets larger, it is difficult to manage the global clock signal to arrive at 
different parts of the design at the same time. Also, as the clock speed becomes 
higher, the global clock period becomes shorter and thus the transition time needs to 
be shorter comparing with the clock period. However, the transition time can only 
be reduced to a limited extent. As a result, the operating speed is forced to slow 
down so as to accommodate the problem. 
In addition, frequency of the global clock is also restricted by the slowest part of the 
whole design. The period between two consecutive active clock edges must be long 
Page 1 
Chapter I - Introduction  
enough for all computations to be completed before latching the result. As a result, 
the clock period is determined by the slowest stage such that every stage is given 
enough time to fully process a data, thus yielding a worst-case performance. 
It is believed that by eliminating the restrictions, a design can reach a higher level of 
performance, and this is the motivation of the development of asynchronous circuit. 
1.2 Asynchronous Design 
The main difference between the synchronous and asynchronous design is the use of 
the global clock and the local handshake signals. In asynchronous design, operations 
on a functional unit are controlled by the communications between neighbouring 
units. When there is an event occurred on the communication wire, an operation will 
be started or stopped by the triggering of the event. 
Since the global clock signal is removed, there is no clock skew problem existed in 
the asynchronous design. Also, without the restriction of the global clock signal, 
different parts in an asynchronous circuit can operate at their own intrinsic speeds 
and thus the average-case performance can be achieved rather than worst-case 
performance in the synchronous circuit. Therefore, the problems of the synchronous 
design can be eliminated and higher speed can be achieved in asynchronous design. 
In addition, asynchronous design offers other potential advantages such as low power 
consumption, automatic adoption to physical properties, high modularity and less 
electromagnetic emission, these make the asynchronous design attractive. 
Page 2 
Chapter 1 - Introduction  
Despite of all the potential advantages motivating the development of asynchronous 
circuit, it has yet to achieve widespread use. This is because asynchronous circuit 
suffers from several problems as well. 
The major problem of the asynchronous circuit is hazards [1]. In the synchronous 
design, hazards can be easily removed by adding more registers or slowering the 
clock rate. However, designers of the asynchronous circuit must remove all hazards 
to prevent incorrect operation. At the same time, there are little supports from CAD 
tools, design automation and optimization of the asynchronous design has still not 
been fully achieved. As a result, extra attention and extensive simulations are 
required and thus the development cost is increased. 
Furthermore, an additional handshake circuitry is required in asynchronous design in 
order to handle the communication signals. This circuitry is usually complex and 
leads to a larger area in asynchronous design. Also asynchronous circuit generally 
requires extra time for handshaking protocol and thus an operation requires more 
time to be completed due to the communication overhead. As a result, the expected 
average-case performance is not fully realized. These two reasons cause an 
asynchronous circuit running at a speed even slower than the synchronous circuit. 
Due to the maturity in synchronous design methodologies and the difficulties of 
asynchronous design as mentioned above, designers still prefer synchronous design 
in most of their system development today. 
Pages 
Chapter I - Introduction  
1.3 Discrete Cosine Transform 
The Discrete Cosine Transform (DCT), proposed by Ahmed et. al. in 1974 [3]，and 
its inverse (IDCT) have become an important tool for image and video signal 
processing applications due to their adoption in standards such as CCITT H.261 [4 
for video telephony and teleconference, JPEG (Joint Photographic Experts Group) 
:5]for colored still image transmission and MPEG (Moving Picture Experts Group) 
[6] for moving pictures on the storage media. The advantages of DCT are that its 
performance closes to the optimal Karhunen-Loeve transform (KTL) for highly 
correlated signals and the existence of the fast algorithms [7] [8] [9] which reduce the 
number of operations. 
The role of DCT is providing a data compression on the picture while a reasonable 
quality can still be maintained. It helps to reduce memory size and transmission 
bandwidth in the image and video applications. DCT basically involve additions and 
multiplications. The operation of ID N-point DCT and IDCT can be described by 
following equations, 
^ 1 , � S (2 i + l )n;r 
DCT ： Y^ = —c(n)2^X.C0S- -Equation l . l 
2 i=o 27V 
I f i , � y (2 / + l > ; r 
IDCT ： X. = —2^ c(n)Y^ cos -Equation 1.2 
where i, n =0,1， ’N -1 
c(0) = l/y[2=l foriitO 
In recent year, the increasing demand of high image and video quality signal, such as 
MPEG-2 and High Definition Television (HDTV), requires higher and higher 
computation in signal processing. To meet with the real-time computation 
Page 4 
Chapter 1 - Introduction  
requirement, a processor which rapidly computes DCT has become a key component 
in image compression VLSI. 
1.4 Motivation 
up to now, most of the past asynchronous circuits are not good in performance in 
terms of speed. Together with the difficulties discussed in the previous section, it 
discourages the development of asynchronous design. However, there are methods 
exist so that full performance potential of the asynchronous design can be realized. 
The worse speed performance of the asynchronous circuit is mainly due to the 
complicated handshaking circuitry and slow communication protocol. It is believed 
that by developing a new asynchronous architecture having simpler handshaking 
circuitry, more aggressive handshaking protocol and together with a careful circuit 
arrangement, the hazard can be removed and a competitive asynchronous design can 
be obtained. This is the motivation of this project. 
DCT is chosen for the realization of a new asynchronous architecture. This is 
because digital signal processing (DSP) algorithm is suitable to be implemented by 
asynchronous technique as the process is data-dependent that fits the style of the 
asynchronous design. Among various DSP algorithms, DCT is a widely used 
algorithm in many image and video applications and high throughput is required. It 
helps to demonstrate the practicality of the new asynchronous architecture and the 
fulfillment of the requirement of image and video applications today. 
Page 5 
Chapter 1 - Introduction  
1.5 Organization of the Thesis 
This thesis is organized into eight chapters. The first chapter describes the 
background of the asynchronous design, Discrete Cosine Transform, and the 
motivation of this project. The second chapter introduces the basic operation and 
past methodologies in the asynchronous circuit design, and the new asynchronous 
pipelined architecture is presented at the end of this chapter. In chapter 3, various 
methods and algorithms of DCT implementation and two different approaches of the 
asynchronous implementation of DCT processor are described. Since dynamic logic 
is employed in the new asynchronous pipelined architecture, a new technique of 
operating dynamic logic in low frequency is presented in chapter 4. Chapter 5 
describes the detailed architecture of the programmable DSP asynchronous 
processor, and the DCT implementation is given as well. Chapter 6 presents another 
implementation of DCT on a dedicated DCT processor. The architecture and flow of 
operations on the processor, and the design of the transpose memory are all provided. 
In chapter 7, all the implementation results and performance of the designs proposed 
in this thesis are given. Based on the results, the performance comparisons, 
discussions and suggestions are also provided in the same chapter. Finally, 
conclusion of the thesis is given in the last chapter. 
Page 6 
Chapter 2 - Asynchronous Design Methodology  
Chapter 2 
Asynchronous Design Methodology 
• 
2.1 Overview 
The operation of an asynchronous circuit is not based on the global clock signal, 
which is used in the synchronous circuit, but on its local handshake signals. The 
handshake signals are the controlling signals in the communication between the 
sender and receiver. For most of asynchronous circuits, they usually make use of 
similar handshaking protocol involving requests and acknowledgements. 
祁 knowledgement 
Sender , Receiver 
data ^ ^ 
Figure 2.1 - Communications between sender and receiver in an asynchronous circuit 
Figure 2.1 shows a basic communication interface in asynchronous circuit. This kind 
of communication style is called the bundled data approach [1][10]. In this 
approach, the interface between sender and receiver consists of a bundle of data 
which carries information (using one wire for each bit) and two control wires. When 
the data from the sender side is ready, a transition will occur on the request wire to 
inform the receiver, and acknowledgement wire from the receiver to the sender 
Page 7 
Chapter 2 - Asynchronous Design Methodology  
carries a transition when the data has been processed. Also the data will be 
maintained constantly during the receiver's active phase preventing wrong operation. 
There are many types of handshaking protocol and different kinds of circuit for 
implementing this asynchronous communication interface. In this chapter, a brief 
introduction to different handshaking protocols will be given. In addition, some of 
the past designs and the micropipeline structure will be introduced. At last, the new 
asynchronous architecture will be presented. 
2.2 Background 
dataJD < valid d a t a ~ ^ < valid d a t a ~ ^ < 
�e s t V / J � �I V 
acknowledgement A x 
out / 
( a ) 
datajD ^ valid data ^ < valid data ^ < 
acknowledgement L ^ / ^ ^ \ \ Z ^ Z / \ 
out � \ / \  
( b ) 
Figure 2.2 - Timing diagram of (a)two-phase, (b)four-phase handshaking protocol 
There are 2 classes of handshaking protocol, one is the two-phase and the other is the 
four-phase [1][10][11] and their timing diagram is shown in Figure 2.2. Two-phase 
handshaking protocol means that any transition in the handshake signal represents an 
event occurred. Different from the two-phase, the four-phase handshaking protocol 
is a level-triggered protocol. The occurrence of an event is represented by an active 
Pages 
I 
Chapter 2 - Asynchronous Design Methodology  
level, and the return to non-active level is required after the event has been finished. 
In general, the two-phase handshaking protocol has better performance than the four-
phase one as it makes use of all transitions of the signal to represent an event, it leads 
to a faster communication rate. 
Compared to the synchronous circuit, the request and acknowledgement signals are 
additional signals. As a result an extra control circuit is required in asynchronous 
design so as to handle these two signals, and usually this circuit is called handshake 
control circuit or handshake cell. 
Handshake Handshake -- ^^ _ ^ Cell request 一 cell ~ • 
control / control I 
Stage X x XX 阳0门丨的「 Stage 
N-2 N+1 
\ Functional ；^ \ Functional \ 
——/ Block ^ / Block ——/ 
Stage N-1 Stage N 
I ( a ) I 
I ack I 
Handshake Handshake 丨 Handshake 吻 Handshake ' Handshake Handshake 
Cell 一 Cell ； Cell Cell | — ^ Cell Cell 
I ^ ^ I ^ ^ 
Functional ~ ； F u n c t i o n a l ！ Functional Functional Functional Functional 
Block ~ B l o c k I Block Block Block Block 
Stage N-1 Stage N ] Stage N-1 Stage N | Stage N-1 Stage N 
Time = 0 丨 Time = 1 ' Time = 2 
( b ) 
Figure 2.3 - (a) connections in asynchronous circuit, (b) operation in asynchronous circuit 
Figure 2.3(a) shows the basic connection in asynchronous design using the 
handshake cell. In this connection, the operations depend totally on the handshake 
signals, and that can be explained with the help of Figure 2.3(b). Initially when the 
operation of functional block in stage N-1 is completed, the output data will be 
passed to the functional block in stage N. At the same time, the handshake cell in 
Page 9 
I 
Chapter 2 - Asynchronous Design Methodology  
Stage N-1 will detect the completion of computation and generate a request signal for 
stage N. This request signal is used to indicate that the operation of stage N-1 is 
completed, and the output data of stage N-1 is held and ready for the stage N to 
process. Starting from this moment, stage N-1 needs to hold the output data until 
stage N finishes the computation. 
The handshake cell in stage N detects the request signal from the previous stage，and 
then allows the functional block in stage N to process the data. After the 
computation is completed, the handshake cell in stage N will generate two signals. 
The first one is the acknowledgement signal which is used to inform stage N-1 that 
the data has been processed. As a result, stage N-1 becomes idle and wait for the 
data from stage N-2 for the next operation. The second signal is the request signal to 
the stage N+1 for further processing of data. 
This communication interface and protocol exist between all the stages and its 
neighbouring stages in the asynchronous circuit. Since all the operations are 
controlled by the handshake signals, the performance of the handshake cell becomes 
the main factor of determining the speed of the asynchronous circuit. 
2.3 Past Designs 
The design of the handshake cell and the use of the handshaking protocol are 
important as they determine the throughput and latency of the whole asynchronous 
system. For the handshake cell, an accurate detection of the completion of the 
operation and a quick generation of the request signal are the most important issues 
as they are used to guarantee the circuit operating correctly and quickly. If the 
Page 10 
I 
Chapter 2 — Asynchronous Design Methodology  
request signal is generated before the functional block finishes its computation 
process or before data is valid, hazard will occur as incorrect data will be latched by 
the next stage and incorrect result will be obtained. If the request signal is generated 
a long time after the end of computation, it is secure to have a correct output but the 
whole circuit will be slowed down. However, to generate the request signal just in 
time while maintaining simple structure is really a difficult task. By using a suitable 
handshake cell, the complexity of the handshaking protocol can be reduced and thus, 
the communication time can be reduced too. As a result, the speed and performance 
of the whole circuit can be enhanced. 
In the past decades, there were many kinds of handshake cell developed 
[12] [13] [14] [15]. And the most famous and commonly used one is the C-element. 
C-element is firstly introduced by D.E. Muller in 1956 [16]. It is a rendezvous 
element, or an event-driven element. Figure 2.4 shows the symbol and 2 different 
CMOS structures of the C-element. 
T T J J 
A C A C A-C B-cJ 
—^L B ^ — j r pcji 
T b T b ^ I 
C 办 A— B -
1 a _ _ i 1 _ _ 
( a ) ( b ) ( c ) 
Figure 2.4 - (a) symbol of C-element, (b) dynamic C-element and (c) static C-element 
The operation of the C-element is that, when both inputs are the same, then the data 
will be copied to the output, else the previous output will be maintained. Therefore, 
Page 11 
• 
Chapter 2 - Asynchronous Design Methodology  
the output will only be toggled when there are events occurred at the both inputs of 
the C-element. 
C-element is usually incorporated in the two-phase handshaking protocol with the 
bundled data approach. In applying the C-element in the asynchronous circuit, the 
input A and B are served as the inputs of request or completion signal from previous 
stage and acknowledgement signal from next stage. The output C has 3 functions. 
The first one is to control the operation of the function block. The second one is 
acted as the acknowledgement signal which is sent back to the previous stage, and 
the last one is acted as the request signal sending to the next stage. A more detailed 
operation of C-element in asynchronous circuit will be discussed in the next part. 
2.4 Micropipeline 
No matter synchronous or asynchronous design, pipeline is an important 
methodology to improve the performance of a circuit or system. The principle of the 
pipeline is to divide a single operation into several sub-operations, and allows them 
to operate simultaneously [10]. For the asynchronous circuit, pipeline can be done 
by breaking down the complex functional block into several simpler functional 
blocks, and each of them is governed by a dedicated handshake cell. The widely 
known pipeline methodology in asynchronous circuit is micropipeline. 
Micropipeline was introduced in Ivan Sutherlands' Turing Award lecture [10: 
primarily as an asynchronous alternative to synchronous elastic pipelines. From the 
definition by Ivan, micropipeline means a simple form of event-driven elastic 
pipeline with or without internal processing. 
Page 12 
• 
Chapter 2 - Asynchronous Design Methodology  
The basic operation of the micropipeline can be explained by the control first-in-
first-out (FIFO) sequence structure as shown in Figure 2.5. The control FIFO 
sequence is operated in two-phase handshaking protocol. Assuming that all the wires 
are initially set at zero, when there is a transition in the request input, then output of 
the first C-element will be changed from zero to one. This transition will be sent out 
of the control sequence as an acknowledgement signal, and also will be propagated 
to the input of the second C-element. Since the input is toggled, same situation will 
occur in the second C-element, as well as the third C-element. As a result, the 
request signal passes through all the C-elements in series, and emerges on request 
out. 
req in ~ ack in 
input output 
side side 
ack out < i) o • req out 
Figure 2.5 - Basic control FIFO sequence in Micropipeline structure 
However, when there is another request signal coming from the request input, this 
new request signal may not be emerged on the request out this time. This is because 
the control FIFO sequence may still not received the acknowledgement from the 
output side, as a result no transition has been made in the acknowledgement input 
terminal and thus the output of the third C-element cannot be toggled. However, this 
phenomenon is normal as no transition on acknowledgement input means that the 
output side, or the recipient side, still has not processed the previous request, the new 
request should not pass to it before the previous event is completed. 
Page 13 
• 
Chapter 2 - Asynchronous Design Methodology 
Figure 2.6 shows the block diagram of the Sutherland's micropipeline system. The 
connections are actually similar to the previous FIFO sequence, but a storage element 
and a logic block are included in each stage. The storage element used is called 
Capture and Pass latch (CP latch), which is an event-controlled storage element. The 
inputs C and F are responsible for controlling the capture and pass function, and the 
outputs Cd and Pd are just simply the delayed version of the inputs C and F 
respectively. In this micropipeline structure, when there is a transition occurred in 
the request input, data will be captured and stored in the CP latch. However, the 
stored data will not be passed out from the output of the CP latch until there a 
transition occurs at input F. If the CP latch in the next stage has captured the 
previous data, the phase of acknowledgement signal will be changed and passed back 
to the first CP latch. Then the first CP latch will pass the stored data to the logic 
block to perform the logic operation. This operation will be repeated when the next 
request signal arrives. The delay element is used to delay the arrival of the request 
(capture) signal to the next stage so as to ensure the logic operation have been 
completed, therefore it needs to be the worst-case delay of the corresponding logic 
block. 
I 1 
req in I j " • req out 
I T I 
I i I I I r n — 
I r c ^ I [Cd P i 
J _ K CP K rk CP K K 
Latch Logic Latch — ^ ^ Logic 
I Cd P I C Pd 
I I T T 
I I 丄 
ack out (I ( ^ ^ ^ ^ e l a y ^ L ack in 
. I 
one pipeline stage 
Figure 2.6 - Micropipeline with computation 
Page 14 
i 
Chapter 2 - Asynchronous Design Methodology  
There are several benefits of using the micropipeline structure. First, the architecture 
is simple and effective, it is easy to implement and a good throughput can be easily 
achieved. Also, the latches moderate the flow of data through the pipeline and can 
be used to filter out hazards. Thus, any logic structure can be used in the logic 
blocks, including the straightforward structures used in synchronous designs. At last, 
micropipeline is automatically elastic [10], data can be sent to and received from a 
micropipeline at arbitrary times. 
Although micropipeline is a powerful implementation strategy which elegantly 
implements elastic pipelines, it delivers worst-case performance in each stage by 
adding delay elements to the control path to match with the worst-case computation 
time of the corresponding function block. Besides from this, the circuit of its CP 
latch is rather complicated, and delays are added on the capture and pass signal to 
make sure the data has been latched. Therefore the performance is degraded. 
2.5 New Asynchronous Architecture 
As previously discussed, although Micropipeline is a powerful arid widely used 
methodology in the asynchronous circuit design, it still has some areas for 
improvement. 
The first improvement from the micropipeline is the use of dynamic logic, and in our 
design, domino logic [18] is used. Domino logic is one of the logic types in the 
dynamic logic family, and its basic structure is shown in Figure 2.7. The logical 
function of the domino logic is characterized by the nMOS logic block. There are 
two phases for the operation of the domino logic, one is the Precharge phase and the 
Page 15 
• 
Chapter 2 — Asynchronous Design Methodology 
other is the Evaluation phase. When the clock signal is low, then the domino logic is 
in the Precharge phase. At this moment the output must be low as a pull-up path is 
connected to the floating node. When the clock signal is high, then it is in the 
Evaluation phase and the output depends on the input data. If the input data creates a 
pull-down path in the nMOS tree, then the floating node will be discharged and the 
output will go high. Otherwise the output will be kept in low. 
c 
,, � o Output 
Input 一 " ^ nMOS 
Data ——logic block 
clock  
Figure 2.7 - Domino Logic 
The advantage of the dynamic logic is that it has lower processing delays and more 
compact in size in comparison to conventional CMOS data-paths. Due to these, 
many asynchronous circuits [11] [19] [20] [21] [23] [25] also adopted the dynamic logic 
in their micropipeline design. However most of them have not utilized all the 
functions of dynamic logic. One of the interesting properties of the dynamic logic is 
its ability of temporary storage [17] [19]. Dynamic Logic is actually a combination 
of the logic and storage elements, the output data can be held even though the input 
data have been changed under some conditions. As a result, the complex CP latch in 
the micropipeline can be omitted if the dynamic logic (domino logic in our case) is 
used. This implementation of dynamic logic in asynchronous circuit has been proven 
by Renaudin et. al. [17], and its pipeline structure is shown in Figure 2.8. In this 
architecture, the completion detection is no longer relied on the worst-case delay, it is 
Page 16 
• 
Chapter 2 - Asynchronous Design Methodology  
done by a dedicated circuit. It monitors the output of the logic block and provides a 
faster and accurate response when the output is ready. Although dynamic logic 
brings benefits for the asynchronous circuit, it introduces other problems of charge 
leakage and charge redistribution. These problems limit the dynamic logic to have a 
minimum operating frequency from preventing the logic error. As a result, extra 
attention must be paid in using dynamic logic. A further discussion on this problem 
and some possible solutions will be given in chapter 4. 
a c k o u t < \ y \ y V a c k in 
r eq in ~ ^ ^ ^ • r eq ou t 
Completion C o m p l e t i o n C o m p l e t i o n 
* Detection * Detection * Detection 
elk Circuit Clk Circuit Circuit 
_ ^ _ _ _ _ _ — _ _ _ ^ ~ 
Dynamic Dynamic Dynamic 
Logic Logic Logic 
K Block K Block U N Block [ \ 
I d a t a in ^ ^ da t a ou t 
Figure 2.8 - Asynchronous architecture by using dynamic logic 
Besides from the dynamic logic, another improvement is on the handshaking 
protocol and handshake cell. Referring to the previous implementation shown in 
Figure 2.8, a very restrictive handshaking protocol is used to guarantee secure 
operation of the asynchronous pipeline. For a certain stage in this pipeline 
architecture, a new operation, either precharge or evaluate, can only be carried out 
when both the previous and next stage finished their current operation. This strict 
protocol limits the performance of the handshake signal. 
In the new asynchronous architecture, some improvements on the protocol have been 
made. First a current stage is allowed to go into the Evaluation phase when the next 
stage goes into the Precharge phase, i.e. no need to wait for the precharge 
Page 17 
• 
Chapter 2 - Asynchronous Design Methodology  
confirmation from the Precharge phase. Second, a current stage is allowed to finish 
the Precharge phase even the previous stage is still in Evaluation phase. This 
introduces a flexible "Enable" period between the Precharge and Evaluation period. 
In order to carry out this new handshaking protocol, a new handshake cell is used 
and it is shown in Figure 2.9(a). 
VDD c <D c 
B 2' ~ 
r ese t ^ ™ ^ g 
… … … ‘ 1 , 考 ， 2 , 1 , 1 , 
">..V“… 丨111丨 工 丨 0_ 丨 U J 丨 山 I 
D—I 14~ ^ ^ —^— 
^ r A -i'…-7T 
\ ai I I I ！ ； i 
^ ^ ~ 1 / 1 1 I 
J H \ ！ I r 
i ‘ ‘ n \ / ~ I I 
GND 丨丨 丨 I 丨 i 
( a ) ( b ) 
Figure 2.9 - (a) new handshake cell, (b) timing diagram of a pipeline stage 
The new handshake cell is also in domino style. Compared with the classical 
architecture, this handshake cell is faster due to its simplicity, low input capacitance 
from the request and using simple transistor in pull-up. In this new structure, the 
handshake cell and the domino logic cell will enter the Precharge phase and 
Evaluation phase respectively at the same time. As a result, the handshake cell can 
be seen as a logic element of the pipeline stage and the throughput of the system can 
be minimized [30]. The handshake cell can be easily modified to receive more than 
one request signal by connecting more nMOS transistors in series in the nMOS tree, 
which is similar to the dynamic AND structure. The difference in speed will be more 
significant in logically joining handshake signals as the classical C-element with 
many inputs is very slow. 
Page 18 
• 
Chapter 2 - Asynchronous Design Methodology  
One of the disadvantages of this handshake cell is the requirement of the four-phase 
handshaking protocol which requires longer communication time. However, this 
four-phase fits the operation of dynamic logic as the non-active phase can be used for 
the precharge of the dynamic logic. 
Based on the new handshake cell, the operation of this new asynchronous pipeline 
architecture can be divided into 4 phases: Evaluation, Hold, Precharge and Enable. 
The timing diagram is shown in Figure 2.9(b). In the Evaluation phase, the current 
stage processes the data, which is valid at the input. After the current stage has 
finished its process, it will enter the Hold phase. In this phase, the input data may 
become invalid but the output should be held for the process in the next stage. After 
that, the stage will enter the Precharge phase, and will enter the Enable phase 
afterwards. In this phase, the stage is waiting for the valid data appearing at the 
input. This phase can be omitted when the valid input data has already appeared 
during the Precharge phase. Since all the handshake cells and logic cells should be 
precharged first during the power up, a NOR gate will be used, as shown in Figure 
2.9(a), in the handshake cell. In this configuration, one of NOR gate inputs connects 
to the Reset signal thus that the all the cells in previous stage can be precharged 
initially. 
Figure 2.10 shows the connection and the flow of the pipeline operations of this new 
asynchronous architecture. When data arrives, the current stage will enter the 
Evaluation phase to process the data. Afterward, it will enter the Hold phase to hold 
the data for the next pipeline stage to process. At this moment, it will send a request 
signal to the following stage and acknowledgement signal to the previous stage. 
Page 19 
Chapter 2 - Asynchronous Design Methodology  
After the following stage has processed the data, the current stage enters the 
Precharge phase. And at last it will enter the Enable phase to wait for a new data 
from the previous stage. 
one pipleine stage   
1 
ack out ‘ I ackin 
Handshake ‘ * Handshake I Handshake 、、、、、 
. > cell > cell :.： I > cell - 如 
req in | | _ _ _ _ J I req out产 
elk elk I elk 
Domino I Domino 丨 Domino 
. . . r \ Logic BIcok ！ K Logic BIcok | K Logic BIcok K 
I data in^) | i data ouQ 
stage N-1 I stage N I stage N+1 
I J 
( a ) 
O) T- 0) CM <U CO ① 寸 <D If) 
=<U .= 0) ；= <U = <D ：^ � <U O) 0)0) (DCT <U D) 0) D) 
.9- S .9-B .9- iS .9- iS .9-3 
Q_ w CL w CLW Q_ « CLW 
彳ao Lao Lao Lao Lao Lao 
ri • ri ^ ri • _d_> ri • 
—K — \ — \ — \ — \ — \ 
dat4 与 与 兮 勞 d ^ 
a^o Lao Lao Lao Lac Lao 
ri ^ ri » ri » ri ^ ri » ri » 
- A —\ ^ —\ —\ - A  
dat^ da t , dat , dat , dat^ da t , ao acknowledgement out 
a^o lao Lao Lao Lao Lao '•‘ request in 
ri » 
—K - - A - A —\ - A —\ 
dat^ . ： : ^  与 与 与 d ^ 
Lao I Lao Lao LaoJ Lao  
I _ J i > - J 1 > -CU • Evaluation 
； — \ - A - A —\ - A - A 
dat4 da^ 弯 d ^ d ^ d ^ • Hold 
,ao Lao Lao Lao Lao Lao _ Precharge 
ri » _iL» -cL^  _iL> ri ^  
— \ — \ ^ — \ - A — \ 
dat^ dat^ dat^ dat^ dat^ dat , 
,ao LaoJ Lao [< 扣 Lao 
• —K 
dat^  dat^  d ^ d ^ d ^ d ^ 
( b ) 
Figure 2.10 - (a) new asynchronous pipeline connection, (b) flow of operat ions in the new 
asynchronous pipeline 
Page 20 
Chapter 2 - Asynchronous Design Methodology 
The use of Differential Cascode Voltage Switch Logic (DCVSL) [24], a type of 
domino logic, can also improve the speed of the circuit. Figure 2.11 shows the basic 
structure of a DCVSL cell. Its operation is similar to that of the domino logic. In the 
Precharge phase, both of the true and complementary outputs will be kept at low. 
When in the Evaluation phase, the computation is enabled. Due to the 
complementary structure of the nMOS logic blocks in DCVSL, one and only one of 
the outputs will go high. 
VDD VDD 
_ clock J 3 ^——C 
true output ^ ^ ~ complementary 
^ n J ” “ ^put 
V V 
GND GND 
Figure 2.11 - Differential Cascode Voltage Switch Logic (DCVSL) 
There are benefits of using DCVSL in asynchronous logic. First, it is based on the 
structure of the domino logic and thus it has the benefits of domino logic, namely, 
are fast computation time and storage property. Second, the DCVSL provides dual 
rail coded data which provides a very reliable completion signal by simply OR-ing 
both the outputs as shown in Figure 2.12. Due to these, DCVSL is an attractive way 
to implement asynchronous operation functions [26] [28] [29] and has been used in 
many asynchronous designs [17] [22] [26] [27:. 
Page 21 
Chapter 2 - Asynchronous Design Methodology  
completion signal 
A 
in+ • • out . 
DCVSL 
in- • — • out-
compleme 门 tary complementary 
inputs outputs 
Figure 2.12 - Completion signal generated from the DCVSL 
Although this way to generate completion signal is very simple, one gate delay is still 
added after the completion. In fact, the completion of the computation can be 
detected directly without the OR gate by modifying the handshake cell. Figure 
2.13(a) shows the modified CMOS structure of the new handshake cell. In this new 
structure, the true and complementary outputs from the DCVSL block can be directly 
connected to the handshake cell for the completion detection. As a result, the OR 
gate and the request signal can be eliminated, and the completion detection matches 
closely the original computation time of the DCVSL block, and thus the average case 
performance can be achieved. Figure 2.13(b) shows an example of the single bit 
basic FIFO cell with the modified handshake cell and Figure 2.13(c) shows the new 
connection of the asynchronous pipeline by using the new handshake cell. 
Page 22 
Chapter 2 - Asynchronous Design Methodology 
I Modified - VDD' — 
] H a n d s h a k e Cell ~ I 
voo 1 
I n . r I 
bh i 4 i 
rJ rJ ； I 
r ~~ r.~~5L i,…… V ： 
d力 力 L ‘--] :-:—:•:::::::::::::;:�-- -： 
^ ^ i I 
r � " n n 
V in_pi , in n 
GND ； ^ ^ r^ ： -
i DCVSL V V ； [Structure GND GND 丨 
(a) — ("b) 
one pipleine stage r ‘ 
ack out I . ！ ack in 
Hdndshake •^^ “^、欲渊拟。热"^似湖抓吻胁•^找 Hsndshske ‘ •^、“似：；体狱微 Hsndshdks 、、、、、、、、、、、、 
— J ) cell I I r — { ) cell I r — ^ cell ： 
i I I 
i I I  
i I I 
elk脅臂從� I elk 丨 elk 
Domino • Domino 丨 Domino 
I I K ^ Logic BIcok 丨 I I \ Logic BIcok I I I K , Logic BIcok K 
I data in > i ) I ) data out > 
1 [ / 1 l / 
stage N-1 I stage N I stage N+1 
- - - ‘ 
Figure 2.13 - (a) modified handshake cell, (b)modified handshake cell and basic FIFO cell in 
DCVSL structure, (c) connection of the asynchronous pipeline 
The use of DCVSL will improve the speed as the communication protocol is simpler, 
but the trade-off is the size penalty incurred by DCVSL. Moreover, dual-rail data 
requires large routing area in the physical layout as the bus width is doubled. 
Therefore within the processing units, DCVSL is used in order to maximize the 
performance. On the other hand, in each connection between all the processing 
units, where they may be separated quite far away in the physical layout, a dual-to-
single or single-to-dual rail conversion interface is inserted so as to reduce the 
routing area by using single-rail data. 
Page 23 
Chapter 3 一 DCT/IDCT Processor Design Methodology  
Chapter 3 
DCT/IDCT Processor Design Methodology 
3.1 Overview 
Most digital signal processing (DSP) algorithms involve many mathematical 
operations which require high computational resources. There is no exception for the 
Discrete Cosine Transform (DCT) [3]. Although there are arithmetic units within the 
general purpose micro-processor or micro-controller, they are not specifically 
designed for the pure mathematical operations. As a result, the implementation of 
DSP algorithm on them may not be efficient and has poor performance. Due to this, 
it motivates the development of the DSP chip, and the DCT chip in this thesis. 
There are many hardware architectures to implement the DCT algorithm, such as 
using a programmable DSP processor, or dedicated ASIC. At the same time, there 
are many kinds of DCT algorithms, some of them focus on reducing the number of 
operations, some of them allow more regular architecture of VLSI implementation. 
Careful analysis on these is required in order to find out a most suitable combination 
for the DCT implementation in asynchronous circuit. 
The advantage of using asynchronous technique to implement the DCT or other DSP 
processors is its average case performance. There are many functional blocks in the 
design, and their computational time may differ from each other a lot. The global P ge 24 
Chapter 3 — DCT/IDCT Processor Design Methodology  
clock frequency in synchronous circuits is governed by the worst case delay in the 
whole system whereas each functional block in asynchronous circuits by its own 
operation speed. As a result the computation time of an asynchronous DSP chip may 
be shorter than the synchronous one. 
In this chapter, different hardware architectures and DCT algorithms will be 
considered and compared. 
3.2 Hardware Architecture 
Different from the general purpose micro-processor or micro-controller, DSP 
processor has traditionally been optimized to compute different arithmetic 
operations, such as the convolution, recursive filtering and fast transform operations 
that typically characterize most signal processing algorithms. They are used in many 
application areas such as communications, speech and video/image processing. As 
mentioned in the previous part, DSP processor can be either programmable or of a 
dedicated nature. 
Programmable DSP processor has the advantages in the flexibility and design time 
for different algorithms as it allows the implementation of a variety of DSP 
algorithms. Besides from arithmetic units, extra memory and control units are 
required in order to store the application programme and control the operations of 
data. The performance of the DSP algorithm is not only depended on the hardware, 
but also depended on the application programme. Therefore the application 
programme should be optimized for utilizing the hardware in the processor. 
Page 25 
Chapter 3 - DCT/IDCT Processor Design Methodology  
On the other hand, the dedicated ASIC is hardwired to perform a specific algorithm, 
and usually no extra control or programme is required. Once it is designed, the 
performance of the dedicated ASIC is fixed. Although the flexibility of the 
dedicated ASIC can be considered to be zero, this approach is expected to perform 
better than the programmable approach as the DSP algorithm is optimized in 
hardware level, and also it is more efficient in terms of area and speed. 
3.3 DCT Algorithm 
The main application of the DCT is in the video or image compression. For most of 
the image and video applications, the whole image will not be processed with DCT 
directly as it will require a lot of computations. In contrast, the image will be divided 
into several regular blocks for processing. The block size is usually eight pixels or 
sixteen pixels in both of x and y direction, as shown in Figure 3.1. The reason to 
have a block size of 8x8 or 16x16 is that they have been found to provide sufficient 
details and localized activities of the picture to enable reasonable adaptive processing 
of the image [31]. And for most of the current DCT applications such as H.261 [4], 
JPEG [4] and MPEG [6], the block size of 8x8 have been recommended. Therefore 
the effort of hardware development has been concentrated on an 8x8 two-
dimensional (2D) DCT. 
Page 26 
Chapter 3 - DCT/IDCT Processor Design Methodology  
Picture 
_ 丨 丨 1 I I I pixel 2 
一 一 一 厂 一 一 厂 一 一 r 一 一 厂 一 一 r 一 r 一 一 pixel i ^ ^ . 
I I I I I I \ \ one block 
— I — 1 — 1 — 1 — 1 — 1 r I r~ 
1 1 1 1—-1 1 z' '"  
-
I I 丨 I I I B 
——I——1——1——r--r--f----""" I  
I I I I I I 00 
� � � � - � � � . pixel 64 
\ � � I I I I I I I — 
Figure 3.1 - 8 x 8 image block 
In general, the iVxTVpoint of 2D DCT is given by Equation 3.1， 
= — c o s —————cos � - Equation 3.1 
丄、 fj=0 m=0 L 斗 J L 斗 _ 
where m’n，k’l = 0，1’ ’N -1 
c(0) = 1/4^ = 1 fori 字0 
Since for the video or image application, the block size is 8x8, i.e. N=込,Therefore 
Equation 3.1 becomes 
COS COS - ^ - Equation 3.2 
4 «=0 m=0 L 丄6 � L _ 
where m’n，k，l = 0，1’ ，7 
C(0) = 1/-42=1 fori^O 
Original Image Data DCT result 
A I '； I I I I I 8x8 f III I I 
- i —— 2D DCT J ,乙 
-L — T^ 4- —— 
m - i 7 乙 > k J ^ 乙 
- — ~ / - — • 
^ — _1 1 / ^—_1 
^10 ^11 丫 10 丫 11 
Xq2 YQO 丫01 丫02 
Q • ！ • 
Figure 3.2 - 2D DCT of 8x8 image block 
Page 27 
Chapter 3 - DCT/IDCT Processor Design Methodology  
Figure 3.2 shows the 8x8 2D DCT of an image block. If the 8x8 2D DCT is directly 
implemented from Equation 3.2，totally 4096 ( 炉 ） m u l t i p l i c a t i o n s and 4032 (8^x8x7) 
additions are required to calculate all the 64 DCT outputs. This number of arithmetic 
operations is extremely high, especially for the number of multiplication as it 
requires higher computational resources. It is not possible to perform the 2D 
transform in real-time applications even for a dedicated DSP processor. Fortunately, 
there are many kinds of fast 2D DCT algorithm to reduce the number of operations, 
and thus makes the real-time 2D DCT implementation possible. 
There are two main types of fast algorithm for VLSI implementation of 2D DCT. 
The first type is the row-and-column decomposition method which is shown in 
Figure 3.3. This method separates the 2D DCT into two one-dimensional (ID) DCT 
operations based on the symmetry and regularity of the 2D DCT structure. The first 
ID transforms are applied on the data row-wise, which is called the row operation. 
Afterwards, next ID transforms are applied on the intermediate results of the row 
operation column-wise, and this is called the column operation. The reordering of 
the results of row operation into column order can be done by a transpose memory. 
In this way, a complex NxN 2D DCT can be decomposed into 2N ID DCT operation 
and the number of multiplications is reduced from A^ to 2NxN^. As a result, the 
computational requirement is greatly reduced. A better result can be achieved by 
further applying the fast ID DCT algorithm [7][8][9] in the row and column 
operations. Since the row-and-column decomposition method requires two ID DCT 
processors and the implementation is straight forward, this method has been chosen 
by many other developers [32][33][34][35]. 
Page 28 
Chapter 3 - DCT/IDCT Processor Design Methodology  
Input Data �ead in Tranpose Memory 
I I I I I • W I I I I I I I 
j ! I j 丨 M > row _ ! j • j j j j 
~i~！~~！~i~i\— ‘ order - J | ~ ~ j ~ i ~ | ~ k -
{ I I I I I W w { I I I 
11:11 1 ~ r — • \ 1-D DCT ^ \ I I I I 11 I 
i ! . j j j ! • / Processor ——/ ~ I I ！ | 
I I ‘ M I I — Z 1 — I i I I I I T 
I I • I I M - > - > M ： M TT" 
! j . I I j ! I • Row Operation “ H j | j j j | ~ 
8 x 8 block I I I I I I I I 
read in  
1 column ^  
order ^  
^ 1 - D DCT / I — — < 
Processor < 
< 
Co umn Operation < 
二 = = = : = = = 2D DCT 
result 
Figure 3.3 — 2D DCT by row-and-column decomposition method 
The second type of the fast 2D DCT algorithm is called the direct method. This 
method directly uses the 2D DCT algorithm to compute 2D DCT. There are many 
proposed fast 2D algorithms to handle this [3 6] [3 7] [3 8]. They explore the 
trigonometry equality of 2D DCT such that the NxN 2D DCT can be decomposed 
into N ID DCT plus some extra additions as shown in Figure 3.4, and thus the 
number of multiplications can be reduced to NxlsF. Similar to the row-and-column 
decomposition method, the number of operations can be further reduced by applying 
fast ID DCT algorithm on the ID DCT processor design. 
Page 29 
Chapter 3 - DCT/IDCT Processor Design Methodology  
)re-processor post-processor 
[=|> I 1D DCT Processor | c=J> 
[=；> I 1DDCT P r o c i i i ^ 
fe I 1D DCT Processor | ^ 
i t ^ I 1D DCT Processor [ ^ | | \ 
Data input； = ^ ^ DCT outpub 
n / c=MlDDCTPro^iii5n^ | 签 — — 1 ^ 
< I 1DDCT Processor | < 
I 1D DCT Processor | 
I 1DDCT ProceiioFI  
Figure 3.4 - 2D DCT by direct method 
By comparing the two approaches, the 2D direct method is more superior than the 
row-and-column decomposition method. This is because it involves much less 
multiplication which directly leads to better performance. Furthermore it does not 
require the transpose memory. However, most of these fast 2D direct algorithms 
require very complex data path in the adder/subtractor network of the pre- and post-
processors which cause difficulty in the VLSI implementation [37] [39]. Besides the 
complex routing overhead, it also introduces a large handshaking overhead in 
asynchronous implementation. On the other hand, although the row-and-column 
decomposition requires more arithmetic operations, it requires only two ID DCT 
processors saving a lot of hardware. Also the data path in a ID DCT is simpler and 
regular which leads to an easier hardware implementation, and this favours the 
asynchronous implementation. Due to these reasons, the row-and-column 
decomposition is chosen for the implementation of the 2D DCT in asynchronous 
circuit. 
3.4 Used Architecture and DCT Algorithm 
In this thesis, two different implementations of the DCT will be shown. As 
previously discussed, the row-and-column decomposition is more suitable for the 
Page 30 
Chapter 3 - DCT/IDCT Processor Design Methodology  
implementation of the 2D DCT using asynchronous technology. Therefore the 
following parts and chapters will be focused on the design and the implementation of 
the ID DCT algorithm. For the two implementations of the ID DCT, one is 
constructed based on a programmable DSP processor, and the other one is 
implemented as a dedicated one. The implementation of the transpose memory will 
be discussed in chapter 6. 
3.4.1 Implementation on Programmable DSP Processor 
Recalling from Equation 1.1, the 8-point DCT is given by the following equation 
X, c o s — ~ ~ - Equation 3.3 
丄 i=0 丄O 
and its matrix representation is shown as follows 
~yJ [a a a a a a a a1 r^o" 
Y, D E F G -D -E -F -G x, 
Y^ B C -C -B B C -C -B x, 
Y^ _l E -G -D -F -E G D F x, 
Y, A -A -A A A -A -A A ^ x, ' Equation3.4 
Y, F —D G E -F D -G -E x, 
Y, C -B B -C C -B B -C x, 
Y, G —F E —D — G F - E D x, 
• 」 L 」 L 一 
where A = cos(n/4), B = cos(n/8)’ C = sin(71/8), D = cos(n/16)’ 
E = cos(3 n/16), F = sin(3n/16), G 二 sin(n/16) 
Since the programmable DSP processor has fixed number of arithmetic units, the 
lesser number of operations, the shorter the computational time and thus the higher 
performance of the DCT implementation. Therefore a fast algorithm with smaller 
number of operations should be chosen for the implementation on the programmable 
DSP processor. 
Page 31 
Chapter 3 一 DCT/IDCT Processor Design Methodology  
There are many kinds of fast algorithm aided to reduce the total number of 
operations. The most well know ones are the Lee's [7] and Hou's [8] algorithms. 
They both reduce the DCT operations to have 12 multiplications and 29 additions. 
The number of arithmetic operations is greatly reduced from the original 64 (炉） 
multiplications and 56 {8x1) additions. However, these two algorithms were not 
chosen for the implementation of DCT in this processor because the accuracy of the 
DCT algorithm should also be considered. 
The main error of the DCT comes from the truncation after the multiplications as the 
bit length of the data is increased after each multiplication. A truncation must be 
taken in order to match the width of data bus. As truncation on a data makes it differ 
from its actual value, if a data in the processor is multiplied several times 
continuously, it resultant value could be greatly differed from its exact value. 
Therefore, a fast algorithm with less multiplication stages on a data path should be 
chosen. 
By comparing several fast algorithms, the one proposed by the Jeong et. al. [40] is 
chosen, and its signal flow diagram is shown in Figure 3.5. This fast algorithm 
requires 14 multiplications and 29 additions, and requires only a maximum of 2 
multiplications in each data path. Therefore it can provide a better accuracy than 
Lee's or Hou's algorithms in a fixed width system. 
Page 32 
Chapter 3 - DCT/IDCT Processor Design Methodology  
Xi O v A® © ^Y, 
X3 V X X ^ © 
X2 y O O ^ Z © ^Y, 
/ X X V ——®—— 
X4 Z Z _ _ ( g ^ Y i 
CO=COS(67I/16)/COS(27I/16), CI=1/COS(27I/16)， C2=COS(47I/16)/COS(27T/16), 
C3=L/^Y C4=COS(47I/16), C5=COS(27T/16), 
C^COS(27J/16)/2COS(57I/16), C7=COS(27I/16)/2COS(37I/16), C8=cos(27i/16)/2cos(l7i/16), 
C9=COS(27I/16)/2COS(77I/16), 
Figure 3.5 - Signal flow diagram of the Jeong's ID DCT fast algorithm 
The detailed architecture of this programmable DSP processor and the 
implementation of the ID DCT will be discussed in chapter 5. 
3.4.2 Implementation on Dedicated Processor 
For the dedicated implementation of the 1D DCT, the fast algorithms mentioned in 
the previous part are not suitable. This is because most of the fast algorithms have 
similar data flow as shown in Figure 3.5. The data flow of such fast algorithm is 
usually quite complex in the last stage. This makes the asynchronous 
implementation a disadvantage as a large handshake overhead will be introduced, 
and a degradation in the performance of the processor will be resulted. The solution 
to overcome this problem is to use dedicated multipliers and adders for each 
multiplication and addition. However this costs a lot of silicon area and thus is not 
practical. 
Page 33 
Chapter 3 - DCT/IDCT Processor Design Methodology  
For the asynchronous circuit, the dataflow should be as simple as possible. This 
helps to reduce the handshaking overhead and hence the performance can be 
enhanced. Therefore a semi-direct method is used in this dedicated DCT processor. 
This semi-direct method is obtained by decomposing the 8x8 matrix multiplication 
into two 4x4 matrix multiplications. As a result, Equation 3.3 can be decomposed 
into 2 equations as shown in Equation 3.5 and Equation 3.6. 
Yq a J4. A Xq + X7 
Y^ _ \ B C -C -B Xi +X6 
Y, A -A -A A ‘ X2+X5 一 Equation3.5 
Y^ C -B B —C X3+X4 
>1] \D E F G ]�X0-X7一 
Y^ _ \ E -G -D -F X, -jCg 
Y广 3 F -D G E • _ Equation3.6 
Y, G -F E -D X, -X, 
and similarly the IDCT can be decomposed into Equation 3.7 and Equation 3.8. 
[A B A C 1 [YJ [D E F G 1 
X, A C -A -B Y^ 1 E —G -D -F Y, 
X, =2 A -C - A B * Y, ^ 2 F - D G E * Y, _ Equatioii3.7 
JC3 A -B A -C K G —F E -D Y, 
— —J J I _ u 」 L —I L. '— 
'jcJ [A B A C 1 [YJ [D E F G 1 
x^ _l A C -A -B Y^ I E -G -D —F Y-, 
X, A -C - A B " Y, ~2 F -D G E • Y, “ Equation 3.8 
A -B A —C Y, G -F E -D Y, 
This method has been used in many other DCT implementations [32] [33]. There are 
several advantages for this semi-direct method. First the number of multiplications 
and additions are reduced to half of the original number. Second, it involves one 
multiplication only in each data path and thus it requires less numbers of bits to 
Page 34 
A 
Chapter 3 - DCT/IDCT Processor Design Methodology 
represent the data. Furthermore, the dataflow is simple, which is multiply-and-add, 
this favours the asynchronous implementation. Finally the structure of the DCT and 
IDCT are similar, it is easier to implement the DCT and IDCT on the same hardware 
by this method. 
Based on the above reason, a ID DCT core processor is constructed by using this 
semi-direct method, and is used in the dedicated 2D DCT processor. This processor 
is capable of handling DCT and IDCT, and can be cascaded to perform the 2D DCT. 
The detailed architecture and the implementation of the DCT algorithm will be 
discussed in chapter 6. 
Page 35 
Chapter 4 — New Techniques for Operating Dynamic Logic in Low Frequency 
Chapter 4 
New Techniques for Operating Dynamic Logic in Low 
Frequency 
4.1 Overview 
Dynamic logic has some advantages over the static logic, they include higher speed 
and more compact in size. Moreover, it is suitable for used in the asynchronous 
circuit design as mentioned in chapter 2. However, dynamic logic is not widely used 
as it suffers from two main problems which are the racing problem [41], and the 
charge redistribution and leakage problem [41] [42] [43]. The racing problem can be 
avoided by a proper arrangement of the logic cell. However, the charge 
redistribution and leakage problem cannot be simply overcome as it is caused by its 
internal structure. This problem causes the dynamic logic to have a bad noise margin 
and a lower bound of operating frequency. 
In this chapter, the problem of the charge redistribution and leakage problem, and its 
traditional solution will be discussed. Afterwards, a new technique to overcome this 
problem will be introduced. 
Page 36 
Chapter 4 - New Techniques for Operating Dynamic Logic in Low Frequency 
4.2 Background 
^ floating node 
\ I__ Z precharge 丨 evaluation 
— f - OUT < phase , P h ^ w 
i V I ^ 
'leakage .…L„. ^ ^ • 
- Mi 去 C�u« I 
P " i elk 
- 〒 ： 
^ • I 
�r ^ —i— \ 
- M 3 …:…- I \ 
L-, - � � ” . I \ floating node 
CLK 厂 i • … . C - I  
1 — ^ 
‘―] 丨 Time 
( a ) ( b ) 
Figure 4.1 一（a) 3-input NAND dynamic logic, (b) voltage in the floating node of the dynamic 
logic 
The output value of a dynamic logic depends on the charges stored in the floating 
node. By considering a 3-input NAND dynamic logic shown in Figure 4.1(a), during 
the Precharge phase, the output of the dynamic logic will be kept at high as the 
pMOS transistor is turned on and current is flowed from VDD to the floating node. 
During the Evaluation phase, unless all the nMOS transistors are turned on such that 
a pull-down path is created, the charges kept in the parasitic capacitor C�ut at the 
floating node will hold the output at high. Otherwise, the output will become low as 
all the stored charges in the floating node flow out through the pull-down path. 
There are several advantages of the dynamic logic over the static logic. First the 
dynamic logic is more compact as the complementary pMOS transistor tree is 
replaced by only one pMOS transistor. Also the operation can be run faster as the 
output only needs to be selectively discharged during the Evaluation phase, and the 
charging speed is faster as there is only one pMOS transistor. An additional 
Page 37 
Chapter 4 - New Techniques for Operating Dynamic Logic in Low Frequency  
advantage for the asynchronous circuit is its temporary storage of data due to charges 
stored at the parasitic capacitor. 
However, the dynamic logic is suffering from the charge redistribution and leakage 
problem. As mentioned previously, the output of the dynamic logic depends on the 
charges stored at the floating node. Theoretically if the pull-down path does not 
exist, the output should be kept at logic high during the Evaluation phase. In 
practice, the output voltage will be dropping continuously with time as shown in 
Figure 4.1(b). This problem is caused by the charge redistribution and charge 
leakage. 
The charge redistribution problem can be explained by Figure 4.1(a). Suppose that 
during the Evaluation phase, the nMOS transistors Ml and M2 are turned on while 
M3 is turned off, there is no pull-down path to the ground and the output should keep 
high. However, since Ml and M2 are turned on two more capacitors CI and C2 are 
introduced and they will share the charges stored in the floating node. This is called 
charge redistribution. As a result, voltage at the output drops and degrades the noise 
margin in the dynamic logic. If CI and C2 are large and large amount of charges is 
flown out from the floating node to CI and C2, the voltage at the floating node may 
be dropped below the switching threshold of the next stage and causes a logic error. 
Furthermore, the charges will also be leaked out from the parasitic capacitor due to 
the leakage current [45] [47]. If the time of the Evaluation phase is sufficient long, 
the diminishing charge will even induce a logic error at the output. Therefore the 
duration of the Evaluation phase should be short in order to prevent the logic error 
Page 38 
Chapter 4 - New Techniques for Operating Dynamic Logic in Low Frequency  
from occurring at the output. This limits the dynamic logic from operating in low 
frequency. 
4.3 Traditional Technique 
J 3 J ]|D  
— ‘ r ^ ——^ 厂 
^ 'charging ‘―| ，. "^a'PS 
y OUT M OUT 
nMOS nMOS 
logic block logic block 
—— 
CLK CLK I 
(a) (b) 
Figure 4.2 - Addition of the pull-up path in (a)dynamic logic, (b)domino logic 
The traditional method [42] [43] used to overcome the charge redistribution and 
charge leakage problem is adding an additional small pull-up pMOS at the floating 
node. Figure 4.2 shows the traditional method used in the basic dynamic logic and 
domino logic. This additional pull-up pMOS directly solves both of the problems as 
it allows a current flow to the floating node during the Evaluation phase, and thus the 
charges stored in the parasitic capacitor can be maintained, or refilled. Due to its 
simplicity, this method is commonly used in most of the dynamic logic. 
However, this method has a drawback of speed degradation. During the Evaluation 
phase, if a pull-down path is created by the nMOS logic block, the discharging 
current will be needed to fight against the charging current created from the 
additional pMOS, and thus a overall discharging current is decreased and the 
evaluation time is increased. Although a smaller charging current can be obtained by 
smaller pMOS transistor, but the transistor size is limited to the technology used and 
Page 39 
Chapter 4 - New Techniques for Operating Dynamic Logic in Low Frequency 
can only be reduced to a certain extent. Sometimes in poor design, the discharging 
current may even weaker than the charging current. In this case, the logic block will 
not be operated correctly and cause an error. These problems are caused by the 
limited or no control of the charging current from the additional pMOS. 
4.4 New Technique 一 Refresh Control Circuit 
Regarding the charge redistribution, there are several techniques [42] to overcome 
this problem, and some of them are shown in Figure 4.3. Also in practical design, 
the dynamic logic with a large nMOS logic block is always avoided as it has a poor 
performance due to the weak discharge current. In this case, the logic cell will 
usually be broken into two, or more simpler logic cells which have less transistors in 
the nMOS logic block. By using these techniques, the charge redistribution problem 
can be minimized. Therefore the charge leakage problem will only be dealt with in 
the new technique. 
c C AOUT 
OUT r^X)— OUT |-<| \ |-<j I 
� - V - H L If ^ 
F 力 ^ ^ ^ 。R r^ r^ 
b > h P . H ? 
_ _ I P _ I P i 1 
CLK CLK CLK  
Figure 4.3 - Techniques of overcome the charge redistribution problem 
In order to solve the charge leakage problem, the introduction of the pull-up path at 
the floating node seems to be necessary. However, the continuous flow of the 
current from VDD to floating node via the additional pull-up path causes the speed 
Page 40 
Chapter 4 一 New Techniques for Operating Dynamic Logic in Low Frequency  
degradation. If the amount of current via the pull-up path is controlled, the speed 
degradation will be minimized. This is the aim of the new technique. 
4.4.1 Principle 
The idea of the new technique comes from the refresh technique used in Dynamic 
Random Access Memory (DRAM) [44] [45]. The core circuit of the new technique 
is called the Refresh Control Circuit (RCC), and it is used to monitor the voltage of 
the floating node in the dynamic logic. When the floating node voltage meets the 
pre-determined minimum voltage, or Vref, a pull-up path at the floating node is 
created for each dynamic logic in order to refill the charges in it. This process is 
called Refresh. Since the pull-up path is not present all the time, this technique 
causes less speed degradation compared with the traditional methods. Furthermore it 
is self-timed and self-operating. It does not need extra control from user. Figure 4.4 
shows the modified structure of the dynamic and domino logics. 
1—1 L-1 Refresh ) 
C D (controlled J I … 
. „ D �� C ‘―I Refresh 
n H by RCC) 1 L n k — — ^ (controlled 
'charging ~ | by RCC) 
‘ OUT I'chaw™ 
‘ ~ O U T 
nMOS 
logic block nMOS 
logic block 
• CLK I 
(a) (b) 
Figure 4.4 - Modified structure of the (a)dynamic logic, (b)domino logic 
Page 41 
Chapter 4 - New Techniques for Operating Dynamic Logic in Low Frequency  
4.4.2 Voltage Sensor 
First, in order to detect the voltage of the floating node and compare it with Vref, a 
voltage sensor is used. In the system, not all the logic is connected to the voltage 
sensor. Only a dummy circuit modeling with the worst dynamic logic structure in 
terms of leakage is used to represent all the logic cells in the circuit and it is 
connected to the voltage sensor as shown in Figure 4.5. 
- ] — r i "1 q 
I ~ I I ~ I I ~ I 0 iRefresh Signal 
I — — C D ~ 1 Refresh Signal I C 
j — 广 T 丨 OUT n J p ^ 1 1 
I ^ 1 I 1 • f ii——OUT 
nMOS j I j I 
logic block ^ nMOS 
, , ^ Refresh^ 丨ogic Wock I 1 1 Voltage Sensor 二训浏• I • 
J - fc -^ ^ ~ J___‘她一 
Cuj： I I CLK I \ y�ef Signal 
去 Dummy 去 Dummy 
Dynamic Cell ： Dynamic Cell ： 
( a ) ( b ) 
Figure 4.5 一 Proposed refresh structure for (a)dynamic logic, (b)domino logic 
The voltage sensor consists of 2 stages, the first stage is the differential amplifier and 
the second stage is the two-stage sense amplifier. Their structures are shown in 
Figure 4.6. The first stage, the differential amplifier, is used to compare the voltage 
of the floating node with Vref, and to amplify their difference. The inputs of the 
second stage, the two-stage sense amplifier, are connected to the outputs of the first 
stage to provide a more accurate comparison. If the voltage in the floating node 
becomes smaller than Vref, the second stage will generate the refresh request signal to 
indicate to the dynamic logic that refresh is needed. 
Page 42 
Chapter 4 - New Techniques for Operating Dynamic Logic in Low Frequency 
I Vin1 I 
I L I (from VoutJ ”| L  
d _ _ d _ _ 
I—C I C r—C I C 
j_ I j ^ (from Vout^^^ 
I Vln2 I  
^ I IP* Zj ZJ l-^  
i—c L V胁料 pJ 
L ^ 二 H _ . _ _ r ^ � r r 
__) I ^ enable 
^ p i —— 
enable | CH * 
Refresh Request ^^ 丨  
Signal I cT ( 
( a ) ( b ) 
Figure 4.6 - Voltage sensor, (a)differential amplifier as the first stage with reference voltage 
generator, (b)two-stage sense amplifiers as the second stage 
The two-stage voltage sensor can provide a good detection. However, it consumes a 
lot of power as there is a current always flowing from VDD to GND, it should be 
prevented from operating all the time. As the sense amplifiers are only used to 
determine the time for refresh, a timer can be used to record the time required for 
refresh in the first refresh process. Afterwards, the sense amplifier can be turned off 
and the signal from the timer can be used to indicate a time for refresh. In practice, 
the combination of a ring oscillator, a counter and latch can form a timer. 
4.4.3 Ring Oscillator 
Ring oscillator is constructed by connecting an odd number of inverters with a 
feedback, which is shown in Figure 4.7. 
- j / O [ > o ~ ~ [ > o [ > o - J O 
Figure 4.7 - Ring oscillator 
Page 43 
Chapter 4 - New Techniques for Operating Dynamic Logic in Low Frequency 
For a ring oscillator, its period (or frequency) is controlled by the size and number of 
inverters. The size of the inverter means the width-to-length (W/L) ratio of the 
pMOS and nMOS in the inverter. In general, the smaller the W/L ratio, the longer 
the period can be obtained. This is because the charging and discharging current, as 
shown in Figure 4.8, are smaller in small pMOS and nMOS, and thus it requires 
longer time to charge or discharge the input capacitor of the next inverter. 
Furthermore, the more the inverters used, the longer the period can be made in the 
oscillator as a longer delay is created in the feedback path. 
hS -, 
\ I 
^^ 'charge | 
» I .,……… 
/ 'discharge 丨 丨 i 
L / I I—…I i 
_ / : : J i_ 
Figure 4.8 — Charging and discharge current in the inverter chain 
Normally, the time for a logic error occurring at the floating node due to charge 
leakage should be in the order of milli-second (10' second) [47]. If a ten-bit counter 
is used to count the refresh time, the ring oscillator will need to have a period in the 
order of micro-second (10'^ second). However, the period of an ordinary oscillator is 
very short (several nano-second, 10"^  second) even when the smallest inverters are 
used. The increase in the number of inverters can increase the period, but it is not 
practical. It is because the difference between the delay of an inverter and the 
required period is too large, it may require thousands of inverters so as to achieve the 
required oscillating period and this makes the oscillator very large. Also, the large 
amount of inverter causes a large amount of power consumption. Rather than 
inverter, delay elements are used. It is added between each inverter and creates 
Page 44 
Chapter 4 - New Techniques for Operating Dynamic Logic in Low Frequency 
larger delay in the feedback path. Figure 4.9 shows the ring oscillator with the delay 
elements. 
>— D e l a y — ^ X ^ Delay — D e l a y — D e l a y — D e l a y ~ \ Element Element ~ Element ~ Element ~ Element / 
Figure 4.9 - Ring oscillator with delay elements 
There are many types of delay elements. The common one is a transmission gate but 
it cannot achieve a long delay. By referring to the comparison done by Mahapatra et. 
al. [46], the transmission gate with Schmitt trigger [46] is chosen as the delay 
element in the ring oscillator as it produces longer delay. The CMOS structure of the 
transmission gate with Schmitt trigger is shown in Figure 4.10 
令 _ _ _ _ 
o H n H 
I I [-<= ^ r ^ 
IN ~ I I ~ ~ I I ~ ~ OUT 
^ - T 1 ^ ‘ 1 P ” ‘ ‘ ― o 
n _ r M r h i u r 
—^ 去 去 去 
Figure 4.10 - Delay element, transmission gate with Schmitt trigger 
VcontrC. C pC Lc L^ 
jH I JP 
广 \ c o n t r o l 广 广 
r ^ \ . r ^ r ^ 
I \ 'charge I TG With I 
“ t——r—— R c o n u o , � f - Schmitt — i 
__ H 广 , « ^ _ r^ Trigger ^^^ ^^^  j-J 
_ I / , I_Ycflrtnl- _ _ _ 




( a ) ( b ) 
Figure 4.11 - (a) a voltage controlled inverter, (b) part of the voltage controlled ring oscillator 
Page 45 
Chapter 4 - New Techniques for Operating Dynamic Logic in Low Frequency  
In order to further increase the delay, the minimization of the charging and 
discharging currents (Figure 4.8) are required. As mentioned previously in this 
section, the smaller the current, the longer the charging/discharging time and thus a 
longer period can be achieved. The minimization of current can be done by adding 
small transistors in the VDD and GND paths, which is shown in Figure 4.11(a). By 
providing the control voltage near to the threshold to the added transistors, the 
charging and discharge currents can be adjusted to a very small value as both 
currents are limited by the added transistors. The method of providing the controlled 
voltage is shown in Figure 4.11(b). As a result, a frequency of 38.5 KHz (period of 
26us) is achieved in this ring oscillator. 
4.4.4 Counter, Latch and Comparator 
Counter is connected to the ring oscillator in order to record its number of period. 
As mentioned before, the time for a logic error occurring at the floating node due to 
charge leakage is in the order of milli-seconds. Therefore, the dynamic logic should 
be refreshed every several or tens of milli-seconds. This constrain indicates that the 
timer should be able to record the time in the order of milli-seconds. 
As the ring oscillator is constructed at 38.5KHz，a ten-bit counter is enough the for 
recording the time as 




The latch is used to record the number of clock period required to carry out the 
refresh process for the first time. The input of the latch is connected to the output of 
Page 46 
Chapter 4 一 New Techniques for Operating Dynamic Logic in Low Frequency  
the counter. When the first refresh is required, the refresh request signal from the 
voltage sensor will trigger the latch and causes the latch to record the value of the 
counter. This value is meaningful as it indicates the number of clock period required 
to have a refresh. Afterwards, the voltage sensor can be turned off, and the refresh 
process is controlled by the comparator. The comparator is used to compare the 
output values of the counter and latch all the time. When their values are the same, 
this means that the dynamic logic reaches the time to carry out the refresh process, 
the comparator will send out a signal to request a refresh. 
4.4.5 Recalibrate Circuit 
The amount of leakage current is highly related to the temperature [44]. The higher 
the temperature, the larger the leakage current flows out from the floating node. As a 
result, the time required to carry out a refresh process is varied with the temperature. 
In the real world, the temperature of the chip may vary with time, therefore the 
circuit should have a recalibrate function such that the refresh time is recalculated 
after certain time. 
The recalibrate circuit is actually a five-bit counter. It counts the number of refresh 
processes has been taken and controls the ON and OFF of the voltage sensor. 
Initially the refresh counter starts counting from zero, and the voltage sensor is 
enabled. After the first refresh took place, the refresh counter is incremented to one 
and the voltage sensor is disable afterward. After 2 ^ - 1 refresh processes, the 
refresh counter counts back to zero and the voltage sensor is enabled again. As a 
result the latch will record a new counter's value by the trigger of the refresh request 
signal from the voltage sensor and thus the recalibration can be made. 
Page 47 
Chapter 4 - New Techniques for Operating Dynamic Logic in Low Frequency  
4.4.6 Operation Monitoring Circuit 
When the actual system is operating, i.e. there is a transition of the clock signal 
synchronous circuit or there is a request signal in asynchronous circuit, the voltage 
sensor is not required to detect the floating node voltage as the charge in the floating 
node will be retained during the normal Precharge phase. Under this situation, the 
voltage sensor is not necessary to be turned on and thus power can be saved. 
Therefore the operation monitoring circuit helps to detect when the system is 
operating, and it will disable the voltage sensor and reset the counter if necessary. 
4.4.7 Overall Circuit 
By combining all the necessary units, the Refresh Control Circuit is formed as shown 
in Figure 4.12. 
I Timer I 
I 1\ I 
I data out ) i 
I n 1/ I 
Counter , . ] 
I ； (10 bits) Latch j 
I OcsiStor n n j N Z X I data out / l^aich {—— 
'. ( 2 5 US) / L K / 广 \ 1 
I _ _ I 
I Matched , 
/ O ^ I 丨 Refresh 
S r ^ Request 
Refresh Signal ^ . Refresh Signal | | | Signal 
to Actual Circuit 嘱 | 
w Dummy Internal 
^ Dynamic Cell Node Voltage Voltage 
1 Sensor 
(2-stage 
Operation Monitoring | | Sense 
Circuit I Amplifier) 
I T I Enable / 
I I Disabe 
“ Enable / 
Refresh Counter Disabe 
(5 bit) 
Figure 4.12 — Block diagram of the Refresh Control Circuit 
Page 48 
Chapter 4 - New Techniques for Operating Dynamic Logic in Low Frequency  
nnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnn:-.'-::::-...:::nnnnnnnnnnnnnnnnnnnnnnnnnnnnnf 
Ring Oscillator 
川 UUUUUU 圓圓圓 UUUUUUUUUUUUUUUUUlJUl:-::.ljUUUUUUUUUUUUUUUUUUlJUUUUUUlJUUIJ 
Voltage at V k k K |V k I V K 、 
floating node ^ ^ ^ ^ 
Refresh request signal 
by Voltage Sensor  
, , “ _ 5N 
Voltage Sensor 0汗 
Latch < Period 1 X Period 2 
Refresh request signal 
by Comparator  
Recalibrate 
signal  
Figure 4.13 - Timing diagram of the Refresh Control Circuit 
The timing diagram of the operation of the Refresh Control Circuit is shown in 
Figure 4.13. Initially the voltage sensor is enabled and the voltage of the floating 
node of the dummy dynamic logic cell decreases with time. When the voltage meets 
the pre-defined minimum level, the voltage sensor generates the refresh request 
immediately. This signal will first trigger the latch to record the value of the counter. 
Also the refresh request signal will be passed out to reset the counter, increment the 
refresh counter and refresh the dummy dynamic cell and the actual circuit. 
Due to the increment in the refresh counter, the voltage sensor is disabled. However, 
the timer is now enabled and is able to generate the refresh signal by comparing the 
value of the counter and latch. It continues until the refresh counter returns to zero, 
then a recalibration is required and the voltage sensor is enabled again. The whole 
process will then be repeated afterwards. 
The performance of the Refresh Control Circuit will be shown in chapter 7. Also, 
multipliers are constructed to test and compare the performance of this new 
Page 49 
Chapter 4 - New Techniques for Operating Dynamic Logic in Low Frequency  
technique with the traditional technique. The result will be shown in chapter 7 as 
well. 
As the Refresh Control Circuit is still in the schematic level design, this technique is 
not applied on the implementation of programmable DSP processor and dedicated 
DCT/IDCT processor. 
Page 50 
Chapter 5 - DCT Implementation on Programmable DSP Processor 
Chapter 5 
DCT Implementation on Programmable DSP 
Processor 
5.1 Overview 
As the number of transistors is increasing, it becomes attractive to build design 
system in asynchronous style as it has benefits of no clock skew, lower power 
consumption and low electromagnetic noise. Several asynchronous processors have 
been built [11][48][49][50], and the AMULETS [50] has been used commercially. 
This indicates that asynchronous designs are plausible alternative to synchronous 
designs. 
In this chapter, a pipelined dataflow [10] [51] micro-coded DSP asynchronous 
processor will be discussed. The architecture of this DSP processor was developed 
by our research group, and I have made some modifications, and I am responsible for 
the DCT implementation and layout generation of the whole processor. The 
programming technique and the implementation of DCT will also be given at the end 
of this chapter. 
Page 51 
Chapter 5 — DCT Implementation on Programmable DSP Processor 
5.2 Processor Architecture 
The design of this DSP processor follows the dataflow architecture. In other word, 
this is a data-driven system. The dataflow architecture naturally fits the 
asynchronous design. The combined architecture allows the data to be sent into the 
system continuously without external control or clock, and the presence of data 
triggers the operation of the asynchronous system automatically. 
In order to realize the pipelined dataflow architecture in an asynchronous system, a 
pipelined processor is developed. The target of this processor is to implement some 
simple DSP operations such as Infinite Impulsive Response (IIR) filter, Fast Fourier 
Transform (FFT) and DCT, where addition, subtraction and multiplication with 
constant are required. 
• ‘ Instruction 
F i r o I I Input 
Memory 1 ^ | N 
零 I Instruction 
^ I , p Memory 
Memory 2 I | 
I r i I Data 
L - ^ I Output 
Data Input ； Switching ！ > 
K Network ^ 
T p 霸 
"m^  < I 
r a ^ < ^ 
�s u b ^ 
Figure 5.1 - Dataflow architecture of the programmable DSP processor 
Page 52 
Chapter 5 - DCT Implementation on Programmable DSP Processor 
The architecture of the processor with the necessary functional blocks is shown in 
Figure 5.1. The processor includes an adder, a subtracter, a multiplier, two FIFO 
memories, a switching network, and an instruction memory. It the following 
sections, each part of the processor will be discussed. 
5.2.1 Arithmetic Unit 
In this processor, the adder, subtracter and multiplier are all pipelined and are 
designed in DCVSL structure in order to maximize the performance. Multiplier is 
based on the bit-parallel architecture. In this architecture, the multiplier core can be 
built by an array of a Product Full Adder (PFA)，which is shown in Figure 5.2. Each 
PFA carries out four functions, which are given by 
Aout = Ain - Equation 5.1 
Bout = Bin - Equation 5.2 
Pout = (Ain • Bin) Q (Pin Q Cin) - Equation 5.3 
Cout = Ain .Bin • Cin + Pin • (Ain •Bin + Cin) - Equation 5.4 
Pin iBin 
.鑫 ‘ A ！ 
. ； . 
• 5 • 
• 5 • • 
• • ‘ 
Ain ~——• 4 Cin 
Product  
: ,%s. Full Adder  
：Bout M - • Aout 
： ：p r " 
Y I T I 
Cout ； Pout 
t , 
Figure 5.2 - Product Full Adder (PFA) of the multiplier 
All the signals in the PFA have their own handshake signals, except B shares the 
handshake signal of P as they propagate to the same direction. The overall structure 
of the multiplier core is shown in Figure 5.3. 
Page 53 
Chapter 5 - DCT Implementation on Programmable DSP Processor 
B7 B6 B5 B4 B3 B2 B1 BO « 
^ ^ I I I I 
„ PI 
日 r - n 
“ � � � , � � � � � � I 日uffer 
Adder Network  
sign out 
^ ^ J ^ ^ ； ； r _ 
P15 P14 P13 P12 P11 P10 P9 P8 
Figure 5.3 - The 8x8 multiplier core 
In Figure 5.3, A and B are the inputs while P is the product of A and B. The number 
behind the inputs and output represent the bit position, where bitO is the least 
significant bit (LSB). Since the data format of this processor is a 1-bit sign bit with 
8-bit magnitude, the sign bit of the final product is just the XOR result of the two 
inputs' sign bit. As a result, buffers are added in the multiplier core so that the sign 
bits of both inputs are shifted to the right-bottom block to perform the XOR 
operation, and the sign bit of the final product can be obtained. 
Unlike the synchronous version, the asynchronous bit-parallel multiplier requires 
different bits of the inputs arriving at different time. This is because within the 
multiplier core the next PFA can only start operation when the results, C and P, of 
the previous PFAs are ready. In the current architecture, (AO, B7) will be calculated 
first, the next operation will be started at (AO, B6) and (Al, B7), and so on. Due to 
this requirement, a ladder-shape input buffers are used at two inputs in order to 
Page 54 
Chapter 5 - DCT Implementation on Programmable DSP Processor  
schedule the arrival time of different bits. The structure of the input buffer is shown 
in Figure 5.4. 
ack^aO ack_a1 ack_a2 ack_a3 ack_a4 ack^ aS ack_a6 ack_a7 
h ^ ？ ；i 5 ！； J i； 
� \ 
Buffer 賣賣 I 
aO Cell r • 
‘ ) “ II ； 
Buffer Buffer I f 
a1 Cell 丨 • • •丨 Cell f » ^ 
Buffer Buffer Buffer I c a2 — • Cel I • Cel I Cel f fc S 
^ n z ^ ~ ~ ri ' t. 
Buffer "^：、、 Buffer 勢、〜 Buffer •�—) Buffer 礎、、 2 a3 Cel I • Cel • Cel Cel r p w 
I 1 O 
Buffer 令、 Buffer 命气 Buffer •他 Buffer ! • � � � � Buffer ；3 a4 •[ Cel 丨 • I Cel • Cel • Cel ! Cel r • ® 
Buffer Buffer 曰uffec Buffer 4 � � ^ Buffer |务、、 Buffer 昏专 a5 — • Cel �• Cel �• Cel 1 fc Cel _ C e l l , L Cel f k 
^ — ^ ^ — n ^ 
Buffer Buffer、？-•、、 Buffer <#-��-n Buffer 日 u f f s r 如、 Buffer #-=«� Buffe a6 Cel I L Cel | C e l l | ,1 •I Cel | Cel | n Cel |j-L»| Cel | f » 
Buffer Buffer 卜 H Buffer |•仲 Buffer 卜务、、| Buffer | 和 � B u f f e r Buffe | Buffer 卜等 
a 7 • Cell I 一 Cell j » Cell 1 » Cell _ i U ^ Cell , fc Cell ^ fc Cell Cell j 
^ z z T T z z： — — n ^ — 
gg Buffer Buffer ,~~' Buffer Buffer •、-』 Buffer •"•、 Buffer Buffer Buffer 卜参* 
(Sign bit) H Cell H Cell • ! Cell Cell — Cell Cell Cell • ] Cell j - ^ 
Figure 5.4 - Input buffer of multiplier 
Similarly, different bits of the output P come out at different time, and bitO will come 
out first in this configuration. As a result, a ladder-shape output buffer is also 
required at the output side. 
The adder is based on the Carry Look-ahead (CLA) architecture [52] [53]. This 
architecture provides a faster computation time by reconstructing the Sum and Carry 
of the addition by 2 new values, which are Propagate P and Generate G. The new 
formulae are given as the followings, 
Gi = Ai • Bi - Equation 5.5 
Pi = Ai QBi - Equation 5.6 
Ci = Gi + Pi • Ci.i - Equation 5.7 
Si = Ci.i QPi - Equation 5.8 
Page 55 
Chapter 5 - DCT Implementation on Programmable DSP Processor 
By computing several Ps and Gs in parallel, the Sum and Carry of several bit 
locations can be obtained simultaneously. As a result, the addition can be carried out 
in a faster way. 
In this processor, the subtracter is actually another Carry Look-ahead adder with an 
inversed input as A-B=A+(-B). 
5.2.2 Switching Network 
In some designs, data transfer is done via a common data bus. However, it is 
difficult to be implemented in an asynchronous dataflow system as large 
handshaking overhead and long delay will be introduced. For example, there is a 
common data bus shared by one receiver and three transmitters. When the data 
exists in the data bus, the handshake cell in the receiver is required to communicate 
with all the three transmitters in order to know which the source is. The time 
required must be longer than a normal handshaking time in the pipeline stage. If the 
number of the receivers and transmitters is increased, the time required for 
handshaking will be increased exponentially and a longer delay will happen. 
Instead of using common data bus, a multi-stage switching network is used to 
connect the different units. There are several advantages for using multi-stage 
switching network in an asynchronous system. Firstly, the network allows 
parallelism. In other words, data from different inputs can be sent to different 
outputs simultaneously. Secondly, the network is pipelined resulting in higher data 
transfer rate via the network. Lastly, the switching network distributes the 
Page 56 
Chapter 5 - DCT Implementation on Programmable DSP Processor  
handshake signals to the corresponding destination only and thus the large handshake 
overhead is avoided. 
The basic component of the multi-stage switching network is a two-to-two 
programmable switch cell. It can perform six modes of connection according to a 3-
bit instruction. The structure and the modes of connection are shown in Figure 
5.5(a). 
~ Six modes of 
% connection 
一 - > 一 - > v 
-I ,、+ \ 今 
± m o d e 0 mode 1 mode 2 
inO outO 
• ^ ~ ~ • ~ I I ~ I I 
i n j _ J “ _ — — • > - • � • • "•乙 
^ ^ m o ^ 3 mWe 4 mode 5 
( a ) 
g L decoded 
instruction ^ '-6 "2 ii^tmction 厂 
— 1 < k 
——i 丨 I >D~~ out 
dataO w I , L/ 
• � outO 1 1 
I ^ • 
• h b h ^ b b i ^ 乏 codO - C0d4 - cod1 — cod5 -
X dataO- data1 -
— — ^ CNj 0Ut1 \ 
^ ^  
d a t a ^ ^ ^ J • I 
( b ) ( c ) 
Figure 5.5 - (a)2-to-2 programmable switch and its six modes of connection, (b)block diagram of 
the internal structure of switch, (c)CMOS structure of basic multiplier cell of the MUX 1 
The design of the switch cell follows the dataflow architecture. It uses the same 
communication protocol and handshake cell as the one presented in the previous 
chapter. The switch is basically built up by an instruction decoder and two two-to-
one multiplexers. During operation, the instruction decoder receives and decodes the 
instructions from the instruction memory. It translates the 3-bit instruction into a six-
Page 57 
Chapter 5 - DCT Implementation on Programmable DSP Processor  
bit decoded word, which is shown in Table 5.1，and then passes it to the two 
multiplexers. Each bit of the decoded word corresponds to one mode of connection. 
It helps to have a simpler design of multuplexer cell for faster operation. The 
multiplexer is built in the form of sum-of-product structure and domino style, which 
as shown in Figure 5.5(c). It receives the decoded instruction and detects the 
presence of the input data. Once the corresponding input data has been ready, the 
data is copied to the output of the multiplexer, and thus the transmission of data can 
be done. 
Instruction Function / Connection Mode Decoded Word (COD� 
000 - inQ^outO/modeO 000001 
001 inl->outQ / mode 1 000010 
010 inO->outl / mode 2 000100 
011 i n l ^ o u t l / mode 3 001000 
110 In0~>out0&outl/mode4 010000 
111 Inl~^outO&outl/mode5 100000 
Table 5.1 - Instructions of switch 
FIF01 out1 ——•TZI •irn •[Tl——• ADD in1 
ADD out——{^K A �h v ^ 赛 • SUBinI 
FIF01 out2 ——•fTTI y ——• FIF01 
INPUT ~ ~ • ！ • MUL in1 
MUL out ——• CO - ^ y H •j"^ "]——• OUTPUT 
FIF02 out2 ——•上八•上、~~^ FIF02 
SUB out ——•ITK M ^ T ^ ^ — — • ADD in2 
^ J T— 
FIF02 out1 •饥 •山 \ § ——\ SUB in2 
Figure 5.6 — 8-to-8 switching network 
In this programmable DSP processor, the switching network is a matrix of 3x4 
switch cells allowing eight-to-eight connections, as shown in Figure 5.6. The 
position of the inputs and outputs is tuned and optimized to allow maximal 
concurrency and efficient resources assignment. 
Page 58 
Chapter 5 - DCT Implementation on Programmable DSP Processor 
5.2.3 FIFO Memory 
There are two FIFO memories within the processor, which are responsible for 
temporary data storage during the operation. The structure of the FIFO memory is 
shown in Figure 5.7. It is organized in four short FIFO sets (FIFO A to D, each 
stores 4 data) and one long FIFO set (FIFO E, stores 16 data). Demultiplexers, 
which have similar structure as the switch, are placed at the input and output of the 
FIFO memory for controlling the data flow to/out from the corresponding FIFO set. 
I I 
丄 , b^ ration"^ DEMUX 一 ] 
FIFO 0UT1 < J = I ！ I 
N 6 H FIFOC N-r§>-t- s f-tj~ I ^ ^ ^ ^��� r厂 r | 
" k r i F o ^ j U ^ i " n j ’ <^ =1::::/、、)：：<)4=1^  I 
FIFO 0UT2 FIFOE \i Data IN 〔 6 ^ 2 ^ ^J^-IJ^-^ | 
Figure 5.7 一 Structure of FIFO memory 
The basic building unit of the FIFO set cell is the basic FIFO cell, which is shown in 
Figure 2.13(b) in chapter 2. The basic FIFO cell captures the input data in the 
Evaluation phase and retains it in the Hold phase. A parallel connection of n FIFO 
memory cells to a handshake cell can form a single n-bit FIFO memory stage. If 
several FIFO pipeline stages are cascaded, a FIFO set will be formed. The input data 
will queue and be held inside the FIFO set until the switching network is ready for 
accepting the data from the FIFO block. 
The input of the FIFO memory is connected to a three-stage demultiplexing network. 
Inside the FIFO memory, the instruction and data are first merged to be a single data 
which is in the form of [instruction] + [data]. When it arrives at the input of the 
demultiplexer, the most significant bit (MSB) of the instruction will be extracted and 
Page 59 
Chapter 5 - DCT Implementation on Programmable DSP Processor 
acts as the controlling signal for switching, and the rest of the bits will pass through 
the demultiplexers. This combination of instruction and data reduces the 
handshaking overhead in the demultiplexer and thus a faster transfer speed can be 
achieved inside the demultiplexing network. Also, the use of three-stage 
demultiplexing network prevents the fan-out and the large handshaking overhead 
problems occurred in a single one-to-five switch. Also the data can be transferred to 
the long FIFO set in shorter latency such that the data in long FIFO set can be reused 
in shorter time. The four-to-one mulitplexer is used at the output of the FIFO block. 
Its structure is similar to that of the switch cell, which is shown in the previous part. 
5.2.4 Instruction Memory 
The inclusion of the instructions allows the processor to be programmable and to 
perform different operations. In this processor, the instruction is used to control the 
connections of the switches and multiplexers in the switching network and FIFO 
memories, and also used for the multiplicand for the multiplier. The instructions are 
all stored in the instruction memory. 
Page 60 
Chapter 5 一 DCT Implementation on Programmable DSP Processor 
Cyclic FIFO 
^ "1 • to multiplier 
o) , , control from user 
—— • — — • to switch 1 
. 8 Ji I ^ to switch 2 Instruction from 
Instrunction_k ^ ; ~ ~ I to switch 2 decoding network .M： 
from user ^ § ： i ^  
z : K picn \ instruction to switches / 
g ： ,_K ^ 阳 II~^ datatomultipier 
- i [ p l ^  
I • to switch n I I 
( a ) ( b ) 
I • p,cMi > to Mulitplier 
— _ x ^ ^ J L J Z ^ ^ 一 _ 
^ ^ , 1 DEMUX ^ ^ to switch2 
扎 J n| K DEMUX ~ V ^ DEMUX 1? 
\r~DEMUX ~ = l ]  
‘ — j^l  
<l J J DEMUX ^ 
[LIa demux ^ 
i r^ f l ： DEMUX ^ 
(一 OpeiitioFof'DEI^ X � J ^ demux ^ 
I I 4 J __DEMUX ？ 
I 仁 DEMUX ^ ^ ~ C DEMUX ZZDC 
I j — I  
I ,dfN-1:0vj -4--V-/ V-Kj \ I J~~[！ DEMUX ^ 
I I M DE瞧 
^ 乂 I L j demux ^ ^ to switch16 
( C ) 
Figure 5.8 - Instruction memory, (a) block diagram of the instruction memory, (b) the structure 
of cyclic FIFO, (c) structure of the instruction decoding network 
There are two mains parts in the instruction memory which are the instruction 
decoding network and the cyclic FIFOs, as shown in Figure 5.8(a). An instruction is 
in the format of [address] + [data]. After receiving the instruction, the instruction 
decoding network, as shown in Figure 5.8(c), decodes the address by demultiplexing, 
which is similar to the demultiplexing network in the FIFO memory, and sends the 
data to the corresponding FIFO or the multiplier. 
The FIFO in here has a cyclic feature, which is shown in Figure 5.8(b). Besides from 
sending to the corresponding destination, the outputted instruction is fed back to the 
FIFO as well. This feature permits the instructions to be recycled and thus the 
application can be run repeatedly without further programming. During 
programming, the switch at the input is connected to the instruction decoding 
Page 61 
Chapter 5 - DCT Implementation on Programmable DSP Processor 
network for collecting the instructions, and then it will be switched to another end for 
recycling the instructions. 
5.3 Programming 
In this dataflow processor, programming is just the organization of the flow of data. 
In other words，the switches are programmed to perform a connection from one unit 
to another unit. For example, there are 2 inputs A and B. In order to perform an 
addition of A and B in this processor, 2 cycles are required to send the inputs to the 
adder and a third cycle is needed to send the adder's output to the processor's output. 
Step 1 : A (from input) Adder Input 1, 
Step 2 : B (from input) —Adder Input 2 
Step 3 : Adder Output Output 
In the actual programming, the following switches are required to be programmed as 
follows, 
For step 1 : sw2 mode 1, sw6 mode 0, sw9 mode 1, 
For step 2 : sw2 mode 3, sw8 mode 2，swJ2 mode 0, 
For step 3 : swl mode 3，sw7 mode 0，swll —> mode 0 
Therefore, an addition requires 3 cycles. However, for example, if the two inputs are 
sent from the internal FIFO memories 1 and 2, only 2 cycles are required for an 
addition as no switch is shared between both the input paths (referred to the 
switching network in Figure 5.6). Therefore the data from FIFO memories 1 and 2 
can be sent to adder input 1 and input2 respectively within the same cycle. Similarly, 
data can be sent to different arithmetic units or FIFO memories in the same cycle 
provided that their paths do not share the same switch. Programming which can fully 
utilize the parallelism of the switching network maximizes the concurrency of the 
Page 62 
Chapter 5 - DCT Implementation on Programmable DSP Processor 
arithmetic operations and thus the greatest performance of the processor can be 
achieved. 
5.4 DCT Implementation 
As mentioned in chapter 3, the implementation of DCT in the processor is based on 
the algorithm proposed by Jeong et. al. [40]. The DCT programme can be divided 
into four stages, which is in shown from Figure 5.9 to Figure 5.12. 
In A 1 B1 C1 D1 E1 A 2 B2 C 2 D 2 E 2 a^d add add sub sub sub 巾 ⑴ ！ 謂 1 mul 
1 2 0 1 2 O coeff O 
� 
Xi \ \ \ K i 乂 ⑤ 
繁::::::::::::::::::::::=::::::::= ..........知 
i 
/ / \  
� Z _ K ::::::::::::::::::::::::::::::::::::二::::::::::^(^^^^^"^(^^^~^^ 
Figure 5.9 - Flow diagram of the first stage of DCT implementation 
In A1 B1 C I D1 E l A 2 B2 C 2 D 2 E 2 add add sub sub sub ⑴^丨丨 mul mul 
1 2 O 1 2 O coeff O 
. . . . . . . . . . . ® \ 
Figure 5.10 - Flow diagram of the second stage of DCT implementation 
Page 63 
Chapter 5 - DCT Implementation on Programmable DSP Processor 
In A1 B1 C1 D1 E1 A2 B2 C2 D2 E2 臼如 add add sub sub sub mul mul 
1 2 O 1 2 O coeff O 
....... .... •.....•..©-©©N 
. . . . .© t^ 
/ \ ( £ w 
…(S)---...._....... Z太 




Figure 5.11 - Flow diagram of the third stage of DCT implementation 
In A1 B1 C1 D1 E1 A2 日 2 C2 D2 E2 add add sub sub sub mul mul 
1 2 0 1 2 O coeff O 
® • . . . . . . . " . ‘ ‘ ® ..... 
M, © ^ Y� . . . . 霸 
© ^ Y, / ... .../ / \ .. ..:/ 
© -Y. /.:;< ； \ / X®'®^© 
© -Y, \ I 
i w/ /： 
lf, If 
© v @ ( @ . 
# � 
、：... 
Figure 5.12 — Flow diagram of the forth stage of DCT implementation 
Page 64 
Chapter 5 - DCT Implementation on Programmable DSP Processor  
The flow diagrams show the data flow in the DCT algorithm. In the flow diagram, 
/« means the input, A, B, C, D and E mean the FIFO sets in the two FIFO memories. 
add 1, add 2 and add O are referring to the input 1, input 2 and the output of the 
adder respectively. Subtracter and multiplier also have the similar representations. 
Due to the parallelism and concurrency of the switching network, two or more data 
are always controlled to transfer simultaneously in order to increase the throughput 
of the switching network, and thus more operations can be carried out by the 
arithmetic units and the performance can be increased. Also in order to avoid data 
queuing, it is necessary to send the data to FIFO memories for temporarily storage in 
sometime. 
The detailed steps of this programme (includes the instructions of each switches) are 
shown in appendices, and the performance of the DCT implementation is given in 
the chapter 7. 
Page 65 
Chapter 6 - DCT Implementation on Dedicated DCT Processor 
Chapter 6 
DCT Implementation on Dedicated DCT Processor 
6.1 Overview 
As the demand of the high quality signal, the computation requirement of the video 
and image applications nowadays becomes higher and higher. For the application of 
the discrete cosine transform such as the MPEG2 (640x480, 30 fps, 4:2:0, 13.82 
Mpixel/sec) or High Definition Television (HDTV) (74.23MHz in luminance signal 
for baseband HDTV), a very high processing rate of a 2D DCT/IDCT design is 
required. Although the processing power of a general purpose processor is high, it is 
still difficult to provide a real-time processing on these signals. On the other hand, 
dedicated processor for specific application can provide an effective solution. It 
always provides a cost effective and higher performance solution for these 
applications. By further applying the asynchronous pipelined architecture on these 
designs, a higher performance may be achieved. 
In this chapter, a dedicated 8x8 2D DCT/IDCT asynchronous processor is 
introduced. The processor has a fully pipelined in the architecture, and provides a 
very high transform rate which is capable of real-time processing on high quality 
signal. The architecture of the 2D DCT/IDCT processor will be introduced at the 
beginning. Since the architecture is based on the row-and-column decomposition Page 66 
Chapter 6 - DCT Implementation on Dedicated DCT Processor  
method, the design of the ID DCT core and the transpose memory will be given 
afterwards. 
6.2 DCT Chip Architecture 
As discussed in chapter 3, the 2D DCT design is based on the row-and-column 
decomposition method which provides a simpler implementation and is more 
suitable for the asynchronous architecture. Figure 6.1 shows the dataflow in the 2D 
DCT by using the row-and-column decomposition method. In the row operation, ID 
DCTs are applied on each row of data, and then the results are stored in the transpose 
memory row-wise immediately. For the column operation, the ID DCTs are applied 
on the data stored in the transpose memory in column-wise, and the resultant values 
of the column operation are the 2D DCT result. 
� . ,P N X row Data in Original Data t '' 
operation _ Transpose Memory 
X70 j…十…I…十-|X77| — 1DDCT — ||丁7�| --•|---十---卜-十--十-卜； 
5 = = = = 二 — 1DDCT^ l ^ o l Z Z Z Z Z Z Z 
^ Z—工一iddct~> 二 工 
^ Z 工 一 i d d c t ~ > |Gl==Z= =工 
^ / 工 一 1DDCT— = = = 工 
T 一1DDCT—> 
ri[二二二二 ；一 i d d c t —{涵二二二二二;; 
I ^  � 1 0^2 0^3 Xo4 XQ5 XQ4 | 1 D DCT ~ • | 丁卯 |丁oi 丁03 丁04 丁05 丁04 丁05 | 
^ ^ ^ — ！ 1 ！——• —  
I I I I I I I 0 „ 
a o D o o D a D ^ S i 
a O O D O O D D O j C I 
^ q q q q E ^ q ^ l 3 




I '60 I ^^ ! 
Y~i 7 r 
I '50 y I 
2D DCT | 5 L = = Z = = 工 
result i g 厂 = 2 ! = = = ! " 
| 5 t l Z = = = = 工 
l^io^ 工 
I Ypo r 01 Ypa YO3 Yo4 Y^g Y^^ Ypg 
Figure 6.1 - Dataflow diagram in 2D DCT by row-and-column decomposition method 
Page 67 
Chapter 6 - DCT Implementation on Dedicated DCT Processor  
From the dataflow diagram, it is shown that it requires eight DCT operations in both 
of the row and column operations. It means there are totally 16 DCT operations in 
the whole 2D DCT operation. However in the physical realization of the row-and-
column decomposition method, it is not necessary to use 16 independent ID DCT 
cores to perform the row and column operations. This is because the data is entering 
the processor serially. A single ID DCT core can be shared for the eight ID DCTs 
by each operation. As a result, only one ID DCT core is required in row and column 
operation, and the block diagram of the 2D DCT architecture is shown in Figure 6.2. 
Since the architecture of both the ID DCT core can be the same, it saves the time of 
designing. 
data (pxel) \ 1DDCT \ Transpose \ 1DDCT K 2D DCT 
input Core Memory ^ Core ^ output 
Figure 6.2 一 Block diagram of 2D DCT processor 
The detailed architecture of the ID DCT core will be discussed in next section of this 
chapter. For the transpose memory, it is built by an ordinary Static Random Access 
Memory (SRAM) with an address generator to control the write and read processes. 
The detailed architecture of the transpose memory will be discussed in section 6.3. 
6.2.1 1D DCT Core 
The implementation of the ID DCT core is based on Equation 3.5 and Equation 3.6 
shown in chapter 3. By dividing the equations into two parts, Equation 6.1 to 
Equation 6.3 can be obtained. 
Page 68 
Chapter 6 - DCT Implementation on Dedicated DCT Processor  
- _ -| p -1 � n � -I 
Zq X0+X7 Z4 Xq — X^ 
= and Z5 一 -Xg 
“ Z 广 - Equation 6.1 
一Z3J +X4J [Z7J [X3-X4一 
X ] [A A A A 1 � Z o _ 
Y^ B C -C -B Z, 
Y, A -A -A A * Z, _ Equation6.2 
Y, C -B B -C Z3 
'¥,1 [D E F G 1 � Z / 
Y,丄 E -G -D -F Z, 
Y, F -D G E * Z, _ Equation 6.3 
Y, G -F E -D Z, 
— � ‘― � L , _ 
|1DDCT CORE I 
I ^ ‘ 
I K § K tL 2 I \ 
Data (pixel) 8 ^ l \ f l ^ ^ p \ DCT 
input h / I I / = I h / output 
t ^ I 8 
I ^ ^ I 
Figure 6.3 - Block diagram of the ID DCT core 
Figure 6.3 shows the basic architecture of the ID DCT core [32][33][34], which is 
constructed by a pre-processor and a multiplier-accumulator. The pre-processor is 
responsible for the operation described by Equation 6.1. It collects the input data 
and performs the addition and subtraction, according Equation 6.1. Since only 
simple addition and subtraction are required, the pre-processor includes an adder and 
subtracter. 
In this processor, the Binary Look-ahead Carry (BLC) adder [30] [54] is used. 
Compared to the Carry Look-ahead (CLA) adder, each processing block of the BLC 
adder handle two sets of Propagate and Generate only. This simplifies the operation 
within the basic block and thus the speed can be increased. However, the drawback 
Page 69 
Chapter 6 - DCT Implementation on Dedicated DCT Processor  
is the silicon area and longer latency. A 8-bit version of the BLC is shown in Figure 
6.4 
Cel  _ ^ ^ ^ ~ < ； ; ^ ^ ^ ^ 
_ 1/ " f f id V y y ^ \ y y / y y \ V y / y / A 
ai —•I PG uC ^  blc h — i ^ ^ r i Uhh^ 
bi — • Cell f t Cell Adder _ _ • 酬 1 
• J J ly y- y y (‘ f^] \ y y y y y \ I 
— H PG p i S f e i n U f W ] n — A ® B 
b2—• Cel _ U Cel _ ^ ^ Cel Adder _ • sum2 ^ —^ Cell -•A.B 
« 1 I 1 I 1 y y , X .J I I 严 I  
a3 —H PG hM±l blc i — B L C UfT^F 
b3 — • Cell f T Cell Cell ^ . . — Z T ^ Adder _ • sum3  
^ P 广參 BLC 卜 
f ""T*****"***"*"^*"*"*"**" “ “ ” “ Pn ^ Cell 
a4 ~ • PG BLC — 4 — B L C BLC J ^ Half G „ • “ — 
b4 ~ C e l l ^ Cell I — ^ l \ Cell Cell | — A d d e r _ _ ^ sum4 
a5 ~ 叫 PG ^ BLC [ — - — H BLC L - " BLC | U h Half y . . . . . . y a 
b5 Cell a Cell — 引 Cell [ - J = t | Cell Adder | _ • sum5 g t a ^ / " 
data -•g/p^ig-^data 
I Y / / / / / / / / 
a6 “ • PG BLC h H ^ Z Z H t 日匕。--f—» BLC '- f> Half 
b6 ~ • ! Cell ^ Cell Cell [_ ~ ~ C e l l _ _ A d d e r _ ^ sum6 
a7 • PG ；^  BLC blc 卜 B L C Half 
b7 ~ C e l l Cell j - L Cell | Cell Adder _ ^ sum? 
i t i Half 
Adder ^ sums 
Figure 6.4 一 Structure of the 8-bit BLC adder 
The second part of the ID DCT core is the multiplier-accumulator. It is responsible 
for the matrix multiplication described by Equation 6.2 and Equation 6.3. The 
matrix multiplication can be done by multiply-and-add. It receives the output from 
the pre-processor and performs 16 multiplications with the DCT coefficients, and 
then adds the results according to the order. As a result, the multiplier-accumulator 
is constructed by multipliers and adders. 
Besides from directly using the multipliers and adders, distributed arithmetic (DA) 
[55] method is used to implement the multiplier-accumulator in some designs 
[32] [33]. The principle of the DA is to use a Read Only Memory (ROM) based look-
up table (LUT) to replace the multiplier. Since the DCT coefficients are fixed, the 
Page 70 
Chapter 6 - DCT Implementation on Dedicated DCT Processor  
result of the multiplication can be pre-calculated and stored in the ROM. In this way, 
the input data acts as an address to read the data which is stored in the ROM. Since 
the ROM based LUT can be built very compactly, the advantage of DA is saving 
silicon area as a general dedicated multiplier is avoided. However, the DA does not 
fit the style of the asynchronous architecture, and the read operation on ROM cannot 
be pipelined. As a result, a general pipelined multiplier based on the bit-parallel 
algorithm is used in this ID DCT core as it can be pipelined and run very fast, but the 
trade-off is the size. 
Basically the architecture of this bit-parallel multiplier is the same as the one used in 
the programmable processor, which has been described in Chapter 5. However, it 
cannot be used directly in this DCT core. This is because the bit-parallel architecture 
is primary designed for the multiplication of two unsigned value, but it is two 
complement data format in the DCT core. As a result, a conversion of a two 
complement data into a unsigned value with a sign bit is required. This conversion is 
done in the input buffer, and the converted output can be used in the multiplier core. 
The mechanism of conversion can be illustrated in the following example. For a 9-
bit data having a the binary representation of 111110101, the conversion can be done 
by 
Original 2-complement binary value 1 1 1 1 1 0 1 0 0 
Step 1 : Inversion 0 0 0 0 1 0 1 1 
Step 2 : add 1 to the result + 1 
0 0 0 0 1 1 0 0 
The resultant binary number shows a decimal value of 12, so the original value 
represents a value of -12. 
Page 71 
Chapter 6 - DCT Implementation on Dedicated DCT Processor  
In this implementation, the conversion is divided into 2 stages. The first stage is the 
step 1, which will invert the input bit if the input data is a negative value. This can 
be done by the XOR gates which XOR all the data bits with the sign bit. The second 
stage is the step 2 which can be performed by an adder. As the conversion is handled 
at the input buffer part, a ripple adder is used as it fits the ladder structure of the input 
buffer. Figure 6.5 shows the modified input buffer used in the 2-complement 
multiplier. 
ack^ aO ack^ al ack^ a2 ack_a3 ack^ alO ack^al1 ack^ a12 ack^ a13 
^ I h 1 \ ^ \ u\ i 
^ I ^ I ^ l ! I 
a14(sign) i n i| I Bufer H^ vriD KH _ „k-~i Bufer I I 姜 
a1 ~ ^ Cell _ _ ^ XOR Full Cell _ d ^ “ 
Bufer N--I Bufer NM |<*™�W~|,~-� Bufer 目 i 2 a2 ~ • Cel —Cell —^ XOR Ful Cel —d ^ = 
^ “ 芸 — 丨 丨 丨 = 
Buffer ^-i Buffer Buffer 略― 碌 、 • _ 缚 、 B u f f e r 、•、‘ lu 】 <D a3 ~ • Cel —1-» Cel Cel —^ XOR FW — cell • ^ 
； ； ； i| ！ ：！ ： “ ： ： ； 0 
实 笑 . $ ’ I » I S 5 • ^ J I ？ M 5 J3 ？ 5 ‘ I I I M 专 M 主 5 • ^ M ^ W 多 专 专 ？、‘ 冬 > 5 • ii 5 • 5 5 • 5 5 • 5 
5 玄 5 ？ 专 i 5 • S J • j ？ • U • 5 ？ i « 5 5 5 ？ I ^ ：： I ‘： ；- I 5 ？ • 5 Bufer I务4 Bufer 卜™ Bufer kH Bufer kH Bufer 丨如 Bufer |善〜! 令™! I Bufer M IJ 
alO Cell • Cell _ ^― • Cell _ C e l l _ C e l l _ ^ Cell ^ _ C e l l _ f ， 
丨 ^^^ * " I i I t  
Buffer • " i Buffer 4-H Buffer Bufer、、f、、i Buffer 4、、、、 Buffer 务 ^ Buffer 4-' a11 • Cel ^ Cel _ ^ Cel Cel _ C e l l ^ Cel _ / j j j �_ C e l l f . 
I i • ？ Adder i 1 ^ 
I I i i 1 n 
Buffer 略、、、4 Buffer Buffer 凑、一 Buffer Buffer ! •、 Buffer [•、、、、 L j^ [•务Buffer K"養 
a12 • Cell _ ^ Cell _ f U ^ Cell _ C e l l _ L ^ Cell 一 C e l l ^ ： ^ XOR ^ • Full ^ „ /> 
" T ！ I F ^ 轻 i p ^ 
i I j I 善 I I I I 
Buffer Buffer 务、、、玲 Buffer •#、、、。 Buffer Buffer ‘、#、、、） Buffer 』 4似丄 Buffer U Buffer Mt^  
a13 • Cel ^ C«l ^ Cel ^ Cel ^ Cel ^ Cel ^ ^ Cel ^ XOR Ful Cel ^ 
I 1 I 1 I 1 I 1 I r I r 严 1 n ^ Adder " i “ 
— • data carry in — • 
XOR iH'sign _ Full 
data ~ • ~•data®sign data ~ • ~ • data •sign sum in ~ • ~ • sum out 
• • • • • • • • Adder 
sign ~^Y/y/y/^, ^ sign ~ ^ ®'9n —• cary out 
Figure 6.5 - Modified input buffer for 2 complement input 
A similar conversion is required at the output as the unsigned product result is 
required to be converted back to a two complement data format. The conversion is 
merged into the output buffer and its structure is similar to the input buffer shown in 
Figure 6.5. For the conversion at output, the sign bit of the result must be ready at 
the same time as the bitO in order to perform the conversion at once. Therefore the 
Page 72 
Chapter 6 - DCT Implementation on Dedicated DCT Processor  
sign bits path in the multiplier core is modified in order to calculate the output sign 
bit first. The resultant structure of the multiplier core is shown in Figure 6.6. 
Together with the modified input and output buffers, a two complement bit-parallel 
multiplier is formed. 
B14 B13 B12 B2 B1 BO 
A A A i I 
sign一B —f��Y/ Vj/:::::……命 I “}'/}……本勢'X/ \二 )/'/1, sign out 
/ ！"“！ / ！： / ！！ / n ！ / h ！ / ；H 
H I c y I CV 1 (：丨 I I P3 * - * - i - * t -• # 
/ f} \ / ! / i / i / i / \ / / ^ y j - ^ y ^ y \ 
m m n i ' m m l j L ^ . . 
Adder Networic ^ 
sign一out 
^ ； .1 I ； r _ 
P31 P30 P29 P19 P18 P17 
Figure 6.6 — Multiplier core of 2 complement multiplier 
I 
For the accumulator part, some of the design uses an adder with an output feedback 
to perform the accumulation of the multiplier's outputs. However, this structure is 
very slow as the second addition cannot be performed until the previous addition is 
finished and fed back to the input. Also, it wastes the pipeline architecture as only 
one addition can be carried out at anytime. Therefore several BLC adders are used in 
this design in order to achieve a better performance. 
Page 73 
Chapter 6 - DCT Implementation on Dedicated DCT Processor  
6.2.1.1 Core Architecture 
For the asynchronous architecture, simpler and direct dataflow allows easier 
implementation and better performance as it reduces the handshaking overhead and 
fits the asynchronous pipeline architecture. In order to develop a simple dataflow, 
Equation 6.2 is further decomposed to Equation 6.4, 
>ol � … 「叫 「叫 � 1 
一 1 万 � 7 1 1 C r71 1 一 C r 1 1 一万 r 1 
n = 2 ^ .LZoJ+S - a •口2J+3 J - Equation 6.4 
r^ C —B B - c 
L J L - J L- -J L. _J 
Let 
I- r- n r- -n r- —, 
A 「叫 「叫 r A 
J ^ i C \ 一 C \ B 
^0 = 2 ^ •[z,],U,=- •[z,],anci U,=- ^ - f e ] - Equation 6.5 
C -B B -C 
— � L J L � 
r。_ 
then, Y^ 
Y^  =^01+^23 - Equation 6.6 
n . 
w/zere U 机=Uo+lJ\ and U,, =U,+U, - Equation 6.7 
Similarly for Equation 6.3, let 
' d i r £ 1 r F 1 r g “ 
1 E r 1 - G r 1 I -D r 1 - F r 
^ A = - •[Zsi G •[ZgJ, and 五 • [ Z , \ _ Equation 6.8 
_gJ [ - f J |_ £ J [ - D 
1 � M 
t h e n，Y , _ 
Y^  =A)1+丄23 - Equation 6.9 
where L^ ^ = I^ o + A 肌d L^ s - Equation 6.10 Page 74 
Chapter 6 - DCT Implementation on Dedicated DCT Processor 
Based on Equation 6.1 and Equation 6.4 to Equation 6.10, the architecture of the ID 
DCT core is formed as shown in Figure 6.7. It is a fiilly pipelined design and the 
datapath is simple and in single direction without any feedback. In this architecture, 
the pre-processor is constructed by one adder and subtracter, the multiplier-
accumulator consists of two general purpose multipliers and three adders. 
I pre-processor i ] i --7 muropIier-~~[ 
] 11 丨 ^ ^ ^ ^ I accumulator i 
I ] I [ Coefficients ——• ^ ] [ 
I I I I Memory g _ ^ I I 
i ^ i i i , >mum 1 
I Add! I I I ^ ^ I . … … J \ I 
I ^ <1 -H-> Replicater ——• ^ q ^ ^ ^ i \ | 
I . ( r J I 丨丨 口) ！ 1 ^ ! 
- U \ 4 \ / t ^ I i ^ - u M h _ � ^ 一 、 J K 
？ > 1 > w ii  
I ° ； ^ ! I I Data I L^"^ I 
I Replicater — > T l 丨 I 
] ••••/ J i l l (x4) 2 ~ • 、 丨 / I 
I ^ ^ l i i 卜 MUI2 I f I 
j 111 ^ ^ ^ " i .……• ) I ! 
丨 I I Coefficients • ^^^ Q ^ ^ I I 
j I I I M e m o r y 丨  I 
j I M _ _ J 丨 
^ stage 1 \ stage 2 \ stage 3 stage 4 
Figure 6.7 - Architecture of ID DCT core 
It should be noticed that only two multipliers are used in this design. In order to 
achieve a high performance, some other designs require parallel input of data or 
require four or more multipliers or LUTs [32] [34]. In this way, several 
multiplications can be processed in parallel such that a higher throughput can be 
achieved. Another reason is that they are synchronous designs, they need to maintain 
a constant data rate throughout the datapath. Otherwise, some clock cycles may be 
wasted for waiting the input data. However it is not necessary in this design as it is 
based on an asynchronous architecture. Different units in the asynchronous design 
can operate at different rates as their operations based only on the local handshake 
signals rather than the global clock signal. Also, the asynchronous pipelined 
Page 75 
Chapter 6 - DCT Implementation on Dedicated DCT Processor 
architecture is applied on the design of the multipliers such that the multiplier can be 
run very fast. As a result, the multipliers can be adjusted to run faster than the other 
units，then a similar or better performance can still be achieved by this design even 
less multipliers are used. Furthermore, it does not require parallel input of data as the 
operation will only be started when all the inputs are ready, no operation (no power 
is consumed) will occur while waiting the input data. 
6.2.1.2 Flow of Operation 
The dataflow of the ID DCT core can be explained by Equation 6.1 and Equation 6.4 
to Equation 6.10. Figure 6.7 can be divided into four stages. 
Stage 1: 
Stage 1 is the operation of the pre-processor, represented by Equation 6.1. Firstly it 
receives the input data in the order [xO, x7, xl , x6, x2, x5, x3，x4], and then the one-
to-two demultiplexer will send the data to input 1 and input2 of the adder and 
subtracter alternatively, that means the odd-th input data will be sent to input 1 of the 
adder and subtracter, and the even-th input data will be sent to the lower path input 2 
of the adder and subtracter. As a result, the output sequence of the adder 1 is 
[x0+x7, xl+x6, x2+x5, x3+x4] or [Zq, Zi, Z2, Z3] (refer to Equation 6.1) and the 
output sequence of the subtracter is [x0-x7, xl-x6, x2-x5, x3-x4] or [Z4, Z5, Ze, Z7] 
(refer to Equation 6.1) 
Since the addition and subtraction can only be carried out when both inputs are 
ready, the output rate of the adder and subtracter is half of the input data rate. 
Page 76 
Chapter 6 - DCT Implementation on Dedicated DCT Processor  
Stage 2: 
The multiplications with DCT coefficients are performed at stage 2. At this stage, 
data is split into two paths, which are the upper and lower path. Both of the paths are 
totally identical, and the upper path is responsible for Equation 6.5 while the lower 
path is responsible for Equation 6.8. 
By considering Equation 6.5, there are totally sixteen multiplications, in which each 
input data needs to multiply with four different DCT coefficients. Therefore, a data 
replicator is used to duplicate the input data four times and then send to the 
multiplier. Therefore, the output sequence of the data replicator at the upper path is 
[Zo, Zo，Zo, Zo, Zi, Zi, Zi, Zi, Z2, Z2, Z2，Z2, Z3, Z3, Z3, Z3]. Similarly, the output 
sequence of the data replicator at the lower path is [Z4, Z4, Z4，Z4, Z5，Z5, Z5, Z5, Ze, 
Z6，Z6, Zs, Z7, Z7, Z7，Z7]. 
The DCT coefficients are stored in the DCT coefficients memory, and they are 
arranged and sent out to the multipliers in the sequence of [A, B, A, C, A, C, -A, -B, 
A, -C, -A, C, A, -B, A, -C] in the upper path and [D, E, F, G, E, -G, -D, -F, F, -D, G, 
E, G, -F，E, -D] in the lower path. As a result, Equation 6.5 and Equation 6.8 can be 
performed and the output sequence of the multiplier 1 is [iV, Ui ,^ iV] while the 
output sequence of multiplier 2 is [U/, Us^ , Ue^ , iV]. The output data rate of each 
multiplier is two times of the input data rate as the data replicator increases the 
output data rate of the first stage by four times. 
Page 77 
Chapter 6 - DCT Implementation on Dedicated DCT Processor  
Stage 3: 
The output of the multiplier will go through the one-to-two demultiplexer at stage 3. 
Its operation is similar to that of stage 1, but the outputs of the demultiplexer are 
alternating every 4 times in order to perform the addition shown in Equation 6.7 and 
Equation 6.10. For example in the upper path, Uo^  and i V will connect to the first 
input of the adder 2, U , and Us^  will connect to the second input of adder 2. 
Therefore the output sequence of adder 2 is [Uoi\ U23I and that of adder 3 is [Loi^ , 
L23I. For the output data rate, it is reduced to be the same as the input data rate. 
This is because an addition can only be performed when both inputs are ready, it is 
reduced to half of the output data rate of the multipliers. 
Stage 4: 
Stage 4 is responsible for performing Equation 6.6 and Equation 6.9. Originally two 
adders are required for each equation. However, data rate after stage 3 is halved, a 
single adder can be shared by both equations. Therefore a two-to-two switch is 
inserted at the beginning of this stage. It is used to collect data from upper and lower 
paths and distribute them to adder 3. Finally the output sequence of stage 4 is [Yo, 
Yi, Y2, Y3, Y4, Y5, Y6, Y7]. 
In this stage, the output data rate can be maintained at the input data rate as it 
combined the data from the upper and lower paths. As a result, the final output of 
the ID DCT core has the same data rate as the input. 
Table 6.1 shows the summary of data rate at different stages of the ID DCT core. It 
shows that the critical part of the design is in stage 2, where the multiplications are 
Page 78 
Chapter 6 一 DCT Implementation on Dedicated DCT Processor  
performed. As a result the throughput of the whole design is limited to the half of 
the speed of the multiplications in stage 2. 
Stage 1 Stage 2 Stage 3 Stage 4 
Input Input data rate 1/2 x Input data 2 x Input data Input data rate  
^ ^  
Output 1/2 X Input data 2 x Input data Input data rate Input data rate 
rate rate  
Table 6.1 一 Data rate at different stages of the ID DCT core 
6.2.1.3 Data Replicator 
The purpose of the data replicator is used to keep a single operand for the multiplier 
to perform four multiplications. In synchronous design, a latch can be used to hold a 
data for four clock cycles. However, it is not possible in asynchronous design as data 
will be lost after used due to the Precharge phase of the domino logic. A simple way 
which uses a buffer with a feedback output can perform a cyclic function and the 
data can be reused. However, the resultant speed is slow and the pipeline 
architecture is destroyed due to the feedback. As a result, a dedicated circuit is 
constructed in order to duplicate a single data four times, and thus four 
multiplications on a single data can be done. 
O 
data ， ‘ MUX  
C ^ H buffer I • J 
MUX —, MUX —, MUX —, » 
^吻ut Muxl � T 
FF FF FF FF � > � > � > � > r v ^ 
_ , MUX H buffer |— 
( a ) ( b ) 
Figure 6.8 - (a) block diagram parallel-to-serial shift register in synchronous design, (b) block 
diagram of data replicator 
Page 79 
Chapter 6 - DCT Implementation on Dedicated DCT Processor  
The idea of the data replicator is similar to that of the parallel-to-serial shift register 
in synchronous design, which is shown in Figure 6.8(a). However, it is not suitable 
to implement the parallel-to-serial shift register in asynchronous design. The first 
reason is that flip-flop does not fit the style of the new asynchronous architecture. 
Also the single stage parallel-to-serial structure requires more difficult control. 
In the data replicator, which the shown in Figure 6.8(b), multiplexers are used 
instead of flip-flops and the structure is divided into two stages. Since the 
multiplexers can only handle one of the two inputs every time, buffers are also 
included for temporarily storage purpose. In this structure, data in the four input 
paths will quickly be transferred to the next stage by the multiplexers or stored in the 
buffers, and then the next data can be inputted. However, it is necessary to ensure 
that all data must be sent out in the single stage parallel-to-serial shift register before 
next data comes in. Therefore, the control of the data replicator is simpler and has 
less overhead, and thus allows faster duplication on data. 
6.2.1.4 DCT Coefficients Memory 
As mentioned in the previous section about the distributed arithmetic, ROM is not 
suitable to be used in asynchronous design. In order to pre-store the DCT 
coefficients for the multiplication, a new logic cell is used, which is shown in Figure 
6.9(b). 
Page 80 
Chapter 6 - DCT Implementation on Dedicated DCT Processor 
reset reset 
T Y t 
IU ai r - - - 4 - - I- -n 
out D r— M1 > - | ； t t - ^ modified T j , … I 
-P 0 < n . 「 n delayed basic FIFO normal basic J 
out p reset ' I FIFO cell ( 
p ^ ~ • L signal | | — » out_p I out_p 
V iq i|—L V ^ i i 
_ _ V I normal basic I / " o d i ^ d , 
I I FIFO cell I I bas丨c^lFO , 
I I 4 — • out n I 丨  - I— • out n 
• r L' “ - -
V V •1’is '0' is 
pre-stored pre-stored 
( a ) ( b ) ( c ) 
Figure 6.9 - (a) normal basic FIFO cell, (b) modified basic FIFO cell, (c) basic DCVSL structure 
of pre-storing data 
The main difference between the new cell and the normal basic FIFO cell, which is 
shown in Figure 6.9(a), is the addition of transistors Ml and M2. Initially when the 
system is being resetted, the reset signal is high and the acknowledgement signal 
becomes low. For the normal basic FIFO cell, it enters the Precharge phase and the 
output becomes logic low. After the reset has finished, it will go into the Enable 
phase and wait for the input data. Since charge is still kept at the floating node, the 
output of the normal basic FIFO cell is still kept low. 
However for the modified basic FIFO cell, although the acknowledgement input is 
low, it is not in the Precharge phase as transistor Ml is turned off and M2 is turned 
on by the delayed reset signal. As a consequence, a pull-down path is created and 
the output is kept in high. When reset is finished, the next stage will go to the Enable 
phase but the output of the modified basic FIFO cell is still kept high due to the 
delayed reset signal. As a result, this high output, which presents having a data of 
logic one, requests the next stage to receive the data. As a result, a data of logic one 
can be sent out. Owing to this feature, the modified basic FIFO cell can be treated as 
a memory cell of either pre-storing logic high or low, as shown in Figure 6.9(c). By 
connecting N memory cells in parallel with a handshake cell, it forms a single FIFO 
Page 81 
Chapter 6 - DCT Implementation on Dedicated DCT Processor  
Stage in which a N-bit data is pre-stored. The DCT coefficients memory is 
constructed from these FIFO stages with pre-stored DCT coefficients according to 
the required sequence listed in section 6.2.1.2. An example of the DCT coefficients 
memory in the upper path is shown in Figure 6.10. 
j -1 
I Handshake ' 
I Cell I 
從 微 鄉 你 y 嫩 •渊 鄉 \ 轮 I 
I FIFO stages built | | 
j n r i by modified basic 1 I 
I / FIFO cell f： < 4 < I 
I 芒 Z c 芒 芒 
[V I ® * .g .g ® . 
| — ^ i I - U - I - U I ~ • I �t o multiplier 
I O O O O I I I- h" 丨 I o • o o j 
I Q Q Q Q ‘ 
I ^ zTzi. _ _ r r _ _ J 
Figure 6.10 - DCT coefficients memory in upper path 
When the delayed reset signal becomes zero, then transistor Ml is always turned on 
while M2 is always turned off. As a result, the modified basic FIFO can be treated as 
a normal basic FIFO cell Therefore by applying the cyclic feature in this DCT 
coefficients memory as shown in Figure 6.10，the DCT coefficients can be recycled 
and can be used repeatedly. 
6.2.2 Combination of IDCT to 1D DCT core 
Similar to the ID DCT, the ID IDCT can also be implemented by similar 
architecture. Referring to Equation 3.7 and Equation 3.8, they can be divided into 
two stages as the following equations, 
X I p B ^ C 1 p q r ^ J I'D E F G ] 
•Si =丄乂 C -A -B Y^ and — 1 五-G -D -F Y, 
S , A - C - A B * Y , S , ^ 2 F - D G E * Y , ‘ Equation 6.11 
A -B A -C Y, S, G -F E -D K 
Page 82 
Chapter 6 - DCT Implementation on Dedicated DCT Processor 
- ~ | 「 p -1 「 _ ! 「 
+ 'J 4 Sq -S^ 
= ^ ^ i + S s and 一 
X, 一 X ^ = A - 民 -Equation 6.12 
3 �k + A � UJ k - v 
Referring to Equation 6.11 and Equation 6.12，the operations of the ID IDCT are 
similar to that of ID DCT, but different in order. The ID IDCT first requires a 
matrix multiplication, and then followed by addition and subtraction. Therefore for 
IDCT, the pre-processor is eliminated while a post-processor is added after the 
multiplier-accumulator, which is shown in Figure 6.11. This post-processor is 
responsible for the operation according to Equation 6.12, and it consists of an adder 
and subtracter which is the similar as the pre-processor. 
11D IDCT CORE I 
I u . I 
I I- O I 
Data (pixel) , [ A . f l — \ 1 ^ ^ K DCT 
i叩Ut I ^ I 1 - V ^ I ^ ^ output 
I CD o ！ 
I Q. I 
Figure 6.11- Block diagram of the IDCT 
By comparing Equation 6.11 to Equation 6.2 and Equation 6.3, their structures are 
the same and thus both the multiplier-accumulators can share on the same hardware. 
Therefore, the ID IDCT can also be performed on the ID DCT core by adding a 
post-processor at the end of the original ID DCT core. Switches are also inserted 
inside the core so as to select the path for performing DCT or IDCT. As a result, the 
overall architecture of the ID DCT/IDCT processor is shown in Figure 6.12 
Page 83 
Chapter 6 一 DCT Implementation on Dedicated DCT Processor 
I DCT I I — — I 
. 广 >Add . Coefficient ~ • ^ ^ g 
8 \ J \ Memory | | ~ • ^ 
k � 8 4 __m__ —— r  
^ i i ^ ^ S H 
\ / r—1 ； 7 \ . 
/ g t — R印"cater — ^ S / — — _ \ ； 乂 / _ _ _ 
1 / L ^ L J ^ t ^ 
Figure 6.12 - Overall architecture of the ID DCT/IDCT processor 
One more modification is made on the DCT coefficients memory. The DCT 
coefficients of DCT and IDCT are different in the upper path as the matrix 
multiplications are different, which is shown in Equation 6.2 and Equation 6.11. 
Therefore the content of the DCT coefficient memory needs to be changed when 
performing IDCT. As the pre-storage of the modified basic FIFO cell only depends 
on the delayed reset signal, it is not necessary to use an additional DCT coefficients 
memory to store the additional IDCT coefficient. The change of the DCT 
coefficients can be done by adding some logic gates to control the presence of 
delayed reset signal in the memory cell, which is shown in Figure 6.13. As a result, 
the pre-storing data can be changed for DCT and IDCT. 
^ d c t reset reset dct/idct 
Y y y Y 
I V 1 I 1 r 1 
, modified 丨丨 modified 丨丨 
I basic FIFO 丨丨 丨 basic FIFO | ； 
I 坊 “ i - h ^ OUt_p 丨 糾 ^ out_p 
M M 
j modified | | modified •丨一 
basic FIFO I I basic FIFO i 
I out_n 丨 械 out n 
I ！ L J “ 
'1' is pre-stored in DCT '0' is pre-stored in DCT 
'0' is pre-stored in IDCT T is pre-stored in IDCT 
Figure 6.13 - Modification of memory cell of pre-storing different data in DCT and IDCT 
Page 84 
Chapter 6 - DCT Implementation on Dedicated DCT Processor  
In performing the IDCT, the order of the input and output sequence is changed too. 
The input sequence of the IDCT is [Yo, Yi, Y2, Y3，Y4, Y5, ¥5, Y7] and the output 
sequence is [xo, X7, xi, X6, X2, X5, X3，X4:. 
6.2.3 Accuracy 
According to the IEEE specification [56], the 2D IDCT should achieve certain 
accuracy in order to prevent the quality degradation in the reconstructed signal after 
the inverse transform. Therefore in this design, the bit length of the different parts 
should be considered in order to achieve the specified accuracy. 
By considering different combinations of the bit length of the DCT coefficient, 
transpose memory and multiplier's output with the verification of the C program, 
Table 6.2 shows the bit length of the different parts of the DCT/IDCT processor. 
The architecture is shown in Figure 6.14. According to this result, truncations are 
needed at the outputs of the multipliers and the ID DCT/IDCT core. Truncation on 
I 
the multiplier's output can be merged inside the multiplier as the last stage of the bit-
parallel multiplier is an adder, only a little modification is required. However for the 
output of the DCT/IDCT core, a dedicated circuit for the truncation is added in order 
to truncate and round up the result. 
Bit length 
Input 一 9/12(DCT/IDCT) 
Output 12/9(DCT/IDCT) 
Multiplier's output in the row operation 19 
Transpose Memory 15 
Multiplier's output in the column operation 20 
Table 6.2 - Bit length in different parts of the 2D DCT/IDCT processor 
Page 85 
Chapter 6 - DCT Implementation on Dedicated DCT Processor 
~ ~ 1 D DCT/IDCT core for row operation 1D DCT/IDCT core for column operation 
15-bit DCT/IDCT 15-bit DCT/IDCT 
Coefficient Coefficient 
r y n i ^ X 
i 一 l i J i i J^ a^^ ip^ L^ i l i l output 
9-bit I 10-bit I I Memory 2 # 8 | 2 | — ~ ^ ^ 
6 £ B Sw If =j g 3 XS 
(12-bit) ^ (12 -b i t ) L l _ J (19 -b i t ) t U (15-bit)' (15-bit) L ^ |(15-bit)| ^ |(20-bit)|窝 I 叫 ( 9 - b i t ) 
Figure 6.14 一 Bit length in different parts of the 2D DCT/IDCT processor 
Table 2 shows the IDCT error produced by using the architecture shown in Figure 
6.12 with the bit length provided in Table 6.3. It shows that the precision meets the 
IEEE specification. 
Error Error Error 
- Spec. [-256, 2551 [-5, 51 [-300.3001 
Maximum Pixel Error 1 1 j J 
Overall Mean Error 0.0015 "~0.Q00777 ~~0.0Q0856 0.000675 
Overall Mean Square Error 0.02 —0.009842 “ 0.009237 0.008331 
Maximum Pixel Mean Error 0.015 1 .004300 ~ 0.004400 0.004700 
Maximum Pixel Mean Square Error 0.06 0.012300 0.012600 飞010600  
Table 6.3 — Accuracy of the 2D DCT/IDCT processor 
In the VLSI implementation of the 2D DCT/IDCT processor, in order to reduce the 
cost, the ID DCT/IDCT core and the transpose memory are separated into two chips. 
The structure of the ID DCT/IDCT core for the row and column operation is unified 
such that a single ID DCT/IDCT core can be used for both operations, and the 2D 
DCT can be done by cascading the ID DCT/IDCT core with the transpose memory, 
and then connect to another ID DCT/IDCT core again. As a result the bit length is 
further modified in the ID DCT/IDCT core as shown in Figure 6.15. In this new 
configuration, the unified DCT/IDCT core can perform four different mode of 
operation, which is listed in Table 6.4. The unused bits at the input are needed to fill 
with zero while the unused output bits can be ignored. 
Page 86 
Chapter 6 — DCT Implementation on Dedicated DCT Processor 
DCT / Row/Column Number of lUsed Range of iNumber of lUsed Range of 
IDCT Operation Input Bit Input Data Bus Output Bit Output Data Bus 
^ 9 Datainr 14:61 15 Dataoutr 14:01 
DCT Column 15 Datain[14:01 12 —Dataout[ 14:31 
IDCT Row n Datain[14:31 |l5 ~Dataout[14:Q1 
IDCT iColumn |l5 |Datain�14:01 19 |Dataoutri4:61 
Table 6.4 一 Four different operation modes of the unified ID DCT/IDCT core 
I L ^ ^ ^ l ^ 爾 DCT I I 1 
r V r>Add \ CotmcLnt. ~ » % J \ 
S \ J \ M«mory1 | 5 _ ^ 
- v i t X < ra —— 
•啊 r Y ^ ( k d I t J . " ^ V - i ^ p s ^ ^ 衝 
K “ 、 :薬丨 P output 
吟 i i k — 一 ^ i ^ l l ^ k P ： ^ 11-^4 
•？ E / ISbJIDCT i ~ » J Q S ? 2 b l t _ 
r g / Cotfflclinti ~ * ^ X ^ ^ \ r Sub  
I / 、义 
Figure 6.15 - Unified structure of ID DCT/IDCT core 
The result and performance of this ID DCT/IDCT core will be given in chapter 7. 
6.3 Transpose Memory 
The purpose of the Transpose Memory is to store the result of the row operation, 
then re-order the data and send them out for the column operation. The name 
"transpose" means that the re-ordering is similar to the transpose of matrix, in which 
the data in the rows and columns are exchanged. In order to be used in the 2D 8x8 
DCT/IDCT operation, the transpose memory should be capable of storing 64 15-bit 
data, and re-ordering the data for the column operation. 
In order to fit the architecture of the ID DCT/IDCT core, the transpose memory is 
required to have two different modes of operation. This is because the input and 
output sequences are different in the DCT and IDCT operation in the proposed ID 
DCT/IDCT core, which has been mentioned in section 6.2.1.2 and 6.2.2. The 
Page 87 
Chapter 6 - DCT Implementation on Dedicated DCT Processor 
transpose memory should be able to rearrange the data in two different orders such 
that the rearranged data sequence fits the corresponding operation. 
To avoid changing both the write and read order at the same time, the write order of 
the transpose memory in both operations are set to be the same. The output data of 
the ist stage ID DCT/DCT core (row operation) is configured to be written into the 
transpose memory in row-wise order, which is shown in Figure 6.16. 
rCNICO 寸 l O C O N O O 
c c c c c c c c 
E E E E E E E E 
o o o o o o o o 
O O O O O O O O 
row 1 irii 丨门2 丨门3 in^  ing in^  iOg 
row 2 iPg iHio �i n , ^ in,3 in,^  in^ g 




row 7 - - inssings 
row 8 |丨》157卜58丨丨门59丨丨门60丨丨门61|丨〜2|丨门63|丨‘ 
Figure 6.16 - Write order of the transpose memory 
As the write order is fixed, the read order of the data from transpose memory should 
be altered according DCT or IDCT operation. No matter in which mode of 
operations, the data are outputted in column-wise order from the transpose memory. 
However, their orders are not the same as shown in Figure 6.17. 
Page 88 
Chapter 6 - DCT Implementation on Dedicated DCT Processor 
1st M^Mg^M^M,, M51 MeJiyi^JMsJ 1st [M,JM,JM3JM,,[Msi|MejM,jM3," 
3rd 2nd J ^ g g i ^ l ^ i ^ l ^ i ^ 
5th 3rd E E ④ 
7th 4th ^ ^ ^ ^ ^ ^ ^ ^ 
8th l y i ^ i y i ^ i y i ^ j y i ^ i ^ l ^ j ^ l ^ 5th 
6th l ^ i ^ j y i ^ l ^ j y i ^ i y i ^ i y i ^ i y i ^ 6th ^ ^ ^ ^ ^ ^ ^ ^ 
4th j y j ^ j ^ j y i ^ i y i ^ i y i ^ i y j ^ l ^ i y i ^ 7th 
2nd |^ 18卜少38�|m |m |m |mJ 8th |mJiyQlVlssk'^^s^bWs^s 
T T T T T T T T t T T T T T T T 
i ； 
M 1 3 — M 1 7 — M 1 8 — M 1 1 — 之—M14—M13—M12—M11 — 
(a) ( b ) 
Figure 6.17 一 Read order of (a) DCT operation, (b) IDCT operation 
6.3.1 Architecture 
Figure 6.18 shows the block diagram of the transpose memory. It consists of a 
write/read address generator, two RAM blocks and two multiplexing networks. 
Although it is only required to store 64 data, two 64x15-bit RAM blocks are used in 
this design. This is because if a single 64x15-bit RAM is used, the second row 
operation cannot be started immediately after the first row operation as data are still 
stored inside the RAM for the column operation, data cannot be written into the 
RAM until the column operation is completed. As a result row and column operation 
cannot be carried out simultaneously and thus the performance is poor. If two RAM 
blocks are used, the result of the second row operation can be written into the RAM 
block 1, and the data stored in RAM blockO is used for the column operation. Since 
the computation time of the row and column operation are the same, the roles of the 
RAM blocks can be exchanged after the current row and column operation are 
Page 89 
Chapter 6 - DCT Implementation on Dedicated DCT Processor 
completed. As a result, both operations can be run simultaneously and the whole 2D 
DCT/IDCT operations can be run non-stopping. 
, \| 
K ^ ——^ 64x15bit 乂 
data 〉 o _ _ K RAM _ _ K ^ 
K I > BlockO \ I 
I 1 W \ 
Write/Read k ^ ^ / 
Address f ^ t 
Generator i ^ 言 
Figure 6.18 - Block diagram of transpose memory 
The multiplexing networks are built by multiplexers and demultiplexers. The first 
multiplexing network is responsible for scheduling the flow of data and address to 
the two RAM blocks, and thus the write and read operations of the two RAM blocks 
can be controlled. The second multiplexing network is responsible for detecting and 
collecting output data from RAM. 
In order to ftirther improve the performance of the Transpose Memory, the 
architecture of the RAM block is further modified as shown in Figure 6.19. 
____RAjyi_^qckO  
K — — N 32x15bit 丨 
Z I S 32X15bit ^ j — l / I K 
I ^ I I 
Write/Read k 1. N ~ ^ 32x15bit | -| 
Address " W r ^ ^ § ^ RAM b x ' _ _ f \ f 
Generator E ^ ^ ' X " ^ 〕 | ) | 
^ 1 / 1 g 32x15bit ^ I 
Z I ^ RAM H I 
‘ ^ M Bibckl ‘ 
Figure 6.19 - New structure of the transpose memory 
Page 90 
Chapter 6 - DCT Implementation on Dedicated DCT Processor  
In this modification, an interleaving technique is used. The single 64x15bit RAM 
blocks is replaced by two 32x15bit RAM block with a multiplexer and 
demultiplexer. In write operation, the demultiplexer delivers the write address and 
data to two 32x 15bit RAM blocks alternatively. As a result, the time allowed for the 
write operation is doubled due to the interleaving policy, and thus the performance 
requirement of the 32x15bit RAM block is relaxed. However, area is the trade off of 
this modification. 
6.3.2 Address Generator 
The address generator is composed of two units, which are the write address 
generator and the read address generator. Besides from generating address for the 
write and read operations in the RAM blocks, they also control the switching of the 
multiplexing networks. 
Since 64 data can be stored in a RAM block, 6-bit RAM address is required as 
26=64. The structure of the address generator is similar to the DCT coefficients 
memory, it uses memory cell to pre-store the RAM address. Instead of directly 
storing the 64 6-bit addresses, the address is split into 2 parts and pre-stored by two 
cyclic FIFO memories. An example of the write address generator is shown in 
Figure 6.20. 
Page 91 
Chapter 6 - DCT Implementation on Dedicated DCT Processor  
address MSB 丨 LSB 
0 000 I oW 
1 — 0 0 0 I 0 0 1 
2 "OOO ； 010 JriB— K 
3 000 丨 011 kI p i ^ U q ^ 3-bit _ | \ 6-bit 
4 [ j V l 三 5 — g — — — [ > > 
5 000 I 101 I ！ ^ / 
6 - 000 1 110 丨 M I 1 ^ 
1 000 I 111 MSB t 
_ ？ _ 001 i 000 . ' " n “ n ~ n F rn ~rn~ ^  , Data ； 
9 001 I 001 l — N E - - - § > o > o > § > 8 > b z ^ Replicator Z z L 
I 丨 I I I 丨 I _ 
I I I  
I I I 
I I I I I I 
62 111 I 110 
63 111 丨 111一 
Figure 6.20 - Write address generator 
In this configuration, the number of data needed to be stored in the address generator 
can be reduced and thus the area can be reduced too. Read address generator also 
has the same architecture as the write address generator. However, the addresses 
stored in the read address generator are different for the DCT and IDCT operation as 
their input and output sequences are different, as described in section 6.2.2. The 
change of the pre-stored addresses uses the same method as the one used in DCT 
coefficients memory. 
Figure 6.21 shows the operation of the transpose memory. Initially the row operation 
is required to be carried out first, i.e. a write operation on the RAM is required. As a 
result, the write address generator controls the multiplexer to send the write address 
to RAM blockO. At the same time, it blocks the read address from entering RAM 
blockO and switches the input data to RAM blockO, which is shown in Figure 
6.21(a). 
Page 92 
Chapter 6 - DCT Implementation on Dedicated DCT Processor 
r Fin  
一 脉 ^ 、，Data^ ^ ^  D 
O UJ …” 
“ “ Wirtino on BlockO ^ „ 〜， 
iHu nn Reading on BlockO 
^ ,丨•。―1| Writing on Blockl 
Wite Address 、 | ^ >< MxISbit … . J , ~ 
Generaotr | | —^ Wrte Address | § ^ ^ 64x15bit 
^ ^ ： 旧 ^^"^"^^"^Ll^ilJj^ .口 
0?~ ~ ？丨 丨、、 Blockl 
^ ^ ^ X _ _ r ‘ ‘ _ 
( a ) ( b ) 
‘ r^ 
Data 3 秘狐、 
Q I 
I Writing on BlockO 
-op- a j Reading on Blockl X ^ __ 
J~~ r ^ ~ ——•.'.s =)"••~>. 
VWte Address I g ^ § _ 64x1 Sbit Q 、- 、、 ^ 
_Generaotr I S s ** RAM  L ^ L__l ‘ BIcokO 
’ X \ u • u • 
Read Addres  | | § . 64x1 Sbit � + ~^ ~ 
Generator S . 5 ^ RAM | i 个 D、~• ~> T ^ Blockl ‘�•“ “ “ • “ 
I I 
( c ) 
Figure 6.21 - Operation of the transpose memory 
After the first row operation is completed, the second row (write) operation and the 
first column (read) operation are started at the same time. By changing the 
controlling signal in the multiplexing network, the read and write addresses are 
transmitted to RAM blockO and blockl respectively, and the output data can be 
collected at the output side. This is shown in Figure 6.21(b). As a result, both row 
operation and column operation can be run concurrently. After these two operations 
are completed, the controlling signals are altered such that the flow of the addresses 
is changed and the role of the RAM blocks is also changed, as shown in Figure 
6.21(c). The controlling signals are altering after every row and column operations, 
and thus the roles of the RAM blocks are alternating repeatedly. This allows the 2D 
DCT/IDCT runs continuously and simultaneously without any user's control. 
Page 93 
Chapter 6 - DCT Implementation on Dedicated DCT Processor 
6.3.3 RAM Block 
0 — N undecoded 
� … " o ^ address SRAM /L_ g _N SRAM 
BankO Nr Q ^ Bank1 ^zzzzh undecoded 
1 row address 
a：  
— N row 
w y 吞 ^ 一 
Column Decoder & H r N。？广门 
Data Buffer ^ ^ address 
^ — data 
address ^^Mdaia  
Figure 6.22 - Block diagram of the RAM block 
The RAM Block is basically a SRAM. Its structure follows the traditional design 
which consists of a column address decoder, a row address decoder and 2 SRAM 
banks. Each RAM bank is capable of storing 16x15 bit data. The structure of the 
RAM block is shown is shown in Figure 6.22. 
There is one difficulty in using SRAM in asynchronous design, which is the 
completion detection. It may have no problem in the read operation as the presence 
of the data at the output representing the completion of the read operation. However, 
there is no related signal representing the completion of the write operation. As a 
result, additional circuits are required so as to detect the completion. 
There are several methods to detect the completion in SRAM. One of the methods is 
the use of delay element [57]. In this approach, it is assumed that the write operation 
must be finished within a certain period of time. Therefore a delay element can be 
used to delay the write request signal, and the delayed write request signal can be 
acted as the completion signal. Although this method is simple, it provides a worst 
Page 94 
Chapter 6 - DCT Implementation on Dedicated DCT Processor  
case performance. Another method is the use of current sensing technique 
[57][57][59]. Since the current drawn will be decreased after the operation, the 
completion can be known by sensing the current drawn from the power supply in 
RAM block. However, this technique is difficult to implement and the result may 
not be accurate. In our design, a monitor cell is used to detect the completion of the 
write operation. 
The structure of the monitor cell is shown in Figure 6.23(b). It is treated as an 
additional SRAM basic cell and placed inside the bit column of the SRAM. By 
comparing with the normal SRAM cell structure, which is shown in Figure 6.23(a), 
the monitor cell is actually composed of two SRAM basic cells with two additional 
pMOSs. The purpose of the additional pMOSs is forcing the two SRAM basic cells 
to store complementary values when the monitor cell is not yet enabled. 
e n a b l e j - J Z X - . 
^ I I I  
I p~~| I _J_ monitor 
data_p I I ~ I； ‘ ~ I ^—•data_n  
p i U i S R A M I 
^ ^ “ basic cell ~‘‘ 
( a )  
enable ^ ^ SRAM , 
~r ‘bas ic cell 一’ 
>-| |—C 3—— 
data_p«|——I“_厂 一 ^ 厂 I~I——pdata_n 丨^ I I 
h U ~ U r = 丨 li 
^ ^ "- done ！ 
H f 丄 ,一 SRAM I , 
_J—lZzLT — 1 门 丨 M basic cell 厂 
^ f f i f ^ 个 个 
^ ^ data_p data_n 
( b ) ( c ) 
Figure 6.23 - (a) SRAM basic cell, (b) monitor cell, (c) monitor cell in a bit column of SRAM 
Page 95 
Chapter 6 - DCT Implementation on Dedicated DCT Processor 
When there is a write operation, the monitor will be enabled and will perform the 
write operation. Since the write operation causes both of the SRAM basic cells 
inside the monitor cell to store the same value, either one of the values in the SRAM 
basic cell will change and thus the change can be detected by the NOR gate. This 
signal will be sent out to indicate the completion of the write operation. Due to 
geometrical reason, the monitor cell is placed at the top of the bit column of the 
SRAM, as shown in Figure 6.23(c), in order to prevent the monitor cell from being 
written before the normal SRAM basic cell. Also due to the same reason, the 
detection of completion is only required on bitO and bit 14 of the SRAM banks as 
their write operation is the slowest among all the bits. 
The advantage of this method is that the monitor cell can be treated a normal SRAM 
cell which is simply placed in the bit column, it will not cause a large modification in 
the traditional architecture of RAM block design. Also it directly monitors the write 
operation, and the completion signal is immediately generated after the write 
operation is done. As a result, the average case performance can be achieved. 
The result and performance of the transpose memory will be given in chapter 7. 
Page 96 
, Chapter 7 - Results and Discussions 
Chapter 7 
Results and Discussions 
7.1 Overview 
In ihis cKaplcr, ihc iniplcnicnuiiion results and pcr“�rmancc of ihc designs, uhicfi 
h i i v f b e e n d c s c r i b c d i n c h a p l c r 4 lo u i l l K r pio\uk\i H. lscJ o n t h e r c s u l l s 
o b l a i n c d . d i s c u s s i o n u i l l be g i v e n i n c a c h d e s i g n I h e r e s u l t s a n d J i n c u s s h u i s v s i l l h e 
prcscnlcd in the onJcr o\ the Kclrcsh (\)iiirol Circuit, priigrannnahlc I )SI' priKcsMU. 
11)丨)(*1 I DC I ci>rc and lr.iii.spH»sc rncnu>r\ 
7.2 Refresh Control Circuit 
7.2.1 Implementation Results and Porformanco 
I h e K c t r c s h t \ » n l r u l ( * i r c u u i s d e s i g n e d o n t h e A M S 1 1 ' o f n i ( M c » s 
tcchnoU»g\ I able ， i shows ！he transistor c«njn! on dj!!cfcnt untfv of ihc Rclrc^h 
( \ > n l r o l C i r c u i t 
B � i h c HSPK'f simulaUini ur紅k?�V supph “�!:.i，’c j? is�ho�”i，h.t: the 
I r c v j i i c n c ) o f i h c n r v g o w i l L i ' u > : is a K u i " 1 % « . h u h n i n f i ^ ^ i ^ r c ” 1 
Mso Ihc lurKfion ol ！he Refresh I ‘ ( is .cn:lcJ m :hc ”?nui‘n« n .'.hah 
i s s h v n v T i i n I i g u r c ' : 
Chapter 7 - Results and Discussions  
Transistor Count 
Ring Oscillator  
Timer - Counter ^ 
Timer - Latch 
Timer - Comparator 142 
^cal ibrate Circuit 124 
Operation Monitoring Circuit 90 
Voltage Sensor ^ 
iTotal 917 
Table 7.1 - Transistor count on different units of Refresh Control Circuit 
1li「厂—ili丨「+_+_� I I li 1！---l-^ -i-,1""“i 
r I i [ ! [ [ [ I [ [ jf ‘ ’ I ‘ ‘ ‘ Ctrreot X.fi.TE DS^DS i i i 
呈趕 I j ~j j I j T _ _ I I ij I I ""[~i I 1—cirrairlvTBnmTirenltt i h 1  
= • I I I • I I I I I I || I I I I I I I j [ j 
S I I I I I ！ I I I j [, [ j ！ I j I I I I I I 
B s _ f T 厂_r r T TT__r t ri r t--I-i 1 r _ r | | | -}•--rl r  I • I I ！I ! I I I ！ !! ！ ！ I I ！ I I I 厂 
a ！ A I I ！ I I I I I I I I I I 1 ！ I I I 
！ ! ~ ！ ~ ~ ！ ！ i ！ " " “ ！ ！ j i - H 1 I “ “ ！ ^ ― _ I _ , _ i _ J 
• • I I I I I I I I 
I • I • I • I • I • I ' I • I • I • I • I • I • I • I • I • I • I • I .Bi> 冊 SSo 31k> aEl> 4Dt» 4El> EDu EGl> 6\ia fiGl> TDl> 7Gl> BDu BEl> BDu »Bl> IKM> 
Tin» (li» (TMC) 
Figure 7.1 - Simulation result of ring oscillator 
= ‘ ！ [ I I 1 I , , 
S ^ — CL" 1 Kc- 1 l^i-l f t 4- 1；  
» g ^ ^ ‘…—t- J "I-— .|r-»— J. 1 , … �I ’�---_ 
B ‘ 1 j j 1 i. 4. i.  
5 » J | - — f - V Q i t a q p a t f l n a t i n f | n r > r i p . _ _ . ! __ I [ [ 
= , I 丨 丨 丨 � � [ I 
« ^ L L L L L 1 I I 
f 8 _ « . :^^:m|rpJr§?Ji_sjjaQ?!_a§p4^At¥Jjpjji_y_qi_t^§i_€#[i:s]|[�:::: 
in. A I I I I I ] T "“ 
> " V I I I • 丨 丨 ‘ 1 i I I I I 
= _ ‘ j j ^ I ! _ _ 
i I ^ 工 二 : 〒 二 ： 〒 二 ] 
I ^ ~"ISfDK^XC®"*?i^ TPJii\ypA^ j " l i ； t t  
「 丨 • * ‘ “ ‘ I I _•__•__••}•__」—•垂 
= I ！ , ！ , ！ ！ ！ ！ ！ ！ 
« . ！ ！ L _ _ _ L L 1 L_ I I 
I J !. 取 a _ t i 卯 軸 § _ § e n s o k " ON/OFFI 
I • n r 1 ！ ！ 「 I ！ ！ ！ 
• I I I i i i { 
I‘I—‘‘"‘‘"““I~~‘―“‘“―‘‘I‘“―‘"““‘‘I‘~~‘‘“‘1”‘~~‘~~‘I~I~1~I~~I~I~(1~I~I~i~I~1~I~rI 
» !1> ai» 4l> El> fiU 7i> 
Figure 7.2 - Simulation result of the Refresh Control Circuit 
Furthermore, the power consumptions of different parts are estimated from the 
simulation result, which is shown in Table 7.2. The average current drawn by the 
whole Refresh Control Circuit is about 15 uA when the voltage sensor is not 
activated, and is about 3.6 mA when the voltage sensor is enabled. Since the voltage 
sensor is not operating all the time, the percentage shown in Table 7.2 is not directly 
Page 98 
Chapter 7 - Results and Discussions 
proportional to the average current, but is proportional to the current drawn through 
the whole process. 
Average current Percentage 
Ring Oscillator 12.5053 uA~~ 10.06% 
Timer - Counter 0.4901 u A一 0 . 3 9 %一 
Timer - Latch 0.6007 u A一 0 . 4 8 % — 
Timer - Comparator 0.2541 uA~~ 0.14% 
Recalibrate Circuit 0.1685 uA 0.20% 
Operation Monitoring Circuit 0.2662 uA 0.21% 
Voltage Sensor* 3.5204 mA 88.5078% 
^Average current of voltage sensor when it is enabled  
Table 7.2 - Current drawn by the each parts of the Refresh Control Circuit 
Figure 7.1 shows the simulation result of the ring oscillator. It shows that the ring 
oscillator can oscillate with a period of around 26 us. Figure 7.2 shows the 
simulation result of the Refresh Control Circuit and its function verified. 
The purpose of developing the Refresh Control Circuit is to reduce the performance 
degradation due to the pull-up path. In order to investigate the performance 
improvement from the traditional technique, three multipliers were built so as to 
provide a comparison. All the multipliers were built in the asynchronous pipeline 
architecture and based on the bit-parallel algorithm. The first multiplier uses the 
normal domino logic without any pull-up path. The second one uses the domino 
logic which a pull-up path as shown in Figure 4.2(b). The last one is using the 
technique of the Refresh Control Circuit, and its logic structure is shown in Figure 
4.4(b). 
Three multipliers were simulated by HSPICE under 5V supply voltage. Since they 
all are asynchronous circuits, the simulations were done by sending inputs to the 
Page 99 
Chapter 7 - Results and Discussions 
multipliers continuously, and their intrinsic throughputs and latencies are then 
measured. 
Throughput Latency — 
Without pull-up path 2.8721 ns 18.7925 ns 
With pull-up path(traditional) 3.0306 ns (+5.819%) 20.3573 ns (+8.326%) 
Refresh Control Circuit |2.9090 ns (+1 • 105%) 19.0668 ns (+1.459%) 
Table 7.3 - Performance of multipliers by different techniques 
„ refresh control circuit technique 
•(^ ^^ •ft^ sM.r^ o ^：:、=^、->;、：；- ；•、、、、、、、、、;、、、、、- • • 二 … �� -� � �- � r-� - �---..-s -s�,； -；、、、、 < ^ i < ^ 、 、 
•[ I j j y ^ ^ - � r p r j ^ r \ f T p 广 
s ‘ wifdufpun-up p w " r j " " i '1 V'with"p'jir-jpi ^ i t f r r—y •|--厂-_ -十-卞… 
I • I r \ (traditionallme : • 
I 1---本 [ • [ ! � ] _ i 
i i f I , f卜 i — t 了-——�•J"----
I j I I [ ^ • I I • 
•^各 —1,1 \ IJ \ J L / / 一 
I ‘ ‘ ‘ ‘ ‘—I—‘—‘—‘—‘—‘—‘—i—-1—I—I—I—I—1—I—1—I—I—I— 
EEn fiDn 
Timo (tn) (TIME) ®""" 
Figure 7.3 一 Output signals of different multipliers 
Table 7.3 shows the throughput and latency of the three multipliers and Figure 7.3 
shows the signal outputs from the different multipliers. 
7.2.2 Discussion 
From Table 7.1, it shows that the new proposed technique provides a better 
performance than the traditional technique. It provides less than 2% performance 
degradation compared with the multiplier using the ordinary domino logic, while the 
traditional technique degrades the throughput by 5.8% and latency by 8.3%. It 
indicates that the goal of the Refresh Control Circuit is achieved. It provides a self-
timed and reliable method for solving the problem of the charge leakage. 
Page 100 
Chapter 7 - Results and Discussions 
Regarding the power consumption of the circuit, voltage sensor consumes near 90% 
of the total power. This is because the differential amplifier and sense amplifiers 
always allow current to flow through. Future work can be focused on minimizing the 
power consumption of the voltage sensor by using other architectures which have 
lower power consumption, or using other technique to detect the charge leakage on 
the floating node of the dynamic logic. 
Although all the discussion on the refresh control system is based on the dynamic or 
domino logic and the comparison is done on the asynchronous circuits, it is not 
restricted to be used this technique on this area only. Other logic types or circuits 
which also encounter the charge leakage problem can employ the Refresh Control 
Circuit technique. However, the method of sensing may need to be modified so as to 
suit the application. 
The disadvantage of this technique is the inclusion of the Refresh Control Circuit in 
the design, and one or two more transistors are added on each logic cell. As a result 
the area of the whole system will be increased and the compact property of the 
dynamic logic is somewhat degraded. It is not suitable to apply the Refresh Control 
Circuit technique on a compact system with which the area is concerned. However, 
for a large system and the speed of the operations are concerned, this technique can 
be employed. 
Page 101 
, Chapter 7 - Results and Discussions 
7.3 Programmable DSP Processor 
7.3.1 Implementation Results and Performance 
In chapter 5, it has shown the dataflow of ID DCT program on the programmable 
DSP asynchronous processor. It requires 50 steps for the processor to perform the 
whole ID DCT operation. Since the size the processor is very large, it cannot be 
simulated by HSPICE. On the other hand, all the basic cells were simulated in 
HSPICE under different loading conditions, and the parameters were extracted to 
construct a Verilog HDL model for each logic cells. The performance of the DCT 
implementation is estimated by using the Verilog models to simulate a 9-bit version 
of the proposed processor. Table 7.4 lists the bit length information of the 9-bit 
processor. 
External  
Primary 10 of the programmable 9 bits 
DSP processor  
Instruction Input 10 bits 
Internal Functional Units  
"Adder 9 bits 
"^btractor ~9bits 
Multiplier 9 bits (output is truncated to 9 b i t ^ 
FIFO Memory 9 bits 
Table 7.4 - Bit length information of the 9-bit programmable DSP processor 
j inn 门 I in nn I ！ 1 门 门 门 inn 门 i j  
�utputao'i pyyy iipiiy i 352ns 1 iJiiiii UIIJII i 
-n n n n inn ‘ ‘ innn! nnnn 
ou t pu t rq I: l U U U I I I JU U I j I J i  
i r r n T T T T t r r T f Y : ； Y Y f y f Y f y   
ou t p u t in '： 000 • 1 • it 000 % s I! i 000 : a u iooo u a si jooo 
！ " L W w W J j j ^ k ^ x / x j ' 1 i U J J v U v U _ I i w U J J w U w l j  
100ns J 
•iiBinii 卞 二 I -L - - J ； 
丨 ">">»!> 丨"eoD I H W C 丨坊�|> i w i o p 
Figure 7.4 - Simulation result of the programmable DSP processor 
From the simulation result, the latency between the 8 pixel inputs and the 8 outputs is 
around 400ns. Since the processor is pipelined, the next DCT operation can be 
started even the current one is still processing. As a result, the 8-point DCT 
Page 102 
Chapter 7 - Results and Discussions 
throughput can reach 352ns, as shown in Figure 7.4. Also, the input frequency is 
around 130MHz(7ns). Figure 7.5 shows the timing diagram of the DCT operation 
and Table 7.5 shows the comparison of the ID DCT core performance with other 
VLSI implementations. 
) > time 
i~•——a H 
in Processing | out i b ^ i s t o c i operation 
i b • in Processing out 2nd DCT operation 
a : latency ~ P r o c e s s i n g ” 
b : 8-point DCT throughput  
in: 8 pxiel inputs �������� 
out: 8 DCT outputs � � � � -
Figure 7.5 - Timing diagram of the DCT operation 
Operating 
frequency Pixel throughput 
Design Year Tech. Processing Unit (MHz) (Mpixel/sec) 
Cheng et. al. 2000 0.6u 9 MUL, 21 ADD T ^ T ^ 
Hsiao et. al. 1999 0.6u 3 MUL, 5 ADD 40 40 
m  
Jang et. al. [33] 1994 0.8u 4 MUL, 1 100 100 
Accumulator, 1 pre-
and post processor  
This DSP 0.6u 1 MUL, 2 7 2Z7* 
processor | ADD/SUB 
Note : [1] and [3] are 2D DCT chips which use ID DCT cores and transpose RAM to handle the 2D 
transform by using the row-and-column decomposition method[8][9]. Normally, the critical path 
exists in the ID DCT core as it consists of many arithmetic and control units. Therefore, the speeds 
of the ID DCT cores are assumed to be the same as their 2D transform. 
*Average result (352ns / 8 = 44 ns/pixel, 1/44 ns = 22.7Mpixel/sec)  
Table 7.5 - Performance comparison of different ID DCT implementations 
The layout of the 9-bit version of the processor is shown in Figure 7.6. It has 153k 
transistors and is designed by using standard cells based on AMS 3M IP 0.6u CMOS 
technology. The core dimension is 4.7mm x 4.2mm. 
Page 103 
Chapter 7 - Results and Discussions  
i i l ( , ....- I®Jl 
：：ij jiiji 丽iniji丨丽|||丽_| "*""丨”丨 
IfllB 
IJLJ _lllll 
1. Instruction Memory 2. FIFO Memory 1 3. Adder 4. FIFO Memory 2 
5. Multiplier 6. Subtractor 7. Switching Network 
Figure 7.6 - Layout of the 9-bit programmable DSP processor 
7.3.2 Discussion 
By the comparison with other dedicated designs as shown in Table 7.5, a worse 
throughput and latency obtained by the general purpose processor with only 3 
arithmetic units is understandable as this is a tradeoff of the flexibility. As all the 
internal arithmetic units are occupied for the current DCT operation, the next 8 pixels 
can only be sent to the processor nearly at the end of the current operation. This is 
the reason explaining the slowness of the 8-point DCT throughput. Moreover, there 
are two main reasons for the large latency. First, the limited number of arithmetic 
units causes more data queuing. Second, the results from the arithmetic units are 
required to be fed back to the switching network for the next operation, while in 
other VLSI implementations, the arithmetic units are directly connected to the next 
arithmetic units of the following stage. Unfortunately, this cannot be avoided in a 
general purpose processor. A better latency and throughput of the DCT operation 
Page 104 
Chapter 7 - Results and Discussions 
can be achieved if two or more processors are cascaded serially，or more arithmetic 
units are connected to the switching network. Both changes allow more operations 
to process concurrently and reduce the data queuing problem. In addition, further 
improvement can be made in the switch cell in order to reduce its latency. 
On the other hand, the high input rate shows that this processor is capable of 
operating over lOOMHz, and it is competitive with other VLSI designs. Also, this 
frequency indicates the high throughput rate of the switching network. Furthermore, 
an asynchronous-to-synchronous 10 conversion interface is included in this design 
for the purposes of testing and measuring. If this interface is removed, the 
performance of the processor can be further improved. 
Due to the size of the FIFO memories, the 2D DCT cannot be implemented in this 
processor. On the other hand, if the size of the FIFO memories is increased or an 
additional memory unit is added, the 2D DCT can be implemented. Based on this 
assumption, the performance of the 2D DCT implementation was estimated. By 
using the row-and-column decomposition method, the 2D DCT can be decomposed 
into sixteen ID DCT operations. Therefore, the computation time for 2D DCT 
operation on this programmable DSP processor can be obtained by Equation 7.1. 
2D DCT computation time 二 8-point DCT throughput x 16 - Equation 7.1 
=352ns X16 
=5632 ns 
Therefore, the average pixel throughput is 
Average Pixel throughput = 2D DCT computation time 
^ number of pixel - Equation 7.2 
二 5632ns +64 
=11.3 Mpixel/sec 
Page 105 
Chapter 7 - Results and Discussions  
Processing unit Clock I Pixel throughput 
Design Year Tech. (MHz) (M pixel/sec) 
TI C6201「381 7 7 "2 MUL, 6 ALU 2 0 ^ 56.63 
u p d 7 7 0 1 6 [611 1993 0 . 8 u " H ^ C ， 1 A L U ^ 2 . 6 
V830 [62] 1995 / RISC with 1 
MAC  
Chang et. al. [381 2000 "q.6 T a L U 33 一 1.7  
This DSP processor ^ 1 MUL, 2 7 T l 3  
I ADD/SUB 
Table 7.6 - Performance comparison of 2D DCT implementation on different programmable 
processors 
Under the same condition of having limited resources for computing, the results 
shown in Table 7.6 indicate that this programmable DSP asynchronous processor has 
a good performance when compared with other general purpose processor designs. 
This result shows the dataflow architecture and the use of switching network favour 
the asynchronous processor design, and a competitive performance can be achieved. 
Future development on this processor can be focused on the 2D DCT operation, or 
other complex DSP algorithms. Although the estimated 2D DCT performance is 
good, it still cannot meet the requirement of processing the MPEG=2 or HDTV signal 
in real-time. 
Obviously, this 9-bit processor will introduce a large error and cannot achieve a 
reasonable accuracy in the DCT operation. A better accuracy can be obtained easily 
by increasing the word length of the processor. Increase in word length on the 
switching network and FIFO memory will not cause a performance degradation as 
they just pass the data without processing. For the arithmetic units which are adder, 
subtracter and multiplier, increase in the word length causes extra pipeline stages as 
they are implemented by the carry-look-ahead and bit-parallel algorithm and 
constructed in the asynchronous pipelined architecture. Thus the throughput will not 
be affected but the latency will increase. As a result, the increase in word length will 
Page 106 
Chapter 7 - Results and Discussions 
not directly affect the throughput of the processor, but the trade-off is the latency and 
chip size. 
7.4 ID DCT/IDCT Core 
7.4.1 Simulation Results 
Similar to the case of the programmable DSP processor, the whole ID DCT/IDCT 
core was failed to be simulated by the HSPICE due to the size problem. As a result, 
the whole core was simulated by using the Verilog models and the correctness of the 
operation on the core is verified. 
In order to have a more accurate performance analysis on the processing units in the 
ID DCT/IDCT core，all the processing units were simulated separately by HSPICE 
under 5V supply voltage. Due to the limitation of processing power of workstation 
and HSPICE, only the parasitic within the standard cells is considered while the 
parasitic information of the routing is not included in the simulation. The simulated 
performance of different processing units are listed in Table 7.7. 
Tested frequency Required frequency 
(MHz) (MHz) 
15-bit adder 250 98 一 
15-bit subtracter 250 98 
16-bit data replicator 400* — 196 
DCT coefficient memor}^ 196.07** 一 196 
Multiplier 220*** 196 
20-bit adder 250 96 
21-bit adder 250 96 
22-bit adder 250 96  
22-bit subtracter 250 一 96 
Truncation unit 250 % 
* The output rate of the data replicator 
** Self-generated frequency 
*"^^Transistor-level simulation only  
Table 7.7 — Performance of different processing units on the ID DCT/IDCT core 
Page 107 
Chapter 7 - Results and Discussions 
output request signal 
I i 口 口 丨 礙 辭 f 顆 趙 齒 齒 ： 辯 ： 律 彳 ： 驻 : : l 
r • j | i — — j : — — i — — i : — — | : - - 4 : - q - 4 - - | — — ~ i - - f 4 — 丄 一 
芸 I ! — - 令 - - - { • “ • 寸 • " 十 - - - 卜 - - 卞 - - 卞 - " | - - " { - - 厂 卜 一 t - f ~ | - - - | - - " j F - } ” - -
“• ‘ • • • ‘ • ‘ ‘ ‘ ‘ I ‘ I ‘ I ‘ I ‘ I • I ‘ I ‘ I ‘ I ‘ I • I • I • I ‘ I ‘ I • I • I .En ^ SBl> SD-ni am Aan 4Ei> EDi> EEtH GDn »t» 7Df> TEr» Ban BEr» EM}n Hl> poan 
Tiitw (•!>) (TM E) '"^ Bni 
Figure 7.7 - Simulation result of the DCT coefficients memory 
All the processing units, except the multiplier, data replicator and the DCT 
coefficients memory, were simulated at 250MHz of the data input rate. This is 
because according to the architecture of the ID DCT/IDCT core shown in Figure 6.9 
in chapter 6 and discussed in section 6.2.1.2, the throughput of the ID DCT/IDCT 
core is greatly depended on the speed of the multiplication as it is the bottleneck of 
the whole operation. From the simulation results, the DCT coefficients memory only 
generates the DCT coefficients at the maximum frequency of 196MHz, as shown in 
Figure 7.7. As a result, the maximum rate of multiplication can only be 196MHz, 
and thus all other units are only required to work equal to or less than 196MHz in 
actual processing. To ensure the processing units can meet the requirement, a higher 
frequency which is 250MHz is chosen to verify the their operations. For the 
multiplier, it is found that it can work at 220MHz. However it is only the simulation 
result in the transistor-level simulation in which the all parasitic information is not 
taken into account. This is because the circuit is very large. The simulator HSPICE 
and workstation were unable to handle the simulation if the parasitic information is 
included. The rest of the simulation results show that other units are working 
properly at or below 250MHz without error. Based on this simulation result, the 
throughput of the whole ID DCT/IDCT core should be able to work at 98 
Mpixel/sec, which is half of the multiplication rate. 
Page 108 
Chapter 7 - Results and Discussions  
_____ 
•M] P i ^ B i 11 iii Ij 丨 w ^ M M K ^ ^ ^ B ^ m v M 
寒 i S f f f i P f ^ B l 
P I ^ ^ I•？ w n。、 r 动 mi iiilfi ii^Piiia 
i ( | _ _ J j _ i p l i _ _ 
i _ i M _ l i _ i i i i i _ _ 
1. Input buffer and DCT/IDCT switch2 2. ISbit adder 3. Data Replicator 1 
4 & 15. 15xl6bit Multiplierl 5, 6 & 7, DCT Coefficients Memoryl 
8. l'to-2 DEMUXl 9.15bit subtractor 
10’ 11 & 13. DCT Coefficients Memoryl 12. Data Replicator 2 14 & 16. 15xl6bit Multiplier2 
17. l-to-2 DEMUX4 18.20bit adder2 19. l-to-2 DEMUX3 
20. DCT/IDCT switch4& 5 21 & 28. 2-to-to switch 22. 22bit adderl 
23. 22bit adder 24. 22hit subtractor 
25. l-to-2 DEMUX5 & 2-to-l MUX 26 Truncation unit and output buffer 
27. 21hit adder 
Figure 7.8 一 Layout of the ID DCT/IDCT core processor 
The layout of the unified ID DCT/IDCT processor is shown in Figure 7.8. It has 
334k transistors and is designed using standard cells based on A M S 3 M IP 0.6u 
C M O S technology. The core dimension is 6.8mm x 7.5mm. 
7.4,2 Measurement Results 
The testing equipments of the DCT/IDCT core include the IMS XL-60 IC Tester, HP 
Infinium Oscilloscope and HP E3631A Triple Output D C Power Supply. The 
functionality of the chip was tested by IMS tester, and all the functions (row and 
column operation, D C T and IDCT) were verified and the chip is working properly. 
Page 109 
Chapter 7 - Results and Discussions  
Figure 7.9 shows part of the captured Input and Output waveforms of D C T row 
operation. Table 7.8 shows part of the input, measured and calculated data sets. 
j    
."=-'.._ „ ... .. . _  _ Timing Diagrams - Ims 
file £rtH Screens Sub-Screcn« Options utilities 世Ip | 
- T o t a l 0 Sa<iuane» 0 
I 1 aO.OObns -.v： 0 
众:；&、.•“、..‘.>:.•:“.:，..‘+ • •,.‘， • + • + + • ^ ,： 
：Error * * * 
i看idct V  
丨『。1 ； — 
i 1^ 01 ： ： 
I _din02 ^ ^VVWVN r\r\ r\r\ aa r\r 
din03 : — ^ r \ r \ j \ r \ ^ v w n 八 a a a i :din04 V — r\r\J\r\ : 
i r f j ^ n ? ‘ ^ — — ^ — — r s j ^ a r ^ j \ j \ j \ f \ f \ r \ j \ r \ r \ r \ a a a a ；： 
、把 H ^ ^ 'WWWVWA_AA i; 
s ^ ^ a / w v a / \ a a a a a a /x 
‘ ^ f^ ' W W W W W ^ W X  
、器 ^ ^ r\ AAAAAAAAA/\ a /\ " i： 
、 ⑶ J ‘ ^ r\ A / \ / \ / W \ 八 A A A A 八 八 ：  
、 溫 ^ ^ a a a a a a a / w n a a —— 
3二i ^ A r\f\r\f\j\f\fsj\_j\ r\j\ = 
<iinl4 X 八 ：  
B ^ 
-：二 r\r\ r\_r ： 二 — ^wv. r 二 — A/N r s A _ r 
:：二 r\ r\/\_r 
/ w w w v A  二 r\ a r \ 
:doU ' W V W N  
<1013 ； M 
— A / W W W W W W W W W :: 
V — — A / W W W W W V W V W W X 丨： 
20.Ottos 
• -「- . ' I I • “ ... ‘ ‘ 一 — 一 一 ~ ~ ~ 一 ~ - — ~ ~ — 一 一一 样 供 
Start System} Stop System) 
^ — - ” 
‘ — _ _ Timlns Diagrams - ims 
file U\U Screens Sub-Screens Options Utilities Help | 
‘ 
total 0 S«qu«nce 80 
I 1 、 20.0Qns 228 




dinoi r\/\f\j\j\r\ rsj\ f\f\ a/n r\r\ r\j\ r\r\ r\r\ f\r\ r\r\ 八八 八八 八八 八 a aa 八, din02 r\r\f\r\ r\r\r\r\ r\j\r\r\ r\r\r\f\ r\r\r\r\ r\j\j\j\ r\j\r\j\ r\r\j\r\  
r \ j \ j \ f \ r \ r \ r \ j \ r \ f \ r \ r \ a a a a a a a a / v w w x a a r \ r \ j \ f 
/ V W \ A / W W W W W > A / \ A / W \ A / W W : 
，想 、 … … ^ w w w w w w w w w w w w v w 
din06 VVN/VVVVN^VVXA/N    
din07 \ / \ / \ / \ / \ / \ / \ A 八 ；; 
din08 v a a a a a a / \ a a i 
din09 \ a a a a a a a a a ~ 
d i n l O \ A A A / \ / \ / \ A ； 
d i n l l \ / \ / \ / \ / \ / \ y \ / \ / \ / \ ；: 
dinl2 W W V - V \ AA 
d i n l 3 \ A / \ A A A A / \   
dini4 A AA m m m m m ^ z z m z n ^ ； 
done 丨 
dooo r\f\ r\r\ r\_r\r\__r\r\r\/\ r\ 八 八 i 
dooi R\ R\ R\ I\_J\_R\ ^ 
do02 A l\f\r\ r\ r \ r \ a a a a a a a a a a 八 八 a a 
do03 r\ r\j\r\ r\r\ a / \ a / \ / \ /\ 八 a a r\ a a a a a ； 
do04 r\j\ r\_r\_rs„r\r\_八 八八 r\ /\ r\ 八 /\ a doos r\ f\r\j\ r\_r\^^ A J\ 八 八 f\_ doos r\r\ A r\ r\_r\r\r\ a 八八八 八/>八 a : 
do07 r \ _ a a a a / \ a a 
do08 r\j\r\r\_r\r\ r\r\_r\a.aaa a a 八 八 八 / > ^ a 
do09 r\j\r\ r\ r\_r\_r\_r\r\ njxrs A A A 八 ^ 
doio ^ w w v N r\ a 八 a 
doll r\_ r\ 八 i\ A 
d o l 2 A FK  
doi3 r\ a a a 
dol4 A 八 
o rq f\r\r\i\r\r\r\r\r\j\r\j\i\r\f\j\f\r\j\r\j\r\r\j\j\f\/\r\r\j\j\/\r\j\r\i\/\j\r\j\j\j\r\r\r\r\r\j\r\/\r\j\r\r\r\i\j\f\j\f 
i ckin /^/VWWWWWWVWWWWWVWWWWXA/XA/WWXAAA/VXA/WWA/VWWX 
I h^vp ： 20.00kis 
"" • ‘ — — ” • ——. - - — - — - - • . . • ••• - . - ~ . . .. ••一 • 
Start System I StopSystemj 
Figure 7.9 - (a) input waveform of the DCT/IDCT core, (b) measured output waveform of the 
DCT/IDCT core in DCT row operation 
Page 110 
Chapter 7 - Results and Discussions  
Input Data Set Input Data((Re-ordered) | Measured Result | Calculated result —  
Set 1 255,0,0,0,0,0,0,0 90.125, 125.0625, 90.1561，125.0501， 
117.8125, 106, 117.7946，106.0124， 
90.125，70.8125， 90.1561，70.8352, 
48.8125，24.875 48.7921，248740 
M 2 0,0,0,0,0,0,0,0 0，0，0，0,0，0，0，0 0^ 0,0,0,0,0,0 




Set 4 180,0,0,0,0,0,0,0 63.75，88.4375, 63.6396,88.2707, 
83.3125，74.9375， 83.1492, 74.8323, 
63.75, 50.0625， 63.6396，50.0013， 
24.5，17.5625 34.4415，17.5581 
Set 5 255,255,255,255,255,255,2 721.25, 0, 721.2489，0, 
55,255 0,0， 0,0, 
0,0, 0,0， 
1 m 1 M  
Table 7.8 — Input data, measured result and calculated result of the DCT row operation 
During the measurement, the actual Input Acknowledgement was not measured as 
it's duration is short and causes difficulty in the measurement. Instead of Input 
Acknowledgement, a signal Done was measured. The signal Done is created by a 
toggle flip-flop which input is the Input Acknowledgement. As a result, the signal 
Done is toggled in every Input Acknowledgement and it makes the measurement 
easier. Figure 7.10(a) shows the creation of the signal Done from the Input 
Acknowledgement while Figure 7.10(b) shows the timing relationship of the Input 
Acknowledgement and the signal Done. 
Q Done I叩ut 1 \ I \ / \ / 
Toggle ack W W W W 
Input V FF  
A C K D o n e � ^ • ^ 
( a ) ( b ) 
Figure 7.10 一 (a) construction of the Done signal, (b) timing diagram of the Input Request, 
Acknowledgement and Done signal 
The measurement shows the maximum frequency (throughput) of the DCT/IDCT is 
around 76MHz. Figure 7.11 shows the measured Output Request of the DCT/IDCT 
Page 111 
Chapter 7 - Results and Discussions  
chip by the HP Infinium Oscilloscope. The average current of the DCT/IDCT core 
chip is about 1.43A under 5V power supply, so the average power consumption of 
the chip is about 7.15W. Table 7.9 shows the performance comparison of the 2D 
DCT/IDCT processor with other VLSI implementations. 
Jfil Jy j U ii'i rfiTPiliTrfiiT}T|ir|TfT 
|!S|llllillll|ll|i|| lijlllp 
- — ‘ 丫 r ”〜f 、…；II J^I … I \ ^ 小，I 
i : ： : •• ： 5 
： ： ： i - ！ i ^ • • r • • • 
； i ： • ； ！ ； • 1 
： 丨 ： ： — ： ： ； ： 
… i •-…各 i … 1 - 1 r V……Y .. .‘ , V ；• .. ： f • 4 
； • . 二 ： • ： 1 ； 
. • • • ！ I 
l O S N JH||lOOns/div jjfti|436300ns Q p l ^ W 
IBEPDHEHlKESm^HISSlBimiKmi^HKKSIffiHIHiSBiiPHI 
— 7 a ' ) — — — — . 1 — — — — . — — — - — — — — - - . 
： - - - - ..... 
i i 為M 二 i ： 
： ： ： 丄 了 ： : i i 
丨 ^ f . •..."•…I ^ . �- ^^ ^ rr-.-rg rrrr，"rumn_�rrr ‘M_m rrniir t P y 
^ /\i /\ " ： n I /”/丨 /丨丨 n:丨/、/�A 
WL' "'^i I" I j:.•…H.• 1 - I ——••jr \ …i. •• T I ‘ . T t …i- - .v. •：•  ^  -!……f- H……r. \ •  ;.-/- -i - ...h •-々 …j …-一 了、’ I ！ i 1 / 1 / \ / 1 I 1 ；I| / 1 ：/ i / ； ！ i ( ：/ \ M 
/ W : \ ； I i 、/ ：丨“I ^ M / h W M i ” J ：丨 / 1 
： ‘ -f • ： ‘ ？ , h . I T ： ： -
y • , '�— \j I • 仏 y V 勺 
丨 ， i 
^ 【 [ g g ^ ft — ^ ft. 
• s f M t i W I M B H M B M B I M H I I ^ M I ^ M I ^ ^ ^ ^ M 
( b ) ^ " 一 
Figure 7.11 - (a) measured waveforms of the Output Request (lower) and Acknowledgement 
(upper) signal, (b) zoomed waveforms which shows the average throughput is 76MHz 
Page 112 
Chapter 7 - Results and Discussions  
Clock Pixel throughput 
Design Year Tech. Processing unit (MHz) (Mpixel/sec) 
Cheng et. al. [351 2000 a 6 u ~ 9 M U L , 21 A D D 100 100* 
Kim et. al. [641 1999 / “ 80 26.6 
Johnson et. al. 1998 1.2u 7 7 
[63]** 
Jang et. al. [ 3 3 ] 1 9 9 4 0.8u 8 MUL(DA)，2 T ^ W 
accumulator, 2 pre- and 
post processors  
Uramoto et. al. 1992 0.8u 8 M U L and T ^ T ^ 
[32] accumulators, 2 pre- and 
post processors  
This processor** |o.6u 14 M U L , 14 ADD/SUB |/ 176 
*Transistor-level simulation result 
**Asynchronous design  
Table 7.9 - Performance comparison of different 2D DCT implementations 
7.4.3 Discussion 
From the measurement result and Table 7.9, they indicate that the performance of 
this 2D DCT/IDCT processor is competitive to other designs. Since there were only 
few dedicated DCT/IDCT processors developed in asynchronous way previously, 
one of the recent asynchronous design which is developed by Johnson [63] is chosen 
for the comparison as it has similar architecture and the best performance among the 
others. By comparing with his asynchronous design, our design can run faster at 
about 36.9%. However, it should be noted that a more advanced technology has 
been used in our design, a certain portion of the superior performance may be caused 
by the benefits gained in the advanced technology. Although a direct comparison 
cannot be carried out, this comparison can be treated as a reference that this 
processor, the latest asynchronous DCT/IDCT processor, has improved performance 
and superior than previous asynchronous designs. 
Page 113 
Chapter 7 - Results and Discussions  
In comparison with the other similar synchronous DCT/IDCT designs, this 
DCT/IDCT processor has better performance than [64], while worse than [35], [33； 
and [32]. Although the performance is not as good as that of these synchronous 
designs, this processor has less operation units while a similar performance can still 
be achieved. This is because the operation units in an asynchronous system are not 
required to work on the same frequency. If a certain operation unit has better 
performance than other units, it can be scheduled to perform more operations by 
sending more input data to it. However, this cannot be done in synchronous design 
as all units must work in the same global clock frequency. This result explores the 
benefit of using the asynchronous architecture in system design as it can utilize every 
operation units in the system, and thus the number of operation units can be reduced. 
However, the measurement result shows a performance deviation from the 
simulation result, where the difference is about 22%. This deviation is properly 
caused by two factors which are the temperature and the multiplier. For the 
environment setting in the HSPICE simulation, the temperature was set to be the 
room temperature. However in the actual measurement, it was found that the ID 
DCT/IDCT core chip was very hot during the operation, and thus the temperature 
was much higher than room temperature. The increase in the temperature is due to 
the fact of high power consumption of this DCT/IDCT processor chip. As the design 
of current chip's package is not good for the heat dissipation, temperature cannot be 
cooled down effectively even a heat sink was added on the top of the chip. This 
causes a large amount of heat generated from the chip but cannot be dissipated, and 
thus the performance degradation is as a result. By setting the temperature at 90 
degree and re-simulating the D C T memory coefficient memory again, the new result 
Page 114 
Chapter 7 - Results and Discussions  
shows that the maximum operating frequency is lowered to around 168MHz (such 
the multiplication rate is 84MHz). In this case, the difference between the simulation 
and measurement result is more reasonable. The rest of the difference may be due to 
extra delay caused by the parasitic of the routing, which is not included in HSPICE 
simulation, and the difference between the parameters of the HSPICE models and the 
actual fabrication process. 
Another possible reason of causing large performance deviation is the multiplier. 
Since the all parasitic information was not taken into account in the HSPICE 
simulation, the actual performance of the multiplier may be much lower than 
220MHz if parasitic was considered as well. However, this cannot be verified by 
neither the simulation nor measurement. 
In order to improve the performance of the current design, the modification of the of 
the D C T coefficient memory and the multiplier must be considered. The limiting 
factor on the speed of the D C T coefficients memory is the output feedback path. The 
D C T coefficients memory is not only required to transmit the output to the 
multipliers, but also send the output to the input of the D C T coefficients memory 
simultaneously. This split-path introduces a large handshaking overhead and thus its 
performance is limited. The reducing of the handshaking overhead can be 
investigated in future, and thus the performance of the D C T coefficient memory, as 
well as the whole D C T processor can be improved. For the multiplier, the 
performance can be improved by building a PFA as a single standard cell. The 
current implementation uses several basic logic standard cells to build the PFA. This 
causes a large parasitic in the auto-placement and auto-routing process. Since many 
Page 115 
Chapter 7 - Results and Discussions 
identical PFAs are used in the multiplier core, building the PFA as a single standard 
cell can minimize the parasitic due to the routing, and also the silicon area can be 
saved as well since the PFA can be built more compactly in this way. 
For further analyzing the practicality of the asynchronous design, a comparison can 
be made on the asynchronous and synchronous implementations of this ID 
DCT/IDCT core. In the synchronous implementation, the bottleneck should no 
longer be the D C T coefficients memory but in the multiplier. This is because D C T 
coefficients memory can be implemented easily in synchronous design by using 
R O M or counters, both can be run very fast as not many computations are required. 
In the multiplier, the critical path is inside the carry generation, which is given by 
Equation 7.3(same as Equation 5.4) 
Cout = A* B • Cin + {Cin + A. B) - Equation 7.3 
The implementation of the Equation 7.3 in domino logic is shown in Figure 7.12(a). 
For the synchronous implementation, Equation 7.3 is modified to Equation 7.4 as the 
inverting static C M O S logic provides a faster response than the non-inverting logic. 
Cout = A*B*Cm + P. {Cin + 
= Cin •尸 • (Cin + A*B) - Equation 7.4 
According to Equation 7.4, the synchronous implementation of the carry generation 
is shown in Figure 7.12(b). From the information provided in the 0.6u standard cell 
Page 116 
Chapter 7 - Results and Discussions 
databook [65], the delay which is under 25。C and 5V supply voltage of each logic 
cell is extracted for the performance estimation. 
CLK - C 
—^—•~ 
Cin- -Cin ^ V 0.30ns 
1 A J F H L A 0,3qns_  
叫 t j 卜 B ^ 
B ~ h - 厂 I reset 
CLK -J r 一 AN21_| 
\ L ^ 0.29ns 2.48ns 
( a ) ( b ) 
Figure 7.12 — (a) carry generation in domino logic, (b) carry generation in static logic 
Operating Frequency = 1 / (longest delay) - Equation 7.5 
=1 /(0.3 + 0.3 + 2.48)ns 
=1/3.08ns 
=325 MHz 
The result of Equation 7.5 shows that the synchronous multiplier could run at about 
325 MHz. However, this estimation doesn't include the worst case temperature, 
supply voltage and clock skew. In practical, a margin of 50% or more is required in 
the global clock frequency when compared with its performance in typical condition 
due to worst case performance assumption in synchronous design. Therefore, the 
performance of the synchronous multiplier should be around 216MHz (for 50% 
margin), which should be similar to that of the asynchronous performance in typical 
condition. And in overall, as the synchronous multiplier is limited to 216MHz, the 
synchronous implementation of the whole ID DCT/IDCT core may be able to run 
faster than the asynchronous implementation described in this thesis, but the 
difference may not be so great. This result indicates that asynchronous design is 
practical and the performance can be similar to synchronous design. 
Page 117 
Chapter 7 - Results and Discussions  
Although the performance of this processor is good, the tradeoffs are the area and 
power. In order to maximize the performance, D C V S L structure is used inside all 
the processing units. This causes nearly a double of size to perform the same logic 
function as other designs. Also, all operation units within the ID DCT/IDCT core 
are deeply pipelined, especially the bit-parallel architecture of the multiplier. The 
deeply pipeline structure decomposes all the complex functions into simple logics 
with several stages. As a tradeoff of speed, this causes more area are required to 
implement the design. Furthermore, the handshake cell in the asynchronous circuit 
also causes an additional size overhead to the synchronous circuit. 
For the power consumption, the measurement result shows that the average power 
consumption of the dedicated DCT/IDCT processor is about 7.15W under 5V supply 
voltage, which is an extremely high value compared with other designs. In order to 
verify the correctness of the power consumption, each of the functional units was 
simulated separately by HSPICE under 5V supply voltage. The simulation results 
are listed in Table 7.10. 
Used in Current Current 
DCT / Operating Current Drawn in Drawn in 
IDCT / Frequency Drawn DCT IDCT 
Both Number (MHz) (mA) operation operation 
15-bit adder DCT 1 76/2 9.32 — 9.32 0 
15-bit subtracter DCT 1 76/2 9.85 9.85 0 
16-bit data replicator Both 2 76x2 28.33 56.66 56.66 
DCT coefficient memory" Both 2 “ 76x2 125.83 251.66 251.66 
Multiplier Both 2 76x2 —478.00 ~^6.0Q 956.00 
20-bit adder Both —— 2 76 32.45 ~64.9Q 64.90 
21-bit adder Both “ 1 76 " ^ . 8 2 ~ K n 33.82 
22-bit adder IDCT 1 76/2 ~l8.04 " o ~ ~ 18.04  
22-bit subtracter — IDCT— 1 76/2 18.05 ~0 18.05 
Truncation unit Both | 1 | 76 15.00 ~ 15.00 15.00 
Total 
Power 1397.21 1414.13 
Table 7.10 一 Simulation results of power consumption of different operation units in the ID 
DCT/IDCT core 
Page 118 
Chapter 7 - Results and Discussions  
From the simulation results, the total power consumption of the ID DCT/IDCT core 
is about 6.986W (1397.21mA x 5V) and 7.071W (1414.13mA x 5V) in the D C T and 
IDCT operation respectively. This results show that the power consumption is 
consistent in both the simulation and measurement result. Therefore the power 
consumption in the measurement is correct. 
The main reason of large power consumption is due to the use of D C V S L structure. 
Since both the true and complement value are presented in the D C V S L structure, 
either one of the true or complement logic block must be discharged in each 
Evaluation phase. Therefore every logic functional block must consume power in 
each Evaluation cycle, which causes a constant and high discharge current. However 
in single-rail design, which uses the true logic block only, there is no discharge 
current if the pull-down path is not conducted during the Evaluation phase (the 
output kept at logic zero in the domino logic). Since the pull-down path is conducted 
occasionally，a single-rail design consumes less power than the D C V S L design in 
average when performing the same logic function. In order to verify this, a single-
rail 15-bit adder is constructed for the comparison. Both circuits are simulated by 
HSPICE under 5V supply voltage. Three different input patterns which are random 
number, all zeros and all ones patterns are fed into the inputs of the adder at a 
frequency of 76MHz to investigate the current drawn in different conditions. The 
simulation results are shown in Table 7.11. 
Average current Average current Average current 
drawn at random drawn at all zeros drawn at all ones 
input (mA) input (mA.) input (mA) 
D C V S L adder 20.69 20.34 20.59 
Single-rail adder 19.58 +4.33 18.90 — 
Table 7.11 - Comparison of power consumption on DCVSL and single-rail adder 
Page 119 
Chapter 7 - Results and Discussions 
Although the power consumption of this dedicated DCT/IDCT power is high, if there 
is no data input, this processor consumes less power than other designs as no 
transition will be occurred in asynchronous design when there is no request of 
operation. 
Future work can be focused on reducing the power and size of the processor. As 
mentioned before, D C V S L structure is the main reason of the high power 
consumption of this design. In order to reduce the power consumption while not 
affecting the current performance, single-rail design or conventional asynchronous 
structure should be used in the non-critical parts, such as the adders and subtracters. 
This modification not only helps to reduce the power consumption, but also helps to 
reduce the area required to implement the design. For the area, since around 30% of 
the area is consumed by the multipliers, as shown in Figure 7.8，minimizing the size 
of the multiplier can greatly reduce the size of the whole design. This can be done 
by using Booth coding [66] or common sub-expression elimination [67] to reduce the 
complexity of the multiplier design, and thus its size can be reduced. Also，grouping 
the PFA into a single standard cell, which has been mentioned before, can also 
reduce the overall area too. Moreover, the adders and subtracters are not the limiting 
factor of the performance of the processor, area-saved or power-saved algorithm, for 
example ripple adder, can be used in the implementations of adder and subtracter 
instead of the size-consuming fast BLC algorithm. 
Page 136 
Chapter 7 - Results and Discussions  
From the result shown in Table 7.11, it indicates that the power consumptions of the 
D C V S L adder are nearly the same in three input patterns, while those of the single-
rail adder are depended on the input patterns. Also the single-rail adder consumes 
not more than half of the power of D C V S L adder in any patterns. This result shows 
that the D C V S L structure consumes more than a double of power when compared 
with usual structure, and this causes our design to have an extremely high power 
consumption. 
This result also shows the disadvantage of the domino logic (or dynamic logic), 
which is the high dynamic power consumption. Although the single-rail adder has 
relatively low power consumption design than the D C V S L design, it consumes 
higher power than the static logic. In the design which uses the static logic with latch 
or flip-flop, the power consumption should be relatively low and nearly the same in 
constant input patterns (all zeros and all ones patterns). This is because for a 
constant input pattern, the output of the logic gate will not change and thus there is 
no switching during the operation. Therefore it consumes very little or even no 
dynamic power under this situation. However in the domino logic, the requirement 
of the Precharge phase causes the output to be precharged to logic zero in every 
Precharge phase. As a result, if a logic block has an output of logic one in 
Evaluation phase, it will consume power in the Precharge phase. This explains that 
even having a constant input pattern, the domino logic still has switching during 
operation, and it causes a relatively high power dynamic power consumption 
compared with using static logic and latch or flip-flop in conventional architecture. 
Page 137 
Chapter 7 - Results and Discussions  
7.5 Transpose Memory 
7.5.1 Simulated Results 
Due the size of the whole design, different units of the transpose memory were 
simulated separately by HSPICE under 5V supply voltage in order to obtain the 
estimated throughput. 
* # 4ile naiDej /uger_vlgil/cwlee/cdg/cTip/giin/rain_read/hgpitieg/gc1 
I a JS J —"S ! \ —-I -1 i —^ •- —I 
-^ - -H r / \ i 1 f. 4 1 
1 2 • read rg j ‘ j cw—ir’B[pfi4“ch IL \ I ) ‘ I / 
B - 卞 - - 了 ” 4 5i；^：^ - — T-k———^― 厂--]_ — 卜 
> ^ -I ~~i- i J AaUnl^ JLaJ^ BAM^ ^^ jl j I j ,' L—'  
e I j 
I , : ―二了二 " V - i C T二…•/二： 7： ^ 1——>/' =： V 7 U L 1 F = n 
I: ： ---------f--4 
5。- d _ L ._L L-\_L i__L \_I j_/  
• I ‘ 
r . • • “ “ 1 “ 1——~I 1 1 1 1 1 —|i 
Read Operation 咖 • Tirrw (lr>) (Tm£) 
. 、 … . . … 〜 一 … . 丫 ： 二 、 . ： “ . . ~ ， 
i fl I—-r:^ 彩 二 ’ 二 • ； •；二:::r’4 ‘ — 1 
I J • wrijte rq [ • j ！ i 「—_ 
5 泛 4 ——i J 1� ^ — — ^ ^ V I •— 
„ : j 
I J ？1 \ / 1 ！ f 、\ ^ X 
i，• -meabi 1 [ 厂 … T , f i r 1 " " " 
i ‘ ： "1……十……1…---十……"1"…计……t-…—t——十——^  卜… 
$ ty - .—1 ！ L J 丄 ！/ J L 丨 / I  
j — 
‘ Write Operation ^ ‘ ‘ ‘ 产 ‘ ~ ‘ 
Tim® (•») c t m c) 
Figure 7.13 一 Simulation result of the write and read operation 
Tested Frequency(MHz) 
Write address generator 276* 
Read address generator 276* 
Multiplexing network 276 
32x16bit S R A M block 182.22/230.31 一 
"^Self-generatedfrequency  
Table 7.12 - Performance of different units in the transpose memory 
Since the interleaving technique is applied on the write operation, the performance of 
the transpose memory is now limited by the read operation of the S R A M block. As a 
Page 122 
Chapter 7 - Results and Discussions 
result, the minimum operating speed of the whole transpose memory should be 
230MHz. 
The layout of the transpose memory is shown in Figure 7.14. It has 11 Ik transistors. 
The S R A M Blocks are full custom designs and other parts are designed by using 
standard cells based on A M S 3 M IP 0.6u C M O S technology. The core dimension is 
3.9mm x 4.2mm. 
i f M B T l P i i 
MJi •: .Mwafflaa ‘ ：!<" Ill-l .nl I'Llll^  ir：：：^  wJ a： V 
!_國鳳 
_圖_ 
n n s s s 滅 
1 &2. Column address generator 3. Input buffer 
4 & 5. Read address generator 6, 8 and 9. Multiplexing network 
7. Output buffer 10，11’ 12’ 13. 32xl6bit SRAM block 
Figure 7.14 - Layout of the transpose memory 
7.5.2 Measurement Results 
The testing equipments of the transpose memory include the IMS XL-60 IC Tester, 
HP Infinium Oscilloscope and HP E3631A Triple Output D C Power Supply, which 
are the same of those of the DCT/IDCT chip. The functionality of the transpose 
memory chip was tested by IMS XL-60 IC tester, and all the functions (DCT and 
Page 123 
Chapter 7 - Results and Discussions  
IDCT) were verified and the chip is working properly. Figure 7.15 shows part of the 
captured Input and Output waveforms of T R A M . 
•繁':、;....二:.、.、.......: > .y • . . . . . .. .. . T|mltt9..tHagrtUHS;~tms. . — ‘ “ ” ‘ ~ “ “ ‘ -
gitit Screens Sub-Screens Options jjtriltles Help ! 
Total 0 Sequtncs 0 
1 20.0Qns 0 
C . , , . • + + + 令 4. + + + . 
.Error 
idct V  
恐 ^ / w v N r\r\f\f\ r\f\j<j\ ^ w v ^ ^ w w v i — /^ /VVV\A/W\A/V\AAA/\ / V W W W W W W W W : 
Q^g : — A A / W W W W V W W W W W W W W W W W ： 
d07 s  dOS : ；V： d09 : ^A dlO ; ^： dll : 
SI ： 
doOO  doOl 丨 
do02 H Z I I Z I ^ I ^ Z I I I I I ^ Z Z ^ ^ Z Z ^ ^ I ^ i do03 IIIIIZZ^ZIZIIIIZZIZIZ^IZIIZII^^ZII^ "—丨 do04 n z z m i i i z i i i i i z i i z z ^ i i i i ^ •— i do05 ZZZIIIZZIZIIIIIIIIZ^II^IZIIZ^III^ — i do06 ^ZZZZZZZZIIZIIIZZIII^^IIIZ^ — 一 i 
do07 Z Z ^ Z Z I Z Z ^ Z I I I I I Z Z I ^ I I Z I ^ i 
do08 m Z I I I I Z Z Z Z I ^ ^ I I I Z Z I I I I ^ — i do09 HHZIZIZIZIIZII^I^IIIII^^III^ 丨 dolo m z ^ z z i z z z z i z z z z i i i i ^ i i i z ^ — ； 
doll Z Z ^ Z Z Z I Z ^ ^ Z I ^ ^ I I I ^ ^ I I I ^ ？ doi2 i m i l l l Z I I Z I ^ ^ Z ! ^ ^ ! ! ! ^ 丨 dol3 Z Z ^ ^ I I I I I ^ ^ I I I Z ^ I I I I ^ ； ‘dol4 -  ；s outrq  ckin 、 ,丨 
20.00ns 
.Sta;^ stem{ StoT^ yTtemj —  
‘ - I riming Diaorams - imY “ “ "‘ 
.r-r^ TT'CTT••；•"•一;-〈• 了了• ：；C了：•二一广•、• ^ ^ ^ ~ 广 一•二 ^ �… •• I 丨• J 丨 I . . . , ‘ 
file Ertit Screens Sufai-Screens Options Utilities 世,p j 
Total 0 Se(iumc« 120 
‘ 1 20.0Qns 276 
(),/,:•»〜••、 + 少…t ‘ ……•‘ • ’• ,、•+•., ……‘ +‘ + • 
：<；：Error ‘ * ‘ ‘ * “ ‘ * ‘ ‘ ‘ ‘ * *•'•'•• ： 
idct ： 
doo r \ r \ r \ r \ r \ r \ _ r \ \ doi \ r\j\ r\j\j\f\ f\r\ r\r\ r\r\ i\f\ aa AA A A 八八 A 八 AA A 八 A/\ AA AA i\r 丨 
d02 n / x a a a a a / v n r\j\f\f\ r\r\j\r\ r\r\j\r\ r\r\f\j\ r\j\r\r\ r\r\r\r\ a a a a ： d04 WWWWVWWWXAAAAAAAAA /^/WWWWWWXAA i dOS \/W\AA/WV\A/\AA/WWWWWWW\AA/V\/W\A/VWN_____ ; d06 A/WWWWWWWVWWWVXA/WXAAAAA/WXAAAAA/VWWXAA/WA/WWWVW 丨 d07 /WAAAAAA/WWWWWVXA/WWNAAA/VXA/VAA/wwwwwwwxaa/wwwva/V ； a08 ^^/WWWWVWWWXAA/WWWN/WVXAA/WVA/WWWWWWXA/WWWA/WW 丨 d09 AyWWVWWWWWWWWWWXA/VWWWVXAA/WWN/WWWXAAAA/WXA/W 丨 dlO ^^AAAA/WWVWWWWVXA/WWWWWWXA/WVXA/WWVWWXAA/VV/^A/VW^^A/ 丨 dll f\ ^WWWVXAA/VWWWWWWN/WWWVAAAA/WWWWWWWWAAA/VW 丨 dl2 / " W W W W W V W W W W V W W W W W W W W W W W W W W W ^ ^ / W V N / W W ^ A / di3 /W\/wwvwww\/\AA/VVWW\A/WW\A/WW\AAA/VVW\/W\A/\AA/W\/\A/ww dl4 r^yWWWWWWWVWXAAA/WVNAAA/WXAA/WVWWWWXA/VXA/WV/VX/V/VA/X/V/V/ ： 
jn-rq 讲 I ^ J J I I I I I I I 
dona ” ： 
doOO /XAAAA/WN /VWWWN r\J\J\J\J\J\/\f\ /VWX ！ 
doOl A / W W W W W W W N A / W \ A / W \ A A / V \ 丨 
doO? — / W V V A / W V W X A / V W W W W X A / V W X / S ^ : do03 r\r\ r\r\ r\j\ r\r\ r\r\ r\j\ f\r\ r\ /\/\ r\r\ 八八 八八八八八 八八 八八 ^八 
do04 a r \ r \ ^ ^ r \ r\_r\r\»»r\ r\_rsj\r\ r\a/\ 八 /\ 八八八 八 /\八 a a 八八 a 八八： 
doos r\„r\_r\^^r\a«««aar\r\_r\_r\_r\r\ r \ ^ ^ 丨 do06  
do07 Zl^ ^^ ZIII^ IIIIZZ^ ^^ ZIIZIIIZII^ I^IIIIZ^ IZIZZZIZZ^ III^ ZIZZZZII^ ZI^ n^ I^I^ Iirr^ : 
do09 I I Z I I I I I I ^ ~ ; 
dolo z m z i m i ^ ^ — i doll z z z z z z z i m m ^  
doi2 丨 
dol3 ： doi4 ^^ ^^ ^^ Z^HIIIIIIII^ ^^ I^IIIIIZIII^ ^^ IZI^ Z^ZIZ^ ZZI^ ZZI^ I^I^ ZI^ I^IZIIZIIZZZI^ I^I: outrq a / w v w w w w w w w w w v x a / w x a a a a / w v l a a a a a a a a a a a / w w w v w w x / " ^ ckin ^WVWWWWWWVXAAAAAAAAAAAAAAAAAAAAAAA/WWWWWWWWVXAA/ 
20.(Wtos 
Start S^eml Stop System! 
( b ) 
Figure 7.15 - (a) input waveform of the transpose memory, (b) measured output waveform of 
transpose memory in DCT operation mode 
Page 124 
Chapter 7 - Results and Discussions 
Although the simulation result showed that the transpose memory could be operated 
at 230MHz, there is no method to verify this. This is because that there is a 
limitation of the IMS XL-60 IC Tester that test vectors can only be generated at a 
maximum rate of lOOMHz. As a result, the transpose memory can only be verified 
that it is working properly at lOOMHz input rate. However, the transpose memory is 
supposed to be worked with the DCT/IDCT core chip which is operated at 76MHz 
only. This result at least can prove that the transpose memory can work well with the 
ID DCT/IDCT chips. Figure 7.16 shows the Output Request of the transpose 
memory at lOOMHz. The average current drawn of the transpose memory at 
lOOMHz under 5V power supply is around 350mA, which means that the average 
power consumption is about 1.75W. 
TT-研 Tl''了.「)T了 T'ITflT I j r \ irr'ii i ‘ j r TiTrrv s ! j i - j n…r —「！了 圓 _ I 
、’、.•香.�'�r..、个...•i'�..."r…..i..、'— 卜...令 . . . . ‘…i . .� -十...今..•小.,〜. .^、.....；•.了...小...一 . . . . j….•.…：•.，...卜.‘....卜...卜、 . . i…• A..�..….一、....1….4-、，.；.…-：..妄.‘.•各、...卜,、...,、！ 
？ - ？ • ： ； • ： ！ 
“ ” 丨 … … 二 、 、 丨 I 1 々 “ ， 、 I ‘ h; ‘丨�；“ 、 ， ： 
： 、 / 「 ' （ 、 ： ： 、 二 卞 。 … ； ’ ： I 
, i , ^ ‘ ‘ i ^ I i 兔 I 基 ^ I j 
： —— - (a) - . - 一 一 
Page 125 
Chapter 7 - Results and Discussions  
J i \ •li…… I……丨 f…… |丨 i.......…1…...1�…...........i….,�..1................fi�,.,�.v.... ....…•…I S 
'-、-p‘、1…p、'-\…Ij-1ji〜…I…ji-Y- j\-jt、. Y ,彳….�Y f 丨-\ . . j. I 
\---j 4 j丨 'r \••... J -； ,,、„,、〜I； 〜 I ..；.,.、,、..J、j..、.i t. J.…..^  j 
； ； ； J + ； ； ； - 1 
i i i ！ t i ； ‘ ？ j ；rt i ； ： 十 ； ； ； ： ！ 
i / \ ！ ； ； 卞、'卜 i ； \ ； i 
'•'、' f" ' i 'C '..�…-二 h-4"…V ；•…--广"…........-,•...! � l„„C..,i.......-、.、„ 
A I i \ I / \ \ \ I / \ / I I = / \ ； I 'I i / \ i i \ 
I : / : J 1 ：至 � - i \ • ^  \ i • i V :: � ； H ： V 气 
\ I \ ; f \ ； I ； f \ t f I \ ： ^ \ = / i ； / \ ： i i ( I ； / \ i ( ] ；! I ； ：‘ ！ 十 ji i ) ； / ^ ； r f \ f \ •••( ’ -‘ 
f…飞.....？—\...I—\....I—T...�'f-\fj…\.'�'.f-T….J-..\-f-.\ 
、/1 - 。丨 ^ - V -
： ： i - I ； • i 
i 丨 L ； I 十曰k j I I i 
Figure 7.16 - (a) measured waveforms of the Output Request (upper) and Acknowledgement 
(lower) signal, (b) zoomed waveforms which shows the average throughput is lOOMHz 
7.5.3 Discussion 
From the result shown in Table 7.12 and by the help of the interleaving technique, 
the transpose memory can be operated continuously at 230MHz. And from the 
measurement result, it indicates that the chip has no problem in operating at 
lOOMHz. As a result, the read/write operation at 76MHz in the transpose memory 
was fulfilled and thus the whole 2D DCT/IDCT processor can provide a throughput 
at 76 Mpixel/sec. 
Considering the R A M block alone, its performance is restricted by the write 
operation. The poor speed of the write operation is due to the slow detection of the 
completion of write operation. Although the monitor cells are used to provide a fast 
Page 126 
Chapter 7 - Results and Discussions 
detection of completion of the write operation, overhead exists on collecting all the 
done signals from different monitor cells. 
2 done signals is generated 
from each 15bit word column 
± _ _ t _ _ 个 个 I个 个 
1 i I . g I g 
2 丨 二 <D =J I 3 
8 I 忌 8 8 丨 8 
1 ‘ 1 <r S # 1 I •§ 
专 丨 多 ^ 多 丨 多 
•S I S §： S ' 2 
lo I lo m I m 
I T~ T - I T -
| | 个 V f i f 
Colum门 Decoder & 
Data Buffer 
Figure 7.17 - Done signals generated from the 3 2 x l 5 b i t R A M block 
Figure 7.17 shows the done signals generated from the monitor cells in a 32x15bit 
R A M block. It is required to handle a total of 8 done signals and thus the overhead is 
introduced. If the detection method can be improved, a better performance in the 
write operation can be achieved. 
Although the goal of the transpose memory is achieved in this implementation, it has 
limitations and the design still can be further improved. In the current architecture, 
an interleaving technique is used so as to achieve a higher operating speed in the 
transpose memory. However, the actual speed requirement of the 2D DCT/IDCT is 
now limited by the ID DCT/IDCT core, which is 76 MHz, the transpose memory is 
not necessary to be run at such a high frequency. Therefore, the interleaving 
technique is not necessary and can be removed from the transpose memory. In this 
case, the maximum operating speed of the transpose memory will be lowered to 
Page 127 
Chapter 7 - Results and Discussions 
about 182MHz but is still capable of handling the 2D DCT/IDCT operation. In this 
way, the area of the additional multiplexers and demultiplexers can be saved. 
In order to realize the whole 2D DCT/IDCT processor design in asynchronous 
pipeline architecture, the address generator and the multiplexing networks are all 
built according to the methodology introduced in chapter 2. The simulation shows 
that they can be operated in a very high frequency. However, they are not required 
to work at such high frequency as the operation is limited to 76MHz, and thus the 
benefit the asynchronous pipeline architecture cannot be gained in this 
implementation. On the other hand, the address generator and the multiplexing 
networks consume over 50% of the whole design, it causes the transpose memory to 
be not cost effective. Therefore other approaches should be applied to the design of 
transpose memory. 
First for the address generator and the multiplexing network, the conventional 
micropipeline structure can be used. As latch is used in the micropipeline, a counter 
can be easily implemented and can be used to replace the area-consuming address 
generator. Also, as the multiplexing network is not the critical unit regarding the 
performance of the transpose memory, DCVSL structure is not necessary to be used. 
As a result, the area of the transpose memory can be largely reduced. Also the 
removal of DCVSL structure can help to reduce the power consumption which have 
been discussed in section 7.4.3, the discussion part of ID DCT/DCT core. 
Another possible approach is the replacement of the R A M block with an array of 
shift registers, or storage elements. Since the sequences of the D C T and IDCT are 
Page 128 
Chapter 7 - Results and Discussions 
fixed, the flow of the data in the transpose memory can be pre-determined. 
Therefore hardwired connections with some multiplexers on the array of shift 
registers may be able to perform the same function of the R A M blocks. Also it can 
eliminate the address generator and multiplexing, thus the area can be reduced. 
Therefore future improvement or development of the transpose memory can be based 
on the above suggestions, then a more cost effective implementation can be 
achieved. 
Page 129 
Chapter 8 - Conclusion  
Chapter 8 
Conclusions 
In this thesis, several asynchronous methodologies have been discussed, and a new 
asynchronous pipelined architecture is then presented. This new architecture uses a 
novel, simple but fast handshake cell which adopts a more relaxed handshaking 
protocol than in the traditional architecture. Furthermore, this new architecture 
employs the D C V S L structure in logic design, and thus the complex latch can be 
removed and the completion can be directly detected by the handshake cell. With the 
new asynchronous pipelined architecture, the circuit developed has a simpler 
architecture and has higher performance than the traditional methodologies. The 
performance of the new asynchronous pipelined architecture has been proven by the 
programmable DSP processor and the dedicated D C T processor 
Since the dynamic logic is used in the new asynchronous pipelined architecture, a 
new technique called Refresh Control circuit is introduced in this thesis to solve the 
charge leakage problem. The new technique is a self-timed, self-calibrated and self-
operating circuit, it monitors the charge leakage in the dynamic logic and controls the 
refresh process effectively in order to reduce the pull-up current. From the result, it 
is shown that this technique causes less performance degradation than the traditional 
technique, it is suitable to apply to a large system which requires a high performance. 
Moreover, the Refresh Control circuit uses a general purpose monitoring scheme and 
Page 130 
Chapter 8 - Conclusion  
thus it can be applied to other circuits which also encounter the charge leakage 
problem. 
Based on the new asynchronous pipelined architecture, a programmable DSP 
processor and a dedicated 2D DCT/IDCT processor have been constructed. 
Although the performance of the D C T implemented in the programmable processor 
is not as good as other dedicated designs, it still has a reasonable performance of 
22.7Mpixel/sec with a small number of operation units. And the impressive result 
shown in the estimation of 2D D C T operation demonstrates the advantages of the 
combination of asynchronous pipeline and dataflow architecture in circuit design, 
and the use of switching network and parallelism in the processor architecture. This 
result encourages the further development of the processor and the use of the 
asynchronous pipelined architecture. 
Finally, the development of the dedicated 2D DCT/IDCT processor is shown in the 
thesis. This processor is fully pipelined and the throughput is 76Mpxiel/sec, which is 
competitive with other high performance synchronous designs when considering the 
number of operation units used. Also, this dedicated 2D DCT/IDCT processor fully 
satisfies the IEEE specification and is capable of real-time processing on the MPEG-
2, or even the more computational demanding H D T V signals. The result indicates 
that the asynchronous design with the new pipelined architecture can perform as 
good as other synchronous designs，and the proposed DCT/IDCT architecture is 
suitable for the asynchronous implementation. Furthermore the benefit of the 
asynchronous approach in system design has been demonstrated, in which operation 
Page 131 
Chapter 8 - Conclusion 
units can be fully utilized and the number of operation units can be reduced in the 
whole system. 
Both of the results of the programmable DSP processor and the dedicated 
DCT/IDCT processor imply the high performance of the new asynchronous 
pipelined architecture. In other words, the use of the new asynchronous pipelined 
architecture favours the asynchronous approach in system implementation, especially 
for the DSP applications. 
However, it is found that there are area and power penalties in these designs. The 
use of the D C V S L structure not only causes a large increase in the silicon area, but 
also causes a high dynamic power consumption. Also, the inappropriate 
architectures used in different operation units cause unnecessary area overhead and 
cause the whole design to be inefficient in term of area. Designers should consider 
different approaches in the implementation of different operation units in a system in 
order to minimize the penalties, and further development of the new asynchronous 




Operations of switches in DCT implementation of programmable DSP 
processor 
Note: 
Switch 1 to 12 are located at the switching networking, their instructions can be 
referred to Table 5.1. Switch 13 and 14 represents the input demultiplexing networks 
of the FIFO memory 1 and 2 respectively, while Switch 15 and 16 represents the 
four-to-one multiplexers of FIFO memory 1 and 2 respectively. For switch 13 to 16， 
the register name means the FIFO set that the demultiplexing networks and four-to-
one multiplexers connecting to. 
Switches ~|l II I 
# In 15 I 16 I 1 I 2 I 3 I 4 I 5 I 6 I 7 I 8 I 9 j 10 I 11 I 12 I 13 I 14 Coef Outpll Operation I  
f ut 
1 XO I 1 I I 0 I I 7 I input X0-> Ail, Sil 
2 X7 3 2 7 input X 7 - > A i 2 , Si2 
3 XI 1 0 7 input XI -> A i l , Sil 
4 X6 3 2 7 “ — input X 6 - > A i 2 , Si2 
5 X3 1 g 7 inputX3->Ail，Sil 
6 X 4 3 2 7 input X 4 - > A i 2 , Si2 
7 X 2 1 0 7 ‘ input X 2 - > A i l , Sil 
8 X5 3 2 7 “ ^ ^ input X 5 - > A i 2 , Si2 ~ 
9 1 6 0 3 1 ~ 6 3 3 A ^ T I A o ( K O ) - > A i l , S i l ,  
So(K4) ->regA2, Muli 
10 1 2 0 1 6 3 A o ( K l ) - > A i l , S i l，  
So(K5) -> regB2 
11 3 2 2 1 3 ~~6 ^ Ao(K2) -> Ai2, Si2, • 
So(K6) -> regC2 
12 3 2 2 1 3 6 ^ Ao(K3) -> Ai2, Si2, 
So(K7) -> regC2 
_13 ^ 1 1 1 RegA2(K4) -> Ai l 
14 B2 3 7 3 1 B2 RegB2(L5) -> RegB2，Ai2 
g 3 7 B l CO So(L2) -> RegBl , Muli 
1 6 6 3 1 3 3 A2 CO So(L3) -> RegA2, Muli 
17 C2 3 1 1 0 1 0 D2 RegC2(K6) -> A i l ,  
Ao(LO) -> RegD2 
18 C2 3 7 1 0 0 1 0 1 D2 RegAl(K7) -> A i l , Ai2,  
Ao(Ll ) -> RegD2 
1 9 B2 3 3 1 RegB2(K5) -> Ai2 
2 0 D 2 1 1 7 R e g D 2 ( L 0 ) - > A i l , S i l 
D2 3 3 7 RegD2(Ll) -> Ai2, Si2 
2 2 1 2 6 ^ CO Ao(L5) -> RegAl , Muli 
2 3 7 2 0 2 2 B2 CO ^ ^ Ao(L6) -> RegB2, Muli 
2 4 1 2 2 ^ ^ C2 ^ ^ Ao(L7) -> Muli 
2 5 1 5 2 ^ ^ Ao(MO) -> Muli 
2 6 g 3 0 C I Mo(cl*Z4)->RegCl 
27 A2 0 3 1 3 2 3 RegA2(L3) -> Si2, •“ 
Mo(cO*LO) -> Sil 
28 B l 0 2 0 3 0 ~ 0 RegBl (L2) -> A i l ,  
Mo(cO*L3) -> Ai2 
29 B2 0 3 1 3 2 3 RegB2(L6) -> Si2,  
Mo(cO*L5) -> Sil 
30 A l 0 2 0 3 " O 0 R e g A l ( L 5 ) - > A i l ,  
Mo(cO*L6) -> Ai2 
3 1 0 3 3 “ C4 S o ( M l ) - > M u l i 
3 2 g 3 3 C5 So(M2) -> Muli 
3 3 0 3 3 ^ ^ H c T Ao(M3) -> Muli 
34 CI 0 2 0 3 6 ~ 6 RegCl(M4)-> Ai l , Si l ,  
Mo(M7) -> Ai2, Si2 
3 5 2 3 1 一 So(M5) -> Ai2, Si2 
3 6 3 2 ~ 6 ~ — Ao(M6) -> Ai2, Si2 
3 7 0 1 7 . ^ ^ So(M4-M7)-> Ai l , Sil 
3 8 1 g 6 — Ao(M4+M7) -> Ai l , Sil 
39 I 0 I I 3 I I 3 I I r ^ l So[(M4-M7)-M5] -> Mi ！ 
Page 133 
Appendix   
1 2 I 2 I C7 II A o _ - M 7 ) + M 5 1 - > Mi 
_ i l I 2 2 C8 Ao[(M4+M7)+K6] -> Mi — 
0 3 ” ~ 3 ~ C9 So[(M4+M7)-K6] -> Mi 
_ 4 3 2 1 “ " ~ 0 ~ ^ ^ Z I I H Z DO Mo(c3*M0) - > Out 
_ 4 4 2 1 “ “ ~ 0 ~ D1 M o ( c 4 * M l ) - > O u t 
2 1 0— “ D2 Mo(c5*M2) - > Out 
2 1 0 D3 - Mo(c5*M3) - > Out 
47 2 1 0 D4~ Mo(c6*[(M4-M7)-M5])  
- > O u t  
48 2 i 0 D5 Mo(c7»[(M4-M7)+M5])“  
- > O u t  
49 2 ~ 0 D ^ Mo(c8*[(M4+M7)+M6])  
- > O u t  
50 2 0 D7 Mo(c9*[(M4+M7)-M6]) 
II II I il I I I I I I I I I I I II I II II II - > Out 
Page 134 
Appendix 
C Program for evaluating the error in DCT/IDCT core 
Generation of data set 
# i n c l u d e < s t d i o . h > 
# i n c l u d e < s t d l i b . h > 
# i n c l u d e < s t r i n g . h > 
# i n c l u d e < m a t h . h > 
d o u b l e p i ； 
d o u b l e o n e d c t r e s u l t [ 8 ] ； 
d o u b l e t w o d c t r e s u l t [ 6 4 ] ； 
u n i o n h e x c o n t e n t { 
l o n g h a l f [ 2 ] ； 
d o u b l e f u l l ； 
}； 
/ / D e f i n e F u n c t i o n 2-D DCT 
v o i d t w o d c t ( l o n g t w o i n p u t [ ] ) 
{ 
i n t i , j , u , v ; 
d o u b l e i n p u t 8 x 8 [ 8 ] [ 8 ] ； 
d o u b l e temp； 
f o r ( i = 0 ; i < = 7； i + + ) 
{ f o r ( j = 0； j < = 7； j + + ) 
{ 
i n p u t 8 x 8 [ i ] [ j ] = t w o i n p u t [ 8 * i + j ] ； 
} 
} 
/ / D i r e c t 2D 
f o r (u=0； u<=7； U + + ) 
{ f o r ( v=0；v<=7；V++) 
{ 
t emp = 0； 
f o r { i = 0； i < = 7； i + + ) 
{ f o r ( j = 0； j < = 7； j + + ) 
{ 
t emp += 
i n p u t 8 x 8 [ i ] [ j ] * c o s ( ( 2 * i + l ) * u * p i / 1 6 ) * c o s ( ( 2 * j + l ) * v * p i / 1 6 ) ； 
} 
} 
t emp = 0 . 2 5 * t e m p * ( ( u = = 0 ) / p o w ( 2 , 
0 . 5 ) + ( u ! = 0 ) ) * ( { v = = 0 ) / p o w ( 2 , 0 . 5 ) + ( v ! = 0 ) ) ; 
i f ( t emp > 2047 ) 
t emp = 2 0 4 7 ; 
i f ( t emp < -2048 ) 
t emp = - 2 0 4 8 ; 




/ / D e f i n e F u n c t i o n I n v e r s e 2-D DCT 
v o i d t w o i d c t ( l o n g t w o i n p u t [ ] ) 
{ 
i n t i , j , u , v ; 
d o u b l e i n p u t 8 x 8 [ 8 ] [ 8 ] ； 
d o u b l e temp； 
f o r ( i = 0； i < = 7； i + + ) 
{ f o r ( j = 0； j < = 7； j + + ) 
{ 





/ / D i r e c t 2D 
f o r ( i = 0 ; i < = 7； i + + ) 
{ f o r ( j = 0； j < = 7； j + + ) 
{ 
t emp = 0； 
f o r (u=0； u<=7； U++) 
{ f o r (v=0； v<=7； V++) 
{ 
t emp += 
i n p u t 8 x 8 [ u ] [v] * c o s ( ( 2 * i + l ) * u * p i / 1 6 ) * c o s ( ( 2 * j + l ) * v * p i / l 6 ) * { ( u = = 0 ) / p o w ( 2 , 
0 . 5 ) + ( u ! = 0 ) ) * ( ( v = = 0 ) / p o w ( 2 , 0 . 5 ) + ( v ! = 0 ) ) ; 
} 
} 
t emp = 0 . 2 5 * t e m p； 
i f ( t emp > 255 ) 
t emp = 2 55； 
i f ( t emp < -256 ) 
t emp = -256； 
/ / t w o d c t r e s u l t [ i * 8 + j ] = 0 . 2 5 * t e m p； 




1 o n g r a n d n u m ( L , H ) 
l o n g L , H ; 
{ 
s t a t i c l o n g r a n d x = 1； " l o n g i s 32 b i t s * / 
s t a t i c d o u b l e z= ( d o u b l e ) O x V f f f f f f f ； 
l o n g i , j ; 
d o u b l e X； " d o u b l e i s 64 b i t s * / 
r a n d x = ( r a n d x * 1 1 0 3 5 1 5 2 4 5 ) + 1 2 3 4 5； 
i = r a n d x & O x V f f f f f f e ； 
X = { ( d o u b l e ) i ) / z ； 
X = X * ( L + H + 1 )； 
j = X； 
r e t u r n ( j - L ) ； 
} 
l o n g r o u n d u p ( d o u b l e t e s t n u m b e r ) 
{ 
d o u b l e r e m i n d e r ； 
l o n g r e s u l t ； 
r e s u l t = t e s t n u m b e r ； 
r e m i n d e r = t e s t n u m b e r - r e s u l t ； 
i f ( r e m i n d e r >= 0 . 5 ) 
r e s u l t += 1； 
e l s e i f ( r e m i n d e r <= - 0 . 5 ) 
r e s u l t 1； 
r e t u r n r e s u l t ； 
} 
m a i n ( ) 
{ 
l o n g L , H； 
i n t l o n g r e s u l t [64]； 
i n t k k , 11 , m, n； 
l o n g i d c t c o e f f [64]； 
c h a r f i l e n a m e [ ] = " d a t a O O . d a t " ； 
/ / c h a r o d c t — f i l e [ ] = " o d c t O O . d a t " ； 
c h a r i d c t : f i l e [ ] = " i d c t O O . d a t " ； 
c h a r f d c t : f i l e [ ] = " f d c t O O . d a t " ； 
/ / d o u b l e temp； 
u n i o n h e x c o n t e n t u p p e r b o u n d； 
F I L E * r e s u l t _ i d , * d c t _ i d 2 , * d c t — i d 3 ; / / * d c t _ i d , 
p i = a t a n ( 1 ) * 4 ； 
u p p e r b o u n d . h a l f [ 0 ] = 0 x 0 0 0 0 0 0 0 1 ； 
u p p e r b o u n d . h a l f [ 1 ] = 0 x 4 0 a f f 0 0 0 ； 
Page 136 
Appendix  
p r i n t f ( " P l e a s e e n t e r t h e L o w e r B o u n d . . . . \ n " ) ； 
s c a n f ( " % d " , &L )； 
p r i n t f ( " P l e a s e e n t e r t h e U p p e r B o u n d . . . A n " ) ; 
s c a n f ( " % d " , &H)； 
f o r (m=0； m<=9； m++) 
{ f i l e n a m e [ 4 ] = f i l e n a m e [ 4 ] + m; 
" o d c t — f i l e [4] = o d c t _ f i l e [4] + m; 
i d c t _ f i l e [ 4 ] = i d c t _ f i l e [4] + m; 
f d c t _ f i l e [ 4 ] = f d c t " f i l e [ 4 ] + m； 
f o r (n=0； n < = 9 ; n + + ) 
{ 
f i l e n a m e [ 5 ] = f i l e n a m e [ 5 ] + n； 
/ / o d c t _ f i l e [ 5 ] = o d c t _ f i l e [ 5 ] + n； 
i d c t _ f i l e [ 5 ] = i d c t — f i l e [ 5 ] + n； 
f d c t _ f i l e [ 5 ] = f d c t = f i l e [ 5 ] + n； 
r e s u l t 一 i d = f o p e n ( f i l e n a m e , " w " )； 
" d c t — i d = f o p e n ( o d c t _ f i l e , " w " ) ; 
d c t — i d 2 = f o p e n ( i d c t — f i l e , " w " )； 
d c t _ i d 3 = f o p e n ( f d c t _ f i l e , " w " )； 
f o r " ( 1 1 = 0 ； 11<=99； 1 1++ ) 
{ 
f o r ( k k=0； k k < = 6 3； k k + + ) 
{ r e s u l t [ k k ] = r a n d n u m { L , 
H ) ; 
f p r i n t f ( r e s u l t i d , 
" % l d \ n " , r e s u l t [ k k ] ) ； — 
} 
t w o d c t ( r e s u l t ) ； 
f o r ( k k=0； k k < = 6 3； k k + + ) 
{ 
/ / f p r i n t f ( d c t — i d , 
" % 2 0 . 1 5 l f \ n " , t w o d c t r e s u l t [ k k ] ) ； — 
i d c t c o e f f [ k k ] = r o u n d u p ( t w o d c t r e s u l t [ k k ] ) ； 
f p r i n t f ( d c t _ i d 2 , "%d\n", 
i d c t c o e f f [ k k ] ) ； 一 
} 
t w o i d c t ( i d c t c o e f f ) ； 
f o r ( k k=0； k k < = 6 3； k k + + ) 
{ 
/ / f p r i n t f ( d c t — i d 3 , 
" % 2 0 . 1 5 l f \ n " , t w o d c t r e s u l t [ k k ] ) ； — 
i d c t c o e f f [ k k ] = r o u n d u p ( t w o d c t r e s u l t [ k k ] ) ； 
f p r i n t f ( d c t _ i d 3 , "%d\n", 
i d c t c o e f f [ k k ] ) ； 一 
} 
} 
f c l o s e ( r e s u l t _ i d ) ； 
/ / f c l o s e ( d c t _ i d ) ； 
f c l o s e ( d c t _ i d 2 ) ； 
f c l o s e ( d c t _ i d 3 ) ； 
f i l e n a m e [ 5 ] = ' 0 ' ； 
" o d c t 一 f i l e [5] = ' 0 •； 
i d c t _ f i l e [ 5 ] = ' 0 ' ； 
f d c t _ f i l e [ 5 ] = • 0 ' ; 
} 一 
f i l e n a m e [ 4 ] = ' 0 '； 
/ / o d c t _ f i l e [4] = ' 0 •； 
i d c t 一 f i l e [ 4 ] = • 0 ' ； 
f d c t : f i l e [4] = • 0 '； 
} _ 




Testing of DCT/IDCT architecture 
II 
II This is program is used to generate a Inverse DCT result 
II from Forward DCT coefficients . 
// 
II The input files "idctXX.dat" contain 12-bit DCT coeffiecients . 
II The output files 'nfdctXX.dat" contain 9-bit reconstrcuted 






long mul__product [4] [8]； 
long dctresult一i[8]； 
double dctresult-f[8]; 
void mul_coeff(int bit_length) 
double value[7]； 











trunvalue[i] = value[i]*pow(2,28)； 
//Round up 
roundup = (trunvalue[i] >> (28 - bit—length)) & 
0x00000001； -
//create the coeff. at given bit length 
trunvalue[i] = (trunvalue[i] >> (28 - bit—length 
+ 1)) + roundup； 
} 
} 
void mul—matrix(long inputl[4], long input2[4], int coeff_length, int trun—length) 
long matrixl [4] [4] , inatri;x:2 [4] [4], roundup； 
int i, j ; 
//Form the coeff. Matrix 
matrixl[0] [0] =trunvalue [0]； 
matrixl [0] [1] = trunvalue [1]； 
matrixl [0] [2] = trunvalue [0]； 
matrixl [0] [3] =trunvalue [2]； 
matrixl[1][0]=trunvalue[0]； 
matrixl[1] [1]=trunvalue[2]； 
matrixl[1] [2]=-1*trunvalue [0] ； matrixl [1] [3] 
l*trunvalue [1]； 
matrixl [2] [0] = trunvalue [0] ； matrixl [2] [1] 
1*trunvalue[2]； 
matrixl[2][2]=-l*trunvalue[0]； 
matrixl [2] [3] =trunvalue [1]； 
matrixl [3] [0] = trunvalue [0] ； matrixl [3] [1] 
1*trunvalue[1]； 
matrixl [3] [2] =trunvalue [0] ； matrixl [3] [3]=-
1*trunvalue [2]; 
matrix2 [0] [0] = trunvalue [3]； 




Appendix   
matrix2 [1] [0]=trunvalue [4] ； matrix2[1] [1]=-
l*trunvalue [6]； 
matrix2[1] [2]=-l*trunvalue [3] ； matrix2 [1] [3]=-
l*trunvalue [5]； 




matrix2 [3] [0]=trunvalue [6] ； matrix2 [3] [1] 
l*trunvalue [5]； 
matrix2 [3] [2] =trunvalue [4] ； matrix2 [3] [3] 
l*trunvalue[3]； 
//Matrix Multiplcation 




mul_product[j] [ i ] = m a t r i x l [ j ] [幻 • 






for (i=0;i<=3;i++) { 
for (j=0;j<=7;j++) 
if (trun_length 二= 0) 
mul_j)roduct [i] [ j ] = (mul_product [i] [j] >> trun 一 length); 
else 
{ 
roundup = (mul_product[i][j] >> 
(trun_length-l)) & 0x00000001； 




void onedct(long input: [8] , int input_length, int coeff—length, int mul—trun—length, 
int final一length, int second) 
{ “ 
long half 1 [4] , half2 [4]； 
int i ； 
long stage31[4] , stage32[4] , stage33 [4] , stage34 [4] , stage41 [4], 
stage42 [4]； 









mul_macrix(half 1, half2, coeff_length, 
mul_trun_length)； 
for ( i = 0 ； i< = 3 ； i + + ) 
{ 
stage31 [i] =nul_product (i) [0】-nul_product: [i] (1)； 
stage32 [i] =nul_produc:: ： i] [2] +nul_product [i] [3]； 
stage33 [i】二nul_produc:: [i] [4 ] • r n u l _ p r o d u G t : [i] [5]； 
stage34 [i】=nul_product [i】[6】+nul_produc-： [i] [7]； 
stage41[i]=stage31[i】+stage32 "T 
s:age42 [ij =st:aae33 ii] + stage3-； :i】； 
} ‘ 
//Third Stage 




result[i] = stage41[i]+stage42[i]； 
} result[7-i] = stage41 [i]-stage42[i]； 
//Round up and Truncation 
reduce—length = input—length+coeff_length-
mul_trun_length+2-final_length-second- 3 ； — _ 
for (i=0;i<=7;i++) 
{ 
roundup = (result[i] >> (reduce—length-1)) & 
0x00000001; -
temp = (result[i] >> reduce一length)+ roundup； 
if (second == 1) 
{ if (((temp & 0x80000000) == 0x00000000) 
&& ( (temp Sc 0x00000100) == 0x00000100)) 
temp = 255； 
if (((temp & 0x80000000) == 0x80000000) 
&& ( (temp 5c 0x00000100) == 0x00000000)) 
temp = -256； 
} 
dctresult一i[i] = temp； 
d c t r e s u l t : f [ i ] = 
dctresult_i [i] /pow (2, final一length-input一length-~^ + l+second)； 
) “ 一 } “ 
main() 
{ 
int input [8] [8]； 
int inputfile [64]； 
long temp[8]； 
long result [8] [8]； 
double result_f[8][8]； 
int input-length, coeff_length, mul—trun—length, 
final—length; — 
int i, j, kk, 11, m, n, test; 
char idct_file[]="idctOO.dat"； 
char fdct:file[]="nfdctOO.dat"； 
FILE *idct_id, *fdct—id; 
coeff-length = 15; 
mul一coeff(coeff—length)； 
for (m=0； m<=9； m++) 
{ idct_file[4] = idct_file[4] + m； 
fdct:file[5] = fdct=file[5] + m； 
for (n=0； n<=9； n++) 
{ idct_file [5] = idct—file[5] + n; 
fdct~file[6] = fdct~file[6] + n； 
idct一id = fopen(idct_file, "r")； 
fdct—id = fopen(fdct_file, "w"); 
//Read lOOtimes, 64 element in each time 
for (kk=0;kk<=99;kk++) 
{ for (11=0;11<=63;11++) 
fscanf(idct—id, "%d\n", 
&inputfile[11]); — 




for (j = 0;j< = 7；j++) 
input [i][j] = 
inputfile [i*8+j]； 
} 
input—length = 12; 
tnul_t run—length = 7; 
final_length = 15； 
for (i=0,i< = 7,-i + + ) 
{ 
for (j=0； j< = 7； j+ + ) 
temp [ j ] = 




input一length, coeff—length, mul一trun一length, final一length, 0)； 
for (j=0；j<=7；j++) 
{ 
result [j] [ i ] = 
dctresult_i[j]； 
result_f [j] [ i ] = 
dctresult_f[j]； — 
if 
(abs(dctresult—i[j]) >= pow(2, final一length-1)-1) 
一 { 
printf("Excess Limit, x=%d, %08x\n", dctresult—i[j], dctresult_i[j])； 




input_length = 15; 
mul_trun一length = 9; 




temp [ j ] = 
result [i] [j]； 
onedct(temp, 
input-length, coeff—length, mul_trun_length, final一length, 1)； 
for (j=0；j< = 7；j+ + ) 
{ 













idct一file[5] = •0'； 
fdct~file[6] = •0'； 
} “ 
idct_file[4] = '0'； 





Appendix   
Pin Assignments of the Programmable DSP Processor Chip 
P i n I IN/ I 
Number Pin Name OUT Description  
I request IN Input data request signal  
— 2 V D D I N ~ 
— 3 G N D IN 
4 s ^ IN Input data start signal  
5 in reset IN Input data buffer reset signal  
6 empty O U T Input data buffer empty signal  
7 done O U T Input data acknowledgement signal  
— 8 V D D ~ I N _ 
9 一 G N D IN 一 
1 0 instr—done IN Instruction acknowledgement signal  
I I instr rq O U T Instruction request signal  
12 一 instr<0> “ IN ^struction bitO 
1 3 instr<l> IN Instruction bitl  
— 1 4 V D D ""“IN “ 
_ 15 G N D IN _ 
16 instr<2> IN Instruction bit2 
17 instr<3> IN Instruction bit3 
1 8 instr<4> IN Instruction bit4  
一 19 V D D IN “ 
_ 20 G N D IN _ 
2 1 instr<5> IN Instruction bit5  
2 2 instr<6> IN Instruction bit6  
23 instr<7> IN fostruction bit? 
24 — instr<8> “ IN Instruction bitS 
25 — instr<9> _ IN fostruction bit9 
一 26 V D D ~ I N “ 
“27 G N D — IN 
28 cmplt clr instr IN Clear the instruction input buffer  
29 mode IN Switch of the cyclic FIFO in instruction memory.  
0=cyclic, l=receive instruction from user  
30 get_output IN Output mode 1 : Get the output handshaking 
signal  
— 3 1 V D D IN “ 
32 一 G N D IN 
33 open cmplp IN Output mode 1 : Get the output handshaking  
signal  
34 out full “ O U T O ^ u t mode 0 : Output buffer full signal 
3 5 out ready O U T Output mode 0 : Output request signal  
36 out buf in<(> O U T Output mode 0 : Output data bitO 
37 "^t buf in<l〉OUT Output mode 0 : Output data bitl 
“38 “ V D D IN 一 
_ 39 G N D IN 一 
40 out buf in<2> O U T Output mode 0 : Output data bit2 
41 |out_buf in<3>| O U T [Output mode 0 : Output data bit3 
Page 142 
Appendix  
一 42 out buf in<4> O U T Output mode 0 : Output data bit4 
43 out buf in<5> O U T Output mode 0 : Output data bit5  
44 out buf in<6> O U T Output mode 0 : Output data bit6  
45 out buf in<7> — O U T O ^ u t mode 0 : Output data bit? 
— 4 6 _VD—D IN~~ 
— 4 7 G N D " " “ I N — 
48 o^buf in<8> — O U T Ou^ut mode 0 : Output data bitS 
49 cmplt out O U T Output mode 1 : Output request signal  
50 cmplt out d O U T Output mode 1 : cmplt out + 4 
51 out_sel IN Output mode selection : mode 0 for data  
verification, mode 1 for speed measurement 
— 5 2 V D D I N -
— 5 3 G N D IN""“ 
54 out—aki IN Output mode 0 : Output acknowledgement signal 
“55 一inl<0> “ IN I ^ t data bitO 
一 56 — i n l < l > _ IN i^tdata bitl 
57 一 inl<2> IN Input data bit2 
_ 58 一inl<3> 一 IN I ^ t data bit3 
—59 V D D ~ ~ m ~ “ 
— 6 0 G N D ~ I N ~ “ 
61 reset IN Global reset signal  
62 — inl<4> IN Input data bit4 
_ 63 一inl<5> “ IN to^tdata bit5 
64 — inl<6> IN ^put data bit6 
6 5 inl<7> IN Input data bit?  
“66 V D D — I N 一 
67 一 G N D — IN 
- 6 8 in<8> IN [input data bitS 
Page 143 
Appendix  
Pin Assignments of the ID DCT/IDCT Core Chip 
@ © @ ® © ® ® ® ® N 
@ @ @ @ @ @ 0 @ @ @ @ @ @ M 
© @ ®@@ @@ L 
@ ® @ @ K 
®® @@ J 
© ® © ® @ ® H 
® @ ® Bottom View @ ( J ) G 
0 ® © @ @ @ F 
© 0 @ ® E 
0 © Ext. P. ® ® D 
① © 〇 @ ® ® @ © c 
© O ® ® @ ® ® ® ® ® @ ® ® B 
^ ^ ® ® @ ® @ @ @ @ @ @ @ @ A 
1 2 3 4 5 6 7 8 9 10 11 12 13 
^ ^ n 
Number Pin Name In / Out Description  
I 一VDD — I N —  
— 2 G N D IN 一 
3 — In<5> IN Input data bit5 
4 In<4> IN Input data bit4 
5 — In<3> IN ^ u t data bit3 
6 In<2> IN Input data bit2 
7 V D D ~ I N ~ 
8 G N D ~ I N ~ 
9 In<l> IN Input data bitl 
10 — ln<0> IN ^ u t data bitO 
II V D D IN 
一 12 G N D ~~IN 一 
1 3 test6 O U T Testing signal from D C T coefficients memory 2 
1 4 tests O U T Testing signal from data replicator 2 
1 5 test 10 O U T Testing signal from multiplier 2 
- 1 6 V D D IN — 
— 1 7 G N D I N ~ 
1 8 testl4 O U T Testing signal from 22bit subtractor 
19 —testis — O U T Testing signal from 22bit adder 
- 2 0 V D D ""“IN — 
- 2 1 G N D IN — 
22 output—rq O U T | Output mode 0 : Output request signal 
Page 144 
Appendix   
2 3 Ckin IN Output mode 0 : Output acknowledgement signal 
24 — V D D 一 IN  
一 25 G N D I N ~ 
Out<14> ~ O U T Output mode 0 : Output data bitl4 
_ _ 2 7 _ _ ^ u t < 1 3 > ~ O U T Output mode 0 : Output data bitl3 
_ _ 2 8 _ ^ O u K 1 2 > ~ 0U T ~ Output mode 0 : Output data bitl2 _ 
29 V D D ~ ~ I N “ 
30 G N D IN~~ - 
— 3 1 Out<ll> Output mode 0 : Output data bitl 1 — 
^Qut<10> "~OUT Output mode 0 : Output data bitlO 
33 ~~Out<9> "~OUT Output mode 0 : Output data bit9 
34 V D D I N -
— 3 5 G N D IN — 
36 Out<8> — O U T Output mode 0 : Output data bitS 
— 3 7 ~~Out<7> ~ O U T Output mode 0 : Output data bit? 
3 8 Out<6> O U T Output mode 0 : Output data bit6 
— 3 9 V D D — I N 
— 4 0 G N D — I N 
41 ~~Out<5> O U T Output mode Q : Output data bit5 
“42 Out<4> — O U T Output mode 0 : Output data bit4 
43 Out<3> — O U T Output mode 0 : Output data bit3 
“44 ^^Out<2> — O U T Output mode 0 : Output data bit2 
— 4 5 V D D ^ ^ I N “ 
— 4 6 G N D I N ~ -
47 ~ O u t < l > — O U T Output mode 0 : Output data bitl 
48 — 0ut<0> O U T ^tput mode 0 : Output data bitO 
49 open cmplp IN Output mode 1 : Get the output handshaking signal 
50 out sel IN Output mode selection : mode 0 for data 
verification, mode 1 for speed measurement  
— 5 1 V D D ~ I N 一 
— 5 2 G N D ~ I N ~ 
5 3 test 17 O U T Testing signal from truncation unit  
54 — testl6 O U T Testing signal from DCT/IDCT switch 5 
5 5 test 13 O U T Testing signal 21 bit adder  
— 5 6 V D D IN 一 
— 5 7 G N D IN 
58 complt out O U T Output mode 1 : Output request signal  
59 cmplt out d O U T Output mode 1 : cmplt out + 4 
60 V D D I N ~ 
— 6 1 G N D I N ~ 
62 get—out IN Output mode 1 : Get the output handshaking signal 
“63 testl2 — O U T Testing signal from 2Qbit adder 2 
6 4 testl 1 O U T Testing signal from 2Qbit adder 1 
65 V D D IN~~ 
- 6 6 G N D ~ I N 一 
6 7 test9 O U T Testing signal from multiplier 1 
6 8 test? O U T Testing signal from data replicator 1 
69 test5 O U T |Testing signal from D C T coefficients memory 1 
Page 145 
Appendix 
~ 7 0 V D D IN 
一 71 G N D ""“IN~~ — 
72 block2 IN Set for the column operation  
— 7 3 Idct IN for IDCT operation 
74 V D D — I N 
— 7 5 G N D ~~IN~~ _ 
76 Reset — IN Reset 
7 7 Start IN Input data start signal  
7 8 test4 O U T Testing signal from 15bit subtractor  
test3 Testing signal from 15bit adder “ 
80 V D D IN""“ -
— 8 1 G N D ~ I N ~ ~ “ 
“82 test2 ~ O U T Testing signal from l-to-2 M U X 1 
8 3 testl O U T Testing signal from input buffer  
8 4 Done O U T Input data acknowledgement signal  
8 5 input—rq IN Input data request signal  
86 一VDD — I N — 
— 8 7 G N D I N ~ “ 
— 8 8 ~In<14> — IN Input data bitl4 
— 8 9 In<13> IN"”~ Input data bitl3 “ 
— 9 0 In<12> ""“INInput data bit 12 “ 
— 9 1 V D D ~ ~ I N ~ 
— 9 2 G N D ~~IN~~ 
— 9 3 In<ll> ~~IN Input data bitll 一 
— 9 4 In<10> ~ I N Input data bit 10 
— 9 5 In<9> IN Input data bit9 一 
96 一VDD — I N — 
一 97 G N D ~ I N ~ 
— 9 8 In<8> ~ I N Input data bitS 一 
“99 In<7> — IN Input data bit? 
100 In<6> IN [input data bit6 
Page 146 
Appendix  
Pin Assignments of the Transpose Memory Chip 
© © @ 0 @ 0 ® 0 @ @ © L 
®@@@@®®®@@@ K 
® @ @@@ @ @ J 
® © ® ® H 
® ® ® © @ © G 
® ® © Bottom V i ew @ @ @ F 
© 0 © © 0 © E 
0 © @ @ D 
0 © @@® @@c 
© 0 @ @ © @ @ @ @ @ @ B 
^@@@@@@@@@@ A 
1 2 3 4 5 6 7 8 9 10 11 
Pin 
Number Pin Name In/Out Description  
1 testl O U T Testing signal for LSB generator in write address  
generator  
2 test2 O U T Testing signal for M S B generator in write address  
generator  
3 test3 O U T Testing signal for in write address generator  
- 4 V D D ~ I N 一 
— 5 G N D I N ~ 
6 test4 O U T Testing signal for LSB generator in read address  
generator  
7 test5 O U T Testing signal for M S B generator in read address  
generator  
8 test6 O U T Testing signal for in read address generator  
9 — idct - IN Set for the IDCT operation  
1 0 data—rq IN Input data request signal  
_ 11 V D D ~ I N 
_ 12 G N D IN 
“13 一1<0> IN Input data bitO 
14 I<1> - IN Input data bitl 
15 — I<2> - IN Input data bit2 
16 I<3> IN Input data bit3 — 
1 7 I<4> IN Input data bit4 
一 18 V D D ~"“IN 
— 1 9 G N D ~ I N 
20 I<5> IN llnput data bit5 
Page 147 
Appendix  
— 2 1 I<6> IN [Input data bit6 
— 2 2 I<7> I N I n p u t data bit? _ 
一 23 I<8> ~ I N ~ ~ Input data bitS _ 
— 2 4 I<9> ~~IN~~ Input data bit9 — 
— 2 5 V D D IN — 
— 2 6 G N D — I N 
— 2 7 I<10> ~ I N I n p u t data bitlQ — 
— 2 8 I<11> Input data bitll “ 
— 2 9 I<12> IN — Input data bit 12 
— 3 0 I<13> ~ ~ W ~ Input data bit 13 “ 
— 3 1 I<14> IN~~ Input data bitl4 “ 
— 3 2 V D D ^ ^ I N ~ “ 
33 — G N D — I N — 
3 4 Start IN Input data start signal  
3 5 Done O U T Input data acknowledgement signal  
3 6 tests O U T Testing signal from input multiplexing network 
3 7 test9 O U T Testing signal from input multiplexing network 
3 8 test 10 O U T Testing signal from input multiplexing network 
— 3 9 V D D I N ~ -
— 4 0 G N D IN — 
4 1 testl 1 O U T Testing signal from input multiplexing network 
4 2 test? O U T Testing signal from input data buffer  
4 3 testl 5 O U T Testing signal from input multiplexing network 
4 4 testl 6 O U T Testing signal from input multiplexing network 
“45 —testl? — O U T Testing signal from R A M blockO 
46 V D D m ~ “ 
47 — G N D — I N — 
4 8 testl2 O U T Testing signal from input multiplexing network 
4 9 test 13 O U T Testing signal from input multiplexing network 
- 5 0 —testl4 — O U T Testing signal from R A M blockl 
51 testis ~ O U T -
52 "^ataout<14> O U T ^tput mode 0 : Output data bitl4 
— 5 3 V D D IN 
— 5 4 G N D I N ~ 
55 "bataout<13>~ O U T Output mode 0 : Output data bitl3 — 
56 "^ataout<12> O U T Output mode 0 : Output data bitl2 
57 "PataouKl 1 〉 O U T ^tput mode 0 : Output data bitl 1 
58 "^ataout<lQ> O U T ^tput mode 0 : Output data bitlO 
59 "^ataout<9> O U T Output mode 0 : Output data bit9 
“60 V D D ~~IN 一 
61 G N D IN 
62 ~Dataout<8>~ O U T Output mode 0 : Output data bitS 
63 ~Dataout<7> O U T Output mode 0 : Output data bit? 
64 Dataout<6> O U T Output mode 0 : Output data bit6 
65 ~Dataout<5>~ O U T Output mode 0 : Output data bit5 
66 Dataout<4> O U T Output mode 0 : Output data bit4 
67 V D D ~ ^ ~ 
“68 G N D I IN I — 
Page 148 
Appendix 
— 6 9 Dataout<3> O U T [Output mode 0 : Output data bit3 
70 下ataout<2> — O U T Output mode 0 : Output data bit2  
71 Dataout<l> ~ O U T Output mode 0 : Output data bitl 
72 DataoutO ~ O U T Output mode 0 : Output data bitO 
73 dataout rq O U T Output mode 0 : Output data request signal  
— 7 4 V D D IN — 
— 7 5 G N D IN 一 
7 6 Ckin IN Output mode 0 : Output acknowledgement signal 
77 Reset 一 IN Reset 
78 output—sd IN Output mode selection : mode 0 for data  
verification, mode 1 for speed measurement  
79 cmplt一out—d O U T Output mode 1 : cmplt out + 4 
80 cmplt out O U T Output mode 1 : Output request signal  
— 8 1 V D D ~ I N ~ -
— 8 2 G N D IN — 
83 open—cmplp IN Output mode 1 : Get the output handshaking signal 
84 get—output IN [Output mode 1 : Get the output handshaking signal 
Page 149 
. a
 ( i f










l n ^ l l
l l B H l a





B I B B ^ I O I U B i l







Measured Waveforms of ID DCT/IDCT Chip 
Waveforms of DCT row operation 
！：) • TfwIwg&laBrwffly'lwy ’ ^  ‘"‘ 
微於欺.. . . . 
• I 1 ZO.OQns 0 
：^^ Error … - . . … ” … ’ • • ‘ ； dct \ ：？)col 、 ； 
i 聰巧? ^ ^^^ i 
； ^ r\r\r\j\r\r\ AA /\A AA A/V A,： 
i > r\ r\r\r\f\ r\j\r\r\ : 
i ^ — A A A A /\A A A A A / \ A : 
！巧 ^ r\ ^ r\j\j\f : ；dinos > AAAA 八八八  :din06 N r\ r\ r\r\ r\ ^\A/VWWWW__/V\/\/\ 丨 ；din07 N A /VWN /\AAA/\/\/\AA/\ r\r\ i ；din08 > A r\ /\/\/\r\ /\/\ r\/\ r\ f\/\/\ f\/\ f\ /\ r. | 
： a i n 0 9 > f\ r\ f\r\r\/\r\r\r\f\/\r\r\ a a ； i dinio \ r\ A ^WVWWWN r\ A i i dinll N r\ r\ AAAAAAAAAA AAA ； 
； N A A A A A A A A A A A A ； 1 N r\ r\r\r\r\f\r\r\/\ 八 八 a ； 
；done ^—/ ； 
i doOO A A AA A r 
\ doOi r\ A A i I do02 r\ r\j\r\ r\ r i i doO? r\ r\j\r\ r\j\_ ； 
:do04 /\i\ A r \ i doos A A 八/\ r \ \ do06 — r\r\ A A r \ 
I do07 r\r\r\r\ r\ r\j\_r \ 
\ ao08 A A A A A A A , ； i r\J\J\ A f \ doio r\j\r\r\j\r\ r\_r \ \ doll A r \ \ dol2 rv r \ ! dol3 A r \ \ dol4 ； r\ r 
\ 0 rq /W\/^ A^AAAAAA/V\AAAAAA/ : 
i 丨 cJcin 、 ^WWNyVWWWWWWVX ； 
.. 20.0Qns 
Start Sy&tem| Stop Systemf 
H. , 二 - , Timing Ptagrams - \ms   
：tmfmi^'t  
i.丨丨：丨.:•.“.「..： Total 0 Sequence 50 
‘ I 1 20.00ns 228 
‘ ……冷……二 •………•………•………•… 丄 "…令 .•………+ …“…………寺… 
.Error ‘idct ： coL ； 
',dinOO r\r\f\i\r\_r\_r\_r\_r\_r\_r\_r\_f\_a_a__a_r\_a_r\___i\_a_a_/v 八 八 八 八 八 八 " 八 八 , ： ；dinOl r\f\J\J\r\f\ A/N l\r\ aa r\r\ r\J\ 八a aa a/\ AA aa /\A aa /\a 八八 A / i rdin02 r\r\i\r\ f\f\rsf\ f\r\r\j\ f\r\f\r\ f\j\i\r\ rsj\j\r\ r\j\r\r\ r\r\j\r\  /din03 r\I\J\J\ ^VWWWN /"WXAAAA/N r\r\r\r\r\r\r\r\ r\r\r\r \ 'din04 r\A/N_f\ T^A/WVAAA/WWVWA /XA/VVXA/VVVW : dinOS /WWWN ^WVVA/WWWWVV^V/WWWV/^AAA/ 丨 dinOS \/\/\/\/\/\/\/\ /\f\/\r\ ： 
‘din07 \/\r\/\/\r\/\ / \ a i 'din08 \AA/VVV\—A— ； ''din09 \AAAAAAA—A/\ i dinlO WW\A/> (\ A ； dinll \AA/\A AA /\AA  
: d i n l 2 \ A A A A A A A i 
(1—3 WWN_r\ A/N i 
n r a i doOO r\r\ r\r\ A r\ A A /\A/\ A A r\ r\j\ : doOl A A A r\_r\_r\ A a A 八； 
do02 r\ / w a r\ r\_f\_/w\aa/n r\ a a a a a a i 'do03 A AAA r\r\ A AA AA AA AA A /\A A A A ； 
'doCI4 f\r\ r\_r\_r\_a/\ r\ a / \ a /\ r\ r\ 八 A /V- I 
do OS /\ r\r\j\ A._ w f\r\ f\/\/\r\/\ 八八 八八 ； 
doos (\r\ a a r\r\__r\j\ r\ a a a a a a a a a : do07 rsj\AA r\ r\r\_r\_r\_r\r\r\_f\r\ r\r\r\ /\a a /v.: 
,do08 r\r\j\-j\—j\r\ r\ aa a : do09 r\r\r\ r\ _r\_r\j\ !\r\r\ A A A A ： '.dolO AAAAA/N A 八   doll r\_r\ r\_r\ r\ r\ r\ /v : 
jdoi2 r\_r\_r\_r\ r\   doi3 r\ A A /\ ； i ： dol4 r\ r\ r\ f\ /\ (\ f\_ ： ；io rq A/WVWWWWWWXAA/WXA/VWN/XA/VWWVAA/WWWWVA/VA/WWWWW : ；ickin /\/WWW\/\/\/WVVV\/W\A/WWW\/V\A/WWWW\/W\/WW\AA/WW\/W\/W\ : 
20.0Qns 
jtart Systemj Stop Systemj 
Page 152 
Appendix  
Waveforms of DCT column operation 
r 过，'.：>. ,I. "《I A:, ,"' ^ f.； :、 II I I:..':.. ( Tlmlna Dlagrami -• Itns 、…. ’.‘，’. ‘ ：“ 
凝 ToUl 0 Saqu«nc« 0 
I I 20.0Qn9 0 C……… 、，• • • ； 4 • • • ..,+•,..,‘，，、•,“.，，，"•�，，�’•," 
；Error 
.^idct N  
丨 c^ol >   
為，巧 > 
、 ， > a / v x a a / n r \ r \ a a a a a a a , 
'DIN02 N r\ r\j\f\r\ AA/\A A/\A A  
V r\r\j\j\ nywvvwN  
din04 N r\ r\j\r\j 
d i n O S > A A A A A A A  din06 N r\ A _ A / V _ y \ i\/\r\f\f\r\r\/\r\i\r\ r\r\f\r\  
din07 n r\ r\r\j\/\ _ / w \ / \ a / w \ / w _ / v \  
dinOS >— r\ r\ f\r\r\/\ /\/\/\r\ r\r\ f \ f \ i\ /\ /\ ^  
dx i l 09 V r\ A A . / \ A A A A A / \ A / \ A A A  
dinlo > r \ r\ a a a a a a a / \ a a / \ a  
d i n l l N A A A A A A A A A A A A A A / \  
d i n l 2 > — A A A / \ A / \ A A A A / \ A  
dlixl3 > A A A / V A A A A A A A 八八 
dinl4 � / \ / \ a  
in_rq _^j^^JJIJI^I^ 一 10^^_14_1144_|_|_|«山4444_|_||_|_|_誦4_山_圓_||_|_|_|_|_14_||4_||_1_14 JJ444UUJJJ_IJJ_ 
doOO A r\ r\ r doOl r\ r\ f\ do02 r\r\„/wv> r\ r 
>do03 f\ A A A a a 
do04 A/\ r\ r 
doOS f\ r\r\j\ r \ 
do06 a . a 八 八 / ： do07 r\j\u\r\ r\ r\j\_r 
;do08 A A A A A A 八 r 
do09 a a a a r : 
： dolo r\f\r\f\f\r\ r\ r 
：doll A r ： 
：doi2 r\ r 
：dol3 A r .dol4 a r 
0 r q ^ V V W W W V X A A A A A A A A A / 
丨 ckin V A / W W W W V W W W W W ： 
20.0(ki3 
'"XJ ' 7 , . Tlmitig Diagraats -* Ims 
totaut 0 Sequence 80 
1 20.0Qns 228 





dinOl r\f\r\r\r\r\ r\j\ f\J\ r\j\ r\r\ f\f\ a a r\r\ /\r\ r\f\ a a a a f\r\ A八 r\r \ din02 r\r\i\r\ r\f\j\j\ r\f\f\r\ rsj\f\r\ r\i\r\r\ a/vv\ r\f\f\r\ aaaa 丨 
',din03 r\r\r\r\ r\r\i\f\nj\/\r\ / v w w v x a r\j\r\j\r\j\r\r\ r\j\r\r 
' ' d i n 0 4 A A A / V W \ A A A / W \ A A / \ / ^ V A / W V W W W '.dinos rsj\r\r\r\f\r\ i\r\j\r\r\i\i\r\/\j\i\r\j\j\/\r\r\r\r\j\j\r\j\f\j\i\j\r /dinOG WWVW\_ 
,din07 \i\r\f\i\i\i\ r\r\ ： ;din08 WVVWWN_r\r\ ； d^in09 \f\r\r\r\f\j\j\^j\r\  
. d i n l O W W W N r \ „ r \ ： 
^'dinll V W V W V « A / V \ : 
dinl2 \ A A A / \ A A A ； 
' 'dira3 \ / \ a a / \ r\ r \ f \ ； 
Srq ixiiiiita^ltmiipoxiimpiiixipiiiiiiiiiiixiiiiiiimmmmmpnrammnpiimixi：: 
；d o n e ； ；,doOO r\_a r\ r\ r\j\ r \ r \ f \ f \ f \ r\f\r\r\ a /\ a a a a 丨 
；她01 r\ r\ a / v \ a - _ y \ / n r\ a a a 八 a a a a a a a a a ； 
、do02 八 R\ R\R\/\/\ R\ a a a a a a a a a a a a  
do03 /\ A A A A A A A A A A A A ； 
？do04 r\f\ /\ A. /\ A i\/\r\ f\f\ r\ r\j\r\ : 
；经doOS A r\r\r\ r\___A r\_r\J\r\J\r\ f\f\r\/\ /\ \ 
\ »do06 I\r\ A A A A /\ Ps r\i\f\r\r\  
'^do07 /\A r\r\ a _ j \ j \ _ r \ A r\ j\_ r\r\ A A /\ A  ''do08 A/v\_w_A/\ r\_A r\ r^j\r\—r\r\—r\r\  
'do09 A/\/\ A A A /\ A r\r\r\ r\/\J\^J\  
；udoiq f \ r \ j \ r j \ r \ a a a a a a a ： -doll r\ 丨 
i idoi2 : 
I idoi3 r\_r\_A A ; 
':doi4 r \ _ r \ r \ i ；。rq /VWWWVXAAA/W/VA/WWWWWWV/WWWWWWWWWWWWWXANyW : ；|clcin ^ W W X A A / W W W W V / W W W W W W W A / W W W W W W W W W W W W W W X ； 
1 
20.0Qns 
』St^ Systcmj Stop Systemf 
Page 153 
Appendix  
Waveforms of IDCT row operation 
‘ T i m i n s Diagrams - ims 
. • • , • , • • , . .... • . „ „ , • , ,, . „ . 二 ..了 — ...- :   
£ » » S^creens Sttfe-Screens flptkms a t i l l t l es Herp j 
IfotaX 0 0 
I 1 20.0Qn3 0 
；议 0 / * ‘ •+‘ • ‘4 ,,••*,•‘，•‘，>.*•/,, ^ ^ . ’，.，*+ + + +,’， ， 
；这Error 
； j 专 i d c t * ； 
coL > i 
dinoo > r\f\r\f\r\_r\_A_r\_r\___\ 
dinOl > ^\AAAA/\ r\r\ A A A A A A A , i 
> A A A A / N FMMMS R\R\J\J\ ： din03 > r\f\j\f\ r\j\j\r\j\/\j\r\ : 
d i n 0 4 、 A A /\ A r\ ^ V W ； 
dinos � a / w w v n : 
dinos > r\ r\r\ r\ / ^ a a a a a a / W V > ^ W V V N i 
din07 > a a a a / \ _ a a a a a a a a a a a a ； din08 > —A r\.^r\j\r\r\ : 
d i n 0 9 、 A f\ A A A A / \ A A / \ A A A A A : 
d i n i o N A A A A A A A A / \ A A A A A ； dinll A r\ /^ AAAAAAA/W-AA/N ： dinl2 > A ^WVWWN_A r\f\ ： 
dinl3 > A Z V V V V V V \ / X _ V X _ _ A A ： 
done ^ / \ _ 丨 
doOO r\ : 
doOl r\J\f -do02 /\/WWW> \ 
do03 A A : 
do04 A / N / W V W N  
do05  
do06 ： r 
do07 r \ 
do08 r\j\r\j\r\j\r\j\ \ 
do09 r\ r \ r \ 
dolO A A A A A A A / N fX/XTJ \ 
doll A A A A A A / V N ^ W ： 
dol2 / W ； 
doi3 /^wvwwN r\j\r \ 
dol4 r\j\j \ 
：O rq / V A / X A / W X A A / W W W W W 
i clcin � ' ^ V V V V V V W W W V W V W V X ； 
20.0Qns 
产 鄉 对 s t o g s ^ 画 驟 添 骤 爾 . : 彌 爾 攀 : 釋 ： 顯 ； 
Z f — ； ^ ： “ ‘ t l rn l i ig D i a g r a m s i m s ‘ 
t o u l 0 Sflquenco 80 
I 20.0Qns 228 
；K C … • … • … … … ‘ … … … + … … … • … … … " I S … … • … … … • … … … … … … • … … … • … 
；SError  
； 终 i d c t ； ；^col ： 
dinoo A/v\/\/v-.7V_A_»A A r\ A _ A A _ A _ ； 
dinoi r\j\r\j\r\r\ r\r\ r\r\ r\r\ a / n r\r\ r\r\ r\r\ r\r\ r\f\ a / \ / \ / \ / \ a / \ a a a a , 
din02 r\r\r\r\ a a a a a a a a a a a a r\r\j\r\ i\r\r\r\ r\r\r\j\ r\r\/\r\  
din03 r\j\/\r\ f\f\r\r\r\f\r\j\ / w v w v w a a a a / w v n r\r\r\r 
din04 / \ A / V V \ A A A A A A A A / \ / \ A / V W W X A / W W 
dinos r\r\r\r\i\j\r\ a / w w w w w w w w v a / w w v x a a / 
din06 \ A A A A A A / ^ ^ - / W \ / \ 
d i n 0 7 \ / \ A / \ A / \ A A A  
din08 y \ _ / N  din09 \AAAAAA/>^ W\  
dinlO W W W N r\ r\  
dinll \ / V V V V X A _ / X A / >  
d i r x l 2 \ / \ A A / \ A A A  dinl3 \AAA/\ A /\A  
^q IU14-l-IUI4XiUi4-l •睡• •匪• • • •面 I I I I 11 I I I I I I 11 I 114一l_IUUI_ia_ll_l—l_IULI_l_IUI_l_IUUUI一l_IUI_i_l_aJUiJJU 
done 
hdooo r\ r\J\r\ r\f\r\ i\r\/\r\ a a / \ / \ r\r\r\r\ 
；doOl r\J\r\r\r\ r\r\r\ r\ AAA A /\AA AAAA  do02 AA/WWW - rsj\ A/W 八 AA A AA A AA A 八 r\f 
'：•.do03 A A a r\r\ r\r\/\f\r\ a / \ a 
i； do 04 A A / W V W N J \ _ r \ A A / \ / \ / \ / \ 八 r \ r \ A 八/\ 八八八 
-doOS r \ 八八A/V八A八八/\ / \ A 八 A A 八八 A A 八 八 
do06 上 A A A / X Z \ A A / V \ A A _ y \ A _ _ A / \ A / \ A / \ A / \ A 
Hdo07 A A A A A / \ A A / \ / \ / \ A / \ A A / \ A A A A A A A A /\ 
:;do08 Z V X Z X / W V V \ _ r\ / \ A / \ / \ _ A / V W \ / W \ f \ A f \ A A / \ A A  
do09 r\_f\J\_f\r\_AAAAAA/\/\ r\ r\ r\r\i\ A A A r\r\ A dolO /\AAAA/V\A 八 AAA AA /\ A/\/\/\ A A/\ AA AAA A A A/\ A A A 
” DOLL ^ V W W W \ /V/WV 八八 Y\/\7\/\A/\/\/\ A A A A A A A A A A A  
d o l 2 A A A A / \ A A / \ / \ A A / \ / \ A A A /\ /\ A / \ A A A A A doi3 r\r\r\r\r\j\f\j\ r\j\r\-j\r\ 八 /\/\ 八 八 a/\ a/\ a 
dol4 r\f\f\_r\r\ 八 r\r\r\ a a a a / \ a 
0 rq / W W W W W W \ / \ y V \ / ^ / W V \ A / V \ A / \ / W \ / W \ A / V W W \ A A / \ / W \ / \ / \ / \ A ^ V / \ / \ / \ / \ A / \ / \ / \ / V c5iii\ A/WVWXAA/WWWWWWWVWWWWWWWWVX/VVA/WWWWVA/V/WW 
Page 154 
Appendix  
Waveforms of IDCT row operation 
^ ；J… _ ™ _ ""“^  Timing DIafirams - ims … 
f i le |il1t Screens Su挺一Screens Options 进" i t ios ” Help \ 
？ 
；I 、、、、- ：、物tail - 0 “ 
参、、 • … 1 - 20.0Qn3 .::..。,;： 0 
、 C … … … . • … … • … … … 4 s * … … … • … … … • s … h … • … … … • … … … • « ‘ … … ， • • … … … • … 
；杉Error 我 “‘• ‘ 
；idct > ： S 
‘ ： 
:dinOO > f\f\r\r\f\ f \ r \ r \ f \ f \ / \ r \ r \ r \ r \ 
-dinOl > r\l\r\r\r\J\ a a / \ / \ / \ a r\r \ � > r\ r\j\j\f\ r\j\r\r\ r\f\r\f\ ； 
> — ： A A A A A A / \ / W W > 丨： 
� d i n 0 4 > r \ a a a a i \ j \ j \ r \ 
^ dinos > r\r\r\j\j\r\f\ ? 
din06 > r\ r\ A A A ^ \ A A A A A / V V V V \ - - A A A / N 穿 
din07 > r\ a / \ a / \ a a / \ a / \ a / \ / \ a a a a ;； 
、、din08 V r\ f\^j\r\r\j\ i： 
din09 > r\ A AAAA /VVV\AAA_ y\/\ 妄 
dinio V r\ r\ / \ a a / \ a a a / \ / \ a r\ A 穿 
d in l l N A A— ^V\A/VA/\/VVV\_/^VV\ ？ 
dinl2 > f\ AA/\AA A/\A A f\r\ ji 
V R\ - A A A A A A A A A A A A 善 
done ^ • \ _ I ： 
dooo r \ r \ j \ r \ j \ j \ j \ j \ a a i； 
doOl A A/\ A A A A A A A ,；; 
do 02 A 5 
do03 A A 笑 
do04 a / W v y V W N \ 
doOS A / W W X A / S i； 
do06 r ；: 
do07 r \ do08 r\^ r^ \ 
do09 A r\r \ ‘dolo r\r\r\r\j\j\j\/\ r\r\f\r ;; 
doll (\i\i\r\j\r\j\r\ A A / 香 
,doi2 r\r\r \ 
dol3 A / V A A A / W N r\J\r \ 
do 14 r\r\j ；: 
0 rq < ^ V V V V V V V \ A A A A A A A A / V \ A / ii 
；clcin 、 ； / W W W W W W W W V W V 丨 
aO.OQns 
, 刑 ? y 绅 s t o p Systymj 
「二 Timing Diagrams - ims 
File tdk Screens Sub-Screeits Options UtiliUes Help ； 
一 ‘ ；職Total • 0 Set^ [U<mc» 90 
I / I 20.00ns 228 
‘ C“'”,…•>K 二……•….……* • + •….、…‘••^… + +………•… 
Error 
idct ： col 1： 
''dinoo _ _ ( \ — r \ _ r \ _ r \ _ 丨 ； 
dinOl r\J\r\J\f\r\ 八/\ a / \ / \ / \ a / \ a a a a a a i\/\ I\f\ r\r\ i\r\ i\/\ / \ a a a i\r \ 
d i n 0 2 r\j\r\j\ r\f\j\r\ r\j\j\i\ /\/w\ r\j\r\r\ r\f\j\r\ f\r\f\r\ r\j\r\r\ : 
din03 r\r\r\r\ / ^ A A A A A A A A / w w w n r\j\f\f\j\f\j\r\ A / W i： 
din04 ^\A/V\A/VAAA/UV\AAA/\ ^ V W W W W W ii 
dinOS A A A A A A A A A A A A A A A A / V V V V V V V V W V 《 
din06 V \ A A A A A A _ - A A A / \ :: 
din07 \t\r\r\r\r\r\ / \ a ji din08 \r\j\r\f\/\j\..u\—r\—r\ i； 
'din09 W X A A A / V N _ r \ f \ :: 
dinlO \i\r\r\r\r\i\ a a i： din l l W W W U V X A / N i： 
dinl2 V V V W W _ A A !： dinia W W N _ r \ rsj\ i； 
S iPilXipCtaQttanilXOXiXiXOXiXiXiXinilllimiXiXOrillXilXiXimilXQXilimmmiX ( 
done % 
doOO A / W W W X A_A_f\J\J\_AAAAAA/\/\ A / \ A A A/\/\ /\/\ 八 A A A f \ 
doOl A A A A / W \ A A/\__r\J\f\ AAA/\/\ A/\ A. A A/\ A A r\ r\/\ r\ r\/\/\ r\ r\ r\ r\ /\r\ r \ 
do02 r\ -^XAA /VVVVVVVN A J\ AA AA/\ AA/\ r\r\ ；: 
do03 A A A /\/\r\r\r\r\r\r\r\r\ /\ f\ /\/\r\ i\i\ /\r\/\i\ r\j\ A ii 
do04 r\j\r\/\r\r\r\f\ r\_a a/\/\/\aa/\aaa aa /\/\ r\ /\ /\ 八 八八 a > 
DOOS A Y V W W V N A _ A A A/\/\/\/\ A A A A /\AAA/\/\/\A /\ 八 八 A ; 
do06 r\ A / \ / \ A / \ / \ A A A A i\ r\r\r\ f\ A / \ r\ r\ A i: 
do07 /\ A A A / \ A / \ A / \ / V A A / \ A A A / \ r\r\/\/\ A f\ { 
do08 ^ W W W V X A — A A A / ^ ^ \ A A A A / W \ /\ A A A /\ \ 
do09 A A A _ r \ r \ _ A A A A / \ / \ / \ / \ A A / \ A A f\ A / \ A A 八丨： 
、dolO A / W W W N a a a a a a a a a / \ / \ a / \ a a a a a a a / \ / \ i 
doll ^ W W W V N r\fSJ\_•/V/\y\/\AAA/\AA a . a a / \ a /\A J\ \ 
d o l 2 A A / \ /\/\ A. /\/\/\AA/\/\/\A/\ A /\ /\ r\j\ J\ \ 
do 13 / \ A A / W \ A A R\I\/\ a a a a a R\ a /v.； 
dol4 r\J\/\_r\f\ A A A-； 
；0 rq ^ W W X A A y ^ A / W W W W W W X A A / W ^ / W W W V W W W W V W W W X A A / V W W W ; 
；cKin A / W W W W V W W W W W W W W W W W W W W V W W W W W V W V / X / W X / W V ：； 
20.00hQs •• 
,Statft S^^temj Stop Systemj ‘ 
Page 155 
Appendix  
Measured Waveforms of Transpose Memory Chip 
Waveforms of DCT operation mode 
广 1) 次 W N \ …Timing MMRMS 厂厂“”，:厂 
� �… \ 如 teftl 0 S«quence Q 
；严.-:.、 、、、…、 1 20.0Qns 0 
！ § C>*、,\。•}<»、、〖• i »、•,、kOK • Ilii^Oliilttiiiiiiilli^^^ ； 
i ^Error  
I iidct 、 ； 
；终 ：； �doo s _ / \ — A — r \ — r \ — r \ — A _ r \ _ r \ _ A _ r \ _ r \ _ r \ _ ； ； 
卿 N f \ r \ / V N _ - A A A A r \ r \ r \ I \ /\A /\/\ A A r \ r \ r \ r \ A A /\A i \ r \ A A A A A , : � 2 > f\i\r\r\—— (\r\r\r\ r\r\f\j\ r\j\r\j\ r\r\j\r\ r\j\j\r\ r\r\f\j\ /vww ^ d04 > AAAAAAAAAAAAAA/v\ r\j\j\j\j\/\r\r\r\j\i\r\/\i\f\r\j\r \ 
、dOS N ^V W W X A / W W W W X A A A / W W W W W W W 丨 
:d06 N f\f ； ；<107 > r\r \ 
、d08 > f\f i 
V d09 N f\r r dio N r\f \ dll > ^ ；: 
dl2 > i\r ；: 
• di3 N r\r\ 


















ckin 、 r i； 
20. OQns •• I 
Start Sy^emj Ston System) I 
「二I Timing Diagrams - ims 
Htk icreens Sub-Screens fiptlons ut i l i t ies Help《 
Tdtal 0 Sequence 120 
1 20.OQns 276 
C - J ： X.' . ：丨丨丨:(^；(:丨:丨:丨:丨::丨:®；；^: . ：：：^^；；>：：；：；：；-：>；：；：；^ ：>•；：•：：：：：；：：：；：；：^.：：；：；：->0：：：：<^ fi ：1Error 
i 钱idct ：丨 
-doo _ r \ _ r \ _ a _ r \ _ a _ a _ r \ r \ _ _ a _ a _ r \ _ r \ _ a _ r \ _ r \ _ a _ a _ r \ _ r \ _ a _ 
doi \ i\r\ /\r\/\/\ r\r\ aa f\r\ i\i\ r\r\ r\r\ r\r\ r\j\ f\r\ aa aa a/\ aa a,丨： 
d02 N ^WXAAAAA /^AA/^  /WVN ^^ VWN A/W> W^VN i： d04 WWWWWVA/VWWX/WWWN /W^ L/^ AAAAAAAAAAAy^  i 
d05 W W W W X A A A A A / W W W X / W X A / W X A / V ^ V V A / ^ / ^ V A A A A / V A ；; 
:d06 / W W W V A A / X A / V r v y V W V W W W V W W W W W W W W W V A / W W W X A / V W V A V / W W i： ,d07 r\j\f\r\/\i\r\r\r\j\r\r\r\f\j\j\r\j\r\r\r\i\r\i\/\r\r\j\/\j\i\/\f\r\j\r\f\j\r\f\f\r\f\j\j\j\i\r\j\j\r\j\r\r\/\r\f\f\j\r\r\j \ d08 A/\/WW\A/V/^VA/VWV\A/V\A/WWW\A/WWWWWW\A/WW\/\/WWWWWVW 丨； '-dlO A/NyVWWWWVVWXAAAAAAAAAAAA/WWVWWW^^A/VA/WWNL/VWWWWWW 1 
"dll 丨 'di2 f\r\/\r\r\r\f\j\r\r\j\r\r\r\r\j\j\j\r\j\j\j\j\r\r\r\r\f\j\j\r\j\i\r\nj\r\r\j\r\j\r\r\j\j\j\r\f\i\r\r\r\j\r\j\f\j\r\j\f\/\r \ ‘di3 i\r\j\r\f\nj\{\r\r\r\j\r\r\j\j\i\r\r\i\i\r\r\r\j\r\r\r\r\r\i\f\j\r\r\j\j\j\r\f\j\j\r\j\r\j\i\r\j\r\j\r\j\r\j\r\j\r\j\r\j\f \ 
in.rq fflJ_B_UJUUUyjJ-IIJIUIlJI JJJJJ-iJJJI^ ^^  
done ~ \ U ~ \ J ~ \ J ~ \ 1 ~ V ^ T W ^ ^ ^ A / “ ~ ； 
dooo r\i\r\f\i\j\f\r\ / w v w w n r\r\j\f\j\j\i\r\ f\j\r\f\\ 
doOl A/NL/XAA/VVVVVVVVVVN AA/WWWVX/VX i: 
,do02 r\j\r\r\r\i\r\r\j\j\r\j\j\j\j\r\i\j\j\j\j\j\r\j\r\r\r\iK \ 
‘do03 A A A A A A A/\ /\/\ r\/\ /\ /\ A l\r\ A A /\A A A A A A A/\ A A ； 




do09 ：  
dolO  
doll : 
dol2 ：  
dol3 丨 
dol4 ；： outrq /^/N/syVWVA/VWWWWWWWWWVAA/V/^VA/WVWW^AAAA/WWWVA/WWWX ckin / ^ V W W W W W W X A y V X / W W W W W W X / ^ / W W W W W W W X / W W W W W W W 
20.0Qki9 
i System] Stop S^temj 
Page 156 
^iuai^S db)s lujBisAsijre^ 
、-、？、•、丄、、〈:•、、•丄、、、、u.l l-：. 、\i';:、丄、、:i、、、《义、、-、二、： „���  ;;s公 ii:''iLs'-'f , i 
：/WWWWWWWWWWWWXA/VWWWVWWWWVWWWWWWVWW  
i WWWWWWWWXA/WVWWWWWWWWWVWWWWWWWWWW bi^no 
^^^^^^^^^^^ — i^m^iii^mi^z^iiiiiiiiiii 打叩 i GTop 
丨 ———^―—^― STOP XTop 
i otop 
： 60op i 80op i LOop 
i ——— 90op \j\f\j\j \yvw WW WW v/wv vyvw ww— so。？ 
i W KTSJ \J\I \J\J \J\J W \J\J \J\I \J\J W \J\J W \J\L \J\J W Wop 丨 V~\j“\j“\j”\j~\j“\j\j~\j\j~\j\j\j\j~\j~\j~\j~\j~\j~\j~\j~\j~\j~\j~\j~\j\j~\j~\j\j — eoop 
i V/WVWWWWWWWWVWWWWWV 30op 
i WV\AAAAA/ WWVWWVWWW WWWW TO 叩 
i VWWWVWWWWWWWX/WWV ooop 
i-riirriiiiriiirnrrriiiirrinriiirrrirrm^ ^^ 
i /\/\/\/WW\/W\/WWWWW\/\/W\/WW\/W\/\/\/\/W\A/\/\/\/WWW\/WW\/WW\>^ ^TP 
i /VWWVWWWWWWWVWVWWVWWWWWVWWWWWWWWV/VWV eTP ；/wwwvwwwwwwwwwwwvwwwvwwwvwwwwwwwwv" sip 
丨 /vwwwvwwwwwwwwwwwwwwwwwwwwwwwvw \J tip 
i yWWVVW\/WVWWWW\/WVWWW\A/WWVWWWW\A/W\/VWW\/WW\^ OTP 
:/vxaaaaa/wwvwwvwwwwvwwvwwwwwwwwwwwwwww 60p 
i AA/WWWWWWWWWWWWWWVWWWWWWWWWVWWWWXAAy^ 80P 
i /WWVW\A/WW\/\/W\/W\/W\/\/\/\/V\/V\/\/\/\/\/W\A/\/\A/V\/\/\/VWWW\/\AA/\/W\/ tOP i /wvww\/w\/\/\/vw\/www\/www\a/wvw\a/wwwwwwwww\/\/w— sop 
i WWWVWWWWWWWWWWWWWVWWXAA SOP 
； wwwwwwww v\a/\aaaaaaaaaaaaaaaa/yyyy\ wp 
i ~y/vw \r\j\f\j v/vw \j\r\j\j v/wv \j\r\j\j \yvw wwww v zop :/v \j\j \j\j w w \j\i \j\j \j\j \r\j \j\i w \j\i \j\j \j\i \r\j \f\r\f\fa/v \ top. 
；"V~\j~\j~\j~\j~V\j\j\j~\j~\j~\j\j~\j~\j\j\j~\j\j~sj~\j~\j~\j\j~\j~\j~\j~\r\j\j~\r~\r~~\ op；； ； o^pi； 
-f •+..» •‘..•«..-‘•.,•, .•«•.+ ,‘....•..+ ....*...,+ .• •...‘.•••< + »••...••.••…..•-..••..’••‘.•••...•-..-••.…•‘..•.^ 
9LZ S"00 03 i I 
OSTt熙：憩：0 • T：抑鄉珍憩： 
r^ - suo„do sua«as •iff 
^ jfmigjgiBjQ 6u(tM|j. “ fX"： 
fiuoo.03 
；J � UT^� 
i: bic^no： 
丨 [；[[；^^^[；[；[^；[[；；^^^^^^^ZIII^^Z^ZZIZIZIII 竹op;:： 
丨 eiop 
i nop 
:i I^ZI^Z^IZIIZZ^Z^IIIZZZIIZ^II^ZI^^^I^IZ^IZI^IZI^ZIZiZ^IZIIZIIIIIIIZZZIZIZIZIIII 60op 
丨 800p::； 
丨 LOO? 
I ^Z^^^ZZIZZIIIIIIZIIZII^I^IZZZIIIIIIZIIIIIIIIIIIZIIIZII^IZ^I^ZZI^^^Z^^ZZIZIZZZ 90op 
；；;^[；[[；[[；[；;[；[；；;;[；[；;[；；;^eoop ：； 30叩.： 
； ooop ； 
：：-rrrrinniirrrrrrinrii-ii-ririr«-|-rnm ^ b^-ux 
：/V � VXP ： \ ；/V 、 eip:； 
i n � stp 
：� � TTP i ..^ � otp 
、 60P ：：/V � 80P � iop 
/v ^ 90p 
/WVWWWWWWWWWWWWVWVW ^ SOP 
/WWWWWWWWV ^^^^^^^^^^^ WWXAAAAAAA/WW v 时p、 /\/ww v/wv syvw \j\f\j\j \j\r\f\j \j\r\j\i v/wv \f\j\f\i ^ 30"P 经 
/VW \J\J \J\1 \J\I \J\l \J\1 \J\J \J\J \J\J \J\i \J\J \J\J \f\J \J\I \J\J \f\j V XOP-
~W~\J~\J\J~\J\J~\J“\J~\J\J~\J~\J~\J~\J\J\J\J\J\J\J~\J~\J\J\J\J~\J~\J\J\J~\J~\J~\J V OOP 衫 f 
JOJJ3 
* ' • •丨 • •..��•«�‘*•«�，’••,  •……�X•…….、，• ，、，•……，，•………•………0 
0 BUOO 0Z t I 
0溫丨:丨丨opuoTnb«s 0 .. 邀。 .. 
- -—二r-:< —.….-‘� 










































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































存M ^ r ^ ^ I
 i p
 j p





• z c s
 











 Z I P 
t = & u
 
1 :
丨 x - d d n s
 
1 .
 : : 6 u
 I ,
 I X I d d n s
 t = E > u 「 x l d d n s 
r ^
 n 0
幻 . 以 = 1
 ^
 n 0 9 . 0
 = l 
^




























































i p i u B
— 1
 
i p b f j
— • 
f l _
















1 4 1 
^ 1 c
 
！ ! _ — —
I c





 l = 6 u 
卜
)no
 l = E > u













 0 ^ . 0
 =
 1








n g H W
I
 I 
n g " 】




1 6 , 「 
^
 i p p z :
— •
 




 i p p T ^
— • )
 







 O J 




























 - m 






1 “ & U
 • • 寸
2 3 c












 p ) M 
 
^
 i p t j A
— •
 •
 U H P
 i f J ^ A
— •
 1 
c ^ - c
 i p t T T ^
— •
 • 
J . J 
^ ^
 l l p k - d d n s
 l ^ / ^ l d d n s













A r m l A






d - x - d d n s 
Appendix  
Schematic of asynchronous bit-parallel multiplier 
：9 0 ffl 0 S B B B 
• ^ f E [ [ ^ ^ [ I ^ [ ^ [ :i dj^ij^J 
] u y ijj q a jy ty^iS" 
i 
H H K H r f t f c t i f c ® 
j | 
^ ^ f f i 
Page 163 
Appendix  
Schematics of Programmable DSP Processor 
Schematic of programmable DSP processor 
. -了n —：： ： 
11 丨 「 : ^   — ^ ^ ^ 
lf • -MS-
I'-TTTTTi—Instrlnput Interface &: 
• •"柳 =;=::三：三 Mf p「og「。mAr「。yS「1_24 
：= ： ： '•？ffl•； 
1 , iiiiiiiiliiiiiiiiiiiiiiiiiliiil 
|l AffftV . I ——I . - — - - ~ ； ^  
IL ：細.， .II' 
_ , In/Out & Switch Array “ 
‘ ._. I' — ： = ：- ：： ::::= t 
‘ = =；；：；— 丨| — -Hi 
丨丨 —==E ；^ ：= = r 
18 
Cko — 礼 myltlA _ 
/ e cmpito —.cmplLmuttiA _ 
/ M ： _ n mult A < 8 : 0 > 
[ �<8:0> • J 
lA J 
I - ... ck-niultiP 寸 . y / 
一…cmplUmuHiP ._ cm帅 § & / 
I ou、_ltfP<8:a^ p<8:0y 二丨 \ 
… i n mul l iB<8:0> 
I b<8:0> • ,J 
^s. em III, ^cmplt_inultfB _ | 
。：b Zsk-multiB 二 」 
iTFI 
M 
Cko ~ c k _ < j c K J e r A J 
/ « c—to 一. . cmplL9dderA  
r o<8:0> 画 in_adderA<B:0> Z J 
s. . 
[ - . _ c k , q d c j q r g _ ^ / 
L _ • _cmplUq<30QrS cmplt， Z 




in Q d d e r B < 8 - 0 > 
I b<8:0> • , 
cmpitd — gmplLo 州 rg 1 
ckb 礼 柳 grB I I 
rrr| 
V Ck_5ubA j 
Z s 一二，S叫 y。 1 / n s u b A < 8 - 0 > 
1 o<8:0> — , 
• ~ M S — — ㈨ I / mpit-鄉;i__一 .S / 
out„sub5<a-.0> • \ 
b < 8 : 0 > •遍8:0>  
c,„pi„ <;mplt.3ubg  
如 £k-SubS  
Page 164 
Appendix  
Schematic of switch network 
A A A A A A A A 
• S . t g S s X s -Eg? i E® a? 5 CJ? 
01 iU m 0 ih in Of 
i - - • . - S j s s ; s S s s s fs s if s ti 
j"' ""I ""I  
^HH] piHifi p ™ h ™ 
— O r T o S r 9 _ 9 r * ~ DrToSr9_9 | ~ “ ~ 咖》DrToSr9一9 | ~ DrToSr9_9 ^ £ £ £ A A A A A A A A A A A 
^ 2 S ^SS I I 2 2 3 S SS S3 SS 
iLjlj^ lU j j^ 
J l J l I j i j i i h i i j l i ； 11 ill 
I jii u Us u rftti-rtts iis u ?^ 5 
Hi If" m m m m 
Il H 11 II 11 11 11 II 
— s w i t c h — s w i t c h — —' sw i tch '"« swi tch 
A A A A 1 A ^ A A A A ?众公 AA A A ^ £ A A A A A " S 众众 hU ill . nn iu - ill iU . iii m i llln fin III i llnltlnffl i f|nHfnlt ^ Ifln Ifnll 
I f I ¥ I I I f f 
V；-； RTn fUTi s-；-： : RsTn t；-： ^^^ nrn 
J 5 gJ 0' j V l J J J J J J S S J = = = = =' S -l s B P S I ： 
' ' ' ' ' ' I j i l S S J Srlolr l I j S S i i i SrT-1,1 } H i S r l i v l J 
Ill \ u f . u l \ Iff 5 
W W 妮 • J — - 4 2 = ^ 
111 111 且 l i t i l l i l l ^ ilijit ^ i i : _ _ 
‘；A A CN A A 二 A A | | P * A A 
•ISS^-JSS 12s ISS - m 122 ^ i s i '-LI "LX "tx S v^  Svv Sv V 二 I： 2 2 二二 二二 2 2 二二 2 2 a 11 88 II 33 11 33 II 
— s w i t c h — swi tch — swi tch — s w i t c h 
ill _ pi =11 ill ill pi ill pi i ；iln tin III i H n I n^ tll ‘ | n I n f 11 j II n I n f 1 
“ “ jiL SS i t M if 1 1 I L 
“ 3 5.T0II I ^ i i i I ^ H S 丄 , 罢 S享 ^ r i l I ^ 
m 1 iH 1 m s w . 
w 妮— w 
M A "S " A to A f~ 人  
I f f F 二 二 ： 二 二 m = _ 1 " i 
_- '4 [ p 目 H 至 • 彳 丨 ”丨 一 
�li^l! rii^il! 1 rpi^i ipi^i 
== as3 i i aa =5 aa == a a 
3 3 8 S IB 8 S 君 S S 8 MS i 2 
— s w i t c h — s w i t c h —«> swi tch — s w i t c h 
=ii .SS III =51 pi .is _ 
i lln I n til i lln I n rl j fl n I n t11 j tin In 11 
III III 麗 III 
^iiUjfiUj；；! 1 £ 1 u 1 i, u I i i 
"H ni Ml M n^ hi ni m ni ni 
‘~ SrToDr9_9_3 ^ ~ SrToDr9_9_3 ^ ~ SrToOr^_9_3 ^ ~ 5rToDrg_9_3 
=5 5 9 9 S q 2 8 i i S 9 3 S 3 2 S z = S s»S 2 2 S = = 2 g 3 3 o 5 S 
I i i U i m j - l i i f _ l l n i X i J l i i l S jllnH^n I 1 4 lUn Usn U 
411 以 以 ill ill ill ^ ill ill ill il 
m 111 111 m III ！It ！!i ！P ！p ！!l !p 
、！ 'il ”】 ：錢 






























































































































































































































































































































































































































































































































































































































































































































































































































































































































































































Schematic of multiplexer cell used in switch cell 
MP  
ewM ^ cod0 ch • I 壩 ek 
eodi • -•'• • t —• codi — 
- I - I - • ^ 丨 •二 
二二:s i • 一 - E〕 
dato2|KSJ> ^ ― — — - - _ . • . 賊。 2 
- • codB — • — 
• • eodi T-
. og>lo«> 
• • cDdS O oull — 
3 
- - ： ： 
pv  
~ codfl di • 
eodi 严 
. - • codS O owll • 
•O 
3 
- - • 二 ; _ 
- • 一 丨 «»40 CK • 
C041 严 
• • " c o c U O o»iM<1> 




- o a t a i ^ l ? . 丨 tfotol 
M<a 
pP  
叫~ eodi — 
• coda � n/ll • ‘ 
3 
- - 迹 ^ ^ ^ 





k _ i n j r j 
py  
~ code <* —•— 
_• — codi 
'二 2 , , 
• ‘ co«9 O oul(   
3 
- - ^ ^ 二 
• ••_ code ck • ~ 




i ^ . 一-。,言 — ^ 一 
- • codi ^ 
— ~ 二 2 . o»nn<4> 
• codS O owl) •• • -
3 
do(atn<4) E -«tou>a>HU， 叙I。： dalo2 
— ek •— 
"O X 3 
~ code <* ~ • — 
. eodi — 
-?— 3 , _ . ownn<5? 






, • cosi 
二 二 5 .. -
• «o«9 O owll •— 
X 
M(0lp<5> . E - - ： i r r ^  
CMS eh ~ 
cotfl ” 
eo^ 5 _ ouKrKO 
• codS � owl I • • ‘ -
•o 
3 
- g g j ^ — E 
OTP 
-m coca cfe ~ • — 
- • ~ tMl : 二 二 
X 
ctKfl ek m— 
-m codi — 
- *— 5 «,11.X7» • CMB � WftI 攀 •D 
_ 二 •广 r — 他 1 ^  ： ^ ^ 
117 
CMS Ck m— 
- • eodi 产 3 , _ -m eodB O oult 1 — ~ ^ •o 
3 
I 二. — 
1-_譽--coda ek I 
• eotfl ^ 
• 二 2 _ o..t,Kf> 




Schematic FIFO memory 
^ h 
i ； 
p.~ 一 Or'oSrt! 
I j ,, I . 
I !l 
J 1 g 了 tt 
丄》:: ffl 11 
J ^ 氣 M 
iHVnftffffM ~ , 
Tm TTTTTTTrnT HK CO 
I ) 
[i^p^^ ri^fiu^ = I 
V V V V V V V 
U il U li 寸 寸 寸 寸 
幼 bf) be bo 
_ r w " .細 ^ J … .明教 ^ 
^ ！H ^ U 
2 2 SS SS 2 2 
j n I Ml i n 】.! afn^ pJfn 
£ i S I S 
5： \ 
j m i d ! 
Ji n U\\\ Jj n U\l\ 
T ^ M | | 
s i I 
J • ri \ 
日"wdiib^  
nH ！! 




J n Hi 
S ！ . 3 i pj i H 
ift_n<W> 
I 
I k & i n li ii I . ~ rM SrTo0r4 U— rM SrToCW 
I t I I t I 
T ' l T 
II s U I 
ii! H 1 
Page 168 
Appendix   
Schematic of Instruction Memory (instruction decoding network and cyclic FIFOs) 
I -p-I—. fifoM I  
I - I I- — I — 
L-H^ fifo16l9b 哄雄 ,… 
-• I L-o" , I ， ~~—丨 「丨」 二 fifoWl [ j • 
I fHoW2 I ~ -。~ 
丨•二 I ~ ‘ — — ^ — f — - 3 b ——^  
—1 ( fioW3 丨 ^ 
I— r t . 1 • I I 州 . a u ^ tj •> — 
I , J Mi 丨 I • • 1 — -mm a 
f « i-giSSS. I zxcj—fifo24l3b ^  
WjOJv I • •‘ • - • r ‘ -O-
工二 r fifow41  
^ ^ I "r-H r^ M^  I 
i ,——zgg— fifo24l3b ^ 
一 I ^ [ fifoWS I ^ 
“ rn ；~rM a ,1 -r^ 1 ^ ~ 
^ ^ i S r i 1 丄 ： • 溶 二 = 
•hT二 • fioWe I -
^ — I ^ — 
• i “ fIfo24l3b .：… 
^ ~“ 广丨 I frfoW? I ^ I ；III , , I •�.丄..1 I • Ti •*»' b 一 c, �，•xw |1 , =d—^― 5 
I • • 1 i _ _ _ , rr"— f;fo24_3b •.... 
' . J & C S L . , — I 丨 0 
.jLsiSL ntowa I  "r^  i ^^  1 nfo24_3b 一 » 
I •• , I m<M* • " 
^ ^ I I p I p f JP 一 fitoW9 J • . 
m ^ 1 M 1 "^jtLj 1 - ； ^ ； • ； f丨〖o24〖3b 二 ^ 
• , — _ , � a  
.JSosl. fifoWia I " 
"l_I_I I • x *> • 
丨 , ^ fifo24l3b 『一 
^ I fioWII I ^ 
^ ^ •二 = ] -pJ-i r f M ^ ~ % — 
_ P _ c, I,ti^> , = i ^一 C 
= ' S -‘•叫 1 ^^ flfo24-_3b » 
I • I I hlHlil." • ‘“ 
t ^ . l I I ^ 
I . •叫• ^ *> _ I I —' D 
I I ？=aa— fJfo24l3b  I fifoWM I ^ 
• I “^ ^  
_ _ _ 1 IZZ ^^ fifo24_4b » 
「-1 • [ fifo^ yis I ^ 
* ‘ ：： “ “ “ j -J I _ _ . 广 1: - 1； � — � 
： ^ I raSLH-^ fif024-13 b — y — 
I fifoWis I ^ 
_ : r h — I ^ ^ ~ 
I i raait- fifo24._3b —, » i,. • 
Page 169 
Appendix 








二 I 咖文CO I 
寸 b c ^ csj is pg| I n 5 -f 
sfl s ~ —f—"n 
- 会 Pi 
j T l r \ H r r L 
rwri -J CO " CO ® "J • 
iLnJ sUy Ij^ nll 
i [ 1 T 
s 
W J A A tM "tJ 
1 I i 
[tyn m m 
s,： r 
sLjt|I §Lj-l|J 气 I I • • 寸 eu) 
I r CH 
If 丁 ta 




如 0； U A 
寸。Is CN < n « ^ 
—— I 
i . H ^ 
n • j- M J n "i I 




Schematic of Product Full Adder (PFA) in multiplier core 
a c c I 
oo^^ acLo o^ v^S-c- W 
TTTTTTTTTTTT 
I 112 11J 
cKout^  Lkout g c k i n l ~ • m = 
• hk c: — ~ 
I I • a_p 
< I • o .n 
1 , • ~ p_p 
, I • p_n 
；:。b-P c:t88.1f 
II • fob_n 
II • fpc_p 
,h- • fpc.n 
c:i，5. if I 1 
SHIFT_A 
ck m~i I 
p Sh i f t h —讲 . 
I ‘ • — — i n _ p o u L p ——•—— ^ ao_p 
II • m_n out_n • ^ oo_n 
SHIFT_B 
0=7.66, 
ck »H I 
pH Shift 口 
.1 in_p out_p ~ •~ ~ » b。_p 
II • In_n out_n ——• ^ ^ ^ bo_n 
MH 
“ 胃[P ^ H I 
一 ⑶ c； 
II • — — f a b _ p p_p ——• ^ po_p 
(I • fab_n 
I 卜 • fpc_p 
浙 11 • ~ fpc_n 
I J c:257f 
c : M 7 . . PN 
— : = 二 ：： ““卜 P Ck ~ - H I 
,一 a@B r 
c.… • _ fob_p p.n ~ • i^po.n 
I • fob_n 
I I • fpc_p 
I I • fpc_n 
C A R R Y _ P 
ck m~i \ 
• carry [I 
I I • ~ a_p carry ——•——si£2J2£ ^co_p , 
II “ ~ b-P 
11 o c_p 
1 • p-p 
C A R R Y . N 
ck n 丨 
• carry L 
a_n corry ) ~ • iiS&22f ^ co_n 
,I • — — b _ n 
I I • c_n 
t I • p_n 
c:9 4-.67< I I 
c: 100.71 
FAB P 
Ck m-i I 
• A.B C 
I a_p fab_p • ⑶‘押 f o b o . p 一;Hf^  —i^：! I 
F A B _ N 
ck m-i I 
J A+B [Z 
a_p fob.n ——• … f^ob。_n 
o_n I •——b_p i • b_n 
F P C , P 
〜 ：-jP ^ “ " p " A@B r 
I •——C_p fpc_p ——•—— ^ fpco_p 
I I • c_n 
I I ” P_P 
I I • p_n 
F P C . N 
I I : -H|-p c；.^—J 
"p" A ^ r 
I • ~ c . p f pc .n ——•—— ^ f pco .n 
L • c _ n 
• p_p 
I • p_n 
Page 171 
Appendix  
Schematic of handshake cell hM (calling h4in) used in FPA 




o u t ^ B ckout i ckin • ^ cl.In 
0) 
p h 4 1 n r 
� _ p • — � _ p 
� 一n ^ • a_n 
P-P ^ “ b_p 
p_n ^ • b—n 
f � b — p ^ • c _ p 
fab—n • c_n 
fpc—p “ p—p 
fpc_n • p_n 






1 — — W P I l I .^P I 
"-fc i=0.6u ^dd! ^tlril  
= 3 vdd! r iL- n0 vdd! B| , rkin « _ m ckin 
Upwa wtot=10u " i r T wtot = 5u " 1 “ ” ^ 
1 n6 l = 0.6u ^ I-0.6U f ！11 . . . • n5 -9=2 I n5| — : n0 
v.tot=i"0u • J r *^ T WTT “ “ m r ) ^ WNL2 ^ 
l=0.5u 1 n5 n0 p n0 
c k o u t 广 ' . l o t i s - ^ i . 。-p • . . o t A - r f l • “ n • 。 - n 
qnd' • ng = 2 ‘ ‘ ng = 2 " g n d ! T _ , r * * , _ 132 
丨丨 丄— b-p • ^ ^ f e . S u wto.im •] " • b_n 
2 W卜 L3 ‘ MNL4 “ '~ng= 1 “ n2 
C-P • " • I 石 glg2.6u ‘ » t ��I • r n 邐 c_n 
1=1 ^ ‘ ng3'l' " n3 
3�卜 MNL6 3 
P-P • " H 6 ! ， ％ t | j | 4 j l n • P-n 
n4 ^ ngi'l * nA 
M^： I T gno 
‘ . 9 两 a J Tk)n 1 







B i o 
n 0 L = 】 o : l M
 u 
— — — — — — — —
 。 s , f
 一 , 
d





















r i D 
？
 I C P P A
 
？




I L j p p >





— • — —





- • — —
f l ^ - ^ c
 c = e > u
 0 f ) 9 u
 
7 












幻. 0 „ = -








































c - E f p u 
f u t
 n
念. 4 = : i o ) M
 n
瑟 . 們








n l 9 . 0 1
 •
 










S ： g v & U T i p u £ >
 g £ e j u
 ipuEi
 g l = & u
 i l
 ^ - ^ H - o - s 
. b o 
1 =
 =
 1 9 . 0 、 - , 
k
 f > u 6
 p u &




 s i 
r ^
 ^
 / \ 
责
 n S U 
—^s ^— • — —





 . 0 ” = l
 % ” 
g P I 办 = . e I 
Appendix 
Schematic of the shift cell (calling cll) 
10 
^ ^ ck 
• c I I L ‘ 
in—p • in out • out—p 
ck • 
• c l l [ 
in_n • in out • 〇ut—n 
Schematic of the buffer cell Cll (signal rail) 
vdd 
5u 1 0 u “ 
0 . 6 u m p 0 . 6 u m p 1 1 
v d d ! j _ _ 0 vdd!丨• 
^ - J r l J / d d ! n0 - J [ , , vdd ! 
C k ^ " V t o t ^ S u T “ w t o t = 1 0 u 
^ 1 = 0 . 6 u ^ 1 = 0 . 6 u 
2.监眾〒 1 n0 ^ out 
0 . 6 u T m n l i k T b u T m n m n0> I out i 
• ^ L ] _ _ • • _n(T丨 J n0 1 B qnd ! 
i n W ^ ^ • wTot = ： .6u J w!o t = 5 . 2 u 
2 . 6 1 > ^ 1 = 0 . 6 u ^ 1 = 0 . 5 u 
n 1 丄 + 
rik • M and! gnd 
J w(ot = 2 . 6 u J _ 
^ ！ = 0 . 5 u \ 






Schematic of stage Product (true value) generation cell PP 
vdd 
I I 
‘ MP MP11 
I yddlj vdd! _• 
c\ ^ # ck • l l vdd! n0 a- l ^ vdd! 
隱 ‘ ‘ • • wtot = 5u “ wtot=10u 
^ 1=0.6u ^ 1 = 0.6u 
n0 ng=1 T ng = 2 
MPAN1 ** ^^  f W P -P 
'ddi II 丄MNl 
ck .JC, vdd! ^  i i , 
•广 ^ wtot = ^ iJTT5 ^ nrn nO ., • qndl 
1=0264 丄 n0 “ T wrot = 5.2u 
r e t 17 ' ng=1r^ ^ ^ 1 = 0.6u 
丨 - P W ^ ^ 丨 — n ― 丨 丁 〒 2 
^ H 0 g 6 c j l = 0 . 8 u g # ^ 1 
netIT"' ngd l •• 
MN2 MN3 
nel17|| 丨 |net17 
fob-. ^^^^ I fab-P 
^ I=f0q6d l = 0.Quq#' 
nic 了 n ^ ' fiet52 
MN1 MN4 
n1 i I p e t 5 2 
fpc_p ^^^^ F^ St'iavec^ iiM I ^ fpc—门 
^ 1= 0q6d l = 0.eug 办 n l0 •• nj j=1 ng^ l • 21 f  c I MN 
n10 i ” " f w?ot^ =5.2u ^ l=0.6u gnd! T ng = 2 
gnd 
"7 
Schematic of stage Product (complement value) generation cell PN 
vdd 
MP MPIl  vddlj vrldil 
^ A rk , I ri vdd! n0 -JQ vdd! C 丨、W '' ^ “ wtot = 5u T “ wtot=10u ^ 1 = 0.6u � 1 = 0.6u n0 T ng= 1 b_n T ng = 2 
MPAN1 t T T ^ P-门 
？ Ln丄關 ck I, II vdd! ^ 1 I 1 " r* •卜,wto=[JTl4 • IT^n n0 1 - and! hi l=006ii 11 n0 “ T w!ot=5.2u net57" ng=lr' ^ — , "-jL 1 = 0.6u 
LP » • ‘ l-n 一丨丁 ng = 2 ^ 1= 0Q6d 1 = 0.Bug^  1 net57 * ^ ^ ng^  ' Viet57 J 
MN2 MN5 
net57|| ||iet57 
fob_p W • f。b-n 
^ I=f0q6d 1 = 0.e uq^  
nic"咖、 n與 net20 
MN1 MN4 
n1 ^ e t 2 0 
fpc-p » h ^ ^ fpc-n 
^ I=f0g6cl 1 = 0.5ug^  
n10 “ n 钩 • � 0 f  
c I MN n10 i 
rk m  m and] 
• T w!ot=5 .2u ^ 1 = 0.5u 




Schematic of Carry (true value) generation cell CARRY一P 
vdd 
MP MPII I Vflfl^ a ydfl"! 
ck » ^ P wl^ {？isu 游 10U 
ng = 1 (orryT n g = 2 
<,f carry 
, M P A N 3 _ 1 
,,vddl ( o r r ^ 
i _ • iri vdd!广 « • r^  • I [• � ^ vdd! • P L. ck n0 , J C gndl 
^ ^ t：微 C-P t j l f 爐 h r C-P ？4 rkl^-'" 
n1 ng = 1 n1 i f ^ n g = 1 n2 g n d l " n g = 2 
I ) , 1 T gnd 
JM “Ml 工 V 
。喻严丨嚇iH�-p 
p-p m ^  ^ 'Irafj^ " ni0 
n10 • n g = 2 T 
6 
"-i l=0.6u 
gnd! T n g - 2 
Schematic of Carry (complement value) generation cell CARRY_N 
vdd 
M P M P M 
vdd! • I I 
"^Hte-
n0 T ng=1 c orry 0 9 = 2 
t Z— “ ^C�r7 
T 117 g MN6 I n0 M ^^  WPAN1 gndP' ng=2" 
a _ n ^ n 
c-n »   
^^ M^Nl MN2 tge …7 
b-n • ^ ' E ^ S H " « P-n 
n10 ‘ ‘ n g - 2 T n10 & ng -2 ‘ n10 
1 — — ^  
“ 




Schematic of r^ stage Product (true value) generation cell FAB_P (A AND B) 
FAB 
ck • 4 1 ck 
J A.B r 
a—p • A 丫 • — — ^ fab_p 
b—p ^ • B 
Schematic of 2-input AND gate 
vdd 
MP MP11 v d d ^ v d d ! j 
广I. ^  A rk , I fl vdd! n0 -J L1 vdd! CK ^  ” ^ ^ “ wtot = 5u “ “ wtot=10u 、1 = 0.6u ^ 1 = 0.6u n0 T ng=1 Y T ng = 2 _ 
十 • 丫 
C MN1 MNI1 
n0_l ^ 丫 J 
A A • • vfot = i ^ u "T w?o?==5.2u 、1 = 0.6u Tl 1 = 0.6u n1 ng= 1 gnd! ng = 2 
c MN2 II 
m I I ？ , 
cn gnd 
B W ^ M & ^ u 去 ^ 1 = 0.6u n10 ng=l 
tSJ c MN 
n*|l 
ck • gnd! • J w!ot = 2.6u TL 1 = 0.6U gnd! 丁 ng= 1 
gnd 
Page 177 

















 二 山 • 0 = 5 .








 e — q 
^ z u l
 I









 L = & u
 o n ^
 - H . ^ c LU 
双
 





「 一 9 . 0
 =
 |






h m m u
 =
 























 I L u





 d — q
 
f









 d l。 
^
 
寸 N s Z N 2
 LN2 
^
 n 9 - 0 1
 ^ 
旧 





f ^ — q
一J n 








 » J n


























对 一 _ — I 
押 J 
Appendix  
Schematic of r^ stage Product generation (true value) cell FPC P (P XOR C) 
vdd 
MP MPll  
vddhp v d d ! i 
r-L ^  m rk , I fl vdd! n0 vdd! 
^ t^ ^ ^ “ wtot=5u “ wtoi=10u 
' - b 1=0.6u "ni 1 = 0 . 6 u 
n0 ng = 1 f )C_p ng = 2  
MPAN1 一 * *. • ^ i r fPC-P vdd! II S MNI1 
ck J , vdd!^ 1 , r * 
•卜 1 wtot = ^7TT4 ^ M T I n0 , l _ and! 
^ 1 = 005 l i 丄 n0 “ wtoi = 5.2u 
ne t56" ng二 一 ， • l=0.6u 
L P W ^ ^ - ^ ^ o ^ t ' i a v a ^ i ^ l r * 丨-n gnd! n g = 2 
% 1= 0q6d 1 = 0.6 u q ^ I I 
n e t 5 6 " 療 可 n ^ ^ ' riet56 ^ 
f \7 
_ 2 MN3 I 
nets 镇 ^ e t 5 6 
C-P W " " " " " H f l f g ^ ^ i l l f r S c-n 
n ‘港哼d I-
M m MN4 
n1 , , I f et43 
P-n W I P-P 
^ 1: 0g6d 1 = 0.^  ug尔 n10 •• n j j^T n g ^ •「nlO S  
匸 MN 
n10 I I 
！ w?ot^=5.2u 1 = 0,6u ng = 2 
g n d 
Schematic of r^ stage Product generation (complement value) cell FPC P (P XNOR 
C) 
vdd 
MP MPll  
vHdi • vHH.i — 
八丨 ^ A rl： _ I r i vdd! n0 H 1 vdd! c k 歉 • , wtot = 5u • 卜 wtot=10u ^ 1 = 0.6u ^ 1 = 0.5u 
n0 T ng=1 f )C_nT ng = 2 
MPANl f — l^fpc—n 
c "“ wtot = 5irn4 • lT3n n0 qndi 
1 = 006il 1 n0 T wtot = 5.2u 
net25" n g = l r ^ ^ • • 1 = 0.6u 
LP W 丨-n gnd 丨 J ng = 2 
MN2 MN3 net2^ ^et25 
广 • � � • - » _ i | n d ! ——•��編 C_p l40gSd l = 0.aug^ r n 1 ^ e t 2 8 
MN1 MN4 
n i l ^ e t 2 8 
P-n W ^ P-P 
^ NOqSd 1 = 0 . 6 ug^ 
n l 0 •• ^ ^ 
21 • 
^ I MN 
n l 0 ^ 
rk • • gnd! 
” T w!ot = 5.2u 
^ 1 = 0.5u 
gnd! T ng = 2 
Page 179 
Appendix  
Schematics of ID DCT/IDCT Core 
Schematic of ID DCT/IDCT core 
a — ‘ 
幽 
mi 
i r ^ 
—-幽 
f 
m T H 
n 
-s「！ 一!「！ 
— m — m' 
fl J f l i 
I t 














 j i ^







. . . I
 
• 
I j l i T I
 I
 s i ^ ^ J 






广 J ^ P ^ S S I 


























p f t ^ i
 P T t M 
A 3 V S I
 c
 I f c s - V S J . ^
 ^ ^ ^ J 
f a
























r u T I
























































































































































































































































































































































































































































































































































































































































r i l i l l d E J I
 贝 QQI
 





1 , 1 . : • .
 •
 
， a i v i t i
 
一I g r s i n ^ " ! 
< 0 : l r l > o l D P - s g q
 
1




 ^ ^ 




































iv g i j n
i I f o
 jj
 u n l l 
— • • 
已
 A ?
 V I ^ I M ^ I ^
 v ^
 T ^ n
 ^ n
 
• ^ ^ n ^ ^ i
 
八 g : H > o < g : H V P
 
• 
L ^ J g a d n
 S O J 
• 
w 


















一 . y ^ ^ . u T M r ^

















 p . H *




























r c Q ^ c 
Appendix  
Schematic of memory pipeline stages in DCT coefficient memory 
A 
—由 I 
「二 IS  —f I 
- 二 想 1 I —« I 
—二难 I I . - • I I 






















J h l _ u i j
 d l ^
 J h ^





 - n j - r j
 n r ^
 l ^ - L J i 
6
 
L r L | J L r L | J
 L T L ^
 l t l ^
 L n - ^
 _ l j h _
 n p H _
 L T L j J
 L T L j J
 L T L ^ 







































I - ^ 
—h " " ! L
 I
 i
























































Schematic of modified basic FIFO cell 
</dd 
MP 
wtA 5 f—^ cKfn 
vdd I ng^ =:Z 
I I 
MPd ,,, ？ I petal 
g "V� i ™ J 
l=0,6u ^ ng=2 n0 
、> • M m 
• MNn d SI 12 
c 丨丨n0 
.ioi^lL t 如 • • gundl J UaBj 
I I gnd . 
net A 
MM 4� I_netB 
gndJ •fikin  
l-0.6y I f � 
ng=1 ‘ 'gndl 
I I 
gnd 
Layout of modified basic FIFO cell 
E I Z g M M « ： 參 議 Z H 
i -玲 
i ^^^^^謹：：^二• J 
Page 186 
Appendix  
Schematics of Transpose Memory 
Schematic of transpose memory 
i 
i ^  
L Switch Z-lo-1 , i H 
「;|u~| r j l ^ 
！" ^ 
, ？ 巻 , Si 
门 J p "1 p :丨门;1 
J J ‘ m 
n i r f i i r i l mm J ! I s I I _ mm • I—» tm • • • —t-補 * « “ 
,f i i“f ii ui i I i If 
111 VI ；11 ij 
Irjil r j i r ^ ^ 
I_ 一 »war' I 一 IkM^  j-m-^ 一 � - _ I i • • j 
L |1 [ j ^ " t K rj^ 
i 
!_ Co l^joller _]_ 赠 
< I 
「 i 「 = ! 
Ln I Ln  








A ® " V 5 
UK 
_—A w . ^ v a l
 
A s v s l a o
 s o ^
 A s ^ n v a l -
 ^ B i
 •
 
i c > s 4





l 。 o 










H M n H
-~ U 1 0 J I ^ P D M 0 J — 」 M — 9 L 〇 E 
u l p l ^
 
• — —
u l n o
 u u ! | >
 — • • —
 
C I
 c c ^ v
 — •
 




1 0 1 0
 —
 
a j n i a
 tfP














- 3 3 V
 _L |
 J




























- - . -
 
a s v e 1 » 8
 
• 1 
I o o 
• —
 
S J r B A l i


























































 , — . —
 
K l o l
 二 d L ^ r p f ^ D l o o l J M — r o o j ! ) 












b o ^ 
r i
 






Schematic of 32x15bit RAM block 
1 
[-i iiL=-|  
I I 
E t  
• ~ P 丨 KhOTBt • * 
【 If , f • __ I • i| « 」 
K I j J |l aiHii | = = = n 
uD-u. - B. “ ‘ 
m I J 
I ^ $ ^ fif* i !, I . A 
… I … i t if 
jtu 二 giu 二 二 gjTU�cnTLF“ 名《 
H i E E E E M n • 
- ^ I - • ' m ffl ffl ffl m o 11 \\ < < < < < < I - ‘ 
ST h h "A ^ I 5 I 
si • I j ‘―1 ‘―I 
— ！ !i| I 
二 「 』 ‘ I r W ~ i I s li^ iKj^ rKji] 
J 龙 J ^ = 
2 L-gm n vj _n_L 
T hTT ~ r r ' ' f r T T T 
J m J f J M 
H ^  
fr t — 
i|  
% 




Schematic of monitor cell 
enable ^  
o. Ai - 1 | ^ r y r f . i p h r f f i r Z V I it I I 
d-p^ JT^ It - J c 
‘# ^ 11 I M ^  
I ^ laft_done 
11 11 I ^ rf^hLdone 
： ：： 
m 1； I 
目 m ‘ 
Layout of monitor cell 





1] S. Hauck, "Asynchronous Design Methodologies: An Overview", Proceedings 
of the IEEE, Vol. 83, No. 1, page 69 - 93, January 1995 
[2] K.D. Emerson, "Asynchronous Design - an Interesting Alternative", 10出 
International Conference on VLSI Design, page 315-321, January 1997 
；3] N. Ahmed, T. Natatajan and K.R. Rao, "Discrete Cosine Transform", IEEE 
Transaction on Communications, Vol. 23, page 90 - 33，January 1974 
[4] CCITT Recommendation H.261, 1990 
[5] ISO/IEC JTCI/SC29/WG10. JPEG Committee Draft CD10918, 1991 
[6] ISO/IEC JTCI/SC29AVG11. M P E G Committee Draft CDll 172，1991 
[7] B.G. Lee, “A New Algorithm to Compute the Discrete Cosine Transform", 
IEEE Transaction on Acoustics, Speech, and Signal Processing, Vol. 32，No. 6, 
page 1243 - 1245，December 1984 
[8] H.S. Hou, "A Fast Recursive Algorithm for Computing the Discrete Cosine 
Transform", IEEE Transaction on Acoustics, Speech, and Signal Processing, 
Vol. 35, No. 10, page 1445 - 1461, October 1987 
[9] C. Loeffler, A. Ligtenberg and G.S. Moschytz, "Practical Fast 1-D D C T 
Algorithms with 11 Multiplications", International Conference on Acoustics, 
Speech, and Signal Processing, Vol. 2, page 988 - 991, 1989 
[10] I.E. Sutherland, “Micropipelines”，Communications of the A C M , Vol. 32, No. 
6, page 720 - 738, June 1989 
[11] J.V. Woods, P. Day，S.B. Furber, J.D. Garside, N.C. Paver and S. Temple, 
"AMULETl: An Asynchronous A R M Microprocessor", IEEE Transactions on 
Computers, Vol. 46, No. 4，page 385 - 398，April 1997 
[12] I.E. Williams, “Analyzing and Improving the Latency and Throughput 
Performance of Self-Timed Pipelines and Rings，，，Proceedings of IEEE 
International Symposium on Circuits and Symstems, page 665 - 668, 1992 
Page 191 
References  
[13] I.E. Williams and M.A. Horowitz, "A Zero-Overhead Self-Timed 160ns 54b 
C M O S Divider”，IEEE Journal of Solid-State Circuits, Vol. 26，No. 11，page 
1651 - 1661, November 1991 
[14] M . Singh and S.M. Nowick, "High-throughput Asynchronous Pipelines for 
Fine-Gain Dynamic Datapaths”，Proceedings of International Symposium on 
Advanced Research in Asynchronous Circuits and Systems, page 198 - 209, 
2000 
[15] G. Matsubara and N. Ide, “A Low Power Zero-Overhead Self-Timed Division 
and Square Root Unit Combining a Single-Rail Static Circuit with a Dual-Rail 
Dynamic circuit", Proceedings of International Symposium on Advanced 
Research in Asynchronous Circuits and Systems, page 198 - 209, 1997 
[16] D.E. Muller and W.C. Bartkey, “A Theory of Asynchronous Circuits", Report 
75, Univerity of Illionis, USA, 1956 
[17] M . Renaudin, B.E. Hassan and A. Guyot, “A New Asynchronous Pipeline 
Scheme: Application to the Design of a Self-Timed Ring Divider", IEEE 
Journal of Solid-State Circuits, Vol. 31, No. 7，page 1001 — 1013, July 1996 
[18] R.H. Krambeck, CM. Lee and H.S. Law, "High-speed Compact Circuits with 
CMOS”，IEEE Journal of Solid-State Circuits, Vol. 17, page 614-619, June 
1982 
[19] A.J. McAuley, "Dynamic Asynchronous Logic for High-Speed C M O S 
Systems", IEEE Journal of Solid-State Circuits, Vol. 27, No. 3, page 382 - 388, 
March 1992 
[20] C. Famsworth, D.A. Edwards and S.S. Sikand，"Utilising Dynamic Logic for 
Low Power Consumption in Asynchronous Circuits”，Proceedings of 
International Symposium on Advanced Research in Asynchronous Circuits and 
Systems, page 186 — 194, 1994 
[21] H. Yoshizawa, K. Taniguchi and K. Nakashi, "An Implementation Technique of 
Dynamic C M O S Circuit Applicable to Asynchronous/Synchronous Logic", 
Proceedings of IEEE International Symposium on Circuits and Systems, Vol. 2, 
page 145- 148, 1998 
[22] J. Ahmed and S.G. Zaky, "Asynchronous Design in Dynamic CMOS", IEEE 
Canadian Conference on Electrical and Computer Engineering, Vol. 2, page 528 
-531, 1997 
[23] G.N. Hoyer and C. Sechen，“A Locally-Clocked Dynamic Logic Serial/Parallel 
Multiplier", IEEE Custom Integrated Circuits Conference, page 481 - 484，2000 
Page 192 
References   
[24] C.H. Erdekyi, W.R. Griffin and R.D. Kilmoyer, "Cascode Voltage Switch Logic 
Design", VLSI Design, page 78 - 86, October 1984 
[25] S.B. Furber and J. Liu, "Dynamic Logic in Four-Phase Micropipelines", 
Proceedings of International Symposium on Advanced Research in 
Asynchronous Circuits and Systems, page 11-16，1996 
[26] M . Renaudin and B.E. Hassan, "The Design of Fast Asynchronous Adder 
Structures and Their Implementation Using D C V S Logic”，Proceedings of IEEE 
International Symposium on Circuits and Systems, Vol. 4, page 291 - 294, 1994 
[27] G.A. Ruiz and M.A. Manzano, "Compact 32-bit C M O S Adder in Multiple-
Output D C V S Logic for Self-Timed Circuits", lEE Proceedings of Circuits and 
Devices Systems, Vol. 147, No. 3，page 183 - 188, June 2000 
28] K.M. Chu and D.L. Pulfrey, "A Comparison of C M O S Techniques; Differential 
Cascode Voltage Switch Logic versus Conventional Logic’，，IEEE Journal 
Solid-State circuits, Vol. 22. No. 4，page 528 - 532, August 1987 
[29] C D . Nielsen, "Evaluation of Function Blocks for Asynchronous Design", 
E U R O D A C 1994，page 454 — 459, September 1994 
[30] T.Y. Tang, C.S. Choy, J. Butas and C.F. Chan, ‘A A L U Design using a Novel 
Asynchronous Pipeline Architecture", Proceedings of IEEE International 
Symposium on Circuits and Systems, Sec. V, page 361 — 364, 2000 
[31] K.R. Rao and P. Yip, “ Discrete cosine Transform: Algorithms, Advantages, 
Applications", Academic Press, Inc, 1990 
[32] S.L Uramoto, Y. Inoue, A. Takabatake, J. Takeda, Y. Yamashita, H. Terance 
and M . Yoshimoto, "A 100-MHz 2-D Discrete Cosine Transform Core 
Processor", IEEE Journal of Solid-State Circuits, Vol. 27, No. 4, page 492 — 
499, April 1992 
[33] Y.F. Jang, J.N. Kao，J.S. Yang and P.C. Huang, "A 0.8u 100-MHz 2-D D C T 
Core Processor", IEEE Transactions on Consumer Electronics, Vol. 40, No. 3, 
page 703-710, August 1994 
[34] T. Kuroda, T. Fujita, S. Mita, T. Nagamatsu, S. Yoshioka, K. Suzuki, F. Sano， 
M . Norishima, M . Murota, M . Kako, M . Kinugawa, M . Kakumu and T. Sakurai, 
"A 0.9-V, 150-MHz, 10-mW, 4mm^, 2-D Discrete Cosine Transform Core 
Processor with Variable Threshold-Voltage (VT) Scheme", IEEE Journal of 
Solid-State Circuits, Vol. 31, No. 11，page 1770 - 1779, November 1996 
[35] K.H. Cheng, C.S. Huang and C.P. Lin, "The Desi严 and Implementation of 
DCT/IDCT Chip with Novel Architecture", Proceedings of IEEE International 
Symposium on Circuits and systems, Sec. IV, page 741 - 744, 2000 
Page 193 
References  
36] N.L Cho and S.U. Lee, “Fast Algorithm and Implementation of 2-D Discrete 
Cosine Transform，，，IEEE Transaction on Circuits and Systems, Vol. 38, No. 3, 
page 297-305, March 1991 
[37] Y.P. Lee, T.H. Chen, L.G. Chen, M.J. Chen and C.W. Ku, “A Cost-Effective 
Architecture for 8x8 Two-Dimensional DCT/IDCT Using Direct Method", 
IEEE Transactions on Circuits and Systems for Video Technology, Vol. 7, No. 
3, page 459-467, June 1997 
[38] T.S. Chang, C.S. Kung and C.W. Jen, “A Simple Processor Core Design for 
DCT/IDCT", IEEE Transactions on Circuits and Systems for Video 
Technology, Vol. 10，No. 3, page 439 — 447, April 2000 
39] C.L. Wang and Y.T. Chang, "Highly Parallel VLSI Architectures for the 2-D 
D C T and IDCT Computations", Proceedings of T E N C O N , vol. 1, page 295 — 
299，1994 
[40] Y. Jeong, I. Lee, T. Yun, G. Park and K.T. Park, "A Fast Algorithm Suitable for 
D C T Implementation with Integer Multiplications，，，IEEE T E N C O N — Digital 
Signal Processing Applications, Vol. 2，page 784 - 787，1996 
•41] J.B. Kuo and C.S. Chiang, "Charge Sharing Problems in the Dynamic Logic 
Circuits: BiCMOS versus C M O S and a 1.5 V BiCMOS Dynamic Logic Circuit 
Free from Charge Sharing Problems”，IEEE Transaction on Circuits and 
Systems -1: Fundamental Theory and Applications, Vol. 42, No. 11，page 974 -
977，November 1995 
[42] J.A. Pretorius, A.S. Shubat and C.A. Salama, "Charge Redistribution and Noise 
Margins in Domino C M O S Logic", IEEE Transaction on Circuits and Systems, 
Vol. 33, No. 8，page 786 — 793, August 1986 
[43] L.A. Knauth, "Dynamic CMOS”， Honors Project, 
http ://www. Stanford. edu/~lknauth/academic/DynCMO S .html 
[44] H.J. Song, “A Self-Off-Time Detector for Reducing Standby Current of 
D R A M " , IEEE Journal of Solid-State Circuits, Vol. 32, No. 10, page 1535 -
1542, October 1997 
[45] J. Nyathi and J.G. Delgado-Frias, "Self-timed Refreshing Approach for 
Dynamic Memories", Proceedings of ASIC Conference, page 169 - 173，1998 
[46] N.R. Mahapatra, S.V. Garimella and A. Tareen, "An Empirical and Analytical 
Comparison of Delay Elements and a New Delay Element Design", Proceedings 
of IEEE Computer Society Workshop on VLSI, page 81-86，2000 
[47] R.J. Baker, H.W. Li and D.E. Boyce, "CMOS: Circuit Design, Layout, and 
Simulation", IEEE Press, 1997 
Page 194 
References   
；48] G.M. Jacobs and R.W. Brodersen, “A Fully Asynchronous Digital Signal 
Processor Using Self-Timed Circuits", IEEE Journal of Solid-State Circuits, 
Vol. 25, No. 6，page 1526 - 1537, December 1990 
[49] T.M.Y. Meng, R.W. Brodersen and D.G. Messerschmitt, "Asynchronous 
Design for Programmable Digital Signal Processors”，IEEE Transactions on 
Signal Processing, Vol. 39，No. 4，page 939 - 952, April 1991 
[50] S.B. Furber, D A . Edwards and J.D. Garside, "AMULET3: a 100 MIPS 
Asynchronous Embedded Processor", Proceedings of International Conference 
on Computer Design, page 329 - 334, 2000 
•51] S.L. Lu and C.M. Chang, “Modelling of a Self Timed Dataflow Processor in 
VHDL”，Proceedings of IEEE International ASIC Conference and Exhibit, page 
228-231, September 1993 
[52] LS. Hwang and A.L. Fisher, “Ultra Fast Compact 32-bit C M O S Adder in 
Multiple-output domino Logic”，IEEE Journal of Solid-State Circuits, Vol. 24， 
page 358-369, 1989 
[53] Z. Wang, G.A. Jullien, W.C. Miller, J. Wang and S.S. Bizzan, “Fast Adders 
Using Enhanced Multiple-Output Domino Logic", IEEE Journal of Solid-State 
Circuits, Vol. 32, No. 2, page 206 - 214, February 1997 
[54] S.M. Nowick, "Design of a Low-latency Asynchronous Adder Using 
Speculative Completion", Proceedings of lEE Computers and Digital 
Techniques, Part E，Vol. 143, No. 3, page 301 - 307，September 1996 
[55] A. Peled, and B. Liu, “A New Hardware Realization of Digital Filters”，IEEE 
Transaction Acoustic, Speech, Signal Processing, Vol. 22, page 456 — 462, 
December 1974 
[56] "IEEE Standard Specifications for the Implementations of 8x8 Inverse Discrete 
Cosine Transform", IEEE Std 1180-1990, December 1990 
[57] M.E. Dean, D.L. Dill and M. Horowiz, "Self-timed Logic Using Current-
Sensing Completion Detection", Proceedings of IEEE International Conference 
on Computer Design: VLSI in Computers and Processors, page 187- 191, 1991 
[58] T.C. Pang, "An ICT Image Processing Chip Based on Fast Computation 
Algorithm and Self-timed Circuit Technique", MPhil Thesis, Department of 
Electronic Engineering, The Chinese University of Hong Kong, 1997 
[59] w.Y. Sit, "Asynchronous Memory Design", MPhil Thesis, Department of 
Electronic Engineering, The Chinese University of Hong Kong, 1998 
Page 195 
References  
[60] S.F. Hsiao, W.R. Shiue and J.M. Tseng, “A Cost-Efficient and Fully Pipelinable 
Architecture for DCT/IDCT", IEEE Transactions on Consumer Electronics, 
Vol. 43，No. 3，page 515-525, August 1999 
[61] M . Yoshida, H. Ohtomo and I. Kuroda, “A New Generation 16-bit General 
Purpose Programmable DSP and Its Video Rae Application", IEEE Workshop 
on VLSI Signal Processing, page. 93-101, 1993 
62] I. Kuroda, "Processor Architecture Driven Algorithm optimization for Fast 2-D 
DCT', IEEE Workshop on VLSI Signal Processing, Vol. VIII, page 481 - 490, 
1995 
[63] D. Johnson, V. Akella and B. Stoot, "Micropipelined Asynchronous Discrete 
Cosine Transform (DCT/IDCT) Processor", IEEE Transactions on Very Large 
Scale Integration Systems, Vol. 6，No. 4，page 731 - 740, December 1998 
[64] K. Kin and J.S. Koh, “An Area Efficient D C T Architecture for MPEG-2 Video 
Encoder’，，IEEE Transactions on Consumer Electronics, Vol. 45, No. 1, page 62 
—67，February 1999 
'65] “0.6-Micron Standard Cell Databook”，Austria Mikro Systeme International, 
1997 
66] C.N. Lyu and D.W. Matula, "Reducing Binary Booth Recording", Symposium 
on computer Arithmetic, page 50 - 57, July 1995 
[67] M . Potkonjak, M.B. Srivastava and A.P. Chandrakasan, "Multiple Constant 
Multiplications: Efficient and Versatile Framework and Algorithms for 
Exploring Common Subexpression Elimination", IEEE Transactions on 
Computer-Aided Design of Integrated Circuits and Systems, Vol. 15, No 2, 
February 1996. 
Page 196 
Design Libraries - CD-ROM  
Design Libraries 一 CD-ROM 
The C D contains the design libraries of the Refresh Control Circuit, programmable 
DSP processor, dedicated DCT/IDCT process and other necessary libraries. All the 
libraries are designed in the A M S C M O S CUP 0.6u 3M1P technology using the 
Cadence 4.4.1. 
Page 197 
Design Libraries — CD-ROM  
Design Libraries 一 CD-ROM 
The C D contains the design libraries of the Refresh Control Circuit, programmable 
DSP processor, dedicated DCT/IDCT process and other necessary libraries. All the 





CUHK L i b r a r i e s 
圓 • • i l l l l l 
•03TSS7ED 
