An ICT image processing chip based on fast computation algorithm and self-timed circuit technique. by Pang, Johnson Tin-Chak. & Chinese University of Hong Kong Graduate School. Division of Electronic Engineering.
An ICT Image Processing Chip Based on Fast Computation 
Algorithm and Self-timed Circuit Technique 
Thesis 
Submitted to 
The Department of Electronic Engineering 
of 
The Chinese University ofHong Kong 
in 
•‘ 





Johnson, Tin-Chak PANG 
January, 1997 
/ 0 ^ A / 1 ^ 1 
m ^ j A ^ ^ ^ _ 
^ ^ ~ m y ^ ~ y ^ / 
W^ .L!5^ ARV ^mmA^' 




List of figures 
List of tables 
1. Introduction 1-1 
1.1 Introduction 1-1 
1.2 Introduction to asynchronous system 1-5 
1.2.1 Motivation 1-5 
1.2.2 Hazards 1-7 
1.2.3 Classes of Asynchronous circuits 1-8 
1.3 Introduction to Transform Coding 1-9 
1.4 Organization ofthe Thesis 1-16 
2. Asynchronous Design Methodologies 2-1 
2.1 Introduction 2-1 
2.2 Self-timed system 2-2 
2.3 DCVSL Methodology 2-4 
2.3.1 DCVSL gate 2-5 
2.3.2 Handshake Control 2-7 
2.4 Micropipeline Methodology 2-11 
2.4.1 Summary of previous design 2-12 
2.4.2 New Micropipeline structure and improvements 2-17 
2.4.2.1 Asymmetrical delay 2-20 
2.4.2.2 Variable Delay and Delay Value Selection 2-22 
2.5 Comparison between DCVSL and Micropipeline 2-25 
3. Self-timed Multipliers 3-1 
3.1 Introduction 3-1 
3.2 Design Example 1 : Bit-serial matrix multiplier 3-3 
3.2.1 DCVSL design 3-4 
3.2.2 Micropipeline design 3-4 
3.2.3 The first test chip 3-5 
3.2.4 Second test chip 3-7 
3.3 Design Example 2 - Modified Booth's Multiplier 3-9 
3.3.1 Circuit Design 3-10 
3.3.2 Simulation result 3-12 
3.3.3 The third test chip 3-14 
4. Current-Sensing Completion Detection 4-1 
4.1 Introduction 4-1 
4.2 Current-sensor 4-2 
4.2.1 Constant current source 4-2 
4.2.2 Current mirror 4-4 
4.2.3 Current comparator 4-5 
4.3 Self-timed logic using CSCD 4-9 
4.4 CSCD test chips and testing results 4-10 
4.4.1 Test result 4-11 
5. Self-timed ICT processor architecture 5-1 
5.1 Introduction 5-1 
5.2 Comparison of different architecture 5-3 
5.2.1 General purpose Digital Signal Processor 5-5 
5.2.1.1 Hardware and speed estimation : 5-6 
5.2.2 Micropipeline without fast algorithm 5-7 
5.2.2.1 Hardware and speed estimation : 5-8 
5.2.3 Micropipeline with fast algorithm (I) 5-8 
5.2.3.1 Hardware and speed estimation : 5-9 
5.2.4 Micropipeline with fast algorithm (II) 5-10 
5.2.4.1 Hardware and speed estimation : 5-11 
6. Implementation of self-timed ICT processor 6-1 
6.1 Introduction 6-1 
6.2 Implementation of Self-timed 2-D ICT processor (First version) 6-3 
6.2.1 1-D ICT module 6-4 
6.2.2 Self-timed Transpose memory 6-5 
6.2.3 Layout Design 6-8 
6.3 Implementation ofSelf-timed 1-D ICT processor with fast algorithm (final 
version) 6-9 
6.3.1 I/O buffers and control units 6-10 
6.3.1.1 Input control 6-11 
6.3.1.2 Output control 6-12 
6.3.1.2.1 Self-timed Computational Block 6-13 
6.3.1.3 Handshake Control Unit 6-14 
6.3.1.4 Integer Execution Unit (IEU) 6-18 
6.3.1.5 Program memory and Instruction decoder 6-20 
6.3.2 Layout Design 6-21 
6.4 Specifications ofthe final version self-timed ICT chip 6-22 
7. Testing of Self-timed ICT processor 7-1 
7.1 Introduction 7-1 
7.2 Pin assignment ofSelf-timed 1-D ICT chip 7-2 
7.3 Simulation 7-3 
7.4 Testing of Self-timed 1-D ICT processor 7-5 
7.4.1 Functional test 7-5 
7.4.1.1 Testing environment and results 7-5 
7.4.2 Transient Characteristics 7-7 
7.4.3 Comments on speed and power 7-10 
7.4.4 Determination of optimum delay control voltage 7-12 
7.5 Testing of delay element and other logic cells 7-13 




Fig. 1-1ICT kemel [T] 1-12 
Fig. 1-2 Forward ICT fast algorithm 1-14 
Fig. 1-1 The forward ICT(10 9,6 2,9,3) fast algorithm 1-15 
Fig. 1-2 The reverse ICT(10,9,6’2,9 3) fast algorithm 1-15 
Fig. 2-1 Basic Self-timed system 2-3 
Fig. 2-2 precharge DCVSL circuit structure and DCVSL latch 2-6 
Fig. 2-3 A DCVSL latch with ready signal output 2-6 
Fig. 2-4 Self-timed system with DCVSL data-path 2-7 
Fig. 2-5 STG and the circuit ofhandshake control circuit (parallelism degree = 2) 2-8 
Fig. 2-6 Timing diagram ofHC protocol in Fig. 2-5 2-8 
Fig. 2-7 STG and circuit ofhandshake control circuit (parallelism degree = 3) 2-9 
Fig. 2-8 Timing diagram ofHC protocol in Fig. 2-7 2-9 
Fig. 2-9 (a) Run-away and (b) Correct operation 2-9 
Fig. 2-10 (a) Continual Feeding and (b) Correct operation 2-10 
Fig. 2-11 (a)Dynamic C-element (b) Static C-element 2-10 
Fig. 2-12 (a) Four-phase (b) Two-phase Bundled data convention 2-11 
Fig. 2-13 Sutherland's two-phase handshaking micropipeline 2-13 
Fig. 2-14 Sutherland's CP Latch (For two-phase handshaking) 2-13 
Fig. 2-15 Micropipeline with CP-Latch 2-14 
Fig. 2-16 Eva's CP Latch 2-15 
Fig. 2-17 Four-phase Handshaking Control protocol and circuit 2-15 
Fig. 2-18 Voltage controlled delay element 2-16 
Fig. 2-19 Micropipeline with D Flip-Flop 2-18 
Fig. 2-20 (a) A new Handshaking Control circuit and (b) its timing diagram 2-18 
Fig. 2-21 Circuit diagram of the new Symmetrical Delay element 2-20 
Fig. 2-21 Circuit diagram of Asymmetrical Delay element 2-21 
Fig. 2-22 Characteristic of Asymmetrical Delay element 2-22 
Fig. 2-23 Delay selection in Micropipeline for 4-phase handshaking system 2-23 
Fig. 2-24 Delay selection in Micropipeline for 2-phase handshaking system 
(a) first design (b) simplified 2-24 
Fig. 3-1 Block diagram of the Bit-serial matrix multiplier 3-3 
Fig. 3-2 STG ofFeedback path handshaking control protocol 3-4 
Fig. 3-3 Layout of the Eva Pang's first test chip 3-5 
Fig. 3-4 Layout of the second test chip 3-7 
Fig. 3-5 Block diagram of the 8x8 vector multiplier with Modified 
Booth's Algorithm 3-11 
Fig. 3-6 Simulation result of Self-timed Booth's multiplier 3-12 
Fig. 3-7 Layout ofthe third test chip 3-14 
Fig. 4-1 Circuit diagram of a basic current comparator 4-4 
Fig. 4-2 Simulation result of the switching current and operation of a basic 
current comparator 4-6 
Fig. 4-3 Circuit diagram Improved current comparator 4-7 
Fig. 4-4 Simulation result of improved current comparator 4-8 
Fig. 4-5 Block diagram of Self-timed logic using CSCD 4-9 
Fig. 4-6 Layout ofthe second CSCD test chip 4-11 
Fig. 4-7 Output voltage of current comparator 4-12 
Fig. 4-8 Time delay from output data to completion signal 4-12 
Fig. 5-9 Block diagram of a general purpose programmable DSP 5-6 
Fig. 5-10 Block diagram of first version of self-timed 1-D ICT processor 5-7 
Fig. 5-11 Block diagram ofMicropipeline with fast algorithm architecture( I ) 5-8 
Fig. 5-12 Block diagram of final version of self-timed 1-D ICT processor 5-10 
Fig. 6-1 2-D ICT system 6-3 
Fig. 6-2 Block diagram of first version of self-timed 1-D ICT processor 6-4 
Fig. 6-3 Block diagram of the 128x16 self-timed static Transpose memory 6-6 
Fig. 6-4 Layout view of the first version 2-D ICT processor 6-8 
Fig. 6-5 Block diagram of the final version Self-timed 1-D ICT core processor ..…6-9 
Fig. 6-6 Timing diagram of input buffer 6-12 
Fig. 6-7 Self-timed computational block 6-13 
Fig. 6-8 Block diagram ofHandshake Control Unit 6-14 
Fig. 6-91 2-phase signal generator 6-16 
Fig. 6-10 Timing diagram of internal handshaking signals 6-16 
Fig. 6-11 Block diagram of the Integer Execution Unit 6-18 
Fig. 6-12 Block diagram of 16 x 3 bit Integer Multiplier 6-19 
Fig. 6-13 Layout view of the final version self-timed 1-D ICT processor 6-22 
Fig. 7-1 Simulated timing diagram of self-timed ICT (inverse transform) 7-4 
Fig. 7-2 Simulated timing diagram of self-timed ICT (inverse transform) 7-4 
Fig. 7-3 Measured timing diagram of self-timed ICT (forward ICT) 7-6 
Fig. 7-4 Measured timing diagram of self-timed ICT (inverse transform) 7-6 
Fig. 7-5 Waveform of the acknowledge and output request signal 7-8 
Fig. 7-6 Fluctuation of the acknowledge signal 7-9 
Fig. 7-7 Waveform of internal request signal 7-10 
Fig. 7-8 Suggested improvement ofIEU design 7-11 
Fig. 7-9 Block diagram of logic cells test circuit 7-13 
Fig. 7-10 Transient response of the 10ns delay element 7-14 
Fig. 7-11 Fluctuation of delay value 7-15 
Fig. 7-12 Transient response of the 5ns asymmetric delay element 7-16 
List ofTables 
Table 3-1 Comparison ofDCVSL and Micropipeline multiplier in the first 
— test chip 3-5 
Table 3-2 Comparison ofDCVSL and Micropipeline multipliers in the second 
test chip 3-8 
Table 3-3 Decoding table of 3 multiplier bits 3-10 
Table 3-4 Simulation results of multiplication time in different operation 
mode and handshake control protocol 3-13 
Table 4-1 Comparison between simulation and measured result 4-13 
Table 7-1 Comparison of simulated and measured total computational time 7-9 
Table 7-2 Simulation and measured result a 10ns Delay element 7-15 
Table 7-3 Simulation and measured result a 5ns Asymmetric Delay element 7-16 
Acknowledgments 
I would very much like to thank my supervisor, Dr. Oliver C.S. Choy for his 
help and guidance throughout my time here at CUHK, and introducing me to the 
asynchronous world. I also want to thank Dr. C.F. Chan for his stimulating 
comments and challenges. Moreover, I would like to thank Dr. W.K. Cham for his 
advice in Integer Cosine Transform theory. I especially want to thank Miss Eva Pang 
for her invaluable advice in asynchronous logic design and Mr. Jean-Francois 
Paillotin at Tima-CMP for IC fabrication support. 
I also want to thank my peers and colleagues in ASIC laboratory of CUHK. 
This includes, but is not limited to Michael, Timothy, Ku, Mark, Wallace, Stanley, 
Winnie(s), Kelvin, Vincent(s), Frankie, Thomas, Or, To, Mr. Long, Mr. F. Li, .... and 
our laboratory technicians Mr. Jason Chan and Mr. W.Y. Yeung. Finally, I would 
like to thank all of my family members for their fully support and love. 
Abstract 
The aim of this research project is to design a high performance Self-timed 
Integer Cosine Transform (ICT) core processor. ICT is compatible with the widely 
accepted Discrete Cosine Transform (DCT) technique and superior over DCT 
because ICT is inherently simpler. Self-timed system is very attractive in 
tomorrow's VLSI as it has many potential advantages over traditional synchronous 
system. 
This thesis describes my research findings in self-timed logic design and 
VLSI implementation of self-timed ICT core processor. It starts with the discussion 
of previous results for 2-phase and 4-phase handshake DCVSL and micropipeline 
system. Then the improved 2-phase and 4-phase handshake micropipeline with 
variable delay element, delay selection and asymmetric delay element, current-
sensing completion detection technique are introduced. Then some application 
circuits such as improved Bit-serial matrix multipliers and Booth's multiplier are 
described. The 2-phase micropipeline is found to be the most cost-effective structure 
to implement wide word-length data path system. 
Second part is the VLSI implementation of the self-timed ICT core processor 
chip based on the best self-timed structure found. It starts from analyzing the self-
timed ICT processor architecture. Then the implementation of two different versions 
of ICT processor are described. The final version has a high degree of parallelism 
and modularity. The final design was fabricated in 0.7^im CMOS technology and 
proven to be working correctly with a maximum operating data rate up to 50 MHz. 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 




Today is an information world, people need a lot of up-to-day news not only 
with plain text but also with informative pictures or video. High speed computers, 
data transmission network, innovative software and multimedia products satisfy what 
the people need. For example, MPEG video, video conferencing, and transmission of 
pictures or video through internet have been extensively used for communication and 
entertainment. However, these multimedia and telecommunication applications are 
usually bandwidth and memory intensive and also require very high computational 
power. Image compression is one of the methods to solve the problem. 
High speed DSP microprocessor or dedicated DSP chips are required to 
handle image processing applications. But they are usually very expensive because 
they have complicated circuits to handle large amount of calculations. So the 
development of high speed and low cost image compression ASICs are necessary 
and this is the prime objective of my research project. 
page 1-1 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 1 Introduction 
Transform coding is one of the image compression techniques which can 
yield a better compression ratio but requires much more calculation. The Discrete 
Cosine Transform (DCT) is the most widely used transform for image compression 
as the baseline system in ISO/CCITT JPEG proposal for image coding is a transform 
coding system using the order-8 DCT coding system. However, the kemel 
components of DCT are real numbers and require 7 bits or more to represent their 
magnitude in order to have negligible effects on the transformed and reconstructed 
signal. Transforms that approximate the DCT and contain only integer kemel 
components were proposed to simplify the implementation of the transform. They are 
called Integer Cosine Transform (ICT). ICTs are desirable alternatives to the DCT 
for image transform coding system as their computation time is much less than that 
in DCT and their hardware is also simpler. 
The ICT basically involves the multiplication of an image vector and the 
kernel matrix. Three generations of ICT chips had been developed based on 
traditional synchronous digital system approach. Also the multiply-addition 
operation was implemented in a pretty straight-forward manner with a multiplier and 
an accumulator, just like a general purpose DSP microprocessor [1]. This has the 
advantage ofhaving the same design for both forward and inverse transforms. In this 
project, a new approach has been used. The fast ICT computation algorithm and self-
timed circuit technique are adopted to speed up the matrix multiplication. The 
number of multiplications can be reduced from 64 to 16 when the fast algorithm is 
used. 
page 1-2 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 1 Introduction 
All the verification of ICT such as the optimal set of ICT kemel component 
bit length and their corresponding transform efficiency and accuracy have been fully 
analyzed by W.K. Cham[2,3,4]. So my major task is to design and implement a high 
performance ICT image processing chip based on the previous finding in ICT. 
Instead of improve the ICT processor with a fast computation algorithm, 
asynchronous approach have been adopted in this new design, and I have spent more 
than two-third of time in investigating, improvement and development of self-timed 
system. Self-timed logic system has many potential advantages, such as not restricted 
by a global clock, offers higher speed in achieving the shorter time for each operation 
and the maximum clock rate is not determined by the worst signal path in the design. 
It is especially suitable for implementing the system with high level of concurrency. 
However, hazard makes it become difficult to design and some special control circuit 
or other circuit architectures are required to eliminate this problem. So the main 
research area in this project is to design some new handshake control protocol and 
circuits in order to enhance the speed and minimize the overhead in self-timed 
system. Different self-timed approaches such as DCVSL, Micropipeline and current-
sensing complete detection technique have been studied. Some improvements have 
been found and they were implemented in real ASIC for practical verifications. 
The research on self-timed logic was started from DCVSL and Micropipeline 
approach, based on the research findings by another research student, Eva Pang. In 
DCVSL method, an improved handshake control protocol is designed. This protocol 
has higher degree of parallelism and is able to handle feedback system. However, 
page 1-3 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 2 Asynchronous Design Methodologies 
after comparing DCVSL and Micropipeline method, we found that micropipeline is 
more suitable for implementing our target application - ICT, in terms of circuit 
complexity and speed. Then the research was focusing on investigating micropipeline 
circuits. A new micropipeline structure and some handshake control circuits were 
developed. 
At the same time, another self-timed method such as current-sensing 
completion detection technique was studied. Some testing circuits had been 
implemented in CMOS chips. However, simulation and testing results showed that 
my current-sensing circuit is not so reliable. 
All of these new self-timed circuit techniques have been used in some 
applications such as multipliers and implemented in CMOS VLSI chips. Totally 
three self-timed logic test chips have been fabricated in order to evaluate the 
performance of the new developed self-timed technique. 
Finally, an innovative asynchronous ICT processor was designed based on 
the optimal ICT coefficients and algorithm concluded previously and the improved 
self-timed circuit technique (2-phase micropipeline). This self-timed technique is 
found to be the most efficient for implement the ICT computation structure. The 
chips are fabricated in CMOS 0.7^m SLP MLP process and have an operational 
speed up to 50MHz. The following sections in this chapter will introduce the 
background information of asynchronous system and Integer Cosine Transform 
theory. 
page 2-4 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 2 Asynchronous Design Methodologies 
1.2 Introduction to asynchronous system 
1.2.1 Motivation 
Asynchronous digital system design was being extensively discussed in the 
pass few decades because of its advantages over conventional synchronous systems. 
Possible potential advantages of asynchronous system [5] are described as follow: 
1. No clock skew - clock skew is the difference in arrival times of the clock signal at 
different parts of the circuit. Since asynchronous circuits by definition have no 
globally distributed clock, there is no need to worry about clock skew. In 
contrast, synchronous systems often slow down their circuits to accommodate the 
skew. As feature sizes decrease, clock skew becomes a much greater concern. 
2. Lower power - Standard synchronous circuits have to toggle clock lines, and 
possibly pre-charge and discharge signals, in portions of a circuit unused in the 
current computation. For example, even though a multiplier unit on a processor 
might not be used in a given instruction stream, the unit still must be operated by 
the clock. Although asynchronous circuits often require more transitions on the 
computation path than synchronous circuits, they generally have transitions only 
in areas involved in the current computation. 
page 2-5 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 1 Introduction 
3. Average-case instead of worst-case performance - Synchronous circuits must 
wait unit all possible computations have completed before latching the results, 
yielding worst-case performance. Many asynchronous systems sense when a 
computation has completed, allowing them to exhibit average-case performance. 
For circuits such as ripple-carry adders where the worst-case delay is 
significantly worse than the average-case delay, this can result in a substantial 
savings. 
4. Easing of global timing issue - In system such as a synchronous microprocessor, 
the system clock, and thus system performance, is dictated by the slowest or 
critical path. Thus, most portions of a circuit must be carefully optimized to 
achieve the highest clock rate, including rarely used portions of the system. Since 
many asynchronous systems operate at the speed of the circuit path currently in 
operation, rarely used portions of the circuit can left unoptimized without 
adversely affecting system performance. 
5. Better technology migration potential - In many asynchronous systems, 
migration of only the more critical system components can improve system 
performance on average, since performance is dependent on only the currently 
active path. However, better performance for synchronous systems can often only 
be achieved by migrating all system components to new technology, since again 
the overall system performance is based on the longest path. Also, since many 
asynchronous system sense computation completion, components with different 
page 1-6 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 2 Asynchronous Design Methodologies 
delays may often be substituted into a system without altering other elements or 
structures. 
6. Automatic adaptation to physical properties _ The delay through a circuit can 
change with variations in fabrication, temperature, and power-supply voltage. 
Synchronous circuits must assume that the worst possible combination of factors 
is present and clock the system accordingly. Many asynchronous circuits sense 
computation completion, and will run as quickly as the current physical 
properties allow. 
1.2.2 Hazards 
All these characteristics make asynchronous systems very attractive in 
tomorrow's VLSI systems. However, hazard and race problems in asynchronous 
circuit make it more difficult to design in an ad hoc fashion than synchronous circuit. 
Asynchronous communication protocols increase the computation time, and involve 
additional circuitary. The existing computer-aided design tools and implementation 
alternatives avaible for synchronous circuits either cannot be used at all in 
asynchronous design or require extensive modifications. 
In synchronous system, glitches on wires are not usually a problem. 
Computation occurs between clock ticks, and transitions on wires must stabilizes 
before the next clock ticks, and transitions on wires must stabilize before the next 
clock tick. However, since an asynchronous system has no global clock, computation 
page 2-7 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 2 Asynchronous Design Methodologies 
is not bound to discrete time intervals and any glitch may cause the system to fail. 
The potential for a glitch in an asynchronous design is called a hazard[6]. Sequential 
hazards are also possible in asynchronous state machines; these are called critical 
races and essential hazards. [28] 
Hazards are temporal phenomena: they manifest during the dynamic 
operation of a circuit. There are several approaches to eliminate combinational 
hazards. First, inertial delays may be used to filter out undesired spikes. Second, 
synthesis techniques can be used to avoid hazards. Third, if bounded delays are 
assumed, hazards may eliminated by adding delays to slow down certain paths in a 
circuit. A final approach is to tolerate hazards where they will do no harm. 
1.2.3 Classes ofAsynchronous circuits 
Asynchronous circuits can be classified into : 
1. Delay-Insensitive - DI circuit is one which operates correctly regardless of delays 
on gates and wires. That is, an unbounded gate and wire model is assumed. 
2. Quasi-Delay-Insensitive - QDI circuit is delay-insensitive except that synchronic 
forks are assumed. An isochronic fork is a forked circuit, delays on the fork 
branches may be different. The motivation of QDI circuits is that they are the 
weakest compromise to pure delay-insensitivity needed to build practical circuits 
using simple gates and operators. 
3. Speed-independent - SI circuit is one which operates correctly regardless of gate 
delays; wires are assumed to have zero or negligible delay. 
page 2-8 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 1 Introduction 
4. Self-timed - Self-timed circuit, described by Seitz, contains a group of self-timed 
elements. Each element is contained in an quipotential region, where wires have 
negligible or well-bounded delay. An element itself may be an SI circuit, or a 
circuit whose correct operation relies on localized timing assumptions. However, 
no timing assumptions are made on the communication between elements [28]. 
1.3 Introduction to Transform Coding 
Transform coding is one of the enabling technologies for advanced video 
transmission and multimedia products. It may reduce the memory and band 
requirement for an image to 1/25 with reasonable quality. Most transform coding 
systems use the order-8 Discrete Cosine Transform (DCT), which is introduced by 
Ahmed et al in 1974. The DCT is a real transform close to the Karhunen-Loeve 
Transform (KTL) of a first-order stationary Markov sequence and is superior in 
decorrelation of the coefficients. 
As the DCT contains real numbers in the transform kemel, so various JPEG 
and DCT chips have been implemented using about 14 bits to represent a DCT kernel 
component to ensure its accurate representation. W.K. Cham has developed an 
integer version of DCT, which are called Integer Cosine Transform (ICT). ICT has 
been shown to be functionally compatible to the DCT and performs nearly as good as 
the DCT[2,3]. The main advantage ofICT is that its kemel components require only 
a few bits for exact representation, so it is simpler to implement in hardware and thus 
page 1-9 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 1 Introduction 
will cost less. This is the reason why ICT has been adopted in the Galileo spacecraft 
for coding images before sending back to earth. 
In transform coding, the original image is divided into sub-pictures of a 
particular block size such that the highly correlated spatial data are transformed into 
weakly correlated coefficients. It results in significant and insignificant coefficients 
of which the insignificant ones can be discarded in the process. Therefore, the overall 
amount of memory is reduced. The main problem of transform coding is on the 
implementation but recently it has been eased from the high-speed digital hardware. 
Transform coding techniques are based on the inter-element correlation of 
the image. The higher the correlation of the image data is, the more the power 
spectral distribution is close to the low frequency components, thus requiring less 
channel capacity for transmission. The extent to which images may be compressed 
whilst still keeping satisfactory reproduction of the image is crucially dependent 
upon their correlation properties. Fortunately, most of the images have high values of 
correlation coefficient. 
Image transform coder and decoder can be in a two-pass system. In two-pass 
scheme, data compression by transform coding consists of three process which is 
done in the transmitter. They are: 
i) It transforms highly correlated image elements into a set of weakly correlated 
coefficients. 
page 1-10 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 1 Introduction 
ii) Bits are allocated to these coefficients so that more bits are allocated to those 
coefficients with higher variances. 
iii) Coefficients are quantized before transmission of the coefficients. 
In this system, the original image is divided into sub-pictures of size n x n, 
where n is usually 8 or 16. The sub-pictures are transformed into an array of 
independent coefficients of which maximum information is packed into a minimum 
number of coefficients. More bits are assigned to the coefficients having larger 
variances, and fewer bits for coefficients having smaller variances. The final process 
is to quantize and code the coefficients for transmission. 
At the receiver, the received data with overhead information are decoded to 
the quantized transform coefficients, and an inverse transform is applied to the 
coefficients to recover the picture. 
In transform coding technique, the most important part is to find the best 
transform function. Optimum transforms can convert the statistically dependent 
picture elements into an array of uncorrelated coefficients. The total energy of the 
spatial image data is preserved in the transform domain. The criterion of optimum 
transform is based on whether it can completely decorrelate the image data. 
Many DCT chips have been implemented, which use about 14-bits to 
represent a DCT kemel component. Since the basis vector components of the DCT 
are mainly real numbers, it was found that the implementation of the DCT in finite 
page 1-11 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 1 Introduction 
length arithmetic was more complicated than those transforms whose basis vector 
components are integers only, i.e. the Walsh transform, Slant transform and HCT. 
However, none of them have the same data compression ability as the DCT. Cham 
generated another transform called Integer Cosine Transform (ICT) as an alternative 
to the DCT [2,3,4]. The ICT was shown to be functionally compatible to the DCT 
and with performance close to the DCT [2,3,4]. The ICT(10,9,6,2,3,1) requiring only 
4 bits for exact representation of its kemel components. 
The Integer Cosine Transform was derived from the DCT by the concept of 
dyadic symmetry[2,3]. The (i,j)th kemel component of the order-8 DCT is: 
tMJ) = j | c o s { ^ ^ ^ } for i 0,y E [0, N -1] 
for / = 0,7G[0,A^-1] 
“ ^ N 
By representing kemel components of the same magnitude using the same 
variable, the DCT kemel can be expressed as [T] with its (i,j)th components being 
tc(ij). 
_kO(g g g g g g g g ) -
kl(a b c d -d - c - b -a) 
k2(e f - f -e - e - f f a) 
k3(b -d -a -c c a d -b) 
T = 
k4(g -g -g g g -g - g g) 
k5(c -a d b - b -d a -c) 
k6(f -e e - f - f e -e f) 
_k7(d -c b - a a - b c -d) 
Fig. 1-1 ICT kernel [T] 
where ki is the scaling factor such that the ith basis vector is of unity magnitude. Let 
[T] = [K][J] 
page 1-12 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 1 Introduction 
where [K] is a diagonal matrix whose (i,j)th element equals ki and [J] contains 
components a, b, c, d, e, f and g. It can be shown that [T] is orthogonal if : 
a.b = a.c + b.d + c.d ... (1) 
Transform [T] are called Integer Cosine Transforms or ICT(a,b,c,d,e,f) if they 
satisfy the following conditions: 
a > b > c > d and e > f ... (2) 
and a, b, c, d, e, and f are integers ... (3) 
Condition (2) ensures that basis vectors of ICT(a,b,c,d,e,f) resemble those of 
the DCT and (3) ensures that transform components of ICT(a,b,c,d,e,f) can be 
represented by finite number of bits. There are many possible ICTs which will be 
denoted as ICT(a,b,c,d,e,f). ICT(10,9,6,2,9,3) has been shown to be a promising 
alternative to the DCT as its kemel components requires only 4 bits for 
representation and it has very close performance as DCT in both transform efficiency 
and mean-square-error[8]. Better performance can be achieved by increasing the bit-
length of the kernel components. 
In mathematics, when x is ah input vector(8xl) and C is the coefficient vector 
(8x1). The lD forward transform is given by : 
C = [T] X 
= [K][J]x 
and the inverse transform is expressed as 
X = [T]t C 
= [ j ] t p c ] c 
page 1-13 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 1 Introduction 
When [X] is an input matrix (8x8) and [C] is the coefficient matrix (8x8), the 
2D forward transform is given by 
[C] = [T] [X] [T]t 
=[K][J][X] [J]t[K] 
and the inverse transform is expressed as 
[X] = [T]t [X] [T] 
=[J]t[K][C][K][J] 
The order-8 ICT can be computed using a fast computational algorithm [2]. 
This fast algorithm is composed of butterflies which convert two numbers (input or 
intermediate) into another two (output or intermediate results). Its structure is similar 
to that of the DCT fast algorithm and has four stages. Unlike the DCT which 
requires multiplication of these numbers by 13-bit multiplicands, the ICT requires 
integer multiplication and addition only. 
xO \ ~ ~ / ® \ ^ ® Y ® CO 
x l U ® 0 — C 6 
x 3 ? ® 6 ^ ® — C2 
x 5 ^ e y e ^ e ^ @ c 5 
x 6 f 4 e e ^ 0 c 3 
x7 L A e l @ / ^ c 7 
Fig. 1-2 Forward ICT fast algorithm 
page 1-14 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 2 Asynchronous Design Methodologies 
Z> + c 1 a + d Where p = —— and q =—— P 2-a ^ 2'C 
From the previous research done by W.K. Cham, ICT(10,9,6,2,9,3) was 
found be the best version and its fast algorithm in forward and inverse transform are 
shown Fig. 1-1 and Fig. 1-2 respectively. 
X0 ^ —— CO 
x l ^ ] ® ¥ © A o — C4 
: m h t i - c 
1 ® f | ® c i 
x 5 | | e _ 0 ^ e ^ 0 C 5 
x 6 H e A @ f ^ e c 3 
x7 / _ _ \ © — Z B . L ^ Q C 7 
Fig. 1-1 The forward ICT(10 9 6 2 9 3) fast algorithm 
CO ——V © ^ ^ ^ ~ ~ 7 e x o 
C 4 — 0 ¥ ® H ® x l 
c 6 — ‘ ' 
C2 — ^ © ^ e ^ © x3 
c i v ^ ® ® — | e x 4 
c5 ^ 0 e e 0 x5 
: m t M i _ ^ n t : 
Fig. 1-2 The reverse ICT(10 9 6 2 9 3) fast algorithm 
page 2-15 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 1 Introduction 
1.4 Organization ofthe Thesis 
This thesis is organized as follows. Firstly, the objective and the summary of 
this project is described in this chapter. Also the background information, pros and 
cons about asynchronous and Integer Cosine Transform are introduced. 
Then the basic concept of designing self-timed system will be reviewed in the 
next chapter. Two established self-timed design methodologies (speed independent 
and micropipeline), and their handshake control circuits, data path circuits will be 
discussed. Also comparison will be made according to their performance in terms of 
speed, complexity and efficiency. Some improvement techniques for micropipeline, 
such as the use of asymmetric delay element, variable bounded delay circuit and 2-
phase handshake protocol were then developed and they are described in Chapter 2. 
The implementation and testing result of two types of Bit-serial Matrix 
Multipliers designed with these two methods will be described and compared in 
chapter 3. Another testing example Micropipeline Modified Booth's Multiplier have 
been designed for evaluation, and it will be discussed in this chapter. An alternative 
complete detection method based on the unbounded delay approach by using the 
current-sensing completion detection technique has been investigated. The theory, 
simulation and testing results of this technique will be discussed in Chapter 4. 
After all self-timed circuit design styles were discussed and summarized, 
page 1-16 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 1 Introduction 
chapter 5 will discuss which kind of processor architecture and self-timed technique 
is the most suitable for implementing our target application - Integer Cosine 
Transform processor. In chapter 6, the first generation of the self-timed ICT 
processor will be introduced first. Followed is the implementation and testing results 
of the final version of the self-timed ICT processor. Finally, Chapter 8 is the 
conclusions of this thesis. 
page 1-17 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 2 Asynchronous Design Methodologies 
Chapter Two 
2. Asynchronous Design Methodologies 
2.1 Introduction 
Asynchronous design has been an active area of research since at least the 
mid 1950's, but has yet to see widespread use. In synchronous system, time is 
assumed to be discrete, hazards and feedback can largely be ignored. While in 
asynchronous system, time is not assumed to be discrete. This has several possible 
advantage as described in chapter one. 
However, hazards or race problem cannot be ignored and it makes 
asynchronous system difficult to design. Self-timed logic provides a method for 
designing hazard-free asynchronous systems, which operates correctly independent 
of gate or wire delay. The control and data path circuits guarantee correct operation 
of asynchronous circuit and prevent from hazard. These additional circuits obviously 
increase the complexity of the system. So, one of the objectives in this research 
project is to find ways to improve the efficiency of both the control and data path in 
self-timed system, i.e. reduce circuit overhead and increase the operating speed. 
The basic idea and potential advantages of asynchronous are reviewed in 
previous chapter. In this chapter, the details of self-timed design methodologies will 
page 2-1 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 2 Asynchronous Design Methodologies 
be discussed. Different self-timed system design approaches will be compared and 
analyzed. The weaknesses of the conventional control and data path structure in self-
timed systems will be pointed out and some improvement technique will be 
introduced. 
My research in asynchronous actually started from the research findings by 
Eva Pang. She had built up a self-timed cell library and found out some special 
handshake control protocol and circuits in both DCVSL and Micropipeline system. 
After studing her works, I made some improvements on her circuits and developed a 
new micropipeline architecture and designed a Booth's Multiplier based on this new 
structure. After comparing these two methods, 2-phase Micropipeline was found to 
be the most suitable method for implementing the ICT processor. 
2.2 Self-timed system 
Self-timed logic provides a method for designing hazard-free asynchronous 
systems, which operates correctly independent of gate or wire delay[9]. Self-timed 
system consists of Handshake Control Circuit (HCC) and Self-timed block (or Data 
Computation block). Handshake control circuit manages the sequence of events in 
the self-timed block and it must ensure that data are correctly transferred regardless 
of the relative delays among the handshake signals. The handshake control circuit 
can be described by a handshake control protocol. A lot of HC protocols and data 
path structure have been developed. However, they are usually very complicated and 
difficult to implement. 
page 2-2 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 2 Asynchronous Design Methodologies 
My research in asynchronous design concentrates on the 2-phase and 4-phase 
handshake signaling in dual-rail speed independent circuit using Differential Cascode 
Voltage Switching Logic (DCVSL) structure and the single-rail micropipeline 
structure. After comparing these two methods, 2-phase micropipeline was found to 
be the most suitable method for implementing the ICT processor. Then I tried to 
improve the 2-phase micropipeline structure in order to make it more effective when 
using in the self-timed ICT processor. 
^ ^ ^ 
Self-Timed ao Self-Timed ai Self-Timed 
f - <— 
Computational Handshake Computational 
ri ro 
block ~ ~ H ^ Controller y / block 
Present stage Next stage 
Fig. 2-1 Basic Self-timed system 
In a self-timed system, global clock is replaced by control signals generated 
locally with a handshake control circuit. Fig. 2-1 shows a typical self-timed system 
which contains two stages of self-timed blocks and a handshake control circuit. 
The complete detection circuitry in each self-timed block is used to generate a 
Complete signal when the block finishes its computation process. So that each self-
timed block can operate in its maximum speed instead of being limited by worst-case 
path in other sub-system. This completion signal initiates the Request signal of the 
next self-timed stage. After the next stage has finished its operation, an Acknowledge 
signal is fedback to the previous stage through the handshake control circuit. The 
sequence of events depends on the handshake control protocol being used. 
page 2-3 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 2 Asynchronous Design Methodologies 
Generation of complete signal is very important and critical in self-timed 
system, as it is used to guarantee circuit operating correctly by preventing hazards, 
"Run-away" or “Continual-feeding" conditions. If the complete signal is generated 
before the self-timed block finishes its computation process or data is not valid, 
incorrect results will be latched. If the complete signal is generated a long time after 
end of computation, the system will be slowed down. However, to generate the 
complete signal just in time while maintaining simple structure is really a difficult 
task. 
The following sections describe two established self-timed circuit structure _ 
DCVSL and Micropipeline methodology. Then the improvement of these methods 
will be discussed. 
2.3 DCVSL Methodology 
According to the self-timed design principles described previously, complete 
signal should be generated correctly and effectively in each self-timed block for 
correct and optimal operation. One of the popular design style used to correctly 
generate a complete signal is the use of 4-phase dual-rail encoding system. It is a 
kind of Speed Independent circuit which uses the unbounded delay model, assuming 
that delays in both elements and wires are unbounded. 
page 2-4 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 2 Asynchronous Design Methodologies 
In dual-rail encoding system, each data of this system is represented by two 
bits (i.e. 1=01, 0=10). Each pair of data-lines is set to +00 before each computation 
cycle. When the computation completes, each pair of data-lines would be either 01 
or 10. Thus, the completion signal can be extracted from the dual-rail data lines as 
soon as the process finishes. A popular method of implementing such a system is to 
use precharge DCVSL [14,15,27] circuit in the data-path. To study the operation of 
the whole system, we have to understand the structure of DCVSL first. 
2.3.1 DCVSL gate 
Fig. 2-2 is a typical DCVSL circuit and works like dynamic logic. It is 
composed of P-MOS precharge circuit and N-MOS evaluate circuit. It has 
precharge/evaluation control signal “1” data inputs and a pair of data output "Q" and 
"QB", where QB" is the complement of "Q". Since it is a dual-rail system, each 
input and output data is represented by 2-bits as described previously. Before 
evaluation, nodes kl and k2 are charged to logic "high" by switching on transistors 
#1 and #2, so that both Q and QB are forced to "0”. Evaluation starts when transistor 
#3 turns on. When the computation completes, either node kl or k2 is discharged to 
"low", so the Completion signal can be extracted. This complete signal informs the 
handshake control circuit to carry out the next event (i.e. precharge, evaluate). As it 
is a dynamic logic and kl and k2 nodes cannot be recharged during evaluation 
period, so the inputs are not allowed to change during evaluation period. The N-
MOS logic tree is different from normal N-MOS logic circuit in conventional 
dynamic logic. It is more complex as it has to evaluate two results. 
page 2-5 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 2 Asynchronous Design Methodologies 
Fig. 2-3 shows the circuit of a DCVSL latch. Complete signal "rdy" is the 
output of a NAND gate. That means once either kl or k2 becomes 1, complete 
signal is activated. A DCVSL inverter can be easily formed by simply changing "Q" 
to "QB" and "QB" to "Q". 
A 
#4|"""[#T~~P"[#5 
C phrHL 3 
Q " ^ 4 ^ 1 N ~ ^ " ' 




Fig. 2-2 precharge DCVSL circuit structure and DCVSL latch 
A 
C pME p 
Q"^"<H^^^]i^>^QB 
y ^ d kl _ r ^ , 
k 2 _ >"rdy 
IN^^L^^IN ^ ^ 
iH[ 
” 
Fig. 2-3 A DCVSL latch with ready signal output 
The operation of DCVSL circuit can be summarized as follow: Signal “1” is 
initialized to "low" to precharge kl and k2 first. After all input data are valid, “1” is 
forced to "high" to perform evaluation. If computation completes, either kl and k2 
becomes "low" and "rdy" signal is activated. According to the working principles of 
a DCVSL circuit, the handshake control circuit in DCVSL system should be able to 
page 2-6 
i 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 2 Asynchronous Design Methodologies 
generate appropriate precharge and evaluate signal for DCVSL data path circuits, 
handle the request and acknowledge signals from other self-timed blocks. 
2.3.2 Handshake Control 
ao ai 
^ Handshake k Handshake . 
p ; ~ ~ - ~ > Controller ~ — Controller 
ro ^rdy done 
\f_ y 
Data J ~ ~ ^ 
- A L Self-TimedLogic - A L Self-TimedLogic - A 
/ Operator _ / Operator _ / 
Fig. 2-4 Self-timed system with DCVSL data-path 
Fig. 2-4 shows a simple DCVSL pipeline self-timed system. It is similar to 
the block diagram of basic self-timed as shown in Fig. 2-1, except that it is 
constructed in the pipeline structure. The system consists of two parts, handshake 
control and DCVSL data path. Based on the operation principle described above, the 
handshake control circuit can be designed starting from planning the sequence of 
handshake signals. Since the DCVSL circuit must be charged before each operation, 
the handshake control protocol must be in four-phase format, i.e. Return-to-Zero 
system. 
Fig. 2-5 shows the Signal Transition Graph (STG) of a handshake control 
protocol which was originally proposed by Tan et al. [16] and modified by Y.W. 
Pang et al. [17]. It is a parallelism degree of 2 handshake control protocol. Fig. 2-6 
shows the timing diagram of a four stage pipeline DCVSL system using this 
page 2-7 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 2 Asynchronous Design Methodologies 
protocol. There are three states between two evaluation states (i.e. "Evaluation", 
“Hold’ "Precharge", “Enable "Evaluation"). The state next to "Evaluation" must 
be “Hold”, and its previous state must be "Enable". 
f ri+ “ > 0 \ z» 
j + , z ro 
. ^ r d y ^ I 01»~~I . _ '16 
ao Z \ i ,: I ^ v ^ - - ^ ^ 
I Q 1 ^ I "t • h b Static C c) • ^ ro 
i 4. .. p ^ ^ ^ > i ^ p = i C : ^ ^ ^ ^ 
ri" > ro' s • ^ ^ 
, ao+<——rdy'i^ + 
I ai , ,<r"^^^, 
\ J \ J LTQ;^=_^;;^^r--^"" 
Fig. 2-5 STG and the circuit of handshake control circuit 
(parallelism degree = 2) 
Stage 1 Hold yPrechargin^ Enabled ^ Evaluating^ Hold yPrecharging 
Stage 2 Evaluating^ Hold &recharging^" Enabled ^ E v a l u a t i n g ^ ^ ~ " H ^ ~ 
Stage 3 Enabled ^ Evaluating Hold ^^echarging^ Enabled^ Evaluating 
Stage 4 Precharging^ Enabled^ Evaluating Hold ^ P r e c h a r g i n ^ Enabled 
Fig. 2-6 Timing diagram of HC protocol in Fig. 2-5 
Fig. 2-7 shows the STG of a parallelism degree of 3 HC protocol, and Fig. 2-
8 is its timing diagram. The main difference between HC protocol with parallelism 
degree of 3 and 2 is that: In the first case, the state before Evaluation is allowed to 
change (i.e. from “Precharge to "Enable") even the state of next stage is keeping 
unchanged. But in later case, there is not such freedom, i.e. previous stage cannot be 
"Enabled" during present stage in "Hold" period. Therefore, the system throughput 
rate may be higher with a parallelism degree of 3 HC protocol. 
page 2-8 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 2 Asynchronous Design Methodologies 
A A 
/ V r d y + — r*o+ 1 . 
a o - ^ , 7 , ' w ^ ' ^ N ^ 
. yj_Z •- (StoticC_.tec)""^• r^o 
y ai _A [^P9_ . ]-^^y 
Y. ‘^ "L^ T^"] M 
n + \ rdy-—ro- pH^Cb^ ,^  
aO^ \ > 7 “» -^^^^__^ / h^c .^ / ~N I+9 
^ j ^ r Z t U . ^ ^ ^ U I k ^ -
Fig. 2-7 STG and circuit of handshake control circuit 
(parallelism degree = 3) 
Stage 1 Hold ^Prechargin^^ii^ Evaluating ^ H^ ^Prechargin^ 
Stage 2 Evaluating^ ^ Hold ^ Precharging^ E^r^ ^ Evaluating ^^ Hold ^ 
Stage 3 Enabled ^ Evaluating"^ Hold ^ Prechargin^^^ii^ Evaluating < 
Stage 4 ^Precharging^ EnabieZ^ Evaluating ^ ^ Hold ^^Prechargin^ 
Fig. 2-8 Timing diagram of HC protocol in Fig. 2-7 
Beside of correctly handling the sequence of request, acknowledge, precharge 
and evaluate signals, hazard-free handshake control circuit must be Semi-modular : 
A circuit is semi-modular if every excited signal becomes stable only by changing its 
value, (e.g. x+ — y+, semi-modularity requires that only after y+ has actually gone 
high, can x go low, x-). Moreover, handshake control circuit must also prevent the 
cases of mn-away and continual-feeding conditions as shown in Fig. 2-9 and Error! 
stage 1 Evaluating ^ Hold P^rechai"gingy Enabled ^ Evaluating^  Hold~y^ec^r^ 
Stage 2 Enabled ^ Evaluating^  Hold P^recharge 
(a) Run-away 
Stage 1 Evaluating ^ Hold ^Prechargin^  Enabied^ Evaluating^  Hold 
Stage 2 Enabled ^Evaluating^  Hold — P^recharge 
(b) No Run-away 
Fig. 2-9 (a) Run-away and (b) Correct operation 
page 1-9 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 2 Asynchronous Design Methodologies 
Stage 1 Evaluating^  Hold yPrecharging 
Stage 2 Enabled ^ Evaluating^  Hold P^recharging^  Enabled ^^^^'tjyf^^ Hold 
(a) Continual Feeding 
Stage 1 Evaluating^  Hold ^ Precharging 
Stage 2 Enabled^  Evaluating Hold P^recharging 
(b) No Continual Feeding 
Fig. 2-10 (a) Continual Feeding and (b) Correct operation 
In order to implement the handshake control circuit from STG effectively, a 
special logic gate has to be used, which is called C-element. It is first introduced by 
D.E. Muller in 1956[8]. Fig. 2-11 shows the circuit diagram of two different kinds of 
C-element, (a) dynamic C-element and (b) static C-element. C-element is an even-
and element, i.e. the output of C-element will not be changed unless all the events at 
the inputs have happened. It is often used to synchronize concurrent processes. 
A T 
^ H _ _ / L r ^ r a ^ " ^ 
A ^ ^ r ^ n A _ H ] V 
n q K HU Hh HC 
B - ^ - 4 _ ^ o _ J — — C B X _ ‘ — — T _ T _ C 
Hd z M^rnd 
Ln^ U ^ ^ H [ | ^ 
4 ~ I " " ^ 
Fig. 2-11 (a) Dynamic C-element (b) Static C-element 
A pipeline Speed-independent system can be easily assembled using the 
handshake control circuit, DCVSL latches and computational blocks with legal 
interconnections of control signals. Users can choose the appropriate HC protocol 
and implement the function of data path with DCVSL elements. This system will 
operate correctly without due regard to the gate delay in each self-timed block of the 
data path provided that the HC protocol is semi-modular. 
page 1-10 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 2 Asynchronous Design Methodologies 
2.4 Micropipeline Methodology 
Micropipeline was first introduced in Ivan Sutherland's Turing Award 
Lecture [18]. He assigned the name "Micropipeline" to a particularly simple form of 
event-driven elastic pipeline with or without internal processing. It is called 
micropipeline because it is useful in very short lengths, and it is suitable for layout in 
microelectronic form. Micropipeline adopts a scheme based on the bundled-data 
convention, using ad hoc bundling delays for each stage of the pipe [19]. 
K—est | / _ ^ \ 




Request j ^ ' ^N^X 
H V ~ ~ f f ^ 
wM ^mmM W w 
(b) 
Fig. 2-12 (a) Four-phase (b) Two-phase Bundled data convention 
Fig. 2-12 shows the timing diagram of four-phase and two-phase bundled 
data convention. The interface between sender and receiver consists of a bundle of 
data which carries information (using one wire for each bit) and two control wires; 
request from the sender to the receiver carries a transition when the data is valid; 
page 2-11 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 2 Asynchronous Design Methodologies 
acknowledge from the receiver to the sender carries a transition when the data has 
been used. 
Similar to DCVSL self-timed system, micropipeline also consists of 
handshake control, complete signal generation and data-path circuits. However, its 
complete signal generation method is totally different from DCVSL approach. It 
uses bounded-delay approach which is similar to traditional synchronous digital logic 
systems. An arbitrary delay element is used to generate the complete signal. The 
delay value can be estimated by simulation or by calculating the worst-case delay of 
the computation block and wire delay can also be included. Once the data bundling 
constraints are met, the micropipeline approach can be considered delay-insensitive. 
Since the complete signal is not extracted from the encoded data, a computation 
block can be implemented by single-rail logic and the data do not have to retum to 
zero in each cycle. So, the computation block can be exactly the same as the one in 
synchronous digital system. 
2.4.1 Summary of previous design 
Many Micropipeline circuits with various handshaking control protocols have 
been developed. This section will introduce Sutherland's first 2-phase micropipeline 
and former research student, Eva's 4-phase micropipeline. Then some improvements 
of these circuits summarized from my research work will be discussed. 
page 1-12 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 2 Asynchronous Design Methodologies 
h » Delay ^ ^ Delay _ • 
Ri H H Ro 
y 0 
T Ai 
. f c M\ ra p | f c ^ rcd p 
Data in "~~~s /"~~~\ ~"N Data out \^ In Out _jCMOSL_v In Out _fCMOSLjv In Out _jCMOSp^ In Out v 
y CP Latch - Logic _ ^ CP Latch - Logic _ ^ CP Latch — Logic ^ CPLatch 
^ , „ \ y \ y \ y , 
Cd P p C Pd p Cd P ^ C Pd 
r \ A 
C 1 
Ao W W 
^ ^ Delay J L_^  Delay J 
Fig. 2-13 Sutherland's two-phase handshaking micropipeline 
Fig. 2-13 shows the block diagram of Sutherland's micropipeline system. It 
operates in 2-phase handshaking protocol. The handshake control circuit is much 
simpler than that of DCVSL system, which has only one C-element per stage. Its 
storage element used is called Capture and Pass Latch, which is an event-controlled 
storage element. When there is change of phase in ‘Ri’ signal, Data will be captured 
and stored in the latch. If the next stage latch has captured the data, the phase of 
acknowledge signal 'Ai' will be changed. The CP-Latch goes to transparent mode 
and capture next data. 
Capture Pass 
r ^ ^ ^ i I 
^ ^ ^ 1 ^ In I ^ I K Out T ^ . 
I K>*^^ ^^^ "^  ^>0 » In Out ». 
01-^ .^ 1^  r ^ j CP Latch 
0 I U ^ ^ Cd Pd 
4 - < H I 7 " 
I I , , 
Fig. 2-14 Sutherland's CP Latch (For two-phase handshaking) 
page 1-13 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 2 Asynchronous Design Methodologies 
This micropipeline is very simple and effective. However, the circuit of its 
CP-Latch is rather complicated and requires two control signals, ‘Capture’ and 
'Pass'. Also, before feeding these signals to next and previous stages, they have to 
be delayed a moment in order to make sure the latch have captured the data. So there 
are totally four control terminals. Moreover, the transparent mode of this latch 
makes the implementation of feedback difficult. 
1_^  delay ) |_^ ~ delay __^ | ^ 
Req Control Control Control 
Ack _ J - J — 
- ^ - ^ = ^ ^ > = ^ ^ ^ U = n / ^ 
C P . . c P . C P f Standard A ( Standard \ 
_ : : ) C P - I ^ , h > , S S £ f O ( = S l :^CP-U.ch ^ 
block) block) 
V J V J 
Fig. 2-15 Micropipeline with CP-Latch 
Fig. 2-15 shows a micropipeline with 4-phase handshake signaling which is 
developed by Eva Pang. The hardware structure and operation of this system are 
different from Sutherland's circuit as a different CP-Latch is used. Sutherland's CP-
Latch is sensitive to both positive and negative phases, while the operation of Eva's 
CP-Latch is the same as a master-slave D-type flip-flop. That means this latch is 
only sensitive to only one phase transition. The circuit diagram is shown in Fig. 2-
16. So this micropipeline can only operate in 4-phase handshake signaling mode. 
page 1-14 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 2 Asynchronous Design Methodologies 
C 
r 4 h [ ^ ^ " ' A ^ 
n T ^ ^ < H > ^ > ^ M > 5 u T 
C P 
Fig. 2-16 Eva's CP Latch 
A A 
ai 
f t ^ ~ ~ ~ 0^- ri 
( l ^ ^ t J ^ ^ ^ p C ^ > y ^ 
V ^ \ J ^ ao 
Fig. 2-17 Four-phase Handshaking Control protocol and circuit 
In order to minimize the additional delay caused by 4-phase handshaking, an 
additional C-element is added and it is equivalent to inserting one more self-timed 
stage. The handshake control protocol used is described in Fig. 2-17. Its advantage 
is that once the second stage's latch has captured and passed the data, first stage's 
latch can capture and pass the next data. But if there is only one C-element, first 
stage's latch cannot capture the next data until the third stage's latch capture the data. 
Implementing micropipeline is thus easier than that of speed independent 
Self-timed with DCVSL structure, (i.e. only need to replace the registers in 
traditional synchronous digital logic systems by CP Latches and to add appropriate 
page 1-15 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 2 Asynchronous Design Methodologies 
handshake control circuits and delay elements). The system throughput depends on 
the values of delay element used plus the gate delays ofhandshake control circuits. 
A A 
, HC H t K r \ 
An <) <1 o i)— j>o ^^ >0— Output 
VC J [ 
V V 
Fig. 2-18 Voltage controlled delay element 
Delay element is very important in micropipeline system. Its delay time must 
be long enough to guarantee the computational block to complete its operation. 
However, if its delay time is much longer than actual delay of the computational 
block, the whole system will be slowed down. In designing micropipeline, we can 
only estimate the delay value by simulation and it is usually not so accurate since 
there are many uncertainties in fabrication process. So Eva used an adjustable delay 
element which can be controlled by a reference voltage as shown in Fig. 2-18. It was 
modified from K.M.Yue's adjustable delay element [20]. It had a minimum delay 
value by setting the width to length ratio of the MOS transistors. An analog input 
VC can further adjust the resistance of MOS transistor manually, hence the delay 
value. So the delay value can be adjusted even after the chip is fabricated. 
page 1-16 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 2 Asynchronous Design Methodologies 
2.4.2 New Micropipeline structure and improvements 
After reviewing previous designs in micropipeline, we will now go to see 
how to improve them. The micropipeline circuits described above are composed of 
handshake control circuit, delay element, CP-latch and standard CMOS logic 
circuits. CMOS logic circuit computational block is actually the same compared 
with their synchronous system counterpart. The concentration then focused on the 
handshake protocol, delay element and storage element. 
Firstly, lets compare the differences between 2-phase and 4-phase 
handshaking. In a 4-phase handshaking system, all control signals such as request 
and acknowledge signals have to retum to their initial value after every complete 
cycle. So the total cycle for a 4-phase handshaking system is equal to twice the delay 
element's delay time plus gate delay of C-elements. While the total cycle time for a 
2-phase handshaking system is equal to one delay element's delay plus gate delay of 
C-element. You can see that, to do the same thing, 4-phase handshaking needs twice 
the time longer than that of 2-phase handshaking. However, Sutherland's CP-latch 
for 2-phase micropipeline is complicated and additional control circuits must be 
added in order to handle a feedback system. 
According to the above problems, a new micropipeline structure has been 
developed to solve the problems, and a series of solutions have been found to 
enhance the performance of micropipeline. Fig. 2-19 shows a new micropipeline 
structure which is a 2-phase handshaking system with D flip-flop as register. 
page 1-17 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 2 Asynchronous Design Methodologies 
1 _ delay > U ~ " delay _ _ • ^ _ | ^ 
R q Control Control Control 
Ack __J - 4 ~ > 
- ^ ^ - p 3 = ^ ^ r ^ ‘ r ^ 
Data r Standard f Standard A 
^ D.FF A CMOSI^gic V ^ CMOSI^gic v ^ 
~ ^ ^ (Computational ― ^ ~ ^ (Computational _ ^ ~ / 
block) block) 
V J V ) 
Fig. 2-19 Micropipeline with D Flip-Flop 
In this design, D flip-flop is used instead of CP-latch. Since the flip-flop 
responses to either one phase change of its input clock only, the control becomes 
easy, especially for feedback control. In addition, the whole data path structure is 
exactly the same as its synchronous system counterpart. The advantage is that a 
synchronous system can be easily modified to a self-timed system just by adding a 
handshake control circuit. You may have the following question: a D flip-flop is a 
single edge sensitive storage element, how can it be used in a 2-phase handshaking 
system? Of course, some modification in the handshake control circuit must be 
made. 
^ ^ ( a i 
H 
C 
V ri _ ] I I I 
ao ( “ r ) ro r o _ J 1 | 1 
6 ai _ _ n ^ ^ 
\/lro lro 
Fig. 2-20 (a) A new Handshaking Control circuit and (b) its timing diagram 
page 1-18 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 2 Asynchronous Design Methodologies 
Fig. 2-20 (a) shows the handshake control circuit for this new micropipeline. 
It is similar to previous handshake circuit except an XOR gate is added. Actually 
their signal transition graph is the same. In previous designs, both "request" and 
"acknowledge" signals are used to control the CP-latch. But now, an XOR gate is 
used to extract a signal to control the D flip-flop from "request" and "acknowledge" 
signals as shown in Fig. 2-20 (b). So two pulses can be produced in each cycle. 
There are some advantage of using this approach. In previous CP-latch 
system, an extra delay element (usually small) is added to "Capture" and "Pass" 
signals to produce "Cd" and "Pd" signals. The purpose of these additional delay is to 
ensure previous stage's latches pass new data after next stage have captured the data. 
In the new design, since the clock signal for D flip-flop "lro" is extracted from 
request and acknowledge signals, a delay which is equivalent to an XOR gate delay 
is introduced. 
In addition to 2-phase handshaking system, 4-phase handshaking system can 
be built based on this D flip-flop structure. By using the same micropipeline 
designed by Eva, we just need to replace all the CP-latches by D flip-flops. The 
handshake control circuit remains unchanged, which is the one as shown in Fig. 2-17. 
The voltage controlled delay (symmetrical delay) was also modified based on this 
structure, i.e. using both P-MOSs as the voltage control transistors as shown in Fig. 
2-21. 
page 1-19 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 2 Asynchronous Design Methodologies 
A A 
v c _ ^ [ 
In ——1» 1 ~ ^ 1 ^ ^ ^ ^ — — ^ ^ ^ ^ < ) ~ Output 
V h 
Fig. 2-21 Circuit diagram of the new Symmetrical Delay element 
Based on these 2-phase and 4-phase micropipeline structure and delay 
element, further improvement techniques have been developed and they will be 
discussed in the next sections. 
2.4.2.1 Asymmetrical delay 
As mentioned before, the total cycle time of a 4-phase handshaking system is 
at least twice the 2-phase system. It is because all signals have to retum to their 
initial value and "request" have to pass through the delay element twice in each 
cycle, i.e. from “low” to “high’ and then from "high" to “low’’. 
The use of asymmetrical delay element [27] in 4-phase handshaking system is 
a way to minimize the waste of return period time, since the time period from 
negative transition of ro (present) to ri (next) can be reduced by passing a delay 
element which has very short delay time for this transition. This element is called 
asymmetry because the propagation delay for positive transition and negative 
transition is not equal. Fig. 2-22 shows the circuit diagram of an asymmetrical delay 
element, it is also modified from K.M.Yue,s adjustable delay element. Delay values 
page 1-20 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 2 Asynchronous Design Methodologies 
are achieved by appropriately sizing the W/L ratio of the transistors. For example, 
W/L the N-MOS transistor of the first stage and P-MOS transistors of the second 




In o <1——0 <1~ ^ ^ ^ ^ ~ Output 
V 
Fig. 2-22 Circuit diagram of Asymmetrical Delay element 
The Control voltage VC varies the resistance of the transistor and in tum the 
RC charging time constant. The output buffer is two normal inverter which is used 
to reduce the rise and fall time of the output signal. Unlike previous design, the VC 
controlled transistor is a P-MOS instead ofN-MOS as its resistance is larger than that 
ofN-MOS for the same W/L ratio. 
Fig. 2-23 is the SPICE simulation result with ES2 CMOS 0.7^im process 
parameters. It shows the propagation delay from low to high varies from 10ns to 
30ns depending on control voltage (VC), and the propagation delay from high to low 
is much smaller and is independent of VC. The simulation result shows the 
propagation delay does not increase linear with VC. 
page 1-21 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 2 Asynchronous Design Methodologies 
* 1 ONS ASYMMETRIC OELflY ELEMENT - ES2 CMOS 0.7UM PROCESS 
9 5 / 0 6 / 1 8 1 5 1 1 0 0 
^ • ° : : DEL1 0-07 .TRO 
: -^ Out 
" . — i r r T ^ n - ^ - - m 
V 0 7 • “ • • • • •• . • • • “ •—-
0 • 
T 3 . r . .1 . . • . • ‘ • . • - ‘ ‘ • -
I . I . 
^ " • "I • _ • • • - :. + ‘ I. • - — : 
o : /iiT^^^C^^^^^^^: •• 
. ^ .1 . I . I . I. . I I. \ J . I I - . I . ^N^  I. I. . I I. >k^  I . _j 
“1 • y E 0 . 0 N \ H 0 . 0 N >M. 0 N 8 O^HL,^  A . Y I M E [ L IN ) Xs_^  9>>4N 
Vc=OV Vc=1.5V Vc=2V Vc=3V 
Fig. 2-23 Characteristic of Asymmetrical Delay element 
Since the charging/discharging equation of the capacitor is: 
dVout 
I = C ut~^ 
jU'S'W 2 
where Ids{sat) = ~z 7~( Vth) in saturation region 
iox*L 
So, the charging and discharging time (delay) is inversely proportional to the 
square of (Vgs - Vth). And the delay increases to infinity as Vgs < Vth as control 
transistor is completely cut-off. 
2.4.2.2 Variable Delay and Delay Value Selection 
Although Micropipeline has many advantages over DCVSL systems, it 
achieves only worst-case delay instead of average-case delay. Although every self-
timed block is allowed to have different delay element, this worst-case delay is fixed 
for all situations in a given self-timed block. In some cases, the whole system would 
be speed up if we can select or vary the values of delay element according to 
page 1-22 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 2 Asynchronous Design Methodologies 
different situation or input data patterns. Muscato et al. proposed a locally clocked 
microprocessor using different delay values for different operations in a ALU [21]. 
This approach was further investigated and some circuits were developed for both 2-
phase and 4-phase micropipeline systems. 
There are several methods to vary the propagation delay of a delay element. 
One obvious way is to vary the control voltage of the adjustable delay as described 
previously. A reference voltage bus can be built just like power and ground busses. 
These analog voltage bus consists of several analog lines which has different voltage 
value in each line. Then the control voltage input can connect to this reference 
voltage bus through analog switches and multipliers. And the switches and 
multipliers are controlled by digital logic circuits. Apart from this approach, a local 
digital controlled reference voltage generator can be used. However, both methods 
are too complicated and require large amount of silicon area. 
Select 
In ( ^ ( 1^ 
)Asym Etelay ————)Asym Delay ^ 
^ ) 2tolMux > 
\ 
y 
Fig. 2-24 Delay selection in Micropipeline for 4-phase handshaking system 
Another much more simple method is to use two or more fixed value delay 
elements with multiplexers as shown in Fig. 2-24. It is the Delay selection circuitry 
for 4-phase system. When a short delay value is required, the multiplexer will select 
a shorter delay path. The asymmetric delay elements can also be used in 4-phase 
handshaking system, so that they can be discharged back to zero in very short time. 
page 1-23 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 2 Asynchronous Design Methodologies 
Fig. 2-25 (a) is first design for 2-phase handshaking system, which consists of 
one more mulitiplexer and use symmetrical delay element. Because both transitions 
"high" to "low" and "low" to "high" are used. If the first transition change from 
"low" to "high" and the short delay path is selected, and in next period, a long delay 
path is selected, the second stage delay's output should be initially "high" and then 
change to "low". So one more multiplexer is added before the second stage delay 
and operates at the same time with the first stage delay during short delay path is 
selected. Moreover, the delay time of these two delay elements have to be the same. 
Fig. 2-25 (b) shows a much more simple circuit which has the same function of the 
first design. Also it can be used in 4-phase handshaking system. 
Select ^ ^ 
\ 
7 ( \ 
In ( 2 tol Mux ^ Delay > 




> ( V Out 
In / V 2 tol Mux ^ Delay ^ 
^ Delay ^ A ) 
\ / 
(b) 
Fig. 2-25 Delay selection in Micropipeline for 2-phase handshaking system 
(a) first design (b) simplified 
page 1-24 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 2 Asynchronous Design Methodologies 
With the asymmetrical delay element and delay-selection techniques, 
micropipeline can be further improved and speed up by just adding a limited 
hardware overhead. In fact this additional circuit (only one MUX) will not introduce 
any timing penalty to the system because its delay also acts as a part of total delay by 
reducing the delay value in delay element. This variable cycle time self-timed 
system can speed up the operation of the whole system, especially in feedback 
system, such as some recursive operations that requires many iterations. So this self-
timed method has been finally used in my ICT processor design. 
2.5 Comparison between DCVSL and Micropipeline 
This chapter discussed two established design methods for Self-timed 
systems - DCVSL structure and Micropipeline. By just considering their operation 
and hardware complexity, micropipeline is more appropriate to implement large 
scale digital systems because of its higher operating speed and less area overhead in 
both the control and data path. 
Since DCVSL requires precharge in each cycle, only 4-phase handshake 
protocol can be adopted. The precharge period is useless for computation. Even 
assuming its evaluation speed is the same as standard CMOS logic circuit, the 
recharging operation makes its totally cycle much longer, and the precharge time is 
normally close to the evaluation time. As a result, the overall cycle time is nearly 
doubled compared with CMOS logic. Moreover, the additional NAND gate in each 
DCVSL circuit (which is used to extract complete signal) further degrades its speed. 
page 1-25 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 2 Asynchronous Design Methodologies 
In general, micropipeline is more suitable for long word-length computation 
process and highly regular pipeline structure, since the complexity of the DCVSL 
system is proportional to its word-length. Firstly, DCVSL is dual-rail system and the 
data lines are doubled. Secondly, as the word-length increases, the number of 
internal complete signal increases. This further makes the whole system more 
complicated. The case of micropipeline is totally different, no matter how long the 
word-length is, the hardware of handshake control per self-timed stage is fixed. And 
there is no additional circuit required in its data-path. So the overhead percentage of 
micropipeline will be reduced as the word-length of the system increases. In 
addition, by just comparing the circuit structure of DCVSL and standard CMOS 
logic circuit, DCVSL is more complicated. The difference is more clear when they 
are used to implement some very simple logic function such as NOT, few inputs 
NAND or NOR fonctions. 
The DCVSL system can be simplified by using the dual-rail logic (DCVSL) 
only in the maximum-delay path, while the rest using single-rail logic circuit 
(standard CMOS) to minimize the overall area in the data path. However, the 
operating speed of the whole system remains unchanged because of the time wasting 
in precharge periods, thus there is no delay reduction in the worst-delay path. 
On the other hand, micropipeline also has disadvantages. Micropipeline 
system is actually a locally clocked system, more design effort is needed for 
determining the worst case delay in each self-timed block. While the DCVSL 
page 1-26 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 2 Asynchronous Design Methodologies 
structure is truly asynchronous and reduces a lot of simulation work. It is very 
convenient in small and non-regular structure pipeline system. Some solutions of 
improving those basic self-timed structures are introduced. Finally, the use of 2-
phase handshake protocol, variable delay, asymmetrical delay and current sensing 
complete detection technique in the micropipeline designs shows that self-timed 
logic is attractive in implementing iterative and recursive systems. Testing results 
from self-timed matrix multiplier designs in next chapter will further prove that 
Micropipeline is better than DCVSL in term of speed and silicon area. 
page 1-27 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 3 Self-timed Multiplier 
ChapterThree 
3. Self-timed Multipliers 
3.1 Introduction 
After two self-timed design methodologies have been studied, some 
application circuits and test chips are going to be demonstrated. Beside of 
investigating the self-timed handshake control protocol and system architecture, 
many self-timed application circuits and test chips had been designed for verification 
and comparison during the past two years of my research life. For example, DCVSL 
and Micropipeline parallel adders and multipliers. In this chapter, only two 
representative circuits, "bit-serial matrix multiplier" and "Booth's multiplier" will be 
discussed, because only these two application circuits and the final version of ICT 
processor had been implemented and fabricated in CMOS VLSI chips, while the 
other circuits were only verified in simulation level. 
The objective of designing these application circuits is to compare the 
performances of different DCVSL and micropipeline structures with different 
handshake control protocol. I first started with a very simple design, "bit-serial 
matrix multiplier", which is actually developed by another research student, Eva 
Pang. She had designed a self-timed bit-serial matrix multiplier and implemented in a 
page 3-1 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 2 Asynchronous Design Methodologies 
CMOS test chip. I followed her work and modified the circuits with the new 
handshaking control protocol and new circuit structure. The improved circuits have 
been implemented in the second test chip, which isjointly developed by Eva and me. 
After summarized the experience and testing results of these test chips, I 
concentrated my research area on micropipeline system and current-sensing 
technique. So the third test chip only composed of "Booth's multiplier" based on 
micropipeline structure with three different handshaking control protocols and two 
current-sensing self-timed accumulator. Unfortunately, because of some layout 
mistake, this test chip cannot work properly. But some conclusions made in previous 
chapter on micropipeline handshaking system can still be proved by simulation 
results. 
In this chapter, the example circuits and three test chips mentioned above will 
be discussed. Firstly, Eva's matrix multiplier, her test chip and testing results will be 
reviewed. Followed is design and testing results summary of our jointly designed 
second test chip. Finally the design and implementation of the micropipeline 
Booth's Multipliers with three different handshake control techniques will be 
discussed. 
page 2-2 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 3 Self-timed Multiplier 
3.2 Design Example 1 : Bit-serial matrix multiplier 
The first design example is a bit-serial matrix multiplier designed by Eva 
Pang [22]. It performs the function of one-bit [1x8] x [8x1] matrix multiplication. 
This circuit is quit simple in hardware and operates in recursive mode. A test chip 
was fabricated consisting of two multipliers based on DCVSL and micropipeline 
system. Both of the multipliers use 4-phase handshake control signaling. 
Fig. 3-1 shows the block diagram of the matrix multiplier. First stage of the 
multiplier is a logic block with AND function which performs 1-bit multiplication. 
Second stage is a 4-bit Accumulator which sums up eight partial products. There is a 
4-bit Counter which operates concurrently with the AND block and Accumulator 
block, and it is used to count the number of partial multiplication or accumulations. 
Inputs 
^ AND \ Accumulator ^ 
n — ~^ 
Outputs 
\ Counter ^ 
/ / 
Fig. 3-1 Block diagram of the Bit-serial matrix multiplier 
The operation of one complete multiplication requires 8 cycles ( 8 AND and 
Feedback accumulation operation). Eight 1-bit multiplicand and multiplier shift in 
the circuit serially and they are controlled by handshake signals. For example, when 
the AND stage finishes its operation, it acknowledges input stage for the next data. 
After 8 cycles of AND and accumulate operations, a complete signal is generated. 
page 1-3 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 3 Self-timed Multiplier 
3.2.1 DCVSL design 
For the multiplier with DCVSL structure, all logic gates and latches in the 
data path are DCVSL gates. Two different handshaking control protocols are used in 
the control path. The one used in the AND block is the basic 4-phase HC protocol 
(parallelism degree of 2) as described in chapter 2 while the other one is specially 
designed for handling Feedback path control signals [2,3] in the Accumulator and 
Counter circuits. The corresponding Signal Transition Graph is shown in Fig. 3-2 and 
its parallelism degree is also 2. Circuit diagram ofEva's DCVSL matrix multiplier is 
shown in Appendix - A2. 
• • • + j + • + 1 + • • « _ 1 ^ 1 ^ rdy ^ 1 ^ rdy ,^ i .rdy 
F F Q Q B B 
rdy" 4——i “ ^ rdy+ ^ i+ ^ rdy' ^ i ' ^ i + 
Q Q B B F F 
Fig. 3-2 STG of Feedback path handshaking control protocol 
3.2.2 Micropipeline design 
In the micropipeline multiplier, all logic gates in the data path are standard 
CMOS logic gates (ES2 Standard Cells). While C-element, CP-Latch and Delay 
element are custom designs. The circuit structure is the 4-phase handshake control 
protocol with CP-Latch and delay element which is the same as the one described in 
section 2.4 (Eva's micropipeline). The minimum delay element values used are 7ns 
and 10ns for the AND block and the Accumulator/Counter block respectively, and 
page 1-4 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 2 Asynchronous Design Methodologies 
these values can be further increased by users (globally adjust only). 
3.2.3 The first test chip 
The test chip was fabricated in European Silicon Structure (ES2) 1.2^m 
CMOS process. It contains two bit-serial matrix multipliers designed with methods 
described in the previous section. Fig. 3-3 is the layout view of the first test chip. 
Besides two multipliers described previously, it includes some DCVSL logic gates 
for performance measurement. 
I [ P M B l B J 1 1 l B P B . 
“ — :^gm^3^^^ife^:^i:.^^^^3::: 
.- : • |^BMW^W v|^  ^ ^^ ^^B ^^ ^^ ^^ W^ :::::. . iWIMW K^^ KS^ ^^ ^^ WFTT^ '-s;;:; 
*^‘™"^ -^ ' Bfflrw"-- mffiiP -^::::: ::.n 
! • ( f^ ^ ^^ ^^ ^^ ^^ ^^ v^^  ^ ^^ ^^ ^^ ^^ ^^  BW^ W^^nB|B*d- • 1 I J 
m i E - ^ ^ ^ ^ ^ ^ h t e 
^ j ^ g . [ ^ m i j ^ m j ^ ^ ^ ^ ^ ^ 
»^S[*u=1| |W|BMwjBl f ^ 1 ¾ ¾ 
^ffi'l^iwffWBwfflf Mifc 
K><OT^j|jHLg^p"j^ ^^Hff^ ^^ ^^ B^ ^^ ^^ ^^ ^^XBBi BjgB|^ T^ ^ 
^ ^ B $ i S M ^ ^ ^ B t ^ M ^ 
wmm^^S^S^mWm 
^ ^ S $ 3 ^ ^ ^ ^ ^ ^ m ^ ' ^ ^ f f l 
• . 1 ^ ^ _ 
sl BHMfflBffl^  BHm^ HmrWH….‘'::: .: \ B^PIHMlM N:x liMUM:;:-]^  i!^raiJ::: ::::: .^ :: :;J 
-. - W^M^W^WMMW"^! H|H^m|^r^^^  |BMj^W|^BMWMM^• n../ y^ \ — B B S ^ ^81^ HRHnSij^'"' EMfiLBaMB mmm __[__ 
Fig. 3-3 Layout of the Eva Pang's first test chip 
Area Latency Throughput 
DCVSL ‘ 0.36mm — 50ns 2.3MHz 
Micropipeline 0.12mm 40ns 3.2MHz 
Table 3-1 Comparison of DCVSL and Micropipeline multiplier 
in the first test chip 
These two Self-timed multipliers were tested and Table 3-1 summarizes the 
measured testing result which is averaged from measurements of five chips. 
Micropipeline structure is found to be much faster than DCVSL circuit. This agrees 
closely to SPICE, Verilog simulation results and our previous prediction. The latency 
page 2-5 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 3 Self-timed Multiplier 
time (Time delay from first request signal to the first interim partial product output) 
of DCVSL circuit is 20% longer. This indicates that the actual data processing time 
required is longer. On the other hand, lower throughout rate (Averaged output data 
rate) of the DCVSL multiplier reflects that the efficiency of handshake signaling is 
poor, since the handshake control protocol has a parallelism degree of 2 only. In 
addition, time may be wasted in precharge periods. 
Although results show that Micropipeline is faster than that of DCVSL 
circuit, it is too slow compared with its synchronous counterpart. In the 4-phase 
micropipeline, computation block operates at the time between the positive transition 
of ro(present stage) and ri(next stage), which is defined by the value of delay element 
(worst-case delay of the block). Unfortunately, in 4-phase signaling, both ro and ri 
have to return to zero before the next cycle. That means time is also wasted from the 
negative transition of ro ^>resent stage) to ri (next stage). Moreover, as mentioned 
before, it should have two C-elements in each stage of handshake control circuit in 
order to achieve higher degree of parallelism, but this may also add extra delay in the 
control and data path. Besides, the delay imposed on each block is much longer than 
the actual delay, so the system tKroughput is not optimized. That means the 
minimum delay of each delay element is longer than the actual delay of the 
computation circuit. 
Since the interconnects in DCVSL structure are twice the amount required in 
an equivalent single-rail circuit and area overhead of the complete signal extraction 
circuitry is also very large. The area of the DCVSL multiplier is 100% larger than 
page 1-6 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 3 Self-timed Multiplier 
that of the micropipeline. As the number of I/O increases, more hardware are 
required to determine the complete signal in DCVSL system, thus micropipeline is 
more suitable for long word-length computation and regular structure pipeline. 
3.2.4 Second test chip 
The second test chip as shown in Fig. 3-4 was jointly designed by Eva and 
me, and it was fabricated with the same technology. This chip consists of my own 
improved design of matrix multipliers for comparison with Eva's design. They are 
functionally the same as previous circuits. The chip also has a large number of 
DCVSL gates designed by Eva for testing and characterization. 
H 1^  j^ \ I .W.lW. , .H.W.H -7T77Tv 
r # = S H g ^ i H i i 9 n ^ 
_:^^^^^^H. ^ ^ ¾ & ¾ ¾ ¾ ^ % : 
. . . . f r^^^^^DZ^^^K^^^^B9By E^^^K^^^^R Q^ ^^^^KZ^^^^B^^^H^^1 • . • •. 
^L ^ ^^p|E m [^ ^ T^  . T^.^ T^T^ *^^^^^^T^^^^^ *^ty3Wy^^^n jhj^  ^L 
i B M i U B i t t M i r i i W P 
= ^ BroFj^^^^^BwHEH^^BBfli^^H^^^^S 
_ m O U B B ^ | ^ ^ ^ ^ ^ H ^ ^ E f H Q 
• 9 B B ^ B ^ B _ _ 
j||H|^ J|^ 9^ ^^ ^^ ^^ ^^ ^^ B^^ ^^ ^^ ^^ ^^ S^^ ^^ ^^ ^^ ^ 
^ i i M M B I g B B W B t ^ S 
—J ^ ^Mtm^ ^^ ^^ ^^ ^^ ^^ ^^ S^^ L^^ ^^ — l[^ ff yf^ ] mm^^BB^^^BBM 
^fti i i B i i n f f W W i i M '^" 
^ ^ ^ ^ ^ ^ ^ r ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ S ^ ^ ^ ^ ^ ^ ^ ^ H B ^ ^ ^ ^ ^ ^ ^ ^ ^ | ^ H t ^ ^ ^ ^ ^ ^ ^ ^ 1 
^^ SS33QH^ ^^ B^ H^BII^ H^ II^ ^^ ^^ ^^ I^MRP-1 i^ ^^ fff^ n 
"fflBnt^^^^jB^^^T S ^ ^ ^ ^ § 
^ ^ ¾ ¾ ¾ ! ¾ ¾ ! ¾ ¾ ¾ ! ^ 
EUft pwwB M m M H1 _ . _ 1_ M M Ml^j^ >MnRj 
Fig. 3-4 Layout of the second test chip 
From the findings of the first test chip, I had made some improvements on the 
matrix multipliers. For the new DCVSL multiplier, I found that the handshake 
control protocol with parallelism degree of 3 can handle feedback system very well 
without additional circuitry. The circuit is very simple which is just replacing all 
HCC in previous design (i.e. HCC with parallelism degree of 2 and the special 
feedback HCC) by the HCC with parallelism degree of 3. This not only greatly 
page 3-7 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 3 Self-timed Multiplier 
simplify the design process for DCVSL with feedback path, but also reduce the 
hardware overhead. In addition, as its parallelism degree is 3, the operational speed 
should be much faster. The circuit diagram of this improved DCVSL multiplier is 
shown in Appendix A-3. 
For the new micropipeline, 2-phase handshaking system with D flip-flop 
approach as described in section 2.4.2 is used instead of 4-phase system. So, the 
time in "retum to zero" period of "request" and "acknowledge" signals also allows 
active optimization. This 2-phase structure also can handle feedback path by its 
basic handshake signals naturally without additional hardware and modification. 
Moreover, the minimum delay value of delay elements were reduced to 5ns and 7ns 
for the AND block and the Accumulator/Counter block respectively. The circuit 
diagram of this improved micropipeline multiplier is shown in Appendix A-4. 
Table 3-2 shows the testing results of the two improved circuits in the second 
test chip. As expected, the throughput and latency for both DCVSL and 
micropipeline are much better than previous designs. Micropipeline circuit with 2-
phase handshaking system is almost four times faster than DCVSL one. 
Area Latency Throughput 
DCVSL 0.31 mm 40ns 3.35Hz _ 
Micropipeline 0.16 mm 20ns 12.3MHz 
Table 3-2 Comparison of DCVSL and Micropipeline multipliers 
in the second test chip 
Although the micropipeline circuit is much faster than that using DCVSL 
structure, micropipeline circuits in above example perform worst case delay in each 
of the self-timed block which limits the performance of this kind of self-timed 
page 3-8 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 3 Self-timed Multiplier 
system. The techniques used to improve this weakness for micropipeline as discussed 
in chapter 2 have been applied to another circuit "Booth's multiplier’ and they will 
be described in next section. 
3.3 Design Example 2 - Modified Booth's Multiplier 
The proposed improvements for micropipeline system discussed in section 
2.4.2 were applied to the design of the parallel multiplier with Modified Booth's 
algorithm. The theory of Modified Booth's algorithm will be described first. This 
multiplier is used for design verification because the advantage of the newly 
developed micropipeline structure can be demonstrated very well. 
In the modified Booth's algorithm, an encoding technique is used to reduce 
the number of partial products by half, i.e. n-bit multiplier generates n!2 partial 
products. Each multiplier is divided into sub-strings of 3-bits, with adjacent groups 
sharing a common bit. Table 3-3 is the decoding table of the eight permutations of 
the 3 multiplier bits. [24] Partial product equals to (Y) multiplied by a scaling factor 
(F), and the final product equals to the sum of four partial products. The scaling 
factor of the first, second, third and forth operation are 1, 4, 16 and 64 respectively. 
page 3-9 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 3 Self-timed Multiplier 
Bit pattern (B„) Operation (Y„) 
000 add zero + 0 
001 add multiplicand + A 
010 add multiplicand + A 
011 add twice the multiplicand + 2A 
100 subtract twice the multiplicand - 2A 
101 subtract the multiplicand - A 
110 subtract the multiplicand - A 
111 subtract zero - 0 
Table 3-3 Decoding table of 3 multiplier bits 
n=4 
P A X B = X F „ X Yn where: F) = l,Fi = 4,Fs = l6,F4 = 64 
n=l 
If B = ^)7,¾,¾,¾,¾,¾,^ ?!,¾, 
Bi = bi,bo,0 B2=b”b2 bi ^3 = ¾,¾,¾ B^ = b^,b^,b^ 
3.3.1 Circuit Design 
For an 8-bit multiplier, input data is divided into 4 overlapping 3-bit groups. 
The final result can be obtained after 4 cycles of operation or computation. Each 
group of the 3-bit data is decoded in order to generate some signals to control the 
shift-register and full-adder circuits to perform the operations such as shift, addition 
or subtraction. Multiplication of the scaling factor can simply be achieved by 
shifting the multiplicand by two bits after each cycle. Since no addition or 
subtraction is required if the bit pattem is either 000 or 111, the computation time of 
accumulation is zero and only shift operation is required. Fig. 3-5 shows the block 
diagram of 8x8 Modified Booth's multiplier and the circuit diagram is in appendix B-
2. 
page 3-10 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 6 Implementation f self-timed ICT processor 
8-bitMultiplier 8-bitMultiplicand ao ri 
(~^~~^. ^ ^ 
Load/Shift register Load/Shift register 
V J \ J > 
, . <> , 
f ?ooth,s ] f Shift-register ]J | 
\ / 1^  decoder j J ^ l ^ y Handshake 
y / 1 4> " ^ Controller 
16-bit Carry Look-ahead Adder < 1 
^ ~ i ^ 
V 
16-bit Data out ao ri 
Fig. 3-5 Block diagram ofthe 8x8 vector multiplier with Modified Booth's 
Algorithm 
When two 8-bit input data and an external request signal is received, the 
central handshake control circuit generates a series of pulses (internal request signals) 
to start calculation. The Booth decoder determines the types of operation needed to 
carry out and then selects the appropriate delay value for the third pipeline stage. If 
"add/subtract zero" operation is required, a delay element with shorter delay value 
will be selected, otherwise a larger value of delay path will be selected. Thus, the 
throughput ofthe system depends on data patterns and it is possible to obtain higher 
speed than that using fixed delay approach or synchronous method. The use of delay 
selection method in micropipeline self-timed system described before enables the 
circuit to increase the average calculation speed. 
page 3-11 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 3 Self-timedMultiplier 
3.3.2 Simulation result 
Baseline Cursor 
15000 24137 
^ :tD __| n n n n[ n n ^ 
1 osto n _ _ n n r n n 
/Sign oStO 
/Result< 14:0> o 1C39 j - 0000 X""0055~X~0n9~X 06F9 X ~X 0000 X 0056 X 050A X~X 0000 
y^ eq_out oStl ~^ ~~I I_I 
/Rin o StO j I 1 I 
ltiplier<7 0 > o 71 ^ srX71 — 71 71 71 
iplicand<7:0>o56 ^ ~~g^^^g g^  56 
0Stl —^ I ~ ~ 1 I 1 
Ti- in lOps |< 0. . 16000 . ‘ • 18^ 00 .,. 20000 . ‘. 22^ 00 . ‘. 24^ 00 . ‘. 26000 . ‘. 28000 . ‘, 30i00 , ‘ 32^ 00 ‘ ,34^ 00 ‘ ‘3sib' ^  
Fig. 3-6 Simulation result ofSelf-timed Booth's multiplier 
Fig. 3-6 is the timing simulation result of a 0.7^im CMOS self-timed 8x8 
Booth's multiplier with 2-phase handshake from Verilog simulator. This timing 
diagram shows the handshake signals, input and output data in two multiplications. 
"T2" and "T1" are internal request signals for the second stage and the third stage 
respectively. Since one of the operations is "add zero" in the second multiplication 
and a smaller delay value in the handshake control circuit is selected, the total 
multiplication time of the second multiplication is less than that of the first 
multiplication by about lOns. Also there is no request signal "T1" sent to the final 
pipeline stage if no addition or subtraction operation is required. As a result, the 16-
bit output register and the 16-bit full-adder will not be toggled, and the average 
power consumption may be reduced. 
Three Booth's multipliers are designed and all of them use the delay selection 
technique. The first circuit is a conventional 4-phase system, the second one is a 4-
phase system with asymmetrical delay elements and the final one is a 2-phase 
system. Table 3-4 summarizes the simulation results of these multipliers obtained 
page 3-12 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 6 Implementation f self-timed ICT processor 
from Verilog-XL® simulation. They were simulated using "typical" case 
parameters provided by ES2. 
Mode of Handshake Control Protocol 
Operation * 4-phase 4-phase (w A.D.) 2-phase 
4 152.8 ns 126.2 ns 86.1 ns 
3 130.5 ns 116.2 ns 75.5 ns 
2 112.6 ns 106.1ns 65.2 ns 
1 99.2 ns 96.1ns 55.0 ns 
0 88.8 ns 86.0 ns 44.4 ns 
* mode ofoperation = number of cycles of long delay selected 
Table 3-4 Simulation results of multiplication time in different operation mode 
and handshake control protocol 
Simulation results show that the 2-phase micropipeline system can operate at 
about 45% faster than the 4-phase system (with or without using asymmetrical delay 
element) in any modes of operation. But the speed of the 4-phase (with asymmetrical 
delay) is improved over the symmetrical delay 4-phase only by 17% in mode 4 
operation. 
page 3-13 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 6 Implementation f self-timed ICT processor 
3.3.3 The third test chip 
The third test chip as shown in Fig. 3-7 was designed and fabricated in ES2 
CMOS 0.7^m process. This test chip contains five micropipeline circuits with some 
improvements described in the previous sections. Three of them are 8-bit parallel 
multipliers with Modified Booth's algorithm as described previously, while the rest 
are the 16-bit accumulators using current-sensing completion detection technique. 
The current-sensing completion technique will be discussed in next chapter. 
M M # i # # H # M 
^^^ ^^^ fr^^ §^t^j|^... TTOf • 'Fftmj ‘”/nnnM' • |n|fln . 'fMnm "• fflftAw- • -^n3•• • BSBS1. . BflR • -P^^^^^^^S 
^ ^ s 3 ^ ^ B ^ ^ ^ ^ ^ ^ ^ E 
•^MjttiB^BBBPl^^l ^P 
3 ^ n B B E 
_ • flBs 
3 i i : . : 1^^^^ i':...|^ ^T i^: - ^ y 
MH^VT -jy? Ill _LL_P^  _ ^ ^ 
ffWM^BX M^BnHBJ - w. ffl^ffi^f^^W^l^^Pi^^^^H ^i|{i|jf|i|^ |j^  
^ ^ ^ ^ ^ ^ H t # r ^ B 
Fig. 3-7 Layout ofthe third test chip 
Unfortunately, this chip failed to work after fabricated. The chip draws very 
large currentjust like short circuited between power and ground. After careful tracing 
the layout, one routing error was found in the layout. The power and ground signals 
in the current-sensing test circuit block had been routed wrongly during the Auto-
Place and Route process. This chip is a semi-custom design, which is a mixture of 
standard cells and full-custom cells. The fiill-custom cells still cannot support LVS 
testing at that moment because of time limitation, so LVS verification cannot be 
performed. After this lesson, all the full-custom cells are modified to support LVS 
test. 
page 3-14 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 6 Implementation of s e l f - t i m e d ICT processor 
Chapter Four 
4. Current-Sensing Completion Detection 
4.1 Introduction 
Chapters 2 and 3 discussed the design methodologies and applications circuits 
of dual-rail logic (DCVSL) and single-rail (Micropipeline) self-timed structures, 
which are being commonly used now. Micropipeline is simple but delay estimation 
is required and uses bounded delay approach; Dual-rail logic uses unbounded delay 
approach but it is too slow and complicated in hardware. Another Self-timed logic 
structure that using current-sensing technique to detect completion of operation is a 
single-rail unbounded delay system, which is proposed by M.E. Dean et al[25]. 
This Current-Sensing Completion Detection (CSCD) technique is based on 
unbounded delay approach. In DCVSL method, completion of operation in a logic 
block is determined by extracting encoding dual-rail signals. But in CSCD 
technique, completion detection is done by monitoring the switching current of the 
CMOS logic block. So, the self-timed is similar to that of conventional 
micropipeline except the complete signal is generated by a current comparator 
instead of a delay element. So The hardware overhead is slightly more than 
Micropipeline but has the advantage of DCVSL which can achieve average-case 
page 3-1 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 6 Implementation of s e l f - t i m e d ICT processor 
performance. 
However, there are many difficulties in designing CSCD circuits, the 
objectives of investigating the CSCD technique in my research is to gain design 
experience in this technique and verify the performance of CSCD by the design and 
testing of a CSCD test chip. In this chapter, the basic structure and theory of the 
CMOS current-sensor is described. Followed is the implementation of self-timed 
micropipeline with CSCD technique. Finally, the design and testing results of the 
second CSCD test chip are discussed. 
4.2 Current-sensor 
The static current of a standard CMOS circuit is near zero. So we can 
determine whether a CMOS circuit is operating or not by monitoring its supply 
current. To detect the switching current of the CMOS logic circuit, we can use a 
current comparator as a current sensor. A simple current comparator which consists 
of a current mirror and a constant current source will be described in this section. 
4.2.1 Constant current source 
A constant current source can be very simple that just consists of one 
transistor. The output current or its drain to source current is expressed by the 
following equations: 
page 3-2 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 6 Implementation of s e l f - t i m e d ICT processor 
jLl'S'W 
Ids = ~ “ (Vgs Vth) • Vds in linear region 
lox:L 
JU'6'W 2 
Ids{sat) = ~ ^ ( ^ g s — ^th) in saturation region 
l o x ' L 
From the above equations, the drain to source current can be varied by 
controlling the gate to source voltage and W/L ratio of the transistor. A constant 
current source thus formed when a fixed-value of reference voltage is applied 
between gate and source. The accuracy of constant current source depends not only 
on the actual dimension of the transistor but also the reference voltage. 
However, the shape of a transistor is not exactly the same as what we 
expected in practical IC fabrication, which is called Edge effect of transistor. Other 
effects such as Channel Length Modulation, variation in threshold voltage and 
current caused by Random oxide thickness variation and Random surface-charge 
effects, etc.,[26] also affect the accuracy of the drain-source current. All of those 
effects are not significant in binary synchronous digital system. Since they will only 
affect the speed of the system instead of its functionality. However, the result of the 
presence of those variations or effects is a very large change in current. 
The reference voltage of the constant current source can be generated by P-
MOS and N-MOS transistors connected in series as shown Fig. 4-1, which is a part 
of current comparator. This reference voltage can be adjusted by the aspect ratio of 
both transistors and they act as two resistors in series. It is equivalent to a potential 
divider and produces a reference voltage for the P-MOSFET to form constant current 
source. 
page 3-3 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 6 Implementation of s e l f - t i m e d ICT processor 
Constant current 
Current in ^ 1 ^ source 
3 ( 
\/ h~~ 
\ r ~ y hv Comparator out 
—>~^ 
r Z 
V y Current mirror 
V 
Fig. 4-1 Circuit diagram of a basic current comparator 
This structure of constant current source is very simple but it is very sensitive 
to the variation in device parameters. A percentage difference with over 80% is 
found when simulating this circuit with the skew parameters (fast and slow case) 
provided by fabrication company, ES2. This is one of the difficulties in designing 
CSCD circuits. 
4.2.2 Current mirror 
Current mirror is used to obtain multiple current outputs, with each output 
equal or proportional to input current. A simple type of current mirror circuit is 
shown Fig. 4-1 which is a part of current comparator. If the W/L ratio of two N-
MOSFETs are identical, the current transfer ratio equals 1. Other ratios can be 
realized by adjusting the transistors' W/L ratio. Thus it can provide a fan-out larger 
than 1 and to realize gain or attenuation. It is a very effective way to produce an 
attenuated current output for comparison. 
page 3-4 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 6 Implementation of s e l f - t i m e d ICT processor 
4.2.3 Current comparator 
A current comparator is a threshold circuit or current to voltage converter, 
that means the comparator output is a binary voltage signal and its input is an analog 
current input. And this output voltage depends on whether the current input is larger 
or smaller than its threshold current. Fig. 4-1 already shows the circuit diagram o fa 
simple current comparator circuit, which consists of a N-MOS current mirror and a 
P-MOS constant current source. The output voltage is determined by the difference 
between the current of constant current source and current mirror. 
The threshold value can be defined by the current output of constant current 
source. Ifthe currents from current mirror and constant current source are the same, 
output voltage equals to half of supply voltage. If the current strength of constant 
current is larger than current mirror's output, output node will be pulled up and goes 
to "High", otherwise, the output node will be pulled down and goes to "Low". 
The current comparator design is really a very difficult task, especially in 
determining the strength of constant current or threshold current. Theoretically, the 
threshold value is just only a little bit more than zero. This is true, but if the 
threshold value is too small noise margin is also very small and it will be sensitive to 
noise. Moreover, if the threshold is small, the strength of constant current source 
should be small. This will slow down the charge-up time of the comparator's output 
node. As a result, if the CMOS logic finished its processing, the comparator's output 
will go back to "High" state very slow. So the threshold value should be slightly 
increased to obtain a reasonable charging time. And the final threshold value are 
page 3-5 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 6 Implementation of s e l f - t i m e d ICT processor 
found first estimation from the CMOS logic block size (or maximum switching 
current) and then optimized by trial and error based on simulations. Sometimes, this 
value may be larger than minimum switching current (i.e. switching current of a 
single inverter). 
C^URRENT SENSOR FOR ES2 CHOS 0 .7UM PROCESS 95/06/1 8 ] 6 06:P^  
V 5.0 • rr —/'=TXV^ -/^ -^=-— r^  y| ' r ^ ^ CIJRR[NT.TRO^  , 
0 : 1 B COUt 
L “― . . . 1. •  . . I . ^ T 
T . I I ;. ^ In 
3.D; I . . V | '=0^^ Out 
N 2 . 0 - - . I. . . - - - • j . . - -
1 .0 r . • - . I . -
. . ‘ 
Q C_1 1 1 uJ_i. ^ U _ U I. I I. I . I. I . I. P . I. I . I . t . 1_.V |\ j I I , ^ 
„ 3 OM - . . i- . .1 .-: CURRENT.TRO^  . , . 
J . : j . : i^fx^ .vTi Switching 
p -  i | 1,1 -: ^ Current 
1 … ! 
1 , 0 M _ - . - - ' - - -
1 : . : : : 
0. 'rr^^^, I -VKWA/^^^'-^^-^'^-t^ -, iVvf%^ 30.ON 10.ON 50 ON G0.ON 70_ ON 25.ON TlM[ (LlN) 75.0N 
Fig. 4-2 Simulation result of the switching current and operation of a basic 
current comparator 
Fig. 4-2 shows the SPICE simulation result of such circuit with ES2 CMOS 
1.2 ^m process parameters. It simulates the switching of an inverter chain. "In", 
"Out" and Cout are data input, data output and inverted current comparator output 
respectively. "Current" is the total switching current of the inverter chain. It shows 
the current increases when there is a change in input signal, and the current 
comparator output switches to low. After output data pass through the buffer, the 
data convert back to full swing voltage, i.e. 5v and OV. In this case, the computation 
time of the CMOS inverter chain is quite large and the maximum switching current is 
also reasonable. So delay of"ready" signal very small with about 1.3ns. 
page 3-6 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 6 Implementation of s e l f - t i m e d ICT processor 
However, most logic circuits are not just simply a serial gate chain. Lets 
consider a more complicated circuit with many parallel branches. If the threshold 
value is small to satisfy the minimum switching current, the charge up time of output 
node will be too long, thus the ready signal delay will be very large. 
If the threshold value is increased to a value larger than minimum switching 
current, the complete signal delay will be shorter. However, discharging time will be 
increased. If this discharging time delay is longer than the processing time the 
CMOS logic block or if there is only a few gate switching in a large logic block, no 
complete signal will be generated. As a result, the system will be stuck or go 




\ / r~~ 
pl jj Comparator out 
. V r 7 ^ v ^ — — 




Fig. 4-3 Circuit diagram Improved current comparator 
Fig. 4-3 is the circuit diagram of an improved current comparator circuit for 
4-phase handshake micropipeline system and Fig. 4-4 its simulation result. In this 
circuit, request signal from previous stage passes through a minimum delay element 
and produces a signal "ri" to the comparator. Initially, “ri” is “Low and node X is 
discharged to “Low . Once request signal is accepted and CMOS logic block starts 
processing. After the minimum delay time (5ns in this case), comparator is enabled. 
page 3-7 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 4 Current-Sensing Completion Detection 
Node X keeps in "Low" until processing completed. This ensures the complete 
signal changes from "Low" to "High" no matter how small the processing time or 
current is. Ifnext stage self-timed block accepts the data and retums an acknowledge 
signal, "ri" will go "Low". The parallel N-MOS transistor is used to speed up the 
discharge time at node X, so that the comparator can start next operation 
immediately. 
*THI S 1 S THE DEFAULT HSPlCE SOURCE F I L E . 
9 6 / 0 2 / 0 5 0 3 : 2 0 . 3 5 
5 • 0 ~ ‘ ^^ "^~ 7n~> ‘ ‘ “ H S P I C E . T R 0 
I : ‘ \ : 
4.50 r • . . /• . • • . . •- 87 In 
- . / , A~: 
..or . . . • . ./ . . . - • . 4 ^ _ 2 ^ i _ 
V 3.5o^  - . - ' - / • . . . : : . . . . JOS ready 
n I I. - • • 
. I . -L - • / , . V T 3 . 0 r . . . . . . • • . [| . . . —^ _" ^ _ _. _ _ 
L . i 
] 2 . 5 0 r ‘ I: - . •..— 
‘ - ‘ .'. . I ,-N - I : 1. 
2.0 r /•: ‘ ^ 
I . I -
- / : 
1 . 5 0 - • . . • , ,- • . . - . • ‘I • • •.— 
: • . J ' : I .:: 
1 A ._—. -_v-^><-^ ~^, . , . . .i, . • . I . _ 
' : _ _ K ., : & 
‘ . . ’ : 5 0 0 . OM r I. . • t 
- .1 . ‘ . : 
- I :. 1 : Q L. -<—I I—I—I—I—1—i- "»• - -b- -1- • -1 . -I - • !• •- I.' ._ 1 I. I. |J—I—I_!_i—I I , I I , I 
'0 2 5 . ON 5 0 . ON 75 . ON 1 0 0 . ON 1 25 ON 150 ON 
-i S . 5 8 4 P T I H E t L 1 N ) 1 5 3 . S q ) N 
Fig. 4-4 Simulation result of improved current comparator 
Since there must be a voltage drop in the current mirror, the noise immunity 
of the CMOS logic block will be reduced, an output buffer should be added to the 
data output to correct the voltage swing. But in most cases the data output is 
connected to output register, no additional buffer is required. Besides, the W/L ratio 
of input transistor in current mirror should be large enough to source the total 
switching current produced by the CMOS logic block. And the transfer ratio of the 
current mirror should be much smaller than 1 to attenuate the current for comparison 
if the total switching current is too large. Similarly if there is only a few gate 
switching, this will produce a very small output current or may be smaller than 
page 3-8 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 6 Implementation of s e l f - t i m e d ICT processor 
threshold value, and the addition of a minimum delay element will also solve the 
problem. 
In Addition, reduced voltage swing inside the CMOS logic block will slow 
down the operating speed of the CMOS circuits. The solution to this problem is to 
increase the voltage supply of the CMOS logic block by providing a separated supply 
buses for all CMOS logic block which are monitored by current comparator. 
4.3 Self-timed logic using CSCD 
ri 
^ 1 
^ P u t _ _ \ CMOSLogic _ V 0 ^ ^ ^ _ x D a t a 
reg. - _ / circuit — ) ^ ^ - ^ o u t p u t 
r 7 ^ I H ^ f ^ " " “ ~ ~ ^ ^ ready 




Fig. 4-5 Block diagram ofSelf-timed logic using CSCD 
A self-timed system with CSCD is shown in Fig. 4-5. When a request signal 
“ri” is accepted and the input register stored the input data. Then the CMOS logic 
circuit starts processing and the current comparator monitors the switching current of 
that CMOS logic block and indicates whether the operation is finished. When the 
CMOS logic block is in steady state, there is no switching current and the comparator 
gives a complete signal. The CSCD circuit can be used in the micropipeline system 
with either 4-phase or 2-phase handshake control protocol. This is a relatively 
page 3-9 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 6 Implementation of s e l f - t i m e d ICT processor 
simple method of determining unbounded delay with single-rail logic circuit. An 
asymmetric delay element is used because request signal has to retum to zero in each 
cycle. And this can minimize the waste of time during return to zero period. 
4.4 CSCD test chips and testing results 
To further investigate the performance of CSCD circuit, a self-timed test chip 
was designed and fabricated. This chip consists of three independent self-timed 
CSCD circuits. Two of those are CMOS logic gates chains (with 40 OR gates in 
each circuit). They are used to test the characteristics of current comparator and its 
performance in different switching current level. Another circuit is a self-timed 16-
bit accumulators using the CSCD technique, which is a 16-bit carry-look-ahead adder 
with feedback path. Fig. 4-6 shows the layout view of the test chip. The circuit 
diagrams of the current-sensing test chip can be found in appendix C. 
This test chip was initially designed and expected to fabricated in ES2 1.2^m 
CMOS process. However, because ES2 could not find enough projects for 1.2^m 
RUN, so this chip was finally fabricated in l.O^m process with all the layout remain 
unchanged except for the size of via and contract. As the parameters of 1.2 and 1.0 
process are different and current sensor is very sensitive to the variation in 
parameters, so a relative large difference between simulation and testing results is 
expected. 
page 3-10 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 6 Implementation of s e l f - t i m e d ICT processor 
" t i E S f l t o S ' 
; ; P ^ B 0 ^ H 0 M 3 _ 
.• 'I . “ ,f ^ I^B^ r^^^ ^^T""^^ "^T^ T^"^^^T"^^ "^i^^ F^ ~ >• . • \ .. 
i i g | ^ ^ a j | [ ^ | t f i p s 
_•_ 
I § 1 B 
_ 91. '1 
rn;i:::: ir ‘ £^^EBflffi:T~ mfflU^^  
r H i ' . u L Z ^ i f # 
S a ^ ^ W f f ^ 
H ^ ^ ^ ^ g l 
Ks &i-i-vPb'"m yfel • Hiy i'jtnioffls>[:j 
Fig. 4-6 Layout ofthe second CSCD test chip 
4.4.1 Test result 
All three testing block of the test chip has different current comparator style 
(different reference voltage and W/L ratio) and substrate biasing method. And only 
one of them in the chip in this test chip works (the one as described in section 4.2.3) 
which is the logic gate chain test block. The reason of failure may be due to 
improper reference voltage for the current comparator, as the circuit was originally 
optimized for 1.2 i^m technology. 
Fig. 4-7 is the analog output voltage of the current comparator (upper) and its 
buffered digital output (lower). The analog output is node X as described in section 
4.2.3. As the data input starts to change, the switching current increases and the 
voltage of node X starts to change from low to high. It will be kept steady (high) 
when the switching current decreases to zero. The test result showed the current 
comparator circuit basically works. When comparing the testing result with the 
simulation result as shown in Fig. 4-4, the transient response is found to be correct 
page 3-11 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 4 Current-Sensing Completion Detection 
but the time delay of the current comparator is much larger in practical case. Table 
4-1 summarized the measured result and compared with simulation result. 
TeK WHE 500MS/s 1003 Acqs 
r^~—-I".- fT —.".:…―-i 
l - - | " - r - ! - - - - - - i -i - - : - | - : - -: K- 63ns 
I ®: - 8 . 5 n s 
I 
: i ^_^ chi Ampl 
i- A y ^ . s . "v 
..... I . . > ^ ( : 
! x - ^ : 




... .|„..|...|.4...|.". .. "|.".|. |" +"+"4..~^.-—‘ ... - r _ • I I « • « . * * - f ^ t ' i - i " i y T ^ - ^ -- .y^ •"-•"•...« •«. 
I : r y s ; y ^ 
( 
• • . • • • • 
: i . 
\ 
I • • • 
I 
I 
9 <">y^—v'-'"w>>^>-'-^—;<^ -^>>N^_4_y^ "^ V^ y^"^ ^^ vJ 
i 
^t^i' ‘ 'iV iffl ‘ '2v' ‘ M '2Sn5 CM / l W 
Fig. 4-7 Output voltage of current comparator 
TeKQSnSSOQMS/s 38Acqs 
i- [ TI 3 -i 
. .. .• ‘ ^ 7 . . i ^ 7 ^  • 5 ^ • •- ‘ • . ^ ^  7 . . V . f f •• •  • s -. . ” ^ 
. :A: 29ns 
/ \ ri A ®- 29ns 
• • • • • • • • • • ; • r V / V V v — ^ ) p i 
• • • • • • . .. 
4 
. • •  •” • . . . 
1 ^ fc^^^^H<"<""7*"^^V^*^^<^^i^^^^-^""^"^N^^ \ ^ 
....1..+++..|..++—.. | ;. —,…..…, . . . , . . . , . . . , . . . | . . . . , . .4^4^ . — ^ . . - . • •“ . - . . . . . . , . . .* .A. .^ ._ t .+ . . | . .++++. . 
:…. | ^ V v ^ ^ 
m ' . . . • J\ . . . . • . • • • • 
Cfi;^<*v<^"-"^x^s/^w^^^_^^<<-">w<ry-^-<"N><>^"s</^>^"^^>^^l^ V I 
lehi+ !^^ — ' ; ' ) / M [^ns 'M' '/ "i,nv^ 
Fig. 4-8 Time delay from output data to completion signal 
page 3-12 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 6 Implementation of s e l f - t i m e d ICT processor 
Simulation result Measured result 
Delay from data input to 18.3ns 19.8ns(averaged) 
output 
Delay from data output 11 ns 29.3ns(averaged) 
(steady) to completion 
signal 
Table 4-1 Comparison between simulation and measured result 
The delay of the CMOS functional block is very close to the simulation result 
as it is only a 40 standard CMOS OR gate chain. However, the delay of the 
comparator is almost three times longer than expected. This may be caused by the 
following reasons: 
1. Improper reference voltage - the resultant reference current may be too large so 
that the rise time of output voltage of comparator is very long. 
2. The output (node X) was connected to an analog pad for measurement, so that the 
capacitance ofthe node is very large, thus rise and fall time is extremely large. 
3. Noise may introduce undesired current in addition to normal switching current. 
4. Stability of power voltage may also affects the accuracy of the reference voltage 
since this comparator circuit is too simple and very sensitive to the fluctuation of 
supply voltage. This may be improved by using cascode current mirror 
configuration and bandgap reference voltage generation technique. 
page 3-13 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 6 Implementation f self-timed ICT processor 
Chapter Five 
5. Self-timed ICT processor architecture 
5.1 Introduction 
Calculation of Integer Cosine Transform requires a processor with very high 
computation power. There are many approaches to implement the ICT algorithm, 
such as using a general purpose DSP microprocessor, dedicated ASIC or SISCs. 
Also they will be analyzed to find out which one is the most suitable for 
asynchronous implementation. In this chapter, different design approaches will be 
considered and compared, then the most cost-effective structure will be pointed out. 
The advantage of using asynchronous technique to implement ICT processor 
is that its calculation time may be shorter than a synchronous design, because its 
computation time of each clock period in each self-timed block is different. And the 
computation time depends on different operations. We can select or vary the values 
of delay element according to different conditions or input data patterns. In our ICT 
circuit, some instructions require only one addition and some require multiplication 
with Accumulation (MAC), so we can apply the delay value selection method to 
reduce the overall computation time. However, for synchronous system, the time 
required for each operation is fixed. Also, this chip is used to demonstrate the idea of 
page 3-1 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 6 Implementation f self-timed ICT processor 
delay selection mechanism and the performance of the micropipeline system with 2-
phase handshake control. 
In previous chapters, the theory and advantages of Integer Cosine Transform 
(ICT) and Self-timed design methodologies have been discussed. We had concluded 
that micropipeline structure, with 2-phase handshake control protocol with delay 
selection technique is the most efficient structure for implementing self-timed system 
with feedback path and long data word-length. Thus our ICT processor will be 
designed based on this structure. 
The objective of this chapter is to discuss various kinds of processor 
architecture, to find out which one is the most suitable for implementing ICT fast 
algorithm and to demonstrate the advantages of self-timed system. The criteria and 
consideration factors of selection include: cost, complexity, performance, difficulty 
in asynchronous implementation, controllability and ability in upgrading. 
In this chapter, various kinds of processor architecture will be discussed and 
analyzed based on their hardware complexity, speed, and suitability of adopting self-
timed techniques, starting from the basic general purpose DSP to complicated 
dedicated ASIC structure. The implementation and testing results of the 1-D ICT 
processor will be described in next chapter. 
page 3-2 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 6 Implementation f self-timed ICT processor 
5.2 Comparison of different architectures 
Many DCT processors have been designed in the past, most of them are 
conventional synchronous processors. Some of them were designed based on general 
purpose DSP architecture and some were dedicated ASIC. Some synchronous ICT 
chip sets based on general purpose DSP architecture had been developed by a 
research student a few years before. I had actually designed three versions of self-
timed ICT processor. The first version micropipeline circuit can calculate both 
forward and inverse 2-D transformation but it is not using fast algorithm. The 
purpose of this design is just to gain practical experience in simple straight forward 
micropipeline design and area estimation. The second version is a completely 
different design. It is also a micropipeline design and have feedback data paths, but 
can only perform forward transform. The final version is similar to the second 
version except that it can perform both forward and inverse transform. Next section 
will discuss the differences between them and point out the reason of using the final 
design. 
Digital Signal Processors (DSP) have traditionally been optimized to compute 
Convolution (sum of products), Recursive filtering and Fast Transform (butterfly) 
operations that typically characterize most signal processing algorithms. DSP can be 
either Programmable or of a Dedicated nature. Programmable DSP allow flexibility 
of implementation of a variety of algorithms that can use the same computational 
kernel, while dedicated DSP are hardwired to a specific algorithm. Dedicated 
page 3-3 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 6 Implementation f self-timed ICT processor 
processors often are faster than, or dissipate less power than general purpose 
programmable processors[l]. 
Most synchronous and asynchronous processors have register-register based 
data path structure, operands are fetched from the registers by the functional units, 
and then the results are written back to the register files. Operands fetched from the 
memory are directly fed into the register files before being used by the functional 
units. Two common register-register based data paths are: 
1. Multiplexer-oriented data path, where multiplexers route the results between 
functional units and the register or storage elements, and in between the registers 
and the inputs to the functional units. 
2. Bus-oriented data path, where results are written into buses and operands fetched 
via buses in between the registers and the functional units. 
Bus-oriented data path structure is simple in hardware, as the registers can 
share the common buses. However, it is rather difficult to adopt self-timed technique 
because controlling the buses asynchronously is very complicated and requires more 
control hardware. On the other hand, Multiplexer-oriented data path structure is 
easier to be controlled asynchronously and has less control overhead in terms of time 
and hardware. But it is not as flexible as former structure and consume more area for 
interconnections in data. 
In the following sections the implementation and design issues of different 
processor styles will be discussed and compared. Silicon area is estimated based on 
page 3-4 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 6 Implementation f self-timed ICT processor 
the amount of hardware such as multiplier, adder and register. While the data rate is 
estimated by calculating the number of clock cycles to complete the whole 1-D 
transform operation. The estimation of area and speed are based on the following 
assumptions : 
• 16-bit full adder = 1 unit area 
• Integer multiplier (16bit x 3bit) = 2 unit area 
• Each pipeline delay (multiplier) = 20 ns 
• Each pipeline delay (rest of the functional unit) = 10 ns 
• All delay caused by handshake control signals are neglected. 
5.2.1 General purpose Digital Signal Processor 
Fig. 5-1 is the block diagram of a typical general purpose programmable 
DSP. It has a bus-oriented data path structure where registers, data memory and 10 
share the same data bus. Instructions are fetched from the program memory and 
decoded to control the flow of data and operation of computational units. Since this 
pipeline structure has many branches and stages, generation of handshake control 
signals to handle correct operations between program memory, data memory, 
computational units and 10 units are very complicated. Although this architecture is 
widely used in nowadays' synchronous processors, it is not so efficient for 
asynchronous implementation. 
page 3-5 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 6 Implementation f self-timed ICT processor 
Data Bus ~~~~]^ Instruction 
decoder/ 
1 Datain Controller 
V \ 7 “ 7 ^ ~ " 
Reg A Reg B U 
n TT Program (^^ "^ ^^  memory 
^ _ _ _ V ^ -
ALU I PC/Address 
L^^^—^^^»—^^^J generator 
I RegC I sj^ 
/1 N Data 
^ ^ Data out \ | [/^  memory 
Fig. 5-1 Block diagram of a general purpose programmable DSP 
5.2.1.1 Hardware and speed estimation : 
Total no. ofMultiplier = 1 
Total no. ofALU = 1 
Total no. ofRegisters = 5 
Estimated area « 20 mm^ 
Estimated data rate « 10 MHz 
Although its computational unit is very simple whichjust consists of only one 
multiplier and one ALU, large portion of area is used by program and data memories. 
Moreover, almost half of the processor operation time is consumed by data transfer 
between memory and registers. A former student had implemented ICT (not fast 
algorithm) based on this structure, but he used conventional synchronous technique. 
This processor requires over 100 cycles to calculate a 1-D ICT coefficient matrix. So 
he implemented the system using 8 chips connected in parallel to speed up the 
operations. 
page 3-6 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 6 Implementation f self-timed ICT processor 
5.2.2 Micropipeline without fast algorithm 
My first self-timed ICT processor design is a micropipeline circuit dedicated 
to calculate 8x8 matrix multiplication but not for fast ICT algorithm. The block 
diagram ofthe circuit is shown in Fig. 5-2. It is a four stage micropipeline system. 
As its pipeline structure is very straight forward and no feedback path, its hardware 
of handshake control circuit is extremely simple thus it is negligible when compared 
with multiplier and adder arrays. 
^ 0 = c f > ^ ^ v ^ 
^ © ^ ^ ^ ^ y ^ 
^ e X ^ f w f ^ ; 
_ = ^ ; X D ^ ! ^ : ¾ ^ 
" = ^ : ^ : ^ i : ^ r 
^ : ^ © ^ : ^ " : ^ ^ s 
^ ^ 0 ^ w ^ ^ ^ ^ 
^ i ^ 0 4 i ^ x2 
Fig. 5-2 Block diagram of first version of self-timed 1-D ICT processor 
In order to speed up the processing time, two multipliers/adder trees had been 
constructed in parallel. Each one is responsible for calculating four coefficients. In 
addition, a chip layout of a 2-D ICT processor with a 128-word transpose memory 
had been design based on this structure and its implementation will be discussed in 
the next chapter. The area and speed estimation of a 1-D system is shown as follow. 
page 3-7 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 6 Implementation f self-timed ICT processor 
5.2.2.1 Hardware and speed estimation : 
Total no. of multiplier = 16 
Total no. ofFull-adder (16-bit) = 14 
Total no. ofRegister (16-bit) = 36 
Estimated Area « 35 mm^ 
Estimated data rate « 100 MHz 
To complete a 1-D transform, totally eight cycles are required (including four 
cycles of latency. The main advantage of this design is that it is capable of 
calculating both all kind of (8x1) forward and inverse matrix transformation with 
minimum overhead. However, it cannot handle fast algorithm. 
5.2.3 Micropipeline with fast algorithm (I) 
^ f ^ y ^ ^ T T S ? ^ ^ = ^ 
=^  \ p x > ^ y ^ - ^ ^ y ^ =^ 
• : J j ( e ^ — f ^ ^ = ^ 
D a t a = ^ ? ^ ! ^ ? ' V ^ f ^ ^ © K ^ ? = ^ D t 
: ; j | ^ S — ^ © = ^ : ^ ^ ut 
• : ! ^ • V ^ s r ^ ^ ^ ; ^ ^ 
• A ^ > ^¾<^ ^ | ^ 
4] ^¾(}=^^¾^>^]^>¾^ 
Fig. 5-3 Block diagram of Micropipeline with fast algorithm architecture( I ) 
page 3-8 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 6 Implementation f self-timed ICT processor 
Another micropipeline design as shown in Fig. 5-3 is also a four-stage 
system, and its hardware structure is simply the same as the ICT fast algorithm. 
5.2.3.1 Hardware and speed estimation : 
Total no. of multiplier = 14 
Total no. ofFull-adder = 26 
Total no. registers (16-bit) = 32 
Estimated area « 60 mm^ 
Estimated data rate « 266 MHz 
The most attractive feature of the design is its very high processing power, 
because this architecture allows eight data to be processed in parallel. As a result, its 
data rate is up to 266MHz. If the Multiply-add units in the third and fourth stages are 
broken down to two more stages, i.e. totally six stages, its data rate can be increased 
to 400MHz! Similar to previous design, the handshake control circuit for this circuit 
is also very simple as it has no feedback path and it has a very straight forward 
pipeline structure. 
However, its silicon area is too large as it requires 14 multipliers, 32 
intermediate registers and a large amount of interconnections. Also it is unable to 
calculate inverse ICT and is not easy to upgrade and modified if the size of matrix is 
changed. 
page 3-9 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 6 Implementation f self-timed ICT processor 
5.2.4 Micropipeline with fast algorithm (II) 
My final circuit is a design based on bounded delay micropipeline with 
feedback path and able to process with fast algorithm. This architecture combines 
the concept ofpreviously described architectures (i.e. micropipeline, general purpose 
microprocessor and dedicated DSP) and it is also similar to superscalar architecture 
which are widely used in nowadays high speed DSP chips. The block diagram ofthe 
circuit is shown in Fig. 5-4. 
This circuit is the combination of eight general purpose DSPs and using 
simple dedicated program sequence and handshake control circuits. So, it has a very 
high degree of parallelism and very high speed of program flow control. Also it is 
able to implement inverse ICT with the same computational block units. In order to 
reduce the silicon area, the function of each computational block and handshake 
control signals between those blocks are optimized for ICT fast forward and inverse 
algorithm. 
= ^ r r ~ _ ir^ 
D ^ ^ = 5 i ~ ~ S e l f - t i m e d ~ ~ ^ = 
In Comp. block 1 ^ ^ u t 
^ Self-timed ^ 
Comp. block 2 ^ 
= ^ Self-timed ^ 
Comp. block 3 z 
>1 Self-timed ^ 
Comp. block 4 7"" 
p = = ^ Self-timed ^ 
\ Comp. block 5 7^  
^ 
^ Self-timed ^ 
Comp. block 6 ^ 
>1 Self-timed ^ 
X Comp. block 7 y 
= ^ Self-timed ^ _ 
\ Comp. block 8 ^ 
^ 
Fig. 5-4 Block diagram of final version of self-timed 1-D ICT processor 
page 3-10 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 6 Implementation f self-timed ICT processor 
Most of the superscalar architectures also have register-register data path 
structure, where data is stored into the registers from the memory, operated upon by 
multiple functional and arithmetic units, and the results from the register units are 
written back into the memory[l]. But in this design, we have eight self-timed 
computational block (functional units) connected in parallel and the program 
memory is broken down into eight parts. Thus it can be viewed as eight small 
computers operating concurrently. 
5.2.4.1 Hardware and speed estimation : 
Total no. of multiplier = 6 
Total no. ofFull-adder = 8 
^ Total no. registers (16-bit) = 24 
Estimated area « 25 mm^ 
Estimated data rate « 80 MHz l/(20ns x2 + 30ns x 2) x 8 
The operating speed of its synchronous counterpart is: l/(30ns x 4) x 8 = 
66MHz. Because the time period for all of the four cycles is the same. 
From the viewpoint of self-timed system design, its handshake control 
overhead is more than micropipeline without feedback path but less than general 
purpose microprocessor style. Also, this feedback design enhances its performance 
when variable delay technique is used. It is because in previous two straight-forward 
micropipeline structures, the whole system speed will be slowed down by the 
slowest stage as all stages are connected in series. 
page 3-11 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 6 Implementation of self-timed ICT processor 
Chapter Six 
6. Implementation ofself-timed ICT 
processor 
6.1 Introduction 
In the last chapter, different kinds of self-timed ICT processor architectures 
have been reviewed. The final version has been concluded to be the most cost-
effective design which compromises the speed and silicon area. In this chapter, the 
detailed circuit implementation, specification and testing results will be described. 
Firstly, my first version design which is a four-stage micropipeline system will be 
briefly introduced for reference purpose. Then we will concentrate on the final 
version. 
Chip design has to consider the constraints such as die size, pin number and 
wiring complexity. To reduce the complexity of the ICT chip, modular architecture is 
employed to allow data pipeline and parallel processing. The first ICT processor was 
only designed for practice. After a lot of improvements in ICT architecture and self-
timed control technique. The final version was developed which is not only able to 
perform forward and inverse fast ICT calculation but also adopted some improved 
self-timed technique to enhance its performance. 
page 3-1 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 6 Implementation of self-timed ICT processor 
A 2-D transformation can be implemented by a 1-D transformation and its 
transpose in both the forward and inverse transformation. In order to optimize the 
speed of the calculation and reduce hardware complexity, the scaling matrix [K] is 
removed out ofthis ICT core processor and merged with quantization module. Thus 
our final chip has only integer multiplication and addition instructions. 
The forward and inverse ICT fast algorithms described in Chapter 1 show that 
at least four stages are required in both forward and inverse transform. If the primary 
input data is 8-bit long, the output coefficient will be 12-bit and 16-bit after 1-D and 
2-D transformation respectively. The complete system (2-D system) should consist 
oftwo 1-D ICT modules and one transpose memory module which is able to store at 
least 64 x 16 bit of data. Moreover, the 1-D ICT module must be able to perform 
both forward and inverse transformations. 
According to the above functional requirements, our final design should be 
able to handle a maximum of 16-bit data word-length and able to calculate the 
following functions : addition, subtraction, x3, x5, xl/2 and xl/4. The following 
sections will describe and discuss the first and final version of the self-timed ICT 
processor chips. 
page 3-2 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 6 Implementation of self-timed ICT processor 
6.2 Implementation of Self-timed 2-D ICT processor 
(First version) 
My first version design is a 2-D ICT system consisting of two 1-D ICT 
modules which are the simple four-stage micropipeline circuits and one 256 byte 
Transpose RAM module. It was design at the very first beginning and the layout had 
not been submited for fabrication, because it was used for logic and layout design 
practice, and area estimation only. The main advantage of this design is that it can 
calculate any 8x8 matrix multiplication (i.e. forward and inverse ICT without fast 
algorithm). 
= 1-DICT = T : - = i _DicT = 
Fig. 6-1 2-D ICT system 
Fig. 6-1 shows the block diagram of the 2-D ICT system. A complete 2-D 
transformation requires two 8x8 matrix multiplication. Input of the system are 8x8 
8-bit matrix raw data. After the first 1-D transform, 64 intermediate coefficients are 
obtained and stored in Transpose memory. Then eight 12-bit data are read from 
Transpose memory to the next 1-D ICT module in "Transposed" order. Finally, 8x8 
16-bit 2-D transformed coefficients will be found. All the circuit diagrams of this 
first version 2-D ICT processor can be found in appendix D. 
page 3-3 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 6 Implementation of self-timed ICT processor 
6.2.1 1-D ICT module 
Fig. 6-2 is the block diagram of first self-timed ICT core. The whole 1-D 
ICT module consists of two multiplier and adder trees to speed up calculation speed. 
So two coefficients can be achieved in each asynchronous clock cycle. Eight input 
data have to shift into input stage register before calculation operation. Then an 
acknowledge signal will be generated from the input control logic and an internal 
request signal will also be generated for next stage. 
=^> = ^ © ^ ^ 7 V ^ 
^ =C>0^ ^ V : y ^ 
= ^ > e ^ @ ^ f ^ f ^ ; 
D a t a = ^ ? ^ ? ^ ? 
” _^ ; 
^ : ^ G X > : ^ ^ : w ^ ^ 
• © = ^ W ^ ^ ^ ^ 
= ^ G > ^ j ^ ^ ^ ^ X2 
Fig. 6-2 Block diagram of first version ofself-timed 1-D ICT processor 
The self-timed handshake used is only simple 2-phase handshake control 
circuit for micropipeline as described previously. Since this system consists of four 
series pipeline stages, delay selection technique has not been applied. It is because 
only latency time of the first coefficient data will be reduced but there is no 
improvement in term of throughput data rate, as the throughput rate of this system 
depends on the worst delay pipeline stage. 
page 3-4 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 6 Implementation of self-timed ICT processor 
All multiplications are processed at the first stage. All kemel component 
values of the ICT matrix are stored by hardwired logic and read out by controlling 
the multiplexers. Since all kemel values can be represented by equal or less than 
four bits, and these values are actually the control signals for the special designed 
integer multipliers. As the ICT requires only multiplication of 10,9,5,3,2, all 
multipliers are simplified, each of which contains only two 16-bit full adders and one 
shifter circuit. The shifter is made up of 16 4-to-l multiplexers which can perform 
fast shifting from 0 to 4 bit within only one gate delays. As a result the total 
multiplier delay is only equal to two 16-bit ftill adder delay plus one gate delay. All 
the full-adders used in this ICT processor are 16-bit signed high speed carry-look-
ahead parallel fiill adders, where the most significant bit is sign bit. 
After all input data are multiplied by the first row matrix coefficients, they 
will proceed to the second to fourth stages, which are the adders tree, for summing up 
eight of the intermediate data. As mentioned before, the whole 1-D ICT module 
consists of two multiply-adder trees. One of them is responsible for evaluating the 
first to fourth coefficients and the rest one for fifth to eighth coefficients, which can 
maximize the efficiency in transpose memory read/write. Each output coefficient 
will associate with an output request signal to indicate output data is ready for next 
stage to read. 
page 3-5 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 6 Implementation of self-timed ICT processor 
6.2.2 Self-timed Transpose memory 
In 2-D Transform system, the 8x8 intermediate l-D transformed matrix 
coefficients should be transposed before performing next 1-D transformation. 
Transpose memory is made up of 4 conventional CMOS static RAM blocks and 
transposed data is obtained by read out the data in transpose order. In practical 
operation, data will be read and written to the memory simultaneously. So the 
memory module consists of 4 blocks, each has 32xl6-bit size. One pair will be 
responsible for write operation and the other for read operation simultaneously. 
There are two memory blocks in each pair because two 16-bit data can be read/write 
in parallel. Fig. 6-3 shows the block diagram of the self-timed transpose memory. 
RAM 1 RAM3 
2 X 16-bit data 32xl6-bit 32xl6-bit 2 x 16-bit data 





, c o n t r o l signals 
Handshake ^ . ^ Handshake 
control signals ^ Control Unit ^ control signals 
Fig. 6-3 Block diagram of the 128x16 self-timed static Transpose memory 
The self-timed handshake control unit for this memory module is quit 
complicated. It not only handles the input/output handshake control signal as what 
normal self-timed systems do but also have to generate internal read/write control 
signals memory access. It consists of some counters, clock(i.e. read/write enable 
signals) and address generator circuits and basic 4-phase handshake control circuits 
described previously. Each input is associated with a request as normal self-timed 
signaling. 
page 3-6 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 6 Implementation of self-timed ICT processor 
The input request counter counts the number of request and control the 
writing sequence by generating appropriate address and write enable signal. After 64 
words have been written to one pair of RAM module, a read enable signal enables 
the handshake control circuit for read operation and thus change to read mode. The 
input handshake control circuit is exactly the same with normal 4-phase 
micropipeline system as it is only responsible for generating acknowledge signal for 
correct data writing. 
In read operation, 64 request signals has to be generated by this control unit. 
They are generated by a simple delay and inverter loop circuit which consists of C-
element, Voltage-controlled delay element, CMOS logic gates and counter. It is 
similar to conventional inverter chain oscillator except the long inverter chain is 
replaced by a single delay element. It also should be controlled by global reset 
signal, acknowledge signal from next stage and counter's signal. So that the self 
generated request signals can trigger and synchronize with the inside and outside 
world. After 64 requested signals have been sent, the counter resets to zero and then 
change back to write mode again. To ensure correct operation, the counter must be 
reset after acknowledge signal has been received. 
The minimum delay is selected based on the RAM macro specifications in 
ES2 library data book and simulation results. Please refer to Appendix D-12 for 
detailed circuit diagram of the control unit. 
page 3-7 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 6 Implementation of self-timed ICT processor 
6.2.3 Layout Design 
Layout of the memory cell was generated by the macro block generation 
program provided by ES2. However, only abstract view can be generated. So, only 
functional and digital timing simulation results can be achieved based on the Verilog 
behavioral model generated by the program but detailed SPICE analog simulation 
based on the real layout cannot be performed. The whole layout of this 2-D self-
timed ICT system is designed using Cadence's Auto-Place-and-Route program and 
ES2 CMOS 0.7um Standard-Cell library. 
1 ^ ^ ^ ^ 
| W H | | 
ffl^^H| 
H ^ ^ ^ H | B 
i ^ H 
g^jfl ^^^^^^^^^^^^^^^^^^® ^ Hr 9H • M^ M^ ^^ Mf HEfi 2 
^^Jmmt^^^mmMBaan 
^ ^ n H | ^ H f f i 
W M W l T ^ n T M 
Fig. 6-4 Layout view of the first version 2-D ICT processor 
Fig. 6-4 shows layout view of the first version self-timed 2-D ICT design. 
The layout was partitioned to three main regions: two 1-D ICT modules and one 
Transpose RAM module. Each 1-D ICT module is further divided into two blocks. 
One block for multipliers and the other for Adder-tree circuit. Total area is 75 mm^ and the number of in is 78. Since this layout is too large and the circuit needsfurther impr v ment, t had no  been submitte  for fabri ation
p ge 3-8 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 6 Implementation of self-timed ICT processor 
6.3 Implementation of Self-timed 1-D ICT processor 
with fast algorithm (final version) 
/ Output 
< buffer 
Data out ^ Se l f - t imed 
\ Comp. block 1 
7^  
= ¾ Sel f - t imed 
\ Input _ _ \ Comp. block 2 
buffer ~ " ~ ~ ^ 
Data in ~ ~ = ^ Sel f - t imed 
\ Comp. block 3 
——=y Sel f - t imed 
X Comp. block 4 
^ 
>1 Self - t imed 
\ Comp. block 5 
7^  
^ Sel f - t imed 
Comp. block 6 
—— ^ Sel f - t imed 
Comp. block 7 
Sel f - t imed 
Comp. block 8 
^ l v H ^ ^ ^ ^ ^ ^ H H N i ^ ^ ^ ^ ^ ^ H i ^ ^ ^ ^ ^ M 
Fig. 6-5 Block diagram ofthe final version Self-timed 1-D ICT core processor 
The block diagram of the final 1-D ICT processor is shown as Fig. 6-5, and 
all circuit diagrams are shown in appendix E. In the final design, the improved 2-
phase micropipeline with variable delay technique was used, so the clock period of 
each self-timed block is different and it depends on the instructions to be executed. 
The system architecture is a mixture of general-purpose microprocessor and 
dedicated DSP style. It is capable of calculating the forward and inverse 1-D ICT 
using fast computational algorithm with the same hardware as it is a micro-coded 
programmed processor. The chip has eight parallel self-timed computation block, 
each self-timed block acts as a simplified general-purpose processor which is 
page 3-9 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 6 Implementation of self-timed ICT processor 
responsible for calculating a particular element of the transformed vector X(i.e. XI or 
X2 …). 
Each block has its own internal clock and handshake signals and also 
generates and receives external handshake signals to make sure correct data flow and 
operation. For example, once the intermediate data from other self-timed block is 
ready, a request signal will be received. If the corresponding self-timed block is also 
ready and accepts the data, an acknowledge signal will be returned to the requester. 
That means all of these eight self-timed blocks operates concurrently and 
asynchronously and the data flow and operation sequence depends on the 
connections of handshake control signals. Refer to the fast algorithm diagram, each 
self-timed block has to process four times in order to calculate a coefficient. At least 
two of those cycles involve addition operation only, thus shorter delay period is 
required. So the maximum total processing time is about two short delay units plus 
two long delay units, which is equivalent to three long delay units. It is theoretically 
faster than its synchronous counterpart system if handshaking delay overhead is 
neglected. 
6.3.1 IO buffers and control units 
Since most of the digital systems are synchronous, we have to consider how 
to communicate with external synchronous world. So, apart from normal self-timed 
systems interconnection structure, this self-timed ICT processor chip must be able to 
operate with other synchronous chips without additional modification of hardware. 
page 3-10 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 6 Implementation of self-timed ICT processor 
Since there is no any handshake signals provided by conventional synchronous chips, 
and they usually synchronize input and output data with a global clock signal. We 
can use this clock signal for synchronous circuit as request and acknowledge signal 
for self-timed circuit. That means this chip can be viewed as an externally 
synchronous but internally asynchronous system. So, the handshake control protocol 
used for communicate with outside world must be 4-phase handshaking and the 
internal signaling is 2-phase handshaking. 
6.3.1.1 Input control 
The 10 control until thus not only controls and rearranges the data sequence 
but also interfaces with external synchronous systems. Lets consider the data input 
from a synchronous/asynchronous host first. The positive transition of clock signal 
is assumed to be a request signal which is similar to 4-phase handshake signaling. 
But an “enable” signal (synchronous or asynchronous) must be given by the host to 
indicate whether the data are valid or not. Once it has received eight request or clock 
pulses, an "frame acknowledge" signal will be generated and fed back to the host 
machine for frame synchronization. And this acknowledge signal will remain high 
until the processor completes the whole operation. During this period the input 
buffer can accept the next eight new data and hold them until the acknowledge signal 
is reset to low. That means the processor's cycle time (i.e. time required for 
calculating all coefficients) is equal to the time period of this frame acknowledge 
signal. 
page 3-11 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 6 Implementation of self-timed ICT processor 
The input buffer has eight 16-bit registers which load eight 16-bit input data 
serially. After each register has accepted a data, the internal request signals Rol 
Ro8 will be sent to the appropriate self-timed computational blocks. So the self-
timed computational blocks can start processing once both of their two input 
operands are ready. When all of the self-timed computational blocks have accepted 
the data, an acknowledge will be received in order to reset the counter for next cycle 
ofoperation. Fig. 6-6 shows the timing diagram of the input buffer. 
R e q . in 
F. A c k . 







R o 8 
Fig. 6-6 Timing diagram of input buffer 
6.3.1.2 Output control 
The data output unit can also handle interfacing with both asynchronous and 
synchronous systems. In asynchronous mode operation, a request signal generator 
(which is similar to the one described in the Transpose RAM module) generates eight 
pulses as request signal which will be handshaking with the acknowledge signal of 
the target host machine. For testing purpose, the acknowledge input can simply be 
hardwired with the request output to continuously push out the data. In synchronous 
mode operation, the generation ofrequest, reset and counter signals, are synchronized 
page 3-12 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 6 Implementation of self-timed ICT processor 
with the clock signal of the target host machine. Even this 10 control unit of 
processor can operate with synchronous systems, but it actually works like a salve 
when connected with synchronous machines, the clock speed of data for both source 
and target machines must be the same to prevent data lost. 
The operation of output buffer is different from input buffer. Once all ofthe 
eight resultant coefficients in the self-timed computational blocks are ready, they will 
be stored into the output buffers and then move out serially through a 16-bit parallel 
output port. In addition, the output port has some multiplexer arrays used to monitor 
some internal data and handshake signal for testing purpose. 
6.3.1.2.1 Self-timed Computational Block 
TT j 1 ^ ^ Handshake control ; , , , Handshake . . Handshake 
.< circuit ^ . , 
signals I signals 
Program memory / 
Instruction decoder 
I i 
Data in Integer Execution Unit Data out 
Fig. 6-7 Self-timed computational block 
Fig. 6-7 is the block diagram of the self-timed computation block. Similar to 
normal micropipeline system, it is composed of the handshake control circuits, 
functional logic block and instruction decoder unit for micro-coded hardwired 
programming. In both synchronous and asynchronous mode, the internal self-timed 
computational blocks are operating asynchronously and having their own handshake 
page 3-13 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 6 Implementation of self-timed ICT processor 
signals. In order to save the silicon area, the functional block and instruction decoder 
of each block are designed dedicated for specific functions based on the ICT fast 
forward and inverse algorithm. 
6.3.1.3 Handshake Control Unit 
The handshake control Unit is the heart of this self-timed processor. It not 
only manages the request and acknowledge signals between different self-timed 
blocks and outside world, but also generates appropriate internal clock for pipeline 
in integer execution unit. This unit contains a pair of 2-phase handshake control 
circuit, interface circuitry for communicating with external handshake signals and 
counters for generating signals for instruction decoder to control the operation ofthe 
integer execution unit. So this handshake control unit in the self-timed system is 
very important and there are many difficulties in designing it. 
Two-stage micropipeline 
Req. “ Ack. 
in |VDin , , , . , VDout. ^ 
~~H 2-phase ~ ~ I, Va_e J 2-phase L ^ 
= { signal 2-gase clelay ^-pha- s ^ a l J = 
——• generator " ^ ^ i T " ^ ^ generator 4—— 
¥ _lzzi^cizizzil lzzzzii^izziL_ 7— 
Ack. X z i ~ A ^ “ “ L _ Request t Z Z j Req. 
out ^ ^ generator ^ Instruction " ~ 1 ^ ^ ~ E ^ _ 
decoder 
— C o u n t e r 4 ^ Counter —— 
Rol Ro2 
(to V' stage (to 2"^  stage To Instruction 
• offfiU) offfiU) ^ • decoder 
Fig. 6-8 Block diagram of Handshake Control Unit 
Firstly, we previously concluded that 2-phase handshake control protocol is 
the most cost-effective method to handle wide word-length bundled data system. 
However, the system to be implemented has eight parallel self-timed blocks 
page 3-14 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 6 Implementation of self-timed ICT processor 
involving 16 request and acknowledge signals. So the handshake control protocol 
will be extremely complicated if using traditional asynchronous logic synthesis 
method to design a hazard free handshake control circuit. In order to simplify the 
design, it is broken down into two parts. Fig. 6-8 is the block diagram of the 
handshake control unit. First part is the heart of the unit which is a well-studied 2-
stage 2-phase handshake control micropipeline with variable delay circuitry. It is a 
completely hazard-free asynchronous circuit. 
Since the request and acknowledge signals come from different blocks during 
different processing period (i.e. may be either positive or negative phase), the 
handshake signals cannot be connected to other blocks directly to or from the 
micropipeline's handshake controller (first part). The request and acknowledge 
signals must all have the same phase to indicate their activation. Thus some standard 
CMOS logic gates and flip-flops are used to construct a request/acknowledge signal 
generators. Once the previous stage has fired a request, the signal will go from low 
to high and keep unchanged until completion of the whole process. Then a 2-phase 
signal generator will convert the rising edge signal to 2-phase signal. 
Fig. 6-10 shows the timing diagram of the HC unit. The 2-phase signal 
generator converts a logic high from the request signals (Req. in[l] ~ Req. in[4]) to 
2-phase handshake signal (VDin) and 4-phase pulse signal (Rol and Ro2) for the 
registers in Integer Execution Unit. Since the request and acknowledge signal 
generators are all simple shift-registers. Their contents will only be changed from 
low to high with the pulse signal generated by a hazard free 2-phase micropipeline 
page 3-15 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 6 Implementation of self-timed ICT processor 
handshake controller. And they will be reset to low when they received a process 
complete signal when all coefficients are calculated and stored in the output buffer. 
So the whole system are also hazard-free and proved in simulation. 
Req 
J T ^ ^ Rin Ack 
‘ U D Q • 2-phase ^ 
^ MUX J ~ " ^ HC R o u t 2 
— P ^ _ > ^ > Q ^ 






Fig. 6-9 2-phase signal generator 
R e q . in [ 1 ] 
R e q . in [ 2 ] 
R e q . in [ 3 ] “ 
R e q . in [ 4 ] “ 
V D in 
R o l 
A c k . o u t [ l ] 
A c k . ou t [ 2 ] 
A c k . ou t [ 3 ] ~~ 
A c k . ou t [ 4 ] 
V D out 
R o 2 
R e q . ou t [ 1 ] 
R e q . ou t [ 2 ] 
R e q . out [ 3 ] 
R e q . out [ 4 ] 
A c k . i n [ l ] 
A c k . in [ 2 ] 
A c k . i n [ 3 
A c k . in [ 4 ] 
Fig. 6-10 Timing diagram of internal handshaking signals 
page 3-16 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 6 Implementation of self-timed ICT processor 
The second part can be viewed as a local synchronous state machine circuit 
which contains a 2-phase signal generator as shown in Fig. 6-9, request and 
acknowledge signal generators and 2-bit counters and they are clocked by 
asynchronous clock pulses (Rol and Ro2). Since all the outputs of this part are 
registered, this local synchronous system and thus the whole system can be 
guaranteed to be hazard-free provided that the worst-case delay time of this 
synchronous subsystem is less than the minimum time period of those clock pulses, 
and the delay of an inverter is less than the delay of the 2-bit counter plus MUX. As 
the inputs (Req.) will only change from low to high, so the only possibility of 
causing a unwanted glitch to accidentally trigger the D-type flip-flop is that when the 
MUX's select signals are changing. To eliminate this risk, a AND with one 
invertered input is introduced. The state of MUX will be changed when "Rout 1" 
clocked the counter, so the AND gate is used to disable the D-type flip-flop during 
this phase. The flip-flop will be enabled again after the counter and thus the MUX 
are stable, provided that the delay of an inverter is less than the delay of the 2-bit 
counter plus MUX. 
The advantage of this structure is that it can easily handle multiple request 
signals by using a few standard logic gates (6 additional gates only, where 3 for input 
and three for output). However the resulting total delay overhead in handshake 
control path is rather large. The total delay time from a complete output signal 
(second stage) in one self-timed block to the Rout 1 signal (first stage) of another 
page 3-17 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 6 Implementation of self-timed ICT processor 
self-timed block is equal to: Delay of (MUX +AND +D-FF +C-element +XOR) « 
3.5ns. 
The connections of handshake signal lines depend on the data flow sequence 
defined in the flow-graph of the ICT fast forward and inverse algorithms. Since the 
algorithm can be partitioned into four stages or four feedback loops, there are totally 
four request and acknowledge pairs. And the connections of input lines must follow 
the correct sequences (i.e. Req. in[l] connected to the first request signal). 
6.3.1.4 Integer Execution Unit (IEU) 
Fig. 6-11 is the block diagram of the Integer Execution unit which is a two-
stage pipeline circuit using all standard CMOS logic gates. The trigger signal ofthe 
register or latch is generated by the handshake control unit. 
~ ^ Z ^ Latch Multiplier ^ A ^ ^ X 
r ^ A d d / ] 
Router L _ ^ > Latch ^ 
pubtract ~"^ 1 
^ Z ^ Latch ^ ^ ^ X 
Fig. 6-11 Block diagram ofthe Integer Execution Unit 
The input router selects appropriate data from input or output data bus. Every 
IEU in all self-timed blocks has an adder which is a high speed 16-bit carry-
lookahead full-adder and it is able to perform addition and subtraction. Its worst case 
propagation delay is less than 10 unit gate delay. However, refering to the fast ICT 
page 3-18 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 6 Implementation of self-timed ICT processor 
algorithm, only six self-timed blocks require multipliers and the rest of them has only 
a single full-adder. 
0 B15B14B13 0 BO 0 0 Multiplicand: 
B [0..15] 
\ f \ f \ r \ t , f , ^ , , ^ - I 
1 . X; I r ,^ , . y ~ * y FromInstruction 
4-to-l Mux 4-to-l Mux , , 
I I I h decoder 
1 Partial product :C [0..15] ^ 
B15 B14 BO 0 Multiplicand: 
B [0..15] 
2-to-l Mux I rr;!^ T7t7n< I Promlnstmction 
decoder 
I Partial product :D [0..15] ^ 
I D [0..15] I C [0..15] 
16-bit Signed carry lookahead full-adder 
1 Multiplicand : 
I B [0..15] 
16-bitMUXbasedshifter L FromInstmction 
decoder 
I 16-bit Result 
Fig. 6-12 Block diagram of 16 x 3 bit Integer Multiplier 
Since the kemel components in transform Matrix [T] are all integers with bit 
length less than four, the multiplier is very simple and can only calculate xl/2, x2, 
x3, x4 and x5. The main common characteristic of these integers is that they all 
consist of only two "1", (i.e. 2=10, 3=11, 4=100 and 5=101). So the multiplier 
simply consists of an adder and multiplexers for shifting purpose. The block diagram 
of the 16x3 bit integer multiplier is shown in Fig. 6-12, it is able perform the 
following functions : xl/2, x2, x3, x3/2, 3/4, x4, x5, x5/2 and x5/4. For example, if 
X is multiplied by 5, we only need to add X by x4. And x4 is simply shifting the 
operand by 2 bits. That means the worst case delay (i.e. multiplication + addition) is 
just equivalent to two Full-adder delays, if the multitplexer delay is neglected. If the 
instruction is addition only, a smaller delay-value is selected inside the handshake 
control unit, which is only one Full-adder delay. So the average computation time is 
reduced. 
page 3-19 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 6 Implementation of self-timed ICT processor 
The main source of computation error of the processor is caused by the 
truncation error of the multiplier, since the division of 2 and 4 is done by truncation. 
Refer to the fast ICT algorithm, as shown in Fig. 1-1 and Fig. 1-2, the maximum 
error caused by truncation for 1-D ICT is the addition of two 3-bit truncation. Which 
is equal to: 2 x (0.5 + 0.25 + 0.125) = 1.75. Since this error is not so significant 
compared with quantization error in image compress process. It is not worth to build 
the multiplier with round off function, as one extra 16-bit full adder should be added 
in each multiplier and the maximum computation time is also increased. 
Another computation error of the processor is the overflow error. "Clipping" 
function could be introduced to minimize the overflow effect. However, this error 
can be prevented by limiting the length of the input data. So, no additional hardware 
is used to implement the "clipping" function in this processor. 
6.3.1.5 Program memory and Instruction decoder 
The program memory and instruction decoder are used to control the 
operational sequence and input/output routing circuitry of the self-timed blocks. 
Thus it can be seen as a simple hardwired program memory of a processor. It 
converts the counter's output from the handshake control unit to the control signals 
for the input routers, multiplier and full-adder in the IEU. Each memory and 
instruction decoder unit was specifically designed for different self-timed 
computational blocks. It basically consists of multiplexers and simple logic gates. 
The input of the multiplexers are hardwired to either high or low depending on the 
page 3-20 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 6 Implementation of self-timed ICT processor 
sequence of operations to be achieved based on the fast ICT algorithm. And their 
input selections are controlled by the counter's value (i.e. same as program memory 
address oftraditional processor) from the handshake control circuit. 
This kind of state-machine type MUX based memory and decoder is used 
instead of ROM because each of the eight self-timed blocks has their own read/write 
timings. Otherwise, the ROM module will have to be divided into eight individual 
ROM (8-bit X 8) modules, which is not an effective implementation method in term 
of silicon area. However, these hardwired MUX memory modules plus decoder can 
be optimized to minimize the gate count. 
6.3.2 Layout Design 
The chip, as shown in Fig. 6-13 was layouted using Cadence Cell-Ensemble 
tool. All of the CMOS logic gates are from the ES2 library, except the C-element 
and all delay elements which are custom design. The layout is partitioned into 6 
groups (Input buffer, output buffer and 4x2 self-timed computation blocks). The 
chip has 68 I/O pins, including some pins for monitoring internal signals for testing 
purpose. Since there are many handshake control signals in the design, the wiring 
and interconnections are relatively complex. 
page 3-21 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 6 Implementation of self-timed ICT processor 
M ^ ^ j ^ _ # ^ M ^ ^ j ^ a J J J M I ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ H 1 
K^ ^^ gj ^UgMa^^^^^^OMm|mmmMM||mg^M ~D ^ mQQ^ Q^ m^mj^ QjmjQQ wjwWT^  
S B M M : : M B M H S 
S _ 8 B E H k o B M tt 
MMMKMflBMBBMWM HH mMBnrnMmmMMBT Jfrr^ 
?g^OT imgm^ mmujpmimm^ i^ 2 B^nn ^ |^ ^^ ^^ ^^ HB^ ^^ ^^ v^ ^^ ^^ p ps^ ssQ 
1 !^SSS^FiitiiCT^S^iiSSSS^ i _ K 
M g ^ ^ g ^ H ^ ^ M m | ^ H | ^ B i | 
mmmmm n i M h M M _ 
^ ^ ^ ^ ^ ^ 
Fig. 6-13 Layout view of the final version self-timed 1-D ICT processor 
6.4 Specifications of the final version self-timed ICT 
chip 
CMOS process / Foundry : 0.7^im SLP DLM / European Silicon Structure 
(ES2) 
Expected I/O data rate : 50 MHz (forward ICT), 60MHz (inverse ICT) 
Data format: 16-bit signed binary format 
Truncation error : 1.75 max. 
Die size / Transistor count: 5.7 x 4.1 mm / 76k 
Package : 68 pin CC 
page 3-22 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 6 Implementation of self-timed ICT processor 
Chapter Seven 
7. Testing of Self-timed ICT processor 
7.1 Introduction 
After very long investigation and design time period spent on self-timed 
system, ICT processor architecture design and implementation, the self-timed 1-D 
ICT processor chips have been finally fabricated. After many times of failed 
experience in self-timed chip design, I am so happy to say that this final self-timed 
ICT processor chip is successfully built. Although it is not a prefect design and it 
still cannot prove that asynchronous is better than synchronous, the measured 
performance is very close to what we expected (simulation result) and it does show 
that our idea and design methodology in implementing such a large scale self-timed 
system is correct. 
This chapter describes the test results and findings of the self-timed 1-D ICT 
processor proposed in the last chapter. Firstly, the functional test section described 
the testing procedures and the results showed the chip can calculate 1-D ICT 
correctly. Then the transient characteristics and performance of the chip in term of 
speed and power is discussed. The unstable problem of the chip is pointed out and 
the solutions and further improvements are suggested. 
page 3-1 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 7 Testing of Self-timed ICT processor 
7.2 Pin assignment of Self-timed 1-D ICT chip 
Pin no. Pin name I/O type Descriptions and remarks 
1 Data_out[4] output Transformed data output 
2 Data_out[3] output Transformed data output 
: :[:::: 1^^  ::: ii^i^SG^i^i^i.j(iijV) 
4 Data_out[2] output transformed data output 
5 Data_out[l] output Transformed data output 
6 Data_out[0] output Transformed data output 
7 Reset input Globai reset signal; Active iow 
•••••••..'8 N."C'. - :. 
………'9 N."C'. - 1' 
::: [1 2 ^ vcc tSi^ ..g(^ r^(gVX 
11 Sync—enabie input Enable signal for synchronous data output with 
an extemal clock; Active low 
12 Sync_clock_in input Extemal ciock for synchronous data output (ifor 
sync, mode) 
Acknowledge input (for asynchronous mode) 
13 Forward— input 0: forward transform; 1: inverse transform 
14 Clock_enable input Enable signal for request 1 clock signai 
15 RecLclock_in input Request signal 7 clock input 
16 SeiSa input Output selection 
17 Sel3b input Output selection 
18 Sel4 input Output selection 
19 Sel5 input Output selection 
20 Sel6 input Output selection 
21 Din[0] input Data input 
22 Volt_con analog i/p Analog voltage input for Variabie Deiay eiements 
(for core processor) 
23 Dinii] input i5ata input 
24 Volt_con2 analog i/p Analog voltage input for Variable Deiay eiements 
(for IO buffer and test circuit) 
……'is GND GND C‘.gr.o““.^VX 
………'i'6 N."C'. - -
………27 K"C. - -
………28 N.""C'. - -
………29 Diii[2] iiiij^ 'Datainput 
30 Din[3] input Data input 
31 Din[4] input Data input 
•…•….32 Din[5] inpS bSairiput 
33 Din[6] input Data input 
34 Dini?] input Data input 
35 Dini8] input Data input 
………36 Din[9] inj^ Data"input 
………37 "Din[r6] inji^ DSalnput 
•••••••38 VCC VCC i ^ ^ S i S : _ e r ( S V ) 
39 Din[iii inpS DSainput 
40 Din[i 2] iiipS Data input 
page 3-2 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 6 Implementation of self-timed ICT processor 
: in^•••••[.5;S@i; 
42 Din[i4j input Data input 
….…43 N7c". ••••••- :. 
………44 N."C". - -
: : : ] [ : : : : : : : : ^ ^ : iSi^.gi^S."^W 
46 pin]"l5] in j^ fiS'ri^igijiir(Si.^it>: 
5.?„.. Ack_out output Frame Acknowledge output 
48 test_out2 output Output of test circuit 
49 test_out output Output of test circuit 
50 tp_out[4] output Intemai test point signais (for processor) 
51 tp_out[3] ou^ut Intemai test point signais (for processor) 
52 tp_out[2] output Intemai test point signals (fbr processor) 
53 |p_out[l] output Intemai test point signais (for processor) 
… S ieyui output Output Request signal 
55 Data_out[15] output fransformed data output (Sign bit) 
56 Data_out[14] output Transformed data output 
57 Data_out[13] output fransformed data output 
58 Data_out[12] output Transformed data output 
………'59 V€C VCC Corep(^er'pV)| 
……60 N."C". - :• 
……61 N."C" - •: 
62 Data_out[l 1] output Transformed data output 
63 Data_out[10] output transformed data output 
64 Data_out[9] output Transformed data output 
65 Data_out[8] output fransformed data output 
66 Data_out[7] output transformed data output 
67 Data_out[6] output Transformed data output 
6 8 Data_out[5] output transfbrmed data output 
7.3 Simulation 
As the ICT processor is a very large system, it cannot be simulated with 
SPICE. However, it contains some none standard CMOS logic gates such as C_ 
elements and delay elements. So, in order to simulate the whole system in the digital 
simulator Verilog-XL®, the custom-designed cells should be first characterized by 
analog simulator SPICE. After all of their timing and loading parameters had been 
found, these cells were modeled in the form of Verilog Hardware Description 
Language (Verilog HDL). Then finally the whole self-timed ICT chip built from 
standard CMOS logic gates and custom-designed cells can be simulated. 
page 3-3 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 7 Testing of Self-timed ICT processor 
^ ^ ^ ^ ^ ^ ^ ^ ^ ^ Z ^ ^ T ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ 
i g W i ^ M M ^ ^ ^ B 03 
-i- T in[7;Q] |000b|004cl00d7 j0032|00b3|OOfb|007f!0013|(|00b|004c |00d7|OC 32|00b3|OOfb|007f l0013|000b](004c j00d710032 J 
GP 1 i n _ s i g n I 
f i - k _ i n _ r L r L n _ r L n _ n _ n _ n n i j i J i j " L r L r L r L r L n _ r L r u " L r u 
GP 3 _ e n a b l e ~ j| 
GP 4 Ack_out I I 
GP 5 t [ 1 4 ; 0 3 ^ ^ J ^ [ | | ^ ^ ^ ^ J J ^ ^ ^ ^ [ ^ ^ [ U ^ ^ ^ J U ^ ^ ^ ^ J | | P ^ 0 7 4 0 J 0 1 0 1 | 0 3 5 c ! 0 1 2 4 ! 0 3 3 4 | 0 0 c 2 j 0 2 4 e | 0 1 3 c 
GP & _Qut_si ^ [ I I . 
E R — • i _ n _ r L n _ r L n _ r L n j " L _ 
GP 8 /DoO ^ I" | 0740 0 7 ^ 
GP 9 /D 1 J U ^ ^ ^ H ^ J ^ ^ ^ J I J J J I ^ J ^ I I I ]8334 8334 
GP 10 / D o 2 m U ^ ^ ^ ^ J | J ^ ^ ^ J | | ^ ^ J J | [ 5 l d 2 ;j8107 [024e 0 2 ^ 
GP 11 /Do3 ^ J [ | ^ | ^ ^ ^ U ^ ^ ^ ^ ^ j80c7 | 8 3 5 c 8 ^ 
GP 12 /Do4 J | ^ j |808d |81 )1 8101 
GP 13 /Do5 ^ ^ ^ ¾ ! ^ ^ ¾ ! ^ ¾ ¾ ^ ¾ ^ ¾ ^ ^ ¾ ! ^ llSOOf | 80b3 )8( c2 80c2 
GP 14 /Do6 m m ^ 1(8057 |004b |01 M o l 2 4 
GP 15 /Do7 ^^^^JJ|^^^J[|j^^^Jj^^^^^J^^[8008 l804a ]Oi: c oHc 
TTr r ~ ~ ~ ~ 
I I I 
I 
Fig. 7-1 Simulated timing diagram ofself-timed ICT (forward transform) 
^^^^^^^^^^E^z^^^^^^^^^^^^^^^^^^ 
p M B ^ ^ ^ ^ B M [<j>] 
J^ |T in[7:0] ~|000b l004c l00d7 |0032 |00b3 |OOfb l007f |0013 I(^OOb {004c l00d7 |0032|(c0b3 j00fb |007f |0013 |000b |004c |00d7 |0032 | 
GP 1 i n _ s i g n i 
p o c k - i n r L r L r L r L r L n _ r L n _ n j j " L n _ n _ r L j " L n _ n _ r L r L r L T L n 
GP 3 _ e n a b l e || 
GP 4 Ack_out I 
GP 5 tC14iQ] H m H [ | ^ ^ ^ ^ ^ ^ J H | ^ ^ ^ ^ m ^ ^ ^ ^ g U ^ ^ m ^ | l*" ^ 8 I0163 l0329 p313c l0124 l013d 
_out_si 1 ^ [ I I 
E R — I J u n _ r L r L r L r L T U " L 
GP 8 /DoO J H ^ ^ ^ ^ ^ ^ U ^ ^ ^ ^ ^ ^ ^ | [ 5 ^ ! j03c2 |0647 0647 
GP 9 / D o l ^ ~ ~ ; j814e | 8 1 7 8 8 1 7 3 ” 
GP 10 /Do2 ^ ~ ~ [ 8 0 0 2 |0138 0138 
GP 11 /Do3 M |^^^^BB[ i^Mj^^^Bl^^B^Brt" l6246 !8163 3163 
GP 12 /Bo4 ^ ^ J ^ J H ^ m ^ ^ ^ ^ J ^ ^ [ [ ^ J | ^ ^ ^ P J l l i i | [ |8329 8329 
GP 13 /Do5 UJ |^^JJ^^[^^^J | | | ^J^ i i | I I J ] |81 5c iHc 
GP 14 /DoS ^ j ^ ^ ^ ^ ^ ^ [ | | ^ ^ ^ ^ ^ j m | | i i | I I 1 ]( 1 14 8124 
/Do7 ^ l013d 013d 
T m I 
Fig. 7-2 Simulated timing diagram of self-timed ICT (inverse transform) 
Fig. 7-1 and Fig. 7-2 are the timing and functional simulation results of self-
timed ICT chip using Verilog-XL® simulator for forward and inverse ICT 
respectively. The timing diagrams show the operation of the chip with handshake 
signals, input and output data. DoO Do7 are the internal data buses which show the 
intermediate and output result of DoO Do7 are calculated asynchronously. 
page 3-4 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 6 Implementation of self-timed ICT processor 
Data_out[14..0] and Data_out_si are the output buffer's 16-bit output, which serially 
shift out eight 16-bit resultant transformed coefficients. 
The total calculation time is equal to the time period of “frame acknowledge" 
signal, since the "high" of this signal indicates the input buffer cannot request for the 
next operation as the self-timed blocks are busy. During this period ( Tc ns), input 
10 buffer can serially accept and store eight 16-bit input data for next operation. So 
the 10 data rate should not exceed : 8 x 1/Tc MHz. Based on the simulation results, 
the maximum 10 data rate was estimated to be 61.5 MHz and 48.4 MHz for forward 
and inverse ICT respectively. 
7.4 Testing of Self-timed 1-D ICT processor 
7.4.1 Functional test 
7.4.1.1 Testing environment and results 
The testing equipment are listed as follow: 
1. The logic function of the chip was tested by HP 16500A Logic Analyzer. It 
generates 16-bit input data and other necessary control signal such as request and 
acknowledge signals. 
2. Power supply : 5V for system main power; variable voltage source for two 
voltage controlled delay element inputs. 
3. Transient response was measured by : Tektronix TDS 320 lOOMHz CRO and 
Philips PM 5786B 125MHz pulse generator. 
page 3-5 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 6 Implementation of self-timed ICT processor 
(Accumulate |^  ft^~ f opccT \ i 
Off [ X marker L f E E 2 _ J 
( ~ s / D i v ~ { D e l a y ~ i n a r k e r s ] i X t o 0 / T r i g l o x 1 p ^ ^ ^ 0 
5 0 ns [ - 3 7 6 nsJ [ T i m e [ M O ns t - ^ 4 0 nsJ | - 3 0 0 ns 
g f ~ r 
| c 5¾ iaT_Q«runLrLn_rLn_t3Jn<J"TJ"iJTJTJTJ"t"n_n_n«runj~LrLn_r Rj<^ jCK uT_rLru"Ljn-runj"Lj~_rLrLrLru"UT_r|_nj~Lj~TJLnj"Lru"Xj 
IR^j ^ • ! " » " • n n 
I = j S Z ^ 
11 • : ^ ^ 
i i ' “ ^ 
m i ~ r n = • = = 
•* • * • * _ * I * * ‘ • 
Fig. 7-3 Measured timing diagram ofself-timed ICT (forward ICT) 
{ A c c u m u l a t e | A t ( p c e c x ~ ^ i 
[ O f f [ X m a r k e r J L J r r i i _ J 
( s / D i v " " { D e l a y ~ i n a r k e r s f X t o 0 ^ T r i g l o x ] | T r l g to o 1 
5 0 ns [ -242 nsJ j T i m e | 170 ns [ - 3 9 0 nsJ t - 2 2 0 nsJ 
w^ ' -
!¢10¾ j-Ljn_rLrunJTJn_n_rLrun_rLn_ru~u~un_ru~unj"Tjn_ru~Ln« RjtiCK jn-n-rL_rLrup_n_ru"U"LTXj"Lj"U"jjnj"un-n_ruru"LXijn_nj"^ 
haQ ~"^ ^ n_pj"T_p^n_p_rup H ^ | I ^ ^ H 
S!3" i • i ~ i ^ 
!• • • ‘ • • * • ^ • ‘ •* 
Fig. 7-4 Measured timing diagram of self-timed ICT (inverse transform) 
All the data inputs and request signals etc.. for both forward and inverse 
transform is the same as the timing in simulation environment as shown Fig. 7-1 and 
Fig. 7-2. The request and acknowledge is emulated by the synchronous clock. Then 
switch on the power and increase the control voltage for delay element until the output data is correct (i.e. sam  as simulation result). Fig. 7-3 and Fig. 7-4 are 
page 3-6 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 6 Implementation of self-timed ICT processor 
measured timing diagram of the self-timed ICT chip for forward and inverse 
transform respectively. Dout[15..0] is the 16-bit output from output buffer and found 
to be the same as the result shown in shown Fig. 7-1 and Fig. 7-2. (Data_out[14..0] 
and Data_out_si). Fifty sets of randomly generated input vector had been used for 
the functional test, and all of the corresponding measured output data matched the 
simulation results. 
The minimum control voltage for delay element to achieve correct operation 
is 2.4V. This value are also proved to be correct in the next section by using delay 
estimation method. And the averaged total calculation time period from 5 chips is 
142.1ns and 173.2ns for forward and inverse transform respectively. That means the 
maximum 10 data rate is 55.9 MHz and 45.9 MHz for forward and inverse transform 
respectively. However, the chips become unstable as the control voltage is further 
increased (Errors occurs at output data). This will be discussed in next section. 
7.4.2 Transient Characteristics 
After the logic function of the chip is found to be correct, next step is to 
further investigate its transient characteristics and to find out the maximum operating 
speed. Since the time step or resolution of the logic analyzer is lOns, a lOOMHz 
CRO with 500MHz sampling rate was used to find out the more accurate reading and 
monitoring its transient characteristics. Fig. 7-6 is the transient diagram of the frame 
acknowledge signal and eight output request signals. After the input buffer is full, 
the acknowledge signal goes "high". Then it will retum to zero once all coefficients 
page 3-7 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 6 Implementation of self-timed ICT processor 
are calculated. And finally, the coefficients are shifted out in association with output 
request signal. This figure shows the total computational time is 142ns and the 
output data rate is 50 MHz which is synchronous with an external clock. 
Tek Run 500MS/s 5ampte 
h" [ T ! J s 
• ‘ ‘ ’ • - [ • . • • ‘ ‘ ‘ ‘ ’ ‘ • jA 800mV 
I :z :^ 142ns 
: . … … ^ / ^ … … : … : … : ®: 3.2V 
4i 
.11 -QB :"—— >^ i^^ .--*>;^tyvJ • • • • |^ y<K^  jifi4^ <^>jy•^;^ w^^K >1V> ^ >¾l>>^¾yi^ ^ 
I i I \ j • I i i t t — + + 1 I I I i < I i I I t I >•» » i j > t i I I I I -^^ 4•^ ^^^4•^ ^^^^^^^ '^ 
• ..:uu:-:-i(ifif:f)nAnfi. … 
2^  ;:^Afcv*wMUVW;_ “’“- iiiiyt<w^ iir^ l j^  I I f 1^  <^vv* ii^ >^^ ^^ v*^ *v^  
W''' '^g^N^''<il^^'' ' 'g^ W<^^^ 'Chl ' / ' ' ' l ' 4^ 
Fig. 7-5 Waveform of the acknowledge and output request signal 
Even though the function of the chips are correct, sometimes the chips are 
found to be rather unstable. The output data are not correct and the time period of 
the frame acknowledge fluctuates seriously. Theoretically, the incorrect data can be 
corrected by increase the delay time of the delay element. However, if the control 
voltage is further increase, another problem is introduced, that is the system becomes 
more unstable and the fluctuation is more serious. 
page 3-8 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 6 Implementation of self-timed ICT processor 
TtK BBDfl 500MS/s 358 Acqs i 
H-- [ T ;| 
t— . - : i . ” . u ' ' ' r - . ' . l : | ' ' . ! k S.48 V 
:A: 27ns 
‘ . j , : • • . . . . . : ® : - 1 . 2 V [::i:rr:::|v .^..i...: 
. . . ; . . . . ; : . : . i : , : . ; : i . . ' . . . : T . _ . :: 
DD .4.+.i..+_.-|..+.+++.| • • • I 11 i i 11 > t 4 * t • 11 I I I I I I tm _v< [ ‘ L'' 
’ _ _ , _ _ i i i > i » ^ M » " i ^ : | | p f ^ ^ 
i t t ' ‘ • • ' j K ‘ ‘ ‘ ‘ ‘ ‘ ‘ • ‘‘‘'^''M ''i6ni''(!h^i '^/'-irj^^ 
Fig. 7-6 Fluctuation of the acknowledge signal 
Fig. 7-6 shows how the acknowledge signal fluctuates. This diagram shows 
the accumulate mode of the input signal, and this picture is captured after 
accumulation of lOs. The amplitude of fluctuation is almost 27ns! Table 7-1 
summarizes the simulated and measured total computational time. The measured 
values were averaged from 5 test chips with the following conditionals : (Vcc=5V; 
Vc=2.5V; 10 data rate = 50MHz and the power current is 62mA). 
Simulated Measured (ns) 
¢ ^ Average Max. Min. 
Forward ICT~ 129.4 142.1 159.5 — 131.7 
"Tnverse ICT 162.0 173.2 186.2 “ 160.1 
Table 7-1 Comparison of simulated and measured total computational time 
Above findings show that the external handshake signals and data are correct. 
Fig. 7-7 shows an internal request signal for one of the latch in self-timed block no.2 
and the timing matches simulated result too. It is an asynchronous signal which has a 
shorter time period between first two pulses and a longer period(i.e. about twice of 
page 3-9 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 6 Implementation of self-timed ICT processor 
the previous) between the second and third pulses, because longer computation time 
is required and thus a longer delay path is selected in the third cycle. 
Tek KBOD 500MS/s 1S Acqs 
I ~ - [ I T 1 
• • • • ‘ • '- • ’ I + • ‘ • • : jA: 200mv 
I A 32.5ns 
. • • t A ®: 2.44 V 
; : : IEt t_ : : : : : ]na: ; : ] , • • • + • • • 
: . . . ! . . . : . . • | . ... ... | . . : 
OB .^jHHsfab^  i *-KjJ-*^ j | ' , > 11 j;r\y^^tyf -<-^-4^ ^ j"*^V^  
:: : : : : i i i £ i 
tBi' ‘ ‘ ‘ ' jK ' • ‘ '^"M'«ns"dh ' f" / ' i'J6m<^ 
Fig. 7-7 Waveform of internal request signal 
7.4.3 Comments on speed and power 
Besides stability, another weakness of this self-timed ICT processor is that 
the operating speed is also not satisfied. Both simulated and measured total 
computational time of a cycle is near 140ns for forward transform and 170ns for 
inverse forward. These readings indicate a rather large portion of time is wasted in • 
handshaking, mainly from request output of one block to the internal request output 
of its next block. 
This delay path contains combinational logic circuit used to handle multiple 
request and acknowledge signals as discussed in previous chapter. If this overhead 
delay could be eliminated, the best reading should be reduced to less than 100ns and 
115ns for forward and inverse transform respectively. However, the performance of 
page 3-10 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 6 Implementation of self-timed ICT processor 
present design is even worse than synchronous design which requires about 130ns to 
complete the whole cycle. As it is not possible to minimize the overhead delay to 
zero, another way to enhance boost up the speed is to reorganize the partition and 
structure of the data path. In the present design, only router (equivalent to about 3 
gate delay) is allocated to this delay time slot. So, some more circuit such as MUX 
based shifter for scaling function can be grouped to that unwanted time slot and then 
reduce the delay required for the main data path as shown in Fig. 7-8. 
~ ^ Z ^ Latch Multiplier " ^ ^ ^ ^ N ^ 
I Add / __v . Shifter _v 
Router b = d > Z y [atch Z j (MUX) 
Subtract ^  ^ ’ 
Z ^ Latch ^ ^ ^ y ^ L_ 
Fig. 7-8 Suggested improvement of IEU design 
The measured power is 310mW when operating at 10 rate of 50MHz and 5V 
power supply. The power consumption cannot be proved to be lower than its 
synchronous counterpart. However, the power consumption is near zero when it is 
operating in idle mode (no input request) as all internal registers will not be clocked. 
In addition, all registers in different self-timed blocks will not be clocked at the same 
moment and they will also be clocked if necessary. For example, four self-timed 
blocks will be clocked three times in each cycle while the reset is four times. So 
theoretically, the overall average power consumption should be less. 
page 3-11 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 6 Implementation of self-timed ICT processor 
7.4.4 Determination of optimum delay control voltage 
Instead of using traditional functional testing method to find the maximum 
operating speed, a delay time estimation technique was also employed. The 
minimum delay value should be larger than maximum delay between two pipeline 
stages in the computational block. A standard CMOS logic gate chain is built inside 
the ICT processor chip used for determining this actual delay time. The logic gate 
chain is made up of 25 different standard CMOS logic gates to simulate the 
maximum delay path of the full adder and router. The worst-case delay is 
approximately equal to 2.2 times of this logic gate chain delay. This delay is also 
equivalent to the time delay of 2 delay element connected in series plus other logic 
gates in the handshake control path (C-element + XOR + MUX). So some overhead 
delay in handshake control path can be included and will not degrade the speed of the 
micropipeline. 
The maximum delay value of the logic gate chain (total 25 logic gates) in 5 
test chips is 15.3ns. So, the minimum control voltage for the delay element should 
be at least 2.4V. Detailed measurement procedure and results of this delay model 
will be discussed in next section. However, the delay element will be unstable when 
its input is biased to this voltage. 
page 3-12 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 7 Testim of Self-timed ICT processor 
7.5 Testing of delay element and other logic cells 
In the self-timed micropipeline system, delay element is used to introduce an 
artificial delay to simulate the delay of the data path for ensuring correct operation or 
preventing hazard. Assigning a more conservative value (i.e. much larger than the 
actual worst-case delay in the data path) obviously can ensure the system is hazard 
free, but it may also degrade th6 system speed. So, an accurate modeling and 
measurement of delay element is very important. A delay/logic test module is 
included in the ICT processor chip to investigate the performance of the real circuits. 
X ] ~ r ^ ^ - ^ \l^ ^ _ 
—— U ^ I I MUX ^ B 
Delay / Logic cells i 
MUX ——^ C 
Fig. 7-9 Block diagram of logic cells test circuit 
Fig. 7-9 is the block diagram of the test module for testing the delay of delay 
elements and other logic cells. Output pads C and B are used to measure the input 
signal and corresponding output signal to find out the propagation delay of the delay 
element chain. The reason of measuring input signal through pad B is that the error 
introduced by the delay of the 10 pad and MUX can be eliminated. 
Fig. 7-10 is the measured timing diagram of a 10ns delay-element chain (with 
10 cells) and the voltage control input is biased to OV, which is the minimum default 
delay value. It shows the rise and fall time are the same. But the total delay is 58ns 
page 7-13 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 6 Implementation of self-timed ICT processor 
which means each delay cell has only 5.8ns delay. It is too far from the simulated 
value. 
Tek QDfl 500MS/S 151 Acqs 
l LTi l i … 
: : j y y [ W > ^ ^ > j v ^ ^ ^ ; : II i S v 
• V. * \ • • --• ‘ • • • • • • . - • ““ Ch1 Freq 
2.002MHZ + • 
, ‘ , . • 
W ‘ • • . • • • • . • . . • . • • - • • T ‘ • • • • • • • • - • : • ‘ ‘ • : • • • • ‘ 
sv^A;>v^| W ^ V r " " ^ " ^ 
‘I t I I i > t t -t-f.>.>.4-+4 ^ 1 I I j ) i I I I > l 1 [ l I I l|th' ,HHH--i i \ • I I t 
[ Vv^ >iv>>^^ ^^ ^^ V>A^^ y^^  
, k • _ • . • • ‘ 
• i;;r^":/\;/>v\^^ . . : . - 4 ; y H ^ • • 
t^M ‘ i ‘ ‘ 3 W ‘ m ' ' | ' ' 1 ^ ‘ M 'S6ns ‘ ehl' ‘ '/ ‘ 800m(^ ' 
Fig. 7-10 Transient response of the 10ns delay element 
As described in previous section, the total computational time of the self-
timed ICT processor has a relatively large fluctuation. It is caused by the delay 
element becomes unstable when its control voltage is increased to above 2.5V and it 
is the necessary value to ensure correct calculation or operation. Fig. 7-11 shows the 
transient characteristics of the delay-element chain and the fluctuation of the delay 
value within 10s time period. It was measured under a 2.5V biased for control 
voltage input and used 10s accumulate mode to capture this diagram. The maximum 
variation under a fixed environment is 88ns in total which means each cell has 8.8ns 
variation! 
page 3-14 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 6 Implementation of self-timed ICT processor 
Tak mni SOOMS/s 20K Acqs 
I ifjf—• I }j 
•' I ‘ ; | ^ ^ | ^ ^ ^ ^ | | | | ^ ^ | ^ ^ ^ ^ ^ 4^ 84V 
; - j ^ P i p P ! _ ^ ^ ^ ^ P * @ : 4 . " V 
‘ : ‘ - } ^ • ^ 
”:•.… '; : " ':i ' T - ‘ . . … … . - … ‘ … 
‘ -1 
: ; : I : •- . -
… - … … - . . . 
r^^W -j • 
;.++...+.+.+4+| I I I • i < I I I || I j ^ g ^ ^ j ^ ^ j ^ ^ ^ ^ ^ ^ ^ ^ 
- . . . .. •. iW^ fl 5 | ^ ^ ^ ^ ^ ^ ^ ^ ^ 
. , : ': :_:::' . . . . . , . : 
• ^ ^ j j y ^ ^ y ^ y ^ ‘ — : ‘ . . : . . . . : … … 
^ ^ ^ P i ^ ^ ^ ^ ^ ^ V ^ ^ ^ ^ 
^hi '^"jOv"iM| ' ' ' ' jV^''tf '$6As^'dhi'/ ' ^Wm^ ^ 
Fig. 7-11 Fluctuation of delay value 
The summary of measured delay value versus simulated result is described in 
Table 7-2. It shows the average, maximum and minimum reading measured from 
five test chips. The fluctuation or variation of delay value for this 10ns delay 
element increases as the control voltage increases. 
Control voltage Simulated Measured (ns) 
W ^ Average Max. Min. 
‘ 0.0 10.0 6.1 = 6.2 6.0 
“ 0.5 10.5 — 6.4 6.5 — 6.3 
“ 1.0 11.4 — 7.0 7.1 — 6.8 
1.5 12.8 — 7.9 — 8.3 7.8 
2.0 16.6 — 10.3 — 12.1 10.0 
“ 2.5 27.8 — 17.4 22.0 — 16.7 
“ 2.75 36.9 — 23.2 27.8 22.4 
“ 3.0 52.4 — 33.1 38.1 — 32.4 
“ 3.25 95.3 59.3 65.6 58.0 
Table 7-2 Simulation and measured result a 10ns Delay element 
This is actually a normal and correct phenomena, as the transistor is biased to 
operate the region which is very close to the cutoff mode. In addition, referring to 
the discussions in section 2.4.2.1, the charging/discharging time is inversely 
proportional to the drain to source current, and the current does not decrease linearly 
page 3-15 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 6 Implementation of self-timed ICT processor 
with Vgs. So, when the power supply for the control voltage is unstable or noise is 
injected, such a large fluctuation is the result. In order to improve its stability 
problem, the layout of the delay element should be redesigned to achieve a larger 
minimum delay value by further reduce the W/L ratio of the transistors. So, the 
delay element is biased within the range from 0 to 1.5V. Besides the W/L ratio, the 
dimension of the transistor should be increased to eliminate the error caused by 
variation of parameters during fabrication process. 
TBK eDEDD 500MS/s 1070 Acqs 
Tl 
] , : r ' ' ’ ’ i ^ : i - " v 
•.. ..•.:My:/^ ^^^*" r^*^^^yr^ .: ®! ]'iv 
‘ • • • • - • « . h t •• 
1 — — > • • • • — — — — — — 
i;<o^<jys^- • •"• tt^^^n^i^"!^"^ 
-+++••H < • I I 11, I , > t t +_>|_f__i__f+++"“ I 1 I I I I i < I j I I nj> • rr_jif.+++.: 
:j ; | | f^K^^M>•^; 
• I -,» • . 
,1 ± Jn 
^^^_ _• A _ _ A ta • m» • 41 • f I • • — • • • - _. . • • • • gm • jA . tM 
®?%^MjJ^ A/^ \ W ^ 
l<ih1 ‘ i ‘ ‘ ' jW • 'iM' ‘ ‘ • 'i^y Mi66ni' M / iflOmV^  
Fig. 7-12 Transient response ofthe 5ns asymmetric delay element 
1 
Control voltage Simulated Measured (ns) 
^ 0 ^ Average Max. Min. 
0.0 5.0 1.6 1.7 1.5 
“ 0.5 5.6 — 1.8 2.0 1.7 
“ 1.0 7.5 — 2.5 2.8 2.4 
1.5 10.5 3.5 “ 4.2 3.3 
2.0 15.3 5.2 “ 6.1 4.6 
2.5 26.4 9.3 11.9 “ 8.5 
2.75 37.3 13.2 “ 16.1 12.1 
3.0 58.2 20.6 ‘ 25.5 ‘ 18.7 
Table 7-3 Simulation and measured result a 5ns Asymmetric Delay element 
page 3-16 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 6 Implementation of self-timed ICT processor 
The similar phenomena also happened in asymmetrical delay element. Fig. 
7-12 shows the transient diagram of the 5ns asymmetrical delay element chain and 
Table 7-3 summarizes the simulated and measured results. 
The large differences between simulated and measured result is that our ES2 
parameters for SPICE simulation given are outdated. The values were given in 
1994, but the fabrication house is always improving their fabrication process. 
Fortunately, the functionality of the processor is still affected. 
« 
page 3-17 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 8 Conclusions 
Chapter Eight 
8. Conclusions 
In this thesis, many self-timed design methodologies have been discussed, 
starting from basic 4-phase self-timed handshake control protocols developed by Eva 
Pang to a new 2-phase micropipeline with delay selection approach. Another 
techniques such as handshake control protocol (with parallelism degree of 3) for 
feedback system in DCVSL system; and the use of asymmetrical delay element in 
micropipeline with 4-phase handshake system have been investigated. They are all 
proved to have better performance when comparing with their previous design. 
Two self-timed application circuits, bit-serial matrix multiplier and the 8x8 
parallel Booth's multiplier have been designed as design example for verification of 
different self-timed techniques. They were also implemented in CMOS VLSI chips 
for practical verifications. Both simulation and testing results show that the 
micropipeline is better than DCVSL structure and 2-phase handshake control is 
proved to be the most efficient self-timed approach in terms of operating speed, 
silicon area and design effort. And the use of delay value selection technique can 
further speed up the system by enabling the system to perform average-case delay. 
An alternative complete detection method - Current sensing completion 
page 8-1 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 8 Conclusions 
detection technique is proved to work but the performance of this design is not 
acceptable, and it is not practical in real digital system. The main weakness is delay 
of our comparator design is too large. In addition, there are many problems in 
designing CSCD circuits such as it is too difficult to determine appropriate reference 
current. Also, the current comparator is extremely sensitive to variation in 
fabrication process, supply voltage fluctuation and noise. The design of current 
comparator should be further characterized and improved as this technique is still not 
feasible to be used in self-timed system. 
After finding the most suitable self-timed structure, our focus changed to the 
ICT processor design. Several ICT processor architectures have been analyzed. 
Straight-forward micropipeline style (i.e. second and third design) have negligible 
handshake control overhead. However, the hardware of the second and third design 
is much more than the final one and poor in flexibility. In addition, straight-forward 
micropipeline style cannot take the advantage of variable delay technique. Besides, 
even the third design has the highest data rate, it is not practical to transfer 10 data in 
such high speed. To conclude, the final design compromises the advantages of • 
different approaches in terms of silicon area, speed and flexibility. 
A self-timed ICT processor based on fast computational algorithm and 2-
phase handshake micropipeline structure is introduced. The circuit supports 16-bit 
data and operates in either forward or inverse transform mode. This self-timed 
modular design enables designer to tradeoff speed for complexity by simply adding 
or reducing self-timed blocks. The new developed handshake control circuitry 
page 8-2 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Chapter 8 Conclusions 
makes the interconnections of handshake signals for multiple self-timed blocks 
becomes simple and systematic, as the additional circuit involves standard CMOS 
logic gates only. Moreover, by using variable delay-value method and its parallel 
structure, the self-timed ICT has the potential of operating at higher average speed. 
The self-timed ICT processor has been built and fabricated in ES2 0.7^im 
CMOS SLP DLM technology. Both simulation and measured timing diagram from 
the internal nodes showed there is not any glitch caused by hazard or race and all the 
calculated 1-D ICT coefficients are correct. So, our improved self-timed design 
method is proved to be right. However, the performance of the self-timed ICT 
processor is still not so satisfaction, because of relatively large overhead delay time 
and silicon area used in handshaking. And the stability problem of delay element 
also makes the chip less attractive. Here are some suggestions to improve the self-
timed ICT processor: 
1. Modify the layout of delay element to increase its minimum delay value so that 
the control voltage can be limited to 1.5V. 
I 
I 
2. Rearrange the pipeline structure of data path to make use of the delay caused by ^ 
handshake control overhead. 
3. Group the self-timed blocks with common timing characteristics so that two or 
more computation blocks can share one handshake control circuit to reduce area 
overhead. 
page 8-3 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Bibliography 
Bibliography 
:1] V.K. Madisetti, Digital Signal Processors, An Introduction to Rapid Prototyping 
and Design Synthesis, IEEE Press and Butterworth Heinmann. 
:2] W.K.Cham, "Development of Integer Cosine Transforms by the Principle of 
Dyadic Symmetry” IEEproceedings, Vol. 136, Pt. I, No.4, August 1989. 
[3] W.K.Cham and F.S.Wu, “On Compatibility of order-8 Integer Cosine 
Transforms and the Discrete Cosine Transform", IEEE Region 10 Conference on 
Computer and Communication Systems, Sept 1990. 
[4] W.K Lam, “A PC/AT-based ICT Image Archiving System" M.Phil Thesis, The 
Chinese University of Hong Kong, 1991. 
[5] S. Hauck, "Synchronous Design Methodologies: An Overview", Proceedings of 
IEEE, vol. 83, no.l, pp.69-93, Jan 1995. 
[6] Al Davis, "Synchronous Circuit Design : Motivation, Background, & Methods” 
Asynchronous Circuit Design, Springer, pp.1-49. 
[7] D.A. Huffman, "The Synthesis of sequential switching circuits", J. Fraklin 
Institute, 257:161-190, pp.275-303, March 1954. 
[8] D.E. Muller and W.C. Bartkey, “A theory of asynchronous circuits", Report 75, ‘ 
University ofIllionis, USA, 1956. 
[9] L. Lavagno and A.S. Vincentelli, “Algorithms for Synthesis and Testing of 
Asynchronous Circuits", Proceedings ofKAP, 1993. 
[10] C.L. Seitz, “Self-timed VLSI systems", Introduction to VLSI Systems, Mead and 
Conway, Addison Wesley, 1981. 
[11] S. Burns and A. Martin, “A Synthesis method for self-timed VLSI circuits", 
Proceedings of International Conference on Computer Design, 1987. 
[12] A. Martin, "Programming in VLSI: From Communicating process of delay-
insensitive circuits", Developments in Concurrency and Communications, 
Addison-Wesley, 1990. 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Bibliography 
:13]A. Martin, "Formal Program Transformations for VLSI synthesis", Formal 
Development of Programs and Proofs, Addison-Wesley, 1990. 
[14]L.G. Heller and W.R. Griffin, “Cascode Voltage Switch Logic : a differential 
CMOS logic family", Proceedings ofIEEEInt Solid-State Circuits Conference, 
pp.16-17, 1984. 
[15]Y.K. Tan and Y.C. Lim, "Self-timed Precharge Latch", Proceedings ofIEEE 
International Symposium on Circuits and System, May 1990. 
16] Y.K. Tan and Y.C. Lim, "Self-timed system design technique", Electronic 
Letters, vol.26, no.5, pp.284-286, March 1990. 
[17] Y.W. Pang, C.S. Choy and C.F. Chan, “A New Handshaking Control Circuit for 
Self-timed Systems", Electronic Letter, Nov 1994. 
[18] I.E. Sutherland, "Micropipelines", Communications of the ACM, vol.32, no.6, 
pp.720-738, June 1989. 
[19] A.D. Gloria and M. Oliver, "Effective Semicustom Micropipeline Design” 
IEEE Transactions on VLSISystems, vol.3, no.3, pp. 464-469, Sept 1995. 
[20]K.M. Yue, "Use of Variable delay in Asynchronous Circuit", private 
communication, University of Hong Kong. 
[21] Stephen J. Muscato and Alexander Albicki, "Locally Clocked Microprocessor", 
IEEE proceedings, 1993. 
[22] Y.W. Pang, C.S. Choy and C.F. Chan, “An Asynchronous Matrix Multiplier" 
IEEE TENCON Region 10 Int. Conference on Microelectronics & VLSI 
proceeding, Nov, 1995. 
I I 
[23] Y.W. Pang, C.S. Choy and C.F. Chan, "Effective Implementation of Self-timed « 
systems with Feedback Paths", Proceedings of ICS, 1994. 
[24]T.C. Pang, C.S. Choy, C.F. Chan and W.K. Cham, "Self-timed Booth's 
Multiplier", Proceedings of2^d M. conference on ASIC, pp.280-283, 1996. 
[25] Mark.E. Dean, David L. Dill, and Mark Horowiz, "Self-timed Logic Using 
Current-Sensing Completion Detection" IEEE proceedings, 1991 
[26]J.B. Shyu, G.C. Temes and F. Krummenacher, “Random Error Effects in 
Matched MOS Capacitors and Current Sources", IEEE Journal on solid state 
circuit, vol.l9, no.6, pp.948-955, Dec 1994. 
[27]K.M. Chu and D.I. Pullfrey, "Design Procedures for Differemtial Cascode 
Voltage Switch Circuits", IEEE Journal of Solid-State Circuits, vol.21, no.6, 
pp.1082-1087, Dec 1986. 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Bibliography 





An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Appendix - index 
Appendix - index 
Appendix A - L2jjm self-timed matrix multiplier test chip 
A-1. Top level schematic diagram of the self-timed matrix multiplier test chip 
(1.2^im) 
A-2. Schematic diagram ofEva's DCVSL matrix multiplier (parallelism degree of2) 
A-3. Schematic diagram of Johnson's simplified DCVSL matrix multiplier 
(parallelism degree of 3) 
A-4. Schematic diagram of Johnson's micropipeline matrix multiplier (with 2-phase 
Handshaking control) 
Appendix B - 0.7jjm self-timed booth s multiplier and current sensing test chip 
B-1. Top level schematic diagram of the self-timed booth's multiplier & current 
sensing test chip (0.7^im) 
B-2. Schematic diagram of 8x8-bit self-timed booth's multiplier 
B-3. Schematic diagram of the first and second stage of the booth's multiplier 
(handshake control circuit, input buffer and shift register) ‘ 
U 
B-4. Schematic diagram of central control circuit (4-phase handshake control) 
B-5. Schematic diagram of the first and second stage handshake control circuit (4-
phase handshake control with asymmetric delay element) 
B_6. Schematic diagram of the first and second stage handshake control circuit (4-
phase handshake control) 
B-7. Schematic diagram of central control circuit of the self-timed booth's multiplier 
circuit (2-phase handshake control) 
B-8. Schematic diagram of the first and second stage handshake control circuit (2-
phase handshake control) 
B-9. Top level schematic diagram of the first current-sensing test circuit 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Appendix - index 
Appendix C - l.OjLm 2nd current sensing test chip 
C-1. Top level schematic diagram of current-sensing test chip 
C-2. Top level schematic diagram of 16-bit accumulator using CSCD technique 
C-3. Schematic diagram of current-sensor 
C-4. Schematic diagram ofhandshake control circuit of the accumulator 
C-5. Schematic diagram of 16-bit signed carry-lookahead full adder 
C-6. Schematic diagram of 16-bit input buffer of the accumulator 
C-7. Top level schematic diagram of current sensor test circuit 
Appendix D - 0.7jLm 1st self-timed 2-D ICT core processor 
D-1. Top level schematic diagram of self-timed 2-D ICT core processor 
D-2. Schematic diagram of input control circuit 
D-3. Schematic diagram of 16 integer multiplier units 
D-4. Schematic diagram of coefficient storage units for both forward and inverse 
transform 
D-5. Schematic diagram of a integer multipliers 
D-6. Schematic diagram of input shifter for integer multiplier 
D-7. Schematic diagram of three stages adder tree 
D-8. Schematic diagram of scaling factor storage element 
D-9. Schematic diagram ofl5xlO bit carry save multiplier with CLA adder 
D-10. Schematic diagram of 128x16 bit self-timed transpose RAM module 
D-11. Schematic diagram of 16-bit serial to parallel output shifter register for TRAM 
D-12. Schematic diagram ofIO controller for self-timed TRAM 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Appendix - index 
Appendix E - 0. Jjumfinal version self-timed 1-D ICT core processor 
E-1. Top level schematic diagram of self-timed 1-D ICT core processor (final 
version) 
E-2. Schematic diagram of self-timed block 1 
E-3. Schematic diagram of instruction decoder for self-timed block 1 
E-4. Schematic diagram of self-timed block 2 
E-5. Schematic diagram of instruction decoder for self-timed block 2 
E-6. Schematic diagram of self-timed block 3 
E-7. Schematic diagram of instruction decoder for self-timed block 3 
E-8. Schematic diagram of self-timed block 4 
E-9. Schematic diagram of instruction decoder for self-timed block 4 
E-10. Schematic diagram of self-timed block 5 
E-11. Schematic diagram of instruction decoder for self-timed block 5 
E-12. Schematic diagram of self-timed block 6 
E-13. Schematic diagram of instruction decoder for self-timed block 6 
E-14. Schematic diagram of self-timed block 7 
E-15. Schematic diagram of instruction decoder for self-timed block 7 
E-16. Schematic diagram of self-timed block 8 
E-17. Schematic diagram of instruction decoder for self-timed block 8 
E-18. Schematic diagram of 2-stage handshake control circuit (for 3-cycle block) 
E-19. Schematic diagram of 2-stage handshake control circuit (for 4-cycle block) 
E-20. Schematic diagram of ALU1 
E-21. Schematic diagram of ALU2 
E-22. Schematic diagram of ALU3 
E-23. Schematic diagram of ALU4 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Appendix - index 
E-24. Schematic diagram of signed x2 multiplier 
E-25. Schematic diagram of signed xl/2 multiplier 
E-26. Schematic diagram of signed x3, x5, x3/2, x3/4, x5/2 multiplier 
E-27. Schematic diagram of signed x2, x4 multiplier 
E-28. Schematic diagram of signed xl/2, xl/4 multiplier 
E-29. Schematic diagram of input buffer 
E-30. Schematic diagram of output data sequencer and multiplexer 
E-31. Schematic diagram of output buffer 
E-32. Schematic diagram of output controller 
E-33. Schematic diagram of test block for delay and C-element 
AppendixF- Chip microphotogmphs 

An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Appendix A - ].2 self-timed matrix multiplier test chip 
A-2. Schematic diagram ofEva's DCVSL matrix multiplier (parallelism degree of2) 
-•• •-• : ••• •- .. -.-. >•...- •. 
« .• — __^ _ j _ 
" “ ~ ~ ‘ : Wo p U ~ j 
r f r "T-=i F = :T^ = ^ — " T - = i = ^ s : ^ S ^ ^ ^ W ^ - • 
1 , _ 
11»^  • r~• * j~“ •‘ ] 
I ••> I I pWtf 
— •~~J *H' _f ^ I' iPrf" • I Lji wrl^  *^* I f t* •*^  
• ―^ ^ • • i— Mv 1 ―^  
‘ - * < ^ • • • * • » _ _ * *^mtf • • t h*<*4_n • • • 
-» —r-r^'^'ZTj^^- .zrz: = =:z: J=r=:r=:::ri *^~~' 
_ • m “ “ L«»>«„_J I •«««# ft • • ••«1 - ^ >f ‘I • • ! 
r _.•» — «M.o m _ — I m _i -^  . ’ I , I 
Kfc I r"*~" ,' ^ ‘ — • •» «i —. . t — i ^ ^ 
, r ^ •» Mt • — —~«»~ mM «tf ~ « . 4 r ~ ^ , 
• p* f * - ^ «j • … • • • _ u ^ " • 4 ~ 
i « ~ ‘ p - ^ ^ • — _ ^ *fci~» j l 1 » » j 
^ - … ; ^ | I_ I I “ “ I »u 
T _ _ j = g | "-
―^ » Mr —••> »«**iLa ______ _ —- «• • ) ~"^~~~*" - I f I L _ M «M  ••••••• • I • • .it Bi ( > r r r •_ •• •* L_»^ ,-, ,1, , I _™_«—^ _^»«—« . i»~ • • • I .—• • . • 1 >^-T~  ‘• I ^ <3 »  1 " - ^ 3 ^ z i " " ^ j j -^-r^ "?, rM:>^i-^- CJ^" m I •"’' ” • I • ^W 4=:Q:r ["[ ^ " ^ ^ y = p s F ^ ^ ^ ^ ^- ~ , H' -F^  H=——fJ ~ ~ ¢ ^ _^1 «M^ ,_<_, . ^ : ^ || ••• , • >» 0 • • >•> M* ~ • • Mt ~ • 1 _ • ‘ _ _ ^ , _ r «  •» “ at t1 - I _ - , r mnmw,n _i  Mi »i — . «~~<««^_«—« | » • • , p-*-- 1 ) ‘ • " . — • kl «3 . «~_•  r • •“ _ ~ U *3 » ’ .  1r— tf «3 •  “ U —- - > ^ ^ "  * ^  >i I ‘ • ‘ ‘ •-• | ^u"J——n^ | -^ ‘ - ^P*MiLO • • — «• — • - I I . I L m»  I»~ ,, • . I l« I •• “ • i I ••I ...J1 u 1U  M 2 , I !-.• .- I I I I A - 2 
^— - ^ — ^ ^ ^ ^ ^ ^ ^ " — ^ ^ ^ ^ — 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Appendix A - .2 self-timed matrix multiplier test chip 
A-3. Schematic diagram of Johnson's simplified DCVSL matrix multiplier 
(parallelism degree of 3) 
, . — - - • - • • - . - . .• . . • „ . . . 
. -•• . . . , - • - . . •-. -..'-.." - . • >• - ••- • , . • - . 
. • » — 
- — m =q , j ^ rp^ ~1 r ^ ^ 
r - * - o o _ „ , - ^ ^ ““ '>«• b ^^~i~rr 
COS.W i ^ " * «0 «i • eo •. mm t,tt 
—„. „ c-.n.'i ^ ^ co^ / ^ ' , rs^ I8? 
“ S ” ~^w V~ ri 5 ^ ~»n r^ >>2^ 
„• L_jjij L_M I I n p ^ 
1 4 = ] “ ‘ 
[~ ^,, r »«• 
f t a§ • “ „ • ^-
ino^ _| .._ I .d, i I ROY ~w^ _ I idX_a f« II i^<*>l 
~ “ ^ ‘ ~ • - ~ rd, I ~ ~ J 
* “ 'f'*l-"J' « “ —- * I _~ ln Prt>UltjT q 1 PrtoM.IJ ^., 
<«>'» *~~>» ,b~^^—•~"«wwM«jT 0 ~~»~j • _, _ —r~ *" “ «>• <1~"fJ ^ 
B B . I * oM , M ~ . t -
88 O B ~ - ~ l . _ , • .1 .1 „ _ . . ^ ^ ^ , 
~ » ~ «M Mdv.l2 «>1 aet ,bl ~ ~ » . . I "*•_ 
nb » r"*— ” ^ oa 03 _^ »• •. t~| 
r * ~ « " ^ oM ,b2 _ _ » . . . . ^ » « 2 
_ P' r ~ ^ eJ o3 o~•• • • • ^1 
. ^ _ •~ I ‘“ ~-“ ~ ‘“ ‘“ oM <b3 ~- •• I »^ 2 
——IZ: "^ “^^  ,'__l »<! 
»WJ 
; P = = P i~ i t^k3 
PrtoW.I2 
‘ • L -..•«_ ,. M 
T —t } I “ •— <M eM ~• I 
as L — , , 01 L L ―^  “•~4b1 •“ • 
Z""^^V-*- L — ^ oa — 1 
"• • ^ ^u^t) : ^ ^ “ 1 
I / I s^  J59 
.A f ^ J 9 ? _ _B ) » 0 0 0 , 
I I 1 r^W~~"~^^AN2 
I "• 1 1 o^V---j^ ^ ^ : 
^____ "M 1103 1104 \ \ r^T ‘ 
• ®® rrM •, • _ * oe oi _ • 00 oi 00 I > 
°**-'^ CoiM/olf ConU<riJ _ A ^ ,1 
n n» ~ « j • ri 5 ro ^ • ri 5 re ~ • . / V j } 4 <r 
1~~ I I i: 
I——I “ | 
» _ ' 
bz=L___J »- ! 
J|1 ^ u 
'"'» 1 M ^^ L^ ^ Ln:;~~i|iiJ ^ r I ^ . p^ .c> ,^~~, ^ . _,r"-",.^ 1| »-
I • V>b 4t • W) aM ~ • - — • ob9 060 " • • • 1 
^b# ‘ I I——•——Q« .' ^  ^ «« 0< ^ 1 1 ,^,i 
j qb0 Add*r.U sb1 ~ • - • ob1 Qbl • | ^ 
‘ r^~ 9l ^ o2 q2 -^ • • • t , 
• 9b> ib2 ― ^ _ Pb3 qb2 ― ^ m _ 1 • ^ 
r * - q2 c3 - t t oJ Q3 ~ • _^ ^ 
t 4b2 cbJ • • • •• ob3 qb3 ••-* • •••• ^ 
• |^ etJ 
I |^ e3 
Oi i — ~ • |^ cto3 
X • ~~* 
L • _ i niy ~ • > 
PrtoU3_U 
L , _ q« o0 1 
L • • • * — «t>9 «bi _ J 
^ • • • «1 «1 -— J 
L . « _ ^ 1 • ’ _ • I 
L « q2 o2 • 1 ‘ U b^2 flb2 • 
A - 3 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Appendix A - 1.2jm self-timed matrix multiplier test chip 
A-4. Schematic diagram of Johnson's micropipeline matrix multiplier (with 2-phase 
Handshaking control) 
. -• -. .-. ••- .-• -- • . — .., .^  .. _•.. 
- • . . • 
pp=i j ^ [p_^  
'•• H""^ rh r=B-*- j7n L_f=E3^~‘-"_-]7"-"^ 
“ r--Ji=Ul T h “ I 
r ^ H ^ ^ 2 i ; ^ 
^ - ^ •• .Fi' •- r ^ 
- £ - > u • -
^ s = i : Q ^ - ^ ~ _ n -
.I  l"^  > s« ‘ 
4 ^ ~ ~~=•- {-^:^:zzj 
« . , I J__ <« i. 
© = I ^j" ~ '^_s j ~ ~ ^ ¾ I [ -
L | ^ L _ | = ^ | ^ h ^ 
L TT M - i ^ ~ ~ L . 4 ^ . _ _ ^ . . 
I ~ I > E. _ • •» ,” t 
I—I I r - > “ ‘ * 
_» i- _ • -=t_J j 
lfiL> r, I * • 
I~=-Lj^ n= : = ; i 
gT^I J - £ ^ 
I 1 -_i_> % 
-=t_J 
I L ~ * ,, 
-J=-> C. 
-^ J^ ‘ 
I——j_p|i L 
LjE.> u I ^¾^ A - 4 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Appendix B - 0.7jm self-timed booth 's multiplier and current sensim test chip 
B-1. Top level schematic diagram of the self-timed booth's multiplier & current 
sensing test chip (0.7|im) 
: I - • : ~ 
! - • ^15H| 
^ i . ^ 1^ 
-_ ~* <jw. 
0 ^« 
J7i Sj a 
• < ^ = 
LJ T) 3 C U ^ £ 
« * ^ •- w S~ 
H i l l , 
J ^ cQ 0 B 0 H p 0 m Q m m m fe ;A ;A ;A ;A 'A 'A 'A i 'A 'A i TTrT^ 
I iT y^ iT >T «T IT «T ^ 7 ty jy iy 
r \ 1 1 1 { 1 i 1 1 1 1 i 
L = j , . . ^ i i M -
H E 0 B B m E m E B 
^ ^ 







- ^ ¾ ¼ J L ^ 
4^  S»» i 5M I 'i i>a""^I 1} 
• i . • / 'f 
J \ } 1 
^ l|r^ ——1^  U^  L[|t| U1 
1 L ~ ^ : ^ ^ = ¥ — ^ ^ : # ^ 
1 ~i ~i 1 7 ~i 1 
- pp[] } } ) i } j } — 
'I S I D 
“ i. i. -
j SE 3SE 
- 1 jl -
!M!!MM!M!Mi 
- h + + 1 ^ $ + 1 ^ $ ¥ r^  
» I 
J J 
• . . •.. . .J. .- •-.... .-, .-
‘ - • " - - — - 00 - - C0 
I 1 1 Q 0 ‘ m < 
B - 1 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Appendix B - 0.7m self-timed booth 's multiplier and current sensing test chip 
B-2. Schematic diagram of 8x8-bit self-timed booth's multiplier 
y 
Hil M 
I J2l _0 -
• … ^ o 
― _J OO Q^  ” 
i 0 x.9- 5 
00 00=^ -
― ― < ^ TD ^ 
LiJ CD i 
1 . ^ £ £ i -
f I ^ o 
1 1 I f -
p a A +3QQg 
L. • • • • • • • • • • • • • • 0 ^ L_ 
r ‘ r ml 
— : … , … ” ” “ s n 1 
- - niiiiiiiifiiiiii 2 -
7 - '11iiiiiniiiiii I i 
^ 3 <s c 
_ L J _ « ® •§ 
<y cN ” fl fN 0 0 
* g ^  X 5 9 0 r « • A A A A A A A A A A « t ^  n ^  A A A A A A A A A A « t A i^  ^  i <J f S fi 3 S W 3 A _ — N *» <v <0 w K • 0, S 2 S • — ct« ^ « fo> 0 A 5 2 3 Z £ Q 3 I Z W 
1 u % 2^  ^  < < < < < < ^  < K < < < < < < £ tt a « « m ea a « « a ta eo ea « I ^ I u |0 I 
1 ^ ¾ ¾ . 
Ljy~ l -• ° * | 1 
1 7 V _ J " V — — i H I I I 1 -1 ^ 
U 51 “ “ • ; “ “ “ “ • • 
_1 * « J Z b. u u Z u. %L Z 1 1 V V y V V y 
. I . I =» * i … … 
— ^ • C1 1 « , . .¾ ^ 
2 - 2 ^ .5 .S « ^ - A A A A A A A A J» 5jyVj ,^ 5 3 3 - A A /» A A A A A C^O^ C^vV 
“ l!f5f Hf^fi^f | l I | | l l | l j __•__ I A A A A A A ALj A ^ 1^1^1^1^)^1^  I V V V V V V v^  V vJy^ fvJyfyR 






… ‘ ^ 2 S 
I 1 
- i 1 
E 
___,_<_. ^ ‘ I 
B - 2 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Appendix B - 0.7m self-timed booth 's multiplier and current sensing test chip 
B-3. Schematic diagram of the first and second stage of the booth's multiplier 
(handshake control circuit, input buffer and shift register) 
‘ ^ 






— £ ^ 
f ^ ~ " ~ 
^ =^^°^£^1 
¢ 1 ¾ ] 
S | | [ ] L Z Z L | 
_ 3 
l i b £ ? 
— I [ 
_ l ' 'R fcz 
I T ^ = = J 
= • : _ 
— -
_ i : L J L _ . 
. I ~ ~ r S - - . . 
I | - C T ^ " ^ . 
7 ^ -
i I • y -. . - .- [ .g"p&:"• 
= ^ ^ ^ ^ ^ ^ ^ ¾ = " 
"><azea:H="a:e^ i:e<c&ccEH=CE:ccE ‘ 
B - 3 
™ ^ . , 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Appendix B - 0.7jm self-timed booth 's multiplier and current sensing test chip 
B-4. Schematic diagram of central control circuit (4-phase handshake control) 
5 I 4 I 3 =^^ "^ =^^ ^ I ~" ] 
RgVSOHS 
W>C ^ OeSCW>>TWN DAU *PPTOVtO 
—0 Q Q •»»• 0 • — "• •‘ ‘ I • —* 
«~*x»» »2i_> jJ ^> ^  
rN^ J r H^ J 
I b o ^ ^ 
^ > « > r^ i3? 
^ ; — — - ‘ -"T r " f e r "^^ “‘ __I 
4cW^ >^ ~i>i^ i^ ~r~~N -^? _ “ . I 
^ - ^ 0 ^ U < N V — ) ^ : |>a-.7 j~^~»"«o 
vc • 1 
• 0 • Q 0 0 0 • 0 0 • 0 
_c>l > ^ • « j r ^ • « J2^ _CM 3Z j! 
_"CSf 7 _Wgg I wgg 1 Htsc z 
r~i »1 p~i_d {““^ 1 """“^ _d 
/ I~^ 117 \i 
" T j ^ K n L 
_ . I p l _ > ^ j 
^ = ‘ 
.0 0 _* , — ^ 10 i 
: r ~ n v ^ »-
*.» 2^> ^  P"L_^N2 _„ 
_ ^ z ~ —y^"^ 
1 pl [ cfc-.r? t j _ • A<fc,>wi 
« - • - r ^ _ 1 
I 
7 . . . _ 
0 ®^  0 0 ®^  Q 
*• _• • • _• • •" • "•»! - • • • I • • - • I -^  
…^ p_^> jj ^> ^ .WCSC 2 _WSC 7 r~n u>| r • 1 u<| ^ 0 ^ 4 ¾  &•^ 7 ... "T J ^ _^ • _ — ~~- f~~~"^AN2DeT25 18:46:57 1996 CUHK EE ASIC L a b . ORAWN " “ ^ “ “ ‘ ‘ ci^ Booth'sMultiplierc6Titrol CHECKS (4—phase HC) fl g?r loee w. rcvRrRs: ps7~ D ISSUED u I I 1 ^ I SCAlX I |SHEET ) OF 1 r t 4 I 3 p ,  2  1 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Appendix B - 0.7im self-timed booth 's multiplier and current sensing test chip 
B-5. Schematic diagram of the first and second stage handshake control circuit (4-
phase handshake control with asymmetric delay element) 
t I ‘ , 
^¾ = 
I I I 
I _0 
, ( >^ 
II^ Z_M5Z 
L g ^ ^ ^ . 




“ “ < ~ 








- . . “ ^ ^ 
^ ^ 
. 4 p = " i ] _ _ _ 5 
...—...._.. . h W """"1 
«0 ‘ “ 3 ‘ 5 S! V 3 
.a a*" 
f^ V4 
a 6 1 • 
Hi Y X r s B - 5 
,‘ ^ 
U n 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Appendix B - 0.7jm self-timed booth 's multiplier and current sensing test chip 
B-6. Schematic diagram of the first and second stage handshake control circuit (4-
phase handshake control) 'j:> 
5 ] 
_ i ^ \ 
R>>rtl 




rzzz ^> |1 7 ic>ti | \^~ ^^L_ tOOl 



















• 4 M 
— Q ^ 
r f ^ — < H'! : : ~ " " S \ L f ^ r 
rs>-
A 
-1 n i t|,-l a y V s *-a «>• • 2 MQ r^ u „ u P n ^-^ ‘n i T s » cV 5
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Appendix B - 0.7im self-timed booth ’s multiplier and current sensing test chip 
B-7. Schematic diagram of central control circuit of the self-timed booth's multiplier 
circuit (2-phase handshake control) 
~ ~ Q • I r x i j . . xzn <— ~ " 
5 """tE M 
^ • ^ -
― ^ c^^ -
, ^^  
^ ^ ^ 2 1 
—— ^ o u ^ 
< nJ9 
, UJT.^. ^ 
^ ^ o ^ f E_b 
c -
I 0» 
rftOL s0 3 ^ 2 ^ i u> 01 





I ' h r_V) 
CN |Y| ^ ^ CN y _ 
r-."“^ r 7 ~ l °I i T ^ ~ ~ 
°| d__ 'I ai__, . 6 “ ,_~j I~~ ~ 
f s T l ^ J r . V h r " S 1 r f ^ r P ^ _ 
t5y' ¾^' ^ j • B s /^ J A £ r:rn 
- V p t 7 ^ 0 P 4rtV 4rtV M -
n R "~^ 1^ w 
UJ a 5^ f ] -• ji__ f ^ n ^1 
- N < • "^p* 
l f 0 ^ 
rO J UJ n 
—— ll p a J d | 
'_aI_ . E > 
E ^ . A . - I ¾^,' ° a .| 
trrf _ ^ — — z L 3 f Q 
— - - " " t d ^ — ^ 1 — 
i y 
Q I 0 ••_- • f . ... cn < B - 7 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Appendix B - 0.7tm self-timed booth ’s multiplier and current sensing test chip 
B-8. Schematic diagram of the first and second stage handshake control circuit (2-
phase handshake control) 
. i j 
« • - • — I 
S ^<*_aui __• 
^ ‘ “ 
3 *«-«• ~*n 
i ... 
*—―™— — <_-^ _0 : 
I, _— ^ > 








' - | 
J *^«0 y 
^ ^ ^ r 
V i 
o 
i ^ n 
[Jir— 
LiJ — 




^ - ― I 





S ^ . .--
• < * 
~ ~ t p f ~ 
t -
§ — \7 
^ ‘ f Y^ 
<5 * « ^ - — £ 3 -
— - •  ,, ____i•_ __i_ • i ~j ‘ .. -...-.: •--: —_-,.:—: .• 
I • • : 




^ i - 8 

» 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
I 
Appendix C- LOjm 2nd current sensing test chip 
! 






! Q • i I ” m I < 
y w ‘ • m o<c| oi »r) 
ry~ n“Rn 
I _Q : 
_ _ I I i f 5 S S 5 i _3 crtx ° 
g 1 1 i 1 0 cn ^ g , 
t m i 
^ r 
, < cn.b^ - i 
^ j m E m m 0 m i • t l c ^ L ^ • 
U i i :A ;A % i % ;A :A - i F -
® ‘ i T 1 i I i I S I j I S I i 1 S J>^l I I k 4 i i 4 4 S i ^ fci \^ ‘ ^ , •* - ^ 
_ I I f [t]^[HrnMHjE -
t ^ , tiV:[^ti^tf:pfJt[V;r^tf:fVtf:rVW 
_L_L_ 4 ® 
^ ^ ~ ^ 1 ~ 4 i ~ ^ " i ~ ^ i ~ ^ " i ~ f • s j a^ 8 9 . 
I i 3 1 i 13 j« 1« j« 1« y^| I \ I 
I , 
‘ ‘ . 
I A 
- i 1 1 1 1 1 i i . : -
^ + W W ^ H i i i 
$ $ ' 1 ¾ ¾ ¾ ¾ ^ ¾ - f f 
1 1717 iT |T |T iT iT iT ,T r _ ffl 
, : 1 4¾¾¾¾¾¾¾!,¾+ V ^ 
~ ~ ^ < < 4 < < < < < I 3 I 3 ^ |J I 
‘ I 1 S I I 1 I 5 I 
i I fifffffffffffff 1 # y ^ 
i i Jllllllllllllllllln .tlHlllTHT ^ 
i T?Fm?T?^ n?iT7in ii 15 \ M 
. _ | _ . ^ 
‘ I 
, : f 5 ..., 
. 0 l j j j j ) h l lL | .g I 
. I mi| I || ll| — 
- I I • - - i ^ s M ^ 
S z S < > S I < I < « I I A A A A A A A 4 “ A 
^¾^¾^¾^¾¾¾¾ i 1 3* ^ 
^ , h 6 h I m H m m H E m 
1 I i ‘ ‘ 1 
I , 
I Q , u f m I <c 
I  C-1 I I  i 
j An ICT Image Processing Chip Based on Fast Computation Algorithm and Sell-timed Circuit Technique 
Appendix C-1. O^gn 2nd current sensing test chip , !' 
' ‘ I 
‘ 1 j , C-2. Top level schematic diagram of 16-bit accumulator using CSCD teclmique 






n . ‘ f . 
I ' f f l ^ ~ ~ ~ ~ U l - H I 
I ^ 0 u. 
, — ^ 0 o 
~J < 
‘ i 0 CP fi 
, _ — 00 C 
i < (J) 
‘ LU s 
I 1 : ^ - L 
( ® ^ A A A A A AAAA A g ^ ^ p J A \ ^ I 
I *T V V 9 V V V V V V V V V V V V V ‘ u_^ /^ I I s i imiinii^iissi . I T 7^  
u j j j • • j j ‘ ‘ ‘ | ‘ , _ j I ~ I _ s _ 
I “ ° s 3 3 l l l l l l l l l l l l l l l l I ~) oj s 
- t T M»M»4»|»MM|» ^ L g 
r ^ aQQi 
[MlIl lMlI l l l l l I ~ ^ b | f " 
T- I iuiii!iifiif|y f 
: - - & 3g" I iiimim i 2 
^ A & , ! i 
I , ^ i . S) 
— — < < I . 
U ‘ <o 
1 ^ CN 
: … … … … … I; i | i i i i d 3 3 rr?n.in!,?.nrir^ —^ '^  I 
1 ^ 5 5 E‘ E‘ . E‘ | ' E' | ' E' e' E‘ | ' E' 
I 5 Q 1¾ g 1¾ 1¾ 1¾ t^  ^ 4¾ 1¾ Jl 1¾ t^  ^ <^  3 *H ^¾ 
i I : L i p " [ M M i T ‘ 
* 
L H : : : : : : b ^ ^ ^ 
“^ 1| j s6g5^0e^Hs = a2sj- | ^ ‘ 
I “ ‘ I . n ^ n n i i n i i j , 
I s ^ . 
‘ I i 1 i 
i , “ • 
^ X A A A A A A A A A A £ ^ n * A A A A A A A A A « ^ A A i 
I I j J I ^l| fftTi^TfTi^iiif ¥ ^ ¥ ^ ^ ” ¥ ¥ ” j 
I ttuyltntintinittit intnttmtmt it 
m__.........:...._.........:..jf pSiiiniiii^ ‘ 
i 1 ^ . 
, I 
•S J . 
s 
1 nrrm ‘ ‘ 
> i 
I I I II |MHI • • • • ” ” i i i i l I I I ‘  ‘  ‘ I ‘ ,I , ‘ Ii A ‘I I i Ii , ^ ^ IC - 2 i  I ‘ I ‘•


An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Appendix C-1. Oj[m 2nd current sensing test chip 
C-5. Schematic diagram ofl6-bit signed carry-lookahead full adder 
- p»— pf^ — 
—r^ ‘ .-^~I 
p S ^ : ^ S ^ ^ ^ ^ ~ 
^ = S-==fc=3=^  
^ 7^ ~ 
l s p a J " = 
aai ‘ ' ^ 1 = ^ ° 
l i i - r B & 
i J l 
^ a i • IFS^ 
^ . J & ^ ^ ^ ~ 
C K - i ^ 5 " ^ = 
-=3S-=i=:kq • -=^ 
- = # - ― ― 
ii:i__ • 
= ^ ¾ ? ~ I “ 
E ^ ! ^ ^ ^ = = ^ ~ 
--a-~~H=:3=^ 
m t i g ^ " 
— 3 — 
… P 5 ^ S / " 
l:llS=^ 
LL^ :e 1 
pas i:g ^ ^ 
=¾?~I 
- ~ 31 - |l_r :g~ j _ 
P ^ ; = ^ 5^=^ == ~^^  
~ :t3 3I • l'-r;>g^  
_ J i E ^ F ^ " 
| _ ^ ^ _ _ 
^ ^ 
C - 5 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Appendix C-1. O^gn 2nd current sensing test chip 
C-6. Schematic diagram of 16-bit input buffer ofthe accumulator 
‘ |M I I I ^ V ^ 
I 1 E L 
- r r 3 L 4 • - 3 -
-- . ] , ^ ¾ [ ^ r ^ ― d i i 1 
r Q V 7 ^ n "^pr^L^ijt^ < I -
E -pL_ r f ^ r ^ • ^ ^ ^ ^ ^ ^ ^ ^ ? ^ ^ f ~ > — _— - i i i 
ii g S = T t ^ S - irr- -
r t ^ r r ^ ^ ^ « H ^ — — F F = = — — > ^ - ^ L_ 
V ^ A"ttr n p T r P T 3 
‘ p # - — — n W S - TTTT^ 
t . _jJ_ • ~J r _^ 2 
r t i r t i - ^ 3 ^^nr— .rT~=——*— 
L 4ptV w Q B 
i . z ^ - # f = = 3 = T 4 ^ = ^ - . iLLLLL— 
f i ^ rt"n ^ ? j j ^ >t. .rT"7=—*— •— 
tiM,^^^SH_ 
“ . S S ^ ^ ^ = ™ ^ — -
= ^ W M -
¢ 5 c ^ T ^ 7 ^ ^ = = = " • — 
_ = j:^_5^s_ 
¢ 5 ^ ^ 3 z ^ ^ p ^ = = ^ -
: W M -
•t aI . -nT" ~ 3 I , ^ » — 
— 5 — 
- d^ ; ^ 3 ^ < 7 ^ = ~ ~ — , — -
g a _ ^ ^ S _ . 
^ fts ^^^^^s^~"" . 
1 = _ _ 4 & = M -
& ; ^ 3 ^ < 7 ^ = ^ — 
- w H . — . ^ 5 
H ^ r^^ ==IPL_ , »— 
f i ^ f i ^ “ 3 t J ^ r T - .rT~ — 
= — ^ a . 
“ r r ^ r T ^ - ¾ ^ , .f _ . . r r 7 = ~ — • • — “ 
W M . Z ^ H _ 
4 t! 'I wI - 3|~~i I , ^ 
e ‘ g I “ ^ _ * < ~ I I~~I .4>>*i^-
- g s , ^ ^ ― -
Q ¢ 5 - & 7 ^ 7 ^ " ~ ~ — 
- ^ ¾ ¾ : M ^ - -
Q Q ^ | r H 
T ^ ^ 3 | j 
f T f f ~ ^ i i 3 00 T T 00 i ― I I 1 Q 0 ‘ m < C - 6 

An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Appendix D - 0.7jm 1st self-timed 2-D ICT core processor 
D-1. Top level schematic diagram of self-timed 2-D ICT core processor 
Q o J m I < 
V ^ _ - wwp| CWPW 
S 
- . 1 . ^ 5 ― 0 — ^ o 
- —— _J ^  o ^ to 
. § 0 I B 
_ J7o^S i 
r .1 < TD o 
_t •. =::=^^===:=;;=;==!rTrrr:r!T::^^m o) ^ r— 
^ .1 LU C^Li 
r ^ " ^ 1 ^ t f 
I ] i$uuu I o — 
S «|5| | ^ i=Og 
hMh cu s 2 0 f ) V 3 J s^i 
- ‘ T r 
£ £ o> c 
J c % — fl 
— T hiu 1 1 1 I — 
| i 1 • r ~ t = W S I j i l ^ n p m J i m [iLiiL 
^ p H n p n ” iiii 
[ i j ^ 1 ^ i ^ ]• 
CN s 1 , 1^  , ii . f i f I i I i I i I I I I CN 
u hi} iAl hii liij ]i\]^ 1¾?^  lH!^ 5il{:55iil | ^ * / _ _ _ 
p : t f f ^ r r f : l . . . . . . . . . . . . . ._ 
p u i tj p ^ t f P l t n i f t l ] 
^ I 5 =^  
i.^ s _ j . _ _ _ _ . _ ^ l l H 
_ | | _ | _ | | | | | l ^ | H ^ Li^"““ 
m < ,^iA "i ;^^ ii s x£«fix ^ $fis«r c n ( 1 ji_E- iL j_^^. T! • 
n “ ¥ iff i i f iiT ra pw p ™ > n 
» i j}} 5 1 B 1 
SSS|:;n jiLJ j1 5 Al I 
1 , - r * i »^ ii i U i 
__a J y iiiii —rri— —rri— 
^ i H — — ^ r ^ ^ 
— , ^fH1 — 
\ntm\ “ i^  
isiSiyg rt r 
• I S ^ i i = I ,i mmn X i« , 'I Ui ipEiiHl 
li 11 ij n j 11111111 
I — I 'i]iiiiim 
i ^ " T \t ^Umim 5 I ^ 
Q cj f m I < 
D - 1 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Appendix D ‘ 0.7m 1st self-timed 2'D ICT core processor 
D-2. Schematic diagram of input control circuit 
I— 
r ^ 1 p 
i L b W J 
^ t aI U aI 0 ' fe 
g • g U _I ^ _ 
^ 4^^r t r ‘ g o I 
_— e- « r4rn 00 ^ " 
n A < fc 
I H ~ ~ £ = , _ _ _ ti^^ L 
I t ^ 7 _ f p 3 f l § . -
r L —— I 3 o ^ 
L » I — s i ^ ' o L 
r °f al ., . < A ~" gm^  
I M 8 gT s \^ I 
—— \ Ac, W 2 i _ alJ^ ck^  < -U I „ 
|i|~ 3 ^~~T j ^ s 
. ^ r f ___ I 3 i i U i 
L TT TT < - :d • 
» M s^, 
0 ^ 
“ g J J _ _ ^ ^ 
J , 
~\ £^  8^  
® ^ 
^ 1 . i , I 
I ‘ ^ -
^^  I a i << • 
^ 
N.^ 
1 i 1 
•^  1 
i\ 
D - 2 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Appendix D - 0. lm 1st self-timed 2-D ICT core processor 
D-3 • Schematic diagram of 16 integer multiplier units 
Q • 1 rn I < 
V *>»• X • o>o| *m »n ^^__^^^__^^^_^^^^_^^^^_^^^^^^_^ 
m "" "~~~~H® 
i • ... _d •_! . 
_g x fe 
iO — -
§ s i i i f s i i ^ "-^ | 
I i . I sj «< ii y sj t g - CO "o c.. 
ii n n ~ — n n n n r r n < 
- n tt tt n n n n n u i j . -I riM rTn rF i r n i r i n r rn r n i m i ^ K M h M h *i M h h ^ i -
9 fi i - — a J _ E 3 •‘ E i _ S 3 _ — S I • B J _ — S J •• <D 0 i 
A 4 « A £ » 2 A t A £ « A 3 f) QJ 
\1 11 11 il U 11 il 11 ^ ^ , |LttL>_jjiJtL_ijiit:L_^j!:ttl_^j!ltL>^^iti>^jljy^ r^^L T ^ 
T~ S i I I i s I 1 1 I 
r \\ H 5S ii 51 H i\"- S I i — 
n n n n n n n n | _ 
1 I ) i I } < I I I I 1 I 1 < I I 1 « I 1 I 3 -B a • 
•” i] i\ i\ q i] i\ i] , .^ |l I  i 18 
^ • —~^  ^ m ~A J «* "~^ ^ M* —^ «• - "'j J m ~~* ^ m ~~^ ^ •» ~ J 
a 1 I i 3 {i 2 I j j j I j j j j I j j 
mm mm • mm mm • • HHii 
CN *• — • CN 
"€> L =3-
\ ‘ II '1 11 II I' I! I i 
1 I I 1 | i « j i «> «1 £1 i J 
I ii I 11 1 Ii II II i 
^ ' ^ ^ 
[|1 ~ ~ 
I 1 aI n ° aI • 
rn r t j h J 
m V ^ -^Ttr 
gy I 
Tf =^  g 
- ff -
fii < • 
j mt*f-^ ‘ * ^ “ “ 1 • 
r^ ‘ i ! 
« • " • ~ • ~ * ― ― * 





. Q I o f m I < 
D - 3 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Appendix D - 0.7jm 1st self-timed 2-D ICT core processor 
D-4. Schematic diagram of coefficient storage units for both forward and inverse 
transform 
r — 4 ^ ^ ——f3^M 
^ 
r n : : R a w - = ¾ ^ 
M ^ ^ p - ^ ^ L . . " 
1 ^ ^ _ ^ ^ ^ ^ 1 _ : : 
f F 1 - F V n ~ ~ : ; p " 1 
iM^ ~~~ raJ — ^ ^ m ——a^^ 
y^L - — 
! _ 5 i p = _ : _ _ L _ — & 
^ ^ ^ ^ - ^ a ^ ^ 
? p ^ L W| 
i f e ^ E i i r i & ^ ^ : : r : ; & i ~ ~ 
£ _ _ n = ^ f c = - - - ^ 
1 : : j j 
j p ^ | b ^ — ~ ^ ^ ^ : ^ ^ 
3 ^ p ^ C ~ " — g j ~ ~ 
=u- _ 
-F- -E i . 
I ::^ =^  = ^ 
-CS__I J ^ 
. E ^ J 
ZF1. . . r3=U__- ‘ 
.- ^J ' ^_J. E 
^ = "“^  — 
^ - J ^ -3=U. 
3J . :5^ : E " 
£ = ^ : _ ^ 
s 3 f ^ u : : ^ ^ a 
I ^ J 
D - 4 

An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Appendix D - 0.7jMn 1st selftimed 2-D ICT core processor 
D-6. Schematic diagram of input shifter for integer multiplier 
i I - —• 
BEMSKXS 
Z0NC|RCV| DCSCRPT>OH UTC | 
~T\ '^1 r o> 
<^ *' • ‘ -
I • 
I _ u 
,W 1 •» 1 
,I •',' \Mut • . yutti , 
_4 T, <»> 
_1 *_ <l> f~" ~~" _ “ 
_ *• “ • ... 
«> . I *c _ a 
_U «• ~*"_—f*^  ^ 
_i!i R^EM^——, M_gzl 0» ~ ~ " - — I I ‘ I • * ~n »Br i f I 
<l> . w I I I 
^ ^ ~ 1 " " ~ _ . , o> 
_» V o> • • 
I”*en_j^  ^ ‘ L--i-|_nJ -
u I 1 _1 1 
• : toflUi * — 7 tewi , ^ 
_• y_ <s> _ .iT. <J> p— — — • • . __ 
_ • ‘ _• 
. _ ' _ - "~T~ — _ » 
-—^' " r f ^ , <0 
_> T_ <*> • “ ‘ “ 
—*r- • _• 
~ t^ _ *— ™*~" • tt 
HI 1 » I 
111. KW4i """*^  Mai , 
• * r_ <» 
,* ^. <a> ( • 
_ *» • 
<4> J 1 TT 2 
•« b' =fc:fI^  •“ 
**> _ _ REC4 _•• I 
fc, 61" • ~) 
<•> 
^ U_ I _ I 1 j 
<r> . ^-m^ •* 
'B1 fll* I - -ill Mtt*l • MMII 
.* y_ <•> 
_ * T o> • • 
•a i3*| ~I^  ~"~ t^  r” I I _« 1 
. Z ^ toflUt ' “ ^ Muai 
* f <7> 
_k ’ <?» f ~ ~— ••'• ' ‘ • •‘ ‘ • “ I 
_ ~ I T' _ •• 
» I 1 S I I 
_ -" kw“ 1 Mmi 
_ » _ • r <t> 
* t . ^> I ™, ‘ 
.•, T^  - .,* I__ •_ 
— I _« 1 
- ― ^ ««» • M«| . A^ T «•> 
_* V. o> j • I • 
— ~7?~ •• 
-^ Tf* —— W— tu 
— 6rRE"^~~r: ‘ 
*K» 
'w «• I I- I I .» I 1 
I - ^ ^ TJT,_J z ^ toui - ^ hRim 
M W 4 r 4tt> 
_ * v_ <m j — 
~>sr| 133| —t “^ • • ~^ 
==ff^ "Tra.. <n> 
• A T <ii> r - — • “ '• 
I: *• • . I _ 
M I f I 
-I^ EI maai ~"^  M«H 
— V <n> I ^ - r . ^ ~ 
'* “ ' i_ 
_«1 I -» 1 
_ ;>i Mnai • MMft 
^ ^ _ k ****• y_ <ii> 
* <M> I • 
° • L •• 
<q> ( - U— •“ u 
'b$ bi' "*—_ = 
, > . . R E G 4 ^ _ 
*0I 0» •^•“>^">"""™>™^^^^™> ‘ ~^ U _ u , • I. 1| .,^_^  
~nr -5Tn - i ^ ^ <n> ___ •*' iMM* r^^ torai 
*M TS*n * r <M> 
“ “ i T. <M» j _ 
toM.*><tt*> |^ HHHBMBHMMHJ L ^^Z ‘ *' ‘ g 
-^ S^ -jd I n| ~ ~~ 
* ^^^ ‘ C^tJfV4 
^>^_-q^^^>!!i • 
i^y UiKV4 
^ " l z H i M z z z 
fc, • D^  * t^i5fv^  131 
i;:^ I ^ 
Z 6fc« • — “ -/ \T~"X U6 i~«~^«_i_—_«_^—«_ _ I ) ) • I UPOATCD I • ^ L ^ n Dec 25 17-48:07 1996 CUHK EE ASIC Lob. Joh f i aon P a n g OAA«N ‘ ‘ B5c5ro Input Shifter for e^srcsEB teger MultiplieCWCK 5crpcrw OWC N0. 1 D ISSU D ^ I“0 1' |SMgT 1 OF : ‘ A 9 1 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Appendix D - 0.7m 1st self-timed 2-D ICT core processor 
D-7. Schematic diagram of three stages adder tree 
• •• -.••- - -. _• - . i . »., • 
 "- • • _' ‘. 'r. • .-• • _ 
~ ~ Q I ^ r 3 u ] < ~ 
ra rpnc ] 
: , , I } - ^ i 
} i J I X^ ^ 
U t J 0 ^ | . 
- - . . .- i L 3 - c -
rn rn T H P T l uj^-^, 
i d : 0 B L- - I | L 
K 0 ¥ ¥ ‘ ~ J ^ ~ h 
' T ^ i ilii .1H 5 j 4 n r -jfjf- "^  L 
^ i t — ^ g ^ ! 
-._ n n n . -^ n a ^ e *j^ i i\ d\ 2 g 
* * t^  0. 
^ i ‘ 1 i f 
• « 9 
--,_ cs I , I , * iTJ 0 J p ^ f f f ^ 4U^ i i f t i = 
'^si^  TnT j ] |Y LsJJLki_ 
^ ““ r ^ r^S ^ 
fjj— fjj— ii n 
*i ii n« n i ~ 
L ~ i 1 i 
I ^ j j 
_^  'L UJ {jn l||J LjjjJ 
f [it] ^¾ \ f h [1¾ J I ^ 1 • 1 ‘ i 
n h _ U | t l U H Aii .ihi { . r — TrnT TM n n r TTTfr 
jK 
^ I— ^ _ _ _ f n 
n f i ^ | r J ^ J ^ i r H ^ - ^ i i i i 1 i i 
S 
- 1 'L ^ \^ _ Nd KL|d LyJ [ d M|jJ — 
f [^ ¾ \M [ft \m rffl [S 1¾ ® 
i • I • I I ‘ j ‘ I ‘ I ‘ I 
4 n kM L # M 4 i i ,"!i •*“ .jfi 
i .-,-^ T I I I 11 I I I 11 I I 11 I I 11 I II {| 
^ ^^jHHHHHHFf ^ 
! 1 1 ” 1^ 1 M ^1 4H M . 
Q cj t m < 
D - 7 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Appendix D - 0.7fMi 1st self-timed 2-D ICT core processor 
D-8. Schematic diagram of scaling factor storage element 
• •• “ - • . _ • • . , 
, • . -.. -• , • - >• •. • 
. . - i • 
. i n Ri 
1 JD ^ 
0 0) fe 
- — fe -
^ y ^ o i 
—— ^ 
< S)Li_ 
i I ^ x.cL 
s , 1 ^ 
I s -
r « ^ ^ ^ ^ ^ T ^ ^ n n ^ ^ r n 
~ i i A A i :t i A £ i 0 H_ 
r gQD| 
- - % % % % ^ U U k % % % \ I i 1 | 1 
1 _ A A Zf Z? ZfZfA A ZTA i i j c2 
M - -, < < < < < < - < “ ” iH a 
— s 
X 
1 ^ ^ 
1 n _ : ~ ~ ~ ^ : i & r i 1 5 
L T J Q UJ t J UJ 
a < ^  • a ^  ^ ^ J 3 ft -A < B J 3 R < D X A ^ 1 ^ D J 3 
L ^ ^ : ^ : = H f c = j = : i r i iiP 
I *^  ‘ * •* ir^  
^ 'l i ^ V l£\ 
°t sI I °1 ttl I 
§ : 'I [ l 1 
o 8 o 5 ^1 





p r ^ 
Q 
o fi )^  
m m 
I I 0 J 
A 
D - 8 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Appendix D - 0.7m 1st self-timed 2-D ICT core processor 
D-9. Schematic diagram of 15x10 bit carry save multiplier with CLA adder 
~ I I “ I ^ I ^ * Q I “ I rTT| _jj _j .. < I nHHi 
r § ” 
-- U ^ I 
- • 2 S" ^ < J5 £ a - •*, 
ic ? t L_ I s 1¾ -
‘ u Vl I 
_ ' x2 L 
iA I U J j _ 
T ? n ~ T 
. .f| ii 
. i 1 . [iiLLLL 
r- fS _______ZZZ^ ?=\ Pn ¥ — " " " " a~~ir&-
— r ^ 1 I Y ^ 7 ^ Y • 
- I — % m M ^ =fnL 
f^^;;zj:P Sf y L -
- ^ ? ^ ] ^ ¾ ^ ¾ ¾ ¾ ^ ¾ ^ ^ ^ ^ ¾ ^ . 
: ; „ 5 :sTT" zsn~^3rr _tn j n _Lrn :^T:^rr 2m i ^ 
- " ^ …=.l==i==.l==^ =^^ :i==5:!3:i==d-:i==d~:i==d~:i==d~:i==^ -:i==== I 
" ^ ^ [j^ [7\ [7\ [T^ [T^ [ ^ [1^ I S PL 
- ^ 4 ^ 4 ^ ^ ^ 4 ^ 3 ^ ^ = = - ^ ^ -i n _yn _Lrn _tn u n JT1 JTH jj~i :irn J-'i4J 
. : = ; = = ^ ^ = ; = = ^ ; = = ^ ^ ;|==^1:; : = ^ ! : | = = ^ ; ==^1 ; | = = ^ ! : ; | = = = = H ^ 
_zC ;^p" ~*^ ~J — — i— t—I t—I 
^ _r^ ^ R ^ f i ^ p ^ ^ ^ f i ^ p ^ t i ^ B ^ 5 = - 1 
- T ^ _ S S ® ® ^ ^ ® ^ ® = = = , ¾ -
:KT"3n^ zsTMn J n J=n j r i jn vi =^ti^ 
^^^^^q^j^^j=ij^^j^^j=;!^jj=i^jj=i!^jj=d^j==- T L 
^ - • ^==- ^ ^ 
- 13^ _ _ _ _ _ 5 ^ _ ®^= p 
] _ _ _ _ ^ ^ - ^ 
^ ::^7r i^ TT" i3"r urn u n :s~r :s~r z^T J:n 
^ ^ • ^ • ^ & = ^ = : : = ^ = ; = = ^ = ; | = = ^ : = ^ | = = ^ ; = ^ = ^ = = = = 
- ^ ~ ~ " ^ m • L_»»J L_.J • * L—J L_J I 
•, 3 ^ J P 5 P 3 f l S P 2 ^ 5 f l 3 ^ S ^ -
L ^ -pCb" " ~ • ^~I ^~I ~^^~I ~ L _ I ~ c _ I ~ t _ I ~ t _ I ~ I _ I 
P ^ r ^ < ^ 5 ^ 3 V 9 ^ 5 ^ 3 ^ 3 V 3 ^ 3 V 3 ^ = 
-^ ¾,-¾<^" •_• *—* ^ 1 _ L_J .I 1 I. • I I L • I 
I^^ffiP^JlP^^^^^JP^^^^P^^^B==*" 
~ t ^ 5 t ^ l ^ 5 l ^ j i ^ t ^ 4 ^ i ^ 4 k : e J 
—J _ 
® o 
o *^  >^  y o I u m I < 
D - 9 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Appendix D - 0.7jm 1st self-timed 2-D ICT core processor 
D-10. Schematic diagram of 128x16 bit self-timed transpose RAM module 
- . . ' - • • ••-
\i 
n^ “ 1 ~~"FT] 
_^eD " 
* n ^ u. 
X < o -
DOCK -
a ^^  c - | 
f £ p 5 fi $ e fi CO 0 
* 4' « « ii S « ii ^ - 7 ^ • i X § i I I K < X)^ 
i J i J i i i 0',7^ 
, t t t t t t t t LJ C gi 
I ^ I § L 
I ___ t I ^ 
8 I A“A A~A I I A"~AA“A | I -^ r~ ^ ts e s? £ c V c £ _ , — ^o-F nii iiii 3 
[ } t 7 
I I 0) W 3 ' . . "6 A 2 c 
^ S I I. -. ¢0 
« * 5 i V 5 • f~> r^  o. 
— ^^ h I in - -tt “~n~f~l”^  “^n~f"i~^  pJ^  5 
o ajy m o 
“ J% Y ;• 2 
_ L _ ^ _ I 1 ^ Q ^ 
I ^ p^]g^ I 
^ I W lhli llili 
^ T - - I ^ 
IL 1 ^ _f 
ri~~I“su I [1 ~~I~i^ j I 
^ HHn ^ 
h-S J J i- L_^ s 1 J «-H> -i^' 
1^ « i| ^ ! « u i ?> • “ 
Kd-^ll # d - i i L > > . . . ^ i • S 
^ ¾ , = ^ ^ ^ ^ ¾ 
i I _vn 
^ ^ " " " ^ " * 1 * " " ¾ <»> * * ^ 
I <f> fn 
mmi^S^Smm i i “ * - _ - - X <z> ^* 
‘ <, , a ,, Bl -- - y^y -
< > •** i 
r -H I" H 1^1¾ 
^ ^ 7 ^ 
“ j I i^ Lf j I «i 
^\ \ ^ ^ 1 < it 
_lLs|-2li I d dl 
^ T O ^ W J !h 
^ ^ ^ ~ ^ ^ 
^ ^ ^ 
1 i 11 
2 J 1 « 
Jt 
— I D-10 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Appendix D - 0.7^m 1st self-timed 2-D ICT core processor 
D-11 • Schematic diagram of 16-bit serial to parallel output shifter register for TRAM 
~ ~ Q u ^ r r u —Tzi <^ ~ ~ 
I • ~~\^  M 
"g ~ 
—— . _ ° ^ o 
I I ~ ’ , " ~ • >" 
a^  ~ <^C § 
—— ^ 00 ij-Q 
^ :!!; < =J I 
, UJ ° ^ ^ . -
I LU X 
I ^ ^ ' ^ “ 
S J T> • J > ^ -1¾ |5 p p )s p |a |s |g |5 ^ 3 I 5 ja ]5 ^ tf | i =|lg =||i 3||3 5| 0 5 U 
0- fti I CA I fti I 1 j g(J 9 B S 8 3 p 8 S s a |o rrTa |a rTTs |s , , “^ ^ i —L, 
_ I : z = = = E = E ===- ~~I ... i ^ 
—1 _ = ^ ^ = ^ ^ 1 = I — 
‘ 1 J — __»»«„«_«_ - > ^ ^ _ _ ^ _ ^ ^ _ <•> : f^' 
"^ = E E E ^ = ^ = lilf If i 11 
|8 s p p p ‘ s |3 js L L L1 L • ‘ 1 … 
11 ^ i r i ^ ^ | i s | | | g| 
CN f l f f ^ m T ^ f f f f l f ) f f ^ CN 
_ ; _ _ _ _ 
!_ L L 3 L. L i_ 3 8^  i_ 3 I » |3 i … 
_^  | i 3||i 5||S S| | | S| 1 
^^ Qi ^ g a j <e-
B s B 3 8 8 S 5 3 |s i S T" 3 S T" |s i 
^ jg |s {g {3 p |5 ^ \s |s |5 ^ |3 s 5 3 s ~ " q K) 
11 - | l i i | i i r T ^ " ^ 
0^  « I rt « 
8 S B 3 p 8 5 5 3 js 8 5 a 3~~~5 j~js 
1 
P ^ ^ c r t c c r E c c f j c r 5 ai • I ‘ I Bj - I I ‘ g| • I ‘ ’ I •[ 
I I I . I I t I 
^ ^ i 2 > ‘ « 9 I I I i 8 • 3 I S J J ^ ,[p< I ; 1 1 2 3 
« = \ % \ \ \ \ e 5 J i 5 • 
t s 9 0 s • i i t t £ ^ A £ i t 
• V » V V V • , • V • Z V C V t 
11 
i i \ i 
Q o f m < 
D-11 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Appendix D - 0.7_ 1st self-timed 2-D ICT core processor 
D-12. Schematic diagram of 10 controller for self-timed TRAM 
• >-• -• -•- •. -.- . .r. • -- • -
~ ~ Q I o- •• m : < ~ ~ 
r n iiiii F T ^ r ^ ~ " " " ~ m 
\ I t t t t t t “ ^ 2 - ' 
- ¢1 r a r ^ ] S . ^ 
• “ 5 U O Q ^ Qn S:alb 1 
ar — X ^|-|< l| -fl r- —I i ‘ 1 , r\ i 
- QQ Rll M B ® "^l ^ 
^ L ^ f l J l f r r M ' ' ' ' 
“ 1 C 6 r i i ^ ! ^H 
_ i : fJ~"I f H - J r ^ ¥ T 
I … lT HT. ®- Tt ^ § — 
rl '5S"« S93li is933t • ‘ g I 
, I j m T A , i n s« ?^ 5 
S ] >;333< E 1 5 V J \ u i i 8 5 1 I 1^ J / m l^ nl 1^  \l 
^ i | _ ^ T LT T T I 
CM bid—— ?t csj 
1_ 
* 
r~*^  •^  
^ ,rinfr?> 
~^ J 15533¾ f c$. 
I r S 
^ L - T ^ f l 
n ~ iTs 
TE n fi p ^ ‘ — — R p ” 
tw ^ w 
i _LjiS 
- F r _ _ ^ j T -
fra v 
L~___^ _^«^  •* • 
^ - ^ ^ 3 ^ } ‘ 
Q u t m < 
D-12 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Appendix E - 0.7jm fmal version self-timed 1-D ICT core processor 
E-1. Top level schematic diagram of self-timed 1-D ICT core processor (final 
version) 
~~ ^ I u I ^ i “ I I -I U 1 < ‘ “ ~ 
~ r r j i 
-- ^ Gs 
_3 - S 
J 0 u t 
^ - - 0 I o 1 
to - a. -
< TJ V 
. -.lsi 
• ^ [__ jl L ^ _ ^ ^ _ ^ ^ I i ^f _ 
I - ,' 1 ‘ I I m E] m E] 0 m ra • ^i 
i m | Wmi s n 
^ 1 i _ _ l ^ " ^ . [j ^ ^ ^ 
|/Miiiiiiii^  i , r n n u u u u 
- | _ _ ^ -
M I I _ i ^ | J ‘‘‘‘‘‘‘‘‘‘‘ 
fO _ 
-4_ J ro 
_ ^ ^ ^ ^ ^ P ^ H ^ ^ I l llfSp 
11 P I ^ r L j f T 
. ^ _ ; | 1 _ : ; _ 
} i I ‘ i j ‘ ‘ I —J^uL-1 3 'HtftiHlt iL-m-J fe imiiuu L.IIllT^Fl iinim«_ 1 I • '| I 1 I 
= ^^ I' ,^  j I •': !!!,;*^ 1”_” I | • ” I _ Hfl j || g — I I I I I 
^ | | |= . ^ 
: t l j N f c ^ l ^ [ 
11 …] ‘ “ “ . 
“ % i h n H “ 
- ?| I 5| ~ ~ f t _ 
1 ," | | | | | _ | | | ! i 
_ E] ll^^ 0 I _ 
11 ‘ I -
00 _ 
00 
^ U I Q o m I < E - 1 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Appendix E - 0.7fm final version self-timed 1-D ICTcore processor 
E-2. Schematic diagram of self-timed block 1 
•“ • - •• - • -•- •……. 
-. . , • . . ••• •• .. - . 'r- • . . 
\J 
3 I 
t " g 5 
—— - _3 ° 
i 0 I fc 
77\ “^ 1 
—— ^J. u 
- < ~o 0 
. LJ ^ 5 i 
_ I ^ § ^ i -
‘f : ^ \in _ 
^ ^ =^ = o _ 
I R i 




K ‘ , A cn 
i ? 1 i 1 • I ^ 
bJ • mm • • • • • V CD 
i ^ t T i 2 
~ I _ I CS| 
U1 
CN 
I 1 9 o 2- i § o 
lALLLL 
L rrn 
L i J I 
i ^ - " \ 
L r " n J J ^"“‘ j 
I r — g- iwi» 
riU——U_L ^ 
T^ 8¾¾ 
— , „ „ _ • "I f_* <9>uon3ao*u|-' 
" _ M • • “ ""«0 « <8>,CU| ~>-f7^ <C>UO^ DC^ .y 3 0» ' e)uno3 < <C>^U| ~ m ~ r ^ <fi>uo;,Dcu>.^ i ^ 
»»C-‘OPO 5 <^ >^ | ~«~• <^>uond#m#g| 
r — • 7)9M^ ” • -p*MUOi <f>|su| • • <f>MonD^io| 
I —OS | ~ " I 1 I s s s s 
I •~ **"il i £ v ^ ^ ^ 
I|90p • ‘ .S .S .S .£ .S 
” 88ll I I I j I 
r f l ^ vm r n 11111 I 1 H 9. • 9 9. ^ 
« < a> £ £ £ £ £ 
L - 11111 
S £ Pi •>! o *3 *3« I 
I 'C : 'c 'C o e e g 11 i V • 
a « t £ : I 5 
K 
A “ 




An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Appendix E - 0.7im final version self-timed 1-D ICTcore processor 
E-6. Schematic diagram of self-timed block 3 
“ . -― m.. • r 7 » - . . 
_ ^ • • • ‘ 
rii -~~ .u u 
“^ 
o — ^ 
― ― -
~'ro 
i • I j 5 
00 ^ ^ 1 
― ― J 
- ^ T3_0 
, LiJ ^ m s 
L u | h _ i 
i 1 ^ — 
iA - I .___ 
r ^ =s — s F 3^ L ^ 
L- ^m^ 




A _ 5 -1 t i 1 1 I g] 
^ T T T T t t t ^ tn 
o & <N 
^ 
‘I tN U^  
1 o ^  o o 
«" < » ^ i I s S 
H-L_L i Q I g X 
i i ^ ^ ^ " ^ 
— L J _ J I 
> I • ~ PS9H V 
i ~~ ° 
a 1 — •~ z>pop o 
5 I 1 n ~ ~I <s 
^ ^ I ga M3op | ‘ “ i i i | ^ . . - . 0 <,>..u,U^ ;::r= :^::^ 
_ 2 I <s>>""l ~ * ^ .^ 
= nunoD • — i>un03 « <f>)tu| • • <f>uO|)Dnj>iu| fs' 
0 _ 0 eiunoo ' , <z>)su) ~ ~ « • <3>«onanji»ui 3 
l«-Xo(»0 • “ S <i>)iui ~~« • <l>uwi3njjiu| 
j m Z)9^a ^ I “ -piDMjoj <0>)tu| • • <0>uojpnjjsu| 
8 n^op • I A A A A A 
I • _ ^ 5 • ® « ?» « 
^ £ m th *n io tn 
j • — J9Wa i V V V V V 
"13~« 1 5 - 2 2 2 
lnn f [ f l i i i i i J | | , K = ^ l M I “ “ 
>• 
I~« m i . . 
a I I I I I 1 I I I i 
I 1^ I I fl ® « « 
iA < £0 ^ 1 iO jn j£j »n j£j 
‘ S~ V V V V V 
_ t S £ « 5> ^ 
i 1 J ^ . I il I 
I I I 5 8 g 1 
m • m ^ 
P I I 
<c 







An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
f Appendix E - 0.7mfmal version self-timed 1-D ICTcore processor 
t ‘ 
f 
t “ E-11. Schematic diagram of instruction decoder for self-timed block 5 
I - • - - ‘ - - _____„ 
g REVISONS 




I, •. -S0 I se I 1 
=51 UUX41 I I UUXii _S I 
A Y A Y A MWC2I 
r *^ • I ( *T" •—“^ “~ 4^D«Li«l 
— - ^ r*^ B 
I D^ -*TT" r"~ i3< 
‘― “ l36[ '"• 37| “ 
sg I se I 1 
, - 1^ ' UUX4l | 1^ ' MUX41 I _S I 
_K y A y A _ Y^  — 
?5 ~"~| *g ~ m — ~ ^ _ _"^ ln,t<7> I ‘ *^ 1 *c B 
‘ "TT" "*r- r-""~ CJ 
'33| ‘ ^  1^  “ “ 
• .__ sal I , ,^ ' : MUX4I I _S I I 
• Y A y_ 
‘-^“ • • ~"H^  ln*t<8> i—TT" B 3^"" r*^ Qi 
‘"• 128 “ 
Sg I se I I 
, =S1 UUX41 : - =S1 : MUXil I I _S I ^ ) y A y A ^ 
»H~ ^ *^ “ -^ ~H> lniU0> 
c "!T" B 
‘ *^r^  ‘ _D ~"•~ II: 
a 1 ^ ' " • [jl “ S0 I I S0 I 1 
‘ 1^ ' UUX41 I: =S|- MUX4l , , ,^  I 
WUX2J 
•i— ―^^^  ‘ *B - “ “ #lnsUl> 
~ ^ _-7T" B 
‘ *^TT" ‘ *D r* 19 
“~ m\ '"•~ |20 “ 
sg I I Sfl I 1 ^ 
I =51 UVX4l , 5 “MUX41 I _S I =^ ~ 
• w . UUX21 A Y A Y A Y^ *B ~"J “ *B • • ~~H^lnit<2> *c —y~ _B ~~5 
‘ *P *TT" I "6 
• 130 • 129 “ • 
1 1 I > 
‘ ― _ … • • • • , UJ 
5 0 _ ^ _ 
I  =SI k|ux41 _S 
A r A " 1 x ^ 
‘ »13~ »"1 ‘ ^ l n r t < 3 > aHBK 
mfz -B t"*TT" r^^ 17 *"•~ |9 * 
Counte • 0 I 
s I I CounM m ~t^  ‘ I ―^ A " 21 Y 1/ r~>^ 14 " l^n,t<5> 
J ) •~ 15 
-^ i ^R2 “ ^ 
s I 1 
A " 1 y 
I ^ |nvt<6> 




Fofwofd. • 1 S 
nr =‘― 
[N^  vpOLOC 
LOC^  
t^ cJbLOG~ I 
UPOA1ED '~~~'''^ ^~''"^~"^^~~"~"'"^~~'-"~^~'~'~""'" 
D e c 2 5 2 2 : 2 3 : 5 0 1 9 9 6 CUHK EE ASlC L o b . 
b e l t - t i m e d 1 - D lCI 
d e c o d e r f o r STB5 
CHECKED aZE CACE NO, |DWC NO. REV 
C 
KSUEO • 1 1 1 ^ ^ ^ , I 
_ SCALE SHEET 1 O F 1 2 1 
E-11 

An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Appendix E - 0.7/m final version self-timed 1-D ICTcore processor 
‘ E-13. Schematic diagram of instruction decoder for self-timed block 6 
2 1 
REVISIONS 
ZONE REV DESCRIPTION DATZ APPROVED 
Sfl I S9 I I 
m WUX41 I ,^^ WUX4i _S I p^  
WUX21 1 
A Y A Y A ^ Y _ J 
r * B " ~ » 1 r ~ * B - — “ • ~ ~ ~ ^ o # i . M i 
^ r - “ ~ " T " 8 
r _D I _D • " • " ~ I4e 
~ * 148 I " • 147 “ 
_ _ S 0 | I Sg I 1 
I •‘ UVXil I  =sr yux41 I B^  I I 
WUY21 
.A Y A Y A " " " ' Y ^ 
* F " »•] " * ¥ " ~ ~ “ — ^ l n » l < 4 > 
V ^*Tr~ B 
*D ‘ "D r~~*~" 142 
‘ K^| " • • ~ [ 0 “ “ 
_ £ 1 ^ — — ~ ~ 
i : = : n E : _ | .s —— 
i A ^ r K _ 
*B ~ • • ~ - ^ l n s l < 7 > 
— i n z -B 
*15"" r “ ~ 125 
I ^ ~ 139 “ 
se I I 
I , J 5 WUX41 I _S I I 
A Y A UUX2^  
» ^ — ^ I ~ » — ~ H > i n i t < e > 
1 ‘ ^ ^ _B 
“ • ~ 1 ^ I ^ 
sel • ! ' — — C 
1 I 1 ^ ' “ WUX4I =S1 WUX41 I _S ^ 
WnY2l 
A Y_ A r A * u « i Y _ 
_g^ ~~"^H"j —"*^ “ “ • "•‘^ ln9l<0> 
* T " ^ " c B 
^ T " — 1 5 ~ r ^ ~ M3 
• 1 ^ ' ~ " ~ 1 ^ “ 
Sfl I I Sfl I 
I ,^ ' kUX41 , … M V X A i I I _S I 
yux2i A Y A Y A Y i 
^ ^ ^_^»^ . I _ g • • ™^^^lnst<1> 
- ^ ' H E Z _B 
"o““ ^ ~ n " it9 
“ ~ 122 _ ~ " ~ l2e “ “ 
L J c ^ 
S I 
I ^ T " _ _ 
I # lnst<2> 
B 
“ ^ ~ ~ ' ^ g 
Counte • I 
CounU — ‘‘ ” ^ ^ 
S I I X 
I I '• ‘ … iA 
kUX21 =~* 
A Y ^ 
~ ^ lnsl<3> 
B 
p"~ \n 
se I I 
,':SI 41 , ,S I I 
A r A " V 
— — *g~" ^ ^ ~^-4^ln t l<5> 
V L B 
‘ *TT" • MS 
‘ 1^ > 
se I I 
1^' UXJtit I _s I 
A V A _ J ^ 
‘I _B »] • “H^ln t l<6> 1^ B § 
*TT^  r*~ i14 
• ~ 136 I “ g 
“ ‘I j o 
rorword_ • -^j 1 
p s ^ \TOLOG 0 
L O C ^ ' y 
y^ G*ft)LOC ^ • 
IOPDATEO I )ec 25 22:50:46 1996 C U H K E E A S I C L o b . A 
DRAWN — r ^ — r r r" 1—?| r^ r / ^ — i Se f- l imed 1-U lCI 
!T ! d e c o d e r f o r STB6 
CHECKED SCE |CAGC NO. |DWG N0.. REV 
c 
lSSUEO ^ ‘ 1 1^ 1 1 I SCALE [SHEET ] Qf 1 . 2 ‘ 1 I E-13 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Appendix E - 0.7j^ final version self-timed 1-D ICTcore processor 
E-14. Schematic diagram of self-timed block 7 
. . - . . • . - . •• 
f _ _ « _ _ _ _ _ 
rT| ~r~|g 
I ^ ^ 
0 ~‘ o 
— J ^ r ^ ^ I ‘ • I 5 -
m ^ ^ 
—— ^ o 
^ TD_2 
, LxJ ^ m i 
i Ld E ^ L_ 
' l ^ T ^ 
« + - ^ 
r— —V — d 
F 3 ^ L ^ 
[ ' . . . ^ ' ^ i - . 
W ^  k. w (^  
• _ s 
CT 
^ *o .5 'o *5 g ^ 'o 0^  
_^  Y w Y T T T T A if) t i N i rO I I o CN 
I ‘ I U1 
« < ffi trt < m v» < oD Q Q - 2 SQ 
J~I~U fJ~I~L, jj_I_L 3 <D I a a : : : 5 1 S t S » § g g ^ 
1 a J 5 g| S f:| ^ ^  I ^ I I m a g J 11 »•"« V1 | | i i ,_ "“‘ I r = Z Z = = Z Z = • ~ • ' I p ~ » <U>^>ot^onl\%^J^ <e>**"i —*r^  <l>uo>^on, t^ ^ 1¾¾¾~~o^ t^| ^'V^. i - ^ <9 «,»!».‘ ^ f.-..Q T^VZ ^ 5  3" ” t)uno3 1 a I >)8u| _  <f uonw*i|i J0) a • 1 ‘ • i)un«3 ^ f> u| f u0ijD0J»»  ^ ^ j»s-Aoj«0  ‘ 0)uno3 "o_ 2 )t  • o o;|3fuii 4„ i tW«l <twCNpjj.^^ “ I ~ ^ • >«»0 g .i   a -^r-rA A A A I ‘ I 1 r*- -p^*i< B I ^ i “ • l*'*a V V V y Z Z XKH3 ~ ” c ? T! 1 lfTTTT f f r f l i ' i i i iU | | | {|lL = _ _ I M MV >• »-' V r~H~m r^~21 r^~si I I * 4 * * * * ^  • V. V. * ®_ __ _ ____ irt ft <r> o M> >a \ ~I ~"r ~ ~I \~~^  ""*r i X 1 ^ 5 Y t/) < B t/t < s (/> < m J2 " rt ^ 1L-- t I : — 1 _ _ ^__^ 11 «[r  _ I i I i I _ I • I_ I I I • p I _ 11 _, b^ ^ X S X X X -T 'c 8 i Ji ^• i i r i I i\E 14 ‘ 
& “ ‘ “ ™ ^ - l — ^ ^ — B ^^^^^H^^^H^^^M*^^MHMW1 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Appendix E - 0.7um final version self-timed 1-D ICT core processor 
— ‘ E-15. Schematic diagram of instruction decoder for self-timed block 7 
2 I 1 
^^^^^^^^^^^^^^^^^^^^^^^^ "^^ "^"T^"**"""^"*^^^"*TI -ReviSONS 
ZONE REV OCSCRiPTK)N OATt APROVED 
SB I S^g I 1 ps 
I =SI uux4l I *^ ' UUX41 _S I I LJ 
A y _ A r _ A _ Y i 
"V" ~^ f~*B- ~« •~ —• DeLtel 
~ ^ ' ― ^ _B .._ _ 
: : *D I :D r "•• MC 
S^fl I I sa [ 1 
I >J :SI UU3C41 '““ =SI • yy^“ I _S I A V * V ‘ 
‘ m ^ _ 1 ^ ^ ( « g # k ) t t < 4 > -
_ ;c '^-*TT- B_ 
*D" '^ *g~ "••~ \A2 
se [ I se I 1 
1 ^ ' UVXAi , 1 : I * S i y ^ 4 1 , S I I 
A r A r A _ Y_ 
“^ ~»1 »^— ~“~^~ ―^ h.t<7> 
2^ "*t- B 
f *D ‘ - *D r “^“ t25 
Sfl I 1 J=HE _ .s I 1 A r A " V 
*B *~"*1 ~H^  hsf<s> 
“^  ,B 
‘ - = *D r^~ I2C 
_5fl :_ •” [~^^ C 
‘― 1^ ' UUXit ,!lI l V . UUX4i , ,S I j 
A r r . _ 1 V 
*H- ~»"] *B # ^>«t<0> V ' - -^ B 
, ^ ^ , - * ^ f — • _ I” 
, • l32| '^ 3|| “ 
Sfl I I s e I 1 
, =SI UUX41 I 51 • Itux41 , _S I I 
UUV21 
A Y k Y A * Y ^ 
*B^  - I ‘ ‘ *B “ ” ~~"#h.,<1> 
~*^ '^ r^- B 
*iT" _ " ^ r*^ "9 
“~ I22| '-~«~ m\ “ “ <^-
S I 1 
' ^ ^ ' r^ “ 
‘ —“—— ~"-^ btt<2> 
^ "J i _ 
Count0 1^  _^ 
Count1 1^ ‘ ‘ ‘ _ _ _ ^ _ w 
( ~ I > S I 1 — 
' ^ _ | 
, • _1^ Ws1<3> 
B 
^Sfl j I . 
, '^ ‘ UUX41 I _S [ I 
_ _ ^ A Y A * - ^ ' V 
‘ »^~" »1 “ ~ - ^ hst<5> 
• *r- B 
‘ »jy— n s 
“ l37[ • 
Sfl I 
=SI UUX41 I •$ I I 
A V . W^  
•H ~ * ] ^ “ ~ • ^^*t<6> § 
T^ _B o 
;a r ^ [lj t 
‘ •~" |J6 “ • — 
Forwofd_ • - * 6 iir 2 




Dec 25 23 :20 :38 1996 CUHK EE ASIC Lob . A 
^ S e l t - t i m e d 1 - 1 ) l(:l 
!!!!^  d e c o d e r f o r S T B 7 
CHECKED aZE |CAGE NO. |OWC NO. riiT ISUEO ~C 1 1 I |SCAI^  |SHECT 1 OF 1 2 ‘ 1 ” ‘i I E-15 ( .-

An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Appendix E - 0.7^gn final version self-timed 1-D ICTcore processor 
— FrM. Soh(^ mntir, diagram ofinstnir,tinn HernHer fnr ^p1f-timpH h]r^r^ 8 
2 1 
- : : • _ RevisiONs 
ZONE REV OESCmmm w t APPROVEO 
D 
… . ^ « I ^sg I 1 
1 ^ ' UUX4i I *S ' VUX41 _S I I 
• _ _ A y .A y . ~ " ^ ' V 
i ^ g - ~ * ~ | I ~ » F - ~ • • ~ ~ ^ D « L " _ 
‘ ^ r2t: . B 
'p ‘ "*TT- r*^ 137 
.… ‘ “ ~ 39| ~ * ~ m “ “ 
,M 1 1 ^sa| 1 
,J ,^ ' UUX41 I ' — 5 “ UUX41 , _S I I 
A r A _ r A " ^ ‘ V — 
‘— "a ^ "• *g i»— ~• "“^H^  insl<7> 
:r" ' _ B 
0^ "*TT- r~"~~ c5 
l33| |"*~ |34 
_ sa 
,:--:S1 : uux4l II _s 
- A _ r A _ • V 
‘ - * ^ ~ — « • • lnit<B> 
'2E B 
‘—"*15" r*~~ D€ 
• ―― |^a \ 
M I _ 58 I 1 
, 1^ ' UUX41 , : = I : s ! u u x u ,, , s 1 I 1 ^ 
> Y A Y_ A _ ^ 
*g ~ ^ ‘ - -•B— ~ “ ~ — “ ~ -H^ inti<fl> 
‘ 't: "V^ B 
f6 '“^ r""*“ M^ 
‘ • 32| ‘ "•— m “ “ 
M I I sg I 1 
, 1^' UUX41 |' =sr UUX4I I _S I A Y A Y A U*^1 Y 
>g ~~V"| I I *g • • |^ln9t<l> 
'“ 1^ '"32EZ ,B 
*p ‘ ‘ _D r*— t9 
‘ “ l22| ‘ ""• 120 “ “ 
^ I I 50 I 1 ^ 
I I ‘ '^ ‘ UUX41 I — =y ‘ WUX41 I I _S i I <B— 
_A r K r _ A _ 1 r _ »B— T ‘ _B —^ “ ~"-^ lntK<2> ;C ‘ “^^ —B 1 — • •  * lT" I ~ ~ ~ * — iie I  • — • •. ‘ -, |^g L , I — ,.J ^ = ~ s _sa g , ^ : : I T E : bUX41 . s * . " r S •B ” ~H^h,lO> ― ^ | | • B D r * 117^ 135 “ CountB 1^ ^ t 1 • ^ < I ~ • - ~ ' V L ^ - 4 p ^ x J L - ~ T ^ ' " '<” ,_B ) • “~ "S^ i R2  ^s I 1 1««2Y* lnst<6 I" •• • I o z 1 J 0Fofwofd- • I  mi \ VDDLOC L0C2^" z csVoG~" y0OPOATED I Dec 25 23:30:42 1996 CUHK EE ASIC Ldb. A ^ S e l f - t i m e d 1-1 ) ICI !!!!^ d e c o d e r f o r STB8 CHECK  SCE fcACE NO, |DWC HO. REV r ISSUED ^ I 1 1 , 1 I |SCALE |SHEET 1 O F 1 2  E-17 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Appendix E - 0.7jm final version self-timed 1-D ICTcore processor 
E-18. Schematic diagram of2-stage handshake control circuit (for 3-cycle block) 
• I, “ I • I “ I < _ • 
\ J U | 
- „ • ^ ^ 
^ J , I , I \ o o 3 ^ 
_ o t ° 
- tn V 
E L [ 5 " ^ I a }}. 
sf V Tfe ^ 1^---
n : = " = 1 7 ~ 1 ” 3 ^o-, 
.3 , — ~ — 5f % r L_ . 
f .| .t »t —— ‘ — tN 
, 1 ^ Jjn Jxi ^TTT^ 
L I “ ^ f e " ^ ^ f e H ^ H 1 
—L L p _ _ _ T S L u f i 11 \ I 
^ n L ^ ^ 
Q z r 
^ | a : ^ f C ^ S r ^ ^ — — -
I 





- J^ — 
[T] 
^ ir> 
^ ^ i i = C ^ : 2 p 2 ^ -
- ^ 7 -
I , 1 . > 
b m ^ 
. _ i ^ j 
- ^ " “ “ ^ M — 
r^ ~ r^  
V - ~ ^ 0 S 
- ZiA iY- STs — 
. : • : 
® 11 00 -
I ' ‘ ‘ ‘ "• ‘ 
1 f 1 Q o ‘ m < 
E-18 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Appendix E - 0.7fm final version self-timed 1-D ICTcore processor 
E-19. Schematic diagram of2-stage handshake control circuit (for 4-cycle block) 
Q 0 _ OQ “ ‘ .<C*". •‘ ‘ •---'‘ -—-
_ • ^ I .:.••• 
f1 r " T i ~~T m 7 = n 
-- n U — 
— ^ IITi r f ^ ^ x5 -
— Vt!^ ^ ^ ^ . 
. < x> o 
'- — fJ ._j Ul ^ ^ 
c r r ^ ^ .i VI 
rJ »8. M Y L — || TC _ _ , “ 
^' . - ~- J = ^ o (^o , 
fL:i | f " ^ T L f i i ^rw_ ^ _ ^ 'Q 
^ ^ ‘ t|p ^ - I ^n r cN 
L ffi~ [ 5 n = i l . i 
L T- a _ _ M . . ^ ¢ 5 
L J - r = ^ Z = 3 T i; I 1 f I I 
i~\ , • * Lt—® ‘ 6 \ B 
W n 
J 3 J ^ _ _ 
^ ^ = ^ ^ C 2 r " ^ H ^ M " ^ - ~ ~ 






^ ^ -. 
[M 
z]zr ^ 
^ ^ f e c ^ ; 5 ^ - ^ ~ ~ -
- j .t ^ 
S 5 i , 
^ -^JfLl^ m “ “ CD 
M1 1 
- y R 4 S ® -
>[a^  - »1 .j br 
1^  I~ I I 1^  
- i r - ~ | r l _ 
i F 00 ,_ =' CO -
- ‘ « • ‘ « 





An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Appendix E - 0.7jm final version self-timed 1-D ICTcore processor 
E-23. Schematic diagram of ALU4 
% 
I ^ 
.G H:l JD — ° 0 T) ^ 
_J <U ,^ -C .^ B 
y -I ^ ^ 
"—» / r\ "^~" • 
- Ln I ^ 
< i 3 
I a ;^  E ^ 
i i 4 l L -
——i 3 ^ 
r n j i < L_ 
i] _ 
, j _ I i , i i s ~ n \ cN 
I~I « j -SK>S*n^  I j ^ 
^n i 1^«"««--» 41 f>j 
i r h I i i I 
a A K) X •• CN t A r4 = s s « - « i A I 2|85S2S:| —| >0 3«~« iI 3 ! 9^  3 a 3 „ 
‘ • . ^ . h I i i ^ i — 
ii ~ 
F n i 
• I ro 
I 
r ^ s 
n n — J 
1 
n 
r ^ " l 
rl r i 
• '}' 
J i . h 
^^ =^ ——I 1 i i \ 
=1 ”n ‘ i "H ‘ i 
:i H … 
f l ^ l I 
A7Si -
i S ,. 
• ] ” J ‘ i — i i iiii i 11 i 1 
_ f T lrM 1111 -
~ r m i i i i ! l i l 
1 1 - " < ^ 
I i 
i 1 — 
… i i ^ n 






An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Appendix E - 0.7^gn fmal version self-timed 1-D ICTcore processor 
E-27. Schematic diagram of signed x2, x4 multiplier 
^ V 
I ‘ .u u 
‘ _Q : j 
~ — — ^ 
. ^ 
^ o 
< TD X 
. j LJ ^ ^ g 
i i f f b j £ ^ i _ 
^ g j <Np •— 
I e ! . | ^ ^ T § .9':¾ T I 
~ S o ^ ^ — 
C CN ~ " \ S 
^ 1 ^ ^ 5 ^ L j 
L J^ j ^CQl 
^ I — 1 r^ D >^^  0¾ 
2 I ® ^ cn £; ^ 
“ io 
A ^ 
u • ^ 
i 2 cs 
N V •• I i i 
I I <£) 
CN 
3 , , a a ^ 
Ss 0 2 ^ ^ 0 
, LALLLL 
i_ 
T s _ 
9 ^ A A £ £ A A A A A A A A A A 
u •“ »- «~ w S5 0» n r» «0 tf> f 0^  c^  _ • 
l«M V V V V V V V V V V V V V V V 
"I "I 1^ >1 g| ^ 1 g[ ?| > ' J Sf' ^ I “n 2 I I ~- S^T| I ~Z~^  nT| j“Z""^  *~51 
I , L J |l ,,,L,i I -|i J| |i | 
"1 3 8 < 5 2 3 3 3 « 3 S 5 3 3 3 3 ui 9i< S3 3 3 3 « 38 15233 3 
^ L_ I u_ I ‘ ‘ I ‘ ^ ^ I ^ • • • • • • L^ I L_ L« , I • - . „ • • , , Lm I L . , 
3 2 a = a 5 £ f 5 St ^ A A i £ ~ „ 
V V V V V V V V V V V V V V V g g 
I ^f 7f 
ZJA 
A 




An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Appendix E - 0.7^m fmal version self-timed 1-D ICTcore processor 
E-28. Schematic diagram of signed xl/2, xl/4 multiplier 
^ 3~~~ ~ ~ ~ r " ^ 
— L 1 ^ I . WCVWW*g I 
^ 0C5CWPnOW I cwt I AW^KMD 
^ ^ F 
":tEjauuil 
r^ ^ • 
<M» . *f T • 
•^— ^ 
‘ ^ WUl 
«*» ~ « « ~ - mt * - «•*> ~ ~ 
<iy> _ 'c 
- • u 
_n I "n 
“ * » 1 _ 
<U> * *. <tT> 
«» : ;c 
- - • • I xa 
• I ” "1 
‘HE koui p 
I ,<V> _ _ ^ - * T_ <ll> ^ 
<M>^  “ "c 
“ • ^ 
y j '1 
- _ Z i L mitn 
<»» .> T. <tt> 
<f; ‘ *c 
- V - - - ^ ni| 
- | I ~ « ZZHL Muut 
**> - I • ! , ,t «t» 
^ :=^  
“_ ‘ t _t>l 
_w I ‘• ‘ 'I 
-I *t^  >ffM 
<•> , _ _ •’ T <g> 
• <^ » • _ *C ‘ 
‘ “ • l|..,J1 
_ ^ D 
"inr auut 
<y> ___^  _ • • T <»» 
<T> : 'c 
“-•*___ 1« 
-n5: MLA4. 
*•> _^ __^  t T_ <tfc 
<l> *c I 
_ -=^ H_d <5-
• • _ 
“ 1 ^ MA*i 
o> — - • y^  <i> 
<» ~"73~ • 
- • |U 
m f _ • • 
“T^  teut 
• <4» _^ _^^  4 ’ “ 
" ^ f ^ _ J ~ ~ C 
" H ? : NW4l 
%» _ _ k t • cft-
<i>"- - 32^  
-""•~^  m| 
-^ - . Uult<2.*0> S«Kt:i> Function I o> .fc T. ^ 901 00 I 1 [:15 ‘ flli II s Z/2 - ^ 1 - - ^ 01  19 3 •  .  ,u 11 > S/“ “ 9   5 1 1 0 91 Z/4 1 KAM - “ - — - - — <,  -* * «> $••• _ tWMmt<0>)^ .lZf. Sdt - ioHUuiO:i>)*H "_l fi _• I ” •‘ '1 ~— ' 3 kUU4 * s *•*  , «t> 5 = . f _ ^ ^ r + n . . trnmt<ttJt> ^ ^ H ^ B H M B H M M M H M M M • | A |I fi *<I<J>«^M«> 1 -* r~" 18 i K '*^* OuKM^ I _<iw**o V» M«» ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ ^ H H H W \ X V 5 CM I ]mm**A^  - px"..r^. L  -.L^_J^ir^ r L _ r ^ ~ ~ , = -JL ;^; t^ ~ ~ ~— " ~ "  ^ ^==^ 4 i_15 23:53: i 1 9 9 6 CUHK EE ASIC Lab. ^ Self-timed 1-D CT ^zsKio X 1_div2_4wora aa |eiawv rwra: n«7~" - rroc J_I I , I SCAL 1 [SHgT 1 o r 1 4 ‘ 3 I 2 ‘ 1 
E-28 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Appendix E - 0.7jm final version self-timed 1-D ICTcore processor 
E-29. Schematic diagram of input buffer 
: ° I 1 " • : • . , 
I i m ra 
-- i \ n y- 5 
r ^ r n r ^ r h ° - . 
-- j^  2i! i 
"1 ‘ -~1 -n: i "1 ^ r j ^ 
f_ <u ^  
Y" fc&fcfc f ^ ^ ^ i J ' ' - " 
y E M f c S r f ^ r L Jn rk 
, IV aJ 2 
s- ^ TCL^ > - : - - / —: - - I 
, ^^ I I I I r f f I * i I ^ 
L j" ^ ih« -t i< • < h i , < |JjJ LuJ LuJ Lu- ^ 
- 1 ^ ^ . . ¾ ^ — . 4 1 -
“ j j 'I (|jjj <"<pi 1"^  .i 
‘ W . , 
ro •… I I 3 ^ 
^ ^ ^ ¾ _ - . ^ Q ^ lI | i ^ 
- /1¾ I | |<1___j$^_j5^ P-*L^  ‘ "~ I,_~ 
^ L u W _ •~ . , , . , . ‘~ i { 
rm ffl! rro {ffl 
• ‘ E E E E ^ ^ - ^ i l A rh h 
-"¢4- ‘ /j " ‘ n " " ‘ j j"" | ^ ^ | ^ i f b ——^^ 
^ i-ffl ^ iffl ^ ^ iffl - " ^ ^ ““ ^ 
T t t - I 
^ ^ ~ P 3 O S 0 P ^ j liir 1111 
- ffTi i c n L d r r j t Q ^ . … . … -
_ ^ ^ ^ ^ ¾ u s s i » » - ^ t itil 'WH yly v^H 7 ^ ^ M . … , . 
—— ^ - U | j | t j | [ -
toi^ P ^ ffiU ttU_ ^ ^ ~ ~ M 
. . S r W P W ^ ' - ^ r f f ^ ^ 
^Li y 
- 3 , ^ b 3 : ^ = ^ ~ ~ -
^ 
j Q0 00 . . - - I ^ Q j ^ f GD I < E-29 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Appendix E - 0.7jm fmal version self-timed 1-D ICT core processor 
E-30. Schematic diagram of output data sequencer and multiplexer 
~ ^ I I - I : • • 1 
I EiTj 
^ ' ^ ' 
r - __ _0. >< i5 
. i o I ^ - -
y —— J ^ ^ s s o ^ | 1 
rs c/i ^ c il < ^ = 
•f U1 ^ , I 
«1 UJ — h:= 
- J M g . ix -




L r^  
s i s L [ 
- 1 i i M i -
\ t 
“liinii.,l,l,ll!llill., 
Ii 'li ..'ll ''II ''II ' ' l i 'li 'II 'li '|i '|i | |^i '|| '|i <|i 'in 'I 
- ; l 1 1 l f ^ ^ -
h 11 
— fVi 
)ii!i)!ill!ilMI W iiilph!si;[:S[; 
^ p^ip^^--S 14 11'1__-^  iO 
I iJiili iiil iij ( ]llii iiiif/ , , , , 
'i 1 
i l!l!l .j _ 
_ 11 inUiil I '_„ •_„ '_„ •_„ 
i|ii||ii | ||i ||i| ih o o o o — |-4J n ¥g y p ^ ^ 
-~ i 1' ~^ ::: |p^  iili Sii! sJ I \ 
CO ‘ 1 ‘. •. , : 11 — , • . r ^~^ ‘ f |^g|. 
_li!i il ' m 
il iilii j y TT 
— 1 ] I — 
! I I I ! 1 
r^  p J 1 1 1 ‘ ‘ ‘ ^ — ― ^ r^  
I 





- . . . . : . -• • • 
1 1 1 Q o ‘ m < 
E-30 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Appendix E - 0.7^m fmal version self-timed 1-D ICTcore processor 
E-31. Schematic diagram of output buffer ^ , -
/ 0 . 
-7 . ^ / ' , 
r '^  / " . ‘ • - i . 
_ 0) 
I 
fefefcfe | n r 
. t , t , t , t , . 5 U 
__^ ^ ^ - — fe, fe, ^ ] ^ 
— f c fc fe fe i n 
^fm fe_ _ _ _ , * e ^ r -
^ ^ „ lffpL, 
^ ^ E ^ ^ S ^ g J i 
fr"^ 5 ^ ^ i ^ ^ f r ^ ^ C ^ _ 
^ ^ 
— ‘ 




^ e ~ 
. f t^e^  
^ ^ u S ' F ^ “ “ 
_ 4=4R ^ 
^^J^ 
. L 
‘“ I ~ ~ = ^ “ 
E-31 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Appendix E - 0.7mfinal version self-timed J'D ICTcore processor 
E-32. Schematic diagram of output controller 
~ ~ Q • 4 r r n r - j < 
I . u u I ^ u : 
o — L fe 
~ < ^ -
H ~ > g I § I 
— ^ ^ g i 
- i i < ^ u 
, I 1 LxJ ^ - i ^ 
L^  igL-
f [ n 2 j — ^ ^ r | - -
m====5y Q BTi-
I . • — iMi 
*~> U3 I I 
» »• C7> 
- L , >1 
« I <N _ « 
_ i _ 1 y s 
4 -- -1 1 J mm 
• o 
2 [M 
CN J =¾ CN 
i i 9 
Y _ ~ ~ … 
=T < - ~~ 
r i n A r ~ h 
^ J T 9 fi 
-^ j ^ *"^ <9~ 
^ ^ y ~ T ^ f ^ - — _ , r t t 14 
't at °t sI »1 B[ ®"^ J 
=-. i n [ T ^ s 1 1 1 n jS 4 «1»! J 8?¾! o| sTsi Q 1^  n | ^h -M 
n ^ • » Q Q ^. M 
£ - ^, 
^ S> 5 
°l aI W 
i , • ^ 
1 Cus^  
• si [ 




^ I ^ f QQ < 
E-32 

An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
ApvendixF- Chip microphoto^raph 
Chip microphotograph ofDCVSL and micropipeline bit-serial matrix multiplier 
(1.2^im CMOS SLP DLM technology) 
1 ^ ® 
Z : : ^ | t t m i t a i , l _ S r X _ _ 
~"|2] | : ifcii*Wt****M*ftm!*(-«Miw**.t»:UMiwmTwru*^|l • : '[J^  ^  #'• 
JiT?"«W»*itt mmmnwrnn.^  rt"vfuigi Hiirmui iiw| W[^ *>|p m i. 
oMmAh!lwu^ ^ww*'.tf,y;_"^ Tiw niiiiMiiin _i_uijj fi(^^ £j. ,_ffi^  
Wr^Hr I *wnMPW4iwfHrtv 'nMv>iMJWMM<i<^waati |[@fB ?1 M^^““• 
^ ^ , ]ifMP^ i l'*^ '^WW^WWW/f ' _4;^_'WHMWMMMW5 1|_i_ _‘ 
^^-‘^ i & J .,;S W »'*M»M4-»*.»^ »^ .'n»»it^ .«r.ik*<«i^ |L»,M ,MwttAMi,Km,^ |^ H11 . : i | [ ^ "V^ 
'''n|p|ifiiiuin|k >c,_<^:>'j^ _^^ <^)***tMMiwMWMwi|t|Np^MVMik^<<Ht^|jlh_nijit^k^  K**>*^  . 
^/^ J^2''l ‘ “ -» >»«.»«r«^ **>^ WWv«M»«—W, i(U“-,“i»»u>.‘M*4j*«§|j :- • '|[U|| W ^*>S 
Z_z j B j ^ ! ^ 3 ^ ^ ^ 1 1*¾ "¥s^ ‘ 
: ^ t t l K B W f v 
, ; g g g | | | | y | | | g j g j | ^ | j g | ^ m g . ^ s ^ 
flP^^^Hl^^^P,\ 
Chip microphotograph ofBooth's Multiplier and current-sensing test chip 
(0.7^m CMOS SLP DLM technology) 
• « = S ^ y « ^ g g g g g 
: • B 3 ^ S K I 9 I S 0 9 I Wi 
> fll l_WIE 
\ M m ^ ' p s I ^ H | R 
^ H H m ^ m f 4 i l J M I M 1 
*"••^-*^^ hdl _[|>4*M*MI*f,*M_<Vlft,r> I |fI“Mi^"»—_MAM»kft^i^7^^^7i^ |^ |^ M I 
v,^  H H H • •'• -.^**^"*"-^'-^**^af| ' wiM'tf^>#i>*****<*%4**<A4Mi4M'ii*fl'f^ |^ H||Bi I 
^ "TTfl • • _ _ — — - • ' |U»»*i.»^ .«»^ «^,«^ «yn^ «.tf H m '| 
^ J W • u m m m i l l l l H i m g rEil|iMI<MnM.|f r^m«4|.«nt4M>* • ' ^^M M | b | I 
— Pm l'IHHHHINm% ‘V-^ *'"*^ -*^ -'*'t'*t*4ifA4^ 4Mi4^  1^1 ju^^ I 
^iU iii -»' -', I |Vtip^ PHi iuw*tMik *,j| • • w ^ I 
r J ta • • i : 
m I j^ j^ jjjjjj|^ j^ f^l w^^^HBH^H^H ffl^l 
, m inn BBBBil • 
z < i !^933PPPCP3^BV 
F - 1 
An ICT Image Processing Chip Based on Fast Computation Algorithm and Self-timed Circuit Technique 
Appendix F - Chip microphotograph 
Chip microphotograph of the second current-sensing test chip 
(1.2^im CMOS SLP DLM technology) 
W M M M f 
itfiil i lMiMMI feil. 
: _ 
tHRjJfciitiMjsltyHi 
Chip microphotograph of micropipeline 1-D ICT processor 
(0.7^im CMOS SLP DLM technology) 
_B^^^Sffl^^^^^, 
-.!*V^ ^ ^f^n I . i ^ i ^ H n ^ ^ B V v n i n i i i u i ^ ^ H T H B ^ i ^ H ^ R n „.*.«».».-»,..(.i»»^-^«»**H»«^«>***w.*-w*^*M--.#-M^'•- ^* • ' 7 ^ m / \ Bml ~ - - ^ ^ "P^- —--- H :==:::===:=:rxrnr::: _yRi . 
V > M I i w » mnt>(**' ir.-^iH,HMWif^*rf ^fc.%-w^** m ^ ‘ »^i If«, H H • • • . ^ • ”^ . - . ( • • • . “ "^ • •^ - • < M 4 W " _ . - ^ . . , " I S v f * | ^ f l 
> V r B ^ #•trt.wii»»r®r\."n' V»i«*i»^ki#fA»«'r«.«l- « <i«ilM l<UMti H y * . ^ w^.f* .,f .^*r'4vwM><* v*<fr-^4M-i'W^i^ ^ . « * ‘ ^ ' . 4 J ^ ^ y 
•^V ^ B ^ f U I _*yr^ __r_##<lW>v**#**W<W_.>#M(n f^i**irt<_-.**^M*W^^*._ | B H »,-.#.. „,^,^^,.t»^^,^^n,^.m'wm^»»»mm.m»^'^-^*' ^ -* •••*'•' - i f ^ O h r i V V >V •ftflH I mmiwrn^^ ,MtMf^^^,*m^^*m *ttm^*h>^m.m. #^.W^*<M f^ * • ••,^ r-,.*»«»m»4.v-^ »*»«lrr.H«.-*«>«»^ *«^ %%^ .w-««.» - i^ LvHP^  y 
~ ^ _ B H ^ f j 'tf,*f^%4ll_,lM****MVMMV.i*_*<*lf^##4Va*-rr*>MKA_tM" M | l j | . , ,,. « ~»^~_”— . -««' < ,^ «» ’ f\ £j ^ f l fT^m ‘ 
^ HW1 h.%MW4M44MK*l4^V(<trr*^t'*^V4kW(M9H,^V^ |l<|i tr>>a :H|U ,^  »-,.- ,.....M4s.-rti,-H..'-. .•.#^•*•'^-'- '•••••-^•^^  • tri^UH , 
>S^ k M M i w»^^mHWwiMw«i**«wwiwnweiNifitw»,»N w> WMN •• . - - . , " j A " " _ - v , ,.^t>.rt^*^*#r-4i%#.i>->^,... ^ ; ; M | H ^ 
v^. ^ Sw HMPWn •t^r^tf'»#flr4V*w*«A».v.M4.ti<Mih<«tA,i'.ryff«.fM>^*^  ‘ VH] “-_ i7r, M"-” rt_-."K« .-»"*” l*«i..<" " • • • " ” , 'i 0*i!T^ M 
^ ^ Hw] M4i*^'r»'4»«««l'«<'«0»X.M^««V.«^*^ rb^^f»««-i»M«*'> B 111 .• • ‘'«».ir iu^t'*^'>*n• ^ . - •«..-f^l...*K. ‘ r ^ . - ^ 4 . . - ^ n^ V>^ H 
^ * • H J I i.r.,.wW**4-<^fiw»**rt|4«ii4*iM.i.hto^««i.i*,^.i>H^.t,*». n r |D ‘ ,iMr»rfwi«_»,",«M- ’ “ “ ' - ^ “ " ” " - ' « « — . V v i \ ^ ^ ^ f ^ 
">>s^ |B ' ::rr:;:;!:;:,rr;:r:^s:s^;r f]l fi'iiwin*M^ Niww|(t^ iiDiii^ ;.,jUiiy S|2f[ > 
r^f ^5HM . . ^ » - , — . ‘ , -^.‘ , . . . : ” 1 , ^ . , ‘ |f^ K^K'<iWl4MWtRJlltW_l3ltJfWWI |^IMft-
_ ^ s i | l i = i ^ i ' 
, - f f i I ^ 
•___«"^ P t i «ia«i»fiiuniHi.tfirriiw.njiiriii<iirw.,iiii.H|i UJ • , .,• i.'u<*j)wir^iwu<ni.wMH**MHM^B)^PI^> -— 
: : u i p p l _ _ t , : ; ; ; 1 ; | ; ' ; ^ : _ ! 1 1 
^Swl P & m _ - _ _ . ,' 5 M S f f i 5 S S S _ 
_ B B i jrm _^i^ afm'_fuBmb(.t(iKW . )| iw^mwiwfriM_ittM_ri_i iW E^B 
H B r t ^ s ^ ^ i s s _ ! ^ ^ f ^ K K K ^ r = i i | S | 
, fcj^^^EE3H^: _ : : H E H S S S ^ ^ H : p _ ® \ 
. ^Jtwk ^ “ ^  - ^ - ] 
fMBST •=•-••:”:,:—…"^ -^••.•—… —1 -"-""-^ rr-.:'"r:^  " - " ? . ‘ ^ * * ^ 7 lH#t"2H \ 
:,:.I : H W W 1 ^ ® B H B K S ^ ^ M S " ^ W 1 P ^ 3 I \ 
,-i i n m ^ ^ ^ M ^ p i ' F - 2 

CUHK L i b r a r i e s 
_ _ _ _ 
003SflTMb3 
I 
