Computational structures for application specific VLSI processors by Lau, C. H.
Computational Structures for 
Application Specific VLSI Processors 
by 
C.H. Lau 
A thesis submitted to the Faculty of Science, 
University of Edinburgh for the degree of 
Doctor of Philosophy 
Department of Electrical Engineering 
1989 
I 
Computational Structures for 
Application Specific VLSI Processors 
Abstract 
This thesis centres on the investigation of architectural forms that are able to exploit 
the inherent structure and regularity in algorithms to realise highly concurrent VLSI 
systems. High concurrency in computation may be achieved by directly mapping an 
algorithm into an architectural form embracing multiple processing elements with 
distributed memory, linked by efficient communication channels. Three distinct 
architectural forms will he presented, based on the granularity of the modular 
processing elements. These are based respectively on analogue current 
computational systems, bit-serial flow-graph networks and wavefront arrays. 
Fine grain analogue systems are typically composed of a few thousand nodes per 
chip. Such fine grain analogue processing may be required in the realisation of 
signal conditioning functions for an array of sensors. In these "smart" sensors, an 
analogue functional element is integrated onto the same site as the sensing element 
to form an individual node. A novel "receptor" cell will be presented to illustrate 
the application of fine grain analogue computation in "smart" vision sensors. The 
receptor cell may be tessellated to form an imager with built-in nonlinear automatic 
gain control (AGC) correction, thereby maintaining the operating range of the 
imager in register with ambient light conditions. 
Medium grain, bit-serial machines are typically characterised by nodes where 
memory and logic are extensively intermingled in tightly pipelined structures. 
Computational efficiency may be achieved through functional parallelism in which 
arrays of hard-wired processors are used to boost system throughput. A filter 
section based on wave digital filter theory will be presented as a case study typical 
of this class of medium grain, flow-graph networks. This filter section, called an 
adaptor, permits the realisation of digital filters with the desirable characteristics of 
passive RLC ladder filters, namely their stability and insensitivity to component 
value variations. 
For highly concurrent numerical computations, bit-parallel computational arrays 
may be employed. The coarser granularity of these nodes tends to increase the 
physical extent of the communication paths in such arrays, making it difficult to 
operate all the nodes in synchrony. This difficulty may be overcome in arrays 
operating with self-timed communication. Generic logic structures that enable self-
timed computational nodes to be efficiently realised will be presented and verified. 
Finally, a wavefront array multiplier will be described to illustrate this mode of 
computation. 
II 
Declaration of Originality 
This thesis, composed entirely by myself, reports work carried out solely by myself 
in the Department of Electrical Engineering, University of Edinburgh between 
January 1985 and January 1989. The only exception to this statement is the case 
study of the digital wave filter adaptor described in Chapter 5. This work is part of 
a three-man team project in which I was a member, with responsibility for the 
adaptor implementation. Aspects of this work which overlap with those published 
by my colleagues are indicated by a footnote at the relevant pages. 
Acknowledgements 
I am grateful to my two supervisors, Dr. David Renshaw and Prof. John Mayor for 
the unstinting support and guidance that they have given me during the course of 
this work. The work described in this thesis was carried out in the course of my 
employment here as a research associate. This opportunity has been kindly 
extended to me successively by Prof. John Mayor, Prof. Peter Denyer and Dr. 
David Renshaw. They have been instrumental in securing me financial support 
without which this thesis would not have been completed. 
Various colleagues in the department have generously given me their assistance in 
evaluating the different chip designs used as case studies in this thesis. In 
particular, I would like to thank Dr. Andrew Firth and Mr. Stephen Lam. I am 
also indebted to Caroline Burns who has been a great source of help and support 
during the course of my employment in this department. Finally, I wish to thank 




Index of Abbreviations 
For convenience, some acronyms and abbreviations commonly used throughout this 
thesis are listed below 
Ack Acknowledge FIFO First-in, first-out memory 
AGC Automatic gain control LSB Least-significant bit 
ALU Arithmetic logic unit M Mega or 106  scaling factor 
CCD Charge-coupled device MSB Most-significant bit 
CPU Central processing unit PP Partial product 
CSA Carry-save adder PPS Partial product sum 
CVSL Cascode voltage switch logic Req Request 
DSP Digital signal processing SD Sign-digit 
FFT Fast fourier transform VLSI Very large scale integration 








Chapter 1 	Computational Structures in VLSI 
An Overview 
Introduction 1 
Computational Costs in VLSI 2 
Scaling Effects of VLSI 3 
Thesis Core 4 
Fine Grain Analogue Systems 6 
Medium Grain Bit-Serial Machines 7 
Coarse Grain Computational Arrays 8 
Original Contribution 10 
Summary 10 
Chapter 2 	Analogue VLSI Computation 
Introduction 12 
Elementary Arithmetic Functions 13 
Identity and Scaling 13 
Addition and Subtraction 15 
Ensemble Averaging 16 
Complex Arithmetic Functions 17 
Translinear Principle 18 
Translinear Current Loops 19 
Translinear Synthesis 20 
Prod uct-Quoti ent Circuit 21 
High-Ratio Current Mirror 23 
Sources of Error 23 
Threshold Mismatch 24 
Body Effect 25 
Transistor Characteristics in Weak Inversion 26 
V 
Modelling of Weak Inversion 
Conduction with SPICE2 	 29 
Summary 	 31 
Chapter 3 	An Electronic Eye 
A Case Study on Analogue Computation 
Introduction 32 




Logarithmic Response Sensor 37 
Implementation 37 
Blooming Suppression 39 
Automatic Gain Control 40 
Implementation 41 
Design of Receptor Cell 42 
Transient Response of Photoreceptor 43 
Scanning Frame 47 
Results 48 




Experimental Setup 52 
Results 53 
Spectral Response 54 
Results 57 
Blooming or Spatial Crosstalk 57 
Result 57 
Transient Response 59 
Result 60 
Summary of Results 62 
Summary 62 
Chapter 4 	Bit-Serial Computation 
Introduction 	 63 
Functional Parallelism 	 64 
VI 
Rudiments of Bit-Serial Computation 66 
Bit-Parallel versus Bit-Serial 
Signal Representation 66 
Control of Bit-Serial Networks 67 
Arithmetic Shifting 70 
Bit-Serial System Composition 71 
Clocking Strategies for Bit-Serial Operators 72 
Dynamic Precharge Logic 75 
Dynamic Pipeline Logic 77 
Race-free Dynamic Pipeline Logic 78 
Two-Phase NORA 78 
Results 79 
Performance Evaluation 83 
Race-Free PHIMOS 83 
Results 85 
Performance Evaluation 87 
Summary 88 
Chapter 5 	Digital Wave Filter Adaptor 
A Case Study on Bit-Serial Computation 
Introduction 90 
Wave Digital Filters 90 
Choice of Frequency and Signal Parameters 92 
Circuit Elements 92 
Interconnection of Elements with Adaptors 93 
Functional Design of Universal Adaptor Structure 95 
Adaptor Flow-graph 95 
Serial Pipeline Multiplier 97 
Adaptor Latency 98 
Universal Adaptor Control 100 
Results 101 
Seventh Order Wave Digital Filter 
(WDF) System 101 
Frequency Response Measurement 106 
Summary 108 
- 	 VII 
Chapter 6 	Self-Timed Computation 
Introduction 109 
Scaling Aspects of Self-Timing 110 
Delay Insensitive Specification 111 
Isochronic Regions 112 
Wire-Delay Characteristics 114 
Reset Signaling Protocol 115 
Data Encoding 115 
Double-Rail Coding 117 
Data Flow Model of Self-Timed Computation 118 
Implications of Data Flow Operation 119 
Delay Assumptions of Data Flow Model 120 
Data Flow Logic Realisation 120 
Interconnection of Operators 122 
Summary 125 
Chapter 7 	A Wavefront Array Multiplier 
A Case Study on Self-Timed Computation 
Introduction 127 
FIFO Elements 128 
Mode of Operation 130 
Delay Insensitive Communication 133 
Case I 134 
Case II 134 
Case ifi 135 
Case IV 136 
Binary Multiplication 137 
Redundant Binary Number Addition 138 
Self-Timed Carry Propagate Addition 140 
Multiplier Structure 142 
Multiplier Algorithm 144 
Multiplier Composition 146 
Summary 148 





Pin-Out of Prototype Chips VB076 and IMAGO04 











Computational Structures in VLSI 
An Overview 
1.1. Introduction 
Commercial microprocessors and digital signal processor (DSP) chips are becoming 
increasingly powerful, offering the user an ever improving cost-performance ratio. 
These devices are tailored to specific applications through software development and 
probably represent the most cost effective solution if they are capable of 
accommodating the application at hand. However, in applications where the input 
data is to be processed in real time or a high throughput rate is required, the 
hardware used must be capable of performing high speed arithmetic. This 
requirement is not met by commercial programmable components that are based on 
traditional von Neumann processor-memory organisation. Such machines consists 
of a single processor or central processing unit (CPU) connected to a single, central 
memory. The computational bottlenecks of such devices stem from two 
fundamental limitations 
• 	There is only a single processor unit working sequentially in 
time. 
• 	The processor-memory bandwidth is limited by the use of a 
single bus structure that forces data transfer to proceed 
sequentially. 
In an attempt to address the above shortcomings, the latest generation of digital 
signal processors incorporate additional hardware for address generation and dual-
ported memories to increase data transfer rates. The increase in computational 
throughput provided by this approach however, is marginal and a more radical 
approach is needed for orders of magnitude improvement. 
Most signal processing algorithms are computationally intensive, requiring multiple 
arithmetic operations to be performed on each signal sample flowing from a 
continuous stream of data. However, the sequence of computations are usually 
repetitive in nature, setting up a regular flow of data through successive series of 
§1 	 2 
computations. This inherent structure within the underlying algorithm makes them 
highly amenable to a Very Large Scale Integration (VLSI) solution. VLSI offers 
the designer the opportunity to define computational structures with local memory 
at multiple sites for the concurrent processing of data. For high throughput rates or 
operation in real-time, the computational demands of signal processing algorithms 
can only be met by application-specific computational architectures incorporating 
the use of multiple processing elements with distributed memory. In this respect, 
we are aided by the inherent structure and regularity of most signal processing 
algorithms, enabling a direct mapping of the algorithm into regular computational 
arrays for efficient implementation in VLSI. 
This fixed-function approach to signal processor realisation exploits the benefits of 
VLSI technology to the full. Firstly, control overheads are kept to a minimum 
since there is no longer a need to specify the function required, as in programmable 
devices. In addition, the flow of data between computational elements is known 
beforehand and remains constant. As a result, memory local to the computational 
elements can be provided for the storage of intermediate results. For a dedicated 
VLSI device receiving data and outputting results through an attached host, the 
provision of local memory can significantly reduce the 110 bandwidth requirements. 
A typical computing environment for such devices is shown in Fig. 1.1. The host 
in this context can be a computer, a memory or a real-time data stream. 
1.2. Computational Costs in VLSI 
The cost of implementing a computation can be viewed as consisting of two 
aspects 
• 	The cost of realising the processing elements as measured by the 
use of silicon area, pin count and computation time. 
• 	Providing communication between processor elements, giving 
rise to an interconnection cost that is measured in terms of 
silicon area and signal propagation time. 
The relative contribution of processing and communication towards the total costs 
of computation is not a constant and is strongly dependent on the architecture and 










Fig. 1.1 	Typical computing environment for custom, dedicated 
VLSI devices. 
1.2.1. Scaling Effects of VLSI 
With VLSI, there is a relentless drive to define smaller feature sizes over an 
increasing chip area in order to exploit the economic, benefits of the technology to 
the full. With the scaling down of device feature size, the communication cost 
would increasingly dominate the power, time and area required to implement a 
computation [79]. 
Consider the scaling down of all feature dimensions by a factor 1/a (a>1). If 
device structures in the vertical dimension are assumed to scale correspondingly, 
then to hold the electric fields constant, the operating voltage must scale by 1/a as 
well. The first effect of scaling is that the number of transistors that can be placed 
on a chip per unit area is increased by a factor of a 2 . The switching delay or 
transit time of a transistor T, is given by the time it takes charge carriers to traverse 
the electric field in the channel. Since the electric field remains constant with 
scaling, the shorter channel length of a scaled transistor results in a transit time 
decreased by a factor 1/a. The transistor becomes faster and the cost of realising 
the active devices within processing elements becomes increasingly affordable with 
scaling. 
§1 	 4 
The interconnection delay in a wire is governed by the diffusion equation that 
determines the rate at which a voltage driven onto a point on a wire, equalises 
along the length of the wire. This diffusion delay is proportional to RCI 2 , where R 
and C are the resistance and capacitance per unit length of wire and 1 the length of 
the wire. The cross-sectional area of a wire is decreased by 1/a 2 , scaling R by a 2 . 
C remains unchanged since the scaling down in conductor area is accompanied by a 
corresponding decrease in dielectric separation. The wire length 1 is reduced by 1/a 
and the wire delay Rd 2 remains constant with scaling. 
Since transistor switching delay decreases while interconnect delay remains 
unchanged with scaling, the speed at which a circuit can operate will eventually be 
dominated by wire delay rather than device switching delay. The consequences, of 
scaling down the physical dimensions in MOS technology are summarised in 
Table 1.1. 
Parameter Scaling Factor. 
Device density a 2 
R a2 
C 	 ' constant 
Switching delay T 1/a 
Wire delay RC12 constant 
Table 1.1 Scaling effects in MOS technology. 
1.3. Thesis Core 
With VLSI technology, communication will become increasingly expensive in area 
and time with scaling relative to switching. Communication is therefore the key 
issue in VLSI design and only computational structures that support efficient 
communication will exploit the full benefits of the technology. The physical 
communication requirement arising from the mapping of an algorithm into VLSI is, 
to an extent, dependent on the complexity of the chosen processing elements or 
nodes. The complexity or grain-size" of the nodes determines the number of nodes 
that can be accommodated on a given die size. Fig. 1.2 illustrates the range of 
granularity of hard-wired processors that this thesis will address, a subset of the 
§ 1 	




Analogue Computational Cells 
10K 
1K 











Typical range of granularity for application-specific 
processors. 
range of ensemble machines as classified by Seitz [72]. 
With current fabrication technology, an integrated circuit today would typically be 
about 5 to 10 mm square, with a 2 iJ.m minimum feature size. In units of A, the 
scalable unit of length measurement representative of the fundamental resolution of 
the process [45], this translates to a die size of approximately 10 A per side or 100 
MA2  in total area. For fine grain systems composed of a few thousand nodes per 
chip, computation on signal quantities is carried out in the analogue domain so that 
the individual node size can be restricted to an area of about iO A 2 . Next on the 
scale of granularity are bit-serial machines with node size of the order of 1 MA 2 . 
Such systems are typically characterised by nodes where memory and logic are 
extensively intermingled in tightly pipelined structures. For highly concurrent 
numerical computations, computational arrays composed from nodes with node size 
of the order of 10 MA 2  are used. These arrays are connected in such way that the 
communication topology exactly matches the data flow of the computation. The 
coarser granularity of the nodes tends to increase the physical extent of the 
communication paths in such arrays. For this reason, it may prove difficult to 
operate all the nodes in a computational array in synchrony. This difficulty may be 
§1 	 6 
overcome if communication between nodes is carried out in a self-timed fashion. 
Such an approach will be presented as part of this thesis. 
1.3.1. Fine Grain Analogue Systems 
For a system composed of a few thousand nodes per chip, the physical design. is 
tractable only if communication between nodes is constrained to nearest neighbours 
or if global broadcasting is employed. At this extreme fine grain-size, each node is 
constrained to fit into an area in the region between 10 1 X 2 to 101  X2 . With this 
area constraint, a digital approach is clearly impossible and an analogue approach is 
required to provide a viable solution. Typical application areas for such fine grain 
processing are in signal conditioning functions for sensors. In these "smart" sensors, 
an analogue functional element is integrated onto the same site as the sensing 
element to form an individual node. Pioneering work in this area has been carried 
out by Mead and colleagues at Caltech [83, 76, 82, 30] where the main objective is 
in computing motion from moving visual images in real-time. The workhorse used 
in the above body of work is the transconductance amplifier [54], which is used as 
a building block for functional elements such as integrators, differentiators, 
multipliers and resistive networks. 
In this thesis, a pure current-mode approach to analogue computation will be 
presented. Current is pEeferred over voltage and charge as a means of manipulating 
analogue signal quantities for three reasons 
• 	For computations involving voltages, the bandwidth may be 
limited by the time required to charge and discharge circuit 
capacitances. 
• 	For computations involving charge quantities, the injection of 
stray noise charge from switching transistors [92] and parasitic 
capacitances onto floating capacitors can severely degrade 
accuracy. 
• 	Compact realisation of complex computational functions is 
possible by applying the translinear principle [20] which applies 
exclusively in the current domain. 
§1 	 7 
This approach has been applied to the design of an optical imager with the unique 
ability to automatically compensate for the brightness level of ambient lighting, in 
order to prevent the sensing elements from saturating. In this respect, this self-
compensating mechanism imitates the behaviour of a receptor cell in the retina, 
which is capable of adjusting its sensitivity to keep its response in register with 
ambient lighting conditions [90]. The imager may be viewed as a two-dimensional 
array of processing nodes, with each node consisting of a photodiode sensor with 
built-in variable gain control. Due to the large number of nodes characteristic of 
such fine grain systems and the limitation in pin-count imposed by current 
packaging technology, the output of the nodes are usually scanned out sequentially 
in time. 
1.3.2. Medium Grain Bit-Serial Machines 
For medium grain size nodes each occupying an area of about 1 MA 2 , a digital 
approach may be used which would enable tens of such nodes to be accommodated 
per chip. Systems can then be constructed from the modular use of multiple chips 
so that the number of nodes in a system can be matched to the application 
bandwidth. For lower bandwidth applications, the hardware resources of a system 
may be reduced by time-multiplexing an appropriate number of nodes to spread the 
computations out in time. For this level of granularity, the communication paths 
typically extend over two levels of interconnection, namely between nodes residing 
on the same chip and nodes on different chips. To meet the twin criteria of 
efficient communication and bandwidth matching, bit-serial arithmetic is often used 
to realise the computational nodes. Since signals are transmitted via single wires, 
there is rarely a pin-out problem with bit-serial parts. Systems constructed from 
bit-serial nodes can be conveniently partitioned into two distinct classes: logic 
enhanced memories [73] and flow-graph networks [14]. 
Logic enhanced memories are nodes that contain a mixture of storage and logic, 
with the logic operating on the contents of the local store. For low-level image 
computations that can be decomposed into local pixel operations, logic enhanced 
memories provide a concurrent mapping of the algorithm directly into hardware. 
Typical systems would consist of hundreds of such identical nodes. The number of 
nodes can be arbitrarily increased to track the problem size without any difficulty 
since communication between nodes is local and the distributed nature of memory 
§1 	 8 
provides such systems with unlimited memory bandwidth. Examples of such 
systems are CLIP [18] and Pixel-Planes [64] developed respectively at University 
College, London and the University of North Carolina. 
For flow-graph networks, the nodes may be of different functional complexity but 
they all conform to a common generic type. The nodes of the system correspond to 
the functional operators in the flow-graph of the algorithm to be computed. One of 
the earliest example of this approach was the second-order filter section reported by 
Jackson [31]. With external state memory, the nodes within the section can be 
time-multiplexed to realise a virtual cascade of second-order sections for speech 
bandwidth applications. Higher throughputs can be achieved through functional 
parallelism by increasing the number of nodes in a system. This has been 
demonstrated by Powell [65] for the implementation of the fast fourier transform 
(FFT) algorithm and other matrix-oriented computations. 
A filter section based on wave digital filter theory [17] will be presented as a case 
study typical of this class of medium grain, flow-graph networks. The filter section, 
called an adaptor, permits the realisation of digital filters with the desirable 
characteristics of passive RLC ladder filters, namely their stability and insensitivity 
to component value variation. 
1.3.3. Coarse Grain Computational Arrays 
The numerical solution of a system of linear equations often involves a more 
advance class of arithmetic operations such as division and square-root extraction. 
Both of these arithmetic operations are data-dependent, requiring a comparison of 
two operands that cannot be efficiently mapped into bit-serial form [77]. For this 
reason, such functions are usually realised using bit-parallel arithmetic, resulting in 
an increase in computational node size to an area of about 10 MX 2 . Due to the 
coarse granularity, the communication paths of a system composed of an array of 
such nodes typically span three levels of interconnect. The interconnection network 
provides communication within chips, between chips residing on the same printed-
circuit board and finally, between boards of chips. 
In order to limit the extent of the communication paths in computational arrays, 
H.T. Kung and his collaborators at Carnegie-Mellon University [35] proposed the 
systolic array. A systolic array is a regular connection of nodes distinguished by its 
§1 	 9 
nearest-neighbours-only communication strategy. The nodes constituting the main 
body of the system are usually identical, performing the same basic operation whilst 
those nodes at the array boundary may compute a different function. The 
computations are distributed in a pipeline fashion across the entire array with all 
nodes operating synchronously. With systolic arrays, the ommunication topology 
of the array exactly matches the data flow of the algorithm so that the nodes are 
kept busy all the time. The only control required is a globally distributed clock. 
They are ideal for high bandwidth applications where the aim is to extract the 
maximum throughput available from the implementation technology. 
Systolic arrays synchronise data communication between nodes through a globally 
applied clock signal. For large two-dimensional arrays, the clock may need to be 
derated to account for possible clock skews. To overcome this shortcoming, a 
variant of systolic arrays with self-timed communication, called wavefront arrays has 
been proposed by S.Y. Kung [36]. In wavefront arrays, the synchronisation of data 
at each node is accomplished locally using a data-driven approach. A node in the 
array remains inactive until all input data are available and its previous output has 
been delivered to all neighbouring nodes. An additional advantage of the 
wavefront approach is that the time taken to complete a computation is not fixed 
and can vary from cycle to cycle. With computations requiring different 
completion times that vary with actual data values, a wavefront array may achieve a 
higher throughput than the corresponding systolic array, which must be clocked at a 
rate set by the worst-case completion time [7]. Examples of such data-dependent 
computations include ripple-carry addition, division and the comparison of the 
magnitude of two operands. 
Although the concept of self-timed communication is attractive, there are very few 
practical examples of such asynchronous, wavefront systems due mainly to the 
difficulty in implementing the self-timed communication protocols. To partly 
address this difficulty, logic structures that enable self-timed computational nodes to 
be efficiently realised will be presented as part of this thesis. A first-in, first-out 
(FIFO) memory will be used to verify the self-timed communication protocol by 
demonstrating proper propagation of wavefronts between nodes. As the nodes in a 
FIFO are pure storage nodes without any computational facility, a wavefront 
multiplier is subsequently described to demonstrate the viability of realising 
wavefront arrays in VLSI. 
§1 
	 10 
1.4. Original Contribution 
Three distinct approaches to concurrent computation are presented in this thesis for 
application-specific VLSI processors. The following are original contributions of 
this thesis 
• 	In § 2.3, the translinear principle is applied to analogue VLSI 
current computation, exploiting the weak inversion characteristic 
of MOS transistors. This approach allows the compact 
realisation of complex computational functions, illustrated with 
the design of a "receptor cell" described in § 3.4. This 
compactness of realisation is essential for "smart" vision sensing 
systems requiring the features in an image to be captured with 
acceptable resolution. 
• 	A race-free clocking scheme for high-performance CMOS 
pipelines, ideal for realising bit-serial operators, is presented and 
tested in § 4.4.2. This technique, called PHIMOS, requires just 
a single global clock line to be distributed. The clock may be 
driven by a sinusoid to avoid the transmission of the higher 
frequency components associated with fast clock edges. 
• 	A data flow approach to self-timed logic is introduced in § 6.5. 
Generic hardware structures that allow an efficient realisation of 
data flow logic are presented in § 6.6. This is illustrated with 
the design of a wavefront array multiplier described in § 7.4. 
A new design for a first-in, first-out (FIFO) memory based on 
the data flow approach is presented and tested in § 7.2. 
1.5. Summary 
This thesis describes three distinctly different approaches to concurrent computation 
for application-specific VLSI processors. The distinction is based on the granularity 
of the individual processing elements from which a system is constructed. The 
thesis, therefore, decomposes into three parts with each part consisting of two 
chapters. The three parts deal respectively with analogue computational systems 
(Chapters 2 & 3), bit-serial flow-graph networks (Chapters 4 & 5) and wavefront 
§1 	 11 
arrays (Chapters 6 & 7). The first of each pair of chapters introduces the 
fundamental concepts relevant to that particular computational form, together with 
the appropriate literature survey on the subject. The second chapter of each pair 
then describes a case study illustrative of the computational form under discussion, 
together with th.e presentation of practical results that serve to verify the concepts 
introduced in the preceding chapter. Chapter 8 provides some concluding remarks 
and the problems encountered with each of the three computational forms. The 
three case studies described have been fabricated on two sets of chips, VB076 and 
IMAGOO4. The pin-out of chips VB076 and IMAGO04 are given in Appendix A. 
The derivation of the equations describing the operation a prototype logarithmic 
imager, used as a test vehicle in Chapter 3, is given in Appendix B. Finally, 
Appendix C lists the author's publications. 
§2 	 12 	- 
Chapter 2 
Analogue VLSI Computation 
2.1. Introduction 
Analogue circuits operate on signals in the amplitude domain. They have 
traditionally been applied to perform the linear functions of addition, subtraction 
and scaling (or amplification) and are often referred to as linear circuits. 
Improvements in fabrication process and the development of new circuit 
techniques [54, 70] in recent years have dramatically extended the application area 
of analogue circuits. Nonlinear functions such as multiplication and division can 
now be realised in a simple and elegant fashion by applying the translinear 
technique [20]. As a result, a variety of analogue computational functions can now 
be realised that are often more efficient both in terms of silicon area and 
computation time than their digital counterparts. In addition, signals at the 
interface of a system to the external world are often analogue in nature. If 
analogue computation can be applied, then external signals at the interface may be 
manipulated directly without the need for conversion into and out of the digital 
domain. This would reduce overall system complexity and cost. 
More importantly, there may be problem areas in which a digital solution may not 
be at all possible because a digital realisation of system components would be too 
complex and slow. Such a situation arises when the output of some sensor requires 
some form of nonlinear correction to achieve some desired output response. This 
feature is central to the design of a novel imager that will be presented as a case 
study in Chapter 3. The computational requirement of this "electronic eye" is 
extremely demanding, requiring some form of nonlinear computation to be carried 
out on each individual pixel value, in real-time, at the focal plane. The only 
tractable solution is to use analogue VLSI techniques that permit the realisation of 
compact structures that operate directly on sensed data at the sensor site. This 
integration of sensor and signal conditioning function onto a single element forms 
the basis of so called "smart" sensors. The integrated element or pixel may be 
viewed as a fine grained computational node that can he tessellated to form an 
A linear function f satisfies the principle of superposition so that 




This chapter will introduce how computation can be done with analogue circuits in 
VLSI. In the analogue domain, we can represent a signal as either a current or 
voltage level. Certain computations are more conveniently realised using currents 
as signals and others by using voltages. In general however, the realisation of basic 
arithmetic functions such as scaling, addition, subtraction, multiplication and 
division tends to require fewer transistors in a current-mode approach. This area 
efficiency in the realisation of complex computational functions is of prime 
importance in the design of the electronic eye. The area occupied by a pixel must 
be kept sufficiently small so that an image may be captured with acceptable 
resolution. For a description on analogue computation involving voltages, the 
interested reader is referred to Ref [54]. 
2.2. Elementary Arithmetic Functions 
A current mirror is probably the most common circuit configuration encountered in 
analogue design. It can be used to produce a replica of some input current, to 
invert current direction or to scale a current by some constant factor. Current 
mirrors are often used to channel together copies of current signals and form the 
basis of several elementary functional circuits. By directly exploiting the physical 
law of charge conservation, the realisation of functional circuits through the use of 
current mirrors is particularly elegant and efficient. 
2.2.1. Identity and Scaling 
When the result of a computation from a functional block is fed to several different 
destinations, the number of destinations is referred to as the fan-out of the source. 
In MOS technology, the interconnection requirement for multiple fan-out is strongly 
influenced by whether the source is current or voltage based [3]. The input 
impedance of a MOS transistor is nearly infinite and from a d.c. standpoint, 
voltage-mode sources can have a near unlimited fan-out. For multiple fan-out, a 
signal voltage carried on a single wire can be applied directly to all destinations. In 
contrast, current-mode devices require as many interconnection lines as there are 
destinations, with one line dedicated to each destination. A current signal is 
therefore replicated to generate as many copies of the signal as is required by the 
§2 
	 14 
fan-out of the source. 
The identity function allows replication of a signal and in the current domain, this 
function is performed by a current mirror. In saturation, the drain current 'OS  of a 
MOS transistor is essentially independent of the drain-source voltage VDS. 'DS is 
therefore determined by the gate-source voltage VGS when a transistor is in 
saturation. To preserve a current-only approach, the controlling gate voltage can be 
set by a diode-connected transistor biased by a reference current 'ref'  as shown in 
Fig. 2.1a. The controlling gate voltage can then be distributed to multiple current 
mirror transistors to produce multiple copies of 'id  (Fig. 2.1b). 
The drain current 'OS  flowing in a transistor is also linearly related to its aspect ratio 
W/L. By varying the size of transistors in a mirror, the output current 'Out•  can be 
scaled by a factor given by the geometric ratio of the output transistor to the diode-
connected transistor. However, scaling by geometric means is only used in practice 
when the scaling factor required is small. For scaling factors greater than ten, the 
mirror occupies a substantial area and is seldom used. 
' 0 ti 'out2 	outn 
'out 
Fig. 2.1a Basic current 
	
Fig. 2.1b Multiple fan-out for 
mirror. 	 current signals. 
'out = 11 + 12 
'out = Ii - 1 2 
1 2 
T 1 2 
§2 
	 15 
2.2.2. Addition and Subtraction 
The addition and subtraction of currents follow directly from the principle of charge 
conservation as embodied by Kirchhoff's current law: the sum of currents flowing 
into a node is equal to the sum of currents flowing out of a node. We can add 
currents by mirroring currents into the same node. This is shown in Fig. 2.2a for 
the addition of two currents in which Kirchhoff's law gives L 11 = I + 12. 
Similarly, subtraction of currents is effected by mirroring positive currents into a 
node from VDD and reflecting negative currents out of the node into ground. This 
is illustrated by the current mirror given in Fig. 2.2b. 
(a) Current mirrors are used to 	(b) Subtraction of currents using 
reflect currents into a node current mirrors follow directly 
to perform addition. 	 from Kirchhoff's current law. 
Fig 2.2 	Addition and subtraction of currents by current mirrors. 
§2 
	 16 
2.2.3. Ensemble Averaging 
In many instances, the average of a group of signals is required as a reference to 
other functional blocks. The ensemble current-averaging circuit shown in Fig. 2.3 
computes the average for current-type signals in a simple and elegant fashion. 
Suppose the drain current 'DS  of a transistor is related to the gate-source voltage VGS 
by the expression 
'DS 10 A f(V05) 
where 10 is some constant, A the aspect ratio W/L of the device and f(V GS) some 
function of VGS. Then summing the currents, 
'total = 
= 10 f(VGS) 	A, 
assuming that all the transistors are operating in the same region in their output 




Since all the transistors share a common gate voltage VGS, L is given by 
I. = 10  A0 , f(V05) 
A = 	0ut 0 
Ai j1 
If all the transistors have identical aspect ratios, then 1 is the average of all the 











Fig 2.3 	Ensemble averaging circuit where Lut = A0n 	I II 
i=] 
2.3. Complex Arithmetic Functions 
In order to form a more complex class of arithmetic functions than those realised so 
far with current mirrors, a branch of analogue circuits based on the translinear 
principle [20] is introduced. Translinear circuits can perform a variety of linear 
and nonlinear functions of high functional complexity with a surprising small 
number of devices. As an example, the four-transistor realisation of a variable gain 
current amplifier [21] given in Fig. 2.6 can effect the multiplication and division of 
currents. Translinear circuits are operated exclusively in the current domain. Their 
principle of operation is based on the proportionality of transconductance to output 
current in certain electronic devices. 
§2 
	 18 
2.3.1. Translinear Principle 
Translinear circuits exploit the exponential relationship between a current flowing in 
the output node of a device as a result of an input voltage applied to its control 
node. Thus 
yin 
'Out = 10 emuT 	 (2.1) 
where 10 is some constant, UT  the thermal voltage kT/q and m a process dependent 
factor. The transconductance of such a device is given by 
a L - out 
3V10 	mUT 
The transconductance is therefore a linear function of current and the term 
translinear is used to describe this key property of the device. The prime example 
of a device with a current-voltage relationship satisfying Eqn. 2.1 is the bipolar 
transistor where the collector current is exponentially related to the base-emitter 
voltage over many decades of current [19]. In MOS technology, a transistor biased 
to operate in weak inversion will also exhibit translinear behaviour [86]. In weak 
inversion, the MOS transistor drain characteristic can be approximated by 
vGs 
T 	mIJT 
1DS = 'DO e (2.2) 
where 'DS  is the drain current, VGS the gate-source voltage and 'DO  some process 
dependent constant. From Eqn. 2.2, 
[ IDS
VGS = mUT ln 	 J 
§2 
	 19 
2.3.2. Translinear Current Loops 
In this section, the translinear principle is derived [20] using MOS transistors 
operated in weak inversion as the translinear elements. Consider a simplified loop 
consisting of diode-connected transistors as shown in Fig. 2.4. In any electrical 
network consisting of one or more loops, the sum of the voltages in each loop is 
equal to zero by Kirchhoff's law. In practical circuits, the currents in the loop may 
be set up by three terminal connections but the principle derived will still be valid. 
Summing voltages round the loop, 
= 0 
VGsI + Vs - VGS2 - V0 + .....= 0 
= Y, Vj 
anli —clockwise 	clockwise 
DSi 	 iPLi (2.3) mU- In I I = 	mUT In 
anti—clockwise 	 ( '1)01 J clockwise 	 ( 'DOj ) 
where anti-clockwise direction of current flow is taken to be positive. If the 
translinear elements are realised in a monolithic form, then the following 
assumptions can be made 
• 	Process-dependent device parameters are uniform for all 
elements. 
• 	The elements all operate at the same temperature. 
Eqn. 2.3 can therefore be simplified to 
'I 
	
J 	I lnl- 
 
j- -- I 	In 
clockwise 	( 'DOj ) anti —clockwise 
Equivalently, the sum of a series of logarithmic terms can be expressed as a product 
of the arguments. Therefore 
(IDS 
H 
anti —clockwise 1Doi = clockwise I'DOj) 
§2 - 	 20 
H 	'DSi H 	'DOi 
anti—clockwise 	- anti—clockwise 	 (2.4)  
H 'DSJ 	- fl 	'DOj 
clockwise clockwise 
IDO is proportional to the aspect ratio A = W/L of a transistor, but it also contains a 
temperature and process-dcpendent term J [85]. J DO is poorly controlled and its 
absolute value will vary with chips from different batch runs. However, its value 
may be considered a constant independent of current level for transistors on the 
same chipt .  Eqn. 2.4 then becomes 
H 	'DSi 	 fl 	A1J0 
anti—clockwise 	- anti—clockwise 
H IDSj - 	fl AJJD0 
clockwise 	 clockwise 
The influence of temperature and processing can be removed if the JDO  terms in the 
numerator and denominator on the right hand side of Eqn. 2.5 mutually cancel. 
This would be the case if elements in the ioop occur in matched pairs connected in 
opposing polarity. Eqn. 2.5 reduces to 
'DS(2r-1) 	2 'DS(2r) 
1= 1 A 2r-1 	r1 A21 
(2.6) 
If Ins/A is regarded as the current density, then the translinear principle states that 
in any closed loop with pairs of devices connected in opposing polarity, the product 
of the current densities in one direction is equal to that in the opposing direction. 
Translinear circuits therefore operate uniquely in the current domain and are 
insensitive to uniform variations of temperature and processing. 
2.3.3. Translinear Synthesis 
As an example of translinear synthesis, consider the classic configuration consisting 
of a quadruple of transistors as shown in Fig. 2.5. Depending on how the devices 
are driven, this circuit can perform a variety of functions on the currents I, 12, 13 
and 14 . Two functional circuits, a prod uct-quoti ent circuit and a high-ratio current 
mirror will now be derived [21]. The product-quotient circuit provides a variable 
current gain with the gain factor set by a ratio of two currents. The current mirror 






U 	 I 
U U 
I 	 I 
Fig 2.4 Simplified translinear current loop. 
has a scaling factor determined by the geometric ratio of transistors connected in a 
translinear loop. 
Product-Quotient Circuit 
If the currents I, 12 and 13 shown in Fig. 2.5 are forced, then applying the 
translinear principle, 
11 	13 - 12 	14 
A1 A3 	A2 A4 
A2 A4 13 13 
14 	




assuming identical geometries for all four transistors. The product-quotient circuit 
is given in Fig. 2.6 and can be used as a variable gain current amplifier with a gain 





	1 3 	14 	12 
if 
T -4 








High-Ratio Current Mirror 
If 12 can be made proportional to either I or 1 3 , then the quadruple core of the 
product-quotient circuit can be used to synthesise a current-mirror with scaling 
factor determined by a geometric ratio of transistors. Such a current-mirror is 
shown in Fig. 2.7. Applying the translinear principle around the minor loop 
consisting of T3 and T5 , 
13 - 15 
A3 	As 
11 - '2 
A3 	As 





Al A3 	A 5 1 1 
- A2 A4 
- A 1 A5 
2.4. Sources of Error 
In a translinear ioop, non-ideal effects such as ohmic resistances giving rise to stray 
voltage drops around the loop, will introduce errors into Eqn.2.6. Random 
variations in the geometry of transistors are also unavoidable and will contribute an 
error to the ideal current relationship. However, the main source of error in using 
MOS transistors as translinear elements is caused by variations in the threshold 
voltage of the transistors. This threshold voltage variation can be due to two 
effects 
• 	In any fabrication process, a mismatch IVTO in zero-bias 
threshold voltage among transistors inevitably occurs. This 
threshold voltage mismatch will impose a fundamental limit on 
the accuracy of translinear circuits realised from MOS devices. 
• The threshold voltage VT  of a MOS transistor is not a constant, 
varying slightly as a function of the voltage between its source 






Fig 2.7 	High ratio current mirror where 14 = A2A4 
A1 A5 
2.4.1. Threshold Mismatch 
In present MOS processing technology, the zero-bias threshold voltage V TO cannot 
be controlled to anything better than ±10 mV, even for physically close 
transistors [87]. The threshold mismatch AV TO can be expressed as an equivalent 
geometric mismatch X. given by 
mUT 
X m = e 	 (2.8) 
Taking the threshold mismatch into account, Eqn. 2.6 is modified to 
- 
'DS(2r-1) = 	j 	2 'DS(2r) e H 
= 	2r-1 	 ri A21 




 1) 	 2 'DS(2r) 	 (2.9) 
MA 	mH A2r 
The equivalent effect of threshold mismatch is to introduce a temperature-
dependent geometric factor K m . Translinear analysis of current densities can then 
proceed as before, with Km  regarded as a constant scaling factor. For typical values 
of zV.1.0 = 5 mV and mUT = 40 mV, Km  has an approximate value of 1.13, 
representing a possible deviation in expected current value of about 15 %. As a 
result, translinear circuits realised from MOS devices are limited in their output 
signal current accuracy. Emerging fields of application where such circuits may be 
tolerated are in the areas of collective computation [81,751 and "smart" vision 
sensors in which features of interest, such as the edges of objects, show up as sharp, 
abrupt intensity variations that are not easily swamped by such inaccuracies. 
2.4.2. Body Effect 
In the preceding discussion on the operation of a MOS transistor in weak inversion, 
the source and bulk are assumed to be at the same potential. The source may 
therefore be used as the reference for the gate potential. The drain current can 
then be expressed as 
vGS 
'DS = IDo e
mUT 
 
where IDO  is some constant, VGS the gate-source voltage, UT the thermal voltage 
kT/q and m a process dependent factor. If the potential at the source terminal V s is 
different from that of the bulk VB, then VGS must be replaced by (V G  —mVs), with 
all potentials defined with respect to the substrate. The voltage V s reverse-biases 
the source-bulk junction and causes the depletion layer under the channel to 
become wider with increasing values of V. There is now more charge resident in 
the depletion layer due to the ionised impurities in the substrate. Since the 
threshold voltage VT  can be regarded as the gate voltage necessary to maintain the 
depletion region without creating a channel, V T  will increase with V s as a result. 
This effect can be accounted for by modifying the drain current expression to 
give [87]: 
§2 	 26 
VG_mVS 
mUT 
'DS = 'DO e 	 (2.10) 
The factor m is determined primarily by local substrate doping with a value that 
increases with substrate doping. For a typical process, m has a value in the range 
1.3m<2 and can be considered a constant. Body effect in transistor can be 
eliminated by tying its source terminal to the local substrate. This is easily done in 
practice though at some expense of layout density due to the need for separate 
wells. 
2.5. Transistor Characteristics in Weak Inversion 
When MOS transistors are used as translinear elements, the transistors must be 
operated in the weak inversion part of their output characteristic. In weak 
inversion, the drain characteristic of a transistor can be expressed as [85]: 
~ Q( 	_x 
mU— le UT 	UT' I0s= 	' I o e 	—e 	I 
) 
V0 —mVs 




' DS  
= 'sat II - e UT 	S 	 (2.11) 
The absolute value of 1DO  is poorly controlled from batch to batch and is 
approximated by [87]: 
VTO 
'DO = K3 UT2  e 
mUT 
where 0 is the transistor transfer factor p.0 0W/L and K some constant factor 
greater than 1. 'DO is strongly temperature and process-dependent but may be 
considered a constant for transistors on the same chip, provided that the threshold 
mismatch iVTo  approaches zero. However, the threshold mismatch in present 
MOS processes can be controlled to no better than ± 10 mV. This can give rise to 
a possible 28 % error in 
§2 	 27 
The drain current 'DS  of a transistor at a given gate voltage, increases rapidly from 
zero with VDs.  When VDS has reached a value of about 3 UT, 'DS saturates to an 
almost constant value L' determined by the gate voltage (refer to Eqn. 2.11). 
Above.this saturated value of drain-source voltage VDSAT, 'DS increases gently with 
VDS. This small increase in IDS  in the saturation region is caused by the drair 
depletion layer extending back into the channel, effectively reducing the channel 
length of the transistor. The gradient of the output drain characteristics in 
saturation is proportional to the drain current level L5.  Therefore, if the drain 
characteristics are extrapolated backwards, they converge onto a single point V A , on 
the VDS  axis. The intercept VA , is called the Early voltage of the transistor. The 
drain characteristics of a typical transistor in weak inversion for different levels of 
L are given in Fig. 2.8. 
In the saturation region, 
'sat 




MDS - 	VDS 
Eqn. 2.11 can therefore be approximated by 
( 	VJs 




The saturation current L of a transistor in weak inversion has an upper limit given 
approximately by 13UT2  [85]. Above this value, the transistor starts to enter strong 
inversion and 'DS  is no longer exponentially related to Vo.  Therefore in circuit 
designs which exploit the, weak inversion characteristic of transistors [84], the drain 
current levels of those transistors operating in weak inversion must be set below 
13UT2  by proper biasing. For moderately sized transistors with 3 of about 100 
.A/V2 , La , in weak inversion is limited to below 0.1 FLA. In weak inversion 
therefore, 















H 	 H 	 VDS 
Fig 2.8 	Output drain characteristics of a transistor in weak 
inversion. 
I 	 I 
WERI-1NV(RSI0N TRRNSITION 	STRONG-INV(RSION 
Fig 2.9 	Diffusion and drift components of 'DS  as a function of VGS 
( after Ref [1] 
). 
§2 	 29 
2.5.1. Modelling of Weak Inversion Conduction with SPICE2 
The drain current 'DS  of a MOS transistor can be modelled by two components, Liff  
and 'drift,  each arising from a different conduction mechanism [1]. 'DS  can 
therefore be expressed as 
'DS = LJ1ff  + 'diffi 	 (2.14) 
Ljff  arises from the diffusion of carriers caused by a carrier concentration gradient 
along the channel of a transistor. For VGS<VT, 'dill is the dominant term in 
Eqn. 2.14. 'drift  on the other hand, is set up by the drift experienced by charge 
carriers in an electric field. 'drift  is therefore non-zero only if V>0 and VGS>VT, 
so that an inversion layer can be formed under the gate. 
The two components of 'DS  as a function of gate voltage VGS  are shown in Fig. 2.9. 
In an attempt to join the two curves in the transition region, SPICE2 defines a new 
threshold voltage VON [88]. Vo N  is slightly above VT and marks the transition from 
weak to strong inversion. At this point VON, an exponential characteristic is 
attached to the 'drift  curve to model the weak inversion region where V GS<VT . 
Weak inversion operation of a transistor as modelled in SPICE2 is a function of the 
parameter N [88]. N FS  is not related to the physical nature of weak inversion 
conduction; it is a curve-fitting parameter obtained from measurements of actual 
devices. Fig. 2.10 shows the transfer curves for two different values of NFS  as 
simulated by SPICE2. Notice that the attempt to join the two curves results in a 
kink in the current characteristic at the transition point from weak to strong 
inversion VON. Improved models based on the physical modelling of weak inversion 
conduction have been proposed [1,47] to remove this discontinuity in the drain 
current characteristic. 
In SPICE2, weak inversion conduction modelling is included only in the 
LEVEL=2 (MOS2) and LEVEL=3 (MOS3) models. For computational 
efficiency, the MOS3 model can be used where only first-order effects are modelled. 
To invoke weak inversion conduction modelling, NFS must be given some non-zero 
value. Otherwise, NFS  defaults to zero and weak inversion conduction is not 
modelled. Table 2.1 gives a list of the relevant parameters to include for weak 












0.5 	 1.0 	 1.5 	 VOSIY) 
Fig 2.10 	Drain characteristic of a transistor as modelled by SPICE2 
for two different values of N ( after Ref [1] ). 
V1.0 zero-bias threshold 
KP intrinsic transconductance 
GAMMA bulk threshold parameter -' 
PHI 0.6V 
KAPPA 1 
NFS parameter that controls the amount of 
weak inversion current flowing. 
tox required for calculating VON in weak 
inversion. 





The rudiments of analogue VLSI computation have been presented. using currents 
for representing and manipulating signals. Elementary arithmetic functions such as 
identity, addition and averaging may be realised from a combination of basic 
current mirrors. In addition, more complex arithmetic functions such as 
multiplication, division and square root may be synthesised by applying the 
translinear principle. The translinear principle is based on the exponential 
dependence of the drain current of a MOS transistor in weak inversion on its gate 
voltage. As a result, translinear circuits realised from MOS devices are limited by 
the mismatch in transistor threshold voltages to an accuracy of no better than 
± 20 % in their output signal currents. Emerging fields of application where such 





An Electronic Eye 
A Case Study on Analogue Computation 
3.1. Introduction 
Most commercial solid-state imagers are based on the principle of charge sensing 
employing charge-coupled devices (CCD) or MOS photodiodes. These devices 
respond to absolute intensity levels and have an upper limit to their input intensity 
range. When this upper limit is reached, any further increase in intensity values 
will not result in any corresponding response at the output of the sensors. The 
sensors are said to be saturated and are incapable ofresponding to light levels above 
the saturation intensity value. Such sensors are therefore sensitive to overlighting 
and may not work well under unfavourable ambient lighting conditions. When 
these imagers are used at the front-end of computer vision systems to capture image 
data, saturation of the sensors would inevitably result in a loss of.information in the 
acquired data. This failure to acquire an image without loss of information is a 
major drawback of current computer vision systems that are applied in a visually 
uncontrolled environment, such as the outside world. As a practical illustration of 
this difficulty, the present generation of vision systems used for autonomous vehicle 
guidance [16] are unable to track a road under unfavourable conditions, where 
there may be an expanse of water or snow in the scene. 
Another shortcoming of charge-based sensors is that their dynamic range is 
insufficiently wide for use in the outside world [6]. Charge-based sensors have a 
dynamic range of about 1000 or 30 dB. This is significantly less than the dynamic 
range in a typical outdoor scene, where the intensity variation from bright sunlight 
to deep shadow spans about six orders of magnitude or 60 dB. For these reasons, 
the use of currently available vision systems is usually confined to a environment 
where the ambient lighting can be carefully controlled, such as in a factory 
inspection line. Mechanical auto-irises have been developed for commercial 
imaging products to address the susceptibility to saturation of charge-based sensors. 
These mechanical irises operate along the principle of a camera shutter, controlling 
the amount of light incident on the sensor cells. In contrast to this approach, novel 
analogue vlsi computation techniques will be applied in this chapter to effect a 
§3 	 33 
solution inspired by biological vision systems. Instead of controlling the light 
intensity incident on a sensor cell with a fixed photo-response, the response of the 
sensor is varied to a range sensitive to that level of incident intensity. In this 
respect, this self-compensating mechanism imitates the behaviour of a receptor cell 
in the retina, which is capable of adjusting its sensitivity to keep its response in 
register with ambient lighting conditions [90]. 
3.2. Solid-State Image Sensing 
The most common mechanism for the detection of light in semiconductors relies on 
the generation of electron-hole pairs by incident radiation. This process, called 
photon absorption, occurs when an incident photon causes an electron to make a 
transition to a higher energy state. Since energy and momentum must be 
conserved, the energy gained by the electron equals the photon energy hv, where h 
is the Planck constant and v the frequency of the incident photon. If the number 
of photons per second or intensity, is reduced from I by an amount M over a 
distance &x in a semiconductor, then a parameter a called the absorption coefficient 
can be defined by 31i'I = —a 6x. In the limit, dlldx = —a I, with solution 
I = Jo e.  Jo is the intensity at x = 0, the surface of the solid. 
Not all photons will create an electron-hole pair and the fraction that do is called 
the quantum efficiency In order to maximise the quantum efficiency at a 
given wavelength, the dimensions of a photodetector can be tailored to the 
absorption coefficient a at that wavelength. If the attenuation a is too high at the 
incident wavelength, most of the electron-hole pairs will be generated near to the 
surface of the material. Carrier lifetimest  are much shorter near the surface of a 
material than in the bulk due to the presence of defects and impurities in the 
surface region. Therefore fewer carriers will be available for photo-conduction. 
However, if a is too low, the solid appears transparent to radiation at that 
wavelength and the overall quantum efficiency will also be low. A good 
compromise for the thickness of a photodetector is 1/a for the wavelength of 
interest [60], in which case the light intensity is reduced by a factor of e ( 2.72) 
over this distance. 






P + diffusion 
N-wefl 
P type substrate 
Fig. 3.1 	Physical structure of photodiode in an N-well CMOS 
process. 
3.2.1. Photodiode 
The most common optical transducer used in solid-state imaging is a reverse-biased 
diode. The physical structure of a photodiode fabricated in an N-well CMOS 
process is given in Fig. 3.1. Photon absorption gives rise to an optically generated 
component of diode current called the photocurrent L?.  To take this into account, 
the diode characteristic equation is modified to 
[VD 	1 
ID = Is Ie kT - 	- 
I 
The diode characteristic curves under different illumination levels are given in 
Fig. 3.2. With increasing light intensity, the diode characteristic curve is translated 
down the current axis. Under reverse bias of a few hundred millivolts, the diode 
current is essentially independent of the bias voltage VD; its level is set by the 
photocurrent L. 
§3 	 0 	 - 	35 
V 
Fig. 3.2 	Diode characteristic curves under different illumination 
levels. 
Blooming 
When an array of diodes are used as photosensors, they are usually operated in a 
reverse-bias mode whereby the diodes are first precharged to some initial voltage 
and then isolated. Light incident on the diodes will generate photocurrents that will 
cause the isolated charge packets to "leak" away at a rate proportional to incident 
intensity. If the photocurrents are integrated over some fixed time interval, then 
the change zW 1  in the node voltage of a particular diode from its initial value will 
be proportional to the incident intensity at that location. 
If the diodes are subjected to overlighting, they may be fully discharged within the 
integration period with the result that an electric field no longer exists across the 
diode to sweep up the photo-generated carriers. The excess carriers will accumulate 
and diffuse outwards, giving rise to a phenomenon known as "blooming'. 
Blooming occurs when a bright spot of light saturates not only the sensors on which 
it is incident, but the surrounding sensor sites as well. It is caused by lateral 
diffusion of excess carriers through the bulk into the electric field of neighbouring 
§3 	 36 
sensor sites [74]. These excess carriers will recombine with the isolated charge on 
neighbouring sites until they have all been absorbed. Thus, blooming will cause 
circular smearing of a bright spot of light on the diode array. 
In addition to circular blooming, a more objectionable form of blooming may occur 
when isolated photodiodes are use in the integration mode. This phenomenon, 
known as column blooming, takes place when a photodiode is, driven into a 
photovoltaic mode of operation (region A of Fig. 3.2) by an intense spot of light. 
The photo-voltage generated may be sufficient to partially turn on the transistor 
switch connecting that photodiode to the bit-line common to that column of 
photodiodes. An excess of charge is therefore spilt onto that bit-line, overwhelming 
the output of other photodiodes in that column. As a result, column blooming 
produces white vertical streaks in the captured image. 
Blooming can essentially be eliminated if the photo-generated carriers are prevented 
from accumulating. From a design standpoint, this can accomplish by maintaining 
an electric field across a reverse-biased photodiode to produce a continuously 
flowing photocurrent. The magnitude of the photocurrent will be proportional to 
the incident intensity. 
3.3. Enhancements 
To investigate possible ways in which the limitations of charge-based sensors may be 
overcome, some lessons may be drawn from the way biological vision systems work 
in nature. The human eye, for example, responds to intensity ratios rather than 
absolute values of intensity [29]. When the ambient illumination level is raised, 
photochemical mechanisms in the retina somehow reduce the sensitivity of the 
receptor cells to keep their response in range with prevailing light conditions [90]. 
To cope with the wide dynamic range in intensity ratios in a typical scene, the 
retina also performs a logarithmic compression on the incident intensity variation. 
In this section, a 'receptor" cell with the unique ability to tune its response to 
prevailing light conditions is developed. Such a cell has a logarithmic response and 
can be tessellated to form an image plane for an "electronic eye". 
§3 
	 37 
3.3.1. Logarithmic Response Sensor 
Over a typical outdoor visual scene, the illumination level varies over at least six 
orders of magnitude. This intensity pattern can be converted by a two-dimensional 
array of photodiodes into corresponding values of photocurrents spanning a similar 
range. In order to compress this contrast range onto a reduced scale, a transducer 
with a logarithmic transfer characteristic is employed. 
A MOS transistor operated in weak inversion exhibits a gate voltage that is 
logarithmically related to the value of its drain current. This behaviour has been 
exploited in the design of logarithmic photosensors which use a MOS transistor as a 
weak inversion load [8, 53] to detect the photocurrent generated by a photodiode. 
The diode-connected transistor load performs a logarithmic compression on the 
photocurrent by converting the photocurrent into a corresponding gate voltage. A 
drawback of logarithmic sensors is that details in an image corresponding to small 
variations in intensity may no longer be visible at the output of the sensors. The 
compression further reduces these intensity differences so that such details are no 
longer discernible. However, the features of interest in a scene, such as the edges 
of objects, usually show up as sharp, abrupt variations in intensity that can easily be 
discerned by logarithmic sensors. 
Implementation 
The translinear behaviour of a MOS transistor in weak inversion can be exploited in 
the design of a logarithmic photosensor. The photocurrent generated by a 
photodiode is passed through a MOS transistor used as a weak inversion load. Such 
a scheme is shown in Fig. 3.3a. The photocurrent flowing through the diode-
connected transistor is logarithmically converted into a corresponding value of gate 
voltage. Ignoring body effect for the moment, the drain current 'DS  of a transistor 
in weak inversion is given by Eqn. 2.11 as 
VDS  
'DS = 'DO e 







Fig. 3.3a Diode-connected MOS 
	
Fig. 3.3b Series-connected load 
transistor used as a 
	 to increase the gradient 
weak inve.rsion load. 	 of V0 with photocurrent. 
IDs increases rapidly with small values of VDS and saturates within a few UT  values 
of VDS. At room temperature, U T  is approximately 25 mV and a diode-connected 
transistor is guaranteed to be in saturation since its drain-source voltage is at least a 
few hundred millivolts. For every decade increase in photocurrent, V. = VGS 
increases by about 100 mV for typical values of m. Corresponding to a zero-bias 
threshold voltage Vw of about 0.9 V, the voltage swing V. at the gate of a 
transistor in weak inversion varies approximately from 0.3 V to 0.8 V. 
In order to provide a larger voltage swing at the output V 0, a second transistor is 
added in series to the load as shown in Fig. 3.3b. Each transistor now contributes 
about 100 mV in gate-source voltage for every decade increase in photocurrent. 
The gradient of V. can now be expected to double to approximately 200 mV per 
decade of photocurrent. However, due to body effect in the upper transistor as its 
source voltage increases, the gradient of V. is nearer to 300 mV per decade. By 
using two MOS transistors in series, the voltage swing V. can be translated upwards 
into the 1.0 V to 2.4 V range, with both transistors still operating in weak 
inversion. 
§3 	 39 
Note that the transistors will remain in weak inversion only if the photocurrent is 
kept below the value 13U- (refer to Eqn. 2.13). This limits the photocurrent to 
values below 0.1 pA for small devices in a 2 to 3 micron process. With any 
increase in photocurrent above 0.1 j.A, V 0 will increase rapidly from 2.4 V to close 
to the VDD rail as the transistor load depart from weak inversion operation. This 
corresponds to a further order of magnitude increase in incident intensity over and 
above the weak inversion range. 
v i 
V 0 
Fig. 3.4 	Isolation of photodiode from load to counteract blooming. 
3.3.2. Blooming Suppression 
To counteract the phenomenon of blooming, we need to prevent the accumulation 
of excess photo-generated carriers by maintaining an electric field across a reverse-
biased diode. Consider once again the pixel cell shown in Fig. 3.3b. If the pixel is 
close to saturation, then V. rises towards VDD and the bias across the diode 
approaches 0 V. Blooming will then start to take place. To prevent blooming from 
occurring, the diode can be isolated from the weak inversion load by a current 
§3 	 40 
mirror. This is shown in Fig. 3.4. Note that the rate of increase of V 0 with 
photocurrent is more than twice the rate at which the bias voltage V I , across the 
diode is decreasing. Therefore, even if V. approaches V DD , V1 would still have a 








-6  10 	10 10102 10-1  1 	10 102 10 
Saturation 
Trttcnsitv 





Fig. 3.5 	Ideal transfer characteristic of a logarithmic sensor. 
3.3.3. Automatic Gain Control 
For an imager to respond to intensity ratios rather than absolute intensity values, 
some form of automatic gain control (AGC) over the generated photocurrents must 
be provided at each and every sensor. Suppose that a sensor exhibits an ideal 
logarithmic response as given in Fig. 3.5. If the ambient lighting of the scene being 
captured is above the saturation intensity value, then no useful output will be 
produced. This corresponds to the Part B region of the characteristic shown in 
§3 	 41 
Fig. 3.5. To get some meaningful output, the sensitivity of the sensor needs to be 
lowered so that the same intensity pattern produces a range of output centred in the 
Part A region of the characteristic. This can be done by scaling the photocurrent I, 
by some appropriate factor to give a new output current L. The amount of scaling 
required would be determined by the prevailing light level and the value of current 
'ref, corresponding to the mid-point of the Part A region of the sensor characteristic. 
Some measure of the ambient light level is therefore required and suppose this is 
obtained in the form of a current measurement 'av The compensated output 'nut  is 




In general, any form of AGC mechanism must feature some form of close-loop 
control. To implement the AGC mechanism, a measure of the prevailing light level 
must be obtain to give an indication of the background brightness of the image to 
be captured. This measured value of background brightness is fed back to the 
sensors so that their output can be compensated accordingly. in order to 
manipulate signals exclusively using currents, an estimate of the background 
brightness is obtained in the form of a current measurement 'av• 
For 1av  to be a representative measure of the image as a whole, the photocurrents 
generated by all the sensors are aggregated and an average derived. This global 
average 'a'  is obtained using the ensemble-averaging circuit given in Fig. 2.3. 'av  is 
then used as an input to the prod uct-q uoti ent circuit given in Fig. 2.6 to provide 
the variable gain of the AGC mechanism. The magnitude of the current gain is 
given by a ratio of two currents, 're/'av 4d is a constant that is set externally and 
its value is determined by the characteristic of the weak inversion load. 
Note that when a global reference is used to establish intensity ratios, then features 
in very bright or dark patches may be lost. Ideally, a more localised reference is 
required so that local intensity ratios can be obtained. This local reference or 
average may be derived by considering the immediate neighbourhood surrounding a 
particular point in the image. The local reference is then the weighted sum of the 
intensity values in the surrounding neighbouring points, with points further away 
accorded less weight than those in close proximity to the point where the average 
§3 	 42 
will be applied. However, from an implementational point of view, obtaining a 
series of local references is made impractical not by active device considerations but 
by interconnect requirements. 










Fig. 3.6 Functional diagram of a receptor cell. 
3.4. Design of Receptor Cell 
Analogue vlsi computation techniques will be used in this section to design a novel 
"receptor" cell with features to overcome the shortcomings of commercial charge-
based imagers, namely insufficient dynamic range, blooming and susceptibility to 
saturation. Such a receptor cell can then be tessellated to form the image plane of a 
"smart" imager. A functional diagram of the receptor cell is given in Fig. 3.6, 
comprising of the following elements 





• 	The average or mean value of light intensity in an image frame 
is obtained as a current measurement 'av'  and updated from 
frame to frame using an current-averaging function. 
• 	The average value of light intensity, as represented by 'av,  is 
used as an input to an automatic gain control function that 
scales the photocurrent L, by some factor so as not to saturate 
the output load. 
• 	A load with a logarithmic transfer characteristic is used to 
convert the scaled photocurrent 'OLfl  into a corresponding voltage 
V0 . 
The circuit design for the receptor cell is shown in Fig. 3.7 
[38t]  The receptor 
cell can be tessellated to form the image plane of an an "electronic eye'. The 
photocurrents generated by the array of photodiodes are aggregated by a current 
ensemble-averaging circuit (refer to Fig. 2.3) producing an average value 'av,  that is 
fed into an automatic gain control (AGC) circuit (refer to Fig. 2.6) by a current 
mirror. The constant I is brought into the AGC circuit by another current 
mirror. Ld  is set externally and its value is determined by the characteristic of the 
output load. The AGC circuit scales the photocurrent I, by a factor 1 e/1av , to 
provide a compensated current output I.  The compensated output I,, is then passed 
through a weak inversion load to effect the logarithmic compression of I. into an 
output voltage V0. V. is the output signal that is sensed by the 110 circuitry in the 
imager. 
3.4.1. Transient Response of Photoreceptor 
The transient response of a pixel to sudden changes in intensity is dominated by the 
charging and discharging of capacitance C associated with the node V 0 . Consider 
the case when a spot of light incident on a photoreceptor cell is suddenly removed. 
Since the value of photocurrent generated is now insufficiently large to maintain the 
initial value of V0 = V0(0), capacitance C is discharged by the current I through 

















From Eqn. 2.11, 1 can also be expressed as 
-Y2- 
	




















1 - 1 
dt 	CaUT 
d (1) _ 	1 
dt t'0j CaUT 
1-d_ 1---" 	1 	fdt I dt = J dt 1I CaUT 
Solving 
1_i 	t 
'0(0) + C(XUT 
I(0) t ) 
= 	 CaU' T) 
1 	L1 Ii +  
T. 
T Note that a in this equation accounts for the cumulative effect of the process parameter m 
and the body effect of the upper transistor in the series-connected load (refer to § 3.3.1) 
§3 	 46 
where J(0) is the initial value of J, at time t = 0 and 
CaUT 
To = '(°) 	
(3.3) 
From Eqn. 3.2, 
V0(t) = XUT (In L(t) - in IDO) 
1 
= aUT  [In (0) - in 11 + 	I - in IDO 
I 
( 	t 
= V0(0) - aUT in i + - I 	 (3.4) I. To  




3.5. Scanning Frame 
To reconstruct an image, the output V. from each and every receptor cell must be 
brought out into the outside world. With such a large number of cells to be 
accessed, the readout of the cells is accomplished in a time-multiplexed fashion like 
a serial memory [50]. The output of each cell is serially scanned out onto a single 
output line by means of two orthogonal shift registers. The rows are controlled by a 
vertical shift register and are enabled a row at a time onto "bit lines". Locations 
within the selected row are then enabled by a horizontal shift register, sequentially 
switching the bit lines onto the output line. 
The bit lines and output line run the length of the receptor array and are therefore 
highly capacitive. In order to reduce the RC time delays associated with voltage 
propagation along highly capacitive lines, current-steering readout as proposed by 
Sivilotti [76] is used. In current-steering readout, the output voltage V. of each cell 
is first converted into a current I. Ideally, this should be carried out using a 
device with a constant transconductance so that the conversion process is linear. A 
MOS transistor operated in its linear region with a small constant value of VDS [52] 
can be used for this purpose. I,, is then related to V. by 
= 13 (V. - VT) VDS 	 (3.5) 
The cells are switched a row at a time onto the bit lines by a pass transistor M 2 in 
series with the transduction transistor M (refer to Fig. 3.8). The current I, 
pertaining to each cell within the row is then selectively switched onto an output 
line where it is sensed by an external amplifier. The output line is biased at some 
appropriate voltage VbI(  externally to ensure the linear operation of the transduction 
transistor M 1 . A suitable value of Vbil  is about 1.0 V. In order to reduce the 
dependence of VDS  of the transduction transistor M 1 on I,, P2 of the switching 
transistor M2 should be made significantly bigger than 13  of the transduction 
device. 
The voltage on the bit lines can be defined at all times by including a dummy line 
also biased at V bj, All the bit lines not connected to the output line are connected 
to the dummy line. Off chip, the current flowing in the output line may be 
converted back into a voltage by a transimpedance amplifier that implicitly biases 
the output line at Vb1 . The functional diagram of the scanning frame is given in 
Fig. 3.9. 






Horizontal Shift Register 
Fig. 3.9 Functional diagram of current-steering scanning frame. 
3.6. Results 
As a prototype for evaluating the performance of the electronic eye and scanning 
frame described in the preceding sections, a logarithmic imager chip with a 50 by 
50 pixel resolution has been fabricated in a 2 tim, dual layer metal n-well CMOS 
process. The photomicrograph of this prototype chip is shown in Fig. 3.10. 
Each pixel contains a photodiode with two series-connected MOS load transistors 
operated in weak inversion (Fig. 3.3b), together with the circuitry for current-
steering read-out (Fig. 3.8). A photomicrograph of an array of pixels is given in 




Fig. 3.11. This imaging array is designed to exhibit a logarithmic response, 
producing a current output that is logarithmic with incident light intensity in the 
visible and near infra-red parts of the spectrum. 
Fig. 3.11 Photomicrograph of pixel array. 
The following characteristics of the imaging array have been evaluated 
• 	Sensitivity response to different light levels. By focusing a light 
spot at different pixel sites in the imager array, the uniformity of 
output response in the x and y direction of the imaging plane 
may also be gauged. 
• 	Spectral response to wavelengths in the visible spectrum. 
• 	Transient response measured with the sudden removal of 
illumination. 




• 	Blooming or spatial crosstalk effects that may be produced in 
neighbouring sites when a pixel is illuminated above the 
saturation intensity value. 
The relevant equations describing the operation of this prototype logarithmic imager 
are included in Appendix B. 
3.6.1. Experimental Evaluation 
The current output from the imager is fed to a transimpedance amplifier to convert 
the current to a voltage to drive an oscilloscope. In principle, this could be done by 
a simple operational amplifier circuit (op-amp) such as that shown in Fig. 3.12a. 
Op-amps however, tend to suffer from limited bandwidth, slew rate and slow 
settling time when the signal at an op-amp input is switched. For this reason, the 
bipolar transistor circuit given in Fig. 3.12b is used instead. Fig. 3.13 shows the 
board constructed to test the prototype imager chip. 
R 
'sig 
V 	+1 	R bias sig 




constant - 1sig R 
Fig. 3.12b Bipolar transistor transimpedance amplifier. 
3.6.2. Sensitivity Response 
The sensitivity response is carried out at a particular wavelength set by the laser 
source used in this measurement. The experimental setup is outlined below. 
Experimental Setup 
A photograph of the experimental setup is given in Fig. 3.14. The light source 
used is a helium-neon laser with a wavelength X of 632.8 nm. The laser beam is 
passed through a polarising filter that can be rotated to vary the intensity of the 
transmitted beam. In the path of the transmitted beam is placed a variable speed, 
motor-driven "windmill" chopper that can be rotated to produced a pulsed light 
beam. This facility is required to measure the transient response of a pixel. 
The intensity of the transmitted beam is measured using a calibrated radiometer. 
The radiometer readings can then be normalised to give absolute values of intensity 




3.13 	Prototype test board. 
Results 
The sensitivity response is measured at 5 different pixel sites located approximately 
at the four corners and centre of the imaging area. Fig. 3.15 gives a superimposed 
plot of the response of all 5 pixel sites, from which we can discern a maximum 
variation in output current of about 7p.A over the 5 sites. From this value of 
output current variation, an estimate of the mismatch in zero bias threshold voltage 
AVm over the 5 sites may be made. This value of iVTo is calculated to be 17.5 
mV from experimental data in Appendix B. 
The sensitivity response of the pixel in the region near the centre of the imaging 
array is given in Fig. 3.16. The reference intensity in Fig. 3.16 corresponds to 
7.12 W/cm2 , incident on a photodiode with an active area of 1045 1i.m 2 . The pixel 
has a measured dynamic range in incident light intensity of about 6 orders of 
magnitude. Over the logarithmic part of the plot, the output current changes at a 




Fig. 3.14 	Experimental setup Inc sensitivity & transient resputise 
measurements. 
Mounted on the optical bench are (left to right) 	laser, calibrated rotatable 
polansing filter, fixed polarising filter, chopper wheel & motor, aperture, beam-
splitter & calibrated radiometer, lens, test board with imager. 
3.6.3. Spectral Response 
To measure the spectral response, a prism monochromator is used to select different 
wavelengths in the visible spectrum. With a calibrated radiometer, the intensity at 
all the wavelengths are equalised to a convenient value by adjusting the size of a slit 
placed in front of a white light source. With this level of intensity fixed, the input 











2 W/cm 2 . 
10-5 	 10 -3 	10 -2 	10-1 	1 
	
10 
Fig. 3.15 Graph of sensitivity response (over 5 sites). 
10 1 io-5 	io-4 10_i 10 2 
CN 
180 T 














A plot of the spectral response of a typical pixel is given in Fig. 3.17. The output 
of the pixel at different wavelengths is normalised relative to the peak response at 
X = 560 nm to produce a relative spectral response plot. From its sensitivity and 
spectral plot, the responsivity of the pixel at the peak wavelength is found to be 
0.484 A/W. 
3.6.4. Blooming or Spatial Crosstalk 
To investigate the effects of blooming or spatial crosstalk would ideally require a 
fine spot of light, illuminating an area no larger than that occupied by a single 
pixel. This spot of light, at near saturation intensity, is focused on a particular pixel 
in the imaging array and its effect on neighbouring cells examined. 
Due to the nature of the readout circuitry, the effects of blooming can be observed 
externally only in the y dimension between rows of pixels. It is not possible to 
observe directly the crosstalk between columns of pixels in the x dimension. The 
symmetrical tessellation of the pixel cells, however, ensures that any spatial crosstalk 
present would be of comparable magnitude in both the x and y dimension. 
Result 
To perform this test, the lens system of a microscope is used to focus a spot of white 
light onto the imaging array. The smallest spot size that can be achieved by this 
arrangement illuminates an area covering a 2 by 2 array of pixels. When the 
focused beam of light hits the silicon surface however, some scattering of light 
occurs over a small area. This scattering effect is visible to the naked eye and will 
result in pixels in the vicinity of the illuminated area giving an output response 
above the dark level. The output response of the imaging array can be seen in the 
oscilloscope trace given in Fig. 3.18. 
The two sharp spikes on the trace correspond to the two illuminated pixels on 
adjacent rows. Blooming effects can therefore be seen to be negligible in this form 
of current-based imaging array. 





















k )L H IC 101 	Vii[1tI IC ) )OIIIIIIC L 11Cc 1 
Each repeated notch represents 1 scan line of output. 2 pixels are directly 
illuminated (notches 5 & 6) with output response at 5.5 V below the dark level. 4 
neighbouring lines above (trace notches to the left) and 5 below (trace notches to 
the right) exhibit some response to scattered light at about 2.0 V below the dark 
level, This represents a reduction in intensity by a factor of 135 relative to the 
illuminated pixels. The remaining lines exhibit no measurable response. 
3.6.5. Transient Response 
The simplest experiment to evaluate the transient response of a pixel is to observe its 
response to a rectangular light pulse. The experimental setup used is identical to 
that presented in § 3.6.2, with the helium-neon laser beam threaded through a 
windmill-like wheel with spokes and the beam 'chopped" by rotating the wheel. 
From Eqn. B.1 in Appendix B. the output voltage of a pixel is given by 
- VT )V, 
vout = 	 RL 
-Y 
§3 	 60 
where 13  and VT  are the transfer parameter and threshold voltage of the 
transduction transistor, V 1 the external bias drain voltage, RL  the load resistor and 
y a constant with value 1<-y<2. VPh  is the photo-voltage generated by the 
logarithmic sensor. Its response in time with the sudden removal of illumination is 
given by Eqn. 3.4 as 
	
Vph(t) = VPh(0) - aUT In 11 + 	 (3.6) t 	T) 
- CaUT 
To - 'ph(°) 
where T0 is a time constant dependent on the capacitance of the photo-sensitive 
node C, the value of photocurrent flowing at the instant of light removal Ip1,(0)  and 
a, a constant parameter of the logarithmic sensor. 
Result 
Fig. 3.19 gives a trace of the response of the imaging array to a train of rectangular 
light pulses. From this trace, the decrease in output voltage AV. ut , is measured at 
4.5 V over a frame time of 1.04 ms. The change in photo-voltage over this time 




Vbj L RL 
Substituting in the appropriate values, 
1.8 	4.5 
Ph = 40 A x 5 55.7 K1 
= 0.73 V 
From Eqn. 3.6, 	A Vph = aUT in 11 + ---  
T J 
I 




Fig. 3.19 Oscilloscope trace of the response of imaging array to a 
rectangular light pulse. 
The top trace displays the output from 10 consecutive frames. The lower trace 
displays the duration of chopped beam illumination. Frames 3, 4, & 5 followed by 
8 & 9 display a response at 5.5 V below the dark level to the chopped beam 
illumination. Pixels illuminated at background level display a response at around I 
volt below the dark level. Note the complete recovery between frames. 
§3 
	 62 
3.6.6. Summary of Results 
The results of this evaluation, summarised in Table 3.1, demonstrates the viability 
of developing of smart vision sensors based on the design principles of analogue 
VLSI current computation. 
Parameter Typical Value Units 
Dynamic range 60 dB 
Spectral response 400 - 1000 nm 
Variation in V-1.0 17.5 mV 
Blooming suppression factor 135 - 
Pixel time constant 1.8 
Maximum frame rate 1 ms 
Table 3.1 Summary of performance characteristics. 
3.7. Summary 
A novel "electronic eye" has been presented to illustrate the application of analogue 
current computational circuits in "smart" vision sensors. The "eye" is an imager 
with built-in nonlinear correction to keep the operating range of its sensors in 
register with ambient light conditions. The computational requirement of this "eye" 
is extremely demanding, requiring some form of nonlinear computation to be 
carried out on each individual pixel value, in real-time, at the focal plane. As an 
example, for a design with a resolution of 100 by 100 receptor cells, 10 1 additions, 
multiplications and divisions are required to be computed per frame time. For 
typical frame rates of 200 frames/sec, this is equivalent to a computational 
throughput of 2 million additions, multiplications and divisions per second. The 
only tractable solution to such demanding throughput is to use analogue VLSI 
techniques that operate directly on sensed data at each sensor site. This integration 
of sensor and signal conditioning function onto a single element forms the basis. of 
"smart" sensors. 




In bit-serial computation, signals are transmitted via single wires and pins within 
and between the boundaries of computational cores. As a result, computational 
functions realised bit-serially are characterised by low device pin-counts and may be 
packaged using standard, low-cost packaging technology. This feature endows bit-
serial computation with one of its key strengths, namely that of efficient 
communication. 
Data words are manipulated a bit at a time in bit-serial computation. As a 
consequence, bit-serial elements or operators are proportionately smaller in area and 
slower in operation than their bit-parallel counterparts; measured in terms of 
operations per second per unit area, the two are roughly equivalent. The smaller 
grain size of bit-serial operators however, gives the bit-serial approach an edge in 
terms of overall system computational efficiency. This is achieved through the use 
of functional parallelism [65], whereby arrays of serialised computational cores are 
used concurrently to boost overall system throughput. The finer grain size of the 
bit-serial operators allows a rich mixture of hardware operators that directly reflects 
the flow-graph of a computation. This functional approach can be viewed as 
casting a portion of the algorithm directly into hardware, with the attendant saving 
in control overhead [15]. 
At the system level of design, this hard-wired computational core may be regarded 
as a functional processor and treated as a black box, with all its low level 
implementational details hidden from the system designer. The control overhead at 
this functional processor level is minimal, concerned mainly with the selection of 
data sources at the 110 ports of the functional black boxes. The coarser granularity 
of bit-parallel operators, in contrast, makes the tailoring of hardware resources to 
an algorithm less flexible. As a result, the hardware resources employed may not 
directly reflect computational requirements of the algorithm. A common solution is 
to time-multiplex the operation of a fast bit-parallel ALU to compute an algorithm, 




Although the individual functional processors accept data words in a bitseriaI 
fashion, multiple data words may be presented in parallel to an array of processors 
to give a proportionate increase in system throughput. The use of functional 
parallelism to form bit-serial but word-parallel architectures offers distinct 
advantages when fixed-function processors are realised in VLSI 
• 	By selecting an appropriate number of processors in the array, 
hardware resources can be matched to the application bandwidth 
at hand. Bandwidth matching is an important practical 
consideration if the cost, size and power dissipation of the 
system hardware are to be optimised. The number of processors 
employed in the array reflects the degree of parallelism in the 
implementation. This flexibility in the degree of parallelism 
allows a range of sample rates to be easily accommodated. 
• 	There is a direct trade-off between the degree of parallelism 
employed and the performance requirements of the individual 
functional processors. The individual processors may be of 
modest performance but system throughput can be boosted by 
the overall performance of a number of such processors. The 
modest bit rates of the processors obviate the need for expensive 
interconnection requirements and help reduce overall system 
complexity and cost. 
• 	The functional processors within the array are identical and 
accept the same set of control signals. This set of control signals 
is generated once and distributed globally to the array. The 
control overhead incurred therefore remains constant 
irrespective of the degree of parallelism in the array. Since data 
words are processed bit-serially, signal words of arbitrary 
precision can be accommodated on the same hardware platform 
by changing the word-length control marker. This flexibility 
allows system throughput to be traded against precision of signal 
representation within a system. 
§4 
	 65 
The general architectural form exploiting bit-serial functional parallelism is given in 
Fig. 4.1. At the I/O interface, 'corner turning" memory is used to effect data 
conversion between bit-parallel, word-serial and word-parallel, bit-serial format. 
With a sufficiently high degree of parallelism, the sample rate can be made to 
exceed even the bit-rate of the fUnctional processors [66]. If this is the case, then 
only the 110 part of the system hardware is required to be operated at the higher 
sample rate. 
"Corner turning" memory for data format 
conversion operated at input sampling rate. 
x(n) E*I "o * ""o  
Control Functional I Functional 'Functional I  Functional 
Counters I Processor Processor Processor Processor 
IMMM"
MEMEM 
I M WIA Mlw"M 016- MM H I 
Fig. 4.1 	General 	architectural 	form 	exploiting 	functional 




4.2. Rudiments of Bit-Serial Computation 
A signal is a measurement of some physical effect and is often represented by an 
electrical analogue such as a time-varying voltage or current value. To obtain a 
digital representation, the signal is sampled at equally spaced intervals of time and 
the sampled values digitised to a resolution of n bits. The signal is now represented 
as a sequence of n-bit numbers, with a sample-index associated with each number to 
correspond to the time instant at which the sample is obtained. Within each 
number, a different weight is assigned to each bit. The weights are in powers of 2 
and the weight of each bit is denoted by a bit-index. A binary-coded signal may 
therefore be viewed as a two-dimensional array of bits, with one dimension indexed 
by the weights of the bits w 1 and the other, by a time-index giving the sample 
instant. 
	
[ao,_1 a0,_2 	a0,_3 	. . . 	a0,0 ] 
a 1 ,,_1 	a1 ,_2 . . . . . . a,0 
am , n _2 	 am,o j 
Binary representation of a signal with n-bit precision. 
There are two main conventions for manipulating this two-dimensional array of bits 
representing the signal values the n-bit sample values may be handled either in a 
bit-parallel or bit-serial fashion. 
4.2.1. Bit-Parallel versus Bit-Serial Signal Representation 
An n-nit binary-coded signal may be conveniently viewed as a matrix of bits. In a 
bit-parallel representation, this matrix of bits is arranged as a linear array of n-bit 
words indexed by sample instants. The different weights within an n-bit word are 
spatially distributed across the n-bits and are implicitly associated with the position 
of the bits within the word. 
§4 
	 67 
a0 a1 a2 a3 	a,, 1 
Bit-parallel representation of a binary signal. 
In contrast, in a bit-serial representation, the bits in the matrix are strung out as a 
linear array of bits so that both the weights and sample instants are distributed in 
time. The weights may be distributed in time in ascending order least significant bit 
(LSB) first or in descending order most significant bit (MSB) first. 
a0 , 0 _2 	a0 , 0 a1 ,_3 	a1 ,0 
Bit-serial representation of a binary signal. 
From an implementational viewpoint, signal manipulation in bit-parallel form 
requires the replication of computation and communication hardware n times to 
accommodate the different weights in an n-bit word. In a bit-serial realisation, a 
single hardware unit is time-multiplexed to handle the signal a bit at a time. This. 
contrast in computational form between the two approaches is summarised in 
Fig. 4.2. 
4.2.2. Control of Bit-Serial Networks 
In bit-serial computation, the signal appears as a continuous stream of bits and 
markers are required to delineate the boundary between sample words. The control 
of bit-serial systems therefore consists essentially of a hierarchy of such word 
markers. At a particular level of the hierarchy, a control signal is asserted true 
during the first time-slot of the time-frame associated with that level and false at all 
other times [46]. A typical example of such a hierarchy of control markers is given 
in Fig. 4.3, where CbI  marks the first bit of a word, Cword  the first word of a frame 
and so on. These control markers are usually derived from a hierarchy of 
counters [14]. 
§4 	 68 
n n-bit 	 -bit  
	
Data In , 	 Data Out 	Data In 
Data Out 
Power 	Serial 	Gnd 	 Power I Parallel  I Gnd 
I Operator 	 1 Operator  r 
Controls ' 	 I 	 Controls 
Fig. 4.2 	Computational contrast between bit-serial and bit-parallel 
arithmetic operators. 
bit-time 	11111111111111111111111111111111111111 
Cbjt 	_fl 	Ii  
Cd 
Fig. 4.3 Typical hierarchy of control markers. 
With bit-parallel data, the bits of different words in the same bit-position have the 
same weight and may be combined together to perform some arithmetic operation. 
In bit-serial data representation however, the weights assigned to the bits are 
distributed in time instead of in space. As a result, the bits from different bit- 
§4 	 69 
streams must be time-aligned so that only bits with the same weight are combined 
together. As an illustration, suppose that three streams of data are to be combined 
together to perform some arbitrary arithmetic function f(a,b,c) = a b c 
(Fig. 4.4). Assume that each operator requires one bit-time to produce the result. 
This delay, in terms of bit-time, is known as the latency of the operator. The 
operation (a . b) takes one bit-time to complete and the bits in input c must be 
shifted by a single bit-time in order to align the weights of the bits in c to those in 
the intermediate result (a . b). This necessity to time-align input streams to bit-
serial operators may be viewed as part of the overall control requirement. The 
generation of word markers and the time alignment of data streams constitute the 
flow of control in a network of bit-serial operators. 
a 	 b 	c 




4.2.3. Arithmetic Shifting 
In bit-serial computation, the bits from different operands must have equal weights 
imparted to them prior to any arithmetic operation. If the weights associated with 
the bits of an operand are to be altered, arithmetic shifting corresponding to 
multiplication or division by powers of 2 can be carried out on the bit stream. 
Consider a stream of 4-bit words {a 1 b1 ci d1} where d 1 is the least significant bit 
(LSB) of the ith sample word. For two's complement arithmetic, an arithmetic 
right-shift of 1 bit corresponding to division by 2 is effected by replacing the LSB of 
each word by the most significant bit (MSB) of the previous word, a sign extension 
operation [77]. The LSB of each input word is discarded and the weight of each 
input bit is halved at the output. Arithmetic right-shifting will therefore result in a 
gradual loss of accuracy in signal representation. 
input d3 a2 b2 c2 d2 a 1 b 1 c 1 d 1 
output a2 a2 b2 c2 I 	a1 a1 b1 c1 a0 
Bit-serial arithmetic right shift. 
An arithmetic left-shift of 1 bit corresponding to multiplication by 2 is effected by 
replacing the MSB of each word by a zero [77]. The MSB of each input word is 
discarded and the weight of each input bit is doubled at the output. Note that left-
shifting beyond the sign extension bits of the input word causes numerical overflow. 
input d3 a2 b2 c2 d2 a1 b1 c1 d 1 
output d3 0 b2 c2 d2 0 b1 c1 d 1 




4.2.4. Bit-Serial System Composition 
At the input of serial operators, bits with different weights are processed at different 
time-slots. In an operation such as the addition of two bit-streams, two outputs are 
produced in the same time-slot, a sum output S 0 with the same weight as the input 
bits and a carry output C O3  with weight twice that of the input bits. To impart the 
correct weight to the carry output, C. must be explicitly left-shifted by one bit-
position as shown in Fig. 4.5. Carry propagation therefore takes place in time 
rather than in space, enabling bit-serial operators to be tightly pipelined for high 





Fig. 4.5 Bit-serial full adder. 
A pipeline stage may be viewed as a block of combinatorial logic buffered by input 
and output latches. The logic block evaluates when the input data is latched and 
the output from the block is stored before the start of the next bit-time. To 
maintain pipeline operation, no output bit may be combinatorially related to any of 
the input bits, thereby requiring a pipeline stage to have an operational delay or 
latency of one bit-time. Bit-serial hardware consists of cascades of pipeline 
operators. The bit-time is determined by the propagation delay of signals through 
the slowest combinatorial logic block plus the settling time of the buffer latches. 
This makes physical partitioning an important aspect of system design if high bit-
rates are to be achieved. To increase the bit-rate, a logic block can be decomposed 
arbitrarily into smaller blocks, interfaced by latches, so as to reduce the 
combinatorial logic evaluation time. The throughput rate is improved but at the 
expense of increased system latency. 
§4 	 72 
Since bit-serial operators are built up from cascades of small, pipeline stages, the 
throughput of an operator is independent of the number of stages from which it is 
composed. Typically, a full adder function would be representative of the largest 
combinatorial logic block included in a serial pipeline. Logic functions more 
complex than the full adder function are usually decomposed into smaller logic 
blocks. If this is the case, then all operators are operated at a fixed throughput set 
by the time taken to perform a full addition. As a result, the throughput rate of an 
adder is the same as that of a delay element, a multiplier or some higher level 
operator such as a filter section, a complex multiplier or a butterfly unit [14]. This 
facility to hierarchically compose higher level operators from a collection of existing 




Fig. 4.6 	Highly efficient but risky single phase clocking. 
4.3. Clocking Strategies for Bit-Serial Operators 
Bit-serial operators are essentially cascades of pipelined combinatorial logic elements 
that are clocked at high rates to optimise throughput. As a result, the 
computational efficiency of the operators is a function of the clocking methodology 
adopted to synchronise communication between operators. With VLSI as the 
implementation medium, the primary objective of a clocking methodology is to 
minimise interconnection cost in terms of the time and area required for 
communication. A single phase clocking scheme such as that given in Fig. 4.6 can 
be used for MOS technologies. This scheme, although highly efficient, is inherently 
4'l 
T 	 T 
42 
§4 	 - 	 73 
risky since it constrains the combinatorial logic delay TCL to satisfy a two-sided 
relationship [23]. If is the period of the clock and thj&,  the time the clock is 
high, then 
thigh < TCL < tp0 j 	 (4.1) 
In the presence of internal delay hazards, variations in device parameters and 
system operating conditions, a two-sided timing constraint on TCL is difficult to 
satisfy system-wide. As a result, a two-phase non-overlapping clocking scheme as 
illustrated in Fig. 4.7 is commonly used to make the time constraint on TCL into a 
one-sided relationship. Referring to Fig. 4.7, the time constraint on TCL  is now 
reduced to 
TCL < 01 + 02 + t12 
5 	 5 
S I 
I 	 I 
I I 
I 	 • 	I 
• I 
cpl cP2 i1 
Fig. 4.7 	Classic two-phase non-overlapping clocking methodology. 
In NMOS technology, the use of a two-phase non-overlapping clock permits the 
realisation of efficient pipeline logic [52]. An NMOS bit-pipeline stage is shown in 
Fig. 4.8a. The logic function f(a,b) = c is realised from a combinatorial block 
buffered by an input and an output bit latch. The input bit latch is controlled on 
one phase of the clock 0 1 and the output of the function is stored on the other 




and is ideally suited for constructing bit-serial operators in NMOS technology. 
Fig. 4.8a A bit-pipeline stage in NMOS technology. 
With CMOS technology, the çb and 02  time epoch in a two-phase, non-overlapping 
scheme can be approximated by the high and low phase of a single clock by 
exploiting the complementary behaviour of n and p-type transistors. This is shown 
in Fig. 4.8b, where a single clock 0 and its inverse 0 are used to control the 
latches. However, this arrangement is sensitive to two practical difficulties 
encountered in distributing clock signals globally, namely degraded clock transition 
times and skew. Slow edge transition times or overlap in the clock phases 0 and 
due to clock skew can, lead to a two-sided timing requirement, similar to that of 
Eqn. 4.1. A true two-phase, non-overlapping scheme with çb and 02  and their 
inverses, 95, and 02  will restore the time constraint on T. to a one-sided 
relationship. A CMOS bit-pipeline stage can be area-inefficient if it is realised from 
static logic trees requiring multi-phase clocks for synchronisation. To improve the 
speed and area requirements of a bit-pipeline stage in CMOS, dynamic precharge 






Fig. 4.8b Direct equivalent of an NMOS bit-pipeline stage in 
CMOS technology. 
4.3.1. Dynamic Precharge Logic 
Fig. 4.9 illustrates the general circuit configuration for the two possible "flavours" of 
dynamic gates. The operation of the two types of gates are complementary in 
nature and for brevity, will be described in terms of the n-type gate. When the 
clock 0 is low, the output node c is precharged high whilst the input logic values a 
and b are in a state of transition. When the input values have stabilised, 0 is 
pulsed high so that output node c may be conditionally discharged through the n-
type logic tree. 
For the realisation of complex logic functions, it would be convenient to be able to 
directly cascade dynamic gates together. However,, the cascading of dynamic gates 
directly can lead to an internal signal race condition [34]. Consider the cascading 
of two gates directly as shown in Fig. 4.10a. Node X and c are initially precharged 
high so that transistor A of the following gate is turned on. When the evaluation 
phase starts, the finite time it takes node X to conditionally turn off transistor A can 
result in the partial discharge of node c. A simple way to eliminate this race 
condition is to insert an inverter at the output of every dynamic gate. This scheme 





















Fig. 4.11 Cascading of NORA gates. 
§4 	 77 
X can conditionally discharge node c without causing any signal races. 
The effect of inserting an inverter at the output node of a dynamic gate is 
reproduced if the output node is cascaded directly onto a gate of the opposite 
flavour. This is shown in Fig. 4.11 and the resultant logic structure is known as 
NORA (No Race) logic [24]. Clocked latches can be added to the output of 
dynamic gates to store the result of the evaluation. These latches have a sample 
and hold phase that works in tandem with the evaluation and precharge phase of 
the dynamic gates to form bit-pipeline stages. 
4.3.2. Dynamic Pipeline Logic 
Consider a NORA pipeline stage as shown in Fig. 4.12. When 0 is high, the gates 
evaluate and the latch is open to sample the result of the evaluation. When 0 goes 
low, the gates are precharged whilst at the same time, the latch closes to hold the 
result. This constitutes a -type half-bit stage, so called as the result is evaluated 
when 0 is high. By interchanging the 0 and q  clocking of a -block, a 
complementary q block is created which precharges when 95 is high and evaluates 







Fig. 4.12 NORA -type half-bit pipeline stage. 
§4 	 78 
Note that in principle, a NORA pipeline stage effectively works on two non-
overlapping phases, delineated by the high and low phases of a single clock. In 
common with all clocking methodologies that use a single clock to delineate two 
separate time epochs, it is susceptible to race conditions associated with clock edge 
transition times and clock skew. When the clock makes a transition in a NORA 
stage to start the precharge phase, the output latch must close simultaneously to 
hold the result of the previous evaluation. Since the precharging and the latch are 
controlled by the same clock phase, there is an inherent race between the 
precharging process and the latch closing sufficiently fast, to prevent the 
precharging from corrupting the dynamically held logic value. As a result, clocks 
with sharp transition edges are required for reliable operation. 
This race condition can be removed by using different clock phases to separate the 
precharging of the gates from the sampling and hold operation of the latches. For 
this reason, dynamic pipeline logic usually requires multi-phase clocks for race-free 
operation [58]. NORA pipeline stages may be clocked by two non-overlapping 
phases 0 1 . and 02  to render it insensitive to signal race conditions. This however, 
requires the use of four clock lines, 01, 02 and their inverses. 
4.4. Race-free Dynamic Pipeline Logic 
For the implementation of a complete system on a chip, a race-free clocking 
methodology is essential for reliable operation. This section outlines two such 
clocking schemes, a two phase version of the NORA bit-pipeline stage [57,49
t  
and a novel single clock scheme incorporating self-timed logic principles 
4.4.1. Two Phase NORA 
The two phase version of a NORA bit-pipeline stage has been used as the basis for 
the implementation of a bit-serial wave digital filter. The computational core of 
this system will be presented as a case study in the next chapter. In this section, the 
clocking scheme common to all operators from which the core is composed is 
described using a full adder/subtractor as a design example. The circuit diagram of 
the adder/subtractor operator and its temporal behaviour are given in Fig. 4.13a 
and Fig. 4.13b respectively. Referring to Fig. 4.13a, I. and 'b  are the input bit 
published work by the author. 
§4 	 79 
streams to be added or subtracted to produce the sum output S 0 ; the state of the 
input L, selects the addition or subtraction function. 
When 0 1 is high, nodes n 1 , n3 , n5 are precharged low whilst nodes n 2 , n4 are 
precharged high. During this time epoch, the input latches are open so that the 
input to the adder, 'a 'b and L, can be sampled. When 0 1 goes low, 'a 'b and 
are held stable whilst the logic trees evaluate, thereby conditionally discharging 
nodes n 1 to n5 . Between a pair of latches, the n and p logic trees may be cascaded 
to any arbitrary depth, provided that the total evaluation time does not exceed half 
a clock period. 
ri__H___r__H___H__ 
1 	___H H H rL 
	
HOLD S  HOLD S HOLD }OLD S HOLD 	c, ctrl 
n 2, n4 
EVAL 	EVAL. 	EVAL. 	EVAL. 	EVAJ.. 
ni, n, n 
HOLDHOLD 	OLD 	OLD 	HOLD ([ 	S02 'a 'b' 'sign 
EVAL 	EVAL. 	EVAL 	EVAL 	EVAL 
Fig. 4.13b Temporal behaviour .  of NORA full adder/subtractor. 
Results 
To evaluate the reliability and performance of two-phase NORA logic, several bit-
serial operators including an adder/subtractor and an 8-bit multiplier have been 
fabricated as test structures in a 2.5 i.m, dual layer metal p-well CMOS process. 
The microphotOgraph of these test structures is given in Fig. 4.14, with 110 nodes 
brought out onto probe pads. These pads may be probed by a test card and the 














Fig. 4.14 Microphotograph of test structures for evaluating two-
phase NORA clocking. 
Fig. 4.15 Photograph of test card manufactured to probe the hO 




For flexibility in testing, the clock phases çb, 02 and their inverses are generated 
on-chip, routed off-chip and re-applied via input pads. The full addcr/subtractor 
function described in this section has been probe-tested at just over 15 Mhz. The 
result of this test is displayed in the oscilloscope trace given in Fig. 4.16. 
Fig. 4.16 Oscilloscope trace for two-phase NORA addition at over 
15 Mhz. 
The top two traces are the inputs 'a  and 'b•  The bottom trace displays the 




Although a two-phase NORA pipeline stage is race-free, it suffers from one major 
and serious weakness likely to prevent its widespread use NORA logic structures 
and their variants are highly susceptible to internally generated noise. This poor 
immunity to noise is a result of the worst of a combination of factors 
• 	The logic threshold of a NORA gate is determined by the device 
threshold voltage VT. in a typical commercial process, the low 
values of VT  will result in a noise margin of less than 1 V. 
• 	Unlike domino logic, the internal nodes are not buffered by 
restoring logic, giving rise to chains of floating nodes. For this 
reason, NORA logic is highly sensitive to capacitive coupling 
noise and charge redistribution. 
• 	The use of fully dynamic latches. As a result, the isolated 
storage node in a latch may be de-stabilised by clock 
breakthrough and parasitic capacitive coupling [571. 
The two major limitations of two-phase NORA as a means of realising high 
performance bit-serial operators are poor noise immunity and multi-wire clock 
distribution. In order to overcome these drawbacks whilst maintaining race-free 
operation, a robust pipeline logic structure called PHIMOS will be described in the 
following section. 
4.4.2. Race-free PHIMOS 
Bit-serial operators are tightly pipelined to be exercised continuously through the 
application of globally distributed clocks. For clock schemes such as the two-phase 
NORA, the generation and distribution of the clock phases not only require 
additional area but may also degrade operator throughput. The temporal 
relationship between the phases must be maintained at all times, even in the 
presence of clock skew. For this reason, the different clock phases are constrained 
to make sequential transitions, with finite rise and fall times that can account for a 
significant proportion of the clock period. This can be viewed as "dead time" 
during which the pipeline is idle and reduces the computational efficiency of 
pipeline structures. 
§4 	 84 
To address the problems of multi-phase clocking, clocking schemes requiring only a 
single clock line have been proposed for CMOS pipeline stages [51, 94]. Although 
only a single clock is used, such schemes are inherently two phase in nature, using 
the high and low phase of the clock to approximate two non-overlapping time 
epochs. As discussed in § 4.3, the finite transition time of a clock edge inevitably 
requires the propagation delay, TCL of the combinatorial logic in a pipeline stage to 
satisfy a two-sided constraint. With dynamic logic, the minimum value of 'r can 
be of the order of one or two nanoseconds, as the precharging process is inherently 
fast. As a result, clocks with fast transition times are required. 
To overcome this inherent race condition brought about by the use of a single clock 
to delineate two non-overlapping time epochs, a novel circuit technique called 
PHIMOS [41t]  is introduced. PHIMOS is based on precharged, differential 
cascode voltage switch logic (CVSL) [26] and requires just a single global clock 
when realised in CMOS technology. The operation of a PHIMOS pipeline stage is 
race-free since the clock is used to delineate only a single time epoch, analogous to 
say 0 1 , which controls the precharging process. The other epoch corresponding to 
2' which sets the sampling time of the latches, is defined not by the clock but by 
the precharging mechanism. As a result, the sample and hold operation of the 
latches are self-timed in synchrony with the evaluate and precharge mechanism of 
the logic trees, giving rise to an "elastic" period functionally equivalent to 0 2 . 
The combinatorial logic in PHIMOS is built from of a pair of logic trees which 
forms the true and complement of a logic function. A logic variable a is then 
represented as a pair of signals {a }. The logic configuration for a bit-pipeline 
stage in PHIMOS is illustrated in Fig. 4.17. Two flavours of precharged trees are 
used, with n-type trees precharged high and p-type trees precharged low. The pair 
of signals resulting from an n-tree evaluation is stored in a nand set-reset (SR) latch 
while that from a p-tree is held in a nor set-reset (SR) latch. When the clock PHI 
is high, the n-tree evaluates forming a complementary signal pair {a ã} on nodes n 
and n2 . The complementary nature of the signal pair {a } opens the nand SR latch 
to allow the result of the evaluation to be written into the latch. When the clock 
PHI makes a high-to-low transition, precharging of the nodes n 1 and n2 occurs. 
When both nodes are fully precharged high, this condition closes the nand SR latch 
to hold the result of the previous evaluation {a }. In this way, the hold and sample 





Fig. 4.17 Logic configuration for a PHIMOS bit-pipelinestage. 
phase of the nand latch is synchronised to the precharge and evaluate phase of the 
logic tree in a self-timed manner. As a result, a bit-pipeline stage in PHIMOS is 
race-free, with the requirement that the combinatorial delay TCL satisfies a one-sided 
constraint TCL < 
Results 
As part of the scan-out circuitry used in the scanning frame described in § 3.5, a 
PHIMOS shift register has been fabricated in' a 2 pm, dual layer metal n-well 
CMOS process. The microphotograph of the 50-stage register is shown in 
Fig. 4.18. The maximum clock rate that this shift register is capable of is displayed 
in the oscilloscope trace given in Fig. 4.19. The maximum clock frequency is 
measured at close to 40 Mhz and is constrained not by the inherent speed limitation 
of the PHIMOS technique, but by the output pad driver delay. The output pad 







- 	 X X 
MW 
i 	 Vl 'rv rW, 
' I 	 S J I rt 	rV a - - 
Wi 	 VV Wi 	IW?I! !LVLY I! 
Wi lVWy 	 (V 
a 
S fl.fltfl .b&.fl.S&sfltflAfl.flIAttflhlttfltfltfllflIflSA 
ILI7T /1 	r,u 
Fig. 4.18 Microphotograph of a 50-stage PHIMOS shift register 
Fig. 4.19 Oscilloscope trace of PHIMOS shift register clocked at 
close to 40 Mhz. 
The top trace is the clock waveform P1-H. The bottom two traces are the waveforms 




To verify the race-free operation of PHIMOS pipelines, a slow sine-like clock with 
2 iJs edge times has been successfully applied to the shift register. The oscilloscope 
trace for this test is given in Fig. 4.20. 
Fig. 4.20 Oscilloscope trace of PHIMOS shift register clocked by a 
slow sine-like signal. 
The top trace is the dock waveform PHI. The bottom two traces are the waveforms 
for signal pair (a } with the sequence 1000 being shifted. 
Performance Evaluation 
The circuit diagram of a bit-serial partial product sum (PPS) adder in PHIMOS is 
given in Fig. 4.21. Note that the use of CVSL in the logic trees allows complex 
boolean functions such as the sum and carry function of a full adder to be evaluated 
within a single tree delayt.  From measurements carried out on similar CVSL adder 
structures presented in § 7.3.2, the delay taken to form the carry out function, Tcay 
is around 5 ns. 
Compare this with the two-phase NORA full adder given in Fig. 4.13 where evaluation of 
the sum function requires the input logic signals to ripple through three successive logic trees. 
§4 	 88 
CVSL offers improved switching delay over other forms of logic [11]. In addition, 
the functional power of the differential trees in CVSL in forming the true and 
complement of a logic function {q }, may reduce device redundancy by enabling 
transistors common to both the q and q trees to be shared. This feature of CVSL is 
exploited in realising the sum and carry function of the PHIMOS adder given in 
Fig. 4.21. The resulting realisation can be seen to be more device efficient than 
conventional static CMOS. 
4.5. Summary 
The two key strengths of bit-serial computation are communication and 
computational efficiency. Computational efficiency is achieved through functional 
parallelism, in which arrays of hard-wired processors are used to boost system 
throughput. Data words are presented in a bit-serial but word-parallel fashion to an 
array of functional processors to give a proportionate increase in system throughput. 
In bit-serial arithmetic, carry propagation takes place in the time dimension rather 
than in space. As a result, bit-serial operators may be tightly pipelined for clocking 
at high rates. A race-free clocking methodology for high performance pipelines, 
termed PHIMOS, has been presented and verified with test measurements: As only 
a single clock line is used in PHIMOS, clock skew between various sub-systems is 
relatively easy to accommodate, in contrast to multi-phase clock schemes. Such. 
clock skew adjustments may indeed be necessary for high levels of functional 
parallelism, with an interconnection network extending over the logic operators 



























I 	 I 	 I 	 I 
uiT c-I o 





T L LL 






Digital Wave Filter Adaptor 
A Case Study on Bit-Serial Computation 
5.1. Introduction 
The bit-serial approach to the implementation of fixed function digital signal 
processors will be illustrated by a system case study of a recursive digital filter 
section. This filter section is called a universal wave filter adaptor [49t , 67t1 It 
can be used as a computational core to construct a variety of filters belonging to a 
class known as wave digital filters [17]. The design study will cover the functional 
design of the adaptor structure, the design of a serial pipeline multiplier operator 
central to the realisation of the adaptor and finally, system composition integrating 
computational and control hardware. Practical results will be presented to vindicate 
the particular strengths of the bit-serial approach to dedicated processor 
implementation. This will take the form of the measured frequency response of a 
seventh-order, low-pass Chebyshev filter constructed from a cascade of two 
universal adaptors. 
5.2. Wave Digital Filters 
A digital filter can be described by a general linear difference equation of the form 
v0[nT] = 	v1[(n—j)T] - 	bk v0 j(n —k)T] 	 (5.1) 
Any digital filter can therefore be realised using only the arithmetic functions of 
addition and multiplication. These two arithmetic operations are data-independent 
in nature and are ideally suited to the bit-serial approach [31]. 
As can be seen from Eqn. 5.1, digital filters are completely characterised by the 
values of the two sets of coefficients {a 0 , a 1 . aM} and {b1 , b2 . bN}. The 
behaviour of a filter is however, more conveniently described by a set of 
specifications in the frequency domain, such as passband and stopband frequencies 
and characteristics. In order to obtain a digital realisation of a filter with the 
published work by the author. 
§5 	 91 
specified frequency characteristics, the coefficients a 1 and b1 of the general linear 
difference equation must be derived in some way. For reasons which stem from 
technological heritage, the most common approach to designing a digital filter with 
the desired frequency characteristics is to start with a classical RLC filter network of 
the same specification. These classical analogue filters, widely known and 
extensively tabulated, play the role of reference filters from which the coefficient 
values of the equivalent digital filter may be derived. This task is accomplished by 
selecting an appropriate frequency transformation function having the general form 
s = 1(z) so that the frequency characteristics in the reference domain is preserved in 
the digital domain. The most well known and useful of these transformation 
functions is the bilinear transform [32] given by 
z —1 
Sa= z+ 1 	
(5.2) 
where 5a  is the complex frequency in the reference domain and z = esdT, Sd and T 
being the complex frequency and sample period in the digital domain respectively. 
A digital filter produces an output response sample v0[nT]  by evaluating its linear 
difference equation within one sample period T. Since the computation as specified 
by the difference equation is carried out digitally, the accuracy of v0[nT]  is 
inherently limited by the use of a finite number of bits for signal representation. 
These inaccuracies arise from the quantisation of the sampled input signal, the use 
of a finite number of bits for representing filter coefficients a 1 and b., and the 
accumulation of arithmetic roundoff errors. Certainly, if the wordlength is made 
sufficiently wide, these finite wordlength errors can be reduced as much as is 
required. However, this simplistic approach brings about an increase in the 
hardware resources required to evaluate the linear difference equation of the filter. 
In addition to the wordlength, the accuracy of a filter is also influenced by its 
physical structure. A new class of filter structures called wave digital filters 
(WDF) [17] have been proposed to reduce the sensitivity of digital filters to finite 
wordlength effects. Wave digital filters (WDF) are exact transformations of RLC 
reference filters which are resistively terminated. They therefore retain many of the 
desirable properties of this class of reference filters, such as insensitivity to 
component value variations and excellent stability [17]. The insensitivity of the 
response of a filter to component value variations will lead to a digital filter with 
low coefficient wordlength requirements. This would result in a considerable saving 
§5 	 92 
in the multiplier hardware required to implement the filter structure. 
5.2.1. Choice of Frequency and Signal Parameters 
The frequency transforrnation function used in developing the wave filter theory is 
the bilinear transform given by Eqn. 5.2. 
z — 1 	=esT Sa = 	 z z+1'  
(sdT) 
=tanh 11 
where the subscripts a and d refer to the analogue and digital domain respectively. 
In addition to the frequency variable Sa , an appropriate choice of signal variables in 
the reference domain must be chosen to become the signal parameters in the digital 
domain. If voltage and current in the reference filter are selected as the signal 
variables, the choice of Sa  as given in Eqn. 5.2 would result in unrealisable digital 
structures containing delay-free loops [17]. For this reason, linear combinations of 
voltage and current, A and B called voltage "waves" are used instead. A and B are 
given by 
A = V + RI 
	
(5.3) 
B== V — RI 
A. can be regarded as the incident voltage wave on a two-terminal port, B the 
reflected wave and R the port reference resistance. 
5.2.2. Circuit Elements 
With the choice of voltage waves as signal parameters and 5a  as the complex 
frequency variable in the reference domain, circuit elements such as resistors, 
inductors and capacitors in the reference domain can be transformed into their 
equivalents in the wave digital domain. Table 5.1 lists the wave-flow equivalents of 
some of the more useful elements encountered in the reference domain. Briefly, a 
capacitor is transformed into a unit delay wave element which delays the incident 
wave by one sample time T, an inductor into a unit delay wave element with signal 




delay elements of length T/2. 
Capacitor Inductor Resistor Unit Element 
1 Al ..... 	- 	 B2 3L R] UE 
81 _.<.. 	A2 
A A A Al 	 B2 
—1 —K B 1 —3--IJ-_ic3._ A2 
B B-— B=O 
Table. 5.1 Wave-flow equivalents for some common analogue circuit 
elements. 
5.2.3. Interconnection of Elements with Adaptors 
A filter network is constructed from an interconnection of circuit elements. In 
order to ensure full equivalence between the reference network and the wave digital 
flow-graph, the interconnection or topological rules as embodied by Kirchhoff's 
current and voltage laws in the reference domain must also hold in the wave 
domain [17]. Structures called adaptors in the wave digital domain ensure that the 
topological rules are satisfied. The adaptors may be of two types, series or parallel, 
and they emulate a series or parallel connection in the reference filter. 





Consider an n-port network with ports k = 1, 2 	n having port resistances Rk. 
Rk can assume arbitrary values and is determined by the element that is connected 
across that port. With a parallel connection, Kirchhoff's laws state that 
V1 =V2 = 	= V,, 	 (5.4) 
11+12+ 	+IO 
The voltage waves Ak  and Bk at port k are related to Vk and 'k  by 
Ak = Vk + RkIk 
Bk = Vk - RkIk 
Eliminating Vk and 'k' 
Bk = A0 - Ak 	 (5.5) 
where 	A0 = 	Uk Ak 
2 Gk 
ak - G1 + G2 + 
	
+ G 
Gk = - 
Rk 
From a similar derivation, the equations for an n-port series adaptor are given by 
Bk = Ak - 1kA0 	 (5.6) 




§5 	 . 	 95 
5.3. Functional Design of Universal Adaptor Structure 
The key to the realisation of WDF are the adaptors which are used to emulate the 
parallel and series connections of the reference filter. The adaptors form the basic 
building blocks from which arbitrary WDF can be constructed. In this section, the 
functional structure of a universal adaptor that permits the realisation of the WDF 
network equivalents of reference RLC ladder filters is considered. 
5.3.1. Adaptor Flow-Graph 
To fulfill its role as a general purpose building block, the universal adaptor has 
three ports with a unit delay element built into one port. The adaptor can then be 
configured externally to behave as a parallel or series adaptor with an attached 
inductive or capacitive filtering element. In addition, they are directly cascadable 
to permit the construction of higher order filters [63]. 
The adaptor functional structuret  given in Fig. 5.1 implements the parallel and 
series adaptor equations given respectively by Eqn. 5.5 and Eqn. 5.6 for a 3-port 
network. Here, extensive use is made of multiplexers so that a common data path 
can be configured dynamically on-the-fly into two different flow-graphs, describing 
either the parallel or series adaptor equations for a 3-port network. One of the key 
advantages of bit-serial design, that of single wire communication, is evident from 
the design of the universal adaptor structure given in Fig. 5.1. For single wire 
communication, the cost of reconfiguring signal communication paths through the 
use of multiplexers is minimal. For bit-parallel systems, providing similar facility 
would require the routing and switching of unwieldy word-wide busses. 
The flow-graph of a universal adaptor is composed from two major arithmetic 
operators, a full adder/subtractor and a multiplier. The design and realisation of a 
full adder/subtractor cell has already been covered in § 4.4.1. Design 
considerations for the multiplier will be discussed next. 
The functional structure of the universal adaptor is adapted from Ref [63]. 
§5 
Fig. 5.1 Functional structure of a 3-port universal adaptor. 
97 
5.3.2. Serial Pipeline Multiplier 
A serial multiplier is basically constructed from a linear array of full adder cells. 
Each cell produces an accumulating partial product sum (PPS) from the addition of 
its two input operands, a partial product (PP) formed locally within the cell and the 
previous value of PPS passed from the neighbouring cell. The PPS output of each 
cell is then passed through an arithmetic shifter (refer to § 4.2.3) so that the 
correct weights can be imparted to the accumulating PPS bits before delivery to the 
next cell. 
Subscribing to the general philosophy of tight pipelining for increased throughput in 
bit-serial operators, a fully pipelined serial multiplier is employed. Due to the 
recursive nature of the computation carried out by the adaptor, the maximum wave 
filter sample rate is determined by the total computational latency of the realised 
adaptor structure. For a conventional add-and-shift multiplication algorithm, the 
computational latency of a fully serial multiplier is 2m bits, in being the coefficient 
wordlength. For high sample rates, a low computational latency is required. For 
this reason, the pipeline multiplier described here is based on the modified Booth's 
algorithm [68], enabling the latency of the serial multiplication to be shorten to 
(3/2 rn+ 2) bits. 
The multiplier structure is based on a design proposed by Lyon [44] which permits 
multiplication on a continuous bit-stream of data. If an n-bit word is multiplied by 
a rn-bit coefficient, bit growth occurs to produce a product that is (m+ n—i) bits 
long. If this full precision product is allowed to propagate on a single pipelined 
path to the output, then the throughput of the multiplier limits the rate at which 
new data can be input. The pipeline operation of such a multiplier requires every 
n-bit input word to be padded by at least m blanks. if the in least significant bits 
(LSBs) of the product are diverted onto an alternative propagation path so that 
each adder cell is used no more than n times per multiplication, then the 
multiplication cycle can be made equal to the input wordlength of n-bits [69]. In 
this case, the full precision product appears from the output of the multiplier on two 
wires. The m LSBs of one product appear on one wire concurrently with the n 
most significant bits (MSBs) of the preceding product on the other wire. This 
follows the convention for multi-precision word formatting in bit-serial design, 
where multi-precision words are arranged in time-staggered form on multiple 
wires [14]. 
§5 	 98 
To avoid word growth, the full precision product can be truncated back into a 
single precision format by discarding the (rn—i) LSBs of the product. The circuit 
diagram of such a truncating multiplier, module implemented in two phase NORA 
logic is shown in Fig. 5.2. The coefficient values are restricted to the range +1— 1 
so that the range of possible products is less than the range of the input data. 
5.3.3. Adaptor Latency 
For a digital filter to be . realisable, the signal flow-graph describing its linear 
difference equation must not contain any delay-free loops. Unfortunately, 
connecting a port from one adaptor directly onto a port on another adaptor will, in 
general, create a delay-free loop [17]. To get round this difficulty, timing delays 
can be deliberately introduced between adaptor port connections by inserting unit 
elements into the reference RLC filter [67t]  A unit element in the reference 
domain is transformed into two half-unit delay elements of length T/2 in the wave 
digital domain. .. The evaluation of the 'adaptor equations must therefore be 
completed within a time period of T12 In a bit-serial environment, this evaluation 
period of T/2 corresponds to Lad  the computational latency of the adaptor structure 
given in Fig 5. 1. n The adaptor latency in bit-times is give by [67t] 
Lad = 	in + 10 
where in is the filter coefficient wordlength. With a bit-rate of f 0 , the maximum 
sampling rate f5 of a WDF constructed from universal adaptors is given by 
fo ' 
Hz. s 	2XLad 
The  maximum system wordlength (SWL) that can be used in a WDF system is 
determined by the sampling rate, f 5 . Word growth may occur during the process of 
carrying out arithmetic operations such as addition. To guard against word growth 
that may lead to numerical overflow, the signal word is padded with sign extensions 
or guard bits. The number of guard bits actually required is determined by the 
'longest computational path from an input port to an output port in a flow-graph. 
For the adaptor flow-graph shown in Fig. 5.1, six guards 'bits are required. As a 
result, the signal dynamic range in a WDF system constructed from adaptors is 
published work by the author. 
§5 	 100 
limited to a maximum of (SWL-6) bits. 
5.3.4. Universal Adaptor Control 
As discussed previously in § 4.2.2, the control function in a bit-serial environment 
is concerned mainly with two tasks 
• 	providing a hierarchy of control markers to delineate sample 
words in a bit-stream. 
• 	inserting synchronising delays on merging bit-streams so that the 
bits from the different streams are time-aligned at the input of 
operators to possess equal weights. 
A bit-serial operator is configured as a pipeline for high throughput. As a result, a 
bit-stream presented at the input of an operator will incur a latency at the output of 
that operator, measured in terms of a number of bit-times. Unless the latencies of 
all operators are constrained to integer multiples of the system wordlength, the 
control markers must be distributed to the operators with the correct timing in 
relation to the signal words. For this reason, a separate control network that tracks 
the signal network purely in terms of latency is required to distribute the control 
markers. This control network is essentially made up of one or more chains of shift 
registers. Note that the control markers are cyclic in nature and if one of them is 
displaced beyond the cycle length, an identical version can be obtain by taking the 
previous version modulo the cycle length [14]. This places an upper limit on the 
length of the shift registers required in the control network. 
The operation of the universal adaptor is controlled by four markers : LSB, P/S, 
LJC and SIR. LSB is at the lowest level of the hierarchy and marks the least 
significant bit of a word. At one level above, the other three markers are used for 
multiplexing purposes 
• 	P/S selects a parallel or series adaptor connection. 





• 	SIR selects the source of data into the adaptor. This can be a 
new sample from an external source or the recycled output from 
the adaptor produced in the preceding T/2 cycle. 
The control markers and their timing in relation to the input signals of the adaptor 
are given in Fig. 5.3. Fig. 5.4 is the signal flow-graph of the adaptor with 
synchronising delays inserted in the signal paths. Although the control network is 
not shown explicitly, the required latency of the control markers at the operators 
are indicated with the notation 0(t), where 0 refers to the name of the marker and 
t the incurred latency in bit times. 
5.4. Results 
The universal adaptor structure described in the preceding section has been 
fabricated in a 2.5 rim, dual layer metal p-well CMOS process. The 
microphotograph of the chip, VBC076, is given in Fig. 5.5. Although the 
granularity of bit-serial computational nodes will allow four independent universal 
adaptors to be accommodated on a standard die size of 25 mm 2 , this has not done 
on this prototype chip. VBC076 contains two adaptors that may be operated 
independently, with each adaptor provided with its own internal clock generator. 
Half the area on the die is taken up by test structures that can be accessed via probe 
pads. This allows the test structures on unbonded chips to be evaluated using 
probes mounted on a custom-made probe card. The pin-out of the prototype chip 
VBC076 is included in Appendix A. 
5.4.1. Seventh-Order Wave Digital Filter (WDF) System 
As a practical illustration of a WDF network implemented using universal adaptors, 
a stand-alone seventh-order low-pass WDF system has been constructed. The 
reference ladder RLC filtert  on which this WDF is based is given in Fig. 5.6a. 
Fig. 5.6b shows the WDF network equivalent of the reference ladder filter, which 
may be realised with a cascade of two universal adaptors. The WDF coefficients 
and the component values from which they are derived are tabulated in Table 5.2. 
The component values are chosen to give a Chebyshev response with a passband 
ripple of 0.644 dB, with a cut-off frequency set at 20 % of the sampling frequency. 
The reference filter topology and component values are obtained from Ref [63]. 
(8) 
fil 
Bit Clock JlJlJlflJTJlJlfWfIjlJlJ]fLfWJlftftftftflfLflftflJlftftfUlflfUlflSLftftILftftftIUlSLftftfl 






cff 	 cff  
cffl&cff2 	I 	 I 	
I 	 I 
Fig. 5.3 Control markers for universal adaptor and their 















Fig. 5.4 Flow-graph of universal adaptor with 




Fig. 5.6a Seventh-order, low-pass reference ladder filter. 











a 	- 1 
) 
Rs 1.000 fl 11 0.4879371 —0.5120629 
C 1 2.570 F a 12 0.2579995 —0.7420005 
C 2 3.777 F a21 0.2198710 —0.7801290 
C 3 3.777 F a22 0.2093644 —0.7906356 
C 4 2.570 F a 31 0.2093644 —0.7906356 
UE1 1.891 fl U32 0.2198710 —0.7801290 
UE2 1.986 fl OL41 0.2579995 —0.7420005 
UE3 1.891 fl a42 0.4879371 —0.5120629 
RL  1.000cl - - - 
Table 5.2 Component and coefficient values for seventh order 




5.4.2. Frequency Response Measurement 
in order to measure the frequency response of the prototype WDF system, analogue 
interfaces to the external world are provided at the 110 ports of the WDF network. 
The conversion between the analogue and digital domain in this prototype system is 
carried out with a resolution of 8-bits. The functional diagram of the prototype 
filter system is shown in Fig. 5.7. The control generator is not included in the 
diagram but is essentially a central counter that is used for demarcating the 
boundary of sample words. A photograph of the board containing the prototype 
WDF system, complete with control generator is given in Fig. 5.8. 
The prototype filter system is clocked at a bit-rate of 10 MHz, giving a maximum 
sampling frequency of 227 KHz and a cut-off frequency of 45.45 KHz. The 
frequency response is measured using a spectrum analyser and the overall frequency 
response plot is given in Fig. 5.9. 
Fig. 5.9 	Spectrum analyser trace of the frequency response of the 
prototype seventh order, low-pass Chebyshev WDF. 
The input frequency is swept from dc to 10 K1-lz. The cut-off frequency is 















I ____ 	 I O N D H 	I 	Analogue Analogue 	 I 1* F j 	 Output Input 
Wave filter network 
Fig. 5.7 Functional diagram of prototype filter system. 
pp. 	i Ift * I 
Fig. 5.8 Photograph of prototype wave digital filter board. 
108 
6. Summary 
The design methodology for bit-serial computational systems spanning operator, 
functional processor and system level integration has been applied to the design of a 
universal adaptor, conceived as a basic building brick for arbitrary ladder wave 
digital filters. A seventh-order, low-pass wave digital filter system has been 
constructed to demonstrate the virtues of realising computational systems from bit-
serial cores. The finer grain size of bit-serial operators allows the computational 
cores to have a rich mixture of hardware operators that directly reflects the flow of 
computation. The cores may also be readily re-configured dynamically on-the-fly 
by virtue of their single wire communication paths. As a result, the control 
overhead at the system level is minimal, consisting of a hierarchy of counters for 






In the design of a dedicated processor system, the design task can be broken down 
into two separate components 
S 	the functional design of a data path which defines the physical 
and topological structure of the computational core required to 
compute an algorithm. The data path specification defines the 
connectivity and the different basic operations that can be 
carried out by the core. 
• 	the design of a system controller to orchestrate the flow of 
information among the constituent parts of the data path in 
time. 
At the highest level, the behaviour of a system can be completely specified by a 
sequence of events, much like an algorithm. Only when the system behaviour is 
realised in some physical form does the notion of time arises, as a consequence of 
the physical laws governing the properties of the implementation medium. In 
general, there are two ways of organising the behaviour of a system in the time 
dimension. As a signal propagates through a system, it will be delayed by some 
value. The precise value of this delay is unknown, and the two forms of timing 
organisation differ in the way they deal with this delay uncertainty. The more 
common method is to use a global clock to define timing relationships. This form 
of timing discipline is known as synchronous design. The clock is used to equalise 
delays throughout the system by holding up all signals, thereby removing the delay 
uncertainty. In other words, the delays are all prolonged to a single worst-case 
value defined by the period of the clock. The period of this clock is thus 
determined by the propagation delay through the slowest possible path in the entire 
system. By considering worst-case delay values, a piece of data is guaranteed to be 
available at any node within the system by the end of a clock period. 
§6 	 110 
An alternative way to organise timing behaviour is to use self-timing. In this case, 
signals converging at a particular section of the data path are held up until the 
arrival of the slowest signal has been detected. This form of timing discipline is also 
sometimes referred to as asynchronous design. The elements making up the data 
path each indicate completion of operation so that neighbouring elements can make 
use of this information to decide when to initiate an operation. As a result, the 
delay that an element has to wait tends to reflect the average rather than the worst-
case value. 
Conceptually, the behaviour of a computational system may be completely specified 
by an ordering of events without any reference to timing. When a system is 
realised in some physical form, events must necessarily take a finite amount of time 
to happen. Realising a system with self-timing enables the ordering of events to be 
abstracted from their occurrence in time. System behaviour is then determined 
entirely by the way the self-timed parts or elements are interconnected together. 
Only within the elements themselves are sequence and time related, linked together 
by the physical laws governing the implementation. This separation of abstract 
sequence specification from the timing behaviour of the physical realisation, 
simplifies the behavioural design of a system by removing all timing constraints 
from the design process. This distinctive feature makes self-timing highly attractive 
for the design and construction of large concurrent computing structures. 
In a real-time, sampled-data system, the sampling process imposes a precise timing 
requirement on the signal samples. Such a sampled-data system may be 
accommodated in a self-timed environment by buffering the 110 data streams of the 
system by first-in, first-out (FJFO) memories. Provided that the average throughput 
of the self-timed processor is higher than the signal sampling frequency, signal 
reconstruction may be performed in real-time at the output. - 
6.2. Scaling Aspects of Self-Timing 
Advances in VLSI technology will lead to a scaling down of device feature size over 
an increasingly larger chip area. From § 1.2.1, the wiring delay over a distance 
measured in X units is increased relative to the transistor switching delay with 
scaling, A being the scalable unit of length measurement representative of the 
minimum feature size of a process. For a synchronous system controlled by a global 
clock, clock skew would require the clock to be derated accordingly. Therefore 
§6 	 111 
with scaling, synchronous systems become increasingly inefficient, as the switching 
of transistors constitutes a smaller and smaller fraction of the clock duty cycle. 
With self-timing, there is no global clock and temporal control of the activity of the 
system is distributed over the elements that compose the system. An element 
responds to a local start signal to initiate an operation and generates a completion 
signal to indicate the presence of valid data at the output. As a result, the speed of 
operation of a self-timed element is, not constrained by worst-case delay values. For 
computations with completion times that vary with data, self-timing can lead to an 
improved average rate of computation. As an example, consider the ripple-carry 
addition of two numbers. The completion time of the addition is usually 
determined by the length of the actual longest carry propagation, particular to any 
two numbers. The worst-case carry propagation occurs when the carry has to 
propagate through the length of the entire adder. For a synchronous design, this 
worst-case carry propagation delay is assumed for every addition. However, worst-
case carry propagation rarely occurs with two randomly chosen numbers. In 
contrast, in a self-timed approach with carry-complete detection, an addition can be 
started as soon as the previous is completed. As a result, the average rate of 
• addition is improved. 
In a self-timed system, the switching activity of the transistors are distributed out in 
time and are not concentrated at a particular instant. As a result, there is no 
sudden surge in the current drawn from the power rails. Instead the current drawn 
is averaged out in time, thereby reducing the power supply noise generated by 
sudden surges of current through the stray resistances and inductances in a system. 
These issues will assume increasing significance as devices are progressively scaled 
down. 
6.3. Delay Insensitive Specification 
Self-timed operation allows the specification of system behaviour purely in terms of 
an abstract sequence or order of events. Correct system behaviour is therefore 
independent of any constraints introduced in the time domain by the 
implementation, such as the delay incurred by signals propagating along lengths of 
wires or the switching delay of gates. The elements that comprise a system that can 
behave in such a manner are said to be speed-independent or delay-insensitive in 
nature. Consider an element forming part of a larger system with some interface 
§6 	 112 
separating the element from the rest of the system, the element's environment. 
Input signals will propagate through the interface from the environment to the 
element and output signals from the element to the environment. The nature of the 
interface can be assumed to be such that signal events crossing its boundary have 
their order preserved, so that the element's behaviour is specified by the allowed 
ordering of events on signal paths at the point of entry to the interface. This is 
shown in Fig. 6.1a and elements that satisfy this interface requirement are said to 
be gate-delay insensitive. Most of the early work on asynchronous logic 
design [55,2] belong to this category, since in earlier technologies prior to VLSI, 
the propagation delay of signals along wires may be assumed to be negligible 
relative to the switching delay of gates. 
Consider now a more elastic form of interface surrounding the element, with 
properties similar to that of a "foam rubber wrapper" [56]. Such an interface is 
shown in Fig. 6.1b. It has two surfaces, an outer surface that defines an interface 
with the environment and an inner surface that defines an interface with the 
element. Signals propagating on paths that traverse these two surfaces will incur 
propagation delays that are assumed to be unknown. As a result, the ordering of 
signal transitions on the different paths may be altered by the passage of the signals 
through the interface. If the behaviour of an element is not dependent on the 
arbitrary delays incurred by signals propagating through the interface, then its 
operation is said to be wire-delay insensitive. In this case, the behaviour of the 
element is independent of the ordering of signal transitions at its 110 terminals. 
6.3.1. Isochronic Regions 
In the foregoing discussion on self-timed elements, the wire-delay insensitive 
specification is applied to a set of signals crossing an element-environment interface; 
Note that it may not be possible to extend the property of wire-delay insensitivity 
into the innards of an element itself, as there is no known way of constructing 
wire-delay insensitive circuits from wires and transistors alone. This can be seen by 
considering certain basic components such as flip-flops and C-elements, in which 
the propagation delay of signals on the feedback paths are always taken to approach 
zero. To circumvent this difficulty, an approximation is used so that over a 
sufficiently small area, the delay incurred on any length of wire within this region is 
negligible relative to the switching delay of a transistor. Such a region is called an 
113 
Interface 
Fig. 6.1a Gate-delay insensitive specification in which A F B 	a F 
b,cFdCFD. 
Envi roi 
Fig. 6.1b Wire-delay insensitive specification in which A F B # a F 
b, c F d qk C F D. 
The notation x F y is used to denote the ordering of signal events x and y, i.e. whether x pre-




equipotential or isochronic region [71]. Within such a region, the occurrence of 
signal events are assumed to be sufficiently well separated in time, relative to the 
wire delays, that their ordering will be preserved at any point within the region. 
An element must therefore be contained entirely within at least one isochronic 
region, though it may reside in more than one region. 
6.3.2. Wire-Delay Characteristics 
The timing behaviour at the 110 terminals of a self-timed element is dependent on 
• 	the evaluation time taken to compute a given function, arising 
from the switching delay of transistors. 
• 	the delay contributed by output capacitance to the evaluation 	 V 
time. 
• 	the delay associated with equalising potential across 
interconnecting wires 





R and C are the resistance and capacitance per unit length of wire, and V the 
V 	
voltage along the wire as a function of time t and distance x. The distance x is 	: 
measured from the point where the voltage V is applied. Solution of the diffusion 
V 
• equation gives the time required for a step to propagate a distance x as - proportional 
to x2RC. In the process, the voltage step is also progressively degraded or "smeared 
out" as a monotonicallyt  increasing function of time. 
A function f(x) is said to be an increasing monotonic function in x if 





6.4. Reset Signaling Protocol 
In self-timed communication, the most basic form of signaling that can be used to 
mark the occurrence of an event is a transition. A pulse of some fixed interval is 
unsatisfactory, since the interval would need to be arbitrarily long to accommodate 
the unknown delay between the occurrence of events. A signal transition can occur 
in one of two directions, a low-to-high (LH) or a high-to-low (1-IL) transition. At 
least two transitions are required for every cycle of self-timed communication : a 
transition on a request (Req) wire to initiate an operation and a transition on an 
acknowledge (Ack) wire to indicate the completion of the operation. In principle, 
transitions in both the LH and HL directions can be used for signaling. Such a 
scheme is referred to as nonreturn-to-zero signaling [71]. 
Logic devices however, tend to be sensitive to signal levels rather than edges. For 
this reason, detecting the occurrence of uni-direction transitions 
(either LH or HL) may be carried out more readily than the detection of 
transitions that may occur in either direction. A signaling scheme referred to as 
reset or return-to-zero signaling uses a uni-directional transition to mark an event. 
This is accomplished by dividing the communication cycle into two phases, an 
active phase in which the Req and Ack signals are activated and a reset phase in 
which these two signals are, returned to a known initial state, in preparation for the 
next active transition. The sequence of events for reset signaling is given in 
Fig. 6.2. 
6.4.1. Data Encoding 
Within an isochronic region, the ordering of signaling events propagating on wires 
within that region remains unaltered. As a result, the signal ordering established at 
the output of an element sending data will apply at the inputs of other elements 
residing within the same isochronic region. As an example, the sender may 
indicate the presence of new data items on its data lines by activating the Req wire 
after a data value has been defined. For communication between isochronic regions 
however, the above form of signaling cannot be used since the ordering of events 
established by the sender may not be preserved at the receiver. For wire-delay 
insensitive communication, it is necessary to encode the data validity information as 









Active phase 	I 	phase 
Fig. 6.2 Reset or return-to-zero signaling protocol. 
There are various ways of encoding the data to embed the data validity information 
in a reset signaling scheme [2]. The general requirement is to encode the data 
values to form two disjoint sets of codes. The members of one set, denoted by 
D1 E D, correspond to valid data states whilst members of the other set, with 
notation Si E S, are all assigned to be reset or spacer states. In general, a self-timed 
element performs a logical mapping f of input states {S 1 , D1 } into output states 
{S0 , D0} such that 
	
f(S1) = S. 	Si , S. E S 
f(D) = D. 	D1 , D. E D 
The state information is encoded onto several variables. In changing between 
states, transitional states must inevitably be traversed if the two states differ in more 
than one variable. The requirement for any valid coding scheme is to ensure that 
in making a state transition from some S i -. D, the intermediate states S. passed 
through must all belong to the set S. that is S E S. Data validity can therefore be 




6.4.2. Double-Rail Coding 
The reset state may be encoded directly into the data if each variable is assigned 
three states: 0, 1 and - (reset) . Transitions between 0 and - or 1 and - are 
allowed but not transitions directly between 0 and 1. If n ternary variables are 
used, there are 20  data states and (30 - 211) spacer states. Alternatively, each 
ternary variable may be realised with two binary variables using a "double-rail" 
coding scheme [71], with {00} representing a -, {10} a logic zero and {01} a logic 
one. If n binary variables are used, the total number of data states possible is 2". 
Note that in double-rail coding, the {11}. state for a pair of binary variables is 
redundant. As a result, double-rail coding does not provide the most efficient form 
of coding in terms of the maximum number of data states that can be encoded onto 
n binary variables. For any valid code set, a transition from S i - D, must not 
traverse any intermediate states that are members of the data set D. This condition 
can still be met if all the data states are reset towards the all zero state as before in 
double-rail coding, but each D i is now encoded using p out of the n available binary 
variables. Combinations with fewer than pnumber of, ones are assigned to the set S 
and 'a code is a valid data item D 1 E D if there are p number of ones: in the code 
combination Note that codes with greater than p number of ones do not occur 
since in any D, Si transition, the code resets towards the all zero state 
In an n-bit code, the total number of combinations with p number of ones is given 
by 
(ni = n (n—i) . . (np+l) 
[p) 	 p! 
= n (n—i) . 	(n—p+1) (n—p)! 
p! (n—p)! 
n! 
- p! (n—p)! 
(6.2) 
The number of possible states that can be encoded using p out of n binary variables 
is given by Eqn. 6.2 and is maximum when p = n/2 [89]. 
§6 	 118 
6.5. Data Flow Model of Self-Timed Computation 
A data flow system can be viewed as a directed graph in which each node 
represents an operator and each directed arc a defined path over which data items 
or tokens flow [13]. The operators are enabled or "fired" according to a simple 
data availability rule. When tokens are present on each input are of an operator 
and all its output arcs are empty, the operator fires and applies its associated 
function to the values carried by the input tokens. Tokens carrying the values of 
the result are then placed on the output arcs and the consumed tokens are removed 
from the input arcs. An example of a data flow operator is shown in Fig. 6.3. 
a 
	 a 	b 
Fig. 6.3 Firing rule of a data flow operator. 
In a data flow model of computation, there is no explicit flow of control among the 
operators. The order in which the operations are performed is governed solely by 
data dependencies among the operators. This is in contrast to other approaches to 
self-timing where self-timed control modules [12,28] are used to synchronise the 
operations of the data path modules in a system. With a data flow model, the 
scheduling and synchronisation of operations are built into the operators and self-
timed operation is a natural consequence of the data flow concept. 
§6 	 119 
If self-timed elements are to be modelled as data flow operators, then a restriction 
must be placed on the number of tokens that may reside on the arcs in a data flow 
network. Since an arc would correspond to a wire in a data flow model of a self-
timed element, no more than one token may be placed on any arc at a given time. 
This is to ensure that a data token on an arc is consumed before the arrival of a 
new one. 
6.5.1. Implications of Data Flow Operation 
From the preceding description of data flow operation, an operator must be able to 
determine whether there are tokens on any of its arcs. In any implementation of a 
data flow operator, this implies that the following two requirements are met [2]: 
• 	an operator is able to distinguish between valid data states 
D, E D and transient states S i E S passed through between state 
transitions. The valid data states correspond to the presence of 
tokens while the transitional states correspond to the removal of 
tokens. 
• 	The operator is free of delay hazards so that signals at its output 
arcs are glitch-free. 
A realisation of a logic function is said to contain a delay hazard if the state of one 
or more of its output is dependent on the relative delay of, signals propagating 
through its logic. If such delay hazards are present, an operator f may produce an 
output sequence f(S 1 - D1) = S0 -. -. S. D. for some input state transition 
Si - D1 , S, S, S. E S and D1 , D, D. E D. The glitches produced in passing 
through D. and S,A  will cause critical races in the network of operators. 
The first requirement for detecting tokens may be met by encoding data values so 
that valid data or tokens are encoded onto one code set D whilst all transitional 
states are encoded into a separate, disjoint spacer set S. A coding scheme 
convenient for implementation is the double-rail code described in § 6.4.2. With 
double-rail coding of data values, a data flow operator is able to map input tokens 
into output tokens and input spacers into output spacers. 
§6 	 120 
Data flow operators must also be free of delay hazards, which may be ensured if the 
operation of the data flow operators are based on precharging [4]. During 
precharge, the output of a gate is forced low (or high). When the gate is 
evaluating, the output can only make a single LH (or HL) transition. As a result, 
there are no delay hazards at the ou!.nut nodes of precharged gates. 
6.5.2. Delay Model of Data Flow Operators 
In a data flow model of self-timing, there is no explicit notion of time at which 
operators fire and perform logic functions on input token values. The scheduling 
and synchronisation of logic activities are built into the hardware of the operators 
themselves. Note that the internal activity of an operator need not be self-timed. 
The only requirement is for an operator to obey the firing rule at its input and 
output interfaces and to process data internally at a rate no slower than that set at 
the interlaces. This feature may sometimes be used to reduce the hardware 
overhead associated with self-timed operation. 
A delay model of a data flow operator is given in Fig. 6.4. The evaluation time at 
the output of an operator is assumed to be bounded. For the dynamic operation of 
MOS operators based on precharging, charge leakage will determine the upper 
bound on the evaluation time. The input delay, on the other hand, is assumed to 
be unbounded. This input delay is associated with equalising potential along wires 
that constitute the arcs of an operator. 
6.6. Data Flow Logic Realisation 
A novel method of realising data flow operators in CMOS technology [37t]  will 
now be presented. This technique is based on differential cascode voltage switch 
logic [26], employing a pair of complementary trees to compute complex boolean 
functions. The complementary trees are driven by input signals and their inverses. 
The general form of the logic configuration for this technique is shown in Fig. 6.5. 
The circuit works in two distinct phases, a precharge phase and an evaluation phase 
defined by the logic level of a local "clock" signal 01 . When 0 goes high, the 
evaluation phase starts and one tree evaluates to its true logical value q 1 , while the 





Input1 	 D1 
	 bounded 
operator 
	F D-  0 
Output 
Input 
Fig. 6.4 Delay model of a data flow operator. 
other tree evaluates to its logical complement q. During the evaluation phase, the 
signal pair {q q'} at the output represents a token for variable Q  in double-rail 
coding. 'When çb, goes low, nodes q 1 and q j are precharged high with signal pair {q 
q'} forced low. Therefore in precharging, the signal pair {q q'} reverts to a spacer 
(-) for Q in double-rail coding. 
The operation of the circuit, like that of domino logic [34], is hazard-free. During 
the evaluation phase, either q, or q, will make a single HL transition and will stay in 
that state until the next precharge phase. Note that since the circuit operates on a 
precharge-evaluate cycle, spacer and token occurs alternately. Effectively, this 
corresponds to the reset and active 'phase in a reset signaling protocol. However, 
unlike clocked domino logic, the precharge and evaluation phase do not occur at 
fixed time intervals. The phases respond instead to the timing information provided 
by the locally derived clock signal 0 1 . 
The result of the evaluation is held in a form of static latch. Static storage is 
necessary since the wire delay on the arcs of operators are assumed to be 
unbounded. However, the latch is only conditionally static. During the precharge 
phase, the cross-coupled paths in the latch are broken to allow the latch to admit a 





Fig. 6.5 	General logic configuration for a data flow operator. 
token {q q'}. The load signal ld now activates to close the cross-coupled paths in 
the latch. Static storage can thus be achieved without logic conflicts. Note that the 
logic configuration given in Fig. 6.5 functions correctly irrespective of signal edge 
times, provided that signal transitions are monotonic in nature. This is necessary to 
accommodate the wire-delay characteristics assumed for data flow operators, as 
described in § 6.3.2. 
Interconnection of Operators 
Given functional operators, some means must be provided for interconnecting the 
operators together to form a data flow network. For simplicity, consider a linearly 
connected array of operators as shown in Fig. 6.6. The presence of a token on the 
output arc of operator e 1 causes e 1 to fire. While e11 is evaluating, the token on 
e1 is held steady until it has been absorbed by e 11 . There are now tokens on the 
§6 	 123 
output arc of e11 and e1 . The consumed token on e 1 may now be removed by 
resetting e 1 to a spacer state. Only when the resetting of e 1 is completed can the 
token on e11 be allowed to propagate by the enabling of operator e 12 . 
Fig. 6.6 Linear array of data flow operators. 
The above sequence of steps is essential to ensure the race-free operation of the 
network. If e12 is allowed to fire before the token on e 1 is removed,, a race 
condition potentially exists whereby e 11 is reset before e 1 . The same token on e 
will then be mistaken for the next token and re-used. Tokens are therefore not 
allowed to reside in more than two adjacent operators at any one time. To ensure 
race-free operation, operator e i has to satisfy the following constraints: 
• 	ei is reset to a spacer whenever e 11 contains a token. 
• 	it cannot fire unless 
e11 contains a spacer and 
a token-spacer pair resides in e_ 1 and e1 _2 . 
The above constraints can be imposed by interconnection logic that accept request 
(Req) and acknowledge (Ack) signals from communicating operators. It may be 
viewed as the logic required to generate the local clock signal 01 for each operator. 
A possible configuration for the local clock generator is given in Fig. 6.7. When ç 
goes high, the operator fires and places a 'token on each of its output arcs. When 




operator is reset, thereby removing the tokens. The local clock generator ensures 










Fig. 6.7 	Local clock• generator providing locally derived timing 
information. 
The structure of a data flow operator may therefore be viewed as consisting of two 
modules, a combinatorial logic module and a local clock generator module. The 
combinatorial logic module performs the logical mapping of input token values and 
takes the form of precharged, differential cascode voltage switch logic. Design 
procedures that have been developed for synchronous cascode voltage switch 
logic [10] may therefore be applied to the design of data flow operators. As a 
result, the data flow operators as proposed here [39t]  are no more difficult to 
design than conventional synchronous logic. The local clock generator module 
effectively implements the reset signaling protocol required for asynchronous 
communication between operators. As such, it is common to all operators 
regardless of their logical function. 
published work by the author. 
§6 	 125 
For an operator with multiple input arcs, the local clock generator module must 
detect the arrival of the slowest input token. Similarly, if an operator has a fanout 
of more than one, the operator is reset only after the slowest of the driven 
destinations has responded. To detect the slowest of a set of changing signals, C-
elements [71] may be incorporated into the local clock generator module. A C-
element is a storage device that responds to the "last of a set of signals changing in 
the same direction. A 2-input C-element is shown in Fig 6.8. Its output is low 
when both inputs are low and becomes high only when both inputs are high. 
Otherwise, its output remains in its previous state. A tree of these devices can be 
used to build a multiple-input C-element. 
b 
Fig. 6.8 A 2-input C-element. 
6.7. Summary 
Conceptually, the behaviour of a system may be completely specified by an ordering 
of events without any reference to the time domain. The self-timed approach to 
computation allows the sequence specification of a system to be abstracted from the 
timing behaviour of the physical realisation, thereby ensuring correct system 
behaviour independent of all timing constraints. In addition, for computations with 
data dependent completion times, self-timing can also improve the average 
126 
throughput by not being constrained by worst-case delay values. A data flow 
approach to self-timed computation has been described, with generic hardware 
structures that allow data flow logic to be efficiently realised presented. The 
practicality of this approach as an alternative to synchronous logic in VLSI will be 





A Wavefront Array Multiplier 
A Case Study on Self-Timed Computation 
7.1. Introduction 
In the preceding chapter, a data flow approach to logic is proposed as a means of 
realising coarse grain computational operators that communicate asynchronously. 
The structure of such a data flow operator may be viewed as consisting of two 
modules, a combinatorial logic module and a local clock generator module. The 
combinatorial logic module performs the logical mapping of input token values. 
The local clock generator module accepts requesi (Req) and acknowledge (Ack) 
signals from communicating operators and implements the reset signaling protocol 
required for self-timed communication. As such, it is common to all operators 
regardless of their logical function. 
With the data flow approach, self-timed elements are no more difficult to design 
than their synchronous counterparts. This is an important advantage that 
considerably reduces the design complexity of communicating, self-timed logic 
elements. To demonstrate the viability of this approach for realising complex self-
timed elements, the design of a two's complement, wavéfront array multiplier [42t] 
will be presented as a case study in this chapter. This multiplier array is composed 
entirely from the replication of only three different cells with local 
interconnections : (i) a full adder cell, (ii) a register cell and (iii) a local clock 
generator cell. As such, it is eminently suited to a VLSI implementation. 
Before describing the multiplier architecture in detail, the nature of the self-timed 
communication imposed by the local clock generator module given in Fig. 6.7 is 
examined by measurements on the operational performance of a First-In, First-Out 
(FIFO) memory. 




7.2. FIFO Elements 
A FIFO allows the concurrent reading and writing of data and automatically tracks 
the order in which the data items are entered. It is often used as a buffer between 
processors operated at different rates to even out the rate of data transfer. To 
demonstrate the self-timed communication imposed by the local clock generator 
module given in Fig. 6.7, an 8-stage bit-wide First-In, First-Out (FIFO) memory 
has been designed and fabricated in a 2 tim, dual layer metal n-well CMOS 
process. The microphotograph of this FIFO is shown in Fig. 7.1. 
% .j 
* 
$[Al i At 
Fig. 7.1 Microphotograph of 8-stage bit-wide FIFO. 
The FIFO is implemented as a network of single hit data flow register cells as 
shown in Fig. 7.2a. The logic for the individual register cells is given in Fig. 7.2b. 
In order to be able to insert arbitrary delay values on the set of signals 
communicating between neighbouring register cells, a set of these signals are 







clock clock clock 	reqout 
generator generator generator 
PAD PAD 
generator 





FIFO FlED FlED 
07 
cell ow 	cell 
FIFO 
S- Q1 05 
Fig. 7.2a FIFO realised as a network of data flow register cells. 
qin -  









7.2.1. Mode of Operation 
The FIFO is first initialised so that all the ackout signals are reset to 0, indicating 
that no token is stored. When a data token is available at the input to the FIFO, 
the write control is enabled and the token is copied into the first cell. ackouti then 
goes high to indicate the completion of this operation. When the input token has 
been reset or removed in response to ackouti, the second cell fires, copying the 
token from the first cell. When the token has been successfully copied, ackour2 is 
activated and resets ackouti. The token in the first cell has now been discarded. 
This process is repeated until the data token appears at the output stage of the 
FIFO. The oscilloscope trace for the above sequence of operations is given in 
Fig. 7.3. From this trace, the propagation delay through the length of the entire 
FIFO is measured at 180 ns, giving a delay of 22.5 ns per stage. Note that in this 
prototype FIFO, token throughput is unnecessarily degraded by having 
communicating signals going off-chip for test purposes (refer to Fig. 7.2a). Input 
and output pad driver delay can make a significant contribution to the 
communication delay between neighbouring cells; the output pads that were used in 
this design have a measured propagation delay of about 20 ns. 
Note that a transfer can be initiated as soon as a cell becomes available, without 
having to wait for a token to propagate all the way to the output stage of the FIFO. 
Occasionally the FIFO fills up and no further tokens may be queued. This state of 
the FIFO is shown in Fig. 7.4a with tokens and spacers residing in alternate stages. 
When a token is de-queued from the FIFO, the read line is activated and resets 
ackout7. The output stage now contains a spacer (Fig. 7.4b). This triggers a chain 
of events whereby the two consecutive spacers are seen to "bubble' towards the first 
stage of the FIFO, as tokens are moved one stage to the right. This process can 
continue until the FIFO is emptied of its contents. In general, the feeding and 
removal of tokens to and from the FIFO are concurrent activities and the data 
throughput of the FIFO is the average of these queuing and de-queuing rates. This 
is displayed in the oscilloscope trace given in Fig. 7.5 
§7 
	
I 3 I 
F 	3-OcIIsp 	ic t 	a ikcii pr 	aaaiina tlic L1W0h 
the entire FIFO. 
The top trace is the write control cycled at a penod of 400 ns. The middle and 
bottom traces show Q7 and reqout respectively from the FJFO output. 
spacer I-_I data  -1 spacer 	data F-I spacer  F-I data 
Fig. 7.4a FIFO filled to maximum capacity when no further tokens 




spacer -1 data 	spacer F-H data  j- spacer  i_-H spacer 
Fig. 7.4b "Bubbling" effect of two consecutive spacers within the 
FIFO. 
Fig. 7.5 	Oscilloscope trace displaying the concurrent reading and 
writing of token values. 
The top trace is the write control cycled at a period of 110 ns. The middle and 
bottom traces show Q7 and reqout respectively from the FIFO output. 
§7 
	 133 
7.2.2. Detay Insensitive Communication 
To demonstrate the delay insensitive operation of the FIFO, external delay lines are 
inserted into the communication paths of the last three stages as shown in Fig. 7.6. 
The delay lines each have a delay value of 250 ns and are labelled into two groups, 
DA  and DB. The delay incurred by a token in propagating from the first to the last 
stage of the FIFO are considered for four different cases, with DA  and DB set to 
different values for each case. 
reset 
	
write I clEk 	clock 
gene generator
kout0























Fig. 7.7a shows the oscilloscope trace taken when the FIFO is operating with D A 
and DB  set to 0. The propagation delay through the length of the FIFO is 
measured at 180 ns. 
Fig. 7.7a Oscilloscope trace for a token propagating the length of 
the entire FIFO with DA = 0, DB = 0. 
The top trace is the write control. The middle and bottom traces show 07 and 
reqout respectively from the FIFO output. The propagation delay is measured at 
l8Ons. 
Case II 
Fig. 7.7b shows the oscilloscope trace taken when the FIFO is operating with D A 
and DB  set to 0 and 250 ns respectively. The propagation delay has increased by a 
margin of 250 ns over the previous value to 430 ns, in response to the delay 
introduced by D. 
§7 
	 135 
Fig. 7.7b Oscilloscope trace for a token propagating the length of 
the entire FIFO with DA = 0, DB = 250 ns. 
The top trace is the write control. The middle and bottom traces show Q 7  and 
reqout respectively from the FIFO output. The propagation delay is measured at 
430 ns. 
Case III 
Fig. 7.7c shows the oscilloscope trace taken when the FIFO is operating with DA 
and DB set to 250 ns and 0 respectively. The propagation delay is measured at 450 
ns, in response to the delay introduced by DA. Note that communication on the 
signal paths skewed by the DA delay lines are gate-delay but not wire-delay 
insensitivet . For this reason, the ordering of signal events on these paths are 
preserved by tracking the absolute delay values incurred by signals propagation on 
these paths. 
This subtle difference between the two forms of delay insermtivity is discussed in greater de-




Fig. 7.7c Oscilloscope trace for a token propagating the length of 
the entire FIFO with D A  = 250 ns, DB = 0. 
The top trace is the write control. The middle and bottom traces show 07 and 
reqout respectively from the FIFO output. The propagation delay is measured at 
450 ns. 
Case IV 
Fig. 7.7d shows the oscilloscope trace taken when the FIFO is operating with D A 
and DB both set to 250 ns. The propagation delay has increased to 700 ns. The 
overall delay margin is equivalent to the sum of delay margins measured in the 




Fig. 7.7d Oscilloscope trace for a token propagating the length of 
the entire FIFO with DA  = 250 ns, DB = 250 ns. 
The top trace is the write control. The middle and bottom traces show Q and 
reqout respectively from the FIFO output. The propagation delay is measured at 
700 ns. 
7.3. Binary Multiplication 
Multiplication is one of the main arithmetic operations that underlie most signal 
and image processing algorithms. Large number of algorithms in this field are 
characterised by the predominance of multiply/add or inner product 
operations [59]. In general, multiplication requires the formation of a number of 
partial products (PP) and the accumulation of these partial product terms. The 
number of PP are progressively reduced by carry-save adder stages until two partial 
product sum (PPS) terms are left, from which the final product may be formed. 
The final summation requires the carry to be propagated in a carry-propagate adder 
such as a ripple adder or a carry look-ahead adder. 
§7 	 - 	 138 
In implementing a computation in hardware, the need to allow for carry 
propagation forms a severe bottleneck to throughput. This delay may be reduced in 
one of two ways 
• 	The length of carry propagation may be limited by performing 
the addition in a 'redundant number representation, such as the 
sign-digit [80] or residue number representation [9]. Addition 
can then be performed in a constant time independent of n, the 
word-length of the operands. 
• 	Use of carry-completion detection logic [22] that allows for the 
actual carry propagation length in a particular n-bit addition. 
From statistical analysis of the random addition of two n-bit 
numbers, the actual longest carry length is, on average, found to 
be significantly less than the worst case value of n bits. 
7.3.1. Redundant Binary Number Addition 
As an example of redundant number addition, consider a radix-2 sign-digit (SD) 
representation of an integer with digit set 1,0,1 where 1 denotes —1. An n-digit 
redundant integer Y = [y_ . YOISD2 has value 
n-I 	 - 
Y = y1 X2 1 	(y1 E 1,0,1) 
i=O 
Due to possible redundancy in the number representation, the addition of two SD 
integers may be constrained so that carry propagation is limited to just one higher 
digit position [80]. When two SD integers X = [x . . x 0] and Y = [y1 . Y] are 
added together to produce Z = [z . . . zJ, the intermediate sum s 1 and carry ci at 
each digit position is determined from the relation x 1 + y1 = 2c, + s. Note that 
redundancy in the SD representation allows 1" to be represented as either [0 1] 
or [1 i]. "-1" may correspondingly be represented as either [0 1]sD2 or [1 1]. 
This feature is exploited to enable an intermediate sum digit s and the carry digit 
c1 _1  from the next lower order digit to be added together to form the result digit z1 , 




The key to breaking the carry chain is to prevent the formation of s 1 and c1 _1 both 
with value 1 or 1. If there is a possibility of a 1-carry from c 1 _1 , [c, s1 ] is chosen to 
be [1 1]. Otherwise, [c 1 5j] is taken to be [0 1]. A corresponding set of rules apply 
when there is a possibility of a 1 carry from c 1 _1 . The digit selection rules for carry 
propagation free addition are given in Table 7.1. 
Operands digit Next lower order Intermediate Intermediate 
digit position carry sum 
x1 	y3 x1 _1 	y1 _1 c• 	 Si 
1 	1 xxx 1 	 0 
1 	0 both are non-negative 1 	 1 
o 1 other combinations 0 	 1 
o 	0 xxx 
1 xxx 0 	 0 
1 	1 xxx 
o 	1 both are non-negative 0 	 1 
1 0 other combinations 1 1 
1 	1 xxx 1 	 0 
Table 7.1 Digit selection rules for carry propagation free addition 
(after Ref [80]). 
Carry propagation chains may be avoided in arithmetic operations so long as the 
numbers are represented in SD form. However, if the SD numbers are converted to 
an equivalent non-redundant binary representation such as 2's complement, then 
carry propagation cannot be avoided in the conversion process. 
c iL 	 1H: 
P 
PPS 



















7.3.2. Self-Timed Carry Propagate Addition 
Suppose that the delay of a 1-bit full adder is denoted by A. The delay xA in 
performing an n-bit carry-propagate addition is dependent upon the two numbers 
being added and is determined by the length of the actual longest carry propagation 
particular to that addition. Therefore x can take any integer value between 
1xn. For a synchronous design, we have to allow for the worst-case condition 
when the carry propagates through the length of the entire adder. In this case, the 
addition delay is fixed by nz. However, worst-case carry propagation rarely occurs. 
From statistical analysis of the addition of two randomly chosen binary integers, X 
and Y, the average longest carry length is found to be bounded by 1092  n [27]. 
With the use of carry-complete detection logic, the average speed of carry-propagate 
addition may therefore be increased by a factor of n!(1092 n). 
01 
Fig. 7.8 	Combinatorial logic for a data flow, partial product sum 
adder. 
The combinatorial logic for a full adder function in data flow logic is given in 
Fig. 7.8. The carry-out function C. is defined independent of the carry-in input C 1 , 
when the input operand bits x i and y, are in the carry-kill (x= 0, y1 = 0) or carry-
generate (x 1 = 1, y 1) condition. A self-timed, carry-propagate adder may be 
§7 	 141 
formed by directly cascading such full adder cells. Completion of an addition 
operation is indicated by an acknowledge signal Ack 0 ,, that is formed by the 
ANDing of acknowledge signals from the individual sum bits. The 
microphotograph of a 4-stage carry-propagate adder, implemented in a 2 p.m, dual 
metal n-well CMOS process, is given in Fig. 7.9. 
'1 
Fig. 7.9 	Microphotograph of a 4-stage, self-timed carry-propagate 
adder. 
To verify the self-timed operation of such an adder, two sets of operands are chosen 
to demonstrate addition involving the two extreme carry propagation conditions. 
With the carry-in C j to the least-significant stage set to 1, one addition is set up so 
that no carry propagation takes place. In the other addition, the carry propagates 
the entire length of the adder. Referring to the oscilloscope trace shown in 
Fig. 7.10, a 20 ns difference in computation time between these two additions is 




Fig. 7.10 	Oscilloscope trace displaying the self-timed addition of two 
sets of operands. 
The top trace is the local start signal 0 1 . The middle trace is the carry-out signal 
C, from the most significant bit of the adder. The bottom trace is the 
acknowledge signal Ack 0,,,. These traces were produced with operands x 1 = 1010, 
y1  = 1010, Q = 1 followed by x2 = 1010, Y2 = 0101, = 1. 
7.4. Multiplier Structure 
Fig. 7.11 shows the composition of a data flow, pipelined multiplier for two's 
complement, fixed-point numbers. It is derived from the well-known structure for 
carry-save array (CSA) multiplication. The weighted summation of partial products 
is carried out using a carry-save array, with a last stage carry-propagate adder to 
form the final product. Except for the last carry-propagate stage, the interstage 
delay between two adjacent rows is determined by the delay A of a 1-bit full adder. 
With the provision for carry-completion detection, the carry-propagate addition may 
be completed in an average time that varies as log 2 n for an n-bit adder. 
y77yo 	xi —x o 	 Req 
to 
I 	8-input AND gate 	 I 
I R: register, CK: local clock generator, RA: register array, 
C: C-element 	0 : full adder 





7.4.1. Multiplier Algorithm 
The Baugh-Wooley algorithm [5] for two's complement multiplication allows the 
product of a multiplication to be formed by the weighted summation of positive 
partial products only. Each partial product bit is formed by the ANDing of a 
multiplicand and multiplier bit. The product P of the multiplication of two n-bit, 
two's complement numbers X and Y, can be expressed as 
n-2 n-2 
XY = (-y_ 2I_1 + 	y1 2')(-x_1 2 	+ f x1 i) 
i=0 	 i=0 
n-2 n-2 
= (x_1y_1 22 _2  + E f xy i+J) 
i=0 j=0 
n-2 n-2 
-( 	x_y 	+E y_1x 2n_ 1 ) 	 (7.1) 
i=0 i=O 
In accumulating the partial product sum (PPS), instead of subtracting partial 
products (PP) with negative signs, the negation of these PP may be added to the 
PPs. 
n-2 
Therefore if 	Z = -z_1 2'' + E z1 2' 
i=0 
n-2 - 
then 	—z = —z_1 2' + f z, 2' + 20 
i=0 
The PP in Eqn. 7.1 with negative signs may therefore be equivalently expressed 
as 
( 	 n-2 	 ) 
— I -O2"_' + 0.22n_2 + 2  x_1y, 2n-1+i 
I' 	 i=° 	 ) 
n-2 
= _1.221  + 1.22n-2  + 	 2n-1+i + 2n-1 
1=0 
n-2 	- 





as 	this 	term 	reduces 	to 	zero 	for 	x_ 1 = 0 	and 
n —2 
(_221  + 22 _2  +Y, y1 2' 	+ 2n_1)  for x_ 1 = 1. 
For completeness, the individual partial product terms produced by the algorithm 
for the product of two 8-bit integers are shown in Fig. 7.12. Note that the true and 
complement of the multiplicand and multiplier bits are needed to form the partial 
product terms. This presents no difficulty since the true and complement of a signal 
are readily available at the output of data flow logic operators. Note also that the 
five extra partial product bits produced by the algorithm, x 7 , y, x71 y7 and 1, may 
be accommodated by the unused inputs of the full adders on the periphery of the 
array, as shown in Fig. 7.11. This is accomplished without disrupting the regularity 
of the structure or incurring an area penalty. 
n-2 Y7 	Y6 Ys Y4 Y3 Y2 Yi 	Yo 
XY - 	2 —v_ 	_1  + 	yj 2)(—x_ 1 21 + I x 2') 	x7 	x6 x 5 io i=O x4 x 3 x 2 x 1 x 0 
ii —2 n —2 
= (x_1y_1 2 2  + I I x1y 2'J) 
i 	O J 	O Y 6 XOYS x(L4 x0y3 X0y2 X01 	X(L)'0 
n-2 
-(E x_1y, 2n-1+1  + 	 n.1+i) Xi Y7 X 1Y 6 X 3Y5 XY. XIY3 XiY2 XIYI X 1Y 0 
iO - 
x 2 Y 7 x 2y 6 x 2y 5 x 2y 4 x 2y3 x 2y2 x 2y 1 x 2y0 
x 3 y, X3JV6  x3y5 x3y4 x3y3 x3y2 X3y1 x3y0 
x4  Y7 x 4y6 X4y5 x 4y4 x4y 3 x 4y2 x 4y 1  x 4y0 
x 5 y7 x 5y6 x 5y5 x 5y 4 x 5y3 x 5y2 x 5y 1 x 5y0 
x 6 y7 x6y6 X6)'5 x04 x6y3 x02 x6yi x6y0 
x 7Y 7 x 7Y 6 x 7Y 5 x 7y 4 x 7y3 x 7y 2 x 7y 1 x 7y 0 
1 Y7 Y7 
X7 
Pis P14 P13 P12 Pu Pio 	P 9 	P8 	P7 	P 6 	P 5 	P 4 	P3 	P2 	Pi 	Po 
Fig. 7.12 Partial product terms derived from the Baugh-Wooley 




7.4.2. Multiplier Composition 
In any pipeline system, the throughput is limited by the propagation aelay through 
the slowest stage. This delay value will then determine the partitioning of a system 
into logical blocks, with each block forming a single stage in the pipe. In an n-bit 
self-timed multiplier, the slowest stage within its structure is the final carry-
propagate adder with an average delay of A 1092 n, A being the propagation delay 
of a 1-bit full adder. In order to reduce the number of registers required for the 
intermediate storage of PPS terms, registers are not placed after every row of full 
adders. Instead, log2 n rows may be combined together to form a single pipeline 
stage, thereby improving the performance-area ratio of the pipeline. 
Due to the regularity of the multiplier array structure, self-timing is not extended 
down to the bit level. The local clock generator modules derive their timing from 
the most significant bit of the intermediate PPS terms. This temporal information is 
then assumed to apply across the bits in a PPS word. For the multiplier described, 
such an assumption is valid for two reasons 
• 	The propagation delay of a full adder is always greater than that 
of a register cell. 
• 	The regularity of the interconnections between full adder cells in 
the array ensures the close matching of interconnection delay 
across a PPS word. As a result, the difference margin will 
always be less than the delay through a local clock generator 
cell. 
With the above assumption, the multiplier is described as being self-timed at the 
word level. The computational wavefronts are then assumed to travel in straight 
lines between pipeline stages (Fig. 7.13a). In the equilibrium condition, data 
words and spacers reside in alternate stages of the pipeline and the computational 
wavefronts are stationary. The wavefronts propagate only when two adjacent 
pipeline stages both contain spacer words. For irregular structures, self-timing may 
be extended down to the bit-level, allowing the computational wavefronts to 
propagate without restriction. However, the wavefronts must be non-intersecting 
and separated by at least a spacer word (Fig. 7.13b). Implementing self-timing at 
the bit-level will incur a heavy area penalty, arising from the proliferation of local 






Fig. 7.13a Computational wavefronts constrained to straight-line 




Fig. 7.13b Irregular computational wavefronts for bit-level, self-timed 
operation. 
Note that at the interface of the multiplier to the external environment, a 2-input 
C-element detects the condition when both the multiplicand and multiplier words 
are valid. This is necessary since no assumptions are made about the timing 





The self-timed communication imposed by the generic hardware structures 
presented in the last chapter for data flow logic has been verified using a first-in, 
first-out (FIFO) memory as a test vehicle. The data flow approach may be readily 
applied to the realisation of complex, self-timed logical functions such as addition 
and multiplication. As a basis for a wavefront array multiplier, a self-timed carry-
propagate adder has been fabricated, from which measurements on addition 
completion times were obtained. In addition to improving the average throughput 
by not being constrained by worst-case delay values, self-timing also alleviates the 
problems associated with clock and power distribution. By deriving clocks locally, 
the difficulties associated with distributing global clock signals at high speed with 
integrity are avoided. By activating the local clocks at different times in response to 
data availability, the current demand of a self-timed processor is averaged out in 
time, rather than concentrated at particular instants. These issues will assume 






The core of this thesis centres on the investigation of architectural forms that are 
able to exploit the inherent structure and regularity in algorithms to realise highly 
concurrent VLSI systems. High concurrency in computation may be achieved by 
directly mapping an algorithm into an architectural form embracing multiple 
processing elements with distributed memory, linked by efficient communication 
channels. Three distinct architectural forms have been presented, based on the 
granularity of the modular processing elements. These are based respectively on 
analogue current computational systems, bit-serial flow-graph networks and 
wavefront arrays. The following remarks are a summary of the strengths and 
weaknesses of the three architectural forms. 
Analogue current computational circuits.exploiting the weak inversion behaviour of 
MOS transistors are characterised by their high functional efficiency in terms of the 
number of computations performed per second per unit area but at low precision 
with an accuracy limited to no better than ± 20 % in their output signal currents. 
Emerging fields of application where such circuits may be tolerated are in the areas 
of collective computation [81, 75] and "smart" vision sensors in which features of 
interest, such as the edges of objects, show up as sharp, abrupt intensity variations 
that are not swamped by such inaccuracies. 
A novel "receptor" cell has been presented to illustrate the application of analogue 
current computational circuits in "smart" vision sensors. The receptor cell may be 
tessellated to form an imager with built-in nonlinear automatic gain control (AGC) 
correction to maintain the operating range of the imager in register with ambient 
light conditions. The AGC function requires a feedback mechanism and as such, 
the imager may be susceptible to the problem of instability. Although the issue of 
stability in such tightly-coupled feedback networks has not been addressed in this 
thesis, relevant design techniques to make such networks unconditionally stable 
have recently been presented [93]. 
Bit-serial, flow-graph networks are characterised by their communication and 
computational efficiency for executing data-independent algorithms. Computational 
§8 	 150 
efficiency is achieved through functional parallelism, in which arrays of hard-wired 
processors are used to boost system throughput. An array of functional processors 
accept data words in a bit-serial but word-parallel fashion to give a proportionate 
increase in system throughput. 
A bit-serial flow-graph network is essentially a pipeline of bit-serial operators, with 
a throughput rate that is independent of the composition of the network. However, 
pipelining introduces an operational delay or latency on a network that is 
determined by the number of bit-times separating an input sample word from the 
corresponding output sample computed by the network. For non-recursive 
computations, the maximum sample rate is independent of the latency and is 
determined solely by the throughput of a network. Examples of algorithms 
belonging to this class are the fast fourier transform (FFT), finite-impulse response 
(FIR) filtering and low level image processing operations. For recursive algorithms 
such as infinite-impulse response (IIR) filtering, the latency of a network effectively 
determines the maximum sample rate. Such recursive algorithms require the i1h 
output sample yj to be available before computation for the next output sample y1+ 
may proceed. As a result, the maximum sampling frequency is limited by the 
latency inherent in a bit-serial network and cannot be increased further through 
functional parallelism. This can be seen in the case study of the digital wave filter 
presented which achieves a maximum sampling frequency of 227 KHz with a bit-
rate of 10 MHz. 
For, a limited class of recursive computations, the sample rate may be increased by a 
factor of L by manipulating a block of L input samples to form a block of L output 
samples concurrently [43, 62,61]. This is achieved by computing all the 
intermediate terms required in forming the L output samples dynamically, 
effectively decoupling the computations of the output samples in the block. 
However, the price to be paid is a massive increase in hardware resources to 
compute ahead of time those intermediate terms that would have formed later in a 
recursive computation. 
Conceptually, the behaviour of a system may be completely specified by an ordering 
of events without any reference to the time domain. The self-timed approach to 
computation allows the sequence specification of a system to be abstracted from the 
timing behaviour of the physical realisation, thereby ensuring correct system 
behaviour independent of all timing constraints. The discipline of self-timed design 
§8 	 151 
has two principal facets, the design of elements and the design of systems consisting 
of an interconnection of elements. At the element level, the primary design task is 
in the physical design of the internal structure of an element in which time and 
sequence are related together. In addition to defining the logical function of an 
element, signal events at its 110 terminals must be constrained to a certain ordering 
so as to satisfy some physical requirement for reliable communication. Elements 
that fulfill these conditions are said to be "safe". 
At the system level, an element may be viewed as a functional black box with well-
defined terminal behaviour. Systems may be built by interconnecting various black 
boxes together. Since there are no time constraints at the system level, system 
function is a result of the overall interaction of an interconnection of elements. 
Different topologies of interconnection will give rise to different functional 
behaviour. However, with certain topologies, part of a system may eventually reach 
a "dead" state from which it is unable to recover. This condition is known as 
deadlock and will cause the entire system to come to a halt. Systems that are free 
from deadlocks are said to be "live". 
To illustrate this property of liveness or deadlock of an interconnection of self-timed 
elements, a parallel may be drawn using a circular connection of inverters. If there 
is an odd number of inverters in the chain, then the topology of interconnection is 
live and the system functions as a ring oscillator. However, if the number of 
inverters in the chain is even, then the system will eventually settle into a stable 
condition in which no further activity is possible. This topology results in deadlock. 
Given a set of safe elements, a system designer must ensure that an interconnection 
of elements performs the specified function and is free from deadlock. Several 
formal methods [12, 48, 25] have been proposed to analyse system behavioural 
aspect of the self-timed discipline. The most promising of these approaches is the 
theory of traces [78]. However, even with these formal methods, the analysis 
rapidly becomes intractable with increasing system complexity. For this reason, the 
majority of self-timed systems implemented in VLSI [91,33] have all adopted a 
very simple interconnection topology along the lines of a ring oscillator. The 
burden of ensuring the liveness of a system is therefore removed and the design 
effort is concentrated on constructing safe elements. 
§8 	 - 	 152 
The computations required to execute an algorithm may be implemented by a 
variety of architectural forms. The real challenge is to define or identify an 
architectural form that will exploit the benefits of VLSI in order to meet the 
throughput requirement of the application at hand. It is hoped that some of the 
work presented in this thesis will have made inroads into this chaJenging field. 
153 
Bibliography 
P. Antognetti, D.D. Caviglia, and E. Profumo, "CAD Model for Threshold 
and Subthreshold Conduction in MOSFET's," IEEE Journal of Solid-State 
Circuits, Vol. SC-17, (3), pp. 454 - 458, (Jun. 1982). 
D.B. Armstrong, A.D. Friedman, and P.R. Menon, "Design of 
Asynchronous Circuits Assuming Unbounded Gate Delays," IEEE 
Transactions on Computers, Vol. C-18, (12), Pp.  1110 - 1120, (Dec. 1969). 
B. Arnold, "Current/Voltage Hybrid Multiplier," Electronics Letters, Vol. 24, 
(14), pp.  860 - 862, (Jul. 1988). 
E.E. Barton, "A Non-Metric Design Methodology for VLSI," in VLSI 81, ed. 
J.P. Gray, pp. 25 - 34, Academic Press, (1981). 
C.R. Baugh and B.A. Wooley, "A Two's Complement Parallel Array 
Multiplication Algorithm," IEEE Trans. on Conzputers, Vol. C-22, (12), 
pp. 1045 - 1047, (Dec. 1973). 
M. Brady, "Sensors for PROMETHEUS," Internal Report for PROMETHEUS 
Project, (1987). 
D.S. Broomhead et al., "A Practical Comparison of the Systolic and 
Wavefront Array Processing Architectures," in VLSI Signal Processing, 
pp. 375 - 386, IEEE Press, (1984). 
S.G. Chamberlain and J.P.Y. Lee, "A Novel Wide Dynamic Range Silicon 
Photodetector and Linear Imaging Array," IEEE Journal of Solid-State 
Circuits, Vol. SC-19, (1), pp.  41 - 48, (Feb. 1984). 
C.L. Chiang and L. Johnsson, "Residue Arithmetic and VLSI," Proc. IEEE 
lnternational Conference on Computer Design: VLSI in Computers, pp. 80 - 83, 
(Oct. 1983). 
K.M. Chu and D.I. Pulfrey, "Design Procedures for Differential Cascode 
Voltage Switch Circuits," IEEE Journal of Solid-State Circuits, Vol. SC-21, 
(6), pp. 1082 - 1087, (Dec. 1986). 
K.M. Chu and D.I. Pulfrey, "A Comparison of CMOS Circuit Techniques: 
Differential Cascode Voltage Switch Logic versus Conventional Logic," IEEE 
Journal of Solid-State Circuits, Vol. SC-22, (4), pp.  528 - 532, (Aug. 1987). 
154 
T.A. Chu, K.C. Leung, and T.S. Wanuga, "A Design Methodology for 
Concurrent VLSI Systems," Proc. IEEE ICCD: VLSI in Computers, pp. 407 - 
410, (Oct. 1985). 
J.B. Dennis, "Data Flow Supercomputers," Computer, Vol. 13, (11), pp. 48 - 
56, (Nov. 1980). 
P. Denyer and D. Renshaw, VLSI Signal Processing: A Bit-Serial Approach, 
Addison-Wesley, (1985). 
P.B. Denyer and S. G. Smith, "Bit-Serial Architectures for Parallel Arrays," 
Proc. SPIE, Vol. 614, Highly Parallel Signal Processing Architectures, pp.  66 
- 73, (1986). 
E.D. Dickmanns, "Integrated Approaches to Autonomous Road Vehicles: 
State of the Art Review," internal Report for PROMETHEUS Project, (Feb. 
1987). 
A. Fettweis, "Wave Digital Filters: Theory and Practice," IEEE Proceedings, 
Vol. 74, pp.  270 - 327, (Feb. 1986). 
T.J. Fountain, "A Survey of Bit-Serial Array Processor Circuits," in 
Computing Structures for Image Processing, ed. M.J.B. Duff, Academic Press, 
(1983). 
B. Gilbert, "A New Wide-Band Amplifier Technique," IEEE Journal of 
Solid-State Circuits, Vol. SC-3, (4), pp.  353 - 365, (Dec. 1968). 
B. Gilbert, "Translinear Circuits: A Proposed Classification," Electronics 
Letters, Vol. 11, (1), pp.  14 - 16, (Jan. 1975). 
B. Gilbert, "Nonlinear Analog Circuits," Course Notes, (1983). 
B. Gilchrist, J.H. Pomerene, and S.Y. Wong, "Fast Carry Logic for Digital 
Computers," IRE Transactions on Electronic Computers, Vol. EC-4, pp. 133 - 
136, (Dec. 1955). 
L.A. Glasser and D.W. Dobberpuhl, The Design and Analysis of VLSi 
Circuits, Addison-Wesley, (1985). 
N. Goncalves and H.J. De Man, "NORA: A Racefree Dynamic CMOS 
Technique for Pipelined Logic Structures," IEEE Journal of Solid-State 
Circuits, Vol. SC-18, (3), pp.  261 - 266, (Jun. 1983). 
M.R. Greenstreet, T.E. Williams, and J. Staunstrup, "Self-Timed Iteration," 
Proc. VLSI 87, (Aug. 1987). 
155 
L.G. Heller and W.R. Griffin, "Cascode Voltage Switch Logic: A Differential 
CMOS Logic Family," Proc. IEEE International Solid-State Circuits 
Conference, pp.  16 - 17, (1984). 
H.C. Hendrickson, "Fast High-Accuracy Binary Parallel Addition," IRE 
Transactions on Electronic Computers, Vol. EC-9, (4), pp..465 - 469, (Dec. 
1960). 
L.A. Hollaar, "Direct Implementation of Asynchronous Control Units," IEEE 
Transactions on Computers, Vol. C-31, (12), pp.  1133 - 1141, (Dec. 1982). 
D.H. Hubel, Eye, Brain and Vision, Scientific American Library, (1988). 
J. Hutchinson et al., "Computing Motion Using Analog and Binary Resistive 
Networks," Computer, Vol. 21, (3), pp.  52 - 63, (Mar. 1988). 
L.B. Jackson, J.F. Kaiser, and H.S. McDonald, "An Approach to the 
Implementation of Digital Filters," IEEE Trans. on Audio and Electroacoustics, 
Vol. AU-16, pp.  413 - 421, (Sep. 1968). 
L.B. Jackson, Digital Filters and Signal Processing, Kluwer Academic 
Publishers, (1986). 
S. Komori et al., "An Elastic Pipeline Mechanism by Self-Timed Circuits," 
IEEE Journal of Solid-State Circuits, Vol. 23, (1), (Feb. 1988). 
R.H Krambeck, C.M. Lee, and H.F.S. Law, "High-Speed Compact Circuits 
with CMOS," IEEE Journal of Solid-State Circuits, Vol. SC-17, (3), pp.  614 - 
619, (Jun. 1982). 
H.T. Kung, "Why Systolic Architectures?," IEEE Computer, Vol. 15, (1), 
pp. 37 - 46, (Jan. 1982). 
S.Y. Kung, "On Supercomputing with Systolic/Wavefront Array Processors," 
Proceedings of the IEEE, Vol. 72, (7), pp. 867 - 884, (Jul. 1984). 
C.H. Lau, "Self: A Self-Timed Systems Design Technique," Electronics 
Letters, Vol. 23, (6), (Mar. 1987). 
C.H. Lau and D. Renshaw, "An Electronic Eye: A Smart Imager in 
Analogue VLSI," U.K. Patent Application No. 8829739, (Dec. 1988). 
C.H. Lau, D. Renshaw, and J. Mayor, "Data Flow Approach to Self-Timed 
Logic in VLSI," Proc. IEEE International Symposium on Circuits and Systems, 
(Jun. 1988). 
156 
C.H. Lau and D. Renshaw, "Weak Inversion Imager Evaluation," Technical 
Report ISG Imagers-004-A, University of Edinburgh., (May 1989). 
C.H. Lau and D. Renshaw, "Race-Free Clocking of CMOS Pipelines through 
a Single Global Clock," Proc. European Solid-State Circuits Conference, (Sep. 
1989). 
C.H. Lau, D. Renshaw, and J. Mayor, "A Self-Timed Wavefront Array 
Multiplier," Proc. IEEE International Symposium on Circuits and Systems, 
(May 1989). 
H.H. Lu, E.A. Lee, and D.G. Messerschmitt, "Fast Recursive Filtering with 
Multiple Slow Processing Elements," IEEE Trans. on Circuits and Systems, 
Vol. CAS-32, (11), pp.  1119 - 1129, (Nov. 1985). 
R.F. Lyon, "Two's Complement Pipeline Multipliers," IEEE Trans. on 
Comnzunications, Vol. COM-24, pp. 418 - 425, (Apr. 1976). 
R.F. Lyon, "Simplified Design Rules for VLSI Layouts," Lambda, Vol. II, 
(1), pp.  54 - 59, (First Quarter 1981). 
R.F. Lyon, "A Bit-Serial VLSI Architectural Methodology for Signal 
Processing," in VLSI 81, ed. J.P. Gray, pp. 131 - 140, Academic Press, 
(1981). 
M.A. Maher and C.A. Mead, "A Physical Charge-Controlled Model for 
MOS Transistors," Proc. Conference on Advanced Research in VLSI, MIT 
Press, (1987). 
Y. Malachi and S.S. Owicki, "Temporal Specifications of Self-Timed 
Systems," in VLSI Systems and Computations, ed. Kung, Sproull and Steele, 
pp. 203 - 212, Springer-Verlag, (1981). 
J. Mayor, N. Petrie, and C.H. Lau, "The Design of a Universal Wave Filter 
Adaptor Using Dynamic CMOS Techniques," Proc. European Solid-State 
Circuits Conference, (1984). 
D.S. McGraph and D.J. Myers, "Novel MOS Memory for Serial Signal 
Processing Applications," Electronic Letters, Vol. 21, (24), pp.  1170 - 1171, 
(Nov. 1985). 
M.S. McGregor, P.B. Denyer, and A.F. Murray, "A Single-Phase Clocking 
Scheme for CMOS VLSI," Proc. Conference on Advanced Research in VLSI, 
MIT Press, (1987). 
.157 
C. Mead and L. Conway, Introduction to VLSi Systems, Addison-Wesley, 
(1980). 
C. Mead, "A Sensitive Electronic Photoreceptor," in Chapel Hill Conference 
on VLSi, ed. H. Fuchs, pp.  463 - 471, Computer Science Press, (1985). 
C. Mead, Analog ViSI and Neural Systems, Addison Wesley, (1989). 
R.E. Miller, Switching Theory, Vol. 2, Wiley, (1965). 
C.E. Molnar, T.P. Fang, and F.U. Rosenberger, "Synthesis of Delay-
Insensitive Modules," in Chapel Hill Conference on VLSI, ed. H. Fuchs, 
pp. 67 - 86, Computer Science Press, (1985). 
A.F. Murray and P.B. Denyer, "A CMOS Design Strategy for Bit-Serial 
Signal Processing," IEEE Journal of Solid-State Circuits, Vol. SC-20, (3), 
pp. 746 - 753, (Jun. 1985). 
D.J. Myers and P.A. Ivey, "A Design Style for VLSI CMOS," IEEE Journal 
of Solid-State Circuits, Vol. SC-20, (3), pp.  741 - 745, (Jun. 1985). 
J.G. Nash, "Review of Arithmetic Algorithms and Circuits for High Speed 
Signal Processing," Proc. SPIE, Vol. 614, Highly Parallel Signal Processing 
Architectures, pp.  98 - 115, (1986). 
Open University, Optoelectronics, Electronic Materials and Devices, The 
Open University Press, (1985). 
K.K. Parhi and D.G. Messerschmitt, "A Bit Parallel Bit Level Recursive 
Filter Architecture," Proc. IEEE JCCD: VLSI in Computers, pp. 284 - 289, 
(Oct. 1986). 
K.K. Parhi and D.G. Messerschmitt, "Look-Ahead Computation: Improving 
Iteration Bound in Linear Recursions," Proc. IEEE ICASSP 87, pp. 1855 - 
1858, (Apr. 1987). 
N. Petrie, The Design and implenentation of Digital Wave Filter Adaptors, 
Ph.D. Thesis, Dept. of Electrical Engineering, University of Edinburgh, 
(Aug. 1985). 
J. Poulton et al., "Pixel-Planes: Building a VLSI-Based Graphic System," in 
Chapel Conference on VLSI, ed. H,. Fuchs, pp.  35 - 60, Computer Science 
Press, (1985). 
N.R. Powell and J.M. Irwin, "Signal Processing with Bit-Serial Word-Parallel 
Architectures," Proc. SPIE, Vol. 154, Real Time Signal Processing, pp.  98 - 
104, (1978). 
158 
N.R. Powell, "Functional Parallelism in VLSI Systems and Computations," in 
VLSI Systems and Computations, ed. Kung, Sproull and Steele, pp.  41 - 49, 
Springer-Verlag, (1981). 
H.M. Reekie et al., "Design and Implementation of Digital Wave Filters 
using Universal Adaptor Structures," lEE Proceedings, Vol. 131, Pt. F, (6), 
pp. 615 - 622, (Oct. 1985). 
L.P. Rubinfield, "A Proof of the Modified Booth's Algorithm for 
Multiplication," IEEE Trans. on Computers, Vol. C-24, (10), pp.  1014 - 1015, 
(Oct. 1975). 
J.T. Scanlon and W.K. Fuchs, "High Performance Bit-Serial Multiplication," 
Proc. IEEE ICCD: VLSI in Computers, pp. 114 - 117, (Oct. 1986). 
E. Seevinck, Analysis and Synthesis of Translinear Integrated Circuits, Elsevier 
Science Publishers, (1988). 
C.L. Seitz, "System Timing," in Introduction to VLSI Systems, Mead and 
Conway, Addison-Wesley, (1980). 
C.L. Seitz, "Ensemble Architectures for VLSI - A Survey and Taxonomy," 
Proc. 1982 Conference on Advanced Research in VLSI, MIT, pp. 130 - 135, 
(Jan. 1982). 
C.L. Seitz, "Concurrent VLSI Architectures," IEEE Transactions on 
Computers, Vol. C-33, (12), pp.  1247 - 1265, (Dec. 1984).. 
B.M. Singer and J. Kostelec, "Theory, Design and Performance of Low-
Blooming Silicon Diode Array Imaging Targets," IEEE Transactions on 
Electron Devices, Vol. ED-21, (1), pp.  84 - 89, (Jan. 1974). 
M. Sivilotti, M. Emerling, and C. Mead, "A Novel Associative Memory 
Implemented Using Collective Computation," in Chapel Hill Conference on 
VLSI, ed. H. Fuchs, pp.  329 - 342, Computer Science Press, (1985). 
M.A. Sivilotti, M.A. Mahowald, and C.A. Mead, "Real-Time Visual 
Computations Using Analog CMOS Processing Arrays," Proc. 1987 
Conference on Advanced Research in VLSI, MIT Press, (1987). 
S.G. Smith and P.B. Denyer, Serial-Data Computation in VLSI, Kluwer 
Academic Publishers, (1987). 
J.L.A. van de Snepscheut, Trace Theory and VLSI Design, Springer-Verlag, 
(1985). 
159 
I.E. Sutherland and C. Mead, "Microelectronics and Computer Science," 
Scientific American, Vol. 237, (3), pp.  210 - 228, (Sep. 1977). 
N. Takagi, H. Yasuura, and S. Yajima, "High-Speed VLSI Multiplication 
Algorithm with a Redundant Binary Addition Tree," IEEE Trans. on 
Computers, Vol. C-34, (9), pp.  789 - 796, (Sep. 1985). 
D.W. Tank and J.J. Hopfield, "Collective Computation in Neuronlike 
Circuits," Scientific American, Vol. 257, (6), pp.  62 - 70, (Dec. 1987). 
J. Tanner and C. Mead, "An Integrated Analog Optical Motion Sensor," in 
VLSI Signal Processing, II, IEEE Press. 
J.E. Tanner and C. Mead, "A Correlating Optical Motion Detector," Proc. 
1984 Conference on Advanced Research in VLSI, MIT, pp. 57 - 64, (Jan. 1984). 
Y.P. Tsividis and R.W. Ulmer, "A CMOS Voltage Reference," IEEE Journal 
of Solid-State Circuits, Vol. SC-13, (6), pp. 774 - 778, (Dec. 1978). 
E.A. Vittoz and J. Fellrath, "CMOS Analog Integrated Circuits Based on 
Weak Inversion Operation," IEEE Journal of Solid-State Circuits, Vol. SC-12, 
(3), pp.  224 - 231, (Jun. 1977). 
E.A. Vittoz, "Micropower Techniques," Design of MOS VLSI Circuits for 
Telecommunications, pp.  104 - 144, Prentice Hall, (1985). 
E.A. Vittoz, "The Design of High-Performance Analog Circuits on Digital 
CMOS Chips," IEEE Journal of Solid-State Circuits, Vol. SC-20, (3), pp. 657 
- 665, (Jun. 1985). 
A. Vladimirescu and S. Liu, "The Simulation of MOS Integrated Circuits 
Using SPICE2," University of California, Berkeley, ERL Memo M8017, (Feb. 
1980). 
W.M. Waite, "The Production of Completion Signals by Asynchronous, 
Iterative Networks," IEEE Transactions on Electronic Computers, Vol. EC-13, 
(2), pp.  83 - 86, (Apr. 1964). 
F.S. Werblin, "The Control of Sensitivity in the Retina," Scientific American, 
Vol. 228, (1), pp.  70 - 79, (Jan. 1973). 
T.E. Williams et al., "A Self-Timed Chip for Division," Proc. 1987 
Conference on Advance Research in VLSI, pp. 75 - 95, MIT Press, (1987). 
W.B. Wilson et al., "Measurement and Modelling of Charge Feedthrough in 
N-Channel MOS Analog Switches," IEEE Journal of Solid-State Circuits, 
Vol. SC-20, (6), pp.  1206 - 1213, (Dec. 1985). 
160 
J.L. Wyatt, Jr. and D.L. Standley, "A Method for the Design of Stable 
Lateral Inhibition Networks that is Robust in the Presence of Circuit 
Parasitics," in Neural Information Processing Systems, Denver 1987, ed. D.Z. 
Anderson, American Institute of Physics, (1988). 
J.R. Yuan, I. Karisson, and C. Svensson, "A True Single-Phase Clock 
Dynamic CMOS Circcit Technique," IEEE Journal of Solid-State Circuits, 
Vol. SC-22, (5), pp. 899 - 901, (Oct. 1987). 
161 
Appendix A 
Pin-Out of Prototype Chips VB076 and IMAGO04 
C, 
I cm— I 
re 
I 	NC 	I 
Prototype Logarithmic Imager 
Chip IMAGO04 














































Prototype Logarithmic Imager Operating Equations 
The transduction device in the readout circuitry of a sensor cell is operated in its 
linear region of its drain-gate transfer characteristic. 
'out = 13 (Vh - VT) VDS 
= 132 (q)dk - VT) (V1,, - Vs) 
where the gain and threshold voltage of the devices are denoted by 13  and VT. Vi,,, 
is the photo-voltage generated by the sensor, Vbil  the external bias drain voltage and 
c1k the applied clock voltage. Re-arranging, 
Vs = 	
132 (c1k - VT) Vb1 
13 (Vh - VT) + 132 (1c1k - VT) 
VOUt = lout R 
- 01 132 (Vh - VT) (I - VT) 
- 131 (V - VT) + 132 	- VT) 
VbI, RL 
If l3< 13 , V0, can be approximated by 
13 (VPh - VT) Vb1, RL (B.1) 
-Y 
where -y has a value between 1<y<2 depending on the relative values of 13  and 
132. 
For two different levels of illumination corresponding to output currents L,, t2 and 
'outi' 
- V1 = 
	
-y 	(i 	- Io) 	 (B.2)  l3 l Vb1t 
For a MOS transistor in weak inversion, 
vph 
	




where a is a measured constant of the sensor and UT  the thermal voltage. 
'1ph2 
	
Therefore V - Vhl = aUT in I 	I 
('phl ) 
'pli2 = 	
' 	(L - buLl) Equating aUT in I 	I 
phl) 13lVbt 
a
________ 	 1 
= 1lbit ('out2 - 'outi) 	
1-- I UT in 
( 1phl) 
For an output current gradient of 29.5 A/decade of photocurrent and typical 
process parameters, 
1.8 	 1 a 
= 40x5 
X 29.5 X 
UTinlO 
= 4.6 
From Eqn. B.1, 
L = 1 
Vbit
(Vh - VT) 
dI0, - I3lVbIt 	dV 	'I i Differentiating 
dV dVT 	-y 	T 
—i 	 (13.4) 
 ) 
From Eqn. 2.11 and Eqn. 3.2, the photocurrent iph  required to generate the 
photo-voltage Vh is given by 
IPhKe 	eT 
F 	 1 
Vh = a UT  in I 	I + V i 
F 	 I 





Substituting into Eqn. B.4, 
dLut - 131V 
(a —1) 
dVT 	-Y 
at has a derived value of 4.6. Therefore 
=36x 13lVbt 
dVT 	 -Y 
u v 
3.6 x f3 
-Y 




= 17.5 mV 
Sc 	 166 
Appendix C 
Author's Publications 
The following is a list of the author's relevant publications in chronological order. 
1 	J. Mayor, N. Petrie and C.H. Lau, 'The Design of a Universal Wave Filter 
Adaptor using Dynamic CMOS Techniques", Proc. European Solid-State 
Circuits Conference, Sep. 1984. 
2 	H.M. Reekie, N. Petrie, J. Mayor, P.B. Denyer and C.H. Lau, "Design and 
Implementation of Digital Wave Filters Using Universal Adaptor Structures", 
lEE Proceedings, Vol. 131, Pt. F, No. 6, Oct. 1984. 
3 	C.H. Lau, "Self: A Self-Timed Systems Design Technique", Electronics Letter, 
Vol. 23, No 6, 1987, pp.  269-270. 
4 	C.H. Lau, D. Renshaw, J. Mayor, "Data Flow Approach to Self-Timed Logic 
in VLSI", Proc. IEEE international Symposium on Circuits and Systems, Jun. 
1988. 
5 	C.H. Lau, D. Renshaw, "An Electronic Eye: A Smart Imager in Analogue 
VLSI", U.K. Patent Application No. 8829739, 1988. 
6 	C.H. Lau, D. Renshaw, J. Mayor, "A Self-Timed Wavefront Array 
Multiplier", Proc. IEEE International Symposium on Circuits and Systems, 
May 1989. 
7 	C.H. Lau, D. Renshaw, "Race-Free Clocking of CMOS Pipelines through a 
Single Global Clock", Proc. European Solid-State Circuits Conference, Sep. 
1989 
