Asynchronous VLSI design. by Nedelchev, Ivailo Marinov.
U N IU E R S ITV  DF SURREY LIBRRRY
ProQuest Number: All rights reserved
INFORMATION TO ALL USERS 
The quality of this reproduction is dependent upon the quality of the copy submitted.
In the unlikely event that the author did not send a com plete manuscript 
and there are missing pages, these will be noted. Also, if material had to be removed, 
a note will indicate the deletion.
uest
ProQuest 10130240
Published by ProQuest LLO (2017). Copyright of the Dissertation is held by the Author.
All rights reserved.
This work is protected against unauthorized copying under Title 17, United States C ode
Microform Edition © ProQuest LLO.
ProQuest LLO.
789 East Eisenhower Parkway 
P.Q. Box 1346 
Ann Arbor, Ml 4 81 06 - 1346



Asynchronous VLSI Design
A dissertation 
submitted to the Department 
of Electrical and Electronic Engineering 
of Surrey University, 
in partial fulfillment of the requirements 
for the degree of 
Doctor of Philosophy
I v a i l o  M a r i n o v  N e d e l c h e v  
J u n e  1 9 9 5
li
Acknowledgements
I would like to thank my supervisor, Chris Jesshope, for his professional guid­
ance, his optimism and support which brought light in the darkest times.
I would like also to gratefully acknowledge the help and the encouragement 
from my colleagues, Chaogang Huang, Mark Josephs, and Jay  Yantchev.
There is also a body of friends, current and former PhD students, without whose 
friendship this work would have never progressed with the same joy. Special 
thanks go to Paul and Wilf.
For their support thoughout my PhD research, I am also very grateful to EPSRC 
and ACiD/WG.

Publications
1. C.G.Huang, C.RJesshope, I.M.Nedelchev, ”A  Systematic Method for Sythe- 
sizing Purely Delay-Insensitive Circuits”, IEE Proceedings-E, Vol. 140, No.5, 
September 1993
2. C.RJesshope, I.M.Nedelchev, C.G.Huang, ’’Compilation of Process Algebra Ex­
pressions into Delay-Insensitive Circuits”, IEE Proceedings-E, Vol. 140, No.5, 
September 1993
3. J.Yantchev, I.Nedelchev, ’’Implementation of a packet switching device as a 
delay-insensitive circuit”, Research on Integrated Systems, the MIT press,
1993
4. I.Nedelchev, C. Jesshope, ’’Basic building blocks for asynchronous packet routers”,
1994 IEEE Great Lakes Symposium, Notre Dame, IEEE Computer Society 
Press.
5. C. Jesshope, I.Nedelchev, "Asynchronous Packet Routers”, to be published in 
the DIMACS Series in Discrete Mathematics and Theoretical Computer Sci­
ence, AMS.
6. J.Yantchev, C.Huang, M.Josephs, I.Nedelchev, Low-Latency Asynchronous 
FIFO Buffers, Proc. of 2nd Working Conference on Asynchronous Design 
Methodologies, London, The IEEE Computer Society Press, May 1995
The work in the thesis has been also presented on the ACiD-WG/EXACT Work­
shops in Leuven-Belgium (1992), Veldhoven-Netherlands (1993), Lyngby-Denmark 
(1994), the UKDA Workshops in Newcastle (1992) and Heathrow(1995), and the 
Great Lakes Symposium at Notre Dame, USA(1994).
v
vi
Abstract
This thesis describes the background and implementation of a novel silicon 
compiler from a high-level programming language, OCCAM(async), to asynchronous 
CMOS circuits. The compilation scheme is based on a process algebra description 
of a concurrent system. This Algebra is called Delay-Insensitive Algebra and is 
based on CSP but allows the user more freedom in communication protocols.
The thesis reviews and compares various, existing, design styles and their prac­
tical aspects for asynchronous design are also discussed. The syntax and the 
operational semantics of OCCAM(async) are defined and, on this basis, the new 
compilation technique is described with its underlying CMOS circuitry. The im­
plementations of various, novel, library cells are also discussed.
The compilation technique is illustrated throughout the thesis with practical 
examples. It is also compared to an existing synthesis tool, Tangram, which has 
been developed at Phillips Research Laboratories. The thesis concludes with the 
place and the role of OCCAM(async) in the contemporary CMOS design, and the 
future aspects in continuing this research into the full, design-process automation.
vili
Contents
A ck n o w led g em en ts i i i
P u b lica tio n s  v
A b stra c t v ii
1 A sy n ch ro n o u s sy s te m s  1
1.1 Architecture of Asynchronous D e v ic e .......................................................... 4
1.2 Asynchronous Compilation Techniques....................................................... 5
1.3 Delay-insensitive design .................................................................................... 11
1.3.1 Top-Down Design Steps ..........................................................................11
1.4 An Algebra for Delay-Insensitive C ircu its ......................................................13
1.4.1 Syntax of D IA .............................................................................................. 13
1.4.2 E x a m p le s ................................................................................................. 15
1.5 Library of basic asynchronous cells .................. .......................................... 17
1.6 Translation of events into physical events.................  21
1.7 Data E n co d in g ....................................................................................................... 24
2  D e la y -In sen sitiv e  A rith m e tic  2 7
2.1 Delay-Insensitive Adder...............................................................................  27
2.2 Delay-Insensitive M ultiplier............................................................................... 31
2.3 Pipeline of m ultipliers...........................................................................................35
3  OCCAM (async) 3 7
3.1 Introduction............................................................................................................. 37
3.2 Syntax of OCCAM(async).....................................................................................38
3.3 Examples .................................................................................................................39
ix
X3.4- The Mutual Exclusion element - M E ........................................................... 40
3.5 The Decision Wait elem ent............................................................................. 46
3.5.1 Some implementation notes on DW elem ent.................................... 46
3.5 .2  Sparse DW element..................................................................................50
3.5.3 Generalised DW element........................................................................ 51
3.6 The compilation technique............................................................................. 53
3.6.1 The sequential operator S E Q ............................................................... 55
3.6.2 The composition operator P A R ............................................................ 55
3.6.3 The deterministic choice operator DETALT...................................... 56
3.6 .4  The nondeterministic choice operator A L T ....................................... 58
3.7 Compiling SEQ with a rep lica to r..................................................................... 59
3.8 Data ty p es................................................................................................................60
3.8.1 E x a m p le s ....................................................................................................61
3.9 »The full s y n t a x ...................................................................................................... 61
3.10 Synthesis for data operations........................................................................ 63
3.10.1 Double-rail lo g ic .................................................................................... 63
3 .10 .2Asynchronous bundle-data registers................................................... 66
3 .10 .3Compiling an expression........................................................................ 67
3 .10.4A ssignm ent................................................................................................ 69
3 .10 .5Output sta tem en t.....................................................................................69
3 .10 .6Input s ta te m e n t........................................................................................72
3 .10 .7WHILE loop s ta te m e n t............................................................................73
3 .10 .8Conditional b ra n ch in g ........................................................................... 74
3 .10 .9Case co n stru ct...........................................................................................75
3.11 Channels to/from the environment.................................................................. 76
3.12 Who made w h o .......................................................................................................76
3.13 Channel alignment................................................................................................ 78
3 .14Peephole optim isations...........................................................................   79
3.14.1 Binary trees of Merge and Muller C e le m e n ts ................................. 79
3.14.2Inputs in sequential threads.................................................................. 81
3 .14 .3Multiple reads from a register............................................................... 82
3.15 Multiple writes to a re g is te r ........................................................................... 82
4  A sy n ch ro n o u s P a c k e t  R o u te rs  8 5
4.1 Deadlock............................................................................................................... 85
4.2 Mesh Topology Packet R o u te rs .....................................................................86
4.2.1 Oblivious Routing Packet R o u te rs ....................................................87
4.2.2 Restricted Routing Packet Switching N o d e.....................................87
4.2.3 Basic Building B locks............................................................................ 87
4.3 Routing Block - R T ............................................................................................. 94
4 .4  Implementation of the Routing Switches .................................................. 99
4.4.1 Oblivious Routing Packet Switch - Implementation..................... 99
4.4.2 Restricted Routing Packet Switching Node - Implementation . . 100
4.5 Conclusions.............................................................................................................101
5  T an g ram  v s OCCAM (async) 1 0 3
5.1 Choice operator: Nondeterministic and D eterm inistic..............................104
5.2 Receiving an in p u t............................................................................................... 105
5.3 SEQ and PAR com pilation................................................................................ 106
5.4 The wagging buffer...............................................................................................108
5.5 Conclusions............................................................................................................114
6  C o n clu s io n s and fu tu re  w ork 1 1 7
6.1 Contemporary design process and OCCAM(async) ...................................118
6.2 Future w ork ............................................................................................................120
A  A sy n ch ro n o u s b u fferin g  1 2 3
B  B it-s e r ia l p a c k e t  ro u te r  1 2 9
B. 1 Delay-Insensitive Specification of a Mad Postman S w itch ........................ 129
B.2 High Level Design ................................................................................................132
B.3 Implementation of MUX.......................................................................................133
B.4 Implementation of R .......................................................................   134
B .4 . 1 Implementation of Ho,f .................................................................136
B.4.2 Implementation of CPi and CP2 ............................................................ 136
B.4.3 Assembling the circuit for the process R ............................................136
B.5 Implementation of D E C .......................................................................................138
B.6 Conclusions............................................................................................................ 139
C A sy n ch ro n o u s lib rary  c e lls  1 4 1
D T h e  law s o f  DIA 1 4 7
B ib lio g rap h y  1 4 9
xii
List of Figures
1.1 Synchrony vs A synchrony...............................................................................  3
1.2 Architecture of digital device............................................................................  4
1.3 Compilation from production ru le s ................................................................ 8
1.4 F(a?tb!,c!)|jC(c?,6?,d!)........................................................................................... 16
1.5 Muller C elem ent.....................................................................................................18
1.6 Merge element .....................................................................................  19
1.7 Decision Wait element............................................................................................19
1.8 Two-way arbiter e lem en t  ................................................................... 19
1.9 Counter modulo N ................................................................................................. 20
1.10 DI variable e le m e n t.............................................  20
1.11 Latch element...........................................................................................................21
1.122 phase signaling: request/acknowledgement..............................................22
1.134 phase signaling: request/acknowledgement..............................................22
1.14 Client/Server lo o p ................................................................................................. 22
1.15 Two to Four phase signaling in terface ............................................................ 23
1.16 Four to Two phase signaling in terface.............................................................24
2.1 Binary A d d e r .......................................................................................................... 27
2.2 1-bit Adder................................................................................................................ 29
2.3 Average D elay.......................................................................................................... 30
2.4 Concurrent computation of the r e s u lt .............................................................30
2.5 The multiplier.......................................................................................................... 32
2 .6  The X  cell block....................................................................................................... 32
2.7 X cell implementation........................................................................................... 34
2.8 Shift register............................................................................................................. 34
xiii
xiv
2.9 Shift register implementation........................................................................... 35
2.10 Pipeline of multipliers.......................................................................................... 35
3.1 Half-stream inverter..............................................................................................40
3.2 Mutual Exclusion elem en t.............................................................................. 41
3.3 CMOS implementation of the ME element...................................................... 42
3.4 Analog subcircuit of the RS NAND g a t e ......................................................... 43
3.5 o\ — F^ipz) ..................................................................  43
3.6 Fa (a) and FFl{ x ) ....................................................................................................43
3.7 Nonsymmetrical metastability p o in t ............................................................... 45
3.8 ME element with 3 in p u ts ..................................................................................45
3.9 CMOS ME element with 3 in p u ts ..................................................................... 46
3.10 Arbiter (sequencer).................................................................................................46
3.11 Decision Wait element...........................................................................................47
3 .12DW element - 2 phases signaling..................................................................... 47
3 .13DW - 242 phased implementation ...................................................................48
3 .14DWC basic cell of the 242 phased implementation.................................... 49
3.15 SPARSE DW element ....................................................................................... 50
3 .16SDW 242 phased implementation ...................................................................51
3 .1 7 GDW symbol ..........................................................................................................52
3.18 GDW implementation...........................................................................................52
3.19 GDW basic c e l l .......................................................................................................53
3.20OCCAM(async) compilation s c h e m e ................................................................55
3.21 Compiling S E Q ...................................................................................................... 56
3.22 Compiling P A R .......................................................................................................57
3.23 Compiling operator DETALT using SDW e lem en t....................................... 58
3.24 Compiling operator ALT .................................................................................... 59
3.25 Compiling SEQ with a rep lica to r..................................................................... 60
3 .2 6 Compiling SEQ FO REV ER................................................................................. 60
3.27 Inverter......................................................................................................................63
3.28 Canonical representation.....................................................................................64
3.29AND, OR and X O R .............................................................................................64
3.30 DI variable................................................................................................................ 65
3.31 Variable: using RS flip-flop..................................................................................65
3.32 Variable: using DW elem en t...............................................................................65
3 .33Asynchronous D flip-flop..................................................................................... 66
3.34Asynchronous R eg ister........................................................................................67
3.35 Compiling expression ........................................................................................... 68
3.36 Bundle Data Adder................................................................................................. 68
3.37 Bundled-data AND................................................................................................. 68
3.38A ssignm ent............................................................................................................. 69
3.39 Bundle data m ultiplexor..................................................................................... 70
3 .4 0 Multiplexor with more in p u ts ............................................................................70
3.41 Pre-establishing the path for subsequent inputs ........................................71
3 .4 2 Fast m ultiplexor.................................................................................................... 72
3.43 Output on channel.............................. 72
3.44 Bundle elem ent....................................................................................................... 73
3.45 Compiling input on ch an n el...............................................................................73
3.46 Compiling WHILE ................................................................................................. 74
3.47 Compiling I F ............................................................................................................. 75
3.48 Compiling C A S E .................................................................................................... 76
3.49 no ACK pragm a....................................................................................................... 78
3 .50ALIGN e lem en t....................................................................................................... 79
3.51 Balanced Merge....................................................................................................... 81
3.52 Inputs in SEQ th re a d ........................................................................................... 82
3.53 Multiple reads.......................................................................................................... 82
3 .5 4 Multiple reads: optimised .................................................................................. 83
3.55 Multiple writes: optim ised.................................................................................. 83
4.1 Direct network, mesh topology .........................................................................87
4.2 Deadlock free routing using virtual n e ts ......................................................... 88
4.3 Multiplexing/demultiplexing acke, ackr ..........................................................89
4.4 DEC b lo ck ............................................................................................................. . 9 0
4.5 Implementation of DEC process ......................................................................91
X V
xvi
4.6 AMUX b lock ..........................................................................................................93
4.7 The implementation of A M U X.........................................................................93
4.8 Composition of AMUX b lo c k s .........................................................................94
4.9 RT b lo c k ................................................................................................................96
4.10 Implementation of RT COPY process ..........................................................97
4.11 Implementation of RT body p r o c e s s .............................................................98
4.12 Implementation of R T ........................................................................................99
4.13 Oblivious routing packet switching device ...................................................100
4.14 Restricted routing packet switching device................................................... 101
5.1 Tangram extension: compiling choice operator ......................................... 104
5.2 Receiving an input: T a n g ra m ...........................................................................105
5.3 Receiving an input: OCCAM(async)..................................................................105
5.4 S e le m e n t............................................................................................................... 106
5.5 Tangram SEQ: implementation........................................................................107
5.6 Tangram vs OCCAM(async) SEQ: implementation...................................... 107
5.7 Tangram P A R ......................................................................................................... 107
5.8 The wagging buffer................................................................................................108
5.9 The wagging buffer: Tangram implementation.......................................... 110
5 .lOThe wagging buffer: OCCAM(async) implementation................................ I l l
5.11 Input/output on channel: 4 p h a se ..................................................................114
5.12 Output on channel................................................................................................115
6.1 The design process............................................................................................... 119
A. 1 req ack loop ................................................................................................... 123
A.2 Implementation of SFIFO process .................................................................. 125
A. 3 Implementation of Ro,o - control part ........................................................125
A.4 Implementation of Ro.o /  data path ..............................................................125
A. 5 Sutherland’s F IF O ................................................................................................ 126
B.l M P.............................................................................................................................. 130
B.2 High level decomposition of M P ........................................................................133
B.3 Compilation circuit for M U X ..............................................................................135
B.4 Compilation circuit for H o ,^ ............................................................................. 136
B.5 Compilation circuit for C P i................................................................................ 137
B.6 Compilation circuit for CP2 ................................................................................ 137
B.7 Compilation circuit for R ................................................................................... 138
B.8 Implementation of D E C ....................................................................................... 140
B.9 The forward data p ath ...........................................................................................140
C. 1 Muller C ...................................................................................................................141
C.2 Muller C ...................................................................................................................142
C.3 Muller C ...................................................................................................................142
C.4 Toggle element, implementation ..................................................................... 143
C.5 Counter modulo N .................................................................................................144
C.6 Inverting latch .......................................................................................................144
C.7 Noninverting l a tc h ................................................................................................ 145
xvii
xviii
Chapter 1
Asynchronous systems
Over the years of digital technology development, the design of synchronous sys­
tems has prevailed over other methodologies. A reasonable question arising from 
that fact is are synchronous systems really offering a better methodology than the 
rest in the line and, if so, is it because the other design techniques are not so well 
studied?
Synchronous circuits are devices which perform their computation in a lockstep 
manner with the pulses of a clock. The existence of the clock in such systems 
introduces discreteness in time and each state of the system can be described as 
a function of the state of the previous instant of discrete time.
SystemStaten+1 = FSMF(SystemStaten)
The clock creates a train of events and the system changes its state between the 
duration of two clock events. The correctness of the synchronous systems relies on 
the bounded delays in the building elements and wires. As systems grow larger, 
a problem arises with the appearance, at different times, of one clock instance 
and such a phenomenon is called clock skew. This might cause serious problems 
with the construction of large networks of circuits and usually designers include 
as a solution a special on-chip component in order to deskew the clock events 
appearances.
Another reasonable argument for speed efficiency arises from the fact that syn­
chronous systems tend to work reflecting worst case delay behaviour. When a 
system changes into a new state, it must wait the worst possible delay instead of 
continuing to perform computations. Thus clock period duration, in most cases,
1
2 CHAPTER  1. ASYNCHRONOUS SY ST E M S
is the worst possible delay in the synchronous circuit. One efficient solution for 
this problem is controllable clock generators. The clock frequency is high and it is 
tuned to the minimal discrete time for the most simple operation. Each computa­
tional step takes as many clock cycles as necessary for completion of the current 
operation, thus minimising the average amount of idle time. Solving this problem 
by using controllable clocks results in increased power consumption, as power 
consumption is a function of the frequency, P  = Po x / .  It is difficult to envisage 
whether the latter problems can be balanced by still using synchronous design 
techniques.
Another serious problem emerges with the improvements of digital technology. 
Because of these improvements, circuits are implemented in a smaller area. As 
these circuits are scaled down, the clock is scaled too. However, the delays in long 
wires and elements do not scale according to the same factor. As a consequence, 
having been scaled down, some designs may no longer work correctly which un­
fortunately would usually involve a system redesign.
In order to avoid these problems, interest in circuits without clock arises. Such 
circuits have been called asynchronous.
There is no common generator of events (clock). Any sequence of events is 
ensured by certain interactions between the subsystems. The next event occurs 
only if the subsystem, which produces it, receives a signal for the completion of the 
previous event. For example, if two subsystems communicate in a sender/receiver 
manner, the sender sends the new data only if there is an acknowledgement from 
the receiver for the acceptance and completion of operation with the previous data. 
In that sense, these are active subsystems whose control is ensured by producing 
control events from/to each other.
Figure 1.1 pictorially illustrates the difference between synchronous and asyn­
chronous systems. The plane with letters represents the subsystems working in a 
lockstep manner on the time dimension for the synchronous type of design. As we 
can see, the time axis looks rather different for asynchronous systems. Each com­
ponent occupies as much time on the time axis as necessary for the completion of 
the current operation. At the completion of some events, two or more subcircuits 
may synchronise before commencing the next task. Such a point of synchronisa-
3Asynchronous System
tion may well be a starting point for several further parallel operations. Therefore 
the control circuitry has to ensure that all operations will commence and complete 
at certain times.
The most attractive methods for asynchronous design ensure that the whole 
system functions correctly regardless of delays in connecting wires and subcom­
ponents. The latter class of asynchronous systems is known as delay-insensitive 
systems. There are no timing constraints upon the delays in wires and construct­
ing circuits. This type of system tends to work reflecting the average case time 
period. After the current action is completed, the circuit continues with the next 
task without waiting for a clock event as in synchronous systems. An advantage 
of asynchronous systems over synchronous ones is that, regardless of how small 
the basic components are, there is no longer a problem with the scaling of the 
whole system because basic properties of the asynchronous library elements do 
not change significantly when scaled down.
To add more to the list of potential advantages of using asynchronous tech-
4 CHAPTER 1 . ASYNCHRONOUS SY ST E M S
data
processing
unit
data to/from 
the environment.
Ny
control ■ N  
K ^ s ln n a ls^ y
control
unit
y s.
/ M \ control to/trom
y \ the environment
Figure 1.2: Architecture of digital device
niques, we have to mention that this design style offers low power consumption. It 
is a serious question whether each processor in a network of thousands will con­
sume 10A or only quarter of that. In CMOS technology, it is a well known fact is 
that the major power consumption is when there is a voltage transition. In a tradi­
tional synchronous system the clock is distributed to each subcomponent causing 
such a voltage transition regardless of whether the subcircuit is performing some 
useful work or not. This causes constant power consumption in all building com­
ponents. In contrast, asynchronous systems dissipate power only in parts which 
are involved in the current operation. Recent surveys in asynchronous design 
show that this feature of these systems might be the overwhelming advantage over 
synchronous ones. The major aim of the research undertaken at the Philips Re­
search Laboratories at Eindhoven in the invention of an asynchronous compact 
disk player is lower power consumption. Thus a home compact disc player can 
work longer with a set of home batteries [45, 44, 56, 7, 341.
1.1 Architecture of Asynchronous Device
The typical architecture of a digital device comprises two units (figure 1.2):
♦ Control Unit (CU)
• Data Processing Unit (DPU)
The DPU consists of all data processing devices like adders, multiplexors, shift 
registers, etc. The CU provides all control signals to ensure the correct behaviour 
of the DPU. The CU of synchronous systems does that by referring to each clock 
event. It ensures the correct behaviour relying on the bounded delays in the DPU
1 .2 . ASYNCHRONOUS COMPILATION TECHNIQUES 5
subcomponents. Each operation takes as many clock cycles as necessary so the 
control signal duration is greater than the data delays in the DPU.
The architecture for an asynchronous device is similar. The DPU for both types 
of system is built in a similar manner too, but both types of CU are designed using 
totally different methodologies. The CU for a synchronous system provides a clock­
ing scheme following the external clock events, while the CU for asynchronous sys­
tems provides control talcing into account the delays in the different DPU subcom­
ponents. When some operation in the DPU is to be performed, the corresponding 
device receives a start control signal and proceeds with the operation. After that 
it sends back an acknowledgement (end) for the completion of the operation. The 
latter signal could be the original start signal delayed for as long as is necessary 
for the operation to be completed correctly. This end signal (also called finish) is 
used to pass the control to the successor of the current operation according to the 
current state in the CU. Passing and receiving such start and finish signals, the 
CU is able to provide correct ’’clocking” scheme for the DPU [32].
Most research efforts in the design of efficient asynchronous systems are fo­
cused on the architecture of asynchronous CU and its implications on the DPU 
structure and set of basic library cells. In the next section we will discuss differ­
ent design techniques with which one can produce asynchronous CU from some 
specification language.
1.2 Asynchronous Compilation Techniques
The assumption of the simple fact, all logical values are set correctly for the next 
clock cycle, gives powerful abstraction throughout the whole design. On this ba­
sis, a wide spectrum of automated design methodologies have been developed for 
constructing various and efficient synchronous devices. Because engineers are 
familiar with synchronous design techniques, and these are well-studied, the syn­
chronous type of system design currently outperforms the rest in terms of speed 
and efficiency regardless of the potential advantages of some other design tech­
niques.
In the process of designing asynchronous circuits, regardless of the advantages 
we have already mentioned, one has to deal with phenomenon like a signal being
6 CHAPTER 1 . ASYNCHRONOUS SY ST E M S
sent out but not yet received. This makes the design process quite difficult and 
puts the possibility of total automation of the synthesis under serious question.
Most efforts are in the area of compilation from a target language to a netlist 
of basic library elements. In this section, we aim to make a short overview of 
the different known asynchronous design techniques and try to compare their 
specifics.
There are several synthesis procedures which can be classified according to the 
input "language”. These techniques fall in one of the following classes:
• graph theory
• algebra based
The specification for the first group is a graph of the signal transitions. Graph 
representation of process behaviour is somehow natural for most engineers [60, 
27, 61, 6 , 30, 31]. Initially the asynchronous device is specified with a graph 
of possible transitions for inputs and outputs from/to the environment. These 
graphs are usually called Signal Transitions Graphs (STG). In general, an STG is 
a finite directed graph in which nodes are signal transitions and arcs define the 
precedence constraints. An arc x —* y defines the constraint that signal y can 
not occur before the occurrence of signal x. Usually a description using an STG 
discriminates inputs and outputs.
The next step is to translate this STG graph into a named transitions graph,
i.e. a STG graph where different occurrences of one signal transition are already 
marked as rising and falling ones -  x+ and x - . Actually some of the methods start 
from such a graph and deal with different problems like for example:
• persistency - in each closed loop x+ should precede x -  and vice versa.
• liveness - every transition can be enabled through some sequence of signal 
transitions
Usually at this stage of the design different methods define the necessary and 
sufficient requirements for hazard free logic and speed-independence of the cir­
cuit. The latter property implies that the circuit functions correctly independent
1 .2 . ASYNCHRONOUS COMPILATION TECHNIQ UES 7
of delays in its components, but assuming that delays in wires are small enough 
and can be neglected.
The next step from the design is to assign system states thus translating the 
STG graph into a state graph. Once that is reached, there are plenty of tools that 
can translate a state graph specification to a physical netlist of library components 
[16, 30, 311.
The second group of compilation techniques differs from graph theory based 
compilation because of the fact that an entirely new set of basic library elements 
needs to be constructed and used as an underlying set of primitives.
There are just a few concepts which comprise the core of all algebra based 
compilation techniques. No matter what the input presentation is, the compilation 
is based on the concept of flow control. All known algebraic methods are flavoured 
by CSP [28]. In that sense, different algebra operators are similar to CSP basic 
denotational constructs like:
• Sequential operator Si ; ¿2
• General Choice [Gi —► 5i —+ Sn]
•  Repetition * Sany
• Parallel construct 5 i ||52
• Communication on a channel: input a? and output a! on a channel
The sequential operator denotes the transfer of the control to process 52 after 
Si finishes its execution. General choice denotes a process which behaves like Si 
if Gi becomes true. Repetition denotes a process which behaves like SaniJ; *Sany  
The parallel operator denotes the overall behaviour of the concurrent execution of 
processes Si and 52.
References [1, 2, 3, 4] propose a method which is suitable for low level design 
such as building library components. These components are constructed from 
different subcomponents implemented as combinatorial circuits.
Initially the asynchronous device is specified in a CSP like language. The speci­
fication language is enhanced with operators for procedures, data structures, com­
munication over channels on that basis. Another interesting and useful construct
8 CHAPTER 1 . ASYNCHRONOUS SY ST E M S
Figure 1.3: Compilation from production rules
is the probe. If (X, Y) is a channel, then probefY) is true anytime some process is 
willing to communicate on the other side X, otherwise it is false.
The compilation method includes three steps. The first step of the compilation 
consists of replacing one process with a parallel composition of several processes 
by applying a decomposition rule. It is a step of reducing the complexity of the 
initial specification.
The next step of the compilation method consists of replacing each communica­
tion action and each channel with a pair of wire-operators. This stage is naturally 
named handshaking expansion because of the strict nature of the synchronisation.
The final and the most crucial step is the transformation of all specification 
expressions, produced so far, into production rules.
A production rule is of the form
a —*■ x I
b -*■ y i
where 6 is a boolean function. Both production rules denote that signal x/y goes 
into high/low logical state after a/6 becomes true /false. After the specification 
has been transformed into production rules, the final circuit can be build from 
transistors.
All production rules for one signal are gathered into two sets, one for rising the 
signal and one for lowering it down. The final set for one signal can be expressed 
in the following form.
1 .2 . ASYNCHRONOUS COMPILATION TECHNIQUES 9
&i is implemented as a pull up circuit for signal x and 62 is implemented as pull 
down circuit (see figure 1.3). Obviously if b\ A 62 is true for some set of input signals 
then the circuit is not well-designed - there is a possible conflict between Vss and 
Gnd. Another issue is if 61 V 62 is false for some set of inputs. This condition is 
acceptable but it requires a memory element (shown with dashed line) which is 
used to hold the last value on x when b\ V 62 is false. Both devices 61 and 62 are 
implemented as a direct network of transistors.
Another interesting and promising compilation method is presented by Geoffrey 
Brown(24]. Again the input language is CSP based and it resembles a restricted 
OCCAM language because it does not contain any data types. It is strictly targeted 
for the design of the CU.
For each OCCAM construct, there is a corresponding compilation structure. 
Each structure has start input and finish output. When an event occurs on the 
input start, the actions required and specified by the structure are performed. Af­
terwards there is an output event finish indicating end of the operation. Again one 
may consider input/output on these two channels (start and finish) as passing the 
control between different compilation constructs, thus performing the required ac­
tions strictly following the compilation control flow graph where the latter reflects 
the specification. There are similar methods resembling more or less Brown’s tech­
nique like those in [26, 55, 19, 18, 17]
The method of specification with traces [35] is similar to specification with 
Delay-Insensitive Algebra (DIA)[41] discussed later in this work. Again this work 
has been inspired by CSP [28] and the basic constructs are the same as in DIA 
or CSP. The differences between this and the previous methods of compilation are 
that this approach is formal and its core lays on two fundamental theorems.
• Theorem for DI decomposition
• Separation theorem
The first theorem defines the conditions when a subcomponent from some pro­
cess can be substituted by a composition of simpler processes. Although the 
method gives a formal way to verify the correctness (i.e. whether after substitution 
the device implements correctly the specification), it does not practically answer
10 CHAPTER 1 . ASYNCHRONOUS SY S T E M S
the question how. The theorem vaguely defines steps for practical refinement of a 
process into the simplification necessary for the final compilation.
The second theorem defines when decomposition can be applied correctly to 
two parallel processes so the final circuit implements the specification. Again the 
theorem does not answer how the substitution can be performed formally.
The first fully automated and professional approach into the design of asyn­
chronous systems comprises the work at the Phillips Research Labs[45, 44, 56, 
7, 34]. The main achievement is the target language for specification, Tangram, 
again similar to CSP but incorporating many peculiarities for VLSI design. The 
compilation method is syntax driven and the number of elements used strongly 
depends on the program structure. Moreover, the efficiency is totally dependent 
on the programming style. The language is procedural and in this fashion it re­
sembles more or less languages like C and Pascal. As the compilation is syntax 
driven, the initial specification is translated to a netlist of basic library elements 
whose operational semantics correspond to the language construct they realise.
W h at is  re a lly  s p e c if ic  to  a ll o f  th o s e  a lg ebra  b ased  d esig n  m e th o d o lo g ie s?
• All of them are based on the trace theory of processes [41, 28, 37, 63, 64]. The 
relationship with the real physics of the circuit is that the method stipulate 
that an event from the process corresponds to voltage transitions in the real 
circuit. How this is translated will be discussed later in this chapter.
• The methods are syntax driven, where each syntax construct is replaced with 
a fixed structure of library elements.
• The methods involve the concept of flow control. There are two types of sig­
nals: control signals and signals from the environment. Control signals are 
generated from the synthesis procedure and travel through the circuit. The 
signals from the environment are those specified by the designer as commu­
nications on channels.
• Because the techniques are syntax directed, the final result in circuitry is 
proportional to the specification length. It results in what you get is what 
you have specified, thus the designer is still in a good command of the final
1 .3 . D ELAY-IN SEN SITIVE D ESIG N 11
circuit’s structure and moreover, the design involves hardware programming 
in high level description language.
• These techniques define correct rules for high level decomposition, thus min­
imising the global states and simplifying the final circuit.
1.3 Delay-insensitive design
The design of delay-insensitive systems covers all design aspects of asynchronous 
systems. The top-down design steps hierarchically follow through the different 
types of these systems. Therefore, delay-insensitive design encapsulates all fea­
tures of asynchronous techniques.
In the following subsections, the design of delay-insensitive systems is dis­
cussed. First, the top-down design steps will be discussed through the design, 
then a specification language called Delay-Insensitive Algebra (DIA)[41] will be pre­
sented which is the basis of the novel compilation technique presented in this work.
1.3 .1  Top-Down Design Steps
Delay-insensitive systems are attractive because they function correctly regardless 
the delays in their wires and building components. Another attractive feature is 
the modular way they are constructed. This comes naturally as a consequence 
from the fact that the delays in connecting wires are disregarded throughout the 
design process, i.e. there are no architectural placement constraints.
At the top, the high-level specification is usually one sequential process. There 
are several compilation techniques which can be applied at this level but it will only 
result in complex and slow sequential finite-state machines. A more promising 
approach is to apply a high level decomposition to the initial specification resulting 
in networks of smaller, simpler processes. Usually it is performed in several stages 
following the steps of top-down design. The design strategy comprises a hierarchy 
of techniques for different types of asynchronous circuits which are listed below.
1. Handshaking circuits
2 . Delay-insensitive circuits
12 CHAPTER 1 . ASYNCHRONOUS SY S T E M S
3. Speed-independent circuits
4. Delay-sensitive library elements
The different top-level subcomponents are tightly interconnected with chan­
nels and corresponding acknowledgements as handshaking circuits. If there is 
a channel from block X  sending data to block Y, then there is an acknowledge­
ment coming back from Y to X. Usually at this level of decomposition, the different 
tasks are encapsulated into different subprocesses. Most of the compilation tech­
niques produce a circuit whose complexity is proportional to the number of differ­
ent states of its initial specification. Because of the decomposition, the complexity 
of the different subprocesses is simpler than the initial process. If this process 
of decomposition can easily be performed and the resulting subprocesses are not 
strictly handshaking, then sometimes this design stage can be omitted.
Once the main process is decomposed into a network of communicating pro­
cesses, one can decide what building components are necessary for constructing 
the different subprocesses. Usually, at this stage (the delay-insensitive circuits 
level) different compilation techniques can cope with the complexity of the sub­
process. The final circuit is a network of standard library cells wired together. The 
latter cells are built according to the last two design steps. Following these two 
last design steps one can construct new library cells, should the need for it ever 
arise. On this level of delay-insensitive circuits, the closed loops req ?=* ack might 
involve more than two processes. For example, process X  requests an action from 
process Y where process Z performs on request from Y some part of the latter 
action, and acknowledges upon completion to the first process X. On this level 
tile designer can freely build the structure of the communicating subcomponents 
assuming that delays in wires and building blocks are arbitraiy.
After several decompositions, the length of connecting wires can be neglected 
and one can assume that wire delays are, in practice, zero. This is the main 
assumption for speed-independent design.
This stage of the design and the next one are where different blocks can be 
placed on a very small area (often called.equipotential[48]) and only assumptions 
concerning the component’s delays should be involved.
1 .4 . AN  ALGEBRA FOR D ELAY-IN SEN SITIVE CIRCUITS 13
The last stage is the one where one can consider in practice different delays 
in various components. For example, one can admit that the delay of 5 XOR is 
greater than the delay through 2 OR elements or one can prove correctness using 
the simulation results of a ready-made layout cell.
1.4  An Algebra for Delay-Insensitive Circuits
Delay-Insensitive Algebra (DIA) has drawn recent interest as a suitable candidate 
for specification, design and verification of large delay-insensitive systems.
References [39, 40, 37] propose a CSP-based algebraic notation, called Delay- 
Insensitive Algebra, in which delay-insensitive circuit specifications can be ex­
pressed concisely, including obligations to be met by the environment. The phe­
nomenon of a signal being on its way is dealt with entirely within the algebra. The 
possibilities of transmission interference characterised by Udding [63, 64], is faith­
fully modeled in the algebra as well; the designer is able to reason about these 
errors, and so avoid them.
A process in DIA is a mathematical model at a certain level of abstraction of the 
way in which a delay-insensitive circuit interacts with its environment. A circuit 
receives signals from its environment on its input wires and sends signals on its 
output wires. The two sets, respectively of input and output wires, constitute the 
input and the output alphabet of the process. These are finite and disjoint and 
typical names are I and O.
1.4 .1  Syntax of DIA
The syntax of the algebra is very similar to the CSP syntax:
1 . X, chaos, denotes the possibility of error.
2 . a?; P , input prefix, denotes a process that must wait for a signal to arrive on
a € I before it can behave like P.
3.. c\; P, output prefix, denotes a process that outputs on c and then behaves like
P.
4. [00 —* Po n ••• D 9 n  —*■ Pn]. guarded choice, where a guard can be either
14 CHAPTER 1 . ASYNCHRONOUS S Y ST E M S
a,?, input guard 
skip, skip guard
An alternative of the form a, ? —► P,- is selected only if a signal has been received 
on a,-. The guarded choice behaves like Pt-. A skip guarded alternative must 
be selected eventually if no input is supplied.
5. PnQ, non-deterministic choice, behaves either as P or as Q, the choice between 
them made internally in a non-deterministic fashion.
6. P/a?, after-input, behaves like P after its environment has sent it a signal on 
a£ I.
7. P\\Q, concurrent composition, has an overall behaviour derived from the in­
dividual behaviours of its components. If the output wire of one component 
has the same name as the input wire of another, then these wires are joined 
together; any signals transmitted along such a connection are hidden from 
the environment. The input alphabet of P must be disjoint from that of Q; 
likewise, the output alphabet of P must be disjoint from that of Q. The input 
alphabet of P\\Q then consists of those input wires of each process P and Q 
which are not output wires of the other. Similarly, the output alphabet of 
P\\Q consists of those output wires of each process which are not input wires 
of the other.
8. pp.P, recursion, where p is process identifier and binds all free occurrences of 
pin P. pp.P denotes the solution to the recur siveequationp = P and either of 
the two forms will be used, whichever is more convenient.
A process P is considered to be ‘just as good’ as the process Q [Q C P) if no 
environment, which is simply another process, can when interacting with P deter­
mine that it is not interacting with Q. Two processes are considered to be equal 
when they are just as good as each other.
There are, of course, some constraints upon the specification: A wire cannot 
accommodate two signals at the same time; they might interfere with one another 
in an undesirable way. This and any other error are modeled by the process _L 
(Bottom or Chaos). The process _L is considered to be so undesirable that any
1 .4 . A N  ALGEBRA FOR D ELAY-IN SEN SITIVE CIRCUITS 15
other process must be an improvement on it: JLC P. This can be expressed by the 
following rule:
a?; a?; P — J_
Similarly
cl; cl; P  = JL
A list of algebraic laws within the DIA can be found in [371.
The environment must ensure that a process never gets into a _L state. A com­
mon use is in the alternative a? —*_L of a guarded choice, in which case the en­
vironment is obliged not to signal on a until it first receives an output from the 
circuit. Thus we can model not only erroneous states in a process but also the 
obligations to be met by its environment.
M u tu ally  ex c lu siv e  guarded c h o ic e . The environment of the process P
P =  [a?-> [b? -+J_ OSo] □ 6? -> [a? ->_L OS{\
must not send a signal on a followed by another on b, or vice versa, without re­
ceiving an acknowledgement from P  of the first signal. Such inputs are called 
mutually-exclusive and we will use the following syntax as a shorthand for the 
above expression
[a? —> [So] 16? —+ [Si]]
It is advantageous to indicate when a choice is mutually exclusive as it is easier 
to implement than general choice. Such a choice will be named deterministic 
choice throughout this thesis.
1 .4 .2  Examples
DIA is not only a powerful abstraction tool which helps the designer throughout 
the specification process, but using its algebraic laws one can reason about and 
verify certain properties. In this section we will give a few specifications of several 
basic circuits, and then we will formally prove that a physical wire-fork in parallel 
with a Muller C element behaves exactly like a physical wire.
16 CHAPTER  I .  ASYNCHRONOUS SY ST E M S
Consider a circuit W with one input a : a e I (W ) and one output b : b e 0 (W ) 
modeled with the following process:
W =  a?;bl;W
Such a circuit infinitely accepts an input on a and then outputs on b and fairly 
models a physical delay-insensitive wire. A process F  modeling a physical fork 
with one input a : a e 1(F) and two outputs b : b e 0 (F )  and c : c e  0 (F )  can be 
described as follows:
F — a?; 6!; cl;F
Let us introduce one more element; A C process models the well-known Muller 
C element which awaits for signals on both of its inputs a: a e 1(C) and b : be 1(C) 
and then outputs on c : c 6 0 (C )
C  = a?; 6?; c!; C
Given the specifications above, we will now formally prove the fact that 
W(a?,dl) =  F(a?,b\,c\)\\C(c?,b?,d\).
We will base the entire reasoning of ours on the laws within DIA, where one can 
find those we use in appendix D.
F(a?1b\,c\)\\C(c?>b?}d\)
=  {by expanding in the definitions}
(a?; 6!; ct; F(a?, 6!, c!))||(c?; 6?; dl; C(c?, b?, dl))
=  {b y  using the law: a?; b?; P  = 6?; a?; P }  
(a?; 6!; cl; F(a?, 6!, c!))||(6?; c?; dl; C(c?, b?, dl))
=  { one choice is no choice, law 30 in 137}}
a?; ((6!; cl; F(a?, 6!, c!))j|(6?; c?; dl; C(c?, 6?, dl)))
1 .5 . LIBR A R Y O F B A SIC  ASYNCHRONOUS C ELLS 17
= { internal communication on b, law 29 in [371}
a?; ((cl; F(a ?, bl, c!))||(6?; c?; dl; C (c?, b?, d\))/b?)
= {b y  using law 19 in [371 }
a?; ((cl; F (a ? , b\, cl))[|([6? -+JL Dskip -> c?; d 1; C (c?, b?, dl)])
= { noone can supply 6?, skip is chosen}
a?; ((cl; F(a?, 6!, c!))||(c?; dl; C(c?, 6?, dl)))
= { similarly to the previous two substitutions }
a?; (F (a ? , 6!, c!)||(dl; C (c?, b?, dl)))
= {  d is not in I(F(a?, 6!, cl))}
a?; dl; (F(a?, bl, c!)||C'(c?, b?, dl))
The process F(a?, 6!, c!)||C'(c?, 6?, dl) has absolutely the same behaviour as W (a?, dl) 
and both processes have equal input and output alphabets({a} and {d} respec­
tively), therefore we can conclude
W (a?,dl) = F (a ? , 6!, c!)||C(c?, 6?, d!)
The example above is only a small illustration of how powerful DIA can be in a 
verification process. Reference [38] gives another nice example of verifying certain 
properties for an asynchronous circuit.
1.5 Library of basic asynchronous cells
The reader probably noticed that there was no comparison between algebra and 
graph theory based asynchronous compilation techniques. Both types give similar 
structures as a result and what is more important, the latter structures are based 
on similar basic building subcomponents. But when STG compilation is used then 
all subcomponents are built by some standard set of library cells - usually NANDs 
and NORs. Instead of doing that, one may find it more appropriate to use full- 
custom designed cells for all common structures in the specification. The later 
process involves the design of an entirely new set of library elements which we 
discuss below.
In this section, we aim to make a short overview of the different library elements 
used within the compilation techniques. The elements discussed below constitute
18 CHAPTER 1 . ASYNCHRONOUS SY ST E M S
Figure 1.5: Muller C element
the basic set for the design of asynchronous systems. As one can notice, this 
set is quite different from the well-known libraries used through the design of 
synchronous systems.
• Muller C element (figure 1.5). Its output becomes 0 when all of its inputs are 
0 and becomes 1 when all of its inputs are 1 , otherwise the output remains in 
whatever condition it was. The Muller C element is one of the most common 
used library cells for asynchronous design. Its main use is for synchronisa­
tion of two events. In other words, the Muller C element outputs an event 
after it receives events on all its inputs. That is why more often the Muller C 
element is called an event driven AND gate.
C  =  a ? ;b ? ;c l ;C
•  Merge element (figure 1.6). The Merge element is usually implemented as an 
XOR logical gate, i.e. any change on its inputs causes change on its output. 
The Merge element simply copies each input event to the output thus merging 
the inputs into one output. In the literature Merge element is called an event 
driven OR gate[32]. The DIA specification of Merge element is as follows:
M erge — [m i? —> otti!; M erge j ¿«2? —► out\\ Merge]
One can envisage from the specification that it is unsafe for the environment 
to supply simultaneously two input events without an interleaving output 
event between them.
• Decision Wait element (DW) is a generalisation of the Muller C element. For 
example DWix i is the Muller C element [39]. It waits for an input from one 
of two sets of inputs, one in each of two dimensions and then sends an event
1 .5 . LIBR A R Y O F B A SIC  ASYNCHRONOUS C ELLS 19
merge
Figure 1.6: Merge element
<^00 <*01
^ 1 0 <*11
0 1
0 00 01
1 10 11
Figure 1.7: Decision Wait element
upon the output matching both indices. Events in each dimension are mu­
tually exclusive. If there are more than one transition on the row/column 
inputs, the next state is undefined (JL). For example the DW element (fig­
ure 1.7} waits for two events - one on each dimension and then outputs an 
event on the output corresponding to the intersection of the wires on which 
the input events were detected.
• Two-way Arbiter. A two-way arbiter is a device that gives two processes 
mutually-exclusive access to a single shared resource. Its circuit symbol is 
given below
Figure 1.8: Two-way arbiter element
A pair of request and acknowledgement wires r*, a, connect two processes to 
the arbiter. The r and a wires are used for resource request and acknowl­
edgement. The behaviour of the arbiter is defined by the following algebraic 
expression:
A =  [ n !=or »'?  r l ;  a? ; a»!; At]
A-i — r i? ; 7*1; a?; at !; A
The request signals ro and n  can be concurrent. In this case one of them is 
blocked until the resource has been released by the other process.
Counter Modulo N. In many designs one can envisage the use of counters 
modulo 2 which are the so called Toggles. Asynchronous counter modulo N 
is a generalisation of the Toggle element. It counts N-l transitions on the out* 
output and the N</l transition on out.
CHAPTER 1 . ASYNCHRONOUS SY ST E M S
Figure 1.9: Counter modulo N
The specification in DIA is as follows:
Counter n  — COUNTq
COUNT í = in?; out.1; COUNTi+1 : 0 < i < n -  2 
COUNTn-i = in?;out\;COUNT0
Variable element. A variable is a device which in one can store a value from a 
fixed set [value i, ..valuen) and later read this value. Its circuit symbol is given 
below (figure 1 . 10).
Figure 1.10: DI variable element
The specification of this device is as follows:
V A R ,  —  [ [| l < i< n  "Oivaluei? Vjavaiuei\; V^a/ue,-] | V ?  »-X ]
V v a l u e j  —  [ [ | l < » < n  ^ v a l u e i ?  *  ' O J a v a l u e i  ^ w a iu e j]  i F ?  +■ V v a i u e j  ! j  V v a l u e j \
The write inputs are as many as the number of the values in the set and 
there is a corresponding acknowledgement for each. There is one read input 
and it is acknowledged on one of the v outputs depending on the last written
1 .6 . TRANSLATION OF EVENTS INTO PHYSICAL EVENTS 21
value. Because we will use this circuit in different parts of the schematic 
representation, it is useful to split its graphical representation as it is shown 
with the device circumscribed with polygon line.
• Latch. Latch is a quite common element in the synchronous design too. It 
comprises a data input and output and one control input. When the control 
input is high, the latch is transparent and when low, the latch locks its output 
and holds the value latched with the falling edge of the control input(see 
figure 1 . 11).
m sw out
Figure 1.11: Latch element
1.6 Translation of events into physical events
In this section we will discuss the relationship between the real circuit and the 
specification. The specification of the circuit deals with abstract events, inputs and 
outputs on channels, while the corresponding physical event in the real circuit is 
transitions. There are two way of encoding an event with voltage transitions([32], 
chapter 7 in 148)). One straightforward translation of an algebraic event (graph 
transition) is the physical implementation as a voltage transition usually OV —> 5V 
and 5V —> OV. In this case both voltage transitions are considered as equal and they 
are considered to carry the same information, an event. Consider the well known 
scheme request ^  acknowledgement between client and server (sender/receiver) (see 
figures 1.12,1.14). The DIA specifications are shown below; let us confine our­
selves to how these two events req and ack are physically interpreted.
R C  = reqldata', ack?\ RC  
A C  = req?data\ ackV, AC
On each voltage transition request from the client, the server responds with a 
single voltage transition as an acknowledgement. The client sets its data valid and
22 CHAPTER 1 . ASYNCHRONOUS SY ST E M S
Figure 1.12: 2 phase signaling: request/acknowledgement
Figure 1.13: 4  phase signaling: request/acknowledgement
requests an action on it with an event on wire req. When the server accepts the 
data, it acknowledges by sending an event on wire ack. This type of encoding is 
the so called 2 phase signaling (sometimes called transition signaling [32]). Both 
req and ack events are single voltage transitions.
Let us consider the other possible translation of an algebraic event (graph 
transition). It is the so called 4  phase signaling. Consider again the scheme 
request =f=* acknowledgement between sender/receiver (see figure 1.13).
The first, rising edge of the request causes the server to acknowledge it by a 
voltage transition on the ack wire (in the example - the falling edge of ack). Then the 
request returns to its initial state and the latter voltage transition is acknowledged 
by the second ack transition. That is why 4 phase signaling is called sometimes 
return-to-zero signaling and as we can see both events req and ack are encoded with 
two adjacent physical voltage transitions.
•  ► rc req
data
ack ac
Figure 1.14: Client/Server loop
1.6. TRANSLATION OF EVENTS INTO PHYSICAL EVENTS 23
Figure 1.15: Two to Four phase signaling interface
Historically, 4 phase signaling has been used more often than 2 phase in the 
design of asynchronous systems because both circuits rc and ac are simpler using 
4 phase signaling and thus the delay of the loop is less. It is also difficult for most 
designers to grasp how asynchronous logic can be designed so it reflects 2 phase 
signaling when nowadays common building elements are single-edge triggered. 
By single-edge triggered we mean that a memory element accepts its new state 
with either the rising or the falling edge of the clock. Indeed, when designing 
transition logic devices, the basic set of asynchronous library elements is totally 
different from single-edge triggered cells. But considering the delay for the 4  phase 
signaling synchronisation, one can envisage that for the full completion of the 
req ^  ack cycle, one needs to propagate two ’’waves" of voltage transitions over the 
loop rc —* req — wire —► ac ack — wire. The latter (roughly speaking) means double 
the delay for the whole cycle. Another peculiarity is the double power consumption 
on each synchronisation in comparison with the circuit utilising 2 phase signaling.
There is no doubt that designing 2 phase signaling devices are somehow natural 
and advantageous, offering twice the potential speed of clocked circuitry. But 
as usual nothing in the universe can be absolute and as we already mentioned 
sometimes using 4  phase signaling drastically simplifies the design of the control 
circuits rc and ac. For the purposes of combining the two schemes and, most 
often, to interface 2 phase signaling devices with clocked logic, conversion devices 
should be used. Figures 1.15,1.16 illustrate the conversion, from 2 to 4 and 4 to 2 
phase signaling.
The sender’s request line for the first scheme turns into logical high the corre­
sponding 4req wire of the recipient. Consequently the acknowledgement from it
24 CHAPTER 1 . ASYNCHRONOUS SY ST E M S
Figure 1.16: Four to Two phase signaling interface
goes into high as well. The Toggle element steers the first event on its input to the 
Merge. The output from the latter one turns off the 4req wire to the recipient side 
and the falling edge of the acknowledgement is toggled out as an acknowledgement 
for the sender.
The second scheme works in a similar way. The request from the 4  phased 
side causes the corresponding Toggle element to send an event on the recipient’s 
request wire. The consequent acknowledgement turns into high the acknowledge­
ment output to the sender. The falling edge of the 4req wire is now steered to the 
Merge input and thus causing falling edge of the acknowledgement output.
1.7 Data Encoding
In the previous sections we discussed issues concerning the design of the CU from 
the architecture of asynchronous circuit. In this section the aim is to present some 
general discussion about the data paths and how the data information is encoded 
physically when designing asynchronous DPU.
In order to understand the types of data encoding and the options we have for 
physical data representation, let us first consider some asynchronous means of 
signaling. The types of such encoding schemes determine the structure and the 
set of different library elements for data processing.
There are two prevailing types of data encoding. The first one and most often 
used is so called bundled-data [32, 19, 10]. Consider again the sender/receiver 
model of communication. As we have discussed that in the previous sections, 
there are two control signals request and acknowledgement which provide con­
trol for exchanging data between the sender/receiver. The data wires cany the
1.7 . DATA ENCODING 25
boolean values. The sender sets the valid data on these data wires and issues a 
request. Unless the receiver confirms that the data is actually received, the data 
wires should stably hold the boolean values. After an acknowledgement from the 
receiver, the sender can send the next portion of data.
Why this type of encoding is called bundled-data?
The only requirement for this type of encoding is that the delays in the data 
transmission lines must be less than the delays of the request line. Thus, when 
the request event is received, the data is valid for the receiver. Usually wiring 
data/request is bundled so the delays in all lines on each data and request are 
similar. The difference in the delays request/data is provided by simply placing 
a delay element on the request line or this delay element can be naturally imple­
mented by the control device for producing request event. The acknowledgement 
line need not be bundled.
This type of encoding prevails where the number of data states is large and 
the application of other encoding schemes, discussed in this section, is expen­
sive in terms of number of wires. Another advantage of using bundled data is in 
the fact that data processing elements are the same as with synchronous systems 
and these data processing cells are generally faster and more efficient than imple­
mented entirely asynchronously as with using the approach described later in this 
section.
The problem with using bundled data convention is that when constructing 
large networks, one can only guarantee with difficulty certain bounded imbalance 
between delay,. equest/ delay data and therefore the correct functioning of the whole 
system is under serious question. The problem arises from the timing assumption 
that delays in data transmission must be less than delays in the request line. 
It is delay-sensitive encoding and to avoid the problem, one should use delay- 
insensitive codes [62].
Let us consider transmission of a single bit of data information between the 
sender and the receiver asynchronously. The delay-insensitive encoding is as fol­
lows: one control wire for the acknowledgement and two data wires named 0 and 
1. An event on each wire carries the corresponding bit of information, i.e. an event 
on wire 0 delivers a logical 0 to the receiver. Often this type of encoding is called
26 CHAPTER 1 . ASYNCHRONOUS SY ST E M S
double-rail encoding. Note that we use the term event rather than transition be­
cause it might be implemented in a four phase transition scheme. What is really 
specific about this type of encoding is that control wire request is missing and it 
is logically implemented with the transmission of the data event. The transmis­
sion of data is totally delay-insensitive and as one may guess there are no timing 
assumptions for the wire delays at all.
If one needs more data states to be transmitted, one straightforward implemen­
tation of encoding is two wires (0 and 1) for each bit in the binary representation of 
the information. The advantage is again concealed in the fact that the transmis­
sion is totally delay-insensitive, but the tradeoff regarding number of wires favours 
the use of bundled data scheme. One advantage of using double-rail encoding is 
when boolean functions are constructed on this basis, the outputs from the cir­
cuit are as many as the canonical forms of the inputs. The latter can be used for 
different fast propagations of the result. Chapter 2 illustrates a good example of 
the use of double-rail logic.
In general, double-rail encoding is suitable for bit-serial devices and on chip 
implementation where the number of wires is not a great issue. One attractive 
implementation using delay-insensitive encoding is shown in appendix B.
Of course, this is not the optimal delay-insensitive encoding when the number 
of transmission lines is greater than 2. Tom Verhoeff showed in [62] that ) states 
can be encoded using 2x N wires. A valid transmission of a value is realised by 
signaling on N wires. Therefore each possible combination of N signals out of 2x 
N can encode a possible value. The latter scheme seems to not favour the design 
of data operation circuits. Such blocks for data arithmetics would be complex and 
very slow.
So far we have considered various aspects of asynchronous design. In the fol­
lowing chapter we will illustrate the basic principles with some interesting design 
examples of delay-insensitive arithmetic blocks.
Chapter 2
Delay-Insensitive Arithmetic
The chapter presents a Delay-Insensitive implementation of a binary adder and 
also a multiplier based on the latter. A side effect of the implementation technique 
is that the carry chain can be ’’broken” stochastically and thus the circuit’s perfor­
mance increased. Because of the nature of the Delay-Insensitive implementation, 
the circuit is naturally pipelined. New single bit values can be supplied as soon as 
the result of the previous single bit add operation is consumed.
2.1  Delay-Insensitive Adder
This section describes a Delay-Insensitive implementation of the well known binary 
adder. Figure 2.1 illustrates the structure for the addition of 3-bit numbers. Both 
numbers A and B as well as the Carry signal are represented with double-rail code 
using 2 phase signaling.
The basic element for this adder is full 1-bit adder, which implements the fol­
lowing function:
RO R1 R2 C3
Figure 2.1: Binary Adder
27
28 CHAPTER 2 . DELAY-IN SEN SITIVE ARITH M ETIC
Sum = A© B ©  Cany (1)
Carry+ = (A A B) V ((A V B) A Cany) (2 )
Let us consider the latter equation more in detail. This is a recurrent equation 
and it reveals that the Cariy+ signal depends on Cany from the previous 1-bit full 
adder. This relation creates the well known Cany chain, and certainly contributes 
to the total circuit delay. In fact the delay of this chain is the delay of the add 
operation. To avoid this, several techniques can be applied like Cany Look Ahead, 
Manchester Cany chain, etc. [53] using different approaches of parallel evalua­
tion, precharging of domino logic, etc. However, not having a termination signal, 
one can only be sure that the result is valid after waiting for the worst case delay,
i.e. from Co to C3 in figure 2. 1 .
It is obvious from equation (2) that Carry* is 1 if A = B = 1. It is also easy to 
see that Cariy* is 0 when A = B = 0. Therefore the Cany signal does not depend 
on previous stages when these two conditions are fulfilled. These two conditions 
can be used to ’’break” the Cany chain and speed up the circuit’s performance.
As we noted when the operands (A, B) are (0, 0) or (1, 1), Cany+ can be further 
propagated immediately once this condition is fulfilled, as there is no need to wait 
for the Carry input from the previous stage in the adder. In this case the latter 
Carry signal is only necessary for evaluating the current sum. On average, this 
type of adder will be characterised with a delay proportional to log2 n where n is 
the number of the bits. Such a delay is normally only achieved in carry-lookahead 
adders, where generate and propagate signals are computed in a tree for geometric 
sequences of groups of bits. There are further advantages for self terminating DI 
adders for it is quite often the case that the full precisions of adders is not used, for 
example in indexing operations for address arithmetic. In such cases even faster 
termination of the addition may occur. Figure 2.3 illustrates the average delay as 
a function of the number of bits for the whole adder. The quantity on the Y axis is 
the average maximal Carry chain length which can not be "broken” and it defines 
the actual adder delay. These results were obtained by exhaustive simulations, 
averaging the maximum Carry chain length over all possible combinations of pairs
2 .1 . D ELAY-IN SEN SITIVE ADDER  29
A B
merge
Carry
merge
V
P
merge
Q ,
merge
Sum
?
o
merge
merge
Carry+
Figure 2.2: 1-bit Adder
30 CHAPTER 2 . D ELAY-IN SEN SITIVE ARITH M ETIC
Figure 2.3: Average Delay
of operands.
Figure 2 .4  illustrates the execution of the addition for two 16 bits long integers. 
The termination of the calculation of each bit for the result is shown on the time 
axis L As it can be envisaged, the biggest delay is proportional to the length of 
the longest chain of adjacent combinations (0, 1) or (1 ,0) of corresponding operand 
bits.
Because of the nature of delay-insensitive circuitry we may also exploit the 
distributed nature of the termination of the operation. It is not necessary nor de­
sirable to wait for some global termination signal from the parallel addition before 
providing new operands, indeed the calculation of such a signal would add further 
delay to the circuitry. Instead, new single bit values can be supplied as soon as 
the results from any single add operation (Sum and Carry+) are consumed by the 
following stages of the evaluation.
2 .2  Delay-Insensitive Multiplier
2 .2 . D ELAY-IN SEN SITIVE M ULTIPLIER 31
The two operands are unsigned integers. The first operand is supplied in parallel 
and it is Mbits long. As we already mentioned we will use double-rail code 2 phase 
signaling convention for encoding the information. Therefore for the representa­
tion of the first operand we will use 2xM  wires, where each single bit is encoded 
by two wires. The second operand is fed to the circuit bit-serially. It is IVbits long 
and for encoding we will use 3 wires each representing logical 0, 1 or the end of 
the operand stream - e. Thus the stream sequence for the second operand can be 
formally described by the following expression
OP2 = (0| l)w;e
where the least significant bits travel first.
The algorithm for the operation multiply is the well known shift and add one. 
We use an accumulator, which performs the addition of the first operand to the 
accumulated result if the current bit of the second operand is 1. The accumulator 
then shifts the result from most significant bit to least significant bit. This con­
tinues until the event e has been received on the input channel. Because the first 
operand is of length M, the operation add should be performed only for the most 
significant Mbits of the current result stored in the accumulator. The final result 
will be M+N  bits long and therefore least significant N  bits are stored in a shift 
register. The structure of the multiplier is shown in figure 2.5. The basic building 
element is the X  cell (figure 2.6). Its implementation is shown in figure 2.7.
Initially the first operand is supplied on Wi channels and acknowledged on wai 
channels. The state of the different bit-wires is thus stored in VAR circuit. After 
this, the second operand can be fed into the circuit. The acceptance of each bit 
is acknowledged on channel ac/c.r. If the current bit is 0, the result in all X  cells 
is simply shifted to right and into the shift register. If the current bit is 1, all X  
cells add the corresponding bit of the first operand, feed the result from the first 
half-add to the left neighbour cell on channel resl.l, preserving the computation of 
the Carry and the final result after receiving the value from its right neighbour cell 
on channel resl.r. These two operations are performed sequentially - add, then 
shift.
32 CHAPTER 2 . D ELAY-IN SEN SITIVE ARITH M ETIC
OP1 M 0P2
Figure 2.5: The multiplier
0,1,e OP2
-~7
> ,0,1,e OP2
resO.I2. 2 resO.r
V
res 1.1 2.
X cell
2  res1.r
carry.l 2 2  carry.r
ack.l ack.r
j ,  j .
wa mr ack
Figure 2.6: The X  cell block
2 .2 . D ELAY-IN SEN SITIVE M ULTIPLIER 33
Data is shifted on two channels resO, resl. (sent on resO.l and resl.l to the left 
neighbour, received on resO.r and resl.rfrom the right neighbour). This separatiqp
simplifies the add operation which is not performed if the second operand bit is
/0. One of the benefits of the delay-insensitive systems is the ability to exploit any 
non uniform completion times of an operation.
Obviously the accumulator contains an adder itself, thus we exploit thqisame 
technique as with the delay-insensitive adder. The subcircuit in dashed line is the 
1-bit adder for the local XCell. The only difference is that the first half-add logical 
function is produced by VAjR element rather than another DW  element.
All X  cells are initially in the zero state. The DW2X3 element is the ’’memory" 
which keeps the current bit-result of the accumulator. The column with the in­
vertor (bubble signed) corresponds to logical value 0 and this invertor resets the 
XCell to its zeroth state initially. The other column corresponds to state 1 for the 
XCell.
Note that in the last XCell which corresponds to the least significant bit of OPi 
if resl.l is 1 then cany.I is 0 (figure 2.5). That is why the wire for logical 1 of resl.l 
is connected to the cany.I wire 0. Another peculiarity is that the XCell which 
corresponds to the most significant bit has no right neighbour. That is why each 
time OP2 bit is supplied, this XCell receives an event on the zero4,1 wire of either 
resO.r or resl.r.
After receiving e event on the input channel of OP2, each XCell sends the status- 
bit on channel rrvr and waits for the consequent acknowledgement (ac/c) from the 
recipient of the result. Having received the latter, it continues in the same manner 
described above.
The shift register functions in a similar way as in the conventional logic (fig­
ure 2.8). It simply accepts every value on its input channel in and shifts one bit 
afterwards. Then it simply acknowledges the acceptance of the operand on chan­
nel ack.r. Such a register has some depth N. It is not safe to store more than the 
length allowed in the shift register.
The circuit performing the described above is shown in figure 2.9, where in 
this case N  is 3. After an e event is received, the shift register sends its value 
on channels sr and then it waits for the consequent acknowledgement from the
CHAPTER 2 . D ELAY-IN SEN SITIVE ARITH M ETIC
waO wa1 mrO mr1 ack
Figure 2.7 : X  cell implementation
shift reg
-e
-0
-1
►ack.r
ack srl sr2 srN 
Figure 2 .8 : Shift register
2 .3 . PIPELINE O F M ULTIPLIERS 35
ack sr1 sr2 sr3
Figure 2.9: Shift register implementation
k3 k2 k1
result k3 ack k2 ack k1 ack
Figure 2.10: Pipeline of multipliers 
recipient of the result. Having received the latter, it continues in the same manner.
2 .3  Pipeline of multipliers
The multiplier presented performs the multiplication on two operands of length M 
and N. The result is of length M+l\^  If required a chain of several multiply operations 
maybe constructed, where more often it is necessary for the result to be of same 
length as one of the operands. For example:
R E SU L T  =  k\ x ¿ 2  x & 3  x OP *
where Jq is of length k, OP and RESULT are N  bits long. The structure of a circuit 
which performs these 3 multiply operations is a natural pipeline of 3 XBlocks and 
one shift register (see figure 2.10). The length of each XBlock is the corresponding 
k and the result is stored in the shift register with length N.
We presented a circuit for a delay-insensitive implementation of a 1-bit full 
adder which can be cascaded to make parallel adders of arbitrary size. The imple­
mentation exploits the self-terminating nature of event signaling using double-rail
code and is naturally pipelined. Using this adder as a basis, we have built a 
delay-insensitive multiplier. It is again naturally pipelined. One can see that the 
acknowledgement is not delayed by those from the previous stages in the pipeline.
36 CHAPTER 2 . D ELAY-IN SEN SITIVE ARITH M ETIC
Chapter 3
OCCAM(async)
3.1  Introduction
In this chapter the reader will be introduced to a novel design technique for asyn­
chronous circuits. First we will define the OCCAM(async) language: syntax and 
operational semantics; then we will display a formal compilation technique for 
synthesizing asynchronous circuitry from a specification in OCCAM(async). In 
the later sections we will introduce data types and of course we will provide the 
extended formal synthesis method reflecting data operations. OCCAM(async), as a 
syntax, is very close to the well known programming language OCCAM, but it dif­
fers in its operational semantics from the latter as it facilitates hardware design. 
The interpretation of OCCAM(async) is based on DIA. Therefore, the first subtle 
difference between OCCAM(async) and OCCAM is in the fact that communication 
on channel is not strictly synchronous: communications in OCCAM(async) repre­
sent signals on wires, thus a transmission and its acknowledgement are treated 
as separate events. Our aim is to free the designer from the constraints of fully 
synchronised communication as used in C^CAM and other languages. Within OC- 
CAM(async) two communications on the same channel without intermediate ac­
knowledgement results in chaos or the undefined state (JL). In essence, it becomes 
the responsibility of the designer to provide the handshake from the environment 
which allows more freedom in the implementation of asynchronous systems. A 
circuit in which eager computation is exploited in order to reduce latency in the 
communication of signals can benefit from specification in OCCAM(async) as we 
will demonstrate in the comparison chapter later in this work.
37
3 .2  Syntax of OCCAM(asyne)
38 CHAPTER 3 . OCCAM(ASYNC)
The syntax of a subset of OCCAM(async) is introduced in this section. The subset 
language contains no data types but it embraces all backbone constructs. The 
Backus-Naur Form of OCCAM(async) is as follows: 
circuit = process
process STOP
input
output
sequence
parallel
choice
input
output
sequence
parallel
choice
alternative
count
channel ?
channel!
SEQ [count]
{i  process }
PAR
{i process }
ALT | DETALT 
{i  alternative }
input
process
integer 
I FOREVER
channel string
An asynchronous circuit is represented by an OCCAM(async) process. A pro­
cess can comprise a single action like input or output on a channel, STOP or it can 
be a more complex structure of basic actions constructed with SEQ, PAR, ALT or 
DETALT.
STOP represents a halt of the whole circuit and is signaled as an error to the envi­
ronment.
3 .3 . EXAM PLES 39
in p u t on a channel is an action which waits for the arrival of signal on the wire 
assigned to the channel.
o u tp u t on a channel in reality is signaling on the output wire. Note that the output 
action is nonblocking.
SE Q  denotes sequential execution of all its subprocesses. SEQ completes success­
fully with the end of the last process in the list.
SE Q  integer represents an integer number of sequential executions of all its subprocesses.
SE Q  F O R E V E R  represents eternal sequential execution of all its subprocesses. SEQ FOR­
EVER never completes.
PAR executes in parallel its subprocesses and finishes with the completion of all 
of them.
D ETA LT specifies that signaling on its guard channels is mutually exclusive and there­
fore the choice between them is deterministic. It awaits the arrival of signal on 
one of its guards and finishes after successful execution of the corresponding 
subprocess,
ALT specifies that the choice between the guards is nondeterministic. It awaits 
the arrival of at least one guard’s signal, accepts nondeterministically one of 
them and ends after successful execution of the corresponding subprocess.
Subprocesses of SEQ and PAR must be textually indented with two spaces. The 
guards of ALT and DETALT must be also indented with two spaces while for the 
corresponding subprocesses the indentation is four spaces.
3 .3  Examples
Let us specify a circuit with two inputs inO, ini and two outputs outO, outl as 
shown in figure 3.1. When the circuit receives a signal on inO, we say that it 
receives logical 0 from its environment. The same applies when it receives signal 
on ini - it receives logical 1 from its environment. Signaling out logical 0 or 1 is 
performed on both wires outO and outl in a similar manner.
40 CHAPTER 3 . OCCAM(ASYNC)
inO outO
ini circuit outl
Figure 3.1: Half-stream inverter
The circuit will receive a stream of 8 data bits on its inputs where, for the first 
4 bits of input if the circuit receives logical X, it must output the negation of X  on 
outO and outl. For the second 4  bits, the circuit simply has to copy them out on 
outO and outl. The specification of such a circuit in OCCAM(async) follows:
SEQ FOREVER 
SEQ 4 
DETALT 
inO ?
o u t l  ! 
i n i  ? 
outO !
SEQ 4 
DETALT 
inO ?
outO ! 
i n i  ? 
o u t l  !
The subset language of OCCAM(async) is a full OCCAM reflection of the syntax 
of DIA introduced in the first chapter. Although as a subset language it is very 
simple, we must also note that it comprises all backbone primitives for specifi­
cation of asynchronous devices. Appendix B illustrates a design of packet router 
using the subset language.
Before we proceed to the compilation method, let us first consider two interest­
ing elements which will be necessary for the compilation: the Mutual Exclusion 
and the Decision Wait elements.
3 .4  The Mutual Exclusion element - ME
It is a well known problem that whenever several concurrent processes share a 
single resource, there is a need for serialising the concurrent requests to this re­
source, e.g. access to a bus, memory, communication channel etc. There are 
several solutions to this problem:
3 . 4 .  TH E MUTUAL EXCLUSIO N ELEM EN T - M E 41
rl  ^  gl
ME
r2 #  g2
Figure 3.2: Mutual Exclusion element
1. By assigning different, nonoverlapping time slots to each of the concurrent 
processes.
2. Another approach is by implementing an additional process which polls all 
requests in a round-robin fashion. Here various token based, test-and-set 
algorithms are used.
3. A more interesting and often more efficient approach is by using an arbiter 
which dynamically resolves the problem of mutual-exclusive access.
4. Perhaps there are other solutions, which are not yet implemented.
The first approach is plausible as an implementation for VLSI but it also implies 
large response delays on average. Using a polling strategy is usually cheap in 
terms of chip area and easier to implement but it might also be a bit expensive 
whenever the power consumption is a concern. The polling signal inheritantly 
implies voltage changes as a realisation of the polling token and as we assume 
our circuit is implemented in CMOS technology, the latter means increased power 
dissipation. There are also concerns about the fairness of such an arbitration.
The third approach differs from the previous two in that each request is dynam­
ically served, i.e. waiting for a request does not involve any kind of functioning. In 
the heart of every arbiter, there is a mutual exclusion element which plays a major 
role in resolving the serial access to the shared resource. This element is central 
for the automatic synthesis of the general choice operator.
The rising edge on r* is regarded as a request event, while the falling one is 
treated as a release of the shared resource (see figure 3.2). These two events are 
respectively acknowledged with the rising and the falling edge on the grant lines 
gi. In the case when there are two simultaneous requests, the ME element will 
eventually grant the resource to only one of them which, in other words, means
42 CHAPTER 3 . OCCAM(ASYNC)
ADI o2| analog difference circuit»21 $¡
r l
r2
Figure 3.3: CMOS implementation of the ME element
that gi and <72 cam never be simultaneously high. The DIA specification of the ME 
element is as follows:
First Seitz [57] implemented ME in nMOS, then Martin [5] redesigned it in 
CMOS. In both designs, an RS flip-flop is used for breaking the symmetry of grant­
ing only one out of two simultaneous requests. Let us consider in more detail the 
CMOS implementation shown in figure 3.3. If there is a single request on one of the 
request lines (without loss of generality we will consider it is n) we will have the fol­
lowing equations fulfilled: 01 = -> o2 -  0. In the latter case the analog difference 
circuit ADi will act as an invertor and its output g\ will go high thus granting the 
request. The nmos transistor of AD2 will be conductive and that will provide low 02. 
Having raised g\, granting any subsequent request on will be suspended until 
the first request is withdrawn by lowering ri. Therefore, the access to the resource 
is granted to only the first request in this case. The more interesting situation is 
when there are two concurrent requests on r 1 and r2. This is the typical case in 
which the outputs of the RS flip-flop are undefined. In order to better understand 
this phenomenon, one has to plunge into the analog world of this circuit. Seitz 
observed that when both NAND gates of the RS flip-flop are placed close, there is 
a certain possibility of both outputs reaching an intermediate, metastable level -  
not oscilating, but still not stable logical high or low.
M E  = [ri? —»• fifi!; ri?; (71! □ r2? —*• P2I; r2?;fif2!]
3 .4 . TH E MUTUAL EXCLUSIO N ELEM EN T  - M E 4 3
Figure 3.4: Analog subcircuit of the RS NAND gate
Figure 3.5: oi -  FA(o2)
Figure 3.6: FA(x) and F A 1(æ)
Considering the s ta t ic  an alo g  properties of a NAND gate with one of its inputs 
stable high, we can confine ourselves to the behaviour of the circuit shown in 
figure 3.4. The graphical representation of o\ as a function FA of o2 is shown in 
figure 3.5. If we assume that both NAND gates of the RS flip-flop are absolutely 
identical, we can conclude that o2 is also F a ( o i ) .  Therefore, the possible states for 
both 01 and o2 are the 3 intercrossing points of both curves FA and FA l as shown in 
figure 3.6. The first two points represent stable logical states: 01 = stablehigh and 
02 =  stablelow and vice versa. The middle point corresponds to the metastable state, 
in which, because of the symmetry, 01 = o2. (As a result of exhaustive HSPICE 
simulations, this point o\ — 1.6u —» 3v when the power supply is 5v - far above the 
threshold values for an nmos transistor.) What is interesting to envisage is that 
both analog difference circuits will keep their outputs low: the pmos transistor is 
nonconductive - Vgs =  0, while the nmos transistor is conductive and far in the 
saturation region. As < 0, any change of one of the RS flip-flop outputs will 
bring them far apart into the stable logical states, which on the other hand resolves 
both the metastability and the arbitration.
Theoretically the circuit can remain in the metastable state for an indeter­
minably long time. What is important is that the ME element will not propagate 
these analog values of 01 and o2 of the metastability point and keep its outputs 
logically low until it resolves into a stable logical state. The latter is a result of the 
symmetry 01 = o2 in which the pmos transistors of the analog difference circuits 
are nonconductive. Having said that, the condition 01 = o2 will be no longer ful­
filled once the symmetry is broken. One can also envisage that 01 ^ 02 if the two 
NAND gates are different, like for example if the zero bias threshold value of
NANDi nmos = 0.7v 
NANDi pmos = -1.5v 
NAND2 nmos = 1.5v 
NAND2 pmos = -0.7v
then we will have a case as shown in figure 3.7. In the latter example o\ ~  3?; 
and o2 ~ 2v which opens the pmos transistor of the second analog difference 
circuit. It is imperative that both NAND gates need to be as similar as possible
44 CHAPTER 3 . OCCAM(ASYNC)
3 .4 . TH E MUTUAL EXCLUSIO N ELEM EN T - M E 45
Figure 3.7: Nonsymmetrical metastability point
Figure 3.8: ME element with 3 inputs
and closely placed on the final layout. Therefore, for practical purposes, we may 
assume that both analog difference circuits will keep their outputs low in the state 
of metastability.
A Mutual Exclusion element with more than 2 inputs can be constructed using 
binary ME elements as it is shown in figure 3.8. The number of binary ME ele­
ments necessary for an N input ME one is N. Therefore, it might be a better idea to 
construct multiple inputs ME elements as shown in figure 3.9. Here metastability 
can exist for a set of and that is why the ith subcircuit with the analog difference 
elements will hold its output low until the metastability is resolved between o* and 
the other oj (j ^  i).
On this basis, we can now design even more complex arbitration elements. 
Figure 3.10 shows an implementation of an arbiter with the following DIA specifi­
cation:
Arb =  enable?; [in i?  —► owiil; Arb □ m 2 ?  —► out^V, Arb]
The third input of the ME element is an enable input of both NANDs forming the 
ME. We will use this important type of an arbiter in the process of generalisation
46 CHAPTER 3 . OCCAM(ASYNC)
Figure 3.9: CMOS ME element with 3 inputs
Inl-
In2-
merge X
E-<
merge
merge
CQ
►Outl
Enable
►Out2
Figure 3.10: Arbiter (sequencer) 
of the DW element in the following section.
3 .5  The Decision Wait element
We have introduced the Decision Wait element in the previous sections. Let us 
cover a few implementational issues and modifications. These will be used for the 
basis of the compilation technique.
3 .5 .1  Some implementation notes on DW element
The DIA specification of the DW element (figure 3.11) with n row inputs (r), m 
column inputs (c) and corresponding n x m array of outputs (o) is
3 .5 . TH E DECISIO N W AIT ELEM EN T  47
c1 c2 c3
°011 d012 d013
Wo21 *022 d023
*031 ^32 d033
D W n x m  —  [ l< t < n |  Y i?  + [ l  < j  < m  I Cj'? > Oj tj !; X ) H n * m ]]
The Decision Wait element is a generalisation of the well known Muller C ele­
ment; DWixi is the Muller C element. The DW element waits for an input from the
<
both dimensions and then sends an event upon the output matching both indexes. 
If there is more than one transition on the row/column inputs, the next state is 
undefined (JL). The DW elements are used for the synchronisation of two sets of 
events. (Obviously the DW element can be defined as many dimensional, but we 
will confine ourselves to 2 dimensions). Events in each set are mutually exclusive.
The rest of this section will be concerned with the definition of the GDW element, 
which, as we shall see, is necessary and useful for the asynchronous compilation 
.of our CSP based language into silicon circuits.
Let us first consider some implementations of the basic DW element.
48 CHAPTER 3 . OCCAMfASYNC)
Figure 3.13: DW - 242 phased implementation
One implementation of the DW2*2 element1 is shown in figure 3.12.
When both transitions, upon rows and columns, are received, the correspond­
ing Muller C element switches, producing a transition upon the output matching 
both indexes. The same transition is used to clear off the rest of the requests to all 
Muller C elements in the same row and column. Both inputs of Muller C element 
for row i and column j  are defined as follows:
M ER G E R , M ER G E R #  oif*))
MERGE{Cj) M E R G E R  okj ))
A possible problem with this implementation is that, when some Muller C el­
ement switches, it outputs on channel immediately. The environment may 
supply a new input while the clear-off process for the remaining C elements in the 
current row and column is not yet finished. This might be possible for example if 
there is a direct feedback wire from some output to a row/column input. In the 
latter case the use of a safe DW element is necessary. One implementation [36 j of 
a safe DW element is shown in figure 3.13, where the implementation of its basic 
cell is shown in figure 3.14.
•Mark Josephs implementation: Private Communication, Oxford University
3 .5 . TH E DECISIO N W AIT ELEM EN T 49
4J
Figure 3.14: DWC basic cell of the 242 phased implementation
The implementation is called 242 because internally the DW element functions 
with 4 phase signaling, while its interface uses 2 phase signaling.
The execution steps are:
1. Wait for both inputs from row and column.
2. Switch on the cell matching both indexes
3. Clear off the rest of the requests to the other cells
4. Output on the wire matching both indexes
When both inputs from row and column are supplied then both inputs of the 
corresponding Muller C element are logically high and it causes a transition to 
high on its output as well. After that the Toggle element (the triangle element) 
sends an event upon the same row and column, which clears off all other requests. 
Obviously it switches the same Muller C element again. This time the output 
event from the C element is toggled out of the DW element. Thus the DW element 
produces an output when the internal clear off process has finished. It is obvious 
that a safe implementation of DW implies greater delays input —+ output but such
50 CHAPTER 3 . OCCAM(ASYNC)
c l  c2  c3
Figure 3 .15 : SPARSE DW element
DW is necessary when there is a direct feedback from one of its outputs to an  
input.
3 .5 .2  Sparse DW element.
Sometimes when an input transition occurs upon row wires, we may know that the 
environment will not send an event on some set of column inputs. Alternatively 
we may want to be more selective with the set of column inputs involved in the 
synchronisation. In such cases the implementation of the DW may be simplified. 
Let us define a new Sparse DW (SDW) element.
There are n row and m column inputs. Let s be an array of n x m boolean 
values. Then the definition of the SDW is as follows:
S D W n +m —  [ l< » < n |  P f? * [ l< j< m |  si, j  &  C j?  * °i,j  1» S D W n + m ^
The implementation of SDW is similar to that of DW with the only difference 
being that the constructing basic cell is missing for all ot-j, where Sij is fa lse. For 
example the implementation of SD W 3 + 3  ( figure 3.15) is shown in figure 3 .1 6 . In 
this case the array s is as follows:
1 1 0  
1 1 0  
1 1 1
3 .5 . TH E DECISIO N W AIT ELEM EN T 51
Figure 3.16: SDW 242 phased implementation 
3 .5 .3  Generalised DW element.
A further generalisation of the SDW element is the Generalised DW (GDW). For 
every row i there is a boolean value </*■ which when true specifies general choice on 
the column inputs, otherwise it specifies deterministic choice.
The specification in DIA is as follows:
GDWnxm =
[l<»<n| *■
i f  (9i) then [i<y<mQ & cj?  -*■ Oij\\GDWMm\
else [ i < j < m |  &  C j?  +
The use of an arbitration element is necessary for the implementation of the 
latter GDW. All column inputs which are in nondeterministic choice are circum­
scribed with rectangles in the graphic symbol of GDW (figure 3.17).
An implementation of GDW2iF2 based on the ’’unsafe” DW is shown in figure 3.18.
The SW element simply holds the input when the control signal goes low and 
holds this value on the output. Otherwise if the control signal is true, the output
CHAPTER 3 . OCCAM(ASYNC)
cl c2 c3
Figure 3 .17 : GDW symbol
Figure 3 .18 : GDW implementation
3 .6 . TH E COMPILATION TECHNIQUE 53
Figure 3.19: GDW basic cell
follows the input (obviously logically negated, but it does not matter since we deal 
with signal transitions).
Another implementation is possible using the second type of DW 242 (figure 3.19). 
In the latter case only the basic cell is changed. There is an assigned mutual ex­
clusion (MUTEX or ME) element for every row with flag $  true. The Mutexln and 
MutexOut wires are connected with the mutual exclusion element assigned for 
this row. Notice that in both implementations, the arbitration process (the arbiter 
or MUTEX) can be placed externally without affecting the regularity of the initial 
DW element and more importantly without affecting the correct functioning of the 
circuit.
Another feature of the 242 implementation is that because of the internal 4  
phases signaling the C2 functions act as an AND gate, so it can be replaced with 
a logical AND gate.
3 .6  The compilation technique
The interpretation of each construct is based on the concept of flow control syn­
chronisation. Each operator (SEQ, PAR, DETALT, ALT) accepts the control from 
its predecessor, executes some actions and passes it to its successor. The ac-
54 CHAPTER 3 . OCCAM(ASYNC)
tions can be an input/output on wire (channel), an arbitration/choice between 
several inputs or the branching of the control flow. There are two types of signals: 
those from the environment and control ones. The control signals travel through 
the circuit and synchronise with the inputs from the environment. After such  
synchronisation, the outputs to the environment are eventually produced. Each  
input from the environment is accepted exactly when the control signal reaches 
the point of synchronisation. Each output is sent after the control is passed to the 
operator which creates it. As synchronisation circuits we use DW/SDW/GDW el­
ements. For each input from the environment there is an assigned GDW element, 
where this input is connected to a column of the GDW element. Different control 
signals, which request this input, are connected to the rows of the GDW element. 
What row should be used (sparse or with nondeterministie choice) depends on the 
syntax construct to be compiled.
In generally we may consider the process of compilation as defining connec­
tions between the inputs and outputs of two blocks D (demultiplexor) and M (mul­
tiplexor) (see figure 3.20). Block D is built by set of different DW/SDW/GDW in the 
sense that the block D is the synchronising element, but also can be regarded as 
a demultiplexor which steers the control signals according to the current state of 
the inputs from the environment. After the synchronisation with the inputs from 
the environment, the outputs from block D are regarded as control signals again 
which creates the feedback, but are also connected to the block M. The latter block 
is built out of Merge elements and multiplexes different control signals which cre­
ate the outputs to the environment. Block D is a demultiplexor in the sense that 
one input occurs in different points of the DIA expression and these points create 
different control signals. Similar to this, block M is regarded as multiplexor be­
cause different control signals create one output to the environment. The number 
of columns for block D is the number of inputs from the environment. The number 
of rows and outputs of block D is the number of control signals encountered by 
the compiler.
The compilation technique is syntax directed. It starts with the beginning of the 
program and parses it sequentially thus translating each construct met to the cor­
responding structure of library elements. Further in the chapter we show how each
3 .6 . TH E COMPILATION TECHNIQUE 55
outputs
control signals
Figure 3.20: OCCAM{async) compilation scheme
construct of the subset is compiled using our set of DW elements (DW/SDW/GDW) 
and Merges. The proposed technique has been implemented as a compiler written 
in C language. It takes as an input a piece of code and produces as output a netlist 
of the physical structure of the circuit.
3 .6 .1  The sequential operator SEQ
The sequential operator is the most straightforward construct to compile. It ac­
cepts the control from its predecessor and passes it sequentially to each construct 
in the scope of SEQ. For example the following piece of code compiles to the circuit 
shown in figure 3.21.
SEQ
p r o c e s s .1 
p r o c e s s .2  
p r o c e s s .3
As we can see from the figure 3.21, all one needs is to connect the results from 
the synthesis of the different subprocesses within the scope of SEQ construct with 
only wires.
3 .6 .2  The composition operator PAR
When the control reaches this operator, the concurrent parts accept the control 
from it in parallel. This operator is more of a constructive primitive than opera­
tional one. In this sense it is somewhat easy to compile. Its operational semantics 
requires two strictly sequential actions:
56 CHAPTER 3 . OCCAM(ASYNC)
Figure 3.21: Compiling SEQ
1. branching the control to all parallel subprocesses within the scope of the PAR 
construct
2. synchronising the finish of all branches and delivering the control to its suc­
cessor.
Relating to the parallel languages these two actions represent fork and join. 
The first operation is implemented by simply replicating the initial control signal 
to all subcomponents of the PAR construct and in reality it is a fork of wires. The 
second one awaits all end signals from the parallel branches and delivers further 
the control to the PAR construct’s successor. It is implemented by a multiple-input 
Muller C element.
For example the following piece of code results in the circuit shown in fig­
ure 3.22.
PAR
p r o c e s s .1  
p r o c e s s .2  
p r o c e s s .3
3 .6 .3  The deterministic choice operator DETALT
The operational semantics of this construct requires the acceptance of one and 
only one ready guard input. As we have already seen in the previous sections, 
there is a GDW element assigned to every input from the environment. If some 
action requires the participation of a certain input, a signal-request is sent to
3.6. THE COMPILATION TECHNIQUE 57
Figure 3.22: Compiling PAR
the corresponding row of the GDW element. After the required transition on that 
input, the corresponding output from the GDW element passes the control to the 
successor of this action.
The problem which arises is that only one transition on an input guard channel 
will occur and this transition should clear off all other requests to the remaining 
input guards. In this case all these inputs are assigned to the columns of one GDW 
where the control signal passed to the deterministic choice construct is connected 
to a row of this GDW.
Consider the following example:
DETALT 
i n i  ?
SEQ
o u t l  ! 
i n i  ? 
o u t2  ! 
in 2  ?
SEQ
o u t2  ! 
in 2  ? 
o u t l  !
When the control is passed to the deterministic choice operator DETALT, two 
requests are sent: one for input on ini and one for input on in2 on the columns of 
one SDW element. The control signal is connected to the corresponding row. The 
corresponding array s is filled according to the rule:
58 CHAPTER 3 . OCCAM(ASYNC)
in i in2
Figure 3 .23: Compiling operator DETALT using SDW  element
• Sij — true : if the control row input i requires input from the environment on 
column j
• S{j — false otherwise
The result circuit using an SDW is shown in figure 3 .23 . The initial signal start 
is copied out through the merge element and requests on an SDW row both inputs 
ini and in2. The SDW synchronises the arrival of one of these inputs with the 
control signal and signals on the corresponding SDW output as a  control signal 
again. The latter is either producing outO or outl (input of one of the output merge 
elements) and also requests the subsequent input on another row of the SDW 
element.
3 .6 .4  The nondeterministic choice operator ALT
The nondeterministic choice operator is an essential one in designing various sys­
tems. The compilation techniques are the same as in the previous subsection with 
the only difference being that a  GDW is used and the row is nondeterministic, 
where the column signals are concurrent.
Consider the following similar example (figure 3.24):
ALT
i n i  ?
SEQ
o u t l  ! 
i n i  ? 
o ut2  !
3 .7 . COMPILING SEQ  W ITH A  REPLICATOR  59
ini in2
Figure 3.24: Compiling operator ALT
in 2  ?
SEQ
o u t2  ! 
in 2  ? 
o u t l  !
The first occurrence of ini and in2 is required from the general choice operator 
and the corresponding row in the GDW is nondeterministic. The next occurrences 
are in the sequential operator and they are requested deterministically.
3 .7  Compiling SEQ with a replicator
Very often for expressing objects’ behaviour we use an iterative description i.e. 
recurrent equations, recursion, explicit cycles etc. When the depth of the iteration 
is fixed, the use of counters becomes apparent. The counter is a circuit which 
creates unique instances of each loop in the iteration.
Compiling a SEQ construct with a replicator is based on use of asynchronous 
counter. The following SEQ statement compiles to the circuit shown in figure 3.25.
SEQ N 
pro cess
- FOREVER cycle is somewhat an ’’optimisation” of the circuit above assuming 
that the modulo is infinity(figure 3.26).
60 CHAPTER 3. OCCAMfASYNC)
Figure 3 .25 : Compiling SEQ with a replicator
Figure 3 .26 : Compiling SEQ FOREVER
3 .8  Data types
The basic OCCAM(async) constructs have now been presented as well as how they 
can be automatically compiled. We can now introduce data types within the lan­
guage and show how this reflects upon the synthesis procedure.
QCCAM(async) processes act upon variables and channels. A variable has a  
value, and may be assigned to a  value in an assignment statement or an input 
on a channel. Channels communicate values. Values are classified by their data 
type. A data type determines the set of values that may be taken by objects of that 
type.
This section first describes the data type of values, then it will discuss the two 
possibilities for a physical representations of these values and communications 
upon channels.
The current data types used are:
fiVJINT Signed integer represented in N  bits.
INT Default for N -8.
CHAN declares a channel of void type. It will carry no value when communicating 
and will be mainly used for synchronisation.
CHAN OF [iVJINT declares a  channel which carries N  bit signed integer value.
CHAN OF INT same as above and N=8.
3 .8 .1  Examples
IN T  r e g is t e r :
declares a variable, 8  bits long and the designer can reference to its value as reg­
ister.
[ 1 6 ] IN T  t w o . r e g is t e r s :
declares a variable, 16 bits long, and the reference identifier is two.registers.
CHAN OF [ 1 5 ] IN T  bus:
declares a channel which carries 15 bits value.
CHAN in t e r r u p t :
is a declaration of void channel, which does not carry any value and it is only used
for synchronisation.
3 .9  The full syntax
circuit = process
process = SKIP
| STOP 
1 assignment 
| input 
j output 
| sequence 
| parallel 
| choice 
conditional 
| case 
| selection 
| loop
{ i declaration } 
process
3.9. THE FULL SYNTAX 61
input
output
sequence
replicator
parallel
choice
alternative
conditional
case
caselse
branch
loop
declaration
variant
62
assignment
channel?  [variable]
channel! [expression]
SEQ [replicator]
{i  process }
FOR integer
PAR
{i process }
ALT j DETALT 
{ i alternative }
input
process
IF
{i branch }
CASE expression 
{i  variant}
[ caselse ]
ELSE
process
expression
process
WHILE expression 
process
[ [integer] ] INT variable { , variable } : |
CHAN [ OF [ [integer] ] INT ] channel { , channel} : j
ENV CHAN [ OF [ [integer] ] INT ] channel {?|!} {  ,channel}?|!} }
ACK ? channel {,channel} ! channel { ,channel} : \
ALIGN channel { i , channel }  :
CHAPTER 3. OCCAM(ASYNC)
variable := expression
integer { i , integer } 
process
integer FALSE 
j TRUE
3.10. SYNTHESIS FOR DATA OPERATIONS 63
| U  0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9  } 
variable = string
channel = string
3 .1 0  Synthesis for data operations
As we previously discussed in the Data Encoding section of Chapter One, there are 
two possible alternatives for physical representation of an OCCAM(async) value. 
The first one, used in our synthesis procedure, is the so called bundled-data [32], 
[19], [10]. Although focusing on this type, we will briefly describe in the subsequent 
sections how the compilation technique with data operations can use double-rail 
logic. It is important to note that the synthesis for the basic OCCAM(async) con­
struct still remains the same regardless the chosen encoding scheme.
3 .1 0 .1  Double-rail logic
In this section we show how one can construct some basis boolean functions in 
order to design data processing circuits using double-rail logic. The DW element is 
at the heart these boolean functions. Each data bit is represented with two wires: 
one for logical 0 and one for logical 1.
Logical negation (inverter)
The double-rail inverter is straightforward to implement: both wires 0  and 1 are 
swapped which implements the logical negation.
Figure 3 .27 : Inverter
AND, OR, XOR
Consider the inputs of two data bits connected to a DW2x2 element. As a  re­
sult of the synchronisation, the outputs define the canonical representation of the
64 CHAPTER 3. OCCAMfASYNC)
product of A and B (figure 3.28).
Figure 3 .28 : Canonical representation
Therefore AND, OR and XOR can be easily implemented by merging the right 
canonical forms as shown in figure 3 .29.
1 AND
&  o 1
<3 O-
merge
B
1 1
OR
Q ^Cj>
Q O- merge
Figure 3 .29 : AND, OR and XOR
Double-rail register
The double-rail register, also called DI variable, is a  compulsory element in order 
to store a state and refer to its value later in the program code. Figure 3 .30  dis­
plays the symbol for DI variable. The register is a  device with 3 control inputs
and 4 outputs. wO and w l store in the circuit the corresponding value and are
also acknowledged on awO and awl control outputs, while the read input r is ac­
knowledged on arO or arl depending on the last stored value in the register. Its 
DIA specification follows:
V  =  P0a P i
Pi =  [wo?  —*■ aw o !', Po|
w i ?  —► a t u i ! ;  P\ [
?•? —* ar,-!; P,]
3.10. SYNTHESIS FOR DATA OPERATIONS 65
wO
wl
VAR
awO
awl
arO
arl
Figure 3 .30 : DI variable
Figure 3.31:  Variable: using RS flip-flop
Figure 3 .32 : Variable: using DW element
Figures 3 .31 and 3 .32  show two possible implementations of DI variable device. 
Double-rail encoding is suitable for bit-serial devices and on chip implementa­
tion where the number of wires is not a great issue limited by chip pins. It also 
becomes expensive with the number of stages in a  boolean expression. Although
66 CHAPTER 3. OCCAM(ASYNC)
Figure 3 .33 : Asynchronous D flip-flop
scientifically attractive, double-rail encoding scheme has shown to be impracti­
cal for a large variety of circuits. Therefore we will confine our synthesis method 
to using bundled-data and we will consider only bundled-data for the remaining 
chapters.
3 .1 0 .2  Asynchronous bundle-data registers
Let us introduce several basic components widely used in synthesis of asynchronous 
circuits, and these are the Asynchronous bundled-data register and multiplexor.
As we are using two phase signaling in order to store a  data value in a  register 
with each edge, falling or rising, there is a need for a  new, asynchronous D flip- 
flop. Figure 3 .33  shows an implementation of such a flip-flop which comprises 
two ordinary latches with their outputs carefully multiplexed 2. When the clock 
c is high or low, one of the latches is transparent, while the other one is opaque 
and its output drives the output of the flip-flop. With the incoming edge on c their 
roles change: the last transparent latch now stores the value on d and drives the 
output q while the previously opaque one is now transparent.
Such a D flip-flop is in the core of the design of an asynchronous register used
2Chaogang Huang’s implementation. Oxford University
3.10. SYNTHESIS FOR DATA OPERATIONS 67
store.ack
read.ack
Figure 3 .34: Asynchronous Register
as the major means of storage in our compilation technique. Figure 3 .3 4  pic- 
torially represents its symbol. Its store input is a  bundled-data input and it is 
acknowledged on a  single control output after completion of the write operation. 
Its read input requests the current value stored in the register and it is signaled on 
the bundled-data acknowledge output. An asynchronous register may be required 
with as many read/write inputs and corresponding acknowledges as necessary.
3 .1 0 .3  Compiling an expression
expression ~ monadic expression
| expression dyadic expression 
| ( expression)
| integer 
| variable
monadic = NOT | “ | -
dyadic -  «  I >> I +  I -  I * I
= i < i > i <= i >= i <> i
/\ I V  I > <  I AND | OR | XOR |
Compiling an expression is a process of constructing a tree reflecting the data  
flow through the computation of the result. This process is very similar to the 
process of constructing a  boolean function given a target expression and it differs 
in the set of used functional components and moreover - their interface. Let us 
illustrate that with an example. Consider the expression:
(a +  b) > <  (c +  6)
68 CHAPTER 3. OCCAMfASYNC)
r := (a + b) >< (c + b)
Figure 3 .35 : Compiling expression
controll
operand!
operand2
Figure 3 .36 : Bundle Data Adder
The resulting circuit is shown in figure 3 .35 . All the data paths are physically 
represented by bundled-data where the latter implies that there is a  control sig­
nal on each data bus indicating validity of the data value, which, on the other 
hand, involves different implementations of the various used functional blocks. 
Figures 3 .3 6  and 3 .37  illustrate the implementation of binary and unary func­
tions as asynchronous bundled-data components. As we can see from the figures, 
both type of components preserve the correct timing on the output by introducing 
a delay line on the control signal. In the case of binary component, the Muller C 
element is used as a  synchronisation element releasing the result only when both 
inputs have arrived.
control V-
data
AND
delay ►Control
II
'data
Figure 3.37: Bundled-data AND
3.10. SYNTHESIS FOR DATA OPERATIONS 69
var.id := expression
Figure 3 .38 : Assignment
3 .1 0 .4  Assignment
assignment = variable := expression
The compilation of the assignment is straightforward once we introduced the 
synthesis of an expression. Figure 3 .3 8  displays the resulting circuit. The control 
signal, which initiates the assignment action, is used as a  start signal for calculat­
ing the expression and the result is stored to the register assigned for the left-side 
variable. A unique write port is used and its acknowledge signals the end of the 
operation.
3 .1 0 .5  Output statem ent 
Bundle data m ultiplexors
The binary bundled-data multiplexor is a necessary component for the compilation 
of an output statement. It accepts a request from one of its inputs and ’’routes” 
the data to its output. It can be specified with the following code:
[N ]IN T  x :
CHAN OF [N ]IN T  i n i ,  in 2 , o u t:
DETALT 
i n i  ? x  
o u t ! x  
in 2  ? x  
o u t ! x
It is interesting to note that there is no memory element on the data path, but 
there is a need for an RS flip-flop in order to keep the data valid on the multiplexor’s 
output until a subsequent write-request arrives. One possible implementation is 
shown in figure 3 .39 .
70 CHAPTER 3. OCCAMfASYNC)
Figure 3 .39 : Bundle data multiplexor
Figure 3 .40 : Multiplexor with more inputs
When the RS flip-flop switches to the desired state, it opens the corresponding 
latch and that causes the final output request.
It is important to note that this implementation is not characterised with con­
stant response time. Let us first consider a  multiplexor with more inputs and later 
we will come back to this implementation. A multiplexor with more inputs can be 
realised by composition of basic multiplexors as it is shown in figure 3 .40 .
The delay from a single request to the output in the worst case is the delay tw 
of two Merges, an RS flip-flop and a Latch. In this case, the stored value in the 
RS flip-flop is different from the one the flip-flop will go into. If it happened to be 
the same, then the delay is the delay ti of only one Merge and one Latch. Now 
let us consider the binary tree of multiplexors where the inputs of the very first 
stage of multiplexors occur sequentially (see figure 3.41). What is interesting to 
note is that the first input will partly pre-establish the data path for the second
3 .10. SYNTHESIS FOR DATA OPERATIONS 71
pre-established path for the second input 
1 
2
Figure 3.41:  Pre-establishing the path for subsequent inputs
one, i.e. the state of all RS flip flops except in the veiy first stage will be already 
the desired value for the second input. Therefore the delay for the second input 
will be tw +  (n — 1) x rather than n x tw, where n is the number of stages in 
the composition. For the third input it will be 2 x tw +  (n — 2) x tb. This fact is 
another illustration of how an asynchronous circuits take as much time (delay) as 
necessary for the completion of the operation.
Still one might not be pleased with the performance of the multiple-inputs mul­
tiplexor. Indeed one can involve more concurrency for a better implementation of 
the multiplexing.
There are two strictly parallel operations:
• Pre-establishing further the path for the data
• Multiplexing of the two bundled-data inputs for a single multiplexor
In order to split these two operations, there are two req lines within a single 
bundle: Jreq which is used for pre-establishing and req which carries the valid 
data(figure 3.42).
The Jreq signal is used to store the new state in the RS flip flop and also to prop­
agate it further to the remaining stages. As we can see from the implementation, 
Jreq is immediately sent out via Merge element without waiting for the whole oper­
ation to complete in this current stage. The req is sent out only when the data is 
successfully multiplexed and valid. The initial Jreq is a copy of the req signal from 
the bundle. It is not difficult to see that the overall delay for the multiple-inputs 
multiplexor in the latter case is approximately n xU.
72 CHAPTER 3. OCCAM(ASYNC)
Figure 3 .42 : Fast multiplexor
end
Figure 3 .43 : Output on channel
Compilation of output on a channel
output -  channel I [expression]
If the channel is of void type, then the compilation is the same as in the synthe­
sis from the basic subset language. Figure 3 .43  illustrates the compilation. The 
expression is type-casted referring to the type of the channel, calculated and the 
output from the expression block is now MUXed with the other appearances of 
that particular output channel. An unique input is used from the bundled-data 
multiplexor assigned to the output channel.
3 .1 0 .6  Input statem ent
input = channel? [variable]
3.10. SYNTHESIS FOR DATA OPERATIONS 73
ol o2 on
bundle
physical layout
bundle
data
control 
in.req cl c2 cn
Figure 3 .44 : Bundle element
1 end
write write write
var.id var.id var.id
channel
H P
bundle
start
Figure 3 .45 : Compiling input on channel
If the input channel is of void type then the compilation is the same as with the 
subset language, otherwise the variable and the channel should be of the same 
size. Figure 3 .45  displays the compilation of input on channel and it is based 
upon a bundle element shown in figure 3 .44 . The bundle element is veiy important 
element and it preserves the correct timing on the bundled-data wires. Although 
implemented only by wires as a  physical layout, it ensures that the delay on the 
control wire is greater than on the data wires.
3 .1 0 .7  WHILE loop statem ent
loop = WHILE expression 
process
74 CHAPTER 3. OCCAM(ASYNC)
Figure 3 .46 : Compiling WHILE
The operational semantics of WHILE requires first an evaluation of the expres­
sion, and then a comparison of the result with 0. As the result (so we allow) can  
be a data of any size, we need to reduce it to one bit bundled-data. This process 
is nothing else but boolean typecasting the result to [1JINT, The output from the 
compilation of the expression is connected to an asynchronous OR component in 
order to produce logical OR of all data bits and conform the output to the type 
[1]INT:
Figure 3 .4 6  shows the result circuit from the compilation of the WHILE state­
ment. The signal, which passes the control to this construct, initiates via Merge 
the evaluation of the control expression. The result is connected to the control in­
put of a  SELECT element. If the expression equals TRUE, the latter element passes 
the control to the WHILE body process, otherwise it delivers the control to the next 
statement in the program.
3 .1 0 .8  Conditional branching
conditional = IF
{i branch}
branch = expression
process
The IF  construct sequentially evaluates the expressions in the listed branches 
and executes the first process whose guard is TRUE. Similar to the compilation 
process of the WHILE construct, all expressions are boolean typecasted to [1 ]INT.
3.10. SYNTHESIS FOR DATA OPERATIONS 75
Figure 3 .47 : Compiling IF
One possible design implementation for the JFconstruct is shown in figure 3 .47 . As 
we can see from the circuit, the initial signal start "polls" sequentially each branch’ 
expression and the first one, equal TRUE, initiates the corresponding process. It 
is interesting to note that if all expressions are evaluated and none of them is 
TRUE, then the circuit signals a  fault. The latter can be considered as an obscure 
run-time error.
3 .1 0 .9  Case construct
In this section, we present one efficiently implemented construct. The CASE oper­
ator is useful for direct branching of the flow control, although it can be realised by 
IF constructs. The used case element accepts a bundled-data input and delivers 
the control to the output indexed by the input data.
case = CASE expression
{i  variant}
[  caselse ]
variant = integer { i , integer}
process
76 CHAPTER 3. OCCAM(ASYNC)
Figure 3 .48 : Compiling CASE
caselse = ELSE
process
Figure 3 .48  shows the compilation for the case construct.
3 .11  Channels to/from  the environment
So far we considered channels as means of communications between two processes 
and there was no discrimination between channels within the circuit and channels 
to/from the environment. It is of importance to the synthesis procedure to split 
these two types of communications.
declaration -  ENV CHAN [ OF [integer]INT] channel {?|!} {  .channel { ? |!} }  :
The type of declaration above specifies the list of channels as an interface be­
tween the circuit and its environment. The list of channels from/to the environ­
ment can be easily inferred from the program code, but introducing such a  type 
of declaration not only increases the compiler efficiency but also the program’s 
readibility. We believe the latter to be of greater importance in the case above and 
also of not a great sacrifice for the programmer.
3 .1 2  Who made who
As we saw from the sections introducing DIA, there can not be two inputs/outputs 
on a channel without an intermediate acknowledgement. We base our language on
3.12. WHO MADE WHO 77
DIA and from that it consequently follows that for a certain input any subsequent 
output can be an acknowledgement and vice versa. But that is not the reality. A 
certain input is acknowledged on a certain output and such a type of communi­
cation protocol is called handshaking [48]. More often, an input is acknowledged 
with an output from a set of channels or vice versa. In order to better describe the 
circuit’s behaviour, we introduce another type of declaration.
declaration = ACK ? channel {.channel} ! channel {.channel} :
The construct above specifies that any input communication from the first list 
in the ACK pragma will be acknowledged on an output from the second list. The 
communications on both list of channels are mutually exclusive, i.e. there can not 
be two simultaneous outputs or inputs as an acknowledgement.
Introducing such a construct, our concern is both expressiveness of the lan­
guage and therefore - efficient synthesis.
ENV CHAN in ? , in .a c k ! :
ENV CHAN o u t ! ,  o u t.a c k ? :
- -  t h is  is  th e  in t e r f a c e  o f  o u r c i r c u i t
ACK ? in  ! in .a c k :
ACK ! o u t ? o u t .a c k :
- -  chan n el ' i n '  i s  acknow ledged on ' i n . a c k '
—  and s i m i l a r l y  -  'o u t '  on 'o u t .a c k '
WHILE TRUE 
SEQ 
in  ? 
o u t ! 
o u t .a c k  ? 
in .a c k  !
In the example above, if one omits the ACK section, out! could be an acknowl­
edgement for in? and out.acl<f? - for outl etc. The resulting circuit from the OC- 
CAM(async) specification is shown in figure 3 .49.
If the user chooses to specify the protocol of the real sequence of acknowledge­
ments, he/she can provide this very important information for the optimisation 
stage. In the example above, the ACK lines specify that there will be no further
78 CHAPTER 3. OCCAM(ASYNC)
Figure 3.49: no ACK pragma
input on in until output on in.ack. As the control signal from the output on in.ack 
is the same as in.ack and it requests in again, we can conclude that the D W lxl 
(Muller C element) for the input in is not necessaiy. Similarly, we can infer that the 
DWlxl  for outack  is redundant too. Therefore the specified program code above 
can be implemented by a  single pair of wires in? —» out\ and out.ack? —> in .ackl. 
Such a beneficial modification is only possible due to the additional information 
specified with the ACK lines.
3 .1 3  Channel alignment
A channel is physically represented by a control wire and optionally data wires. 
Often it is convenient, if a communication channel comprises several control wires 
plus data, that some control state can be encoded while transmitting a  request of 
data validity. It is also intended in this work to allow several channels to share the 
data wires and each be represented with its unique control wire.
declaration = ALIGN channel { i , channel }  :
The ALIGN pragma above instructs the compiler that the list of channels will be 
physically implemented as an entity of several control wires and a  single data bus. 
Signal events on control wires must be mutually exclusive. The emerging problem 
is that in our compilation technique we use only a  single control wire channel, 
therefore when signaling an output on such a  channel the control wires should be 
realigned.
Figure 3 .5 0  shows the align element used throughout the synthesis procedure. 
It takes as many single control inputs as necessary and outputs on an aligned 
output where each input is represented with its unique control wire. The align
3 .1 4 . PEEPHOLE OPTIMISATIONS 79
cl/data
cl/c2/data
c2/data
align ----- -►
———►
Figure 3 .50 : ALIGN element
element is similar to the asynchronous multiplexor with the only difference that 
the output comprises multiple control wires. In that sense, the align element do 
not multiplex the control wires but only the corresponding data inputs.
3 .1 4  Peephole optimisations
So far we considered the synthesis procedure and we introduced various useful 
circuits. The result from the compilation is a netlist of libraiy elements and, as 
the procedure is syntax directed, there is always a  space for further improvements 
in the circuit at this topology level.
An obvious and useful method of optimisation is substitution of subcircuits 
with more efficient subcircuits but preserving the overall behaviour[47]. There are 
plenty of circuit patterns that can be replaced with cheaper ones but we will dis­
cuss only those which are of some interest to this work refraining from meticulous 
surveys but still illustrating useful improvements within the final circuit.
3 .1 4 .1  Binary trees of Merge and Muller C elements
As a  result of the synthesis procedure, the final circuit often contains tree struc­
tures of binary Merge and Muller C elements. By this we mean a tree structure of 
similar subelements, with one output and N  inputs. For each subelement we have
• its output is either the output of the circuit or it is connected to an input of 
some other subcomponent.
• its inputs are either inputs of the circuit or they are connected to the outputs 
of some other subcomponent.
These binary trees are an obvious implementation of Merge/Muller C elements 
with multiple inputs. As these two types of elements, binary Merge and Muller
80 CHAPTER 3. OCCAM(ASYNC)
C, are symmetrical regarding their inputs, one can readily suggest that a  better 
implementation of such a tree is a balanced tree of such library cells. As it is rather 
obvious that such a substitution optimises the performance on average, we will 
consider a slightly more complex problem.
For each input of a  delay-insensitive circuit, there is at least one output which 
acknowledges it. This input is accepted via a  GDW element and the outputs from 
the latter are control signals passed further and eventually copied out onto this 
output via a  tree of Merges. Therefore each instance i of accepting the input is 
delayed with latency U because of the different paths the control signals take before 
reaching an input from the tree of Merges. Our aim here is to design an algorithm 
which defines an N-inputs Merge and also gives a fair chance to those signals which 
are delayed with a greater latency tj with respect to the further delay through the 
N-inputs Merge. Such an N-inputs Merge we will call a  balanced element. The 
idea is to ’’equalise" delays from an input to its acknowledgement regardless of the 
state.
Let T  — U : 1 < i < N  is the set of latencies of all inputs of an N-inputs Merge. 
Let tmini and tm,„2  itmini < tminz) are the two minimal delays from this set. Let tj 
is the delay of a  binaiy Merge element.
Definition:
♦ A balanced 2-input Merge is the binaiy Merge cell.
• A balanced N-input Merge with a set of latencies T  is the composition of a  
binary Merge element and a balanced (N-l)-inputs Merge as shown in fig­
ure 3.51.  The inputs of the binaiy Merge are the inputs with latencies tmini 
and t,nin2 and its output is connected to an input of the (N-l)-inputs Merge. 
The rest of its inputs are the original inputs except those with latency tmin i 
and tmin2- The new set of latencies for the (N-l)-inputs Merge is defined as
'I'1 — T  {tmin  1 > tmin.2}  T  ( tmin2 td).
Similar algorithms can be constructed for a  tree of Muller C elements and asyn­
chronous MUX elements.
3.14. PEEPHOLE OPTIMISATIONS 81
Figure 3.51:  Balanced Merge
3 .1 4 .2  Inputs in sequential threads
Often it happens that an output m ust be immediately acknowledged from the en­
vironment as it is the only progress the circuit could perform. This results in 
req\\ack?  lines in the OCCAM(async) specification. If these two lines appear N 
times and the sequence of appearance is known apriori, then we can substitute 
the GDW for the acknowledgement with a  vertical counter modulo N. The latter 
element has one input and N outputs and the ith input event is acknowledged on 
the i mod N  output.
The following piece of code results in the circuit shown in figure 3 .52 . The cir­
cuit circumscribed with dashed line is the optimised one where the vertical counter 
modulo N (the triangle element with bullet signed output 1) steers the next incom­
ing acknowledgement to the corresponding control output.
SEQ
re q  ! 
ack  ?
re q  ! 
ack  ?
82 CHAPTER 3. OCCAM(ASYNC)
Figure 3 .52 : Inputs in SEQ thread
Figure 3 .53 : Multiple reads
3 .1 4 .3  Multiple reads from a register
Another possible optimisation is when there is an asynchronous register where 
some identical read-control signal requests the stored value several times. Typical 
example is when a variable appears several times in a  single expression:
r e s u l t  :=  a + b > a
The implementation is shown in figure 3 .53  and can be nicely substituted with 
the circuit in figure 3.54. In practice, this optimisation can drastically relax the 
implementation of the used asynchronous register.
3 .1 5  Multiple writes to a register
If we have two appearances of an input channel a and both of them store data in 
the same variable x, one can optimise the number of write inputs for the x register. 
Figure 3 .55  displays the substitution where the circuit circumscribed with dashed
3.15. MULTIPLE WRITES TO A REGISTER 83
Figure 3 .54 : Multiple reads: optimised
control signale from the GDW
Figure 3 .55 : Multiple writes: optimised
line presents the optimisation. There are many many other possible optimisation 
and this is a subject to further research.
So far we considered the compilation procedure for synthesis of 2 phase sin- 
galing, bundled-data, asynchronous circuits. We will demonstrate its application 
in the following chapter with the implementation of two types of packet router.
CHAPTER 3 . OCCAM(ASYNC)
Chapter 4
Asynchronous Packet Routers
Packet switches are now a familiar component of all concurrent architectures and 
a good example to illustrate asynchronous design. This chapter describes the 
asynchronous implementation of three basic building blocks for asynchronous 
packet routers used for constructing direct packet-switched networks. It shows 
how on this basis two different types or routing devices for mesh topologies can  
be constructed. Another major objective of this chapter is also to demonstrate the 
asynchronous design technique for VLSI design described in the previous chapter.
4.1  Deadlock
A major problem in the design of communication networks is preventing deadlock. 
Deadlock occurs in a concurrent system when no further action can take place. 
There are many approaches in solving this problem and as the survey on these 
is beyond the scope of this work, we will only consider a few that have been used 
through the forthcoming design of asynchronous packet routers. References [11, 
331 present a good overview of several deadlock avoidance strategies.
First let us introduce some definitions. A network V  is an indexed set {Pi : 1 < 
i <  N )  where each Pi is a  process. The process defining the network activity as a 
global one is ||i<t<yv^ i- We will also require that all communications between the 
processes are point-to-point. Such a network is called triple-disjoint. A request 
for communication between two processes Pi —*• Pj is ungranted if Pi is willing to 
communicate with Pj but Pj never responds. A cycle of ungranted requests is an 
indexed set {¿0, 1} where the request P ij —+ Pij+l mod k is ungranted.
85
86 CHAPTER 4. ASYNCHRONOUS PACKET ROUTERS
Theorem : If V is a  triple-disjoint network and all its subprocesses Pi are in­
ternally deadlock free, then any deadlock state contains a  cycle of ungranted re­
quests.
A cycle of ungranted requests is a  consequence from deadlock and is also a  
symptom for deadlock. Therefore a  good strategy for deadlock prevention is to 
avoid such cycles. In both types of deadlock avoidance strategy we chose, the 
whole network of communicating processes is apriori acyclic thus guaranteeing 
deadlock freedom.
Another, more powerful theorem can be found in reference [11]. The solution 
for deadlock freedom is tailored for packet switching networks. Each process Pi 
from the network V serves a buffer pool B{ where each communication delivers 
a message with its destination Dm as a preamble. All communications between 
the processes are point-to-point and directed, P i^ P j.  A  process Pi forwards the 
message from channel C in with destination Dm e  V  to channel C out =  R F (C in , Dm ). 
The function RF is a  global routing function. There are two additional channels, 
for message injection and consumption, and these are assigned to each process. 
A channel dependencies graph, CDG is a  tuple of two sets, nodes CN and arcs RA. 
The nodes are the channels in V. There is an arc, C,- —► C j between nodes Ct- and 
C j if there is a process P k from V for which Cj — R F ( C i , P k).
Theorem : If for network V the corresponding CDG is acyclic, and if each mes­
sage arriving at its destination is eventually consumed then V is deadlock free.
The theorem is more powerful because the original network V might contain 
cycles but as long as there are no cyclic channel dependencies in the CDG graph, 
deadlock can occur only if a destination process refuses to accomodate its message.
4 .2  Mesh Topology Packet Routers
This chapter presents the implementation of three asynchronous blocks with which 
one can build packet routing nodes for direct networks, i.e. we consider networks 
where there is a  routing node (RN) attached to each processing element (PE) (see 
figure 4.1). The PE injects and consumes messages on channels with which the 
PE is connected to the RN. We also consider only mesh topologies, so each RN is
connected with 4  such pairs of channels to its neighbouring RNs thus comprising 
grid topology.
4.2. MESH TOPOLOGY PACKET ROUTERS 87
Figure 4.1:  Direct network, mesh topology
4 .2 .1  Oblivious Routing Packet Routers
Oblivious routing for mesh topology is a well known strategy for deadlock avoid­
ance [11, 12, 13, 14]. Basically the routing is initially performed on dimension 
X  and then on dimension Y {first X  then Y). Each routing node is connected to 
its neighbours with 4  pairs of input and output channels. There are another two 
channels for injection and consumption of the message. For each input channel 
there is a  routing process with an assigned buffer and because the routing is per­
formed in first X  then Y  way there is no possibility for channel cycles [11]. The 
latter is a sufficient condition for deadlock freedom.
4 .2 .2  Restricted Routing Packet Switching Node
Another strategy for deadlock avoidance is using virtual networks [33, 50, 51, 52]. 
For mesh topologies there are 4  virtual planes where the routing is performed in 
(+X, +Y),(+X,-Y),(-X, +Y],(-X,-Y) as shown in figure 4 .2 . Obviously there can be no 
channel cycles in any of these 4  planes so any type of routing is deadlock free. 
Normally, depending on the address of the destination the message is injected in 
one of these 4 planes and it is routed only in this plane where it is eventually also 
consumed.
4 .2 .3  Basic Building Blocks
The section presents the implementation of three asynchronous blocks with which 
one can build packet routing nodes for direct networks. The processing element
88 CHAPTER 4. ASYNCHRONOUS PACKET ROUTERS
Figure 4 .2 : Deadlock free routing using virtual nets
injects and consumes messages on channels with which the PE is connected to 
the routing node. As we discussed before, deadlock can arise in such highly con­
current networks. Depending on the chosen strategy for deadlock avoidance, the 
node architecture of these routing nodes is different, but the set of building blocks 
is the same. These comprise: asynchronous packet multiplexor (AMUX), address 
decrementor (DEC) and routing switch (RT).
Each of these blocks (AMUX, DEC and RT) has been specified in OCCAM(async) 
The compilation technique, presented in the previous chapter, has been applied 
to the specification of the different building blocks and thus the final circuit has 
been produced.
Com m unication channels and p ackets form at
All building blocks communicate on channels of one and the same type. Some of 
the channel's wires carry the logical values of the data, the rest of the wires are 
control ones which provide the correct clocking scheme for node-to-node commu­
nication. In this section we describe the meaning of each wire of the communica­
tion channel.
Each channel is physically represented by a pair of bundled-data buses: req +  
data and e +  data when on-chip and both buses are aligned for off-chip communi­
cations.
• data,- :i<»<8* are the logical values of the current data word.
• req is an event which latches the valid data on its bundle.
• e is an event which indicates end o f packet and it also latches the initial 
control word. In the latter case e might be regarded as a start of the packet 
event.
4 .2 . MESH TOPOLOGY PACKET ROUTERS 8 9
Figure 4.3: Multiplexing/demultiplexing ack,, ackr
• ack e and ackr are events which acknowledge the data from the recipient 
respectively for events e and req. As the existence of two acknowledges is ex­
pensive in terms of wires for off-chip communications, we will multiplex these 
two onto a single acknowledgement line when outside the chip. Figure 4 .3  
displays the two circuits necessary for the multiplexing.
The routing nodes receive and send out packets on the communication chan­
nels described above. The format of these packets can be formally expressed by 
the following expression: e.req*.e. The first event e carries the control word which 
consists of a control field (M bits) and relative address of the destination (AT bits). 
We use relative addressing and therefore each address field contains the number 
of the hops to be traveled on the current dimension. On each hop this relative ad­
dress is decremented by 1. When some relative address becomes zero, the control 
field in the control word indicates where the rest of the message should be copied 
to - reinjected on another dimension or consumed by the processing element.
After the initial control word the packet continues with data words latched by 
req. In fact these data words can also be control words but they will be interpreted 
as such by the routing nodes further on the message path. The second incoming 
event e indicates the end of the packet. Each event e or r eq is acknowledged on the 
corresponding channel acke/a c k r or if it is off-chip interface - on channel ack.
The three blocks we present later in this section accept and send out data on 
packet channels of the format described above.
90 CHAPTER 4. ASYNCHRONOUS PACKET ROUTERS
e/data-req/data-
ackr-acke-»
DEC
out
► e/data ►req/data 
—ack 
acke
Figure 4 .4 : DEC block
Address D ecrem entor - DEC
In tliis section we present the address decrementor(figure 4.4). As the relative 
address and the control field are in the initial control word, DEC has to decrement 
only the address field from the current incoming word latched with event ein. The 
specification of DEC follows:
ENV CHAN OF INT ein? , r in ? , eout!, ro u t !:
ENV CHAN ae in !, aeout?, a r in !, arout?:
ACK ? ein , r in  ! aein, a rin :
ACK ! eout, rout ? aeout, arout:
- -  ALIGN ein, r in :
- -  ALIGN eout, rout:
— i f  the existence o f two type of buses is  expensive 
- - w e  can a lign  them fo r  even on-chip communications
INT buf:
WHILE TRUE 
SEQ
ein ? buf 
aein !
eout ! b u f [7 : 6 ] : (b u f [5 :0 ] - [6 ]  1) 
aeout ?
[1 ] INT EOP:
SEQ
EOP := 1 
WHILE EOP 
DETALT
rin  ? buf 
SEQ
arin  ! 
rout ! buf 
arout ? 
ein ? buf 
SEQ 
PAR
EOP := 0 
SEQ
aein ! 
eout ! buf 
aeout ?
4 .2 . MESH TOPOLOGY PACKET ROUTERS 91
Figure 4 .5 : Implementation of DEC process
With the first event ein, DEC accepts the control field and the relative address. 
It sends out the control field unchanged but the relative address is decremented by
1. With each consequent request on rin and e DEC transmits the data unchanged 
on the output channel. DEC acknowledges incoming events on rin and ein on the 
corresponding aein and arin. The second incoming e event initialises the process 
DEC.
The result from the compilation is shown in figure 4.5.
Asynchronous P ack et M ultiplexor - AMUX
The second building block for asynchronous packet routers is the asynchronous 
multiplexor AMUX (figure 4.6). Block AMUX serialises the concurrent request from 
its two input channels i n i  and m2 on the output channel o u t .  The specification of 
AMUX process is as follows:
ENV CHAN OF INT el? , e2?, r l? , r2?, e !, r ! :
ENV CHAN a e l !, ae2!, a r l !, a r2 !, ae?, ar?:
92 CHAPTER 4. ASYNCHRONOUS PACKET ROUTERS
ACK ! e , r  ? ae, a r :
ACK ? e l,  r l  ! a e l, a r l :
ACK ? e2, r2 ! ae2, a r2 :
- -  ALIGN e, r :
- -  ALIGN e l,  r l :
- -  ALIGN e2, r2 :
- -  a lign ing  might be usefu l 
- -  fo r  on-chip communications too
INT buf:
PAR
WHILE TRUE
— th is is  the f i r s t  process invo lving the a rb itra t io n
SEQ
ALT
e l ? buf 
SEQ
e ! buf 
ae ? 
ae l ! 
e l ? buf 
e ! buf 
ae ? 
ae l ! 
e2 ? buf 
SEQ
e ! buf 
ae ? 
ae2 ! 
e2 ? buf 
e 1 buf 
ae ? 
ae2 !
WHILE TRUE
- -  this is  the second process fo r  the data on ' r '
—  w e ll, i  see nothing e lse  than one AMUX plus DW fo r  the acknowledgement
SEQ
DETALT 
r l  ? buf 
r  ! buf 
ar ? 
a r l  ! 
r2 ? buf 
r  ! buf 
ar ? 
ar2 !
Actually only the initial start o f packet events e are concurrent therefore an  
arbitration need to be involved for resolving possible conflicts for only these events. 
All subsequent requests on req lines are mutual exclusive and they can be simply
4 .2 . MESH TOPOLOGY PACKET ROUTERS 93
e/data-
req/data.
aclcr-
acke-
in.1
i n 2
e/data-
req/data-
ackr->
aclq,-«
Figure 4.6 : AMUX block
e/data
req/data
ackr
acke
merged. After the arbitration process has been resolved, the direction which input 
data is copied on the output channel out is determined. Therefore the result from 
the arbitration process controls the multiplexing for the data inputs. There is no 
data processing in the AMUX element. Its task is to transmit one incoming packet 
from an input channel to its output out in a  noninterleaving fashion a t the word 
level regarding the other input.
One possible implementation of the AMUX process is shown in fugure 4 .7 .
Figure 4.7 : The implementation of AMUX
If an AMUX block with more inputs is required it can be built out as a compo­
sition of several AMUX blocks with two inputs as shown in figure 4 .8 .
9 4 CHAPTER 4. ASYNCHRONOUS PACKET ROUTERS
Figure 4 .8 : Composition of AMUX blocks
4 .3  Routing Block - RT
The routing block RT (figure 4.9) is the most complex from the three building 
blocks. It has one input channel and as many output channels as necessary for 
the specific implementation. The OCCAM(async) specification of RT routing block 
follows:
ENV CHAN OF INT e?, r ? :
ENV CHAN OF INT e l ! , e 2 !, e 3 !, e 4 ! :
ENV CHAN OF INT r l  ! , r2 ! , r3 ! , r 4 ! :
ENV CHAN a e !, a r ! :
ENV CHAN ael? , ae2?, ae3?, ae4?:
ENV CHAN a r l? , ar2?, ar3?, ar4?:
ACK ? e, r  ! ae, a r:
ACK ! e l, r l  ? a e l, a r l :
ACK ! e2, r2 ? ae2, a r2 :
ACK ! e3, r3 ? ae3, a r3 :
ACK ! e4, r4 ? ae4, a r4 :
- -  ALIGN e, r :
- -  ALIGN e l, r l :
- -  ALIGN e2, r 2 :
— ALIGN e3, r3 :
- -  ALIGN e4, r4 :
INT buf:
PROC COPY( [ 2 ] INT index)
[1 ] INT EOP:
SEQ
EOP := 1 
WHILE EOP 
DETALT 
e ? buf 
SEQ
EOP := 0 
ae !
CASE index
4.3. ROUTING BLOCK - RT 95
0
SEQ
e l ! buf 
ae l ?
1
SEQ
e2 ! buf 
ae3 ?
2
SEQ
e3 ! buf 
ae3 ?
3
SEQ
e4 i buf 
ae4 ?
r ? buf 
SEQ 
ar !
CASE index
0
SEQ
r i  ! buf
a r i  ?
1
SEQ
r2 ! buf 
ar3 ?
2
SEQ
r3 ! buf 
ar3 ?
3
SEQ
r4 ! buf 
ar4 ?
[2J1NT ind:
WHILE TRUE 
SEQ
e ? buf
ind := b u f [7:6]  
ae !
IF
b u f [5:0]
SEQ
e l ! buf 
ae l ?
TRUE
SEQ
r  ? buf 
ae !
CASE ind 
0
9 6 CHAPTER 4. ASYNCHRONOUS PACKET ROUTERS
SEQ
e l  ! buf 
ae l ?
SEQ
e2 ! buf 
ae3 ?
2
SEQ
e3 i buf 
ae3 ?
3
SEQ
e4 ! buf 
ae4 ?
COPY(ind)
1
It accepts the control word of the packet with the first incoming event e and 
if the relative address is not zero then RT copies the control word and the rest 
of the message on the direct channel {e l.r l} .  (We stipulate that channel {e l:r l}  
is the direct one and its binary representation in the control field c f  from the first 
control word is zero.) If the relative address is zero then the routing block RTopens 
channel {ecj:rcj  } with event e where c f  is the control field from the control word as 
a binary value.
To illustrate the compilation technique when using a PROC definition we first 
compile the process COPY and the result is shown in figure 4.10.  The circuit for 
the body of RT is shown in figure 4.11.
We can now assemble the overall process from the separately compiled parts 
(see figure 4.12).  In each circuit, one GDW element is assigned for each input from 
the environment. When the circuits are assembled into one, all DW elements for 
the same input are combined into one. The same process is applied for the bundle
4.3. ROUTING BLOCK - RT 9 7
a« J
e,data
bundle writebuf
gnd
write
EOP
read
index
stari
o
t  r - -H r
vdd
write
EOP
bundle writebuf
merge
merge nu
V
arjc>
read
EOP select
17
read
index
read
buf
read
buf
read
buf
read
buf
end
read
buf
read
buf
read
buf
read
buf
r l ’,data.rl
r2',data.r2
r3’,data.r3
r4’,data.r4
el’.data.el
e2',^ata.e2
e3’,d^ata.e3
e4’,^ata.e4
Figure 4.10:  Implementation of RT COPY process
9 8 CHAPTER 4 . ASYNCHRONOUS PACKET ROUTERS
e.data
bundle write read
bufl7:6] 
/  - write read
bufI5:0] 
/  - ORbuf buf index buf select
merge
r.data
out.e
merge
start
copyO
P true.branch
bundle write
buf
read
index
' true.branch
merge
a l l  i o  *
out./
r
~L
merge
read
buf
read
buf
read
buf
read
buf
el”,data.el
e2”>data.e2
_e3^data.e3
e4”,data.e4
Figure 4.11:  Implementation of RT body process
4 .4 . IMPLEMENTATION OF THE ROUTING SWITCHES 99
elements as well.
4 .4  Implementation of the Routing Switches
4 .4 .1  Oblivious Routing Packet Switch - Implementation
One possible implementation based on the basic blocks is shown in figure 4.13.
On injection the address field of the first control word is zero. The control 
field contains the ’address’ of the direction to be injected - (+X, -X, +Y, -Y) which 
is the address of an output channel for the injection R T  node. (The channels 
with dashed line are the channels for injection/consumption.) Each subsequent 
control word of the message carries the relative address of the destination for the 
current dimension. The control field shows on which channel the message should 
be delivered after it satisfies the current dimension - whether it is reinjected in 
(+Y, -Y) or it is consumed by the recipient. Thus the first control word indicates on 
which dimension the message is to be injected. The next control word (the second
100 CHAPTER 4. ASYNCHRONOUS PACKET ROUTERS
word from the packet) shows how many hops it has to travel on this dimension and 
when the address is satisfied - to where the message should be reinjected. When 
the packet arrives at the destination RN, the last control word m ust address the 
channel consume.
4 .4 .2  Restricted Routing Packet Switching Node - Implementa­
tion
One possible implementation for a single routing node of a single plane is shown 
in figure 4.14.
On injection the address field of the first control word is zero. The control field 
contains the ’address’ of the direction to be injected - (X or Y). Each subsequent 
control word of the message carries the relative address of the destination (in its 
address field) for each dimension. The corresponding control field shows on which 
channel the message should be delivered after it satisfies the current dimension 
- whether it is reinjected in X  or Y or it is consumed by the recipient on channel 
consume.
4 .5 . CONCLUSIONS 101
from.Y
Figure 4.14:  Restricted routing packet switching device
4 .5  Conclusions
The chapter presents an asynchronous implementation of some basic blocks for 
packet routing switches. It also demonstrates an asynchronous design for creating 
a correct clocking scheme for the different functional blocks. The design presented 
in the chapter is an enhancement of a previous asynchronous implementation of a 
fully delay-insensitive mad postman packet switch described in appendix B. Our 
aim was to preserve the delay-insensitivity of the control unit but also to implement 
bundled data channels with data processing based on conventional logic design. 
In the previous implementation the data was double-rail encoded which requires 
two wires for each single bit of data. The latter becomes expensive in terms of 
wires and bandwidth for wider channels.
Although we coniine ourselves only on mesh topologies in this chapter, packet 
routers for other architectures can be constructed using the same design tech­
niques described in the previous chapter. In fact, the basic blocks can easily be 
enhanced so that some other facilities like a  broadcast option can be also imple­
mented.
102 CHAPTER 4 . ASYNCHRONOUS PACKET ROUTERS
Chapter 5
Tangram vs OCCAM(async)
In this chapter we will compare our approach to an existing, similar asynchronous 
logic synthesis tool, Tangram. Tangram is a language for asynchronous hardware 
specification and it was developed by Philips Research Laboratories[44],
Both design methods
1. exploit similar synthesis ideas
2. use similar CSP based input languages for hardware description
Both languages utilise several new constructs in addition to those used incon­
temporary sequential languages and these are the parallel construct, communi­
cation on a  channel between concurrent processes, and guarded choice. Such  
a  language extention logically derives from the concurrent nature of a  hardware 
device and moreover from the nature of an asynchronous circuit:
• The start —* finish  exchange of control signals within an asynchronous device 
can be nicely modeled with communications on channels
• Concurrent hardware subdevices can be formally described with the execu­
tion of parallel processes
• Shared resources and their underlying hardware management can be for­
mally specified using guarded choice
Besides these similarities there are some apparent differences:
• Philips’ design method produces 4  phase signaling circuitry, while OCCAM(async) 
compilation results in 2  phase signaling devices.
103
104
inputs within the choice operator
CHAPTER 5. TANGRAM VS OCCAMfASYNC]
control signals after resolving the choice
Figure 5.1: Tangram extension: compiling choice operator
• The basic sets of synthesis library elements are entirely different.
In the following sections the various basis language constructs are discussed 
and the performance of the underlying circuitry is compared in order to better 
appreciate the elegance of the 2 phase signaling.
5.1  Choice operator: Nondeterministic and Determin­
istic
Reference [25] claims to extend the work of Philips Laboratories with a  solution for 
synthesis of the choice operator. In all available references, there was no clear ev­
idence that Tangram contained any similar construct and corresponding underly­
ing asynchronous circuit. The proposed solution is based on a  polling mechanism  
(see figure 5.1). At the heart of the compilation, the T element accepts a  polling 
signal and, based upon whether an input event has been encountered, it either 
delivers the polling signal further down the chain or passes a control signal for 
accepting the signal as a  possible choice. The major drawback of this solution is 
the constant power consumption in the circuit, unless a  signal arrives on one of 
the input channels. In the latter mode, the circuit shown in figure 5.1 behaves 
like an ordinary oscillator.
As we saw from the chapter 3, our corresponding synthesised circuit statically 
waits for a  signal on one of the choice inputs and after the preliminary initialisation 
of the GDW, there is no power consumption unless an input arrives.
5.2. RECEIVING AN INPUT 105
control signals
input c)  ( ) c) c)
write signals 
Figure 5 .3 : Receiving an input: OCCAM(async)
5 .2  Receiving an input
Let us consider an input in some circuit which textually appears N times in its 
Tangram or OCCAM(async) specification and also this input is not a  guard of any 
kind of choice. The result from the compilation circuit for receiving this input in 
both methods is shown respectively in figures 5 .2  and 5.3.
The latency in accepting the input within the first method (Tangram) is
¿ a c c e p t  i n p u t =  2 X  lo g< z{N ) X  ( ¿ N O R  +  ¿ M U L L E R  C )
Using the OCCAM(async) method, in the average case where the circuit allows 
us to use the fast GDW, the latency is
¿ a c c e p t  i n p u t  •— ¿ M E R G E  +  ¿ M U L L E R  C
106 CHAPTER 5. TANGRAM VS OCCAM(ASYNC)
aO
a l
bO
S
b l
Figure 5.4 : S element
By t M E R G E y t M U L L E R  C y t N O R  we denote the one gate delay in Merge, Muller C, 
and NOR cells. As we can envisage from the latter formula, the delay does not de­
pend on N. However, that is not fair picture. We have not included the latency for 
the output acknowledgement in the second formula. The communication in Tan- 
gram is strictly a  ”handshake", which means that whenever an input is received, 
it is acknowledged before commencing any further action. Therefore
where the latency log2 (N ) x Imerge  represents the delay in the tree of merges for 
the output acknowledgement in the OCCAM(async) method. However, the latter 
is required only if the input m ust be acknowledged immediately. It is a  known 
strategy for decreasing overhead from communications with concurrent work, i.e. 
while waiting for an acknowledgement the sender can perform some useful work 
instead of being idle. Therefore, the second part of the latency log2 {N ) x E m e r g e  
can be naturally "covered” with some progress on the sender side and this is only 
possible within OCCAM(async) where the designer has the freedom to specify it so. 
In contrast to that, because of the strict handshaking manner of communication 
in the Tangram design technique, each single action on a  channel involves a  delay 
dependent on N.
5 .3  SEQ and PAR compilation
Because of the nature of the four phase signaling, the SEQ construct is compiled 
rather differently in Tangram as compared to OCCAM(async). The circuit for the 
realisation of complete four phase handshaking is the S element shown in fig­
ure 5 .4  and its implementation - in figure 5.5.
Initially, ao =  ai — &o = &i =  FALSE  and the sequence of events can be formally 
described as
t a c c e p t  i n p u t & a c k  ~  t M E R G E  +  ^ M U L L E R  C  +  l o g z { N )  X  Ì M E R G E
5.3. SEQ AND PAR COMPILATION
SEQ
Tangram process.1 OCCAM(asyiic)
Figure 5.6: Tangram vs OCCAM(async) SEQ: implementation
00 T; bo t; bi I; b0 J; bi a! a0 I; a i |
What is important to notice is that the signaling on ai is initiated with the full 
completion of the handshaking on bo and bi,
The SEQ construct in OCCAM(async) is naturally implemented as only a  wire 
passing the control from the previous process (see figure 5.6).
The PAR construct realisation in Tangram is also based upon a S element. One 
implementation is shown in figure 5.7.,
Notice again that the completion of the concurrent branches finishes with the 
full four phase handshaking on all parallel branches. Roughly speaking, one has 
to propagate one signal to the bottom of the branch, wait for the completion of
Figure 5.7: Tangram  PAR
all full handshakes and then commence the execution of all processes following 
further. In contrast to that, the PAR construct in OCCAM(async) branches the 
execution of its subcomponents using ju st a fork and the completion of all of them  
is synchronised with a  Muller C element. Both SEQ and PAR constructs are more 
efficiently compiled using 2 phase signaling.
5 .4  The wagging buffer
So far we compared the basic constructs in both languages and their implementa­
tion, we will conclude with a  performance estimation, in terms of speed, of the well 
known wagging buffer realised using both compilation techniques (see figure 5.8). 
Tills type of buffer contains two registers (X and Y) and it is called wagging be­
cause it reads and writes into these in a  toggling manner performing two strictly
sequential operations:
• reading from Y while storing in X
• reading from X  while storing in Y
108 CHAPTER 5. TANGRAM VS OCCAM(ASYNC)
Figure 5.8: The wagging buffer
The specification of the wagging buffer can be formally described in Tangram  
with the following expression
W AG  =  # [(m ? x ||o w i!i/); ( in?y\\outlx)]
and figure 5 .9  illustrates the implementation in handshaking circuits (taken from 
page 25  of (441).
The OCCAM(async) specification of the wagging buffer is similar:
WHILE TRUE 
SEQ 
PAR
5.4. THE WAGGING BUFFER 109
SEQ
in ? y
a in !
SEQ 
aout?  
o u t I X
PAR 
SEQ 
in ? x  
a in i  
SEQ 
aout?  
o u t !y
Its two-phase, bundled-data implementation is displayed in figure 5.10.  It is 
interesting to note that the used delay line is necessary in order to use fast DW 
implementation and it can realised by two inverters. The wagging buffer is a  veiy 
good example which illustrates how after assembling the final circuit, one can  
reason about critical feedbacks such as from a DW output to some of its rows. 
Having constructed the circuit, the dimensions of all GDW are known, and one is 
aware of the constraints imposed on these feedbacks if it is intended that the fast 
DW element be used. Of course, one can optimise the implementation by replacing 
the DW2x i elements with Toggles. However, although not theoretically nice, the 
use of the fast DW with delay lines may drastically improve the total performance. 
In the fast DW implementation, the constraint on the feedback, output(DWoui)—> 
synchronising Muller C —+ input(DW0Ui), is that the latter should involve a  delay 
of at least one Merge element, and as we will see, the delay line does not play any 
crucial role in the final performance.
A good strategy for performance estimation for asynchronous circuits consists 
of simply connecting the corresponding request *=* acknowledgement lines with 
wires, thus transforming the circuit into an oscilator. The oscilation frequency 
can be regarded as the highest performance the circuit can reach, because of the 
zero delays in those request acknowledgement lines.
In both implementations, we will consider a rough, gate-level delay estimation 
of critical paths. These longest paths, in terms of delay, are shown in dashed 
lines in both figures 5 .9  and 5.10.  The delay of the Tangram’s circuit involves the
110 CHAPTER 5. TANGRAM VS OCCAM(ASYNC)
M
Figure 5.9: The wagging buffer: Tangram implementation
5.4. THE WAGGING BUFFER 111
Figure 5.10:  The wagging buffer: OCCAM(async) implementation
gates in the order shown below. We will use the names of the library cells and the 
constructs, where signals are delayed on the critical path, as a  subscript. Also, as 
we are not aware of the delay for storing a value in a  Tangram’s var circuit, we will 
ignore it in favour of the Tangram’s design performance evaluation.
START —+ NORf o r e v e r  —■* A N D s p q  —*■ A N D  p a r i  —>• NORm i x e r  —+ C m i x e r  —* 
NOTCpyipi —*• ANDp^pi —► NORm i x e r  —* C m i x e r  —► NORp^pi —► C p a r i .s y n c i  —* 
NOTCs e q  —*■ ANDs e q  —> NOTCp^pi —»■ NORpari —► CpyiRi.syjvci. —*■ NORsrq —► 
ANDpyip2 —* NORm/xer -+ C m i x e r  —* NOTCpap2 -*■ ANDPj4P2 -*  NORm i x e r  —>*■ 
c  M I X E R  — ► NORpAii2 — *• C p A R 2 . S Y N C f  NORF O R E V E R  — * NOTC S E Q  — > N O R 5P Q  — ► 
N O T C P A R 2  — *■ N0RpyiP2 — *■ & P A R 2 . S Y N C I  — * END
Total delay: 18 x delaynan<J/nor + 14 x delaymui/er c
The delay of our two-phase, bundled-data implementation involves the gates 
shown below. Again, we trace down the traveling signal on the longest path. Al­
though the two DW elements could be replaced with Toggles in the circuit in fig­
ure 5.10,  we would like to measure the circuits performance in the average case  
where DW elements appears as such.
START —+ M ERG Eporever * MERGElf«^ £)jy —* —► M ERGE^ i mux  ~ y
NORj'i^Mpx L A T C H i , ( ^ m u x  —> M E R G E j.,^ ^  —*• MERGE2„d^ DW ^ 2 **±dw 
-  MERGE2nd¿mux -+ NORand >MUX -> LATCH2nd iMUX -  MERGE2„d iMUX -  END
Total delay: 6  x delaymer<,e + 6 x delaynand/n(,r + 2 x delaymu//er c
It is not difficult to see that the two-phase implementation is characterised with 
a lower latency than the handshaking circuits one, but this is not the point of our 
consideration.
From the total delay of 18 NAND/NOR and 14 Muller C elements, the delay 
in the control structures, FOREVER, SEQ, and PAR, is 14 NAND/NOR and 10 
Muller C where SEQ and PAR contribute most. The wagging buffer is relatively 
simple example, and we have only 2 levels of PAR and SEQ nesting. A big part 
of the delay in these structures is due to the existence of an S element (refer to
112 CHAPTER 5. TANGRAM VS OCCAM(ASYNC)
5.4. THE WAGGING BUFFER 113
figure 5.4). Both structures, PAR and SEQ, contain such an S elements in or­
der to fully complete all 4  phases before releasing their finish  signal. When PAR 
and SEQ are nested, which is quite natural and logical for any concurrent spec­
ification, these S elements spend most of the time clearing the 4  phases between 
each other, instead of performing any useful work in regard to communicating 
or data processing. Nesting of such constructs will result in bigger control unit 
and consequently greater delays. The graph of SEQ and PAR physical elements 
reflects the circuit’s specification and therefore, one can expect greater delays for 
irregular circuit’s behaviour. We would also like to mention again that signaling 
on all four ports of the S element is strictly sequential. It is important that the 
4  phase handshaking completes in PAR or SEQ branches before commencing the 
next task because of possible transmission interference in channel mixers or data 
processing units. However, it is difficult to envisage how one can explore more 
concurrency in order to reduce the overhead from using S elements.
In the process of accepting an input from the environment, there is a certain  
delay before the acknowledgement (see figure 5.11), and this corresponds to the 
delay in the input circuit, IC. If the type of the handshaking is 4  phase one, than  
the total delay will involve twice the delay of IC in the req/ack loop 1. If one chooses 
to further propagate the acknowledgement as a control signal instead of closing 
the 4  phase loop, the control circuit will also contribute twice its delay to the total 
latency of accepting an input. Therefore, the final acknowledgement is produced 
and signaled out as soon as possible whenever 4  phase handshaking is utilised. 
The latter is a good reason why request and acknowledgement lines should be 
tightly coupled and so confirms the implementation of Tangram’s channels.
On the other hand, performing in full all 4  phases within the handshaking on a 
channel before commencing any other action will only introduce sequentially the 
delay of request ^  acknowledment from the environment. Obviously, in such a 
scheme, one can not explore more concurrency than the specified in the initial 
program code, and possibly decrease the overhead from communication with fur­
ther progress in the 4  phase handshaking circuit. Although we consider an input 
from the environment, the case with an output is similar.
114 CHAPTER 5. TANGRAM VS OCCAM(ASYNC)
3:! input 
cSircuit
J j
; i ;CC| fcontrol
; j | ciicuit :
:. Xe gVa c1iA&bp2 VI.'.Vj-' •
ack j
Figure 5.11:  Input/output on channel: 4  phase
In contrast to this, OCCAM(async) discriminates request and acknowledgement 
as two separate events. Our aim is to free the designer from the constraints of fully 
synchronised communication as in Tangram. The natural reason for doing this is 
to give the possibility to the designer to explore further concurrency within the final 
circuit, thus increasing its performance. As we noticed from the fast implementa­
tion of the DW element, we can utilise one very important strategy of propagating 
the output as soon as possible and simultaneously completing the clear-off phase 
within the DW cell. It is also important to note that one can explore similar par­
allelism with an output to the environment. Taking into account the fact th at the 
environment will not supply an acknowledgement for an output before the actual 
signaling out, one can propagate the control signal, which executes the output 
statement, as also the control event to the successor as shown in figure 5.12.  This 
is another example of how one can explore more and more eager parallelism, and 
as we can see from the 2  phase, wagging buffer implementation, both delays, from 
the PAR constructs and the delay in outputing/receiving the acknowledgement, 
are concurrent.
5.5  Conclusions
We will summarise the comparison between both approaches in a  concise fact /  
consequence form for both types of signaling.
i
req/acK loopi
ack
5.5. CONCLUSIONS 115
4  phase handshaking: Both rising and falling edges of a  request line should 
be treated as a unity and they represent a logical event. Acknowledgement’s and 
request’s edges interleave, therefore one has to fully complete the handshaking 
before commencing any further progress in the circuit.
Consequence:
- Delays from communications from/to the environment are sequentially im­
posed on the circuit.
- PAR, SEQ and other structural components m ust fully complete the 4  phase 
handshaking down to the bottom of their branches prior to any further progress. 
Nesting of such constructs introduces large control units, which contribute 
most to the total latency as a factor of the total performance.
+ The multiplexing/demultiplexing part for output/input signals is simpler, re­
quest and acknowledgement lines are logically high in the active phase.
2  phase signaling: Any voltage transition represents a logical event. Requests 
and acknowledgements are treated as two separate lines.
Consequence:
+ One can explore more concurrency in order to reduce the overhead from com­
munications.
+ PAR, SEQ and other structural components are implemented more simply in 
comparison with 4  phase signaling ones. Nesting does not introduce linear 
delay as SEQ is naturally implemented with only wires.
- The multiplexing/demultiplexing part is more complex because of the nature 
of the 2 phase signaling.
+ One can also explore eager parallelism in order to reduce overhead either from 
inputs/outputs or control circuitry, as the latter two can be concurrent.
Still we have not answered the question which technique is more appropriate to 
the design of asynchronous systems. Part of the difficulty in resolving it emerges 
from the fact that not only the design community lacks good examples for asyn­
chronous implementation but also a long historical experience with such systems.
The packet routers, described in this thesis, are good examples illustrating the 
advantages of 2 phase signaling. The structures are relatively complex and the 
level of nesting PAR and SEQ constructs encourages it. We believe that the latter is 
the threshold factor in the final decision which technique is better to utilise. On the 
other hand, if the control structure, as the clocking backbone, is relatively simple, 
uniform and regular, one may consider 4  phase handshaking as confirms the work 
at Manchester University in the design of an asynchronous ARM processor [21, 22,  
23]; The original 2  phase Sutherland’s micropipeline is the skeleton for clocking 
the functional blocks. Requests and acknowledgements are propagated twice in 
order to transform the 2  phase implementation to 4  phase one, thus drastically 
relaxing the implementation of various data blocks. The control micropipeline is 
a regular and uniformly constructed structure and the performance degradation 
from using 4  phase handshaking is insignificant in comparison with the delay 
improvements in the functional blocks.
116 CHAPTER 5. TANGRAM VS OCCAM(ASYNC)
Chapter 6
Conclusions and future work
The novel compilation procedure, presented in this thesis, has been implemented 
as a compiler written in C language. It takes an input specification in OCCAM(async) 
and produces a Verilog netlist [65] of library cells (GDW, Merges, Muller C elements, 
etc). Verilog, as a hardware specification language and run-time environment, pro­
vides enough primitives for asynchronous circuits simulation and performance 
evaluation. One can use a Verilog description of the library cells which allows a  
relatively efficient and accurate debugging level, and also speed and power con­
sumption estimation. This verilog netlist can be further imported into the Cadence 
design framework as a  schematic representation and used for the automatic layout 
generation.
The novelty of this OCCAM(async) compilation technique lies in the way of ex­
ploiting 2 phase signaling and, as a consequence from that, the novel set of library 
elements. Encoding an event as a  single voltage transition allows us to separate 
request and acknowledgement lines, and, therefore, to exploit eager parallelism in 
order to implement more efficient circuits. Such a  freedom of treating events as 
separate on these two lines allows the designer to use more efficient communi­
cation protocols and, therefore, to decrease the overhead of synchronisation. In 
fact, the latter might also be considered as bringing more asynchrony to the asyn­
chronous circuit.
As we demonstrated in chapter 4  and 5, there is a wide spectrum of circuits 
which can be implemented more efficiently using 2 phase encoding scheme com­
pared to 4  phase. The 2 phase signaling simplifies the implementation of the con-
117
118 CHAPTER 6. CONCLUSIONS AND FUTURE WORK
trol part but also involves slightly more complex data processing units. 4  phase 
handshaking eases the design of the latter but, on the other hand, the CU is slower 
in comparison with 2 phase encoding scheme. Obviously, there is a trade-off be­
tween speed improvements in the DU and CU respectively when using 2  and 4  
phase signaling. We believe that circuits from the large variety of asynchronous 
systems with complex, irregular behaviour can be more efficiently implemented 
using 2 phase signaling as it implies more efficient control units.
It is not a  novel technique to apply high level programming techniques to the 
hardware design process. It gives the designer powerful abstraction when building 
various devices -  it is indeed a difficult job to design any circuit when one m ust 
consider how to connect various library elements instead of using some behaviour 
description language. Despite the fact, that there are many behavioural synthe­
sis tools, designers still incline to use architecture description languages rather 
than behavioural ones. The problem emerges from the gap between high level de­
scription and final result. One of our initial aims, when designing the compilation 
technique, was to mitigate the latter problem by targeting the programming lan­
guage as an assembler for asynchronous design. The specification language is in 
tight interconnection with the final circuit’s netlist thus the designer has a better 
control over the synthesis and, he or she still utilises a powerful design abstraction  
which is also important.
6 .1  Contemporary design process and OCCAM(async)
Figure 6.1 illustrates the stages of the contemporary design process. The starting 
point is a high-level behavioural specification of the circuit. It can be in various 
forms, from in the designer’s head to high-level abstract languages (such as Z 
specification [58]). In order to reduce the initial complexity, and therefore, the 
level of difficulty of understanding the circuit’s behaviour, engineers decompose 
the initial description to a network of smaller, concurrent subcircuits. This step 
is very intuitive and unfortunately manual. Having performed a  decomposition, 
one can continue to tediously implement these subcircuits out of basic library 
elements but, at this stage, there is already a wealth of tools which automatically 
synthesise the final netlist of library cells. The compilation technique, described
6.1. CONTEMPORARY DESIGN PROCESS AND OCCAM(ASYNC) 119
Figure 6.1 : The design process
120 CHAPTER 6 . CONCLUSIONS AND FU TU RE WORK
in this thesis, comprises this design process step. The result from the compilation 
can be further used for an automatic layout synthesis at the last 3d stage using a 
design framework such Cadence and Mentor Graphics.
6 .2  Future work
The future work is logically defined from the yet not covered steps in the design 
process shown in figure 6.1. The ultimate goal is a fully automated VLSI design 
procedure, from the top down to the actual chip manufacturing.
The process of high-level decomposition appears to be rather difficult step but 
not insurmountable. One possible and promising approach comprises the design 
of a subtraction procedure. Assuming that the high-level specification is algebraic 
(CSP, DIA, etc), such a procedure involves the algorithmic definition of the following 
operation:
Process subtract: Given a process P, as an initial specification, and a process 
R, as a possible subcomponent of P, the subtraction of P — R is a new process Q  
such that
PCK|| Q , or 
PCK|| (P -i? )
The idea behind it is simple. Given the initial specification, one can predict 
what subcomponents need to be extracted/subtracted from it, and performing the 
subtraction, one can reduce the initial complexity while simultaneously introduc­
ing desirable concurrency to the implementation. Such a process is another form 
of decomposing the initial high-level specification. It can possibly discover the 
major, backbone subcomponents nesessary for the final implementation and the 
interconnections between them. But this is not the only usage of the subtraction 
procedure; The processes P and R can be also regarded as a target circuit and its 
previous version. The interpretation of the subtraction is defining the additional 
circuitry which evolutes process R to the current specification of P. It is a common 
situation when the VLSI market demands incremental changes within the current 
implementation of a circuit. Such an approach of process subtraction will only
6 .2 . FU TU R E WORK 121
ease the engineering decision of how much from the design history can be reused 
in order to speed up the actual manufacturing.
The second but not the final parallel direction towards the ultimate goal is the 
implementation of an optimiser tool. The output from a synthesis procedure is 
usually a fixed topology netlist of library cells. Although it seems to be rather 
straightforward step to produce the final layout of silicon polygons, the designer 
faces a large set of possible choices for the geometry of these polygons. Sizing the 
physical layout plays a major role in the performance estimation. At present, the 
designer performs circuit optimisation by using intuitive, empirical methods or by 
many exhaustive, device-level simulations. More often, having designed the circuit 
at the netlist of library cells level, the optimisation process comprises numerous 
iterations where a physical cell is replaced with a stronger or weaker one until 
the performance criteria are met. This process is so time consuming that the 
designers are hardpressed to produce any circuit, and it would hardly allow them 
to iteratively redesign all crucial library cells. This task is complicated and can be 
accomplished relatively efficiently by an optimiser tool. It can be also performed at 
two technology-dependent levels: maeromodular and transistor level. The trade­
off is efficiency vs accuracy. All transistors in a module are sized proportionally at 
a maeromodular level, thus simplifying the complexity of optimisation procedure 
which, on the other hand, is less computationally expensive but it also finds a  
suboptimal solution. Therefore, the maeromodular approach is less accurate than 
dealing directly with the geometry of each transistor.
These are only two obvious research directions into which the work in this 
thesis can evolve. There are many design choices to be made yet: for example, an 
optimisation can be performed under the same criteria at higher levels such as the 
well known Quine-McCluskey method [8] regarding also power consumption and 
chip area, as there are many other research directions towards the ultimate goal. 
The further we progress within the same technology domain of design, the more 
difficult the problems become. The level of difficulty, historically, is a good cue for 
the need of replacing an obsolescent design process by a new design framework. 
Perhaps, the latter is the major driving force for the technology progress. Hopefully, 
the work in this thesis has contributed to this evolutionary process.
122 CHAPTERS. CONCLUSIONS AND FUTURE WORK
Appendix A
Asynchronous buffering
In this appendix a novel asynchronous implementation of a FIFO buffer 1 is de­
scribed and compared with an alternative asynchronous implementation of Suther­
land’s micropipeline. The design has the potential for significant reductions in la­
tency and it is an attractive application for problems where both throughput and 
latency matter.
The design itself is delay-sensitive but it deserves some attention as such 
FIFO buffer is extremely useful when an off-chip interfacing is involved. For the 
behavioural specification we will use an algebra based on DIA but violating the 
delay-insensitivity law, i.e. one can signal on a wire without intermediate acknowl­
edgement.
req
data »
ack
conventional
asynchronous
control
!
i reQ (
sender i c*ata i i receiver
i ack i
' 1 .........  1 .......
„req
^data
ack
!..
conventional
asynchronous
control
delay-sensitive
asynchronous
control
Figure A. 1: req ac/cloop
As one can envisage from the designs presented in Chapter 4, all blocks com­
municate in a strict loop-manner r e q  ^  a c k ,  i.e. there are two phases in each 
single communication cycle.
1The work on asyn chronous buffering is a  resu lt of a  join t project w ith Oxford and S ou th  B an k  
University
123
During the req phase, the sender is active and signals to the receiver the valid 
data. During the second phase, the sender is inactive and waits for an acknowl­
edgement in order to continue further. While such a scheme is feasible for on-chip 
communications, it is obvious that applying it for off-chip is unacceptable. The 
delays in off-chip wires are significant and this has led us to investigate a novel 
scheme for the off-chip interfacing which we describe in this section. To avoid the 
problem described above, we would like such an implementation to minimise the 
number of inactive ack phases for off-chip communications.
Instead of waiting for the corresponding ack, the sender can continue signal­
ing data with a certain delay between each subsequent words, providing that the 
receiver can accommodate each incoming word within the duration of this de­
lay. There is a buffer of size sz words on the receiver side, thus the sender can 
continuously send sz words (figure A.l). When the receiver empties the buffer, it 
acknowledges the whole window with only one ack.
The specification of the sender is as follows:
S F I F O  = S 0 / a c k out?
S i  = if ( i = 0) then acfcoui? fi; regtn?x; r e q ou t!x ;  a c k {n !; S i+ io /03Z
The specification of the receiver is as follows:
R F I F O  = R q ,o / a c k 0 Xlt ?
R r ,w  ~ [ re q in ? m e m [ w ]  ->  R r }W+ i% Sz O
a c k out?  —► if (7’ = 0) then acfcjn! fi;
i f (r  =  w ) t h e n  r e q in ? m e m [ w ]; r e q o u t lrn e m [r ] ',  R r+ i% SZlW+ i% Sz 
else r e q o u t\m e m [r]; ftr+i%Si,w]
One implementation of the process S o  is shown in figure A. 2.
One possible implementation of the process Ro,o is shown in figure A. 3 and 
figure A.4, where signals are defined as et- = mr* ©mr;+i and the acknowledgement
124 APPENDIX A . ASYNCHRONOUS BU FFERIN G
125
Figure A.3: Implementation of Ro.o - control part
ackin is identical to signal aoo. The counters acknowledge each incoming event 
on a different output. The task of the sender is to signal the valid data with a 
certain delay between two consequent words. After sz, Jl=sz words it awaits an 
acknowledgement in order to further continue. The control part of the receiver 
ensures that the reader side will not overtake the writer one. The two counters 
synchronise with each other using a set of Muller C elements providing the above 
constraint. The data path of the receiver stores each incoming word in the current 
asynchronous latch.
Introducing two additional blocks in the routing node implementation - sender 
and receiver, improves the latency of the off-chip communications by approxi-
d at^ n+req
as.reg as.reg d ataout+req
-fi-1
p®o Pan-i
Figure A.4; Implementation of Rqj0 / data path
126 APPENDIX A . ASYNCHRONOUS BU FFERIN G
datajn O
£
aCkin.in - dela
- delay -\C) -  delay--------
" I — 3T
-  y t jdelay "*( c j
d fttS o u t
acHiut
Figure A. 5: Sutherland’s FIFO
mately Timprove where
'I'xmprove — sz y (DELAYack)0ff—chip DELAYack,on—chip)
Another advantage of using such an interfacing scheme is hidden in the fact 
that coupled together RFIFO(0)||SFIFO constitute a new process FIFO(sz) where 
FIFO(sz) (see figure A. 5) is the elastic micropipeline described in the well-known 
Sutherland’s article (32). This additional buffering improves the network band­
width when the network is heavily loaded.
FIFO(sz) =  MPLn/ackout?
MPLX [(ft® ^ 0)& ackout? —* reqout\head(x); MPLtail(x)0 
(It® ~f~ sz')&.reqin‘?y + ackin\, MPLcat(x,y)\
where by ft® we denote the length of the list of values ®; head, tail, cat are the well- 
known list functions returning correspondingly the head, the tail and the concate­
nation of the list parameters. One can argue that the implementation of such an 
elastic micropipeline suffices for the asynchronous buffering and also the design 
in Sutherland’s article [32] is simpler. Since we send the data in burst mode and 
the receiver should ensure the accommodation of all sz incoming words, the cur­
rent receiving buffer needs to be initially empty. If such a buffer is implemented 
as an elastic micropipeline, the data has to ripple through the whole pipeline until 
it reaches the output. The latter is unacceptable for us; it is imperative that we 
reduce the latency between the input of the sender and the output of the receiver.
One can notice from the specification that the sender is blocked once it fills the 
buffer on the receiver side, awaiting for an acknowledgement. In order to avoid
127
this, we would like to introduce another additional buffer on the receiver side (a 
receiver with 2 buffers), where the sender fills these buffers in a wagging manner. 
It could happen in such a case that while the receiver empties the first buffer, 
the sender fills the second one. Such an implementation will provide a feature of 
possible contiguous writing if the data is available on the sender side and if the 
processes further along the pipeline stages are ready to proceed.
The behaviour of the sender is unchanged, and it implements SFIFO/acfc0lii?  
The implementation of the receiver is the same, j l  = 2 x sz, and the ackin is gener­
ated by simply merging m ro and mrsz. The trick here is that the two RFIFO halfs 
(control and data path from 0 to sz-1, and from sz to 2 x sz — 1) should send an 
acknowledgement ackin only after the previous one has been consumed, i.e. after 
there is an acknowledgement for the acknowledgement ackin. That is how ackin 
is a result of merging the two signals mro and r n r sz. The signal m r 0 appears on 
ackin only if the receiver has consumed the second half RFIFOs and the sender has 
received the previous signal m r* .  The same is valid for the signal m r^ .
In this appendix, although briefly, we have considered the use of burst trans­
mission of data between chips for increased throughput. Interleaved, wagging 
manner, transmission/acknowledgements have shown to further improve the per­
formance. The main contrast with Sutherland’s micropipeline is that the data do 
not ripple through the stages before reaching the output and therefore this new 
design significantly reduces the latency.
128 APPENDIX A. ASYNCHRONOUS BUFFERING
Appendix B
Bit-serial packet router
This appendix describes the delay-insensitive implementation of a packet switch1. 
It illustrates the top-down design of delay-insensitive circuits. The switch uses the 
low latency mad postman switching technique and the main design objective is to 
achieve as low as possible propagation latency through the switch. The design 
makes use of a low latency arbiter designed by Mark Josephs and Jay Yantchev 
[43} - a circuit that makes packet propagation latency independent of arbitration 
latency.
B. 1 Delay-Insensitive Specification of a Mad Postman 
Switch
In this section we will give the DIA specification of a delay-insensitive 2 x 2  mad 
postman switching process MP with relative addressing.
A pictorial representation of MP is given in figure B. 1 below.
MP is connected to its environment by four pairs of data /  acknowledgement 
wires; the data wires encode t he signals 0, 1, and e. 2 phase signaling is used 
with double-rail encoding using wires 0 and 1, and a separate start/end-of-packet 
wire e. Signal transmission on these wires is assumed mutually exclusive. A 
packet commences with an e signal, followed by an hi-bit, hi > 1, address flit, 
or header, consisting of Os and Is, followed by an arbitrary number of Os and Is, 
and completing with an e signal. This can be formally defined in regular expression 
syntax as e(0| l)ft,(0j l)*e.
'T h e work, presented in this appendix, is a  result of a  join t project with Oxford University
129
130 APPENDIX B . BIT-SER IA L P A C K ET ROUTER
Figure B .l: MP
The output r  of MP, also marked with the bullet sign • in figure B. 1, is the direct 
output and b is the complementary output. Each incoming packets is by default first 
sent out on r. If the hi bits in the header of the packet are all 0, then the rest of 
the packet is switched out on 6; otherwise it is switched out on r . The header of 
each packet is decremented by one before it is transmitted on r.
A set of wires {a.t>o, — ,a n^} are said to constitute a channel, if signal transitions 
on the wires are mutually exclusive. In this case the choice [a.vo? —► PVo | ... |a.vn? -*■ 
PVn □ 5], is abbreviated to [a?x —► Px □£].
The specification of the MP process is given by the mutually recursive expres­
sion:
M P  = Arb/oa.r?
Arb — oa.r?; [D ,n6{( Xt)in?x —+ if (x =  e) then ¿a.m l;r.e l; o else X]
0 < i < hi :: Hinti = oa.r?',in?æ; ia.in\",
if (x -  0) then r l l ; i f inii+i 
elseif ( x  — 1) then 7’10; P in , i+ i  
else X 
Hinthl — Ofl.r?, 1'\c, b\ô, .S
0 < i < hi :: Pin,i — oa.r?\ in?æ;
if ( x  — e ) then JL
else ia . in Y ,  r!x ;  P i n , i+ i
P in ,h i  —  S in , r
S in , r  — o a . r ? ’, in ? x \  i a . i n l ;  r \ x ;  
if ( x  =  e) then A r b  
else S i ntr
S i n ,b =  o a .b ? ;  in?#; ¿a.ml; blx;
if ( x  = e) then o a .b ? ;  A r b  
else S i n t
B .I .  D ELAY-IN SEN SITIVE SPECIFICATIO N O F A  MAD POSTM AN SW ITC H  131
The named states of MP can be explained as:
A r b ,  the initial state when the process receives the acknowledgement o a . r ?  for the 
last sent r!e and then is waiting for a packet on either of the input channels I 
or u. A packet should commence with an e signal; otherwise there is an error.
Hinti, 0 < i  <  h i ,  a packet header is being switched out (decremented by 1) on the 
direct output r  and only Os have been received so far.
Hin,hii a header of h i  many Os has been received, the rest of the packet must be 
switched to the complementary output b, and a dead flit has been released 
on the direct output r. The process then
1. completes the transmission on the direct output r by a transition on r .e  
and
2. initiates transmission on the complementary output 6 by a transition on
b.e.
P in , i ,  a packet header is being switched out on the direct output r  and is not all 
Os.
Pin,hi, a non-zero header has been received and the rest of the packet must be 
transmitted on the direct channel as well.
132 APPENDIX B . BIT-SER IA L P A C K ET ROUTER
Sin.out, the packet is switched out on channel o u t  (either b or r) until an e is received 
on in.
B.2 High Level Design
In principle, the top level specification can be compiled using the compiling tech­
niques described in this work. As the specification is of a sequential process with 
no explicit concurrency it will be compiled to a single sequential state machine. 
Although this approach is straightforward, it is also unacceptable as it produces 
a complex and slow state machine.
A more promising approach and the one that is favoured here is to decompose 
MP into a network of simpler concurrent and interacting processes that can be 
implemented as smaller and faster state machines. This section describes in detail 
the top-down design decomposition of MP.
A decomposition of a process is an implementation of that process (in the sense 
of the C ordering) as a network of other processes. A delay-insensitive decomposi­
tion is a decomposition such that the correctness of the network does not depend 
on the delays in the component processes and the interconnecting wires.
Careful consideration of the specification of MP reveals three independent ac­
tions performed by MP
• arbitrating between concurrent requests for input on channels I and u ;
• routing of each packet to either r or 6, depending on the comparison of the 
address flit with zero;
• decrementing the address flit of each packet by one before it is output on r.
Each task can be implemented by a separate process and these can then be con­
nected together in a pipeline, as figure B.2 illustrates.
The process MUX arbitrates between competing requests for input on / and u .  
It also merges at the packet level the two input streams on I and u  into one stream 
on the internal channel w. The process R routes each incoming packet by sending 
it first out on v. If the address flit of the packet is all Os, then the rest of the 
packet is switched out on b; otherwise it is transmitted out on v. The process DEC
B .3 . IMPLEMENTATION O F MUX 133
Figure B.2: High level decomposition of MP
decrements the address flit of each packet as it is output on the direct channel 
and transmits the rest of the packet unchanged.
It is convenient to use two wires w a .d  and w a .e  to acknowledge forward Os or 
Is and forward w .e  transitions, respectively. Separating the e-acknowledgements 
from the data acknowledgements simplifies the implementation of the MUX pro­
cess, as will become apparent later.
Each of the three processes, MUX, R, and DEC is individually simpler than MP 
itself, while collectively they are a correct implementation of its behaviour 
M P  C  ( M U X  |i R  |j D E C )
In the next section we will present the circuits corresponding to each of these 
three processes (MUX, R, DEC) and based on the library of delay-insensitive cells.
B.3 Implementation of MUX
The MUX process is defined as:
M U X  = [u ? e  —+ w \e\ w a .e ? ',  ia .u Y , D u □
l ? e  —*■ w \e \ w a .e ? ',  i a , /!; D i)
134 APPENDIX B . BIT-SER IA L P A C K ET ROUTER
Din -  in?x; wlx\ iva?y, ia.inl;
i f  (x =  e) th e n  MUX 
else Din
I.e it arbitrates in its initial state between (possibly) concurrent inputs on the l.e 
and u.e wires, and then copies out the 0 and 1 signals from the channel that has 
been selected.
The implementation of MUX as a network of primitive circuits is straightforward 
if we consider the following restrictions about the behaviour of its environment:
• every packet starts and ends with an e transition
• 0, 1, and e transitions on each of the / and u channels are mutually exclusive
Signals sent on the / and u channels are not mutually exclusive and we need to 
employ an arbitration circuit to resolve potential conflicts. If two packets arrive 
concurrently the circuit must block one of them while granting the other the right 
of way. As each packet commences and completes with an e transition we can 
use the opening/closing l.e and u.e signals as request/release inputs to an Arbiter 
element. As the 0 and 1 transitions are mutually exclusive, the two 0 input wires 
can be merged together to form the w.O signal and, similarly, the two 1 input wires 
can be merged together to form the iu.1 signal. Figure B.3 shows the final circuit 
for the MUX process.
B.4 Implementation of R
The routing process R is the most complex component of MP as it implements all 
routing functionality. The DIA specification of R is as follows:
R = Hq,f /  oa.r?
Hitb -  oa.r?’,w?x\v\x;
if (x = 1) then wa.dI;
els if (x = eop) then wa.e 1; ffi+1,6
else wa.d\", if,-+1,6
B A . IMPLEMENTATION O F R  135
Figure B.3: Compilation circuit for MUX
where b in {F ,T }, i in 0 ..hi 
H h i+ i , r  -  C P i
H h i+ i , F  -  o a . r ? ;  v le o p ;  b le o p ; C P 2
CP\ = oa.r?; w?x; v\x\
if (x =  e o p )  then w a .e l',  Hq,f 
else w a .d l ;  C P \
C P 2 = o a .b ? ;  i ü ? x ;  6!x;
if (x = e o p ) then w a .e l;  o a .b ?; ifo.F 
else w a .d l ;  C P 2
Because of the complexity of the process R and for clarity of presentation we will 
compile the Ho,f, CPi, and CP2 expressions separately and will assemble the com­
plete circuit from the compiled parts. The used synthesis procedure was described 
in chapter 3 with the only extension involving simple double-rail data operations.
136 APPENDIX B . BIT-SER IA L P A C K ET ROUTER
Figure B.4: Compilation circuit for Ho.f 
B .4 .1 Implementation of H 0,f
Figure B.4 shows the compiled circuit for the process Hq,f • The first subscript i of 
the process H is used as a loop index variable; the number of iterations is equal to 
the length hi of the packet header plus 1 for the starting e. The loop is implemented 
using a counter of modulo hi +  1.
The VAR  (D I variable element) circuit records the occurrence of a 1 in the packet 
header. Control is passed either to the process CP\ or to CP2 depending on whether 
1 was encountered on the channel w. The control signal START that activates the 
circuit also initialises VAR  to value F. The occurrence of a 1 in the header assigns 
the value T  to VAR.
B.4.2 Implementation of CPi and CP2
The compilation of these two processes is straightforward. The state machine is 
based only on DW elements. Figure B.5 and figure B.6 illustrate the compiled 
circuit for the processes CP\ and CP2 respectively.
B .4.3  Assembling the circuit for the process R
We can now assemble the process R from the separately compiled parts. In each 
circuit here is just one DW element for each input from the environment. When the
B A . IMPLEMENTATION O F R 137
Figure B.5: Compilation circuit for CPi
Figure B.6: Compilation circuit for CP2
138 APPENDIX B . BIT-SER IA L P A C K ET ROUTER
oa.b e o 1
Figure B.7: Compilation circuit for R
circuits are assembled into one all DW elements for the same input are combined 
into one 3 x 3  DW element (for the channel u>) and one 3 x 1  DW element (for the 
channel oa.r) and the control signals corresponding to the different states of R are 
connected to their columns.
Figure B.7 shows the circuit for the process R.
B.5 Implementation of DEC
The process DEC is intended to decrement the header of each packet by one and 
leave the packet body unchanged. One would therefore expect that DEC will re­
quire some form of counter to count the header bits. A careful consideration of the 
behaviour of the environment of DEC and the process R in particular reveal that
B.6. CONCLUSIONS 139
DEC =  v?x\
i f  (x =  0) th e n  r ! l ;  DEC 
e ls e if  (a; =  1) th e n  r!0; P 
else r!e; DEC
P = v?x
i f  {x — e) th e n  r!e; DEC 
else  r\x',P
The process can be in one of two possible states
• still decrementing, and the next input is an e signal which will take it back 
to the initial state, or
♦ passing the bits on to r unchanged (not decrementing) and will continue to 
do so until another e is received.
Although the DEC process as specified above has the potential to corrupt the body 
of a non-empty packet with zero address header by decrementing the body as well, 
the specification of R excludes such possibility as all such packets are switched 
out on the complementary channel b.
The implementation of DEC derived from the above specification is shown in 
figure B.8.
B.6 Conclusions
This appendix describes the delay-insensitive implementation of a packet switch. 
It also shows how from a high-level behavioural specification via refinement the 
final circuit is compiled. The compilation techniques are based upon a library 
of basic delay-insensitive cells. The switch uses the low latency mad postman 
switching technique and as we already mentioned the main design objective is to 
achieve as low as possible propagation latency through the switch. To evaluate the 
propagation latency of the implementation we need only consider the forward data
a counter is not required. This simplifies the implementation of DEC.
140 APPENDIX B . BIT-SER IA L P A C K ET ROUTER
Figure B.8: Implementation of DEC
path between the inputs and the direct output. Figure B.9 illustrates the forward 
propagation path extracted from figure B.3, figure B.7 and figure B.8.
MUX: MERGE R: DW3x3 R: MERGE DEC: DW2x3 DEC: MERGE
Figure B.9: The forward data path
The delay on the e signal path is similar and it comprises
tpmp = tarb  + 2 X  tdw + 2 X  tmerge
where t arb =  t and +  t or due to the low-latency arbitration scheme.
Otherwise if conventional arbiter is used
tarb — tmutex tand T tor
where tmutex might be nondeterministically long.
The simplicity of the final design is due to the virtues of mad postman switching 
as simple but efficient routing strategy. This routing technique can achieve node 
through delay that is lower than the channel rate and the latter ideally suits the 
asynchronous implementation.
Appendix C
Asynchronous library cells
In this appendix we will display some implementations of various elements used 
throughout this work.
Muller C We will illustrate several different implementations of the Muller C element. 
All of them are useful for different purposes.
Figures C.l (132]) and C.2 ([45]) have similar performance. Although the sec­
ond figure’s layout is nearly twice bigger, it is interesting to note that the 
output c switches the whole network of transistors from AND to OR element 
thus implementing the Muller C element. When both inputs are low, the 
actual function implemented is AND. When both inputs go high, this AND 
switches its output to high too and turns the whole circuit to perform this 
time an OR gate. In order to lower its output, both inputs should go low and 
then the function becomes an AND gate again.
141
142 APPENDIX  C. ASYNCHRONOUS LIBR A R Y C ELLS
143
Figure C.4: Toggle element, implementation
The third implementation, shown in figure C.3, is good for semicustom design 
as it is implemented with only standard NAND gates.
Toggle One possible implementation of the toggle element is shown in figure C.4.
Counter The implementation of counter modulo N can be based on the implementation 
of
• counter module N-1 if N is even.
• counter modulo INT(N/2) if N is odd.
using the following circuits:
SEQ FOREVER 
SEQ <N-1>  
in  ? 
o u t l ! 
in  ? 
out2 !_
SEQ FOREVER 
SEQ 2
SEQ <N/2> 
in  ? 
o u t l I 
in  ? 
out2 !
144 APPENDIX  C. ASYNCHRONOUS LIBR A R Y C ELLS
Figure C.5: Counter modulo N
The implementations above are based on the Toggle element. A faster counter 
can be implemented if asynchronous D flip-flops are used. There are two 
separate circuits for each output. The schematic for both of them is the 
same(figure C.5) only that the initial values of the D flip-flops are different. 
For example if N is 5 the initial state of the bullet signed flip-flops is (from left 
to right) 01010 and there will be no inverter. The second circuits initial state 
will be 00000 and there is an inverter in the feedback. A good feature of such 
an implementation is that it is characterised with a constant response time.
Latch The traditional latch is shown in figure C.6. Recently DEC published [15] 
some implementations of elements used throughout the design of the alpha 
chip. Figure C.7 illustrates their implementation of the latch element.
c c
145
146 APPENDIX C. ASYNCHRONOUS LIBRARY CELLS
Appendix D
The laws of DIA
The DIA laws[4I], used in the some formal proofs in this work, will be listed in this 
appendix.
A process that waits for input on a and then for input on b before being able to 
do anything is actually waiting for both inputs therefore their order is immaterial:
L a w  1: a?; b?\ P = b?\ a?; P
L a w  19 : [S]/a? = [a? —>J_ OS*], where S' is formed by substituting for each 
alternative A e S the new alternative A/a?, defined by
® (skip -+ P)/a? = skip —+(P/a?)
» (a? —*■ P)/a? — skip —► P
• (6? -> P)/a? = b? -+ (P/a?), b £ a
When an output-prefixed process c!; P is composed with another process Q, 
the output is transmitted along c. Depending on whether or not c is in the input 
alphabet of Q, the signal on c is send to Q or to the environment:
L a w  2 9 : (c !;P )||Q ) =  P ||(Q /c ? )  if c is in the input alphabet of Q or (c !;P ) | |Q ) =  c !;(P ||Q ) 
otherwise.
L a w  3 0 : [So]||[Si] = [-S'] » where S is formed from the alternatives in So and 
Si in the following way. For each alternative in So of the form skip —> P, we have
147
skip -> (P\\[Si]) in S. For each alternative in S0 of the form a? -*■ P  with a not 
in the output alphabet of [Si], we have a? —+ (P||[Si]) in S. The alternatives in Si 
contribute to the alternatives of S in a similar way.
148 APPENDIX D . TH E LAW S O F DIA
Bibliography
[II Alain Martin, ’’Compiling communicating processes into delay-insensitive 
VLSI circuits”, Distributed Computing, 1,(4), 1986
[2] Alain Martin, Chapter 6, ’’Formal Methods for VLSI Design”, Elsevier Science 
Publishers, IFIP 1990
[3] Alain Martin, "The design of a Self-Timed Circuit for Distributed Mutual Ex­
clusion”, In Henry Fuchs, editor, Chappel Hill Conference on VLSI, 1985
[4] Alain Martin, "Programming ■ in VLSI” from "Communicating Processes to 
Delay-Insensitive Circuits” in C.A.R. Hoare, editor, UTYear of Programming: 
Institute of Concurrent Programming, Addison Wesley, 1989
[5] Alain Martin, "On Seitz’s arbiter”, Technical Report 5212:TR:86, Computer 
Science Department, CALTECH, 1986
[6] Alex Yakovlev, Luciano Lavagno, Alberto Sangiovanni-Vincentelli “A uni­
fied signal transition graph model for asynchronous control cirucit synthesis”, 
Technical Report, Electronic Research Laboratory, College of Engineering, 
University of California
[7] Cees Niessen, K.H. van Berkel, Martin Rem, Ronald W. J .  J . Saeijs, "VLSI Pro­
gramming and Silicon Compilation; a Novel Approach from Philips Research”, 
Proc of the 1988 IEEE Int. Conf. on Computer Design, VLSI in Computers 
and Processors
[8] Charles Roth, Jr., "Fundamentals of LOGIC DESIGN”, fourth edition, West 
Publishing Co., 1995
149
150 BIBLIOGRAPHY
[9] Daily WJ and Song P, "Design of a Self-Timed. VLSI Multicomputer Communi­
cations Controller”, Proc. International Conference on Computer Design, Oct 
1987
[10] Dally WJ and Seitz CL, "The Torus Routing Chip”, Journal of Distributed 
Systems, Vol.l, No.3, 1986
[11] Dally WJ, Seitz CL, "Deadlock-free message routing in multiprocessor inter­
connection networks", IEEE Trans Computers, vol C-36, No 5, May 1987
[12] Dally, WJ, ”A VLSI Architecture for Concurrent Data Structures", Kluwer Aca­
demic Press, 1987.
[13] Dally, WJ, "Performance Analysis of k-ary n-cube Interconnection Networks”, 
IEEE Tran. Comp., 1990, 39, 6
[14] Dally WJ, "Express Cubes: Improving the Performance of k-ary n-cube Inter­
connection Networks", IEEE Trans. Comp. 1990
[15] Daniel W. Dobberpuhl et al, "A 200-MHz 64-bit Dual-issue CMOS Micropro­
cessor", Digital Technical Journal, vol 4, No. 4, 1992
[16] Ellen M. Sentovich, Kanwar Jit Singh, Luciano Lavagno, Cho Moon, Rajeev 
Murgai, Alexander Saldanha, Hamid Savoj, Paul R. Stephan, Robert K. Bray- 
ton, Alberto Sangiovanni-Vincentelli, "SIS: A System for Sequential Circuit 
Synthesis", Electronics Research Laboratoiy, Dept of Elec. Eng. and Comp. 
Science, Univ. of California, Berkeley, 1992, Memorandum No. UCB/ERL 
M 92/41
[17] Erik Brunvand, "Parts Are US", technical report, Comp Science Dept, Uni- 
verisity of Utah, 1992
[18] Erik Brunvand, Robert Spoull, "Translating Concurrent Programs into Delay- 
Insensitive Circuits”, Proc of IEEE on CAD 1989, vol 1
[19] Erik Brunvand, "Using FPGAs to Implement Self-Timed Systems”, Tehnical 
Report, Computer Science Department, University of Utah, January, 1992
BIBLIOGRAPHY 151
[20] Fisher AL and Kung HT, ’’Synchronizing Large VLSI Processor Array’’, IEEE 
Transactions on Computers, Vol C-34, No. 8, August 1985
[21] Furber SB, ’’Micropipelines - A Case Study”, ACiD-WG/EXACT Workshop on 
Asynchronous Data Processing, Veldhoven, The Netherlands, 1992
[22] Furber SB, ’’AMULET1 - An Asychronous ARM Processor”, Symposium Record 
of Hot Chips V, Standford University, USA, 1993
[23] Furber SB, Day P, Garside JD, Paver NC, Woods JV, ”AMULET1: A Mi­
cropipelined ARM”, IEEE CompCon ’94, San Francisco, 1994
[24] Geoffrey Brown, ’’Towards Truly Delay-Insensitive Circuit Realizations of Pro­
cess Algebras”, Proceedings of Workshop on Designing Correct Circuits, 
Springer-Verlag, 1990
[25] Geoffrey Brown, ’’Translating OCCAM to handshaking circuits”, CSRG Sem­
inar, University of Surrey, 1994
[26] Henrik Hulgaard, PerH. Christensen and Jorgen Straunstrup, ’’Synthesising 
Delay-Insensitive Circuits from Verified Programs”, technical report, Dept of 
Comp Science, University of Denmark
[27] H.F.Li, S.C.Leung, P.N.Lam, ’’Synthesis of Delay-Insensitive Circuits by Re­
finement into Atomic Threads”, Proc of IEEE on CAD 1991, vol 3
[28] Hoare C.A.R., ’’Communicating Sequential Processes”, CACM, 1978, 21, 8
[29] liana David, Ran Ginosar, Michael Yoeli ”An Efficient implementation of 
Boolean Functions and Finite State Machines as Self-Timed Circuits”, IEEE 
1990
[30] IMEC, ’’Optimised symthesis of asynchronous control circuits”, Proc of Inter­
national Conference oc CAD, IEEE, Santa Clara 1990
[31] IMEC tutorial session on IMEC’s interface compiler. 1st ACID-WG/EXACT 
Workshop, Leuven, Belgium, 1992
[32] Ivan Sutherland, ’’Micropipelines”, Comm, of ACM, volume 6, 1989
152 BIBLIOGRAPHY
[33] Jay  Yantchev, Chris Jesshope, "Adaptive, low latency, deadlock-free packet 
routing for networks of processors”, IEE Proc, Part-E, vol 136, No. 3, May
1989
[34] J.L.W.Kessels and F.D. Sehalij, ”VLSI Programming for the Compact Disc 
Player", Science of Computer Programming, 1990
[35] Jo  Ebergen, "A formal approach to designing delay-insensitive circuits", Dis­
tributed Computing 1991, volume 5
[36] J.C . Ebergen, "Translating circuits into delay-insensitive circuits”, CWI Tract, 
Vol. 56, Centre for Mathematics and Computer Science, Amsterdam, 1989
[37] Josephs, M., "Receptive Process theory”, Eindhoven Uni. Tech. Rep. 9 0 /8 ,
1990
[38] Josephs MB, Mak RH, Verhoeff T, "Asynchronous Design of a Router”, Pro­
ceedings of the IEEE/ProRISC Symposium on Circuits, Systems and Signal 
Processing, Stichting voor de Technische Wetenschappen, Utrecht, Nether­
lands 1990
[39] Josephs MB, Udding JT, "Delay-Insensitive Circuits: An Algebraic Approach 
to their Design”, Lect. Notes in Comp. Sei., Vol. 458, Springer-Verlag, 1990
[40] Josephs MB, Udding JT  "The Design of a Delay-Insensitive Stack", Jones, G., 
Sheeran, M., editors, "Designing Correct Circuits”, Springer-Verlag, 1990
[41] Josephs MB, Udding JT, "An Algebra for Delay-Insensitive Circuits", Proc. DI- 
MACS/IFIP Workshop on Computer-Aided Verification, Rutgers University, 
New Jersey 1990, DIMACS Series in Discrete Mathematics and Theoretical 
Computer Science, Vol. 3, AMS-ACM 1991
[42] Josephs MB "Algebraic Verification of Speed-Independent Circuits”, Work­
shop on Designing Correct Circuits, Lyngby, Denmark, 1992
[43] Josephs MB, Yantchev JT, ”Low latency arbiters", Tech Report, South Bank 
Univ, London, 1994
[44] Kees van Berkel, "Handshcüce circuits: an intermediary between communica­
tion processes and VLSI”, PhD thesis, University of Eindhoven
[45] K.H. van Berkel, Ronald W .J.J. Saeijs, "Compilation of Communicating Pro­
cesses into Delay-Insenstive Circuits”, Proc. of International Conference on 
Computer Design, 1988
[46] K.H. van Berkel, Martin Rem, Ronald W .J.J. Saeijs, "VLSI Programming”, 
Proc. of International Conference on Computer Design, 1988
[47] McKeeman WM, ”Peephole Optimisation”, CASM vol. 8, July 1963
[48] Mead C., Conway L., Chapter 7, "Introduction to VLSI Systems”, Adison Wes­
ley, 1980
[49] Miller, R.E., "Sequential Circuits”, Chapter 10, "Switching Theory”, Vol 2 Wi­
ley. NY 1965
[50] Miller PR, Yantchev JT, ”Developing Powerful Communication Mechanisms 
for Distributed Memory Computers from Simple and Efficient Message Rout-* 
ing”, Proc. 5th Distributed Memory Computing Conference, Charleston, S. 
Carolina, USA, 1990
[51] Miller PR, Jesshope CR, Yantchev JT, "The Mad Postman Network Chip”, 
Proc. Transputing *91, Sunnyvale, CA, USA, 1991
[52] Miller,PR, Yantchev JT, Jesshope CR, "High Performance Packet Routing 
Based on Systolic Arrays", Proc 3rd International Conference on Systolic Ar- -a 
rays, Killarney, Ireland, 1989
[53] Neil Weste, Kamran Eshraghian, "Principles of CMOS VLSI Design”, Addison 
Wesley Publ. Co, 1985
[54] Noakes M and Dally WJ, "System Design of the J-Machine”, 6th MIT Confer­
ence on Advanced Research in VLSI, MIT Press 1990
[55] Peter Barrie, Paul Cockshott, George J  Milne and Paul Shaw, "Design and 
verification of a highly concurrent machine”, Microprocessors and Microsys­
tems, Vol 16 No 3 1992
BIBLIOGRAPHY 153
154 BIBLIOGRAPHY
[56] Ronald W .J.J. Saeijs, C.H. van Berkel, "The Design of VLSI Image-Generator 
ZaP”, Proc. of International Conference on Computer Design, 1988 ^
[57] Seitz CL, "Ideas about arbiters”, Lambda 1:10-14, 1980
[58] Spivey JM, "The ZNotation”, Reference manual, Second edition, Prentice Hall 
International
[59] Steven Nowick, David Dill "Synthesis of Asynchronous State Machines Using 
a Local Clock", IEEE 1991
[60] Tam-Anh Chu, "Synthesis of Self-timed Control Circuits From Graphs”, IEEE 
Tran on Comp, vol 1 1986
[61] Teresa Meng, Robert Brodersen, David Messerschmitt, "Automatic Synthesis 
of Asynchronous Circuitfrom High-Level Specifications”, Proc of IEEE on CAD, 
vol 8, 1989
[62] Tom Verhoeff, "Delay-Insensitive codes - an overview”. Distributed Comput­
ing, 3, 1988
[63] Udding JT  "Classification and Composition of Delay-Insensitive Circuits”, 
Doctoral thesis, Dept, of Mathematics and Computing Science, Eindhoven 
University of Technology, Eindhoven, The Netherlands, 1984
[64] Udding JT, "A Formal Model for Defining and Classifying Delay-Insensitive 
Circuits and Systems", Distributed Computing 1(4), 1986
[65] Verilog reference manual, "VERILOG-XL”, Cadence, 1991
UMMERSITY OF SURREY LIBRARY

Vi
■A
:'ì{‘
I
•Ÿj
«
v>: 
5-’ >%
1
i
I
r
t.
•S
ä


