On the Distribution of Control in Asynchronous Processor Architectures by Rebello, Vinod
On the Distribution of Control in
Asynchronous Processor Architectures
Vinod Eugene Francis Rebello
Doctor of Philosophy







The effective performance of computer systems is to a large measure de-
termined by the synergy between the processor architecture, the instruction set
and the compiler. In the past, the sequencing of information within processor
architectures has normally been synchronous: controlled centrally by a clock.
However, this global signal could possibly limit the future gains in perform-
ance that can potentially be achieved through improvements in implementation
technology.
This thesis investigates the effects of relaxing this strict synchrony by dis-
tributing control within processor architectures through the use of a novel asyn-
chronous design model known as a micronet. The impact of asynchronous
control on the performance of a RISC-style processor is explored at different
levels. Firstly, improvements in the performance of individual instructions by
exploiting actual run-time behaviours are demonstrated. Secondly, it is shown
that micronets are able to exploit further (both spatial and temporal) instruction-
level parallelism (ILP) efficiently through the distribution of control to datapath
resources. Finally, exposing fine-grain concurrency within a datapath can only
be of benefit to a computer system if it can easily be exploited by the compiler.
Although compilers for micronet-based asynchronous processors may be con-
sidered to be more complex than their synchronous counterparts, it is shown
that the variable execution time of an instruction does not adversely affect the
compiler’s ability to schedule code efficiently. In conclusion, the modelling
of a processor’s datapath as a micronet permits the exploitation of both fine-
grain ILP and actual run-time delays, thus leading to the efficient utilisation




I am indebted to my supervisor, D. K. Arvind, for his continuous support,
encouragement and advice throughout my research.
Thanks to the MAP Group for our fruitful discussions; to the Edinburgh
Parallel Computing Centre (EPCC) for access to the MEiKO Computing Surface
and their technical support; and to the Department of Computer Science for
providing all the “essentials” for this work.
Most of all, a big special thank you to my parents and all my friends who
shared in my trials.
Finally, this work was funded by a research studentship from the UK Science
and Engineering Research Council.
Muito obrigado para todos!
iv
Declaration
This thesis was composed by myself and the work reported herein is my
own except where indicated. Some of the material in this thesis has already
been published in: D. K. Arvind and V. E. F. Rebello. Instruction-level parallelism in asynchronous
processor architectures. In M. Moonen and F. Catthoor, editors, The Proceedings of
the 3rd International Workshop on Algorithms and Parallel VLSI Architectures, pages
203–215, Leuven, Belgium, August 1994. Elsevier Science Publishers. D. K. Arvind and V. E. F. Rebello. On the performance evaluation of asynchronous
processor architectures. In P. Dowd and E. Gelenbe, editors, The Proceedings of
the 3rd International Workshop on Modeling, Analysis and Simulation of Computer and
Telecommunication Systems (MASCOTS’95), pages 100–105, Durham, NC, USA,
January 1995. IEEE Computer Society Press. D. K. Arvind, R. D. Mullins and V. E. F. Rebello. Micronets: A model for decent-
ralising control in asynchronous processor architectures. In M. B. Josephs, editor,
The Proceedings of the 2nd Working Conference on Asynchronous Design Methodologies,
pages 190–199, London, UK, May 1995. IEEE Computer Society Press. D. K. Arvind and V. E. F. Rebello. Static scheduling of instructions on micronet-
based asynchronous processors. In The Proceedings of the 2nd International Sym-
posium on Advanced Research on Asynchronous Circuits and Systems (ASYNC’96),
pages 80–91, Aizu Wakamatsu City, Japan. March 1996. IEEE Computer Society
Press.
Vinod E. F. Rebello
Table of Contents
1. Introduction 1
1.1 In this Thesis : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 4
1.2 Thesis Outline : : : : : : : : : : : : : : : : : : : : : : : : : : : : 6
2. Towards an Asynchronous Control Paradigm 10
2.1 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 10
2.2 System Design : : : : : : : : : : : : : : : : : : : : : : : : : : : : 11
2.3 Implementation Technology and a Synchronous Control Paradigm 11
2.3.1 Clock Skew : : : : : : : : : : : : : : : : : : : : : : : : : : 12
2.3.2 Other Limits on the Clock Frequency : : : : : : : : : : : 12
2.3.3 Power Consumption : : : : : : : : : : : : : : : : : : : : : 13
2.3.4 Shrinking Geometries : : : : : : : : : : : : : : : : : : : : 14
2.3.5 Design Difficulties : : : : : : : : : : : : : : : : : : : : : : 17
2.4 Asynchronous Design – A Solution? : : : : : : : : : : : : : : : : 18
2.4.1 Disadvantages of Asynchronous Design : : : : : : : : : 21
2.4.2 Equipotential Regions (revisited) : : : : : : : : : : : : : : 22
2.4.3 Handshake Protocols : : : : : : : : : : : : : : : : : : : : 23
v
Table of Contents vi
2.4.4 Data Transmission : : : : : : : : : : : : : : : : : : : : : : 24
2.4.5 Ease of Design : : : : : : : : : : : : : : : : : : : : : : : : 27
2.5 Exploiting Performance : : : : : : : : : : : : : : : : : : : : : : : 28
2.5.1 Synchronous versus Asynchronous Control : : : : : : : : 28
2.6 Pipelines : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 30
2.6.1 The Conversion of Synchronous Pipelines to Equivalent
Asynchronous Ones : : : : : : : : : : : : : : : : : : : : : 30
2.6.2 Micropipelines : : : : : : : : : : : : : : : : : : : : : : : : 33
2.7 Related Work : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 34
2.8 This Thesis : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 35
2.8.1 Towards Asynchronous Datapaths : : : : : : : : : : : : : 36
2.9 Micronets : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 38
2.9.1 Micronets, Microagents and their Micro-operations : : : 39
2.9.2 Micronet-based Datapaths : : : : : : : : : : : : : : : : : 41
2.10 Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 42
3. A Parallel Event-Driven Simulator 44
3.1 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 44
3.2 Parallel Discrete Event-driven Simulation : : : : : : : : : : : : : 45
3.3 An Overview of PEPSÉ : : : : : : : : : : : : : : : : : : : : : : : 46
3.3.1 The Simulation Platform : : : : : : : : : : : : : : : : : : 48
3.3.2 The Basic Simulation Platform Algorithm : : : : : : : : : 49
3.3.3 The Class Models : : : : : : : : : : : : : : : : : : : : : : 50
Table of Contents vii
3.4 Development Notes : : : : : : : : : : : : : : : : : : : : : : : : : 54
3.4.1 Occam Buffers : : : : : : : : : : : : : : : : : : : : : : : : 54
3.4.2 Guarded Outputs : : : : : : : : : : : : : : : : : : : : : : 56
3.4.3 Modelling Signals : : : : : : : : : : : : : : : : : : : : : : 57
3.5 Component Delays : : : : : : : : : : : : : : : : : : : : : : : : : : 58
3.6 Conclusions : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 59
4. The Control Paradigm and the Instruction Set 60
4.1 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 60
4.2 Comparing Synchronous and Asynchronous Processor Control : 61
4.2.1 The Two Processor Models : : : : : : : : : : : : : : : : : 62
4.2.2 The Instruction Set : : : : : : : : : : : : : : : : : : : : : : 63
4.2.3 The Architectural Components : : : : : : : : : : : : : : : 65
4.3 The Synchronous Processor : : : : : : : : : : : : : : : : : : : : : 66
4.3.1 Synchronous Control : : : : : : : : : : : : : : : : : : : : 66
4.4 Asynchronous Control and MAP : : : : : : : : : : : : : : : : : : 68
4.4.1 The Distribution of Control : : : : : : : : : : : : : : : : : 68
4.4.2 The Rôle of the Control Unit : : : : : : : : : : : : : : : : 69
4.4.3 Data Transfer : : : : : : : : : : : : : : : : : : : : : : : : : 73
4.5 The Performance Results : : : : : : : : : : : : : : : : : : : : : : 73
4.6 Discussion : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 76
4.7 Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 77
Table of Contents viii
5. The Control Paradigm and the Architecture 79
5.1 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 79
5.2 Exploiting Instruction-level Parallelism : : : : : : : : : : : : : : 80
5.3 Design Goals : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 82
5.4 An Asynchronous ILP Processor : : : : : : : : : : : : : : : : : : 83
5.5 A Micronet Architecture : : : : : : : : : : : : : : : : : : : : : : : 84
5.5.1 Modifications to the Fetch Stage : : : : : : : : : : : : : : 85
5.6 The Control Refinements : : : : : : : : : : : : : : : : : : : : : : 87
5.7 Measuring Improvements in Performance : : : : : : : : : : : : : 88
5.7.1 The Test Programs : : : : : : : : : : : : : : : : : : : : : : 91
5.8 Refinement Step 1 – The Base Case : : : : : : : : : : : : : : : : : 92
5.9 Refinement Step 2 – Exploiting Multiple Write-back Buses : : : 97
5.10 Refinement Step 3 – Using a Single Write-back Bus : : : : : : : : 100
5.11 Refinement Step 4 – Asynchronous Micro-operation Issue : : : : 101
5.12 Refinement Step 5 – Out-of-Order Write-Backs : : : : : : : : : : 107
5.13 Refinement Step 6 – Faster Instruction Issue : : : : : : : : : : : : 110
5.14 Refinement Step 7 – Data Forwarding : : : : : : : : : : : : : : : 115
5.15 Refinement Step 8 – The Last Control Modification : : : : : : : : 118
5.16 Conclusions : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 122
5.17 Refinement Step 9 – Transistor Resizing : : : : : : : : : : : : : : 123
5.18 Discussion : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 124
5.18.1 Minimising the Self-Timed Overheads : : : : : : : : : : : 125
Table of Contents ix
5.18.2 Implications for the Compiler : : : : : : : : : : : : : : : 130
5.19 Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 131
6. The Control Paradigm and the Compiler 141
6.1 Introduction : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 141
6.2 Compilers : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 142
6.3 Scheduling Challenges in MAP Architectures : : : : : : : : : : : 143
6.3.1 MAP Behaviour : : : : : : : : : : : : : : : : : : : : : : : 145
6.3.2 A Parameterised Computational Model : : : : : : : : : : 145
6.4 The Scheduling Problem : : : : : : : : : : : : : : : : : : : : : : 147
6.4.1 Similar Scheduling Problems : : : : : : : : : : : : : : : : 148
6.5 A Scheduling Methodology for MAP : : : : : : : : : : : : : : : 149
6.5.1 The Scheduler : : : : : : : : : : : : : : : : : : : : : : : : 152
6.6 Results : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 162
6.6.1 Post-pass Optimisation for Instruction Interference : : : 166
6.6.2 Are These Schedules Really Optimal? : : : : : : : : : : : 169
6.7 Open Problems : : : : : : : : : : : : : : : : : : : : : : : : : : : : 170
6.7.1 Instruction Execution Costs : : : : : : : : : : : : : : : : : 170
6.7.2 Interaction Between Executing Instructions : : : : : : : : 171
6.8 Conclusions : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 172
7. Conclusions and Future Work 174
7.1 A Summary : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 174
7.2 Effects on System Design : : : : : : : : : : : : : : : : : : : : : : 175
Table of Contents x
7.3 On-Going and Future Work : : : : : : : : : : : : : : : : : : : : : 180
7.3.1 Easing System Design : : : : : : : : : : : : : : : : : : : : 180
7.3.2 Extending the Micronet Architecture : : : : : : : : : : : 181
7.3.3 Parallelising Compilers for a Superscalar MAP : : : : : : 185
7.4 Discussion : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 186
7.5 Conclusions : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 187
A. Glossary 189
B. The PEPSÉ Simulator 192
B.1 The Simulation Algorithm in OCCAM2 : : : : : : : : : : : : : : 192
C. The MAP Test Programs 196
D. Published Papers 198
D.1 Instruction-level Parallelism in Asynchronous Processor Archi-
tectures : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 198
D.2 On the Performance Evaluation of Asynchronous Processor Ar-
chitectures : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 211
D.3 A Model for Decentralising Control in Asynchronous Processor
Architectures : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 217
D.4 Static Scheduling of Instructions on Micronet-based Asynchron-
ous Processors : : : : : : : : : : : : : : : : : : : : : : : : : : : : 228
Bibliography 241
List of Figures
2–1 Two- and four-phase signalling : : : : : : : : : : : : : : : : : : : 23
2–2 Encoded data transmission : : : : : : : : : : : : : : : : : : : : : 25
2–3 Bundled data transfer : : : : : : : : : : : : : : : : : : : : : : : : 26
2–4 From a synchronous to an asynchronous pipeline : : : : : : : : 31
2–5 A basic micropipeline FIFO : : : : : : : : : : : : : : : : : : : : : 33
2–6 Synchronous and asynchronous pipelines : : : : : : : : : : : : : 37
2–7 Contrasting a micropipeline with a micronet : : : : : : : : : : : 40
3–1 Overview of the simulator : : : : : : : : : : : : : : : : : : : : : 46
3–2 The simulation platform. : : : : : : : : : : : : : : : : : : : : : : 48
3–3 A microprocessor model : : : : : : : : : : : : : : : : : : : : : : : 56
4–1 The processor pipeline : : : : : : : : : : : : : : : : : : : : : : : : 62
4–2 The synchronous and self-timed processor models : : : : : : : : 63
4–3 Synchronous instruction cycles : : : : : : : : : : : : : : : : : : : 67
5–1 A typical micronet-based processor architecture model : : : : : 84
5–2 Issuing an LDA instruction in Refinement Step 1 : : : : : : : : : 95
5–3 Issuing an LDA instruction in Refinement Step 2 : : : : : : : : : 98
xi
List of Figures xii
5–4 Issuing an LDA instruction in Refinement Step 4 : : : : : : : : : 105
5–5 Issuing an LDA instruction in Refinement Step 6 : : : : : : : : : 113
5–6 Issuing an LDA instruction in Refinement Step 8 : : : : : : : : : 118
5–7 The FM utilisations : : : : : : : : : : : : : : : : : : : : : : : : : 120
5–8 The test program execution times : : : : : : : : : : : : : : : : : 121
5–9 Resource activity : : : : : : : : : : : : : : : : : : : : : : : : : : 127
5–10 Overlapping micro-operation handshake cycles : : : : : : : : : 129
5–11 The micronet model for Refinement Step 1 : : : : : : : : : : : : 133
5–12 The micronet model for Refinement Step 2 : : : : : : : : : : : : 134
5–13 The micronet model for Refinement Step 3 : : : : : : : : : : : : 135
5–14 The micronet model for Refinement Step 4 : : : : : : : : : : : : 136
5–15 The micronet model for Refinement Step 5 : : : : : : : : : : : : 137
5–16 The micronet model for Refinement Step 6 : : : : : : : : : : : : 138
5–17 The micronet model for Refinement Step 7 : : : : : : : : : : : : 139
5–18 The micronet model for Refinement Step 8 : : : : : : : : : : : : 140
6–1 The makespans of schedules based on worst- and average-case
run-time costs : : : : : : : : : : : : : : : : : : : : : : : : : : : : 170
7–1 Influences within processor system architectures : : : : : : : : : 176
7–2 Previously implicit influences within system architectures : : : 178
List of Tables
4–1 The instruction set : : : : : : : : : : : : : : : : : : : : : : : : : : 64
4–2 Synchronous versus asynchronous performances : : : : : : : : : 74
5–1 The micro-operations required for instruction execution : : : : : 94
5–2 Instruction execution for Refinement Step 1 : : : : : : : : : : : : 95
5–3 Execution of the test programs on Refinement Step 1 : : : : : : : 96
5–4 Instruction execution for Refinement Step 2 : : : : : : : : : : : : 99
5–5 Execution of the test programs on Refinement Step 2 : : : : : : : 99
5–6 Instruction execution on Refinement Step 3 : : : : : : : : : : : : 100
5–7 Execution of the test programs on Refinement Step 3 : : : : : : : 101
5–8 Instruction execution on Refinement Step 4 : : : : : : : : : : : : 106
5–9 Execution of the test programs on Refinement Step 4 : : : : : : : 106
5–10 Instruction execution for Refinement Step 5 : : : : : : : : : : : : 109
5–11 Execution of the test programs on Refinement Step 5 : : : : : : : 109
5–12 Instruction execution on Refinement Step 6 : : : : : : : : : : : : 113
5–13 Execution of the test programs on Refinement Step 6 : : : : : : : 114
5–14 Instruction execution on Refinement Step 7 : : : : : : : : : : : : 117
xiii
List of Tables xiv
5–15 Execution of the test programs on Refinement Step 7 : : : : : : : 117
5–16 Instruction execution for Refinement Step 8 : : : : : : : : : : : : 119
5–17 Execution of the test programs on Refinement Step 8 : : : : : : : 119
5–18 Instruction execution for Refinement Step 9 : : : : : : : : : : : : 123
5–19 Execution of the test programs on Refinement Step 9 : : : : : : : 124
6–1 Measuring the optimality of the scheduling heuristics : : : : : : 164
6–2 The effects of Post-pass optimisations on Instruction Lookahead
schedules : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 168
6–3 The effects of Post-pass optimisation on MAP instruction schedules169
Chapter 1
Introduction
“In analysing the functions of the contemplated device, ..... the logical
control of the device, that is the proper sequencing of its operations, can be
most efficiently carried out by a central organ.”
John von Neumann, First Draft of a report on the EDVAC (1945)
It has long been realised that the implementation technology has influenced
developments in processor architectures. As a case in point, the advent of VLSI
technology in the early 1980s (together with mature optimising compilers) led to
the reassessment of complex instruction sets, and resulted in the development
of RISC architectures [71] [86]. The designers of these processors also paid
close attention to the interactions between the compiler, the instruction set, and
the processor architecture. Reducing the number and formats of instructions
made the architecture considerably simpler compared to existing designs, with
streamlined datapaths effectively shifting complexity from the hardware to the
compiler.
Improvements in transistor speed have brought improvement in system
performance [121], and it has been assumed that such progress would continue
virtually unhindered. However, designers have now been forced to consider
a domain previously taken for granted – the influence of the control paradigm
on the rest of the system. From around 1945, conventional wisdom has advoc-
ated the use of a centralised clock to sequence information correctly within a
1
Chapter 1. Introduction 2
processor architecture. Unfortunately, the ability to sustain this design style
as systems become larger, faster and more complex, is under pressure from a
number of directions, related to the global clock as well as the speed and scale
of the new systems [115,140,147,150,175].
Given the developments in technology, and contradicting John von Neu-
mann, centralised control can lead to inefficient behaviour. Events in synchron-
ous processors are recognised at regular, pre-determined intervals. In typical
designs, there are idle periods between events and the next clock tick. Of course,
this wastage could be reduced by increasing the clock frequency, but the benefit
of such a policy is diminished by problems of increased control complexity,
clock skewing and noise. Furthermore, the clock’s very presence is likely to
limit future gains in performance which may potentially be achieved by im-
provements in VLSI technology. The maximum speed of this clock signal is a
conservative estimate for reliable operations, which considers worst-case delays
in the critical path. In practice, even this estimate may not be met due to vari-
ations in fabrication and environmental parameters. The propagation delays
along clock distribution lines may become a significant proportion of the clock
period, and mitigating their effect at higher frequencies would be at signific-
ant design costs [42]. These inefficiencies are further exacerbated by scaling of
transistor sizes [115] [140] [147]. Another issue is the difficulty in separating the
logical and temporal aspects of synchronous circuits. Accurate estimations of
timings of synchronous processors and abstracting them from the logical design
is difficult. This has been one of the limiting factors in the automatic synthesis of
synchronous processors. All of these drawbacks have led to a renewed interest
in an alternative control strategy which relaxes the strict synchrony imposed by
the centralised clock by removing it altogether.
Asynchronous design is not new, in fact early computers did incorporate
asynchronous methods which were later abandoned in favour of the easier syn-
chronous style. Lately, a restricted form of asynchrony known as self-timing is
Chapter 1. Introduction 3
being considered which avoids timing-related problems by enforcing a simple
communication protocol [150]. This protocol acts like a local clock which syn-
chronises components within a circuit, but neither relies on specific time inter-
vals nor extends homogeneously to the entire circuit as a synchronous clock
does. The correct operation of self-timed systems is independent of delays,
enabling systems to cope with changes due to data dependencies or environ-
mental variations. This robustness is achieved at the price of local handshaking
protocols. Therefore, in order to exploit the performance benefits of asynchrony
over synchronous control, the average delay of the components together with
overheads of self-timed control should be less than the sum of the worst-case
delay and overheads of synchronous control. However, it was not the poten-
tial performance advantage of self-timed circuits which first attracted processor
designers.
Self-timed circuits offer a number of other advantages over their synchron-
ous equivalents (as discussed in [66] [106]) and, for example, have proved
attractive for low power circuit design and automated synthesis. Asynchron-
ous microprocessor designs (which have been built) have either concentrated
on their formal synthesis [37] [110] or just their feasibility [143], with limited
emphasis on their performance or efficient operation. One exception is the AM-
ULET project [56]: an asynchronous implementation of a previous synchronous
design, although the emphasis has primarily been on low power consumption.
The performance evaluation of asynchronous processors is still in its infancy.
Only recently have designs begun to take architectural considerations into ac-
count, e.g. Counterflow [157] and Fred [142], and investigate issues such as
Instruction-Level Parallelism (ILP).
Synchronous architectures exploit ILP at a considerable cost in terms of con-
trol overheads. Also, this centralised control regime forces complex designs to
operate below their technological best by always assuming worst-case beha-
viour. The benefit, however, is that the computational model uses fixed delays
Chapter 1. Introduction 4
thus leading to a deterministic behaviour of the architecture. This benefits the
compilers in predicting the state of the machine for efficient code generation and
scheduling. Therefore, forcing operations to complete within a fixed period of
time simplifies the cost of sequencing operations. In contrast, under asynchron-
ous control, operations take only as long as is necessary; even the execution
times of identical instructions may vary. This, in turn, may have an adverse
effect on efficient code generation and scheduling. However, note that exploit-
ing concurrent behaviour is more efficient under distributed control, whereas
synchronising operations or making them take place sequentially increases the
control complexity in an asynchronous environment.
1.1 In this Thesis
The RISC approach exploited the synergy in the interactions between the three
domains – the compiler, the processor architecture and the implementation tech-
nology. The work described in this thesis builds on this theme and investigates
the design of effective computer systems in the light of progress in each of these
domains; in particular, the efficient exploitation of ILP in fully asynchronous
general-purpose processor architectures.
There has also been an important trend in identifying and exploiting con-
currency in programs which are written in languages without explicit parallel
constructs. The concurrency is exposed in different stages of descending levels
of granularity: between basic blocks, between instructions within the same
block, and even within the instructions themselves. In general, concurrency
between basic blocks can be teased out by the compiler without an intimate
knowledge of the underlying processor. However, for effective exploitation of
concurrency at a finer detail of granularity, it is profitable to consider the in-
teractions between the compiler and the processor, and the processor and the
Chapter 1. Introduction 5
implementation technology, respectively. Increased performance through the
exploitation of ILP is a key feature of modern synchronous RISC processor archi-
tectures. However this approach is limited not only by the available parallelism
within programs, but also by the cost effectiveness of designing processors with
centralised control to exploit ILP.
This thesis studies the influence of a fully asynchronous control paradigm
on the design and performance of RISC-like processor architectures. The jus-
tification for doing this is the following observation. The clock period of a
synchronous processor is determined a priori by the speed of the slowest com-
ponent, and takes into account the worst-case execution and propagation times
and the worst-case operating conditions. In contrast, the performance of an
asynchronous processor is determined by the actual operational timing char-
acteristics of the components (effectively average delays) plus the overheads
due to self-timed control. Furthermore, a more significant and important con-
sequence of an asynchronous control paradigm is the ability to exploit fine-grain
concurrency efficiently at the instruction level.
Processors can be divided into two parts – the datapath and the control.
In synchronous designs, the centralised control performs the dual functions of
timekeeping and sequencing of operations within the datapath. Timekeeping is
now redundant in an asynchronous processor, thereby reducing the rôle of the
centralised control to just sequencing instructions. An asynchronous datapath
can be modelled and implemented as a micronet. Defined as a network of elastic
micropipelines [158], it allows for a greater degree of fine-grained concurrency
to be exploited, both between and within instructions, which would otherwise
be quite expensive to achieve in an equivalent synchronous design. In a tradi-
tional synchronous datapath, the centralised control forces each instruction to
go through all of the stages regardless of the need to do so (in effect a single
pipeline), with the time spent in each stage being determined by the clock
period. In a micronet, each program instruction spends time only in the relev-
Chapter 1. Introduction 6
ant stages and for just as long as is necessary. Furthermore, different program
instructions may execute concurrently within the same stage. A synchronous
pipelined processor for exploiting ILP has to incur additional control overheads,
e.g. [40] [42] [118]. In contrast, it will be demonstrated that as a consequence
of asynchronous control, implemented using a micronet, ILP can be achieved
implicitly without extra costs. This is because the control is now decentralised
and distributed amongst the communicating functional units which operate
concurrently. Micronets are easy to implement in CMOS VLSI technology [126],
and at the same time, as will be shown, they offer a good target for an optim-
ising compiler which can exploit the available concurrency between and within
instructions.
1.2 Thesis Outline
The contents of each of the remaining chapters are summarised as follows:
Chapter 2 highlights the inefficiencies in current synchronous designs and in-
troduces a particular field of asynchronous design known as self-timed
circuits as a methodology to overcome these problems. How self-timed
circuits communicate while being insensitive to varying delays and the
advantages of these types of circuits are also discussed. This chapter
then sets out the objectives and goals of this thesis in the context of cur-
rent related work and opinion, and introduces an efficient structure for
distributed asynchronous control called a micronet.
Chapter 3 – The performance of an asynchronous system is ultimately determ-
ined by the dynamic interaction amongst components within the system.
Furthermore, the temporal behaviour of current VLSI systems is being
increasingly influenced by propagation delays which themselves can only
Chapter 1. Introduction 7
really be determined after layout has taken place. Therefore, evaluat-
ing the performance of these systems via analytical methods is difficult.
Estimating program performance via logic simulation is impractical due
to the amount of CPU time required. However, an application such as
an asynchronous processor is particularly well suited to parallel discrete
event simulation (PDES) due the inherent parallelism afforded by the dis-
tribution of control.
This chapter describes PEPSÉ, a simulation platform on a network of
transputers [79] for evaluating the performance of asynchronous processor
architectures. The architectures can be modelled at various levels of ab-
straction in the programming language Occam2 [78]. Occam2 is based on
the process model of computing in which a system can be described as
a collection of concurrent processes which communicate with each other
asynchronously through channels. The semantics of Occam2 capture the
behaviour of asynchronous circuits naturally. The underlying timekeep-
ing mechanism in PEPSÉ is based on a parallel asynchronous simulation
algorithm described in [8]. The asynchronous nature of this algorithm ef-
ficiently simulates the class of architectures under investigation compared
to time-driven simulations.
Chapter 4 investigates, through simulation, the improvements in instruction
execution times of an asynchronously-controlled processor when com-
pared to an equivalent synchronously-controlled one. This study only ex-
ploits the average delays of the functional units in the self-timed case to re-
duce the execution times of the individual instructions. Results show that
shorter execution times can be achieved under micronet control. Taking
datapath pipelining into account at this stage is considered inappropriate
since pipelining increases both the control complexity and the instruction
latency.
Chapter 1. Introduction 8
Chapter 5 concentrates on the use of micronets to exploit ILP, which also re-
quires a number of control issues resulting from data and structural de-
pendencies between instructions to be resolved efficiently. Suitable met-
rics are introduced for measuring this and the performance of asynchron-
ous processors. The exploitation of ILP is analysed through a number of
refinements made to the Micronet-based Asynchronous Processor (MAP)
design of the previous chapter. Centralised control is progressively dis-
tributed to the functional units and the effects on the overall performance
of simple test programs are recorded. Results show that a micronet-based
datapath allows a greater degree of fine-grained concurrency to be ex-
ploited.
Chapter 6 discusses the influences of the asynchronous control paradigm on
the compiler of a micronet-based architecture. It is important to demon-
strate that the asynchronous processor is still a good target for a parallel-
ising compiler. The back-end of a compiler has two machine-dependent
tasks, namely to generate code and schedule the instructions. It will be
demonstrated that the local scheduling of a basic block can be performed
efficiently.
A micronet compiler is unable to predict the exact behaviour of the archi-
tecture for the execution of a given set of instructions. This is because the
execution times may vary due to data dependent operations and to inter-
actions between executing instructions competing for the same resources.
However, an instruction schedule based on worst-case operational beha-
viour can provide an upper bound on the program’s execution time. This
is useful since, generally, compilation is carried out once and programs are
run many times. Further performance improvement may be obtained at
run-time, to exploit the actual and data dependent delays, by fine-tuning
the instruction schedule dynamically.
Chapter 1. Introduction 9
Chapter 7 draws conclusions and includes discussions on the implications for
processor design and future work. A glossary of terms appears in Ap-
pendix A.
Chapter 2
Towards an Asynchronous Control
Paradigm
2.1 Introduction
This chapter focuses on a previously implicit factor in computer system design
called the control paradigm, and examines the motivation behind investigat-
ing the use of an asynchronous control paradigm in RISC processor architectures.
Synchronous controls have been the norm in processor designs. But lately, there
has been a resurgence in the use of asynchronous design styles where instead
of using a global clock to regulate operations and communicate information at
fixed intervals, operations take place autonomously and communication takes
place at arbitrary times whenever information transfer is necessary. Some of the
motivation behind this interest has been due to the difficulties envisaged in syn-
chronous VLSI design. This chapter outlines these concerns, the inefficiencies
in synchronous control and the advantages of asynchrony. More importantly,
the effect of the control paradigm on the exploitation of instruction-level paral-
lelism in the traditional view of processor datapaths is discussed. It is believed
10
Chapter 2. Towards an Asynchronous Control Paradigm 11
that the asynchronous approach can provide a more efficient design style for
processor architectures.
2.2 System Design
The design of a well integrated RISC microprocessor system should consider the
relationships between the different aspects of the system. The RISC experience
highlighted the need to consider the interactions between the implementation
technology, the processor architecture (which efficiently implements a given in-
struction set) and the compiler. The shift from CISC to RISC architectures took
advantage of maturing optimising compilers and improved VLSI technology.
The implementation technology has continued to play a significant part in im-
proving system performance of these architectures. However, current advances
are adversely affecting the synchronous control paradigm’s ability to exploit
the potential performance gains efficiently. In synchronous processors, while di-
minishing feature sizes and increasing clock speeds bring better performance,
they are achieved at a significant cost and design effort. Even the underlying
efficiency of this improvement is falling due, for example, to increases in power
consumption and the greater proportion of the clock period which needs to be
set aside to account for the side effects of technological advances.
2.3 Implementation Technology and a Synchronous
Control Paradigm
The improvements in integrated circuit technology pose new constraints on
the design of synchronous processors. Control management is characterised
by a global synchronising signal or clock to make all of the components in
Chapter 2. Towards an Asynchronous Control Paradigm 12
the design communicate correctly, i.e. the clock controls both the sequencing
and the timing within circuits. Though not always appreciated, this global
clock can significantly limit the performance in a large system. This is due,
in part, to a number of factors. Firstly, the clock period needs to account for
some underlying physical characteristics of VLSI circuits related to the cost
of distributing the clock and the loading on clock buffers. Thus, part of the
clock period must be set aside to allow for clock skew. Secondly, the clock
speed must be a conservative worst case, not only in terms of the component’s
critical-path delay, but also of fabrication and environmental parameters (if the
chip is to operate reliably). Finally, transistors switch virtually simultaneously,
causing the power supply inductance to become a more significant limitation
on switching speed.
2.3.1 Clock Skew
Some components in a synchronous design may see the global clock signal
change before others because of variations in propagation delays (due to dif-
ferences in track length and loading) along the clock distribution lines. This
discrepancy, known as clock skew, means that the effective computation time
available is less than the clock period. In order to ensure correct operation, the
clock period must be increased which implies a limit on the maximum clock
frequency. Reducing the clock skew requires detailed analysis of the load on
the clock signal and careful design of the clock drivers, which incurs significant
cost and design effort [42].
2.3.2 Other Limits on the Clock Frequency
Synchronous designs are optimised for worst-case conditions. The clock period
(and hence maximum frequency) is limited by the operation that takes the
Chapter 2. Towards an Asynchronous Control Paradigm 13
longest time to complete which is determined by the slowest component, its
slowest operation, its worst-case data inputs and the worst-case operating con-
ditions (i.e. supply voltage, temperature and fabrication process). Designers try
to reduce this delay by speeding up the component’s logic for degenerate data
input and by balancing component delays. However in synchronous designs,
effort must be invested in analysing logic which might be rarely used, in order
to find and speed up the critical path.
Furthermore, the slowest operation may not even be required in a particular
clock period. There has been some work on varying the period of the clock
dynamically depending on the operation [39]. An alternative approach is the
incorporation of multiple frequency clocks into designs (generally derived from
a single clock), which requires analogue circuitry i.e. phase-locked loops. Both
these approaches are difficult and expensive for the high clock frequencies at
which modern processors operate.
2.3.3 Power Consumption
Power consumption is increasingly becoming an important factor in processor
design. In CMOS circuits, the majority of power is consumed during the switch-
ing of gates. Most of them take place at clock transitions in synchronous designs
causing peaks of power consumption and leading to voltage drops due the in-
ductance of the power supply. (Extreme variations can cause the system to
malfunction.) Also, periodic high currents on a chip can cause electromigra-
tion: the force of the moving electrons hitting metal atoms causing deformations
and breaks in the metal [159]. Designers resort to using decoupling capacitors,
many power pins and wide power rails to reduce these effects at the expense of
packaging costs (e.g. gold is now being used in some designs for bond wires,
pads and power distribution rails [65] [83]). For example, the DEC 21064 Alpha
Chapter 2. Towards an Asynchronous Control Paradigm 14
chip requires 138 power and ground pins to supply its 30W power requirement
and the 43A peak switching current drawn by the clock [42] [114].
Synchronous systems distribute the clock to all of the components which
means that they consume power whether they are doing useful work or not.
Selective disabling of the clock signal adds complexity to the clock buffers and
exacerbates the clock distribution problem, especially at high clock frequen-
cies. Power consumption can also be reduced by decreasing the power supply
voltage. However, since transistor threshold voltages must scale down with
supply voltage, it may become increasingly difficult to make transistors with
small enough thresholds.
If the supply voltage is not reduced in proportion to the decrease in feature
size, then the power consumption per unit area will increase. Together with
the fact that in CMOS the power dissipated is proportional to the frequency of
the clock [175], it seems likely that the upward trend in power consumption
(especially of microprocessors) will continue. Eventually, one might envisage
performance being limited by heat dissipation unless cost effective techniques
can be found. Removing heat from chips will become increasingly difficult
and therefore expensive. Solid (passive) heat sinks to cope with even moderate
power levels (50W to 100W) are large and require significant air-flow. For higher
ranges, more active devices become necessary, e.g. a thermosiphon [65] [83].
2.3.4 Shrinking Geometries
As the physical size of transistors and connections, known as the feature size, is
scaled down, therefore allowing a larger number of more complex and faster cir-
cuits to be fabricated on a single chip, the problems associated with synchronous
design (clock skew and power consumption) will become increasingly signific-
ant [115] [140].
Chapter 2. Towards an Asynchronous Control Paradigm 15
The ability of synchronous designs to take advantage of these smaller, faster
devices is being hindered by timing delays in the interconnection layers [147]. In
VLSI circuits, wiring delays are approaching a significant proportion of switch-
ing delays and can no longer be ignored. Scaling exacerbates these problems:
since systems contain more circuits, global signals have to travel longer dis-
tances relative to transistor sizes. This may mean proportionally reducing the
clock period, which would result in inefficient operation of the system.
The Effects of Scaling
It is informative to observe how a circuit’s operation is affected when its spatial
dimensions are scaled down by a factor  [175]. (Assume that the circuit’s
operating voltage is divided by  too. This keeps both the electric fields on the
chip and the power dissipation per unit area constant.)
The propagation of electrical signals through a circuit is attenuated by two
delays: in the channels of transistors and in the wires. The former, often called
the transit time  , is the time taken by charge carriers to “cross” the electric
field in the channel. Since this field is unaffected by scaling, the transit time is
divided by  (the channel becomes shorter), resulting in faster transistors. The
delay that signals encounter in wires is determined by the rate at which a voltage
presented at one end of a wire equalises across the whole wire. For a wire of
length l, this is proportional to R:C:l2, where R and C are the resistance and
capacitance of the wire per unit length, respectively. When scaled down, R is
increased by a factor of 2, C is unaffected, and l is divided by . Consequently,
the wire delay does not change under scaling. But since the transit time is
shortened, the wire delay increases relative to the transit time. If the correct
functioning of a circuit depends on the relation between these delays, then the
shrunk version may not function correctly any longer.
Chapter 2. Towards an Asynchronous Control Paradigm 16
Delays in short wires are much shorter than delays in transistors. For small
chip areas the wire delay may, therefore, be ignored. Such an area is known as
an isochronic or equipotential region [106] [150]. By dividing a circuit into suf-
ficiently small subcircuits and realising each subcircuit in an isochronic region,
only the wire delays of the connections between different subcircuits need to be
taken into account.
Locality
It is clear that since gate delays decrease with scaling, whereas interconnection
delays remain constant, eventually the speed at which a circuit can operate will
be dominated by interconnect delays rather than device delays. However, the
situation is actually somewhat worse than the above consideration implies, due
to a factor known as stuffing. This means that the lengths of the interconnections
do not scale down with the inverse of the scaling factor, as was assumed. In
practice, as the complexity of the circuit increases, the distance over which
interconnections must be maintained on a chip of fixed area may stay roughly
constant. It has been argued from statistical considerations [89] that a good
approximation to the maximum lengthLmax of interconnection required is given
by Lmax = A1=22
where A represents the area of the chip. Therefore, the average interconnection
delay may actually increase. If scaling occurs and the chip size is also increased,
then the interconnect problem is further exacerbated. When the delay time of
the circuit depends largely on the interconnection delay (instead of the logic
gate delay), minimal and local interconnections will become an essential factor
for an effective realisation of the VLSI circuit [96].
Chapter 2. Towards an Asynchronous Control Paradigm 17
2.3.5 Design Difficulties
The clock in a synchronous circuit can be a source of both transient and per-
manent errors [150]. Even when modules communicate correctly under ideal
or typical conditions, timing problems can still arise. A change in clock speed,
caused by processing or the environment, can make the system fail even if a
conservative one is chosen. For example, it could exaggerate clock skew and
require increased setup and hold times. For systems running at their maximum
clock frequency, this means reduced reliability. Overcoming these timing prob-
lems in synchronous designs is far from trivial and is one of the causes of devices
being either slow, unreliable, or not working at all.
Thus, improvements in IC technology pose new constraints on the design
of synchronous processors and since the clock has to be proportionally reduced
this results in an inefficient operation of the system. The use of global clock
signals also affects other areas of the design process. In synchronous designs the
timing of a circuit, being fundamental to its correct operation, is one of the most
difficult parameters to abstract from the logical design. Designers must always
be aware of the performance of the hardware implementation in order to verify
its operational correctness. Also, as a consequence of the automated layout of
circuits, the designer has less control over the exact placement of global signal
lines. Therefore, the true performance of these designs is difficult to estimate
accurately. For example, in the design of the DEC Alpha 21064, designers
had to use post-layout simulations and three-dimensional representations of
the results to evaluate the clock skew across the chip [42]. This violates the
hierarchical approach to design by making it more difficult to abstract away
from the electrical characteristics of the VLSI implementation [147].
Chapter 2. Towards an Asynchronous Control Paradigm 18
2.4 Asynchronous Design – A Solution?
Asynchronous design attempts to solve some or all of the problems described
previously. Asynchronous circuits have no global clock, and therefore are free
from global synchronisation operational and design problems. Asynchronous
circuits can be based on different timing models. A circuit is delay-insensitive
(DI) if its correct operation is independent of the delays in the logic gates and the
interconnections [20] [119]. However, the class of DI circuits has been found to
be extremely limited [21] [107]. A restricted form of this class, known as speed-
independent, allows arbitrary delays in logic elements, but assumes zero delays
in the interconnect (i.e. all interconnect wires are equipotential) [41] [124] [125].
Another class of circuits, quite similar to the first two, is known as quasi delay-
insensitive: i.e. delay insensitivity with isochronic forks (the delays in the arms of
a fork are assumed to be the same) which in practice is very close to speed inde-
pendence [106]. Finally, if the circuit only functions when the delays are below
some predefined limit, then the circuit is known as bounded-delay. Rather than
relying on a bounded delay model of the worst-case path through the circuit,
there are a variety of methods for generating a completion signal [150]. Self-
timed logic will signal when its output has been composed rather than simply
producing a result at some time in the future. These methods use a multiple
wire protocol for the communication of data to and from components in a delay-
insensitive way. Thus, the circuit’s logical behaviour is independent of delays
within components and wires. In addition to being freed from the problems
of clock distribution, systems designed with these asynchronous circuits are
claimed to offer a number of advantages over synchronous designs [66] [106]:
Speed – Asynchronous circuits are optimised for the typical case; worse-case
operations simply take longer. There is no fixed clock period during which
the operation must complete and therefore delays need only be as long
Chapter 2. Towards an Asynchronous Control Paradigm 19
as necessary. This may sometimes be slower than the synchronous clock
period, but since the circuits operate at a speed determined by the current
operation and therefore are effectively limited by their average (or typical)
delay, they are potentially faster. The time variation between worst-case
and typical operation can be significant, so optimising a circuit for typical
rather than worst-case operations has an advantage not available to the
synchronous designer. Generally, these circuits can be smaller and sim-
pler than their synchronous equivalents. Note that the delays themselves
are affected by environmental parameters and conditions. Again, syn-
chronous design needs to allow for the worst-case operating conditions to
guarantee correct operation.
Power Consumption – Asynchronous circuits generally have a much lower
power consumption than their synchronous equivalent. Clocked circuits
fire most of their transistors simultaneously at rising or falling clock edges.
In asynchronous circuits, since there is no global clock signal, power con-
sumption will be more evenly distributed over time so that the voltage
variance should not be as large (transistors only fire when they contribute
to the computation). Provided the supply voltage does not fall below the
transistor’s threshold voltage, an asynchronous circuit would simply slow
down but continue to operate correctly [109]. Note that in a synchronous
circuit any slowing down could mean the clock transition occurring before
data becomes ready, thus causing the circuit to fail.
Also, an asynchronous system activates only those parts of the circuit
which are required and so does not dissipate power in the rest of the
circuit that is not being used.
Modularity – The complexity caused by the current high level of integration
and parallelism makes demands upon our ability to design reliable sys-
Chapter 2. Towards an Asynchronous Control Paradigm 20
tems. A key lesson VLSI designers learned from software designers is to
divide a problem into modules that can be designed separately.
To reduce complexity, it is necessary for the boundary between modules
to be well defined and simple. An important boundary condition is to
know when the data communicated are valid. Provided each block in
an asynchronous system is internally correct and meets the simple timing
constraints of its external interface, the design will be correct in terms of
timing. A designer can therefore simply replace one block by another with
different characteristics and evaluate any change in performance with little
further effort. Again, a synchronous designer does not have this flexibility.
Layout and Robustness – Chip layout is much simplified since the lengths
(delays) of the wires do not affect the correctness of the circuit. Similarly
delay-insensitive circuits are tolerant to implementation parameters such
as fabrication process and transistor scaling.
Metastability – An arbitration device, i.e. a device that grants one of a num-
ber requests exclusively, is an example of a circuit exhibiting metastable
behaviour. The closer its initial state is to a metastable state, the longer it
takes to settle down into a stable state. This problem, first discovered by
Chaney and Molnar [28], means that any clocked system containing such
a device has a finite probability of malfunctioning.
Automated Synthesis – Accurately estimating the timings of synchronous pro-
cessors and abstracting them from the logical design is difficult. This has
been one of the limiting factors in automatic performance-lead synthesis of
synchronous processors. Since the correct operation of an asynchronous
circuit is independent of the delays, these circuits have proved attract-
ive for automated synthesis. Many “correct by construction” synthesis
methods and compilation tools [19,35,36,66,101,106,116,171] based on the
Chapter 2. Towards an Asynchronous Control Paradigm 21
decomposition of formally-proven specifications (e.g. [43]) have been pro-
posed. Due to the complexity of designing asynchronous systems, many
recent large designs [37] [110] [170] have been synthesised via compilation
tools derived from high-level specifications.
2.4.1 Disadvantages of Asynchronous Design
Asynchronous designs have complexities of their own. First, the logic to detect
when data are valid requires extra circuitry. Second, races and hazards need
more careful consideration [180]. Output hazards of combinational circuits have
little effect on the operation of synchronous systems, as they are allowed to settle
before being latched into registers. On the other hand, hazards are intolerable
in asynchronous systems because any transition of an output or state variable
triggers other transitions immediately; the circuit operates autonomously, and
does not depend on any clock timing. For this reason, it is necessary to analyse
the circuits used and define the constraints under which no hazard will ever
occur [179]. These constraints must then be followed strictly or failure due to
hazards may result [119].
Despite the significant work on the specification and design of asynchron-
ous circuits, testing them has received relatively little attention [67] [76]. Tradi-
tionally, testing asynchronous circuits has been considered a difficult problem,
especially when compared to the synchronous case. Unfortunately, methods
used to test synchronous circuits are not directly applicable. This is due, in
part, to the absence of the global clock signal in the asynchronous design style
which reduces controllability, and makes both the generation of test vectors
and the detection of hazards and race conditions harder [22]. However, some
techniques have been adapted for use in asynchronous circuits e.g. partial scan
path [90] [144]. Other developments have been the inclusion of hazard-free
Chapter 2. Towards an Asynchronous Control Paradigm 22
circuit synthesis strategies [179] and fault modelling and fault test evaluation
into synthesis systems [145] [173].
2.4.2 Equipotential Regions (revisited)
An equipotential region is one in which a signal can be treated as identical
everywhere, that is, the signal requires a negligible amount of time to equalise
all potential differences within the designated region. This notion is funda-
mental in any self-timed methodology [150]. A basic assumption in the syn-
thesis of self-timed modules is that within a module, wire delays are negligible,
whereas delays between logic gates are arbitrary but finite. This is equivalent
to stipulating that self-timed modules have to reside completely within equi-
potential regions. In any integrated circuit technology, limits of such regions
can be defined, based on the electrical characteristics of interconnects and cir-
cuits. Particularly, in MOS technology, equipotential regions are defined within
which signals settle in less than  , the transit time of a transistor [115]. As
stated in [150], normally, these limits are much larger than the size of self-timed
modules, and hence, no special care is required.
Scaling affects the number of transistors per isochronic region. Suppose that
in an isochronic region we allow wires of length at most l, with l satisfyingR:C:l2 = :
for some small constant . The maximum area of an isochronic region is then(: )=(R:C) and is proportional to =(R:C). Consequently, when scaling down
the circuit the maximum area of an isochronic region is divided by3. Since scal-
ing multiplies the number of transistors per area by 2, the maximum number
of transistors per isochronic region is divided by. This implies that subcircuits
need to be realised in isochronic regions that are as small as possible and that
the minimum number of isochronic regions per chip scales as 3.
Chapter 2. Towards an Asynchronous Control Paradigm 23
The notion of equipotential regions also brings up another interesting and
important point: self-timed modules can be considered to be contained in equi-
potential regions, communicating with each other reliably through the use of
a handshake protocol [150]. Therefore, this protocol must be implemented
whenever signals are to be transmitted between regions.
2.4.3 Handshake Protocols







Figure 2–1: Two- and four-phase signalling
A single voltage transition or change of voltage on a wire is the simplest form
of signalling that an event has occurred. Since there are time and energy costs
associated with changing the voltage on a wire, it pays to use as few voltage
transitions as possible in asynchronous signalling conventions, commonly re-
ferred to as handshaking.
The most efficient signalling convention is two-phase handshaking. Consec-
utive signals or events are indicated by alternating low-to-high and high-to-low
voltage transitions. The major advantages of two-phase handshaking, also
known as transition signalling or nonreturn-to-zero (NRZ) signalling, are that
Chapter 2. Towards an Asynchronous Control Paradigm 24
it is as fast and as energy efficient as possible [150]. However, in practice, addi-
tional logic and state information may be required in each element, since logic
devices tend to be sensitive to voltage levels or only transitions in a particular
direction.
Much of the work on self-timed circuit design has centred around an al-
ternative to two-phase, known as four-phase handshaking, which was first used
by Muller in many of his examples of speed-independent circuits [117]. In the
four-phase handshaking protocol, also referred to as Muller or return-to-zero
(RZ) signalling, both wires are initially low, by convention. After each event is
sent or presented onto the wire and acknowledged, both wires return to their
initial (low) state. The protocol is termed “four phase handshaking” since both
transitions (the assertion and the return to zero) are accompanied by additional
acknowledgements from the receiver. This results in four phases for a complete
message transfer. The principal advantage of this approach is that the nature
of four-phase handshaking tends to result in very simple and natural circuit
implementations in conventional logic gates. However, it uses twice as many
transitions than are necessary and whenever wire delay is a substantial fraction
of the operation time, the extra trip required by a single communication can be
a serious performance penalty. Figure 2–1 shows both signalling conventions.
The terms request driven and data driven indicate whether it is the receiver or
sender who initiate the handshake (the terms pull and push are also sometimes
used).
2.4.4 Data Transmission
The “two-wire” handshake, shown in Figure 2–1, is sufficient to communicate
one bit of information to another component. In order to communicate a larger
number of bits as a single event, a modification is required to allow the receiver











Figure 2–2: Encoded data transmission
to recognise when all the constituent bits are valid. Data transmission can take
one of two forms.
Firstly, the data and a data valid signal are encoded together to form a
codeword. The transmitted codeword is recognised by the receiver which
then extracts the original data (see Figure 2–2). Various codes have been pro-
posed [16] [74] which are dependent on the handshaking convention. The pre-
cise conditions for the feasibility of delay-insensitive data communication and a
comparison of DI codes has been made by Verhoeff [174]. The most popular one
is Dual-Rail Coding (DRC) (which is equivalent to Hot codes [74] of length two),
because of their simple encoder-decoder pair. In general, the disadvantage of
encoded data transfer is the extra circuitry (and therefore, area and perform-
ance costs) required to support this mechanism. An encoder and a decoder are
required on every output and input data port, respectively. Their area depends
on the data width and the coding scheme. Furthermore, the data highway
width also depends on the coding scheme, e.g. DRC requires a highway width
twice the data width. In practice, for small data widths, dual rail encoding may
be quite efficient. But for larger data widths, it becomes expensive in silicon
area, in terms of routing the wide data highways across and off-chip, and in
terms of the latch sizes associated with holding large code lengths. Although,
Chapter 2. Towards an Asynchronous Control Paradigm 26
in the future this may become less of a problem since with scaling, the effective
area increases by the square of the scaling factor, and improving technology is
increasing the physical area of chips too. Of the other codes suggested [174],
Berger Codes seem promising since the data value is a subset of the encoding
(i.e. separable), they have a low redundancy, and are easy to code.
An alternative scheme for self-timed datapaths would be to use data path
components which operate directly on the DI codeword instead of the data
alone. This would remove the need for encoding schemes, (a detection mechan-
ism still being required of course). At first sight, this may seem expensive due
to the complexity of the data path components involved, however it has been
shown that some designs based on dual-rail encoded data can be comparable




Figure 2–3: Bundled data transfer
The second form is “bundled data transfer” and is based on the bounded
delay model. The data wires and the data valid signal are treated as a bundle, i.e.
the data valid (DV) signal reaches the receiver after the data wires become valid.
This implies that the propagation delay for the data must be less than the delay
to propagate the DV signal. In general, this condition is met by inserting an
extra delay on the DV wire to account for the worst-case delay on the data wires.
This form allows the use of standard datapath components such as multipliers
and ALUs without the coding circuitry (as shown in Figure 2–3), thus reducing
the communication and area overheads.
Chapter 2. Towards an Asynchronous Control Paradigm 27
The main advantage of this method of building logic functions is that stand-
ard techniques or existing cell sets can be used to transform the data and be still
used in the framework of a self-timed system. A major disadvantage is that a
careful examination of the worst-case delay through the logic block and the data
delivery to the receiver is required to guarantee that the bundling constraint is
met under all conditions (similar to the task carried out in synchronous designs).
Guaranteeing worst-case delay will often require the bundling delay to be large
compared to the average case performance of the logic. This not only slows
down the module, but also the entire system that uses this module.
Conversions between dual-rail and bounded protocols is simple [150] so
that the self-timed logic techniques can be used even in a system that is largely
bundled. If dual-rail signalling is used internally on-chip, since dual-rail de-
mands more resources in terms of wires and pins, then it makes sense to convert
to a bundled protocol when sending data off-chip.
2.4.5 Ease of Design
In addition to the advantages of asynchronous design outlined earlier, further
benefits of reduced design time and costs are also possible. Asynchronous
design could be considered easier than synchronous design since the prob-
lems with clock distribution, skew and excessive voltage surges may not exist,
so a designer need not spend time resolving them. Furthermore, the delays
of infrequently used blocks do not significantly effect overall performance, so
costly sophisticated design techniques may be avoided. Simpler designs may
be used for blocks with data dependent delays (e.g. the ripple-carry adder).
The use of high-level design languages derived from CSP [73], such as Tan-
gram [149] and CHP [106] ease the difficulties of designing asynchronous cir-
cuits by allowing programs to be automatically compiled to circuits by a silicon
compiler [25] [171].
Chapter 2. Towards an Asynchronous Control Paradigm 28
The major drawbacks of self-timed circuits are in the circuit and signalling
overheads involved in local communication, and any timing constraints that are
required to be met by particular choices of signalling protocols. For example,
data may be passed in a delay-insensitive fashion at the expense of using mul-
tiple wires per data bit to encode this form of signalling [174]. If bundled data
signalling is used instead, the complexity is reduced at the cost of meeting the
bundling constraint. Any such timing constraints must be analysed thoroughly
and carefully if the circuit is to operate correctly.
2.5 Exploiting Performance
This thesis seeks to exploit the potential performance benefits of asynchrony
in processor systems. Care must be taken when comparing synchronous and
asynchronous implementations since in practice their design goals are differ-
ent [2]. One must also be aware of the trade-offs between performance, area
and power consumption.
2.5.1 Synchronous versus Asynchronous Control
Events in a synchronous processor are recognised at regular, pre-determined
intervals which are ultimately fixed by the clock. If the duration of all actions
were constant and known precisely, then the sequencing of actions could be
implemented efficiently with a global clock. Unfortunately, the actual delay can
vary and is likely to be a lot less than the predetermined worst-case delay, which
could result in significant idle periods between events and the next clock tick.
In contrast, an asynchronous architecture which is realised by using self-timed
components with appropriate handshaking protocols, is able to adjust to varying
delays in the components which could be due to data dependencies or changes
in the environment. This robustness is at a price, due to the overheads of local
Chapter 2. Towards an Asynchronous Control Paradigm 29
handshaking protocols. For this approach to be viable, the average delay of the
components together with overheads of self-timed design should be less than
the worst-case delays plus overheads of a synchronous design. Synchronisation
overheads are difficult to estimate as they are intimately influenced by the clock
frequency, technology, fabrication process, routing and chip size.
Most importantly, the self-timed (ST) overhead should not exceed the syn-
chronous overhead by more than the magnitude difference between the aver-
age and the worse-case delay of the component. As discussed earlier, while
improvements in technology may cause the synchronous overheads to increase,
this may not be the case for the overheads due to asynchrony since these can
be accounted for by gate delays and local communications. Improvements in
performance can be achieved by either reducing the ST overhead directly by
speeding up the specific circuits or indirectly by hiding the overhead by do-
ing some “useful work” concurrently. Alternatively, a designer could optimise
the design for typical operation. A synchronous designer’s primary goal has
been to reduce the worse-case delay (possibly at the cost of increasing the av-
erage delay) of components, therefore since the scope for a sufficient margin
of difference is small, incorporating synchronously designed components into
ST systems may not prove advantageous. Furthermore, when components are
connected in pipelines or arrays, the overall performance will tend towards the
worse-case value since throughput is limited by the slowest individual compon-
ent stage [87] [97]. Consequently, in comparison to an equivalent synchronous
design, the performance may even be worse due to the ST overheads. Previ-
ous attempts to harness this proposed advantage of self-timed circuits have not
proved too successful [146] [156].
Chapter 2. Towards an Asynchronous Control Paradigm 30
2.6 Pipelines
Pipelining is an implementation technique whereby a cascade of processing
stages is connected (generally in a linear fashion) to perform functions over a
stream of data flowing through the stages. This technique, which is by far the
most popular method for enhancing performance in CPUs, provides a way to
start a new task before an old one has been completed.
The throughput of a pipeline is determined by how often a result exits the
pipeline. In a synchronous pipeline all of the stages must be ready to proceed at
the same time. The time required to move data down one stage of the pipeline,
the machine cycle, is determined by the time required by the slowest pipe
stage. As long as there are no dependencies between the data, the throughput
is fixed at one result per machine cycle. Data flow between adjacent stages in
an asynchronous pipeline is controlled by a handshaking protocol. Results only
move forward when the succeeding stage is empty. An asynchronous pipeline
may have a variable throughput rate since different stages may experience
different delays. For complex (data-dependent) computations, asynchronous
design has the advantage of exploiting the actual delays, whereas synchronous
solutions are adjusted to the worst-case.
2.6.1 The Conversion of Synchronous Pipelines to Equivalent
Asynchronous Ones
This section describes the transformation of a synchronous pipeline to an equi-
valent asynchronous one, as illustrated in Figure 2–4. Part (a) illustrates a
conventional synchronous pipeline with a clock signal being used to control the
transfer of data between functional units (FUs), and by the control unit (CU), to
generate the correct sequence of control signals to define the pipeline’s









































Figure 2–4: From a synchronous to an asynchronous pipeline
operation. In the CU, the relationship between control signals C1 and C2 is
strictly bound since they must be generated at the correct time and in correct
order. In other words, the CU needs to incorporate a (pessimistic) “timing”
model of the pipeline. A simpler pipeline as in the case of RISC architectures,
results in simpler control and therefore smaller control costs.
Part (b) illustrates an intermediate stage, where the transfer of data is con-
trolled locally. The “network” is responsible for communicating data and control
Chapter 2. Towards an Asynchronous Control Paradigm 32
signals between FUs. This process can be as simple as bundled data trans-
fer [158], or more complex such as encoding the data prior to transfer and
decoding it at the receiver [174]. The clock (to the CU) is now only used as a
time reference for the generation of FU control signals. The CU still needs to
model the timing characteristics of the pipeline which results in minimal per-
formance gains, if any. However, the global clock signal has been removed by
the decentralisation of communication controls.
Part(c) illustrates a truly self-timed pipeline. The interfaces receive control
signals from the CU and encode transfer data for detection at the interface of
the destination. When valid data has been detected and latched, the interface
sends an acknowledgement signal back to sender. It is now able to remove the
data and release the bus (if shared). The interface is responsible for meeting the
operational requirements of the FU, such as guaranteeing that the input data is
valid before control signals are asserted. This, together with the communication
protocol, decouples the logical behaviour from the timing characteristics of the
pipeline. This enables functionally-equivalent FUs to be interchanged without
affecting the operation of the rest of the pipeline. Since the CU no longer re-
quires the timing characteristics, the pipeline control becomes less complex and
therefore faster. The control signals C1 and C2 no longer have to be generated
at the right time or in the correct order with respect to each other, since a FU
cannot begin its operation until it has received both the data and the control
signals (due to the FU interface). The only constraint on the control or data
signals is that the previous value must have been received by the correspond-
ing FU interface before the next one can be issued. This means that both the
CU and the interfaces cannot change the value of a signal until it has received
an acknowledgement from the receiver. A typical handshake cycle might be as
follows: wait until FU is not busy; assert the control signals; wait for an acknow-
ledgement; clear control signals; repeat. This naturally maps to a four-phase
protocol [150] with the acknowledgement signal also doubling as a busy flag.
Chapter 2. Towards an Asynchronous Control Paradigm 33
This would allow the control unit of a processor to use the acknowledgement
signals from FUs as part of a scoreboarding mechanism.
The CU cannot predict exactly when the FU with the largest delay in the
pipeline will finish. By letting the FU indicate that it has finished, and not
necessarily to the control unit but to its successor, the pipeline is driven by local


























Ack Out Ack In
Req Out
Data Out
Figure 2–5: A basic micropipeline FIFO
In the 1988 Turing Award lecture, Ivan Sutherland outlined a methodology for
the design of asynchronous pipelined systems using the two-phase bounded-
delay (bundled data transfer) protocol [158]. The interface has an arbitrary
number of data bits accompanied by two signalling wires (req and ack). A
micropipeline is a simple event-driven elastic pipeline which maintains the
order of data. A block diagram of a generic micropipeline is shown in Figure 2–
5. It consists of three parts: a control network consisting of a single C-element
per micropipeline stage, a latch in each stage, and possibly some combinational
Chapter 2. Towards an Asynchronous Control Paradigm 34
logic between stages. The logic can signal its own completion (Stage A), or it can
be simulated with a known delay (Stage B). If no processing is present between
stages, the pipeline becomes a first-in first-out (FIFO) queue (Stage C).
2.7 Related Work
Udding [165] [166] first proposed a formal definition and classification of delay-
insensitive circuits. Since then much theoretical work has evolved from process
algebra [82], trace theory [41] [140] [141] [172] and Petri nets [31] [101] [116] [178].
Due to the complexity of designing asynchronous systems, many large designs
have been synthesised via compilation tools derived from high-level methods.
These circuits have been shown to be efficient and robust in the design of con-
trol circuitry [36] [74] [106] [177]. At the board level, communication interfaces
such as the VME protocol [68] already make use of asynchrony [99]. However,
these circuits have been considered inadequate for designing data paths for
the following reasons. The overhead of encoding data, generating completion
signals and arbitration on buses make them slow and wasteful in area [2] [135].
Nevertheless, a few fully asynchronous microprocessors have been proposed.
Many of these designs have concentrated on specific aspects of self-timing such
as their formal synthesis [37], low power consumption [48], or just the feas-
ibility of implementing conventional microprocessor architectures (with little
emphasis on their performance or efficient operation).
The first asynchronous VLSI processor was built by Martin [110] at California
Institute of Technology. The goal was to demonstrate that complex circuits could
be generated from specifications using a library of self-timed elements. The
Amulet project [56] [137] at Manchester University investigated the application
of asynchronous micropipeline techniques to the commercial low-power ARM
microprocessor. The NSR processor [18] built at University of Utah is a general
Chapter 2. Towards an Asynchronous Control Paradigm 35
purpose processor built from Actel FPGAs. In addition to being internally self-
timed, the units are decoupled through self-timed FIFO queues between each
of the units which allows a high degree of overlap in instruction execution.
Other processors which are still in their design stages (or have yet to be built)
include: SCALP [49] and Hades [47], which are superscalar designs; TITAC,
which is a simple 8-bit processor built using CMOS gate array technology [129];
the ECSTAC [123] processor which uses an 8-bit architecture and a two-phase
communication strategy; and STRiP which, although it is called “self-timed”, is
in fact a synchronous processor which can dynamically alter its clock period [39].
Although these designs are based on a single micropipeline-style datapath [93]
[158], viewing the datapath as a linear sequence of stages may not be very effi-
cient for reasons elaborated in the following section. A couple of designs have
begun to investigate the influence self-timing has on processor architectures.
A novel architecture has been recently proposed by Sproull et al. at SUN Mi-
crosystems called the Counterflow Pipeline Processor Architecture [157], which
derives its name from the fact that instructions and results flow in opposite
directions in a pipeline and interact as they pass (similar to a 1-D systolic ar-
ray). It supports a form of register renaming, data forwarding, and speculative
execution across control flow changes. The performance of such an architec-
ture is still unknown [152]. Fred [142] is a decoupled, pipelined architecture
which supports dynamic instruction re-ordering and out-of-order instruction
completion.
2.8 This Thesis
One feature common to all of these processor designs is their view of the
datapath. As with synchronous designs, the datapath is still viewed as a single
linear pipeline. The work described in this thesis differs from them by viewing
Chapter 2. Towards an Asynchronous Control Paradigm 36
a datapath as a network of asynchronously communicating resources through
the generalisation of the micropipeline concept to a network of communicating
pipelines.
2.8.1 Towards Asynchronous Datapaths
The clock period of a synchronous pipeline is determined by the delay of the
slowest stage which takes into account worst-case timings for execution and
propagation. Furthermore, optimal performance for a pipeline is achieved
when all the stages are balanced. This is quite difficult to achieve in practice,
since the stages of a typical pipeline perform different operations, and often
their delays are data-dependent. Figure 2–6(a) illustrates the operation of such
a datapath in which synchronisation overheads have been omitted for the sake
of clarity. This imbalance between stage delays results in idle periods leading
to poor utilisation of the physical resources. Of course, further pipelining of the
slower stages could reduce this at the cost of increased design complexity and
synchronisation overheads.
In contrast, the performance of an asynchronous pipeline is determined by
the actual delays of individual stages (usually the average delays), and over-
heads due to self-timing protocols (which have been omitted in Figure 2–6(b),
again for the sake of clarity). ([54] compares synchronous and asynchronous
pipelines, taking into account their overheads.) This pipeline only exploits tem-
poral parallelism as before, but does so more efficiently. An instruction proceeds
to the next stage once it has completed the current one and the next stage is free.
Although each stage may consist of a number of (different) resources, generally,
only one (or a subset) of them will be active at any time for a given instruction.
The average throughput of any asynchronous pipeline cannot be greater
than the average throughput of the stage with the slowest isolated average
performance [128]. This is only the upper bound and thus may not always be







































































































(c)  An Asynchronous "Networked" Datapath - exploits spatial parallelism as well
Figure 2–6: Synchronous and asynchronous pipelines
achieved, especially since once a stage is idle it is no longer able to maintain its
isolated average performance. Idle times, caused by blocking and starvation,
can be reduced by introducing additional buffers between stages (the number
required being closely correlated to the coefficient of variation of data dependent
delays between the stages [87]). However, this increases pipeline latency and
area costs, possibly resulting in reduced area-time performance and therefore
comparing unfavourably with a synchronous equivalent. Exploiting spatial
Chapter 2. Towards an Asynchronous Control Paradigm 38
parallelism, through the improved utilisation of resources, not only reduces
idle times but may also reduce the number of buffers required to maintain
isolated average performances. Thus, an implementation technique which is
more flexible than a linear pipeline is required to model datapath behaviour
efficiently.
Figure 2–6(c) illustrates an asynchronous datapath which exploits spatial
parallelism within some of the stages. (The datapath is no longer modelled as a
true pipeline). Successive instructions which utilise different resources within
a stage are now able to execute concurrently. In the simple example under
consideration in Figure 2–6(c), the execute stage has two concurrent resources.
It is possible for the instructions to share resources in any of the stages. For
example, while an instruction is stalled waiting for an operand on one bus,
another instruction could use the other buses to fetch its operands. The amount
of spatial parallelism which can be exposed in practice is determined by the
relative delays of the functional units in the datapath.
2.9 Micronets
Micropipelines [158] have been used to model linear asynchronous pipelines
such as datapaths [56] [143], and two-dimensional pipeline structures [64].
However, as described earlier, viewing a datapath as a single linear pipeline
does have limitations. A new paradigm called micronets is proposed for the dis-
tribution of control in asynchronous processor architectures. Micronets model
datapaths as a network of communicating functional units which allows the
efficient exploitation of both fine-grained instruction-level parallelism and the
actual execution costs of instructions.
Chapter 2. Towards an Asynchronous Control Paradigm 39
2.9.1 Micronets, Microagents and their Micro-operations
In a synchronous datapath the centralised control forces each instruction to go
through all the stages regardless of its need to do so (in effect a single pipeline).
The cost of execution is determined by the worst-case estimate of the slowest
stage. The same is true of a micropipeline-based datapath [56], except that the
cost is now determined by the actual delay of the slowest stage.
Micronets are effectively a generalisation of Sutherland’s micropipelines.
The components within each of the micropipeline datapath stages are exposed
in the form of fine-grain microagents. The microagents in any “stage” can operate
concurrently, and microagents in the different “stages” communicate with each
other asynchronously. Program instructions only utilise the relevant microa-
gents and for just as long as is necessary. More than one instruction may utilise
the different microagents within a “stage”. Figure 2–7 compares the resource
utilisation in micropipelined and micronet datapaths. In the former, the num-
ber of active instructions is never greater than the number of pipeline stages,
and at any time only a subset of the resources in each of the stages is normally
utilised. In micronets, the number of instructions which may be active at any
time is bounded by the number of microagents. An instruction which does not
require any of the resources within a “stage” can skip it. Furthermore, the time
spent by instructions in microagents may vary. Due to these reasons computa-
tions may overtake. In this way, micronets differ from 2-D micropipelines [64]
which represent asynchronous regular arrays. This feature will be exploited to
implement out-of-order instruction completion. (Note also that a microagent
itself can consist of a number of (micro)pipeline stages).
Figure 2–7(b) shows an instruction (I1) executing concurrently with a suc-
ceeding instruction (I2) in what would have been the same stage in a syn-
chronous pipeline. Because there are effectively a number of paths, different
instructions need not necessarily complete in the order they were initiated. Also,

















a) Typical resource utilisation in a pipeline
b) Snapshot of typical resource utilisation in a micronet
Figure 2–7: Contrasting a micropipeline with a micronet
the micronet is controlled at two levels: the data transfer between the micro-
agents is controlled locally, whereas the choice of micro-operations within the
microagent and the destinations of the results are controlled by the control unit
or by other microagents (see I4 and I5 in the figure). Communication between
microagents may occur either across dedicated lines or via shared buses. The
micro-operation control signals can also be used to prevent contention on shared
Chapter 2. Towards an Asynchronous Control Paradigm 41
buses. There are no specific restrictions on the choice of handshake protocol
employed at the different control levels. However in practice, such a choice is
influenced by performance and area considerations.
2.9.2 Micronet-based Datapaths
The micronet control paradigm is investigated in the context of a Reduced
Instruction Set (RISC) architecture. Self-timed circuits are used to distribute
processor control away from a centralised Control Unit (CU) (found in conven-
tional synchronous processors) to autonomous functional units. This distribu-
tion of control locally to functional modules affords greater scope for exploiting
concurrency between instructions.
Data dependencies within synchronous datapaths are resolved by using
either a hardware or a software interlock [70], which adds to their control com-
plexity. A micronet datapath uses existing handshaking mechanisms together
with simple locking of registers to achieve the same effect with trivial hard-
ware overheads. In synchronous designs the structural hazards are normally
avoided in hardware by using a scoreboarding mechanism. In micronets this
is provided by existing handshaking protocols. The choice of a four-phase
communication protocol [150] between the functional units allows the effi-
cient utilisation of these resources, by avoiding the additional control costs
(scoreboarding and hazard avoidance mechanisms) normally associated with
processors which exploit ILP. (This choice and its justification is discussed in
greater detail in Chapter 5). Out-of-order instruction completion can be suppor-
ted in synchronous designs, but at a non-trivial cost. Micronets are able to relax
the strict ordering of instruction completions and thereby exploit further ILP.
A Micronet-based Asynchronous Processor (MAP) design has the advantage of
exploiting the best-case delay (behaviour), whereas synchronous solutions are
adjusted a priori to the worst-case. The result is an increase the utilisation of
Chapter 2. Towards an Asynchronous Control Paradigm 42
the functional units by reducing their stalls. By exploiting both ILP and actual
run-times of instructions, better program performances may be achievable by
asynchronous processors.
2.10 Summary
There has been renewed interest in asynchronous circuits, especially in a restric-
ted form known as self-timed circuits [150]. These circuits have a number of
advantages [106], including their automatic synthesis from specifications [66].
While this has resulted in provably-correct circuit designs, the performance of
the resulting processor architectures have been largely overlooked.
A few processors have been proposed [56] [123] [143] which utilise asyn-
chrony at the circuit level and exploit average-case behaviour for better per-
formance. However, in the only comparison of an asynchronous processor
with its synchronous equivalent, results showed the synchronous version to be
faster, smaller and at the same time consume less power [56]. One reason could
be that the chosen architecture itself is better suited to a synchronous control
paradigm. This is emphasised by the fact that the next design will include archi-
tectural modifications [55] (rendering a comparison to the original synchronous
version unfair). This underlines the fact that the design of a processor must
consider the relationship between different aspects of the system.
A new model has been proposed called the micronet for modelling asynchron-
ous datapaths, which efficiently exploits actual instruction execution times and
instruction-level parallelism. Micronets model processor architectures as a net-
work of communicating resources, in contrast to the traditional one of a linear
pipeline. Micronets distribute the control to the functional units, which en-
ables the exploitation of fine-grain concurrency between instructions. It will be
shown that the overheads due to asynchrony can be hidden with the four-phase
Chapter 2. Towards an Asynchronous Control Paradigm 43
protocol being used to implement scoreboarding and hazard avoidance mech-
anisms, without incurring additional control costs. Although improvements
may be obtained in one area of the system design, it is imperative that this is not
at the expense of performance in another, thus having an overall negative effect
on the system. Therefore, the following chapters examine the influence of this
novel asynchronous control paradigm on the design of processor architectures.
In particular, the instruction latencies and resource utilisation in a micronet ar-
chitecture will be investigated together with the compiler’s ability to schedule
code for this target.
Chapter 3
A Parallel Event-Driven Simulator
“Both users and designers of computer systems are interested in perform-
ance evaluation since their goal is to obtain or provide the highest perform-
ance at the lowest cost.” [80]
3.1 Introduction
The dynamic behaviour of asynchronous systems is difficult to model analyt-
ically for making accurate performance predictions. The approach adopted in
this work has been to simulate register-transfer-level (RTL) models augmented
with timing obtained from SPICE-simulations of their circuit implementations.
This chapter describes the development and implementation of an asynchron-
ous parallel event-driven simulation platform for the performance evaluation
of both synchronous and asynchronous processor architectures and systems.
One objective was to develop a simulator for obtaining performance figures for
the execution of algorithms under different scheduling or placement strategies,
on different (multi)processor architectures and interconnection topologies. In
particular, this would include the measurement of the performance over time
of an ensemble of heterogeneous functional units which operate concurrently
44
Chapter 3. A Parallel Event-Driven Simulator 45
and communicate with each other asynchronously. This tool is the workbench
for the work described in this thesis.
3.2 Parallel Discrete Event-driven Simulation
Logic simulation is a common and effective technique for investigating the beha-
viour of computer designs. However, accurate simulations of large designs can
be extremely time-consuming. By executing them on parallel architectures, Par-
allel Discrete Event-driven Simulation (PDES) attempts to address this problem
by exploiting the structural concurrency inherent in the applications.
A Parallel Event-driven Processor Simulation Environment (PEPSÉ – pro-
nounced in the same way as the well known fizzy drink) has been developed
based on the ELSA algorithm [8]. PEPSÉ provides a framework for efficiently
evaluating the performances of both sequential and parallel architectures. The
architectural components may be modelled either uniformly at one of the dif-
ferent levels of abstraction, or the components can be modelled individually
at different levels. One could for example examine the performances of cache
coherence protocols in shared memory MIMD machines, communication pro-
tocols for local area network, effects of topology in distributed memory MIMD
machines, resource hot spots within processor design, to name just a few. For our
purposes, architectures are modelled at the register-transfer level with accurate
timing delays of the functional units being provided by SPICE simulations of
their VLSI implementation.
The current implementation of PEPSÉ runs on a network of transputers
called the MEiKO Computing Surface [79]. The architectures are modelled
in the programming language Occam2 [78]. (Occam has long been used to
specify the behaviour of circuits [103] [105].) A system can be described as a
collection of concurrent processes which communicate with each other asyn-
Chapter 3. A Parallel Event-Driven Simulator 46
chronously through channels. The semantics of Occam2 captures the behaviour
of asynchronous circuits naturally [161]. The asynchronous nature of the un-
derlying simulation algorithm efficiently simulates the class of architectures
under investigation (compared to time-driven simulations). For typical sizes
of system-under-simulation (s-u-s), these runs could be in the order of a few
hours on a uniprocessor. PEPSÉ exploits the structural concurrency in the s-u-s
to reduce these run times considerably.
3.3 An Overview of PEPSÉ


















Figure 3–1: Overview of the simulator
Algorithm in HLL – This is the application program/software which is to be
executed on the simulated architecture. The application program is usu-
ally in the format of a high level language, and will need to be “compiled”
into a format which is suitable for the particular architecture upon which
it is to be executed. This “compiled” format is called the Architecture
Specific Code.
Chapter 3. A Parallel Event-Driven Simulator 47
Algorithm Model – This is the Architecture Specific Code (ASC) of the applic-
ation program. The ASC contains instructions specific to the processor
or architecture upon which the application is to be simulated. Whether
the ASC is equivalent to assembler, machine code or some other inter-
mediate code depends on the level at which the processor is modelled.
For example, for register-transfer level models, the ASC would normally
be in the form of assembler instructions from the processor’s instruction
set. Since we are interested in the performance of an algorithm on a
given architecture, it would also be necessary to take account of compiler
characteristics.
Placement and Scheduling – This is the strategy for distributing the ASC over
the processor architecture, and determining how it is to be scheduled.
(Currently this task is achieved manually.)
Architectural Description – This consists of two groups:
1. The Architectural Components which include: processors (which
consist of two objects, an instruction fetch object and an instruc-
tion execute object, for modelling SIMD architectures or instruction
prefetch mechanisms), synchronous processors with clock speed as
a parameter; memories or caches whose parameters include size,
access time and initial contents; and application specific hardware
which includes components from logic gates to application-specific
integrated circuits (ASICs).
2. The Interconnection Network which describes the communication
between the architectural components. Direct or point-to-point con-
nections between two objects to model simplex communication can
be achieved using the Occam2 communication channels. Shared
connections, such as a bus, need to be modelled by a simulation ob-
ject. These objects have both a propagation delay and the number of
Chapter 3. A Parallel Event-Driven Simulator 48
components which share the bus as parameters. Half duplex com-
munication can be modelled as a bus with two ports, and full duplex
communication as two simplex ones.
3.3.1 The Simulation Platform
The simulation platform is based on ELSA algorithm. In ELSA, logical processes
have their own local simulation clock and communicate with other processes via
time-stamped (duration bounded) messages. Each logical process or simulation
object consists of two components, firstly a behavioural model of the object
which evaluates the physical process’ operation based on the value of its inputs
at the current simulation time and secondly, a mechanism to control the local
simulation clock and time-stamping of output messages. This mechanism uses
the delay associated with the particular operation to generate the correct time-
stamped output. The simulation proceeds asynchronously, with each logical
process passing state information in the form of tuples via their simulation





Object  BObject  A
Channel
carries tuples of information
(start time, end time, state values)
between Objects.
Figure 3–2: The simulation platform.
Each tuple of information contains:
1. a set of state values, and
Chapter 3. A Parallel Event-Driven Simulator 49
2. a start time and an end time which defines the interval for which these state
values are valid.
Note that a tuple containing a start time equal to the end time conveys no useful
information and that all tuples on each channel must represent contiguous
periods of time.
3.3.2 The Basic Simulation Platform Algorithm
The following steps outline the basic simulation platform algorithm:
Algorithm 1 : Basic Simulation Platform
1 Initialisation of variables and flags.
2 Clear input buffers, set input start and end times = 0,
and place initial output values in output buffers.
3 Send the initial output tuples out on their respective output channels.
Set the object’s current simulated time = 0.
4 If necessary, get required tuples from each input channel.while (current simulation time  tuple’s end time) get the next tuple.
5 Evaluate the function (output states values)
based on current inputs using the behavioural description.
6 Calculate the start time of all of the output tuples.
start time = current simulation time + object delay time 
7 Calculate the end time of all of the output tuples.
end time of each output tuple = MIN(end time of all input tuples) + 
8 Send the output tuples which are still within the simulation window,
i.e. tuples which have start time < “Stop Simulating” time.
9 Update simulation time.
current simulation time = MIN(end time) of all the inputs tuples
10 if not finished simulating, i.e. still within the simulation window,then goto step 4.
11 Sink outstanding tuples, i.e. those tuples which have start times that
are outside the simulation window.
Chapter 3. A Parallel Event-Driven Simulator 50
Steps 1 and 2 are initialisation stages, with Step 3 sending the initial tuples
defining the output states for the period (0, ) at the start of the simulation.
The only inputs which can affect the state values during the period (0, ) are
those with start times < 0, which obviously do not exist. Steps 4 to 10 constitute
the main loop where each time a new tuple(s) is required to advance the object
simulation time, a re-evaluation of the output states takes place. Step 6 evaluates
the output start time which is the how far into the simulation the object has
progressed plus , a delay for the generation of the output state values. Step
7 determines the output end times which are set to the time at which the next
“event” occurs, which will be at the earliest end time of all of the input tuples,
plus  the same delay for the generation of output state values. At Step 9, the
current simulation time of the object is advanced to the time at which the next
“event” occurs. This means a new tuple(s) will be required and therefore a re-
evaluation of the output values. Step 11 is more an implementation requirement
to guarantee that all objects will complete executing and terminate.
The propagation delays over dedicated wires (one-to-one connections) are
modelled by incorporating them directly into the source object, and delays on
shared wires are modelled as a separate resources. If necessary, the simulation
platform for these resources can easily be made to detect instances of contention.
3.3.3 The Class Models
Using the basic simulation platform together with its behavioural description is
sufficient to allow the simulation of an object, if the output state(s) of that object
are a function of only the current input(s), as in the case of simple logic gates:
output states(time + ) = f(input states(time))
Since the simulation occurs at the instruction/register transfer level, most ob-
jects have more complex behaviours such as state machines. This means that
Chapter 3. A Parallel Event-Driven Simulator 51
the output states are a function of both the input states and some internal state
of the object:
output state(time + ) = f(input states(time) + internal state(time))
This means it is necessary to modify the basic simulation platform. Another
reason for modifying the simulation platform of some objects is related to per-
formance. In order to achieve good performance on parallel systems it is neces-
sary to keep inter-processor communication to minimum.
Clocked Objects
State machines, registers, synchronous processors etc., all require some sort of
“clock” or latch signal. These objects are generally only sensitive to the value of
input signals at the transition of (or when a certain value occurs) on one of the
inputs, i.e. the clock. If an object has a clock input then the simulation platform
need only evaluate the outputs once every clock period, instead of each time
the object needs a new tuple. In practice, these clock/latch signals can either be: regular/periodic or irregular/aperiodic, and either edge- or level-triggered.
For periodic clock signals, the simulation platform will know when the clock
transitions will occur. For example, if the clock input signal is regular, e.g. from
an oscillator, the clock input signal can be modelled internally within the object.
However for aperiodic clock signals, the simulation platform will have to test
only the state value of the clock input to determine its timing information. An
alternative would be to wait for a transition on the clock input and then allow
the behavioural description to test the clock input along with the other inputs
when evaluating the outputs. Remember, even if outputs do not change it will
Chapter 3. A Parallel Event-Driven Simulator 52
still be necessary to send new output tuples to allow the simulation to proceed.
The effect of the clocked inputs on the basic simulation platform is discussed in
the following sections.
Objects with Irregular Clock Signals
This simulation platform need only evaluate the outputs when there is a new
tuple on the clock input, therefore the platform only considers the tuples on the
clock input as new events. This implies that the simulation only uses the clock
input for the generation of timing information. Each iteration of the simulation
loop will require a new clock input tuple with the corresponding tuples of the
other inputs being required to evaluate the output tuples.
The Simulation Algorithm
The steps of the basic algorithm requiring modification are:
4. Get the required tuples. On the clock input:
– if (current simulation time == end time) then get the next
tuple. For each of the others:
– while (current simulation time  end time) get the next
tuple.
7. Calculate the end time of all of the output tuples: end time = clock input’s end time + object delay time .
9. Update simulation time: current simulation time = clock input’s end time.
Chapter 3. A Parallel Event-Driven Simulator 53
Objects with Regular Clock Signals
The simulation platform for an object of this type is a special case of the one
with an irregular clock signal. The simulation advances a fixed (and known)
amount of time, i.e. the clock period, each iteration and therefore there is no
need for a separate clock input. Even if no new input tuples are required or the
input states do not change over a number of iterations, it is still necessary to
re-evaluate the outputs since the timing information will need to be updated,
and being a clocked object, the outputs are likely to be functions of both the
inputs and the internal state of the component.
The Simulation Algorithm
The variable object latency can be used as an offset or time delay before the
periodic clock starts.
3. Send initial tuples: current simulation time = object latency. Each input tuple’s start time = current simulation time.
4. For each input, make sure the tuple is valid: while (current simulation time  end time) get the next tuple.
7. Calculate the end times of all output tuples: end time = current simulation time + object delay time  + clock
period.
9. Update simulation time: current simulation time = current simulation time + clock period.
Chapter 3. A Parallel Event-Driven Simulator 54
Level-triggered Clock Signals
A level-triggered clock input is treated just as another input since this input will
only have a boolean effect on the other inputs. Therefore, the basic simulation
platform would suffice. However, employing the simulation algorithm used
for irregular clocked objects may generate fewer output tuples.
3.4 Development Notes
The PEPSÉ simulation platform was implemented in Occam2 [78]. Occam2 sup-
ports concurrent threads of execution (processes) and uses unbuffered channels
to provide synchronisation and communication between processes. However,
since synchronisation is not required, the channels are buffered to avoid dead-
lock, decouple logical processes (thus increasing concurrency), and reduce mes-
sage traffic by merging tuples.
3.4.1 Occam Buffers
Avoiding Deadlock
Deadlock will occur if a cyclic relationship exists between a group of objects.
Initially, all of the objects will attempt to place their initial tuple on their output
channels, including the recipient objects, and will therefore be unable to receive
the incoming tuple. Buffers have been inserted to overcome this communication
constraint within Occam2. These buffers simply receive tuples, releasing the
sending object, and pass them on to the receiving object when it is ready, thus
having the effect of unblocking the objects not only at their initial transmission
but also at any time two objects attempt to send tuples simultaneously to each
other.
Chapter 3. A Parallel Event-Driven Simulator 55
Maintaining Asynchrony
The use of buffers also allows objects which can proceed “quicker” into the
simulation not to be held up by “slower” ones. Buffers can queue tuples, thus
allowing the sender object to proceed, by removing the synchronisation between
sending and receiving objects. However, one factor which has been observed
while simulating at the register/instruction level is that, in general, there is
a tight cyclic relationship between some pairs of objects, especially self-timed
components. If object A has an output channel to object B, object B is quite
likely to have an output channel to object A. In this case, there seems to be
only one outstanding tuple in the queue, and this occurs since both objects are
progressing at about the same rate.
Aliasing Variable Names
In order to generalise the simulation platform, since the number of inputs and
outputs to a particular object varies, the simulation platform takes an array of
inputs and an array of outputs. The buffers allow the aliasing of these array
variable names from the output of one object to the input of another object.
Numbers of Tuples on a Channel
With the basic simulation platform, the total number of tuples on each out-
put channel (one tuple per channel per iteration of the simulation loop) can
be bounded below by the largest number of tuples on any of the inputs and
bounded above by the sum of all the tuples on each of the inputs.
With regular clocked objects the number of tuples on each output will be the
simulation duration divided by the clock period. Also, with irregular clocked
objects, the number of tuples on each output will be equal to the number of
tuples on the clock input. Thus, clocked objects prevent the avalanching effect
Chapter 3. A Parallel Event-Driven Simulator 56
on tuple numbers. Furthermore, the buffers can be used to merge consecutive







Figure 3–3: A microprocessor model
In some cases, it is not necessary to send tuples on all of the outputs at every iter-
ation. For example, when the fetch unit passes a load instruction to the execute
unit, it will take one iteration of the execute unit to interpret the instruction and
initiate a read from memory, and one further iteration to execute the instruction.
This implies that, after the first iteration, the execute unit object sends a tuple
to:  memory, a read access request, the instruction processor, a null tuple or unexecuted instruction message.
After the second iteration, the object sends a tuple to: memory, a null or no access request message, the instruction processor, the updated register values.
Chapter 3. A Parallel Event-Driven Simulator 57
In order to execute one instruction, the execute unit object had to send two tuples
on each of its output channels, of which only one conveyed useful information.
Guarded outputs are boolean flags which inhibit or allow the transmission of a
tuple on a particular output channel. By applying guarded outputs to the above
example, the data processor object would not send a tuple to the instruction
processor after the first iteration, and to the memory after the second.
Thus, the use of guarded outputs can achieve a significant reduction in the
number of tuples used, without losing the modularity between timing inform-
ation generated in the simulation platform and the state information generated
by the behavioural description.
3.4.3 Modelling Signals
The transfer of state information takes place via tuples which are represented as
a variable length array of integers. Each tuple has a number of flags associated
it, these being index values within the array:
elsa.tup.len – is the pointer to the tuple length (index value “0”),
elsa.start.time – is the pointer to the time from when the states are valid (index
value “1”),
elsa.end.time – is the pointer to the time until when the states are valid (index
value “2”),
elsa.state – is the pointer to the first state value (index value “3”). The number
of states within a tuple can be determined by
number of states = tuple[elsa.tup.len] - 3
The use of variable length arrays allows the ability to incorporate a number of
states into one tuple and thus reduce the number of communication channels
Chapter 3. A Parallel Event-Driven Simulator 58
between two objects in any one direction to 1. This will always be true unless
the start- and end-times of particular states need to be different, in which case
another channel and separate tuples would be required.
3.5 Component Delays
The fidelity of the simulation results is determined by the accuracy of the simu-
lation model. The models used in this thesis have been validated via a combin-
ation of HSPICE simulations and analytical analysis. Each of the architectural
components used in the designs of Chapter 4 has been modelled as an in-
dividual simulation object based on a 1.2 m CMOS process implementation
of off-the-shelf/standard library components. Individual component delays
have been extracted from a simulation tool within ES2’s commercially avail-
able silicon compilation integrated circuit design suite SOLO 1400 [50]. (ES2
claim to guarantee circuits designed using these tools will be fully functional
on first silicon). In the synchronous design, component delays were based on
worst-case timings, (including component operation e.g. propagating a carry
the entire length of the adder’s carry chain) and nominal/typical timing delays
(average component operation delays) in the self-timed case. Unfortunately
these designs were not laid out completely since this tool was not suited to
custom datapath design and thus full account of propagation delays were not
considered. The designs described in Chapter 5 were based on the EUROCHIP
0.7 m CMOS process obtained from the CADENCE design suite. These tools
are better suited to datapath design (giving the designer more control over lay-
out) thus the HSPICE simulations give a more accurate account of both relative
component and propagation delays.
Chapter 3. A Parallel Event-Driven Simulator 59
3.6 Conclusions
PEPSÉ provides an efficient framework for obtaining accurate performance
figures for the execution of small programs on the simulated architectures. By
allowing mixed-level simulations the run-time costs can be further reduced
without sacrificing accuracy.
The approach adopted here is well suited to the simulation of asynchronous
circuits due to the asynchronous nature of the underlying algorithm itself. This
algorithm is inherently deadlock free and, in its conservative form, never viol-
ates the causality principle which means that an expensive roll-back mechanism
is not required [8].
Chapter 4
The Control Paradigm and the
Instruction Set
4.1 Introduction
In general, improvements in the performance of processor architectures can be
achieved in two ways: reducing the time taken to complete a unit of work
(i.e., reduce the latency of the operation) or by increasing the amount of work
achieved per unit time (i.e., increase the concurrency between operations). This
chapter focuses on the former by comparing an asynchronous control paradigm,
where the datapath control is distributed and functional blocks communicate
using handshaking protocols, with the traditional synchronous style. Specific-
ally, this work attempts to investigate if any performance improvements in the
execution times of individual instructions can be obtained within a typical RISC
datapath implemented as a micronet.
Although asynchronous control of datapaths had previously been considered
too expensive [2] [135], other work has suggested that the opportunity for
improved performance does exist [38] [63]. This chapter investigates the ap-
plication of the asynchronous control paradigm to a variable length pipelined
60
Chapter 4. The Control Paradigm and the Instruction Set 61
datapath and compares the effect of the two design styles, synchronous and
self-timed, on the performance of a RISC microprocessor architecture. It will
be shown that a micronet-based datapath can enhance the performance of a
microprocessor architecture.
4.2 Comparing Synchronous and Asynchronous Pro-
cessor Control
The basis for comparison of the two design styles is a simple two-stage pipelined
RISC architecture with a simplified instruction set. The justification for the sim-
plicity of the pipeline is the following: isolating the effect of the control paradigm
on the datapath is best realised by keeping the latter simple (although the ex-
ploitation of pipelining in a micronet processor is discussed in the following
chapter); in fact further pipelining interferes with the comparison of datapath
latencies in the two designs.
The RISC philosophy of simple control, regular and predictable behaviour
and efficient silicon utilisation has been considered ideal for a synchronous con-
trol paradigm. The current trend of commercial processors with high frequency
clocks are very much in this mould. However, it is difficult to define or find an
ideal synchronous processor design since the design itself is inextricably linked
with actual implementation delays. An asynchronous control paradigm would
be equally applicable to CISC or RISC, however a RISC-style architecture with
a simplified instruction set was chosen because of a shorter design time, sim-
pler data paths, and with the corresponding decode/control being hardwired,
avoiding any extra level of macro-to-microinstruction translation. This makes
it easier to investigate the interactions between the control paradigm and the
architecture. (It should be noted that an asynchronously-controlled architecture
loses some RISC features e.g. fixed instruction execution times).














Figure 4–1: The processor pipeline
The two stages, fetch and execute as shown in Figure 4–1, carry out the
usual processor operations. The fetch cycle involves fetching an instruction
or an offset value, and incrementing the program counter; the execute cycle
involves sequencing data movement and controlling the functional units within
the datapath. Thus, the architecture retains the basic RISC features and is a good
starting point from which to develop and investigate the suitability of the self-
timed paradigm to more complex pipelined processors.
4.2.1 The Two Processor Models
The two processor designs, as illustrated in Figure 4–2, almost share the same
functional units and only differ in the design style used to implement their
control sequencing. In the synchronous microprocessor, the control sequencing
is centralised in the Control Unit (CU). This unit generates signals for each of the
datapath resources (i.e. Fetch Unit, ALU, the Registers, PC Unit and Memory
Unit), to control the complete execution of an instruction. In contrast, the control
sequencing is decentralised in the asynchronous microprocessor. The CU initiates
a sequence of actions, and in most cases will no longer take any further part. The
respective functional units and their interfaces communicate with each other to
complete the task. This reduced complexity of the CU is achieved through
the distribution of control by the micronet and the asynchronous mechanisms
outlined in Chapter 2. This work naturally extends the theme of early RISC
architectures where performance improvements are gained by reducing the



























Direction of data transfer.
Req./Ack. signal in other direction.
Control signals and acknowledge.


















Figure 4–2: The synchronous and self-timed processor models
complexity of the pipeline and simplifying the control. Here the control is
simplified even further due to decentralisation.
4.2.2 The Instruction Set
The two designs also share a common instruction set (shown in Table 4–1),
which is based on the design in [110]. In the synchronous design, the execution
time of each type of instruction is fixed, whereas under asynchrony the execution
time of a particular instruction may vary. The different instructions can be
Chapter 4. The Control Paradigm and the Instruction Set 64
Group Instruction Explanation
1 ALU Rz := Rx ALUop Ry
1 LD Rz := Mem[Rx+Ry]
1 ST Mem[Rx+Ry] := Rz
2 LDX Rz := Mem[Ry+Offset]
2 STX Mem[Ry+Offset] := Rz
2 LDA Rz := Ry + Offset
3 STPC Rz := PC
3 JMP PC := Ry
4 BRCH If Cond then PC := PC + Offset
Table 4–1: The instruction set
divided into four categories which highlights the irregular nature of even a
simple processor pipeline:
Group 1 – These instructions do not affect the Program Counter (PC) and are
therefore independent of the fetch stage. The ALU and store (ST) instruc-
tions represent the classic single-cycle RISC instructions, with load (LD)
instructions taking slightly longer.
Group 2 – These instruction use an offset value which requires an additional
fetch from the instruction memory. The current instruction cannot begin
execution until the offset has been fetched and placed in the offset register.
Group 3 – These instructions require or modify the current PC, and the next
fetch cycle is stalled until the current execute stage is completed.
Group 4 – The branch instruction is a combination of groups 2 and 3. A PC
offset is required and the next fetch cycle cannot begin until the current
execution cycle completes, i.e. until the branch condition has been resolved
and the PC contains the correct value.
Chapter 4. The Control Paradigm and the Instruction Set 65
4.2.3 The Architectural Components
Figure 4–2 shows the architectural components implemented in both models.
The common components are:
1. The Instruction and Data Memory/Cache (IM and DM) which store the
program instructions and data, respectively.
2. The Fetch Unit (FU) which fetches instructions from the IM and transfers
offset values to the offset register in the PC Unit.
3. The PC Unit (PCU) which contains an adder to increment the PC, the PC
register and an offset register.
4. The Control Unit (CU) which initiates the necessary micro-operations in
the respective microagents for the given instruction being issued.
5. The Memory Unit (MU) which services the load and store instructions,
generates addresses and accesses the DM. This unit has an adder for
address calculations. (The input operands must be latched in the unit
prior to the unit’s operation).
6. The ALU executes arithmetic and logical instructions. It does not have
registers on its inputs (or outputs) and operates continuously with the
values on its inputs. This allows worst-case operation to complete within
the required time.
7. The Register bank consists of 32 registers, three operand read ports to the
functional units, and one write port for each of the functional units.
8. X and Y are operand fetch buses, ZA and ZM are write-back buses. The
ZM bus is also used as a third operand fetch bus for store instructions.
Chapter 4. The Control Paradigm and the Instruction Set 66
4.3 The Synchronous Processor
The synchronous model assumes that the control signals are generated exclus-
ively by the control unit, (i.e. the delaying of individual control signal outside
the CU to meet any timing constraint is not permitted), using an input clock
signal as a timing reference. In synchronous design, the clock period is gener-
ally determined by the largest delay in the pipeline. In this example however,
the execute stage delay varies from instruction to instruction, while the delay
of the fetch stage is generally independent of the instruction. Since the latter
is always on the critical path, the clock speed was chosen to exactly match the
worst-case delay of the fetch stage. However, instead of just viewing each stage
as a single cycle, the clock cycle is divided into a number of clock phases (four
in this example) which mimics a higher frequency clock and reduces idle time
by achieving a better approximation to delays. (This allows the modelling of
multi-phase clocking as used in modern synchronous designs to improve the
temporal granularity).
For the purpose of this study, the synchronisation overheads (as discussed
in Chapter 2) are ignored. In practice, they are difficult to estimate as they are
ultimately influenced by the clock frequency, technology, fabrication process,
routing, chip size and environmental variation.
4.3.1 Synchronous Control
On the first clock edge, the CU initiates a fetch instruction request. The FU then
fetches the next instruction from the IM at the location pointed to by the program
counter (PC) which is kept in the PCU, and at the same time, the current PC
value is incremented. The FU forwards the instruction to the CU just in time
for the next clock edge. Now, the CU has the instruction and decoding begins
































The execute clock phase is
required when the BRANCH
is taken. (The PC is assigned
the offset value).
1 Clock or Clk   =  1 clock phase
4 Clocks  =  1 clock cycle (period)
Figure 4–3: Synchronous instruction cycles
while the PC is assigned its incremented value. The CU behaves according to
the type of instruction, as shown in Figure 4–3. If an offset is required, then
the execution of the instruction is stalled until the offset has been loaded into
the offset register. If the instruction is a branch instruction, then it is evaluated
while the offset is being fetched. If the branch evaluates to TRUE, then an extra
clock phase is required to assign the new PC value. The execute stage latencies
vary, taking anywhere between four and seven phases in this example, (the
total instruction latencies vary between 8 and 15 clock phases (2 and 4 clock
periods)).
Chapter 4. The Control Paradigm and the Instruction Set 68
4.4 Asynchronous Control and MAP
A micronet-based asynchronous processor (MAP) architecture does not have a
global clock signal nor centralised control for the transfer of data between archi-
tectural components. Although the processing components (the main functional
units) are considered to be identical in the two designs, additional components
(the communicating microagents (CMs)) effectively allow the functional units
(the functional microagents (FMs)) to locally control data transfer between them-
selves and their neighbours. In order to exploit data dependent or variable
delays, it is assumed that the functional units in the self-timed design can be
modified so that they generate completion signals [38] [169].
4.4.1 The Distribution of Control
There are a number of additional components required in the micronet design,
as shown in Figure 4–2: The PFE interface models a combined interface between the CU, Fetch
Unit and PC Unit which aids local control of the fetch pipe. Local control
signals (previously routed via the control unit in the synchronous design)
between the FU and PCU, coordinate fetching of an instruction while
concurrently incrementing the PC or transferring an offset to the offset
register (held in PC unit). Register, ALU, MU and PCU interfaces are found between their func-
tional units and the buses. These bus interfaces contain the CMs which
are responsible for receiving their FM’s micro-operation control signals
from the CU, returning the corresponding acknowledgement signal, ob-
taining the operand data for that operation and presenting these to the FU,
Chapter 4. The Control Paradigm and the Instruction Set 69
and if necessary, returning the result of a micro-operation to the correct
destination.
A number of protocols have been proposed for both control and data trans-
fers [111] [150] [174] between microagents. In the absence of a clock, the data
transmissions have to be encoded to enable the receiver to recognise valid
information. Bundled data transfers have been adopted to minimise coding
costs [158]. A four-phase handshaking protocol was adopted for both control
and bundled data transfer. This allows for a simpler design through the use of
various types of Muller C-elements [117] and conventional logic gates. In the
case of control signals, although four-phase protocol would be considered twice
as expensive compared to a two-phase one, the same efficiency is obtained as
two back-to-back, two-phase handshakes by representing two events in each
cycle. This is also an efficient option for data transfers since they take place
over shared buses, and in any case the second half of the four-phase handshake
occurs concurrently with computations. (These issues will be discussed further
in Chapter 5). Another advantage of using the four-phase protocol is that it
allows components to synchronise phases of an operation, e.g. calculating a
next Program Counter (PC) value while using the current PC register value to
address memory.
4.4.2 The Rôle of the Control Unit
The CU is still required to sequence tasks for correct datapath operation. Since
this control sequencing is decentralised in the micronet, the CU just needs to
initiate the sequence of actions, and leaves the respective FUs to communicate
to complete the task.
The CU initially requests the next instruction from the Fetch Unit (FU). The
FU will then fetch an instruction from the IM based on the current value of the
PC, while at the same time signalling the PCU to calculate (increment) the next
Chapter 4. The Control Paradigm and the Instruction Set 70
PC value. When the FU receives the instruction from the IM, it signals the PCU
to assign the calculated value to the PC register, while at the same time checking
to see if an offset is required for this instruction. If so, the FU will fetch the offset
first, then send the instruction to the CU and pass the new offset value to the
PCU. When the CU receives an instruction from the fetch unit it can initiate the
next instruction fetch if the current instruction does not use or modify the PC.
Instruction decoding identifies which components or FMs are required to
execute the current instruction. The CU communicates with them via the
chosen four phase asynchronous communication protocol. Each acknowledge-
ment control signal signifies two events – the first acknowledges the micro-
operation request and the second signals the completion of that micro-operation.
Should any of the required resources have not completed their previous micro-
operation, then the CU must wait until it receives the ‘finished’ signal, i.e. the
previous handshakes have completed. Then the CU can initiate the instruction’s
execution by informing the relevant microagents (by beginning a handshake on
each of the appropriate microagent control signals). Once the CU has received
all of the acknowledgements, then the instruction is considered to have been
issued. The CU resets the control signals (completes its phase of the handshake
protocol) and the instruction issue cycle can begin again. The execution of an
instruction is complete when the corresponding control signals have completed
their handshakes. Although the current instruction execution is overlapped
with the fetching of the next instruction, if the PC unit is involved in the instruc-
tion execution it may cause the current instruction fetch to stall.
The registers involved in the instruction execution are informed by the CU
as to which buses they have been assigned (derived from the instruction), with
respective microagents using the local communication protocol to request their
operands. For example, the ALU will assert request signals on both the X and
Y buses. This signal (being on a bus) will go to both the register bank and the
PC unit. However, only one of them will respond on each of the buses, since
Chapter 4. The Control Paradigm and the Instruction Set 71
the CU will have already notified which components were to be enabled during
the current instruction execution cycle.
The Control Signals
The control signals used by the CU effectively consist of a pair of wires: one
is the request to, (from now on referred to as the control signal) and the other
is the acknowledgement from (referred to as the acknowledgement signal), the
FU interface. By using a four-phase handshake protocol the CU can use each of
the acknowledgement signals as a status flag (e.g. high to mean busy and low
to mean free) for their respective resources. The precise meanings of the control
signals and their acknowledgement signals are described below.
As well as the request signal, the control signals to the register bank also
consist of the address of the register to which the signal applies. The control
signals to the register bank are:
Rx – Identifies the register which should output its contents to the X bus port of
the register bank. The corresponding acknowledgement signal is asserted
once the register has been accessed (if a register is blocked then it cannot be
accessed), and cleared when both the control signal has been de-asserted
(following the handshake convention) and the register interface has re-
ceived the data (i.e. when the interface is ready to transfer the data over
the X bus).
Ry – Identifies the register which should output its contents to the Y bus port.
The acknowledgement signal is set and cleared as for Rx.
Rz – Identifies the register which should output its contents to the ZM bus port.
The acknowledgement signal is set and cleared as for Rx.
Chapter 4. The Control Paradigm and the Instruction Set 72
ZMs – Locks the destination register preventing any read access to it. The
acknowledgement signal is set when the register is locked and cleared
when the register has been written to with data from the ZM bus (i.e.
data which has been received from the MU). Note that neither the control
signals Rz nor ZMs can be asserted simultaneously since this could lead
to bus contention.
ZAs – Locks the destination register and prevents read access to it. The ac-
knowledgement signal is asserted when the register is locked and cleared
when the result from the ALU has been written back to its destination
register via the ZA bus.
Other control signals to registers in the PC Unit include:
Rpcx – Outputs the value of the PC on to the X bus. Note that Rx and Rpcx can-
not both be active simultaneously since this could lead to bus contention
on the X bus.
Rpcy – The (next) data value on the Y bus is to replace the current PC value.
Rof – Outputs the value stored in the offset register on to the X bus. The PFE
interface makes sure that this register holds the correct value. As before,
the Rx and Rof cannot both be active simultaneously.
The control signals to the functional units (microagents) can take one of two
forms. Firstly, the control signals can contain the instruction opcode (or some
part of it) which is decoded locally by the functional unit itself (as in the case
of the ALU’s control signal AU). Here the local decoding is overlapped with
the instruction’s operand fetch. Secondly, if the decoding costs are small and
do not increase the CU delay, it may be possible to decode the opcode and use
dedicated control signals for particular (micro)operations within a functional
unit. Control signals to the MU are MU1 for a load (LD) instruction, MU2 for
Chapter 4. The Control Paradigm and the Instruction Set 73
the store and MU3 for the address calculation instruction. In this case the cost is
generally hidden by the instruction issue handshake of the previous instruction.
4.4.3 Data Transfer
Data transfer is request-driven, e.g. a functional unit which requires an operand
will assert a request to the register. The register will in turn send the data on the
bus, the reception of which is acknowledged by the functional unit’s interface
by de-asserting the original request. Thus allowing the register to release the
bus. Generally, this ensures that resources (registers and buses) are utilised
for no longer than is necessary. The register control signals together with the
handshaking protocol prevent bus contention occurring.
4.5 The Performance Results
All the functional units in Figure 4–2 were based on a 1.2 m CMOS implement-
ation process. Their timing characteristics were extracted from a post-layout
simulation tool within a commercial VLSI design package called SOLO 1400 [50]
and used in the PEPSÉ simulation models of the processors. Neither layouts nor
transistor size optimisations for improved performance [26] were considered.
The performance of the instruction set outlined in Table 4–1 is summarised
in Table 4–2. In the simulations, every effort was made to make the comparisons
between the two design styles as fair as possible. While the chosen implement-
ation process is not state-of-the-art, no commercial design tools nor sufficient
commercial processor layout information was available upon which to base
an accurate comparison. Also, commercially available synchronous architec-
ture generally contain a number of engineering and design “tricks” specific to
particular implementations of a design.
Chapter 4. The Control Paradigm and the Instruction Set 74
Synchronous Design Asynchronous Design
Group Instruction Inst.Exec. Clock Inst.Exec. Datapath Speed Up
Time (nS) Phases Time (nS) Exec.Time
1 ADD 36 4 26 17 1.38
1 LD 54 6 34/26 34 1.58/2.07
1 ST 36 4 26 14 1.38
2 LDX 99 11 60/55 60 1.65/1.8
2 STX 81 9 55 40 1.47
2 LDA 81 9 55 43 1.47
3 STPC 45 5 32 20 1.40
3 JMP 45 5 32 9 1.40
4 BRCH F 72 8 59 32 1.22
4 BRCH T 81 9 63 42 1.28
Table 4–2: Synchronous versus asynchronous performances
The results of the comparison of instruction execution times under the two
control philosophies are shown in Table 4–2. The Instruction Execution Time (IET)
represents the time between issuing the current instruction and the next, i.e. the
effective cost for fetching and evaluating each instruction, taking into account
the two staged pipelined nature of the processors. In the synchronous case,
the minimum IET is 36nS (the clock period) which is equivalent to the delay
of the fetch stage. The fetch stage delay is 26nS in the micronet design, which
considers both the average timings and the self-timed overheads.
The Datapath Execution Time (DET) is the average duration between the CU
initiating an instruction and its completion, i.e. the instruction latency within the
(execute stage of the) micronet datapath. The IET is the maximum of the fetch
stage delay and the execute stage delay (DET). The DET is of particular interest
when it is larger than the fetch stage delay (26nS) since this means that the CU
Chapter 4. The Control Paradigm and the Instruction Set 75
might be able to exploit some concurrency by being able to overlap the execution
of more than one instruction within the datapath. If the following instruction
is independent, then the effective IET of the previous instruction will be the
smaller value (IETunrelated). Otherwise, in the presence of structural or data
dependencies, the larger value applies (IETrelated). When comparing execution
times between the two design styles for the load (LD) and load with offset (LDX)
instructions, the IETrelated value should be used, because in the synchronous
case wait states have been inserted in these instructions as the CU must assume
the worst-case situation. Although, in general, this suggests that MAP can
exploit some data-dependent concurrency, the synchronous processor’s CU
could test successive instructions for structural and data dependencies at the
expense of increasing the complexity and delay of the unit. The asynchronous
design can take advantage of any independence between instructions without
testing, since the handshaking mechanism will prevent erroneous behaviour
should such a dependency exist.
In the self-timed design, the IETs of the instructions are limited by the fetch
stage delay. In fact the speed-up in these cases virtually represents the ratio of
latency between the two instruction fetch pipes. Even though the synchronous
fetch pipe has a perfectly matched clock it is still limited by worst-case delays
and an inability to generate control signals at precise times due to its centralised
control.
These speed-ups show that it is indeed possible to achieve performance
improvements under an asynchronous control paradigm. Since all of the in-
structions show improvement, a program consisting of these instructions will
therefore be expected to execute faster. Furthermore, it is in the nature of the self-
timed CU to initiate instructions as soon as possible. This can only be achieved
at run-time. However, the timing characteristics used for the synchronous CU
are fixed at design time.
The preliminary conclusion from these results is that one can observe an im-
Chapter 4. The Control Paradigm and the Instruction Set 76
provement in performance of the asynchronous control mechanism over their
synchronous equivalent, when the individual instruction execution times are
compared. The MAP architecture uses circuits that generate completion sig-
nals [169] and therefore benefits from exploiting actual component delays. The
magnitude of any improvement is limited by a number of factors. The two
important ones are: the architectural design, where some sort of decoupling is
required between the two stages since each of the stages can stall waiting for
the other; and the difference between typical and worst-case delays which is
influenced by component design.
4.6 Discussion
MAP implementations are robust to variation in physical parameters and can
adjust to variations due to data-dependent operations. For instance, the time to
add two integers using a ripple-carry adder varies with the length of the carry
chain. The clock period of a synchronous implementation has to be adjusted
for the worst case, and therefore a synchronous version takes time proportional
to the number of bits of the operands. On the other hand, an asynchronous
ripple-carry adder computes in time which is on the average proportional to
the logarithm of the number of bits [60] [109]. This is at a cost of detecting the
completion of the operation locally. However, the overheads of the handshake
mechanism can be hidden in micronets, as will be shown in the following
chapter.
If the duration of all of the operations were constant and known precisely,
then the sequencing could be implemented efficiently with a global clock and
centralised control, since this is sufficient to signify the end of a computation
and start of the next one. Timing relies on the physical and environmental
parameters of the design. Designers, being aware that their knowledge of both
the physical properties of the devices and the runtime behaviour of the circuits
Chapter 4. The Control Paradigm and the Instruction Set 77
is imperfect, have to lengthen the clock period to account for an error margin
in the evaluation of the duration of a computation step. This error margin is
becoming a significant proportion of the operating clock period and actually
leads to inefficiency. Furthermore, delays have to be matched by a discrete
number of clock cycles which gives rise to idle times which can become quite
significant. Incorporating a variable period clock [39] or using a faster clock
leads to diminishing returns; increases the design complexity without neces-
sarily improving performance significantly. In fact, increasing clock frequency
has been the popular solution although such signals induce noise, and their
distribution is difficult and subject to skewing, as discussed in Chapter 2.
For complex computations with data dependencies, asynchronous design
has the advantage of exploiting the best-case delay, whereas synchronous solu-
tions have to adjust to the worst-case. Furthermore, data flowing in a network
of stages rather than a linear pipeline may not encounter the component with
the largest delay (slowest stage), e.g. not all instructions need to use a shifter,
and therefore will not even be hindered by the slowest operation (which itself
may not be executing at the time).
4.7 Summary
This chapter has described two similar microprocessor designs which differ
only in the control strategy. The architecture incorporates the basic features
of RISC without complicating issues such as pipeline hazards and provides a
good foundation from which to develop and investigate the suitability of the
self-timed paradigm for more complex pipelined processors. The synchronous
design incorporates conventional centralised control mechanisms. The sequen-
cing of instructions is controlled centrally in the control unit which generates the
control signals for each of the other components in the datapath with timing
Chapter 4. The Control Paradigm and the Instruction Set 78
provided by a clock signal. The clock period is fixed by the largest possible
delay within a stage in the pipeline. In an asynchronously controlled micropro-
cessor, control sequencing is decentralised amongst the datapath’s functional
units. The execute unit just initiates a sequence of actions, and in most cases
will take no further part. The corresponding components will then communic-
ate between themselves via request and acknowledge handshakes in order to
complete the task. This allows an operation to proceed at a rate determined by
local, variable delays and not by a delay which is fixed pessimistically.
This alternative control paradigm is realised through a micronet and the
main concern in this chapter has been with the exploitation of actual datapath
delays in micronet-based processors. Results obtained via simulation have
been presented for the performance of an instruction set for two design styles of
microprocessor. These indicate an improvement in performance (on average)
for the self-timed design over the synchronous equivalent. These results only
represent the performance gain per instruction. Since all the instructions have
shown improved execution times, the execution time of a program containing
an average instruction mix will also be better. The magnitude of these results
really depends on the type of operation being carried out and the design style of
the functional units (e.g. ALU design). The speed up reported here does agree
with other related work by Dean [39] and predictions by Ginosar [63].
Further improvements in performance are possible by taking advantage
of instruction-level parallelism (as in most commercially available RISC pro-
cessors). The MAP’s control unit can exploit some execution concurrency if it
can issue the following instruction before the previous one has finished. This
incurs no extra cost in this design unlike a synchronous processor’s control unit.
Allowing concurrent instruction execution introduces pipeline hazards [72] into
the design. The following chapter examines the modifications to the design of
the MAP architecture which exploit more fully the underlying self-timed control
paradigm, for exploiting ILP.
Chapter 5
The Control Paradigm and the
Architecture
5.1 Introduction
The previous chapter compared a synchronous RISC processor architecture
with its asynchronous equivalent. Centralised control and synchronous data
communication were replaced by distributed control and asynchronous com-
munication without the higher levels within the computer system perceiving
any changes. It was shown that an asynchronous control paradigm could indeed
improve the performance of the instruction set for a given processor architec-
ture. That design experiment only attempted to improve the execution times
of individual instructions, made possible by the micronet’s ability to exploit
actual component delays as well as hiding some of the handshaking overheads.
However, in order to realise the full potential of this asynchronous design style,
this chapter attempts to highlight the ease with which a MAP architecture
can be modified to exploit Instruction-Level Parallelism (ILP). Refinements are
made to a modified version of the micronet processor architecture described
earlier, to efficiently improve performance through the increased utilisation of
79
Chapter 5. The Control Paradigm and the Architecture 80
the datapath resources and to exploit ILP without significantly increasing con-
trol costs. In fact, ILP is used to effectively hide the remaining overheads due
to asynchronous control.
5.2 Exploiting Instruction-level Parallelism
Speeding up the execution latencies of instructions is one approach to improving
performance. An alternative is to execute more than one instruction at the same
time. Exploiting ILP [84] can be achieved either by issuing several independent
instructions per cycle as in superscalar or VLIW architectures, or by issuing an
instruction every cycle, where the cycle time is now shorter than the times for
any of the operations, as in (super)pipelined architectures. Furthermore, these
two approaches may also be combined.
The superscalar principle relies primarily on exploiting spatial parallelism,
which is achieved by running multiple operations concurrently on duplicated
hardware. In contrast, pipelining relies on exploiting temporal parallelism by
overlapping multiple operations on common hardware and operating with a
faster clock. Note that ILP is limited by data dependencies between instructions,
structural dependencies and also control transfers in pipelined architectures.
Most, if not all, processor architectures are pipelined (to some degree) since it
is considered the most cost effective of the two alternatives. However, the limits
on this type of concurrency have meant that modern processor designs need to
consider the more expensive form as well [40] [42]. This chapter concentrates
on implementing asynchronous “pipelines” for exploiting ILP (both temporal
and spatial) as a number of control issues resulting from data and structural de-
pendencies between instructions have to be addressed efficiently. Since a good
instruction schedule (generated statically) to avoid such dependencies is not
always possible, techniques are required to resolve them at run-time. Within
Chapter 5. The Control Paradigm and the Architecture 81
synchronous datapaths, structural hazards are normally avoided in hardware
by using a scoreboarding mechanism and data dependencies are resolved by
using either hardware or software interlocks [70], which adds to the control com-
plexity and cost. Data Forwarding is a technique commonly used in pipelined
architectures to minimise the cost of functional unit (FU) stalls due to data
dependencies, by redirecting data being written to registers to the waiting func-
tional unit [163]. In synchronous ILP designs, the cost of maintaining correct
operation increases the complexity of control which in turn adversely affects
the clock period and therefore the performance. However, an asynchronous
datapath which is designed using micronets can use the existing handshaking
mechanisms, together with the simple locking of registers, to achieve the same
effect with trivial hardware overheads. Exploiting concurrency in a micronet
architecture is aided by the distributed nature of the control strategy and by
the fact that data movement is controlled locally. Previously, it had been con-
sidered expensive to pipeline decoding, but here this is no longer the case since
control and decoding are distributed amongst architectural components. As
a consequence, implementing asynchronous superscalar or superpipelined ar-
chitectures is relatively straight-forward, and this will be discussed briefly in
Chapter 7.
In practice, all instructions do not necessarily have identical execution times
and thus the results of instructions may be ready out of program order. En-
forcing in-order write-back to registers is inefficient for performance, since
this can effectively stall functional units and thereby increase the evaluation
time of instructions. Out-of-order instruction completion can be supported in
synchronous designs, but at a non-trivial cost [40]. In contrast asynchronous
designs, as proposed in this work, can relax the strict ordering of instruction
completions and thereby further exploit ILP. The effect is to increase the utilisa-
tion of the functional units by reducing their stalls. By exploiting both ILP and
actual run-times of instructions, better program performances can be achieved
Chapter 5. The Control Paradigm and the Architecture 82
on asynchronous processors, and this will be demonstrated in greater detail in
the rest of this chapter.
5.3 Design Goals
A goal of early synchronous RISC architectures was to achieve an execution rate
of one instruction per machine cycle. In simple architectures, like the design
in Chapter 4 which followed the sequential mode of program execution and
avoided hazards, this meant an instruction would complete its execution before
the next one started. Such processors did not have a pipelined execute stage and
either the choice of instructions within the instruction set had to be restricted
by the requirement that the execution time of each instruction be equal to a
single (and in later RISC architectures – a fixed multiple of the) clock period (in
order to achieve a certain performance or MIPS rate) or that the clock period
was determined by the execution time of the slowest instruction. Remember
that the clock period itself is determined on the basis of conservative estimates
of component delays. Therefore all instructions are viewed as executing in the
same time irrespective of their actual delays even though most instructions will
actually complete in some fraction of the clock period. Also, in practice, different
instructions generally require different resources and even the same instructions
can have different execution times. All of this leads to poor utilisation of
expensive resources. Although pipelining has gone some way to redressing
this, the technique itself introduces inefficiencies: stage balancing problems, for
example the von Neumann bottleneck makes it difficult to match the cost of
fetching an instruction with its execution. Whereas the RISC philosophy was
concerned with the efficient usage of silicon real estate, the goal of the micronet
control paradigm is more efficient utilisation of the functional units over time.
Chapter 5. The Control Paradigm and the Architecture 83
5.4 An Asynchronous ILP Processor
The structure of a processor architecture is determined by the number and type
of components or functional units and their connections. Pipelining is a control
technique for exploiting temporal ILP. The first MAP architecture under invest-
igation is a modified version of the one described in Chapter 4. The functional
units are identical to those used in the previous design, with the exception
of those in the fetch stage. The modifications in the execute stage focus on
optimising the control and data handshake protocols to improve the control
sequencing and supporting ILP. These modifications have been implemented
in a series of refinements and at each refinement, their effect on program per-
formance is measured. An adequate set of instructions has been implemented
in each refinement step to highlight the effects of the modification.
The results in Chapter 4 have clearly shown how the asynchronous pro-
cessor’s performance is affected by the fetch stage. It is therefore necessary to
reduce the fetch stage delay to less than the smallest execution cost in order
to ensure that the execution pipe is kept busy. (Note that the fetch cost, being
independent of the instruction set, is more a function of the memory technology
which allows the overall processor performance to be traded off with the fin-
ancial cost of the instruction memory/cache). Also, the amount of concurrency
that can be exploited in such an architecture is severely restricted by the fact
that the PC has to be available to both the fetch and execute stages. The work
in this chapter focuses on the the datapath within the execute stage. In order
to improve resource utilisation and expose maximum concurrency, a number
of minor architectural modifications are made to the design described in the
previous chapter, to create the base architecture upon which further (control)
improvements will be made.
Chapter 5. The Control Paradigm and the Architecture 84
5.5 A Micronet Architecture
Adder
MU/









































Figure 5–1: A typical micronet-based processor architecture model
Figure 5–1 illustrates the functional units which might constitute a typical MAP
architecture. The intention is not to focus on the functional units themselves,
but rather on their asynchronous control using micronets and the resulting
performance improvements. The number of units and their functionality can
Chapter 5. The Control Paradigm and the Architecture 85
be changed without any side-effects. The base architecture under study is
comprised of the following units:
1. As previously, the Instruction and Data Memory (IM and DM) or Cache
store the program instructions and data, respectively.
2. The Fetch and Branch Unit (FBU) fetches instructions from the IM, executes
control transfer ones and places the others in the instruction buffer.
3. The instruction buffer is an asynchronous queue which effectively de-
couples the fetch stage from the execute stage.
4. The Control Unit (CU) initiates the necessary micro-operations in the re-
spective microagents for a given instruction.
5. The Memory Unit (MU) services the load and store instructions, generates
addresses (using its own adder) and accesses the DM.
6. The ALU executes arithmetic and logical instructions.
7. The Register bank consists of number of registers (32), three operand read
ports to the functional units, and a write port for the ALU and the MU.
8. The Boolean Register Bank holds flags which are used to resolve branch
conditions.
9. X and Y are operand fetch buses and V is the boolean flag write-back bus.
The Z bus is initially used as both an operand fetch bus (labelled W in
Figure 5–1) and a register write-back bus.
5.5.1 Modifications to the Fetch Stage
The combination of an unbalanced two stage pipeline and the implementation of
certain instructions (particularly control transfer ones) could cause the execute
Chapter 5. The Control Paradigm and the Architecture 86
stage to often become starved of instructions. This will have a detrimental effect
on the exploitation of concurrency and efficient utilisation within the execute
stage of the datapath, and therefore this behaviour has to be improved. Firstly,
all PC-related instructions are either executed in a new unit called the Fetch and
Branch Unit (FBU) or removed from the instruction set altogether. The FBU
is responsible for fetching instructions from the instruction memory or cache
and processing control transfer instructions. This unit filters out unconditional
branches and updates the PC directly. The branch target address is copied to
a register after which branch prediction schemes similar to those employed in
synchronous designs can be applied. Although the removal of the execution of
PC-related instructions from the execute stage may be seen as the influence of the
control paradigm on the processor architecture, this feature has already been
incorporated in high performance synchronous designs (e.g. [40] [44] [155]).
The problem is related to the fact that it is difficult to exploit parallelism when
a resource is being used in separate stages within the datapath.
As described in Chapter 2, pipeline stages have a producer-consumer be-
haviour. If two stages have varying delays such that their worst-case delays
alternates between them, then the pipeline latency will be the sum of the two
worst-case delays. If the stages are decoupled from each other by an asyn-
chronous queue which stores the predecessor’s results, then the stall time of the
stage is reduced and throughput improved. An instruction buffer/window has
been implemented to hold instructions pending execution. Situated between
the two stages, the buffer relaxes the synchrony between the FBU and CU, al-
lowing each stage to proceed at its own rate without hindering the other until
the buffer becomes either full or empty. Thus, the decoupling of the fetch stage
from the execute stage can reduce the amount of time the control unit is starved
of instructions. The FBU continuously fetches instructions and places them in
the buffer until either the buffer is full or the unit stalls waiting to resolve a
conditional branch (control transfer). Unconditional branches will be executed
Chapter 5. The Control Paradigm and the Architecture 87
by the unit, updating the PC immediately. The problem of control transfer resol-
ution is, however, made more difficult. Although this is similar to the problem
faced by deeply pipelined synchronous processors, the effect of the buffer is to
introduce a variable number of pipeline stages between the instruction being
fetched and the instruction being issued (executed). Ignoring control transfers,
implies that the current PC value will no longer be just one (or a constant num-
ber) ahead of the PC value of the instruction being executed, which makes it
difficult to use the PC value in the execute stage. The use of branch prediction
schemes to prevent stalling the pipeline and conditional instruction execution
as a solution to malpredicted branches can be employed without affecting or
being influenced by the control paradigm (see Chapter 7). The instruction buffer
has an additional use in more advanced designs which will also be elaborated
in the same chapter.
5.6 The Control Refinements
The following sections discuss the refinements made in a number of steps to
the execute stage of the base MAP design shown in Figure 5–111. These refine-
ments highlight the ease with which the micronet model can both efficiently
exploit ILP and obtain good functional unit utilisation without the difficulties
normally encountered in synchronous datapath design (e.g. implementing haz-
ard avoidance, data-forwarding or balanced pipeline stage design). Control is
distributed at each refinement step to the functional units, and improvements,
if any, in the execution of sample programs are recorded. An architecture, as
illustrated in Figure 5–1, is composed of a network of microagents (denoted by
solid boxes) which are connected via ports. The Functional Microagents (FMs)1Figures 5–11 to 5–18 can be found at the end of this chapter, from page 133 onwards.
Chapter 5. The Control Paradigm and the Architecture 88
perform micro-operations which are typical of a datapath. On each port of a FM
is a Communicating Microagent (CM) which is responsible for communication
among the FMs, and with the Control Unit (CU). The FMs are effectively isol-
ated and only communicate through their CMs, and can therefore be modified
without affecting the rest of the micronet. The modifications to the datapaths
are modelled using micronets as shown in Figures 5–11 to 5–18. These versions
aim to exploit the fact that the microagents operate concurrently, each executing
one micro-operation at a time. In Figure 5–11, for example, four microagents
can operate in parallel in the operand access stage; followed by three pairs in
the operand fetch handshake stage; two in the instruction execution stage; two
pairs in the write-back handshake stage; and two in the write-back stage.
5.7 Measuring Improvements in Performance
The two parameters which affect the performance of programs in asynchronous
pipelines are the latency of the microagents, which is defined as the time between
initiating the micro-operation and the result being signalled as available; and
their cycle time, which is the minimum time between successive initiations of the
same micro-operation, i.e. throughput. The two parameters have the same value
in a synchronous pipeline, with the cycle time being determined by the latency of
the slowest stage. The difference between the two values may be viewed as the
overhead due to asynchronous protocols and a good design should endeavour
to minimise it. This is achieved in micronets by overlapping the phases of the
communication protocol in the CMs with operations in the FMs, thus hiding
the overhead through concurrent operations. The effectiveness of this method
is gauged by measuring the utilisation of FMs when exercised by test programs
composed of the appropriate, identical instructions. Metrics are now introduced
for characterising the performance of micronet datapaths.
Chapter 5. The Control Paradigm and the Architecture 89
Minimum Micronet Latency (MML) is the time between asserting the control
signals (i.e. initiating an instruction issue) and receiving the final acknow-
ledgement of the instruction’s completion. From the CU’s point of view,
this is the shortest execution time (latency) through the micronet (ignor-
ing any stall time due to busy resources) for a particular instruction. This
value influences when successive data dependent instructions can begin
their execution. Note also, that this metric is not the same as the Datapath
Execution Time (DET), as used in the previous chapter, which is just the
time taken for the instruction’s result to reach its destination (i.e. it does
not include the time to signal the instruction’s completion).
Instruction Cycle Time (ICT) – In asynchronous pipelines, which usually have
non-uniform stage delays, the time between successive instruction issues
is influenced by the slowest stage currently active in the pipeline. The ICT
is the time between two identical instruction issues once that instruction’s
pipeline is full. This metric is the sustainable rate at which a particular type
of instruction can be issued. The upper bound on this value is determined
by the cycle time of the slowest microagent on the instruction’s path.
(Instructions are executed by following the particular paths through the
micronet). Note that this is not a strict upper bound since the time between
these instruction issues could increase because of contention for a shared
resource (caused by the concurrent execution of a different instruction).
For example, a different functional unit starts using the write-back bus
causing another instruction in the current instruction’s micropipeline to
stall. In practice, if this only happened occasionally, it may not affect the
ICT since the elasticity of the micronet may absorb the effect.
Program Execution Time (PET) is the actual execution time of a program. As
this time is reduced, component utilisation will increase (assuming the
amount of work stays the same). For a micronet executing a stream of
Chapter 5. The Control Paradigm and the Architecture 90
identical instructions, the PET can be approximated to:(n  1)  ICT +MML + overheads (5.1)
where n is the number of instructions and the overheads are the costs asso-
ciated with the initial instruction fetch startup. Equation 5.1 is obviously
related to the synchronous equivalent where the ICT would be equivalent
to the clock period and MML to the pipeline latency, i.e. the clock period
multiplied by the number of stages in the pipeline. Note that average
values have been used for modelling purposes but in practice it is likely
that both the ICT and MML of an instruction would vary.
ALU Utilisation – The percentage of the program execution time (excluding
the initial instruction fetch time) for which the ALU performs useful com-
putation. Utilisation measurements are important for two reasons: firstly,
they are a measure of efficient functional unit usage, greater efficiency
leads to improved performance; secondly, high utilisation can also in-
dicate potential bottlenecks within the design. Although adding another
resource may improve program performance and reduce the utilisation
(an architectural design trade-off), this work advocates that given a set of
architectural resources, an asynchronous control paradigm is better able
to utilise them.
MU Utilisation – Same as above, but for the Memory Unit (MU).
Register Utilisation – Same as above, but for the Register Bank. This figure is
useful since in the nature of RISC architectures all data must be moved via
the register bank which could pose a potential bottleneck.
ALU Interface Utilisation – The percentage of the execution time (excluding
the startup latency) during which the ALU’s CMs are busy.
MU Interface Utilisation – Same as above, but for the Memory Unit Interface.
Chapter 5. The Control Paradigm and the Architecture 91
Register Interface Utilisation – Same as above, but for the Register Interface.
Program Minimum Instruction Issue Cycle Time (MIICT) is the minimum time
between successive instruction issues, which gives a measure of the max-
imum possible issue rate for a given program. The ratio of the largest
MML and smallest MIICT is an upper bound on the number of instruc-
tions which can potentially execute concurrently in the datapath.
Maximum FM Utilisation – The upper bound on the FM utilisation for a par-
ticular instruction is the ratio of the FM micro-operation latency and the
ICT for that instruction. Therefore, architecture designs should aim to
reduce the ICTs of instructions to that of their FM micro-operation delays.
Given that the ICT is determined by the slowest delay on the instruction’s
path, optimal utilisation can only be achieved when the FM is the slow-
est microagent. (In terms of program execution it is assumed that only
FMs do useful work and the other operations are effectively the overheads
associated with the architectural design).
5.7.1 The Test Programs
The feasibility of taking advantage of actual delays rather than assuming the
worst-case values depends on the difference between the actual and worst-case
delay being larger, on average, than the overheads due to asynchrony. If the
asynchronous overheads were to be hidden then asynchrony would always
have a performance edge. The successive refinements aim to show that the
exploitation of fine-grain ILP can be used to hide these overheads.
The actual performance of the architecture is determined by delays of the
components. It is demonstrated that the maximum attainable performance
approaches the maximum possible performance of the architecture. The FU
Chapter 5. The Control Paradigm and the Architecture 92
latencies are chosen to be constant – the average execution time, to capture the
essential behaviour of micronets.
The micronets in Figures 5–11 to 5–18 were exercised by programs with a
mixture of LD, STR, and ALU instructions (see Appendix C for more details).
The Alu, Load and Store test programs (ATP, LTP, STP) measure the maximum
attainable utilisation of their respective FMs. They contain repetitions of either
ALU, LD or STR instructions, so that only structural dependencies exist between
instructions (in effect setting up a static pipeline or a fixed path through a net-
work of components). The number of instructions in the test programs are
sufficient to fill the pipeline, i.e. enough instructions exist to allow the CU to
achieve a steady issue rate. The Hennessy Test (HT1) consists of a mix of the
three instructions, but without any data dependencies, which exercises the spa-
tial concurrency and out-of-order completion, for a particular schedule devised
by the compiler. HT2 is a variant of HT1 but with data dependencies, which
exercises the data forwarding mechanism as well. This program represents a
“typical” basic block of compiled code (actually a line of code in C from [70]).
In the following sections, the refinements which were made to MAP in
order to exploit ILP through the distribution and decentralisation of micronet
control have been described together with the performance results that have
been measured in the PEPSÉ environment.
5.8 Refinement Step 1 – The Base Case
Figure 5–11 illustrates a naı̈ve implementation of an asynchronous datapath
which does not as yet fully exploit the properties of micronets. Refinement
Step 1 only exploits the actual execution timings of micro-operations. The ex-
ecution of each instruction requires a predetermined set of micro-operations,
each initiated by signals from the CU. These are four-phased controls whose
Chapter 5. The Control Paradigm and the Architecture 93
acknowledgement signals are used as status flags for mimicking a scoreboard-
ing mechanism. The micro-operations for an instruction are initiated as soon
as possible by asserting the necessary control signals. The receipt of an ac-
knowledgement confirms that the associated micro-operation has begun and
the initiating control signal is de-asserted. The instruction is said to be issued
once all the asserted control signals have been acknowledged, and the next
instruction issue can begin.
These micronet control signals are described in greater detail below, with
the micro-operations required by each instruction outlined in Table 5–1:
Rx – This signal identifies the source register for the X Bus and the correspond-
ing acknowledgement is asserted once the register has been accessed, and
cleared once the data has been transferred to the operand fetch handshake
phase.
Ry – This is the same as above but for the Y Bus.
Rz – This is the same as above but for the Z Bus when used for fetching
operands.
Rof – This is similar to Rx except that it is used to access the offset register,
the contents of which are output on to the X Bus. Rof and Rx cannot be
asserted simultaneously since they both require the X bus.
AUs – This signal identifies the next operation of the ALU and the corres-
ponding acknowledgement is asserted when the interface is ready to fetch
the ALU’s operands from the register and is cleared when it initiates the
write-back handshake.
MC1 – This signal identifies a load instruction to the MU and is asserted and
cleared in the same manner as AUs. Other signals exist for both the store
Chapter 5. The Control Paradigm and the Architecture 94
(STR/STX) and the address calculation (LDA) instructions but these have
been omitted for the sake of brevity.
ZAs – This signal identifies the destination register for writing back the result
of an ALU operation via the ZA bus and the corresponding acknowledge-
ment signal is asserted when the register is ready to receive data and
cleared once the data has been written back.
ZMs – This is the same as above, but for data written back from the MU via the
ZM bus.
Instruction Required Micro-operations
ALU Rx Ry AUs ZAs
LD Rx Ry MU1 ZMs
ST Rx Ry Rz MU2
LDX Rof Ry MU1 ZMs
STX Rof Ry Rz MU2
LDA Rof Ry MU3 ZMs
Table 5–1: The micro-operations required for instruction execution
Figures 5–11 to 5–18 illustrate the micronet model through the series of
refinements. For each refinement step, they identify the stage during instruction
execution when each of the acknowledgement signals is generated. The timing
diagrams correspond to the execution of a load followed by an add instruction
which highlights the relationship between the control signal transitions.
In Refinement Step 1, all the micro-operations for an instruction are initiated
at the same time and the next set can only be initiated after the completion of
the micro-operations of the current instruction. This effectively serialises the
instruction execution, as illustrated in the timing diagram in Figure 5–11. As
Chapter 5. The Control Paradigm and the Architecture 95
an example, the behavioural description of the CU issuing a LDA instruction
is given in Figure 5–2. In successive refinements the rôle of the CU will be
diminished by distributing the control of the micronet to local interfaces, with
micro-operations being initiated individually as early as possible.
.
.
LDA  :  SEQ
Wait until the handshake cycle of all control
signals have been completed, by testing
the input acknowledgement signals.
Wait until the Control Signals have 
been acknowledged.
wait until (RxA . RofA . RyA . RzA . AUA . MU1A . MU2A . MU3A . ZAsA . ZMsA);
-  Rof
-  RofA (Active phase), RofA (reset phase).Incoming Offset Register Acknowledgement Signal




assert (Rof, Ry, MU3, ZMs);
deassert (Rof, Ry, MU3, ZMs);
wait until (RofA . RyA . MU3A . ZMsA);
Initiate instruction execution.
Instruction issued.
Figure 5–2: Issuing an LDA instruction in Refinement Step 1
Performance Results
Instruction ICT MML Max. FM Utilisation
ALU 24nS 24nS 16.67%
LD 43nS 43nS 53.49%
ST 23nS 21nS 42.85%
Table 5–2: Instruction execution for Refinement Step 1
The ICT value for an instruction is determined by its slowest microagent
control signal handshake, since the instruction issue is serialised. The results in
Table 5–2 show that the Instruction Cycle Time (ICT) is equal to the Minimum
Chapter 5. The Control Paradigm and the Architecture 96
Test Programs Alu Test Load Test Store Test HT1 & HT2
Program Execution Time 175nS 308nS 164nS 143nS
MIICT 24nS 43nS 21nS 21nS
ALU Utilisation 16.57% 0% 0% 8.39%
MU Utilisation 0% 53.31% 39.87% 22.38%
ALU Interface Utilisation 78.7% 0% 0% 39.86%
MU Interface Utilisation 0% 88.08% 82.91% 38.46%
Table 5–3: Execution of the test programs on Refinement Step 1
Micronet Latency (MML)2 (except for the ST instruction), which is not surprising
as instructions execute sequentially but only take as long as is necessary. The
higher value for the ST instruction is due to a handshake delay, which in the
case of the LD instruction is hidden by the write-back stage (discussed later in
this section). Although there is no explicit pipelining of the datapath, different
phases of the handshaking may occur at the same time, e.g. a CM may initiate a
handshake with another CM while completing one with its FM. This is reflected
in the interface utilisations shown in Table 5–3.
Also shown in Table 5–2 are the figures for the maximum FM utilisation
which represents the proportion of the MML taken by the FM to complete its
operation. As predicted, the execution times of the test programs in Table 5–3
are the sum of their individual instruction execution times together with startup
overheads. It is observed that the utilisations achieved for the FMs (in Table 5–2The values given here differ from those in the previous chapter due to the following
reasons: DET and MML are slightly different measures (see pages 74 and 89); changes
to the CU caused by the architectural modifications described earlier in this chapter;
and a different choice of design process and cell library has been used to implement
the datapath components (see page 58).
Chapter 5. The Control Paradigm and the Architecture 97
3) are very close to their upper bounds (in Table 5–2) which demonstrates that
asynchronous control using a micronet can be efficient.
The Store Instruction’s Cycle Time
The MU only receives the next control signal, i.e. its next operation once it has
completed the current instruction. Only then can the MU make a request to its
interface for the necessary operands. The increase in cycle time is due to the
operands waiting at the interface for this request because of the shared use of
the Z port (as both an input and output). This delay is effectively hidden by the
write-back operation in a load instruction.
5.9 Refinement Step 2 – Exploiting Multiple Write-
back Buses
An instruction’s micro-operations are still asserted and de-asserted collectively,
but as soon as all the relevant signals become ready, i.e. without having to wait for
earlier unrelated micro-operation handshakes to finish. This introduces overlap
between successive instructions which require different micro-operations. This
feature of the micronet helps to exploit even finer-grained spatial concurrency
between instructions than previously possible. In Figure 5–12, while instruc-
tions share the operand fetch resources, the two FMs and their write-backs
(WBs) can operate concurrently. This implies that there is scope for out-of-
order completion of instructions, which introduces pipeline hazards, such as
Read-after-Write (RAW), Write-after-Write (WAW) and Write-after-Read (WAR).
These problems are addressed in the following manner:
RAW & WAW – A register locking mechanism is implemented in the register
bank without the CU having to keep track of the locked registers. The
Chapter 5. The Control Paradigm and the Architecture 98
acknowledgement signals, ZMs and ZAs, are asserted after the locking
operation, and are de-asserted once the result is written back signalling
the unlocking of the register. This implies that the destination register of
the previous instruction will have been locked before the next one attempts
to use that register. The timing diagram in Figure 5–12 assumes that the LD
and ALU instructions write to different registers. Should the destinations
be the same, then the ZAs acknowledgement signal would only be asserted
after the ZMs acknowledgement signal has been de-asserted.
WAR – This hazard is avoided without additional hardware overheads. By
definition, an instruction is issued when all of the acknowledgements from
the relevant micro-operations have been received. This implies that the
source registers of previous instructions will have already been accessed.
Also, as long as the control signals to lock registers are not asserted before
the operand fetch ones, then the register bank will ensure correct operation.
A behavioural description of the CU issuing a LDA instruction in this refinement
step is given in Figure 5–3.
-  Rof
-  RofA (Active phase), RofA (reset phase).Incoming Offset Register Acknowledgement Signal
Outgoing Register Offset Control Signal
completed, by testing the input
on these control signals have been
Wait until the previous handshakes
acknowledgement signals.






LDA  :  SEQ
STATE 2.
STATE 1.
assert (Rof, Ry, MU3, ZMs);
deassert (Rof, Ry, MU3, ZMs);
wait until (RofA . RyA . MU3A . ZMsA);
CASE  instruction
wait until (RofA . RyA . MU3A . ZMsA);
Figure 5–3: Issuing an LDA instruction in Refinement Step 2
Chapter 5. The Control Paradigm and the Architecture 99
Performance Results
Instruction ICT MML Max. FM Utilisation
ALU 24nS 24nS 16.67%
LD 43nS 43nS 53.49%
ST 23nS 21nS 42.85%
Table 5–4: Instruction execution for Refinement Step 2
Test Programs Alu Test Load Test Store Test Hennessy Tests
Program Execution Time 175nS 308nS 164nS 106nS
MIICT 24nS 43nS 21nS 17nS
ALU Utilisation 16.57% 0% 0% 12%
MU Utilisation 0% 53.31% 39.87% 32%
Register Bank Utilisation 41.42% 23.18% 22.15% 39%
ALU Interface Utilisation 78.7% 0% 0% 57%
MU Interface Utilisation 0% 88.08% 82.91% 55%
Register Interface Util. 70.41% 92.72% 48.73% 71%
Table 5–5: Execution of the test programs on Refinement Step 2
This refinement step exploits limited spatial concurrency between instruc-
tions with different micro-operations, i.e. instructions which require different
microagents. Therefore, improvements are only observed in the Hennessy Tests
where instructions using different micro-operations (ALU and memory instruc-
tions) may execute concurrently, and this is reflected in the greater utilisation
figures for the respective units as shown in Table 5–5.
Chapter 5. The Control Paradigm and the Architecture 100
5.10 Refinement Step 3 – Using a Single Write-back
Bus
In the previous versions of the architecture, each functional unit had its own
write-back bus which allowed result operands to be written back to the registers
as soon as they became available. However, supporting n function units would
require n write-back buses (incurring area costs) and n write-ports on the register
bank (incurring performance costs). The micronet datapath (Figure 5–13) in this
refinement step has only one write-back bus, i.e. the functional units share the
ZM bus to write data back to the registers. The control signal ZAs is no longer
used so there is only one write-back microagent control signal ZMs. This
has a significant effect on performance since previous concurrent write-backs
must now take place sequentially. Also, the instruction issue conditions forces
instructions which require to write data back to execute completely sequentially
again.
Performance Results
Instruction ICT MML Max. FM Utilisation
ALU 24nS 24nS 16.67%
LD 43nS 43nS 53.49%
ST 23nS 21nS 42.85%
Table 5–6: Instruction execution on Refinement Step 3
Table 5–7 shows increases in the execution time for both Hennessy Test
programs, which re-enforces the advantages of multiple write-back buses (see
Table 5–5). Another interesting point to note is that the execution time of this
test program is independent of data dependencies. Each instruction issue is
Chapter 5. The Control Paradigm and the Architecture 101
Test Programs Alu Test Load Test Store Test Hennessy Tests
Program Execution Time 175nS 308nS 164nS 139nS
MIICT 24nS 43nS 21nS 17nS
ALU Utilisation 16.57% 0% 0% 9.02%
MU Utilisation 0% 53.31% 39.87% 24.06%
Register Bank Utilisation 41.42% 23.18% 22.15% 33.83%
ALU Interface Utilisation 78.7% 0% 0% 42.86%
MU Interface Utilisation 0% 88.08% 82.91% 41.35%
Register Interface Util. 78.7% 44.04% 55.06% 66.17%
WB Bus Utilisation 37.28% 20.86% 39.87% 33.83%
Table 5–7: Execution of the test programs on Refinement Step 3
stalled until the previous one has written its result back to the registers. This is
a return to almost complete sequential execution (as in Refinement Step 1). (The
difference in PETs for the Hennessy Tests in Step 1 and here is due to concurrency
between the ST and ALU operations.) Although the write-back bus doesn’t seem
to be a bottleneck, there are times when a result is delayed waiting for another
write-back operation to complete. This can affect performance especially if the
stalled data item is required by an instruction on the program’s critical path.
5.11 Refinement Step 4 – Asynchronous Micro-operation
Issue
In previous refinement steps, the control unit would not assert any of the in-
dividual control signals for issuing an instruction until all of them could be
asserted together. This constraint is now relaxed so that once an instruction
has been chosen to be issued, the individual control signals required by that
Chapter 5. The Control Paradigm and the Architecture 102
instruction can be asserted asynchronously as soon possible. This allows micro-
operations belonging to different instructions to overlap (see the timing diagram
of Figure 5–14). Note that an instruction’s control signals can only be de-asserted
once all the relevant control signals have been acknowledged, this being the time
at which the instruction is considered to have been issued (also shown in the
timing diagram). This refinement aims to improve the instruction execution by
exploiting a finer grain of ILP than previously possible in synchronous designs,
i.e. concurrency between individual components within stages of a datapath.
This also speeds up the instruction issue of blocked or stalled instructions. Only
the control signals to the common resources (which have not finished) will be
stalled thus allowing the ready resources to execute their micro-operations for
the current instruction earlier than before. However, relaxing this constraint
re-introduces possible hazards and efficient mechanisms have to be devised to
avoid them.
Instruction Issue
The micro-operations for an instruction are initiated individually as soon as
possible by asserting the necessary control signals. The receipt of an acknow-
ledgement confirms that the associated micro-operation has begun and the
instruction is said to be issued once all of the asserted control signals have been
acknowledged. The initiating control signals can then be de-asserted and the
next instruction issue can begin. As in Refinement Step 2, micro-operations
relating to different instructions may overlap. However, while Step 2 benefited
from spatial concurrency (made possible through the availability of resources),
this refinement step exploits mainly temporal concurrency through a limited
amount of pipelining. Fortunately, thanks to the properties of the micronet the
hazard avoidance mechanisms are implicit in the orderings of the assertions of
the control signals, known as pre-issue conditions, and these are discussed below.
Since some micro-operations share the same resources they obviously cannot
Chapter 5. The Control Paradigm and the Architecture 103
execute simultaneously. These restrictions are also applied by the pre-issue
conditions.
RAW – An instruction is considered issued once all of its resource control signals
have been acknowledged by the relevant microagents (i.e. the microagents
are active). This allows the control signals to be cleared and the next
instruction issue to begin. Recall that the control acknowledgement signal,
ZMs, is asserted once the register is locked and cleared once data has been
written to it. Thus, the destination register will be locked before the
following instruction attempts to read from it, since the next instruction
issue cannot be initiated until the previous set of control signals have been
acknowledged.
WAR – When a register is used both as a source and a destination within the
same instruction, then it is necessary to ensure that the source data is
obtained before the register is locked, otherwise deadlock will occur. In
the previous refinement steps no action was required to avoid this hazard
since this criteria was met by the issuing conditions (the set of microagent
control signals being asserted together) and the register bank. However, it
is now possible for ZMs to be asserted before the source operand control
signals Rx and Ry and therefore the CU stalls the assertion of ZMs until
Rx and Ry have been asserted.
Operand fetch – It is also necessary to ensure that a functional unit gets the
correct operands since it is possible for two units to require operands at
the same time. Functional units fetch each of their operands separately
over the operand fetch buses (X and Y) while acknowledging the control
signal (i.e. operation request) from CU in the following manner:
1. If the bus is free and no other request is in progress then the request
signal (to register port for this bus) is asserted.
Chapter 5. The Control Paradigm and the Architecture 104
2. When valid data is detected, the data is latched and the request
signal is cleared. Data is, of course, only latched by the functional
unit interface which made the original request.
Simultaneous operand requests by FMs to the same register bank CM
micro-operation can lead to one of them acquiring the wrong operand.
This can be avoided by the CU delaying the assertion of the control signal
to one of the functional units. The CU need only delay the assertion
of the control signal to a FM until the FM of the previous instruction
has made its operand request(s) to the registers. This event will have
occurred before the acknowledgement signals of the previous instruction’s
“operand fetch” micro-operations (Rx, Ry or Rz) have been de-asserted.
WAW – A situation may arise where the current instruction is stalled because
a previous instruction has not written its result back to the destination re-
gister. This stall is necessary because the current instruction might either
attempt to write its result to an unlocked register (which may eventu-
ally cause a deadlock) or write data to a location out of program order.
In this refinement, write-backs are still forced to occur in-order. The
solution adopted here is very simple since the above conditions can be
avoided by preventing each functional unit from writing data back until
its control signal from the CU has been de-asserted (an implicit go-write
signal). This is sufficient since an instruction’s control signals cannot be
de-asserted before ZMs is asserted (see timing diagram in Figure 5–14).
(In the CU, the control signals will be de-asserted once all the required ac-
knowledgements have been received, which includes ZMs, implying that
the destination register has been locked.) Note that if the CU attempts
to lock a register which is already so, then the acknowledgement signal
will not be asserted and the current register lock request will stall. This
mechanism guarantees that write-backs to the same register occur in the
Chapter 5. The Control Paradigm and the Architecture 105
correct order without stalling the instruction issue, and thereby allowing
the instructions to execute concurrently with only write-backs taking place
sequentially. Historically, the CDC6600 [162] used a Go-Write signal which
sequentialised the execution of the offending instructions.
Bus Contention – Only the functional units and the register bank can write on
to the Z Bus. The mechanism to avoid WAW hazards prevents contention
between functional units and therefore the only possibility for contention
is when the Register Bank and one of the functional units attempt to write
on the bus simultaneously. However, access to this bus is arbitrated by
the CU, through the mutually-exclusive assertions of the operand fetch
control signal Rz, and the write-back control signal ZMs.
The refinements to the behavioural description of the CU issuing a LDA instruc-
tion are shown in Figure 5–4.
-  Rof
-  RofA (Active phase), RofA (reset phase).Incoming Offset Register Acknowledgement Signal





in accordance with the
Assert Control Signals
pre-issue conditions.



















(RofA . RxA . RyA . MU3A);







Figure 5–4: Issuing an LDA instruction in Refinement Step 4
Chapter 5. The Control Paradigm and the Architecture 106
Performance Results
Instruction ICT MML Max. FM Utilisation
ALU 21nS 24nS 19.05%
LD 42nS 43nS 54.76%
ST 23nS 21nS 42.85%
Table 5–8: Instruction execution on Refinement Step 4
Test Programs Alu Test Load Test Store Test Hennessy Tests
Program Execution Time 157nS 302nS 165nS 119nS
MIICT 21nS 42nS 22nS 16nS
ALU Utilisation 18.54% 0% 0% 10.62%
MU Utilisation 0% 54.39% 39.62% 28.32%
Register Bank Utilisation 37.09% 23.65% 35.85% 43.36%
ALU Interface Utilisation 80.13% 0% 0% 68.14%
MU Interface Utilisation 0% 89.86% 78.62% 48.67%
Register Interface Util. 93.38% 89.19% 88.68% 77.88%
Table 5–9: Execution of the test programs on Refinement Step 4
Table 5–9 shows some improvement in the execution times over Refinement
Step 3. In fact the PETs for the instruction test programs are better than the
corresponding values in Refinement Step 2 (see Table 5–5). These performance
gains are due to the small improvements in the instruction cycle times as shown
in Table 5–8. The magnitude is determined by the overlap between the operand
access of the current instruction and the write-back of the previous one. In
the example under consideration there can only be two program instructions
active in the datapath simultaneously. The likelihood of operand fetches and
write-backs occurring concurrently depends on the FM delay.
Chapter 5. The Control Paradigm and the Architecture 107
Although the Hennessy Test PETs also show improvements over the previous
refinement step, they are still worse then the figures in Step 2. In Refinement
Step 2, the programs exploited spatial parallelism, whereas now they only
exploit temporal parallelism. The latter is limited, due to the control unit being
unable to complete the issuing of the current instruction, (specifically, locking
the destination register) until the previous instruction has written its result back
to the register. This is necessary to enforce in-order instruction completion and
to prevent contention on the write-back bus. Also, the MIICT for the Hennessy
Test (in Figure 5–9) is less than the corresponding values for the other test
programs. This is due to the overlapping of independent instruction issues.
While analytical estimates of program execution times (PETs) for the Alu,
Load and Store Tests (see Equation 5.1) match those obtained from the simula-
tion, it is less easy to obtain the same for programs with a mix of instructions, as
in the case of the Hennessy Test. The execution times for such programs depend
on a number of factors, such as the relative values of the instruction issue and
cycle times and resource availability, which affect the amount of concurrency
available.
5.12 Refinement Step 5 – Out-of-Order Write-Backs
Enforcing in-order write-backs restricts the amount of concurrency which can
be exploited especially when functional unit execution times vary significantly.
However, supporting out-of-order completion of instructions in an asynchron-
ous environment is more difficult than under synchronous control. Determining
the precise order in which results will become available is virtually impossible
since micro-operation delays vary (subject to data and environmental paramet-
ers). Therefore a decentralised bus arbitration scheme is required such as a
token ring which is distributed amongst the CMs that write to the bus. Out-of-
Chapter 5. The Control Paradigm and the Architecture 108
order instruction completion can now be supported by tagging the write-back
data with the address of its destination register. However, the micro-operation
to write data back to the register bank can no longer be controlled by the CU
since the order of the write-backs cannot be predicted. Therefore, write-backs
are initiated directly by the CMs of the FMs which require the service, i.e. the
write-back micro-operation is initiated by the micro-operations in the previous
stage.
Since the Z bus is shared by the functional units which generate results
and access to the bus is no longer controlled by the CU, then potential for bus
contention does exist. Two (or more) CMs may attempt to write on to the bus at
the same time (or within the bus propagation delay). Determining the precise
times of the availability of data is very difficult. The use of a centralised request-
grant arbitration scheme is possible. This will become more complex as the
number of functional units increases. A priority scheme could be incorporated
to give certain functional units, especially those with longer delays like the
memory unit, access to the bus before other waiting units. An alternative more
distributed scheme can be achieved by using a token ring. The token need only
be held for the duration of data transfer and not the whole handshake. The
ring is distributed amongst the FU interfaces and is very simple to implement.
However as the number of functional units increases, so does the token’s cycle
time and for architectures with a large number of FUs this may not prove to be
a satisfactory solution.
The register control signal ZMs has to be modified in order to decouple the
CU from the process of writing data back into the register:
ZMs – Now just locks the destination register and prevents read access to it.
The corresponding control signal acknowledgement is now set on receiv-
ing the request (the asserted ZMs control signal) from the CU, and cleared
when the register is locked. Again, ZMs and Rz cannot be asserted simul-
Chapter 5. The Control Paradigm and the Architecture 109
taneously, since it is now necessary to guarantee that either the register has
been locked prior to the next instruction being issued (in case of a RAW
dependency), or that the register has been read before it is locked (in the
case of a WAR dependency). Note that in the case of a WAW dependency,
it is still necessary for the functional unit control signal to be de-asserted
after the destination register has been locked, i.e. de-asserted only after
the de-assertion of ZMs has been acknowledged.
Performance Results
Instruction Instruction Micronet Maximum FU
Cycle Time Latency Utilisation
ALU 21nS 24nS 19.05%
LD 42nS 43nS 54.76%
ST 23nS 21nS 42.85%
Table 5–10: Instruction execution for Refinement Step 5
Test Programs Alu Test Load Test Store Test Hennessy Tests
Program Execution Time 159nS 302nS 165nS 114nS
MIICT 21nS 42nS 23nS 17nS
ALU Utilisation 18.3% 0% 0% 11.11%
MU Utilisation 0% 54.39% 39.62% 29.63%
Register Bank Utilisation 29.41% 23.65% 32.70% 56.48%
ALU Interface Utilisation 80.39% 0% 0% 72.22%
MU Interface Utilisation 0% 89.86% 79.25% 72.22%
Register Interface Util. 86.27% 85.81% 91.19% 85.19%
Table 5–11: Execution of the test programs on Refinement Step 5
Chapter 5. The Control Paradigm and the Architecture 110
The results in Table 5–11 show that in this refinement, out-of-order instruc-
tion completions (i.e. write-backs) have little effect on performance. This is to
be expected in the instruction test programs where there is no scope at all for
benefit, although the Hennessy Test shows is only a slight improvement. The
explanation is as follows: In order to benefit from out-of-order write-backs, the
architecture needs to be able to exploit spatial parallelism. In the micronet,
this means that the instruction issue rate needs to be faster than the instruction
execution rates. It can be observed in Table 5–11, that the Minimum Instruction
Issue Cycle Time (MIICT) is nearly as long as the smallest Instruction Cycle
Time (ICT). This suggests that the issue of instructions is a limiting factor on
the degree of spatial concurrency that can be exploited. In order to achieve
higher concurrency it is necessary for the IICT to be as small a proportion of
the smallest ICT as possible. Another reason is the limited amount of spatial
parallelism available in the test programs themselves and the general (conser-
vative) dependency rules applied when issuing instructions. These issues will
be addressed in following refinements.
5.13 Refinement Step 6 – Faster Instruction Issue
The issue cycle time determines the rate at which instructions can be issued
to the micronet datapath and should this be a limiting factor on performance
then the handshake cycle times of the microagent control signals have to be
minimised. This can be achieved by either improving the hardware design of
the control circuits, or alternatively, by redefining the handshake cycle itself (the
option considered in this refinement step).
Here, in Refinement Step 6, the rôle of the CU is diminished further by
distributing the control of the micronet to individual CMs. While the CU ini-
tiates the micro-operations individually for the current instruction as early as
Chapter 5. The Control Paradigm and the Architecture 111
possible via the corresponding CMs as before, the rôle of the CMs has been en-
hanced to more than just controlling local communications between FMs. They
effectively buffer the initiation of the micro-operations from the CU until the
respective FMs are ready to perform. This increases the number of operations
which actually take place concurrently. This is also due in part to the changes
in the significance of the control signal handshake. The acknowledgements to
the control signals are revised as shown below:
Rx – This signal still identifies the source register whose contents are to be
transferred across the X Bus. However, the corresponding acknowledge-
ment is asserted by the CM of the register bank when the X bus operand
fetch micro-operation is ready to access the register, and de-asserted once
the operand fetch handshake is in progress over the X bus.
Ry – Same as above, but for the Y Bus.
Rz – Same as above, but for the Z Bus.
Rof – Same as above, but also with the restriction that the control signals Rx
and Rof cannot both be active simultaneously.
Rz – The acknowledgement signal is cleared when the register interface has re-
ceived the data transfer acknowledgement from the destination functional
unit. (Z bus is data driven).
AUs – This still identifies the next operation to be carried out by the ALU. The
acknowledgement, however, is now asserted when the corresponding
CMs are ready to fetch the operands from the registers and is cleared once
the FM micro-operation has completed.
MC1 – This signal still identifies a load instruction for the MU. The acknow-
ledgement is asserted and cleared as for AUs.
Chapter 5. The Control Paradigm and the Architecture 112
ZMs – This signal still identifies the destination register which has to be locked.
However, the corresponding acknowledgement signal is asserted when
the CM is ready and de-asserted once the operation has been completed,
as described in the previous refinement step.
As in previous refinement steps, hazards are dealt with by properly sequen-
cing the control signals (the pre-issue conditions):
WAR – A functional unit cannot generate a result without first receiving its in-
put operands. These are fetched in instruction order due to the handshake
mechanism. The ZMs signal is only asserted after all the previous operand
fetch control signal handshakes have been completed. This also prevents
the destination register being locked before operands are accessed.
WAW – The mechanism is similar to before, except now the de-assertion of
the functional unit control signals is no longer delayed until the ZMs
acknowledgement signal is de-asserted. Instead, the go-write signal now
originates explicitly from the register interface once the register has been
locked and not implicitly from the CU.
RAW – The CU delays the assertion of the operand fetch control signals Rx,
Ry and Rz until the previous ZMs control acknowledgement signal has
been de-asserted, which indicates the locking of the previous destination
register.
Operand Fetch – The pre-issue conditions are same as before. For each instruc-
tion, the control signal to the functional unit interface is only asserted after
the required operand fetch control signals. This prevents bus contention
on the operand fetch buses and guarantees that operands will be fetched
in-order.
Chapter 5. The Control Paradigm and the Architecture 113
Write-back Contention – This is prevented by the use of a token ring to arbitrate
accesses to the write-back (Z) bus. Of course, this problem could be
obviated by using dedicated buses for small number of FMs, but may be
impractical for larger designs.
The behavioural description of the CU issuing a LDA instruction in this refine-
ment step is given in Figure 5–5.
-  Rof
-  RofA (Active phase), RofA (reset phase).Incoming Offset Register Acknowledgement Signal
















in accordance with the
Assert Control Signals
pre-issue conditions.










(RofA . RxA . RyA . MU3A);
(RofA . RxA . RyA . ZMsA);
(RyA . ZMsA);
(RofA . RxA . ZMsA);
assert (MU3);
assert (ZMs);
Figure 5–5: Issuing an LDA instruction in Refinement Step 6
Performance Results
Instruction ICT MML Max. FM Utilisation
ALU 15nS 24nS 26.67%
LD 39nS 43nS 58.97%
ST 23nS 21nS 42.85%
Table 5–12: Instruction execution on Refinement Step 6
Chapter 5. The Control Paradigm and the Architecture 114
Test Programs Alu Test Load Test Store Test Hennessy Tests
Program Execution Time 123nS 287nS 165nS 102nS
MIICT 15nS 39nS 23nS 10nS
ALU Utilisation 23.93% 0% 0% 12.5%
MU Utilisation 0% 57.3% 39.62% 33.33%
Register Bank Utilisation 59.83% 24.91% 35.22% 66.67%
ALU Interface Utilisation 94.87% 0% 0% 83.33%
MU Interface Utilisation 0% 97.86% 79.25% 84.38%
Register Interface Util. 93.16% 89.32% 93.08% 90.63%
Table 5–13: Execution of the test programs on Refinement Step 6
The instruction issue times have been reduced by minimising the delay
between the assertion of the control signals and the arrival of their acknow-
ledgements. This improvement was achieved by pipelining the control signal
handshakes. Previously, the control acknowledgement signals were asserted
once the particular action had taken place. Now, the interfaces (if not already
busy with a previous handshake) will acknowledge a request immediately, sig-
nifying that the operation will take place, and de-assert the acknowledgement
signal once the task has been completed. Although the particular task may
have been completed, the interface may still continue to be busy completing
successive tasks and may not be ready to acknowledge another request from
the CU immediately, thus hiding communication delays.
An improvement in the ICTs of instructions which require to write data back
to the registers, i.e. the LD and ALU instructions, can be observed in Table 5–12.
This is due to the de-centralisation of the write-back control to the relevant CMs.
These improvements are reflected in the shorter PETs for the Alu, Load, HT1
and HT2 test programs, as shown in Table 5–13. Thus, the faster issue cycle time
Chapter 5. The Control Paradigm and the Architecture 115
allows these test programs to benefit from the write-back modifications made
in the earlier refinement steps.
5.14 Refinement Step 7 – Data Forwarding
This refinement step implements two features to increase the amount of fine-
grained concurrency which is available: firstly, the well-known technique of
data forwarding to reduce the effect of stalls due to data dependencies between
instructions and secondly, the application of the pre-issue conditions only when
strictly necessary in order to reduce CU stalls. Implementing the specific de-
pendency rules requires checking the actual register addresses within instruc-
tions. This requires extra hardware and increases the control unit’s complexity,
however all of this is required for out-of-order instruction issue. Therefore, it
may only be worthwhile if the expected or exploitable performance (determ-
ined by the target application) outweighs the cost of implementing the specific
dependency rules (which depends on implementation technology). Alternat-
ively, it may be possible to generate the required information at compile time
and encode it into the instruction word.
Previously, the micronet imposed a feed-forward discipline in the pipelines.
This is now relaxed by having feedback paths which has the effect of allow-
ing required operands to move against the flow. In the micronet datapath, the
operands are only available for a short period of time (i.e. while they are be-
ing transferred to the register interface) after which they are obtained from the
register bank (in effect the architecture implements only one stage of the coun-
terflow pipeline [157], and relies on fast operand fetch access from the register
bank).
With data transfer on the Z bus being tagged, the CMs can identify and in-
tercept operands for which they may be waiting. This mechanism, reminiscent
Chapter 5. The Control Paradigm and the Architecture 116
of the IBM 360/91 common bus architecture [163], has been implemented by
exploiting the feedback loops within the micronet. In the event of data for-
warding, where data is routed directly to the CM of the waiting FM, the CM’s
previous request for that operand is, in effect, cancelled by initiating a separate
handshake. This frees the corresponding operand fetch CM to service its next
request. An alternative approach would be to implement operand bypassing,
where the operand is fed back to the operand fetch micro-operation. This fea-
ture avoids both duplicated tag matching in each of the data forwarding CMs
and the need for the cancel handshake, at the expense of being slower than data
forwarding. However, the functionality is viewed as internal to the register
bank, since, from outside the data is obtained from the same place – just slightly
quicker than expected. This means that this method would fit perfectly into the
micronet model since no further modification would be required to any other
part of the datapath.
The dual rôle of the Z Bus can now no longer be supported due to the
data-forwarding mechanism. A separate operand fetch bus (W Bus) is used,
making the Z Bus purely a write-back bus (see Figure 5–17). (In the previous
refinements, the Z Bus was used as both a operand bus (for STR instructions)
and as a write-back bus.) By separating the functionality, the register interface
for the Z bus is simplified. The traffic on the Z Bus is reduced and since the
register interface no longer needs to send data on the Z Bus, it can be removed
from the token ring (speeding up the ring’s cycle time). Also, this allows the
third operand for a STR instruction to be forwarded when necessary. As one
might expect, in terms of exploiting concurrency, it is better if less resources are
shared between operations of different pipeline stages.
Performance Results
Columns “HT2” and “HT2DF” refer to the cases without and with data-
Chapter 5. The Control Paradigm and the Architecture 117
Instruction ICT MML Max. FM Utilisation
ALU 15nS 24nS 26.66%
LD 38nS 43nS 60.52%
ST 23nS 21nS 42.85%
Table 5–14: Instruction execution on Refinement Step 7
Test Programs Alu Test Load Test Store Test HT1 HT2 HT2DF
PET 121nS 280nS 165nS 83nS 97nS 91nS
MIICT 15nS 39nS 23nS 8nS 10nS 10nS
ALU Util. 24.35% 0% 0% 15.58% 13.18% 14.11%
MU Util. 0% 58.76% 39.62% 41.55% 35.16% 37.65%
Reg. Bank Util. 60.87% 25.55% 22.01% 58.44% 67.03% 65.88%
ALU If. Util. 94.78% 0% 0% 79.22% 80.21% 80%
MU If. Util. 0% 97.81% 79.25% 72.72% 74.72% 72.94%
Reg. If. Util. 94.78% 89.42% 93.08% 90.90% 92.30% 92.94%
Table 5–15: Execution of the test programs on Refinement Step 7
forwarding, respectively. As is expected, the results show improvements in
performance in programs with data dependent instructions, and this is recorded
in the figures for the Hennessy Test in Table 5–15. As a side-effect of the data-
forwarding, the PET improvements in the Load and Alu Test are due to the
introduction of the W Bus which removed the Register Interface from the token
ring, thereby reducing the ring’s cycle time. The division of the Z bus into
separate buses also improves the scope for greater concurrency. For the first
time the PETs for HT1 and HT2 differ since the pre-issue conditions have been
applied only when necessary. Since HT1 has no data dependencies between
instructions it executes faster.
Chapter 5. The Control Paradigm and the Architecture 118
5.15 Refinement Step 8 – The Last Control Modifica-
tion
In this final refinement step, both the assertion and de-assertion of the control
signals occur independently of each other. This increases further the concur-
rency between micro-operations and maximises the exploitation of fine-grained
concurrency between instructions for a given architecture. A behavioural de-
scription of the CU issuing a LDA instruction is given in Figure 5–6. Previously,
the FM control signal acknowledgements represented the business of their re-
spective functional units. This is no longer the case, since these signals are
de-asserted on the receipt of the required operands. This effectively reduces the







(RofA . RxA .
(RyA .
(RofA . RxA . RyA .
(RofA . RxA . RyA . ZMsA . MU3A);
assert (Rof); wait until (RofA); deassert (Rof);
assert (Ry); wait until (RyA); deassert (Ry);
ZMsA); assert (ZMs);
assert (MU3);
wait until (ZMsA); deassert (ZMs);
wait until (MU3A); deassert (MU3);
PAR
-  Rof
Incoming Offset Register Acknowledgement Signal
Outgoing Register Offset Control Signal
-  RofA (Active phase), RofA (reset phase).
-  RofACondition only applied when a dependency exists
Figure 5–6: Issuing an LDA instruction in Refinement Step 8
Performance Results
The ICT figure for the LD instruction in Refinement Step 8 is the best attain-
able as it represents the MU’s FM delay for the operation. The corresponding
Chapter 5. The Control Paradigm and the Architecture 119
Instruction ICT MML Max. FM Utilisation
ALU 12nS 24nS 33.33%
LD 23nS 43nS 100%
ST 12nS 21nS 75%
Table 5–16: Instruction execution for Refinement Step 8
Test Programs Alu Test Load Test Store Test HT1 HT2DF
Program Exec Time 103nS 188nS 98nS 79nS 91nS
Effective Speed Up 1.75 1.66 1.71 1.89 1.62
MIICT 10nS 10nS 10nS 10nS 10nS
ALU Utilisation 28.87% 0% 0% 16.44% 14.11%
MU Utilisation 0% 88.46% 67.74% 43.84% 37.65%
Register Bank Util. 72.16% 32.45% 37.63% 64.38% 64.71%
ALU Interface Util. 93.81% 0% 0% 78.08% 78.82%
MU Interface Util. 0% 96.7% 95.7% 72.6% 70.59%
Register Interface Util. 93.81% 79.79% 90.32% 91.78% 91.76%
Table 5–17: Execution of the test programs on Refinement Step 8
utilisation figure in Table 5–17 supports this claim (Note: these utilisation meas-
urements do take into account both the initial operand fetch and the final write-
back delays, and will therefore never attain the theoretical upper bound shown
in Table 5–16). These figures show that the micronet can exploit the actual
operational costs and effectively hide the overheads of self-timed design. The
ICTs for the ALU and ST instructions are limited by their operand fetch cycle
times, and the utilisation of the FM in these cases also approach their bounds.
These cycle times are due to the communication protocol between the FUs and
the register bank. These delays can be reduced by using a less conservative
bundling delay [158] and through layout and transistor size optimisation [26]
Chapter 5. The Control Paradigm and the Architecture 120
(Refinement Step 9). The improvements in FM utilisation over the 9 refinement

















Step 9Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 Step 7 Step 8
Figure 5–7: The FM utilisations
The overall improvements in the program execution times in Refinement
Step 8 over Step 1 (shown in Table 5–17 and Figure 5–8) for the three instruction
test programs are due to improvements in temporal concurrency due to asyn-
chronous pipelining of the datapath. The actual speedups achieved are less
than the maximum attainable improvement, which is the ratio of the ICTs (in













Step 1 Step 2 Step 3 Step 4 Step 5 Step 6 Step 7 Step 8 Step 9
Figure 5–8: The test program execution times
Tables 5–2 and 5–16), because of the MML and the startup overheads (see Equa-
tion 5.1), for longer tests programs the speed-up will approach the maximum
value. The speed-up for HT1 is in part due to the pipelining of the instructions
as observed in previous test programs, and also due to additional spatial con-
currency achieved through the overlapping of different instructions in the same
stage of the micronet. This further improvement is still significant (approxim-
ately 17-20% in this example) given that both successive instruction operand
fetches and write-backs are effectively forced to take place sequentially due to
resource constraints. (In fact, the delays of these operations are larger than the
Chapter 5. The Control Paradigm and the Architecture 122
FM delays for store and ALU operations and the MIICT which implies that the
scope for spatial concurrency in this particular example is quite small). As the
number of microagents within each stage is increased, the spatial concurrency
effect will be more pronounced. The speed-up for HT2, as expected, reflects the
reduced concurrency which can be exploited, because of data dependencies in
the program. These dependencies affect spatial concurrency more since they
sequentialise operations irrespective of resource requirements. This emphasises
the need for a good instruction schedule to exploit micronet-based architectures.
5.16 Conclusions
The interaction between concurrently executing instructions is quite difficult to
predict. For example, two instructions which compete for the same resources
might acquire them in a different order depending on the actual delays which
are themselves data-dependent. This is not in itself a drawback, since one of
the instructions is stalled for just as long as is necessary, unlike the synchronous
case.
These refinements have investigated the influence of an asynchronous con-
trol paradigm on the performance of processor architectures for exploiting fine-
grained ILP. The rôle of the CU in an asynchronous processor can be con-
siderably simplified, just to initiating individual micro-operations as early as
possible. The control of the datapath is distributed to local interfaces courtesy
of the micronet. The results show that given a set of resources, an asynchronous
control paradigm implemented as a micronet is able to efficiently achieve good
utilisation on datapath resources through the exploitation of both actual exe-
cution latencies and fine-grain spatial and temporal ILP. It has to be noted that
the datapath latency is unaffected by the exploitation of temporal parallelism
which is generally not the case in a synchronous pipeline.
Chapter 5. The Control Paradigm and the Architecture 123
5.17 Refinement Step 9 – Transistor Resizing
This additional refinement further illustrates how easily modifications can made
to the behaviour of a particular part of the micronet without affecting the rest of
the design. In the previous refinement step, some of the ICTs were limited by
the handshake cycle time of data transfers across the buses. In this refinement
step, the bus drivers have been resized to reduce the bus propagation times.
Consequently, no other (design) modifications were required to both ensure the
correct operation of the micronet and exploit the benefits.
Performance Results
Instruction ICT MML Max. FM Utilisation
ALU 10nS 22nS 40%
LD 23nS 41nS 100%
ST 10nS 20nS 90%
Table 5–18: Instruction execution for Refinement Step 9
In this refinement step, the ICTs of both the ALU and ST instruction, as
shown in Table 5–18, are now limited by the instruction issue cycle. This is
clearly highlighted by the fact that their corresponding test programs have
similar PETs (see Table 5–19).
In a synchronous pipeline, performance benefits can only be attained when
improvements are made to the worst-case delay of the slowest stage. However,
a micronet can exploit the benefits of improvements made to any stage. In MAP,
the delay associated with the current issue cycle is determined by the current
slowest stage and this is likely to vary from cycle to cycle. Excluding hazards,
the instruction issue rate is limited by the issue cycle time or the operand fetch
Chapter 5. The Control Paradigm and the Architecture 124
Test Programs Alu Test Load Test Store Test HT1 HT2 HT2DF
PET 89nS 186nS 85nS 68nS 83nS 78nS
MIICT 10nS 12nS 10nS 10nS 10nS 10nS
ALU Util. 33.75% 0% 0% 19.35% 15.58% 16.67%
MU Util. 0% 89.44% 79.75% 51.61% 41.56% 44.44%
Reg. Bank Util. 77.11% 36.67% 44.3% 62.9% 70.13% 68.06%
ALU If. Util. 92.77% 0% 0% 74.19% 75.32% 69.44%
MU If. Util. 0% 96.67% 96.2% 74.19% 74.03% 72.22%
Reg If. Util. 91.57% 81.11% 88.61% 90.32% 90.91% 91.67%
Table 5–19: Execution of the test programs on Refinement Step 9
cycle time (depending on the delays incurred by their actual implementation).
The time to write data back to registers will vary depending on the time taken
to obtain access to the bus.
5.18 Discussion
The work in this chapter has focused on how asynchronous controls within the
micronet can be used efficiently to utilise the components of a typical datapath.
The architecture has been influenced to the extent that resources (microagents)
should operate as independently and concurrently as possible. This is achieved
through the distribution and decentralisation of control and the use of decoup-
ling queues between successive resources. The micronet effectively provides
a framework to control a processor architecture and does not constrain a de-
signer towards any particular architecture. Instead the designer can easily add
or remove resources (thus modifying the architecture) to meet different design
criteria. The processor’s performance and efficiency is determined by the pro-
Chapter 5. The Control Paradigm and the Architecture 125
portion of the actual delays (latencies) of the micro-operations to the cycle times
of their functional units (FMs). It is difficult to quantify the magnitude of change
a particular modification will have on performance because of the complex in-
terdependency (sequencing and actual delays of components) of events. A high
functional unit utilisation isn’t necessarily a good thing since this may imply a
bottleneck in that particular unit. Of course, the designer could replace that unit
with a faster one or even add another unit to improve performance which may
have the effect of reducing the utilisation. Improving the performance of the
part of a design which is causing a bottleneck will simply move the bottleneck
to the next slowest part of the design (which may or may not be within the
design constraints). This work tries to advocate that given a set of architectural
resources, the micronet control paradigm is better able utilise them (almost
achieving their maximum theoretical bounds).
5.18.1 Minimising the Self-Timed Overheads
As shown in this chapter, the key to efficient exploitation of ILP in MAP has
been the ability to hide the overhead due to asynchronous handshake protocols.
While the two-phase handshaking protocol is conceptually easier to understand,
a four-phase one leads to simpler and therefore possibly faster circuits. In
order to exploit ILP, the control unit might want to issue instructions before
the previous one has completed, therefore these circuits need to be as fast as
possible.
Both transitions in a generic four-phase protocol (the assertion and the
return-to-zero) are accompanied by additional acknowledgements from the
receiver. Although the principal advantage of this approach is a simpler cir-
cuit implementation, it uses twice as many transitions than is necessary and
whenever the wire delay is a substantial fraction of the operation time, the extra
trip required by a single communication can be a serious performance penalty.
Chapter 5. The Control Paradigm and the Architecture 126
In fact, the reset phase of the handshake does not signal any event, thus leading
some designers to modify the protocol to simultaneously reset the two signals
after the active phase to reduce the handshake cycle time [55]. The micronet is
only concerned with the external communications between microagents, each
of which might use a different protocol internally. Micronets employ the tradi-
tional four-phase handshaking protocol for both control and local bundled data
transfer. Other reasons, more specific to micronets, have influenced this choice,
and these are discussed next.
Fast Instruction Issue
One of the significant features of micronets is their ability to exploit spatial
concurrency within the datapath. This requires a fast instruction issue rate to
keep the microagents busy. The CU initiates the micro-operations for each of
the instructions individually and as early as possible. The acknowledgements
from the CMs (after a delay of one C-element) confirm that the corresponding
micro-operations will be initiated. The instruction is considered to have been
issued once the CU has received all the acknowledgements. This corresponds
to the first half of the four-phase protocol. The CU is free to issue the next
instruction, while the reset phase of the protocol completes. This is done when
the corresponding acknowledgement signal is de-asserted which signifies that
the particular resource is ready for the next request. The instruction releases
the resources individually as soon as the respective micro-operations have com-
pleted, thus freeing the resources for another instruction. Figure 5–9 shows
the activity of two resources in micronets in comparison to a micropipeline
and a synchronous pipeline. Assume each of the three instructions require two
resources concurrently for varying periods of times. In the case of the synchron-
ous pipeline, successive instructions must wait for the next clock tick to begin
execution. In a micropipeline, the next instruction can begin execution when
the previous one has finished with both resources. In both cases, significant











a) Resource activity in a synchronous pipeline
b) Resource activity in a micropipeline







Figure 5–9: Resource activity
idle times may exist. A micronet can reduce these idle times by not forcing the
instructions to obtain both resources at the same time.
Allowing the micro-operations of different instructions to overlap could lead
to potential hazards. Since the acknowledgement signals effectively denote the
readiness or busyness of resources, they can be collectively used as a scoreboard.
Hazard avoidance due to data dependences is implicit in the orderings of the
assertions of the control signals. These pre-issue conditions stall the assertion
of the respective control signal until the completion of one of the halves of
Chapter 5. The Control Paradigm and the Architecture 128
the handshake protocol of the dependent micro-operation control signal(s).
Although a four-phase protocol would be considered twice as expensive as a
two-phase one, the same efficiency is obtained as two back-to-back, two-phase
handshakes by representing two events in each cycle. The recovery transitions
are used by the control unit for scoreboarding and hazard avoidance. This is
necessary for efficient exploitation of ILP, since the control unit has to issue
each instruction before the previous one completes its execution. Furthermore,
a four-phase protocol exposes more concurrency by effectively decoupling the
sender’s and receiver’s operations from their communication.
Routing Data in Micronets
Although the actual data transfer between microagents is controlled locally via
handshake protocols, the access to shared resources, such as data highways,
may be controlled either globally by the CU or locally by an arbitration scheme.
Global control is used in cases where the order of granting resources is known
in advance or has to be enforced. This is again achieved through the use of
pre-issue conditions. Otherwise, a local mutual exclusion scheme such as in
token rings or arbiters will grant requests. For example, the writing back to the
register bank is controlled directly by the CMs of the FMs which require this
service. As a consequence of this and also due to the differences in the execution
times of micro-operations, instructions may complete out of order. Therefore
data has to be tagged with its destination which also enables data-forwarding
to be supported.
The decision to use a two-phase or four-phase protocol also depends on
whether the local communication between two microagents takes place over a
shared bus or not. When wires are shared between two or more components,
the wires must return to an inactive or predetermined state to allow successive
handshakes to commence. When the highway is not shared, two-phase can be
Chapter 5. The Control Paradigm and the Architecture 129
used because there is only one source and one destination and so after each
completed handshake the wires will be in phase. Generally in processor archi-
tectures, data transfers take place over shared highways and the four phases of
the protocol map to the four states of bus activity: an inactive state; (either a
request is made for data or) data placed on the bus; (data is placed on the bus
or) the receiver signals the receipt of data; (the acknowledgement cleared or)















L1 + L2 + L3 + L4
Figure 5–10: Overlapping micro-operation handshake cycles
As suggested earlier, only one of the four phases actually contributes to the
progress of the operation. In practice, the overhead of the other phases can be
reduced by overlapping them with the micro-operation (computation) associ-
ated with the sender’s and/or receiver’s stage (see Figure 5–10). In a pipeline,
overheads can be further minimised by overlapping handshake phases of suc-
cessive stages. For example, fetching a single operand requires a handshake
between the register interface and register bank to access the operand, and the
register interface and the functional unit interface to transfer the operand to the
appropriate functional unit:
Phase 1 (REGISTER ACCESS) – While the register interface makes a request to
accesses a register, the FU interface can initiate the operand fetch hand-
shake by making a request to the register interface.
Chapter 5. The Control Paradigm and the Architecture 130
Phase 2 (DATA TRANSFER) – When the operand is received by the register
interface it carries on and completes the handshake with the register bank.
Concurrently, if it has received the operand request from a FU interface,
then the data is transferred over the bus.
Phase 3 (FM BUSY) – The functional unit interface receives data, it is trans-
ferred to its FM and it completes the handshake with the register interface.
Meanwhile, the first handshake (register bank-register interface) may have
completed, which implies that the next register access could begin.
Phase 4 (FM BUSY and/or REGISTER ACCESS) – When the operand request
signal from the FU interface has been cleared, then the register interface
removes the data from the bus. Meanwhile, the register interface could
also be accessing another register or although the FU may still be busy, its
interface could make another request for operands.
In terms of fetching the operand for the FU, if there were no delays, then this
is the shortest time possible (i.e. the sum of the critical latencies). However, the
time between operand fetches may be increased by the unnecessary additional
time the bus is being driven, this being the transit time of the corresponding
de-assertion of the acknowledgement signal and time to remove data from the
bus. Notice that in this case, data transfer is request-driven, i.e. for each operand
that is required, the FU asserts a request signal over the appropriate bus to the
register read port. This ensures that resources (registers and buses) are utilised
for no longer than is necessary. The register control signals together with the
handshaking protocol prevent bus contention occurring.
5.18.2 Implications for the Compiler
Code scheduling is important in architectures which are able to exploit instruction-
level parallelism. In synchronous RISC systems, the order and the time at which
Chapter 5. The Control Paradigm and the Architecture 131
each instruction is to be issued are determined by the compiler. Generally,
instruction execution is started at the next machine instruction cycle which is
determined by the clock. In MAP, instructions are issued in-order as soon as
possible – allowing instructions to execute when ready. In the control unit, the
effect of this scheduling approach is to initiate the instruction issue immediately
after the previous one, thereby reducing the idle time between instructions. For
example, given two events: A followed by B, in a synchronous design B will
be captured only at the first clock after the worst-case delay between A and B.
On average, this can still be a significant time after the actual occurrence of B.
Thus, in a synchronous design, each instruction will spend a fixed period of
time (determined by the largest worst-case stage delay) in each stage regardless
of requirement, while in a MAP design, instructions spend varying amounts
of time in only the relevant stages. Therefore, the asynchronous design is more
efficient – each instruction spends only as long as necessary in each stage,
and is better able to exploit any concurrency. A side-effect of this is that in
MAP, the execution time of instructions cannot be predicted exactly at compile-
time. However, a MAP compiler need only generate an appropriate instruction
ordering to maximise the exploitation of ILP and need not be concerned with the
time at which instructions are actually issued.
5.19 Summary
The utilisation of parallelism between instructions in high performance pro-
cessors is very important. This chapter has investigated the influences of an
asynchronous control paradigm on the design and performance of processor
architectures for exploiting fine-grained ILP. An ILP MAP design has been out-
lined and how the control for such an architecture can be implemented efficiently
has been shown. The rôle of the CU in an asynchronous processor has been
Chapter 5. The Control Paradigm and the Architecture 132
considerably simplified, just to initiating individual micro-operations as early
as possible. The control of the datapath is distributed to local interfaces cour-
tesy of the micronet. The advantages of this approach accrue from being able to
exploit both the actual run-time delays of the microagents and their concurrent
operation. The results show that given some set of architectural resources, an
asynchronous control paradigm implemented as a micronet is able to achieve
near optimal utilisation efficiently. Furthermore, as one might expect, when
a FU operation is the slowest stage in the pipeline, then maximum utilisation
can be achieved. The improvement in performance can be accredited to the
style of design, which can exploit advances in technology better, unlike cur-
rent synchronous designs. The next chapter investigates the suitability of MAP
architectures as good targets for optimising compilers.


















ALU Ack. MU Ack.
ZA Bus
Operand









Deasserted : When data has been written to its destination register.
Deasserted : When Write Back Handshake complete (LD, ALU) or Instruction Execution finished (ST).
Asserted : Upon access to source register. Deasserted : When Operand Access phase complete.Rx, Ry, Rz  Acks.





ALU Instruction Cycle TimeLoad Instruction Cycle Time


















Figure 5–11: The micronet model for Refinement Step 1


















ALU Ack. MU Ack.
ZA Bus
Operand

























ALU Instruction Cycle Time






Deasserted : When data has been written to its destination register.
Deasserted : When Write Back Handshake complete (LD, ALU) or Instruction Execution finished (ST).
Asserted : Upon access to source register. Deasserted : When Operand Access phase complete.Rx, Ry, Rz  Acks.
ALU, MU  Acks.
ZMs  Ack.
Figure 5–12: The micronet model for Refinement Step 2












ALU Instruction Cycle TimeLoad Instruction Cycle Time
Load Instruction Issued ALU Instruction Issued






Deasserted : When data has been written to its destination register.
Deasserted : When Write Back Handshake complete (LD, ALU) or Instruction Execution finished (ST).
Asserted : Upon access to source register. Deasserted : When Operand Access phase complete.Rx, Ry, Rz  Acks.



















ALU Ack. MU Ack.
Operand









Figure 5–13: The micronet model for Refinement Step 3















ALU Ack. MU Ack.
Operand





















Load Instruction Issued ALU Inst Issued








Deasserted : When Operand Access phase complete.
ALU, MU Acks.
Rx, Ry, Rz  Acks
Deasserted : When data has been written to its destination register.ZMs Ack
Deasserted : When Write Back Handshake complete (LD, ALU) or Instruction Execution finished (ST).
Asserted : Upon access to the source register.
Figure 5–14: The micronet model for Refinement Step 4


















Load Instruction Cycle Time
































Deasserted : When Operand Access phase complete.
ALU, MU Acks.
Rx, Ry, Rz  Acks
ZMs Ack Deasserted : When the destination register has been locked.
Deasserted : When the Functional Unit’s execution stage has finished.
Asserted : Upon access to the source register.
Figure 5–15: The micronet model for Refinement Step 5




















would be asserted should it be
point at which that control signal
The dotted line represents the
required for the next instruction.
ALU, MU Acks.
ZMs Ack
Deasserted : When Operand Fetch Handshake is in progress.
Deasserted : When the Functional Unit’s execution stage has finished.
Deasserted : When the destination register has been locked.


































Figure 5–16: The micronet model for Refinement Step 6




















ALU Instruction Cycle Time
ALU Inst Issued
would be asserted should it be
point at which that control signal
A dotted line represents the
required by the next instruction.
ALU, MU Acks.
ZMs Ack
Deasserted : When Operand Fetch Handshake is in progress.
Deasserted : When the Functional Unit’s execution stage has finished.
Deasserted : When the destination register has been locked.



































Figure 5–17: The micronet model for Refinement Step 7








would be asserted should it be
point at which that control signal
required for the next instruction.












ALU Inst Cycle Time
Inst Issued
Load Inst Cycle Time
ALU, MU Acks.
ZMs Ack
Deasserted : When Operand Fetch Handshake is in progress.
Deasserted : When  the destination register has been locked.
Rx, Ry, Rw  Acks




































Figure 5–18: The micronet model for Refinement Step 8
Chapter 6
The Control Paradigm and the
Compiler
6.1 Introduction
It is important that any processor design be a good target for a compiler, in
order that the architectural and technological benefits afforded by the design
be efficiently realised. The execution times of programs are strongly influenced
by the relationship between the compiler and the rest of the system [176]. In
the case of MAP architectures, it is important to understand the influences
of asynchronous control on parallelising compilers. The compiler’s rôle is to
identify parallelism within the program, generate the appropriate code and
efficiently schedule the instructions for the given processor architecture. This
chapter investigates how these functions are influenced by an asynchronous
control paradigm, and examines the design of a static instruction scheduler for
MAP architectures.
As described earlier in Chapters 4 and 5, the micronet improves the per-
formance of the instruction set by exploiting average delays and by exposing
141
Chapter 6. The Control Paradigm and the Compiler 142
both spatial and temporal instruction-level parallelism (ILP) within an architec-
ture. This chapter discusses how these advantages can be exploited; a generic
computational model for MAP architectures is developed and techniques (heur-
istics) are introduced which together with the architecture’s distributed control
strategy allow the compiler to efficiently exploit the available ILP. The inten-
tion of this preliminary study is not to propose the best heuristic or schedul-
ing strategy for MAP architectures, but rather to show that a micronet-based
datapath can indeed be a suitable target for compilers.
6.2 Compilers
A compiler has three machine-dependent tasks which affect the performance
of a program. Code Generation determines which instructions implement the
given program most efficiently, which can be made simpler by good instruction
set design [34]; Instruction Scheduling attempts to find an optimal ordering for
the chosen instructions; and Register Allocation assigns variables to physical
registers.
Instruction scheduling is an important feature for processor architectures
which exploit instruction-level parallelism, since fast program execution relies
on a good code schedule to both reduce the effects of hazards and to maximise
functional unit utilisation. In code generation, a particular implementation of a
higher-level function is usually determined by the combination of instructions
which leads to the minimum execution cost. However in ILP processors, this
cost is also affected by the order in which instructions are scheduled.
Instruction scheduling is classified as being local if it only considers instruc-
tions within a basic block [14] [70], and as being global if it considers instructions
spanning multiple basic blocks [15] [52]. While local scheduling can extract
parallelism within a basic block, global scheduling can exploit further program
Chapter 6. The Control Paradigm and the Compiler 143
parallelism by allowing inter-block movement of instructions [3] [46]. Further-
more, in architectures which exploit ILP, in order to generate good code for a
particular function, the compiler can no longer just take into account the cost of
individual instructions but rather their collective costs as determined by their
schedule.
Although register allocation can also introduce hazards due to register de-
pendencies (e.g. anti-dependencies), techniques such as register renaming can
be employed to reduce this effect. Note that since global scheduling is generally
achieved independently of the architecture, this chapter concentrates on local
scheduling which is more machine-dependent. Furthermore, local scheduling
is generally used to fine-tune the code produced after global scheduling [70].
6.3 Scheduling Challenges in MAP Architectures
Most modern synchronous processors enhance their performance by exploiting
ILP. This is achieved in two parts: firstly, parallelism within the program has
to be exposed [9] (e.g. through loop unrolling) and, secondly, a semantically
correct instruction ordering has to be achieved which utilises as much of the
available parallelism amongst the resources of the target architecture. Although
this ordering can be imposed either at compile-time or at run-time, ILP might
be best exploited statically rather than dynamically, more so since the dynamic
approach cannot exploit a greater degree of parallelism beyond the scope limited
by the fetched instructions.
The trend towards static instruction scheduling, i.e. the reliance on the com-
piler to generate the optimal schedule, has been aided by the predictability
of execution costs on synchronous processors. The optimising compilers for
synchronous pipelines assume a deterministic behavioural model of the tar-
get with each stage delay being approximated to being the same, having been
Chapter 6. The Control Paradigm and the Compiler 144
fixed a priori by the clock. In contrast, a linear, asynchronous pipeline, e.g.
micropipeline [158], has stages whose delays can vary. A compiler in this case
has a less accurate timing model of the target, and any optimisations based on a
synchronous model, such as scheduling instructions in execution gaps as found
in the MIPS re-organiser [13] [71], are less effective.
A micronet enables the exploitation of both spatial and temporal concur-
rency between instructions (in contrast, a micropipeline only exploits temporal
parallelism). Therefore, it is less easy for a compiler to predict the behaviour of
the micronet for the following reasons: firstly, as in a micropipeline, the delay
of each pipeline stage might vary; secondly and more uniquely, each instruction
only visits the relevant stages and the multiple paths enable more than one
instruction to operate concurrently within a stage, which enables instructions
to race each other, with possible out-of-order completion of instructions. Fur-
thermore, instructions may interfere with each other when competing for the
same resource in a particular stage.
The effective performance which a MAP system can deliver depends intim-
ately on the compiler’s ability to match the parallelism in programs with the
temporal and spatial concurrency exposed by the MAP architecture. The result-
ing instruction schedule should aim to keep the functional units busy thereby
increasing their utilisation and improving the overall performance. Unlike syn-
chronous schedules which imply both an order of execution for the instructions
and the times in terms of multiples of the basic instruction cycle, when they are
to execute, asynchronous ones only imply an order and are efficiently issued
“dynamically” by the control unit (CU). This removes the need for the inclusion
of NO-OP instructions in asynchronous schedules. Note that in synchronous
designs, the selection of which instruction to issue in a given cycle is gener-
ally performed at compile time in superpipelined (and VLIW) machines and at
run-time in superscalar ones.
Chapter 6. The Control Paradigm and the Compiler 145
6.3.1 MAP Behaviour
A MAP architecture has several communicating pipelines all of whose stages
can potentially be busy simultaneously. The task of the scheduler is to order
the instructions in such a manner so as to maximise the resource utilisation,
minimise the resource contention and allow the processor’s control unit to
maintain an optimal instruction issue rate. The control unit issues successive
instructions as early as possible in order to initiate the instruction’s execution
immediately after the previous issue, thereby reducing the idle time between
instructions.
A micronet can be stalled due to contention for resources. In particular, the
CU (also referred to as the issue unit) will be stalled when the resources required
by the current instruction are all busy. The scheduler attempts to minimise
this by suitably ordering the instructions at compile-time. If it is impossible
to schedule successive unrelated instructions, then the micronet minimises the
stall at run-time. In the case of data-dependent instructions: both instructions
are issued, with the second instruction awaiting the result to be forwarded. In
the case of resource contention: the second instruction performs all the micro-
operations up to the microagent which is busy. In effect, only the offending
micro-operation is stalled, rather than the entire instruction. These fine-grained
hazard avoidances are enforced at run-time by the pre-issue conditions of the
micronet as previously described in Chapter 5.
6.3.2 A Parameterised Computational Model
A computational model describes the scheduler’s view of the target architec-
ture. The model is the basis upon which the scheduler aims to maximise the
amount of parallelism that can be exploited. One of the advantages afforded by
asynchronous and distributed control in the design of processors is the modu-
Chapter 6. The Control Paradigm and the Compiler 146
larity and composibility which allows designers to easily modify and explore
the architectural design space e.g. determining the optimal number of resources
for a class of problems. It would be advantageous, therefore, not to have to
redesign the scheduler each time as well. This will need two requirements to be
fulfilled: the scheduling strategy should not be specific to any particular archi-
tecture; and the computational model should capture the salient characteristics
of any target architecture (the holy grail in the field of scheduling [134]).
For synchronous architectures the computational model is simple: instruc-
tions do not interact and their execution times are considered fixed. In contrast,
the model for a micronet-based processor is necessarily less accurate for the
following reasons: execution times for even the same instruction may vary due
to data-dependent operations, environmental parameters, and the interactions
between different instructions which are executing simultaneously. However,
the modularity and composibility of the micronet makes it easy to parameterise
the computational model which would allow the same scheduler to be applic-
able to a variety of architectures. Unfortunately, this concept cannot be easily
adopted for synchronous architectures since each one is almost unique because
of its centralised controls and any changes made can effect the behaviour of the
whole design.
The MAP computational model views an architecture as a collection of re-
sources or microagents (a number of issue units, a number of various functional
units, and a number of bus highways) which are connected in some fashion (i.e.
some functional units may share buses, while others have dedicated point-to-
point connections). Each microagent associates a latency and a cycle time with
each of its micro-operations, the former determines when the result becomes
available and the latter is the rate at which those micro-operations can be pro-
cessed. The model currently assumes a “five-stage” network (register access,
operand fetch handshake, execution, write-back handshake, and write-back)
(see Chapter 5) to which resources are allocated depending on their type – is-
Chapter 6. The Control Paradigm and the Compiler 147
sue unit; register bank; operand fetch bus; functional unit; or write-back bus.
The register file can be modelled as one large file or a number of smaller ones
depending on the number of operand fetch ports. It is also possible to model
VLIW or superscalar architectures. The parameterised model effectively forms
a resource graph of the target architecture. In general, this graph is irregular and
does not have the same conventional connectivity patterns that are normally
associated with multiprocessor scheduling graphs, e.g. full connection, grid,
hypercube or a ring topology.
6.4 The Scheduling Problem
The MAP scheduling problem can be stated as follows: Given a set of heterogen-
eous resources with variable execution times, devise a minimal-length, non-preemptive
schedule which respects dependencies within programs; each program being described
as an arbitrary partial ordering of instructions.
This type of problem, usually referred to as the precedence- and resource-
constrained instruction scheduling problem has been studied well, and it is
known that even by imposing restrictions, the problem is still NP-hard [32] [85] [168].
For example, when the execution times of tasks are not uniform and their partial
order is arbitrary, then for two or more identical processing units, the problem of
determining a minimal-length, non-preemptive schedule is NP-complete [59].
This result is true even if all of the tasks are independent. Therefore, in order
to achieve near-optimal execution times for given applications on MAP archi-
tectures, an efficient (polynomial-time) scheduling algorithm based on one or a
number of heuristics must be devised.
Chapter 6. The Control Paradigm and the Compiler 148
6.4.1 Similar Scheduling Problems
The scheduling of instructions for MAP finds echos in other scheduling prob-
lems:
1. Multiprocessor Scheduling: There is a wealth of strategies and solutions
to various classes of scheduling problems. For example, multiprocessor
scheduling considers tasks as the basic unit of work, whether one considers
processes, code segments, or even machine-code instructions they can all
be viewed as tasks at a different levels of granularity. These problems
usually assume that processors are homogeneous (i.e. identical), whereas
a MAP architecture has different functional units each of which can only
execute a unique set of instructions. Furthermore, since multiprocessor
scheduling only considers acyclic dependencies between tasks and that
each task is only executed once, this technique can only be used to schedule
instructions within basic blocks. Level Scheduling was an early approach used in operational research
and assembly line problems [75]. This scheme is only optimal when
considering unit execution time (UET) systems and tasks graphs
which are either in- or out-forests. Priorities are assigned to all
tasks: the tasks within the same level of a directed acyclic graph
(DAG) being assigned the same values and the higher levels within
the DAG (those farthest from the terminal level or sink tasks) being
given higher priorities. The highest, unexecuted, ready task, i.e. a
task which has no predecessors or all of its predecessors have already
been executed, is assigned to the first processor which becomes avail-
able. More recently, optimal solutions for arbitrary-shaped DAGs for
up to 2 processors have been found [33] [57] [151].
Chapter 6. The Control Paradigm and the Compiler 149
2. Graph Colouring is a technique used in register allocation [27]. A large
number of symbolic registers are mapped onto a limited number of phys-
ical registers in a CPU. At any time t there are a number of “live” symbolic
registers which need to be optimally allocated. Similarly in MAP, at any
time t there is a list of instructions that are eligible to be issued for execu-
tion. The choice of instruction for scheduling depends on availability of
resources and the cost, of say, not scheduling the instruction immediately.
3. In dataflow machines, instructions are issued as soon as their operands are
available. This is achieved completely dynamically in hardware but incurs
significant run-time (book-keeping) costs. Scheduling in traditional syn-
chronous RISC architectures is achieved completely statically. An effort to
reduce the book-keeping costs has lead to an interest in dataflow-RISC hy-
brids [58]. MAP architectures can also be viewed as a hybrid of these two
classical styles. As in the RISC architectures, code scheduling is done stat-
ically but, additionally, instruction issue (and even possibly the instruction
schedule) is fine-tuned dynamically to take advantage of run-time charac-
teristics as in the data-flow model. Notice that in some sense, MAP is more
interested in dataflow at the microagent-level than at the instruction-level.
This now begs the following question: How much scheduling should be
done statically in the MAP scheduler and what should be left for the MAP
hardware? Before this can be answered, the rest of this chapter attempts
to determine how much scheduling can be done at compile-time.
6.5 A Scheduling Methodology for MAP
A directed acyclic graph (DAG) is used to represent the instructions within the
basic blocks of a program. (Techniques such as trace scheduling [46] [52] or
global compaction [130] could be used to increase the size of these blocks.) Each
Chapter 6. The Control Paradigm and the Compiler 150
node within the DAG corresponds to an instruction, and each edge to a data
dependence between instructions. Typically, an instruction cannot begin execu-
tion until all of its predecessors have completed and their results have become
available. In practice, it is not necessary to stall the instruction completely in all
the cases where such dependencies exist. Since the micronet already minimises
the length of any stall, i.e. only stalling until their dependencies have been re-
solved, the implicit (and possibly) unnecessary stalls incurred by a conventional
computational model, which may adversely affect the optimality of the sched-
ules generated by heuristics, can be avoided. The implications for the MAP
scheduler, i.e. the degree to which an instruction needs to be stalled, depends
on the type of dependency implied by the edge within the DAG, as described
as follows:
Read-after-Write – Although the dependent instruction will be issued, its exe-
cution will be delayed (by the micronet) until the completion of its prede-
cessor. In practice, it is preferable not to issue such an instruction, since
some of the resources earmarked for the dependent instruction will be-
come unavailable for use by other, now “ready-to-execute” instructions,
which might introduce further structural hazards in the bargain.
Write-after-Write – Only the write-back order has to be maintained and this is
also achieved in hardware by the micronet. Two instructions are permitted
to execute concurrently. Although all of the second instruction’s micro-
operations will have been initiated, the write-back micro-operation will
stall for as long as the first instruction holds on to the destination register.
The current MAP architecture supports only one outstanding register lock
request, therefore a subsequent third instruction which requires a locked
register cannot be issued, until the first write-back has been completed.
The scheduler should avoid arranging instructions which write to the
register file immediately after two instructions with write-after-write de-
Chapter 6. The Control Paradigm and the Compiler 151
pendencies if independent instructions cannot be found for issue between
the two dependent instructions.
Write-after-Read – In the case of an architecture with a single set of operand
fetch buses, the hardware ensures that a dependent instruction will be
unable to lock its destination register before its predecessor has fetched
its operand. Should there be a number of operand fetch buses (as in a
superscalar MAP), and the possibility of a dependent instruction obtain-
ing its operands before its predecessor, then this instruction may have to
be stalled. This would only be necessary when the time to execute the
dependent instruction is less than the operand fetch time for the prede-
cessor. This hazard is also known as an anti-dependency, and along with
write-after-write hazards can be avoided by register renaming.
Hazard resolution is a good example of the interaction between the compiler
and the architecture. Since there is no concept of time in the schedule, it is
impossible to avoid all hazards at compile time (c.f. the MIPS organiser). The
scheduler can only hope to produce an ordering of instructions which reduces
the number of hazards, and relies on the MAP architecture to minimise their
effects by efficiently resolving them in hardware.
In MAP architectures, it is better to schedule independent instructions suc-
cessively since this may allow the optimal instruction issue rate to be achieved.
In practice, finding independent instructions is not always possible. With the
MAP scheduling problem being NP-complete [59], heuristics are required to
map tasks from a program DAG on to a resource graph. The method which has
been investigated here, combines some elements of the approaches described
earlier but is based primarily on the well-known List Scheduling method.
Chapter 6. The Control Paradigm and the Compiler 152
6.5.1 The Scheduler
List scheduling (LS) is a general method for scheduling tasks in resource-
constrained problems [32]. LS builds a ready set that contains all of the tasks
which are not waiting on the results of other tasks. When a processor becomes
available, a task with the highest priority is chosen from the set and assigned
to it. The ready set is obtained from a topological sort of the data dependence
graph. LS relies on other heuristics to prioritise the ready tasks and guide it
towards an optimal solution. This has lead to a profusion of LS-based heurist-
ics [12,45,77,104,134].
The MAP solution adopted here is based on a greedy scheduling algorithm
for list scheduling which was proposed by Coffman and Graham [33]. This is an
optimal, O(n2) algorithm for arbitrary precedence constraints on two processors
with unit execution costs. A MAP scheduler has to deal with heterogeneous
resources and can no longer just choose the ready instruction with the highest
priority, but must also consider whether the correct resources are also available
i.e. the instruction must be executable. Once an executable instruction is issued,
its execution cannot be suspended and resumed at the point of suspension at
a later time, i.e. schedules must be non-preemptive. The goodness of these
schedules are highly dependent on the parameter(s) that are used to prioritise
instructions within the ready list [1] [112], and these MAP-specific heuristics are
discussed in the following sections.
Compared to multiprocessor environments, although the scheduler for MAP
does not have to explicitly consider interprocessor communications it does
however effectively assume data is not local since operands have to be fetched
from and sometimes returned to the register bank (i.e. incurring some cost).
Note also that even though data forwarding might be considered to be equi-
valent to local data access, it is not modelled in the computational model since
Chapter 6. The Control Paradigm and the Compiler 153
this is an architecture-specific feature (i.e. not permitted in the parameterised
model) which is impossible to predict a priori.
Minimising Idle Times
The scheduler’s first assumption is that minimising the stall time will lead to bet-
ter (or at least near-optimal) program execution time (the first priority heuristic).
This implies that the MAP compiler should not schedule instructions until their
dependencies have been resolved (as discussed in Chapter 5 and Section 6.5)
and the necessary microagents (resources) are available. This requirement is
met by basing the heuristic’s cost function on worst-case instruction execution
times (see Section 6.7.1 for further details). This implies that the computational
model has to maintain a scoreboard of resource activities.
Primary Instruction Priority
In Coffman and Graham’s algorithm, interprocessor communication is assumed
to be zero and tasks have unit execution times, which means that time can be
conveniently treated as being discrete rather than continuous. This allows pri-
orities to be assigned based on the task’s level within the DAG from the sink
tasks. Since instructions have different worst-case execution times in MAP, the
problem is similar to multiprocessor scheduling with interprocessor communic-
ation delays (where communication costs are only incurred if dependent tasks
are scheduled on different processors). The solutions adopted in this field have
been based on critical path analysis and heuristics [62] [91] [148]. (The critical
path cost of a task is the largest sum of costs along a path from itself to a sink
task.) In the MAP computational model, although actual instruction execution
costs may vary, these critical path costs can be determined a priori by basing
them on fixed, worst-case instruction costs.
Chapter 6. The Control Paradigm and the Compiler 154
Secondary Instruction Priorities
The heuristics applied so far may still not prioritise the executable tasks suf-
ficiently. Therefore, additional heuristics are required to further prioritise the
candidate tasks. One feature which does seem to significantly influence the best
choice of candidate is the dependents of the chosen task. The two heuristics
used to “break ties” amongst candidates of the same priority act as follows: the
first one gives a higher priority to the task with the larger number of successors
which are solely dependent on it. A feature of this heuristic is that the priority of
a task increases with time. If a tie is still unbroken, then a higher priority is given
to the task with the most successors. Additionally, these heuristics highlight
the need to consider not only which tasks need to execute in the future, but also
their resources.
The Importance of the Instruction Issue Cycle Time
Unlike synchronous pipelines, micronet resources have two parameters which
affect instruction execution costs: the micro-operation’s latency and its cycle
time. Together with program parallelism and the number of resources, a limiting
factor on the amount of exploitable ILP is the cycle time of the issue unit in
relationship to the execution time of instructions (or more accurately their cycle
times).
In order to minimise the issue unit’s stall time, the compiler has to devise
a schedule that allows instructions to be issued continuously at the highest
possible rate, which is equivalent to one every minimum Instruction Issue
Cycle Time (IICT). Synchronous datapaths are pipelined or where necessary
super-pipelined (i.e. the functional units are themselves pipelined) sufficiently
to achieve this goal. Due to the spatial ILP in MAP, instructions are issued
at a rate (determined by the IICT and dependencies) which is faster than their
Instruction Cycle Times (ICTs). The ICT is the effective issue time (due to pipelin-
Chapter 6. The Control Paradigm and the Compiler 155
ing) for a particular instruction, which is determined by the rate at which that
specific instruction type can be processed. As the IICT, which is less than the
largest ICT, gets smaller, the MAP architecture behaves more in a superscalar
fashion and therefore the value of the IICT itself can have a significant influence
on the optimality of a schedule. This is less significant when the IICT is compar-
able to the largest ICT, in which case the order of the independent instructions is
less critical, since the micronet behaves like a linear pipeline without any spatial
concurrency.
IICT, ICT and Lookahead
When choosing an instruction to schedule, it may be beneficial to consider not
only those instructions which are ready, but also the ones which will become
ready in the near future, called instruction lookahead, e.g. within the next min-
imum IICT. Note that this may mean deliberately selecting an instruction that
causes the processor’s issue unit to stall.
The two steps of choosing an instruction and checking to see if sufficient
resources are available for it should not take place independently. Since the
scheduling of an instruction is subject to current resource availability, the sched-
uler should also consider future resource requirements (Resource Lookahead).
Example 1 and Example 2 contrast the influence of IICT and resource lookahead
on determining an optimal schedule. A1 andB are ready candidate instructions,
with a third instruction, A2, which has a structural dependency on A1.
Chapter 6. The Control Paradigm and the Compiler 156
Example 1 : Resource Lookahead
1 switch IICT
2 case 0: Choose schedule fA1,B,A2g or fB,A1,A2g;n Either schedule is optimal n
3 case (0  IICT < 12 ICTA):
4 if (ICTB >2ICTA) Choose schedule fB,A1,A2g;n Instruction B takes longer than the both A1 and A2 n
5 else Choose schedule fA1,B,A2g;n In other words, combine the resource requirements of nn dependent instructions and schedule the instruction nn according to the resource with the most work. n
6 case (12 ICTA  IICT < ICTA):
7 if (ICTB >2ICTA) n then schedule B first (as before) n
8 Choose schedule fB,A1,A2g;
9 else n schedule A1 firstn
10 if (ICTB < ICTA) Choose schedule fA1,A2,Bg;
11 else Choose schedule fA1,B,A2g;
12 case (ICTA  IICT):n Schedule the instruction with the largest ICT first n
13 if ICTA < ICTB Choose schedule fB,A1,A2g;
14 else Choose schedule fA1,A2,Bg;
15 end switch;
In the case of scheduling heuristics which do not consider resource looka-
head, the schedules they generate might be as follows:
Example 2 : Without Resource Lookahead
1 if (IICT = 0) Choose schedule fA1,B,A2g or fB,A1,A2g;n Again, either schedule is optimal n
2 else n Simply schedule the instruction with the largest ICT first. n
3 if (ICTA < ICTB ) Choose schedule fB,A1,A2g;
4 else if (IICT < ICTA) Choose SchedulefA1,B,A2g;
5 else Choose schedule fA1,A2,Bg;
Chapter 6. The Control Paradigm and the Compiler 157
The lookahead heuristics attempt to match the available program and archi-
tectural parallelism over a short window of time. The strategy of repeating the
process over the entire program allows the instruction-level parallelism to be
exploited more evenly. This has two effects: firstly, a better program makespan
is usually achieved; and secondly, a schedule is generated which is more robust
to deviations from the predicted instruction costs because only the appropriate
amount of program parallelism is exposed which can be exploited by the target
at any one time. Since costs are based on worse-case values rather than typical
ones, the traditional list scheduling heuristics tend to overly migrate independ-
ent instructions to the top of the schedule, leaving insufficient parallelism for
later. Kerns and Eggers [88] proposed a code scheduling algorithm called bal-
anced scheduling for synchronous architectures which is similar in concept. Their
algorithm is specifically designed to tolerate a wide range of variance in load
latency, e.g. cache misses/hits, global and local memory. In these architectures,
instruction costs are well defined and considered fixed. Usually the latencies
reflect the most optimistic execution, e.g., the time of a cache hit rather than a
cache miss. Traditional schedulers improve performance through reordering
instructions to avoid pipeline stalls, e.g., by inserting independent instructions
after loads to keep the CPU busy. The number of instructions inserted (in the best
case) depends on this latency value. If the load instruction is delayed beyond
the scheduler’s estimate, then the processor will stall. However, if the latency is
shorter, the destination register of the load instruction will be tied up for longer
and this may increase register pressure enough to cause unnecessary code spills.
Unfortunately both balanced scheduling and resource lookahead are computa-
tionally more expensive than the traditional list scheduling approach and will
not be considered further in this initial study.
Chapter 6. The Control Paradigm and the Compiler 158
The approximation algorithm
The algorithm takes as its input a directed graph of instruction dependencies
and a resource graph with architectural parameters, and generates an instruction
schedule for the given MAP architecture. Two lists are defined as follows: the
WI list – the list of instructions still awaiting their operands, and the EI list
– an ordered list of instructions which are ready, or will be ready in the near
future (for lookahead instructions), but still awaiting issue. The order of the
latter list is determined by the critical path costs of instructions, i.e. the primary
priority. Next, a prioritised list of executable instructions is derived from the
EI list based on the availability of their resources at the current time. If there
are ties, an instruction (or instructions in the case of superscalar MAP) is chosen
for issue based on secondary priority values.
The scheduler mimics the behaviour of the architecture’s issue unit. The
function generate schedule(), as shown in Algorithm 2, schedules instructions
based on their readiness, their priority and the availability of resources. Un-
like schedulers for synchronous machines, the scheduling of instructions does
not proceed in uniform time steps, but rather in an asynchronous event-driven
manner until all the instructions are scheduled. Each iteration of the main loop
(the while do loop in line 5) corresponds to an instant in time when the issue
unit is ready to issue an instruction. However, a situation may arise when at
some given time there are no instructions ready for issue (line 8), in which case
the clock must be advanced, but only as far as necessary to remedy this. The
incrementing of the clock simulates the issue unit being stalled. The routine,
advance clock(), finds the earliest occurrence of three types of events: the ready
time of an instruction in the WI list and of a lookahead instruction in the EI list;
the time when the result of an operation becomes available in the register file;
and the time a busy resource becomes free. Only the first two events can change
the status of the EI list. There is a choice of heuristics which can be applied,
Chapter 6. The Control Paradigm and the Compiler 159
either the instruction lookahead or the traditional priority-based approach. In-
struction lookahead (lines 9 – 17) chooses the best instruction to issue from
the EI list based on the lookahead heuristic. The function, get ready instr() de-
scribed in Algorithm 3, returns from the given list of instructions the one with
the highest estimated-time-to-completion (ETC) priority for which there will be
sufficient resources in the datapath when issued at its earliest issue time. In
the current implementation of the lookahead heuristic, only one instruction is
chosen per issue cycle iteration. The routine, apply.lookahead() as described in
Algorithm 4, implements the instruction lookahead heuristic. The alternative
heuristic (lines 18 – 29) chooses the instruction with the highest priority which
can be issued immediately. This may involve choosing one or more from a
number of instructions with the same primary priority value (ETC). Line 19
creates a list of ready instructions with the same, highest ETC values and line
22 removes those instructions with insufficient resources for issue at the current
time. Line 23 supports architectures which incorporate lockstep superscalar
instruction issue. The routine issue all() issues as many of the instructions as
possible from the given list. If there are not enough issue-slots for the complete
list (rdyI), then the routine choosing insts() returns the best instruction for issue
based on the secondary priorities. The two loops (lines 26 and 27) repeat until
either the issues slots are filled or their respective lists become empty. The clock
is advanced appropriately depending on whether or not the scheduler was able
to issue one or more instructions at the current time (lines 28 and 29). The
routine, update writeback, models the behaviour of the portion of the micronet
not directly controlled by the issue unit, e.g. write-back bus. Line 32 updates
the instruction lists and the next instruction issue cycle iteration begins at a new
time.
Chapter 6. The Control Paradigm and the Compiler 160
Algorithm 2 : The MAP scheduler (generate schedule())
1 curr time := 0;
2 calc completion times(); n Critical path analysis for each instruction n
3 update WI(WI list); n Determine instruction start times n
4 update EI(WI list); n Move ready instructions to EI list n
5 while (WI list 6= fg) or (EI list 6= fg) do
6 no issued := 0; n Number of inst issued simultaneously at this time n
7 candidates := EI list;
8 if (EI list = fg) curr time := advance clock(YES, YES, NO, curr time);
9 else if (lookahead = YES) n Use Instruction Lookahead Heuristics n
10 BestChoice := get ready instr(candidates); n Inst with the highest nn priority in the candidates list for which there are sufficient resources n
11 if (BestChoice 6= NULL)
12 while candidates 6= fg do
13 NextInst := get ready instr(candidates);
14 if (NextInst 6= NULL)BestChoice = apply.lookahead(BestChoice, NextInst);end while
15 if (BestChoice.rdy time  curr time + issue cost)
16 issue instruction(BestChoice); no issued++;
17 EI list := EI list - fBestChoiceg;else
18 do n Alternative strategy without Instruction Lookahead nn same ETC list is the list of the highest ETC cost, ready insts n
19 9 same ETC list  candidates, s:t: 8 i 2 candidates,9 v 2 same ETC list, s:t: (v.ETC  i.ETC);
20 candidates := candidates - same ETC list;
21 do n Remove instructions without sufficient resources n
22 9 rdyI  same ETC list, s:t: 8i 2 rdyI,
find avail FU resources(i, datapath, curr time);
23 if (jrdyIj  spsclr deg - no issued) issue all(rdyI, no issued);else n choose between insts in rdyI list n
24 inst chosen := choosing insts(rdyI, no issued);
25 EI list := EI list - finst choseng;
26 while ((no issued < spsclr deg) and (same ETC list 6= fg));
27 while ((no issued < spsclr deg) and (candidates 6= fg));
28 if (no issued > 0) curr time += inst issue cycle;
29 else curr time := advance clock(YES, YES, YES, curr time);
Chapter 6. The Control Paradigm and the Compiler 161end if
30 update writeback(datapath);
31 if (WI list 6= fg)
32 update WI(WI list); update EI(WI list);end while
33 update writeback(datapath);
The function described in Algorithm 3 returns, from the given list, the
instruction with the highest estimated-time-to-completion (ETC) priority for
which there will be sufficient resources in the datapath if it is issued at its earli-
est issue time. If this time is not the same as the current issue time (i.e. the
next earliest scheduling time for any unscheduled instruction), then issuing this
instruction will effectively cause the issue unit to stall. However, in practice it
is not possible to predict what will actually transpire unless the actual delays
can be determined a priori. Notice that the scheduler must take into account the
fact that some instructions will begin to be issued before they are ready or all
of their resources are available which effectively allows the cost of issuing the
instruction to be hidden.
Algorithm 3 : The MAP scheduler (get ready instr(inst list))
1 9 inst 2 inst list s.t. inst.ETC is maximum;n i.e. inst is the first instruction in the ordered list inst list n
2 while ((inst list 6= fg) and (not cand found))
3 if (inst.rdy time > curr time + issue cost)n This instruction can be issued early to hide the issuing cost n
4 inst.issue time := inst.rdy time - issue cost;
5 else inst.issue time := curr time;
6 if (find avail resources(inst, inst.issue time) 6= fg) cand found = TRUE;else
7 inst list := inst list - finstg;
8 9 inst 2 inst list s.t. inst.ETC is maximum;end ifend while
9 return(inst);
Chapter 6. The Control Paradigm and the Compiler 162
In certain cases it is more prudent to stall the issue unit until a higher
priority instruction becomes ready, rather than immediately issuing another
ready instruction. The routine, apply.lookahead(), as described in Algorithm 4
implements the instruction lookahead heuristic which uses the ETC priority
and the earliest issue time of two instructions to determine which of them
should be issued first. By comparing the estimated execution time of the two
instruction schedules, the order with the smallest time is chosen. Should the two
schedules have the same time, then the order where an instruction completes
earlier is chosen, since this would at least allow its dependents to become ready
sooner. However, if a tie still exists then the secondary instruction priorities are
applied to choose a candidate.
Algorithm 4 : The MAP scheduler (apply lookahead(instA, instB))
1 opt1 := instA.issue time + instA.ETC;
2 opt2 := instA.issue time + instB.ETC + instA.issue cycle;
3 opt3 := instB.issue time + instB.ETC;
4 opt4 := instB.issue time + instA.ETC + instB.issue cycle;
5 etc ABl := max(opt1, opt2);
6 etc ABs := min(opt1, opt2);
7 etc BAl := max(opt3, opt4);
8 etc BAs := min(opt3, opt4);
9 if (etc ABl < etc BAl) return(instA);
10 else if (etc ABl > etc BAl) return(instB);
11 else if (etc ABs < etc BAs) return(instA);
12 else (etc ABs > etc BAs) return(instB);
13 else return(break ties(instA, instB));
6.6 Results
In this section, the makespans of MAP schedules for a number of typical in-
struction DAGs (briefly described below) are compared with their optimum.
The optimal makespan of each DAG is derived from an exhaustive search of all
Chapter 6. The Control Paradigm and the Compiler 163
possible valid schedules. The DAGs represent a selection of graph shapes
typical of program applications:
BT3 – A Binary Tree with three levels.
BT3.5 – A Binary Tree with three and half levels.
BT4 – A Binary Tree with four levels.
DD – Diamond DAGs which are commonly found in the evaluation
of partial differential equations.
DM – Dense matrix multiplication.
SM – Sparse matrix multiplication.
CC – Mix of Load, Store and ALU instructions with data dependen-
cies. (The Hennessy Test used in Chapter 5.)
CCL – A loop unrolled version of CC (i.e. two iterations of the
Hennessy Test).
Min1 – This architecture contains the minimum resources – one
ALU and one Memory Unit (MU) which both share a single
write-back bus. The cycle times and latencies of the ALU, the
MU and the write-back micro-operations are assumed to the
same.
3bus1 – This architecture has an additional ALU and each of the
three functional units has a dedicated write-back bus. (The
micro-operation cycle times and latencies are the same as Min1).
Min2 – Same as Min1, except that the micro-operation costs of all of
the microagents reflect realistic costs obtained from SPICE-level
simulations.
3bus2 – Same as 3bus1, but with the micro-operation cycle times
and latencies of Min2.
Chapter 6. The Control Paradigm and the Compiler 164
No. of No. of The The MAP Heuristic MAP with Lookahead
Prgm MAP Valid Optimal Optimal Make- Close- The Make- Close- The New
DAG Arch Schds Schds Mkspn span ness Range span ness Range Schd?
BT3 Min1 640 24 1105nS 1185nS 92.76% 75% 1185nS 92.76% 75% No
BT3.5 Min1 230400 512 1505nS 1585nS 94.68% 85.71% 1585nS 94.68% 85.71% No
BT4 Min1 21964800 529920 1785nS 1885nS 94.4% 85.71% 1885nS 94.4% 85.71% No
DD Min1 42 2 1325nS 1325nS 100% 100% 1325nS 100% 100% No
DM Min1 310160 200 1905nS 1925nS 98.95% 98.11% 1905nS 100% 100% Yes
SM Min1 46574 24 2085nS 2245nS 92.33% 81.81% 2265nS 91.37% 79.55% Yes
CC Min1 4 2 735nS 735nS 100% 100% 735nS 100% 100% No
CCL Min1 4032 4 945nS 1015nS 92.59% 88.89% 1015nS 92.59% 88.89% No
BT3 3bus1 640 72 1105nS 1105nS 100% 100% 1105nS 100% 100% No
BT3.5 3bus1 230400 128 1355nS 1355nS 100% 100% 1355nS 100% 100% No
BT4 3bus1 21964800 456960 1605nS 1605nS 100% 100% 1605nS 100% 100% No
DD 3bus1 42 2 1225nS 1225nS 100% 100% 1225nS 100% 100% No
DM 3bus1 310160 156 1645nS 1645nS 100% 100% 1645nS 100% 100% No
SM 3bus1 46574 46 2005nS 2035nS 99% 97.67% 2005nS 100% 100% Yes
CC 3bus1 4 2 735nS 735nS 100% 100% 735nS 100% 100% No
CCL 3bus1 4032 18 835nS 835nS 100% 100% 835nS 100% 100% No
BT3 Min2 640 32 930nS 930nS 100% 100% 930nS 100% 100% No
BT3.5 Min2 230400 704 1230nS 1230nS 100% 100% 1230nS 100% 100% No
BT4 Min2 21964800 768768 1500nS 1500nS 100% 100% 1500nS 100% 100% No
DD Min2 42 2 570nS 570nS 100% 100% 570nS 100% 100% No
DM Min2 310160 120 1250nS 1280nS 97.6% 92.5% 1250nS 100% 100% Yes
SM Min2 46574 2 1180nS 1200nS 98.3% 95.9% 1190nS 99.15% 97.96% Yes
CC Min2 4 2 400nS 400nS 100% 100% 400nS 100% 100% No
CCL Min2 4032 2 550nS 550nS 100% 100% 550nS 100% 100% No
BT3 3bus2 640 32 920nS 920nS 100% 100% 920nS 100% 100% No
BT3.5 3bus2 230400 704 1220nS 1220nS 100% 100% 1220nS 100% 100% No
BT4 3bus2 21964800 2377728 1500nS 1500nS 100% 100% 1500nS 100% 100% No
DD 3bus2 42 2 490nS 490nS 100% 100% 490nS 100% 100% No
DM 3bus2 310160 1620 1230nS 1230nS 100% 100% 1230nS 100% 100% Yes
SM 3bus2 46574 8 1160nS 1180nS 98.28% 96.01% 1160nS 100% 100% Yes
CC 3bus2 4 2 400nS 400nS 100% 100% 400nS 100% 100% No
CCL 3bus2 4032 12 550nS 550nS 100% 100% 550nS 100% 100% No
Table 6–1: Measuring the optimality of the scheduling heuristics
The results for the MAP scheduling heuristic, both without and with instruc-
tion lookahead, are shown in Table 6–1. For each DAG, the number of valid
schedules is recorded together with the optimal makespan for the given target
architecture. The makespan generated by the heuristics together with its close-
ness to the optimum (recorded both as a percentage of the optimal (Closeness)
and as a percentage of the difference between the best and worst makespans
(within The Range – best being 100%, worst 0%)) are also included. It is assumed
that there are a sufficient number of registers available to avoid code spilling.
This would normally be determined at the register allocation phase of the com-
Chapter 6. The Control Paradigm and the Compiler 165
pilation and is not considered here (see Chapter 7). If the lookahead heuristic
generates a different schedule, this is indicated in the column “New Schd?”.
The results look quite promising. In a majority of the cases for the 3bus1
and 3bus2 architectures, the MAP heuristic can find an optimal solution (only
in the case of SM is instruction lookahead required, for both architectures, to
reduce the makespan to optimum). However, the MAP scheduler does not
do as well on the Min1 architecture (for BT3, BT3.5, BT4, CCL, DM and SM).
The reason for the poorer makespans is due to a bottleneck on the write-back
bus. So significant is the effect of the bottleneck that even applying instruction
lookahead, i.e. waiting until a higher priority instruction becomes ready rather
than issuing the current one, has little effect. It turns out to be better in some
cases to stall the issue unit for a much longer period of time than that assumed
by the lookahead heuristic (of just the IICT), because this additional stall time
would be hidden by the write-back bottleneck. The bottleneck can actually cause
the lookahead heuristic to generate a schedule (e.g. for SM) whose makespan
is worse than the one generated by the original MAP heuristic. The makespan
would have been significantly better if it were not for the bottleneck (c.f. SM on
Min2). Where the makespan is only slightly worse than the optimum, i.e. DM,
the heuristic together with instruction lookahead is sufficient to find an optimal
solution. In the case of the Min2 architecture, BT3, BT3.5, BT4, and CCL are now
optimal. This is because the relative delays of the microagents have reduced
the bottleneck for the write-back bus. In the case of DM and SM, there is still
interference between the instructions which result in sub-optimal executions.
This instruction interference can be reduced by applying a post-pass re-ordering
of the generated schedules.
Chapter 6. The Control Paradigm and the Compiler 166
6.6.1 Post-pass Optimisation for Instruction Interference
Instructions are said to interfere when a higher priority instruction’s flow (i.e.
execution) through the micronet is delayed by an instruction of a lower priority.
For example, when an instruction is stalled waiting for a common resource,
such as the write-back bus.
The MAP scheduler, which is mainly concerned with minimising the stall
time of the issue unit, will generally choose to issue a lower priority instruction
rather than wait for a higher one to become ready. However, the instruction
lookahead heuristic tries to counterbalance this effect, albeit in a limited fashion.
As described in Chapter 5, the issue unit can only control the order in which
operands are fetched (via the pre-issue conditions), i.e. the execution order of the
micro-operations of microagents up to the execution stage. After this stage, the
order in which instructions acquire successive microagents, especially common
ones, may not necessarily be the same as the order in which the instructions
were scheduled. This is due to multiple paths which allow instructions to race
each other; the ability to skip stages; and varying stage delays, all of which are
afforded by the micronet. In fact, it is difficult to predict how the microagents
will be utilised as the schedule is being generated (i.e. on-the-fly), since a yet-to-
be scheduled instruction could still determine whether an already scheduled one
is serviced by a given microagent at a particular time. Therefore, any instruction
interference optimisations can only be made after the initial schedule has been
generated.
The only optimisation that can be made by the scheduler is to reorder the
instructions. The post-pass heuristic, described in Algorithm 5, tries to ensure
that instructions on critical paths are never delayed by those which are not. The
heuristic uses the earlier instruction scheduling priorities and the schedule’s
“trace” information from the computational model to determine whether the
issue order of two successive instruction should be swapped. Line 4 locates an
Chapter 6. The Control Paradigm and the Compiler 167
instruction which is scheduled after one with a lower critical path priority. If
the two instructions use a common microagent, in this case a write-back bus
(line 5), the trace information is used to determine if the second, higher priority
instruction is delayed by the first. This delay can easily be identified if the
second instruction is stalled at its previous microagent (line 8). However, the
heuristic (line 10) also assumes that if the second instruction requires the com-
mon microagent just as the first one finishes with it, then the control unit must
have delayed (i.e. stalled) the issuing of the second instruction. The algorithm
is applied to successive pairs of instructions in the schedule in a manner similar
to the well known bubblesort algorithm. Although, this heuristic may increase
the stall time of the issue unit, it has the overall effect of performing a restricted
form of resource lookahead.
Algorithm 5 : Post-Pass Optimisation (reduce interference())
1 do
2 SWAP = NO;n Assign InstA and InstB to the first two instructions in the schedule. n
3 do
4 if ((SWAP == NO) and (InstA.ETC < InstB.ETC))n Possible swap between InstA and InstB. n
5 if (use same wbbus(InstA, InstB) == TRUE)n Both instructions use the same write back bus. n
6 InstA.et = time at which InstA relinquishes the write-back bus;
7 InstB.rt = time at which InstB requires the write-back bus;
8 if (InstA.et  InstB.rt) SWAP = YES;n InstA delays InstB by the difference in these values. n
9 else n No swap required since instructions use different resources. nn Get the next pair of instructions in the schedule. n
10 else n Get the next pair of instructions in the schedule. n
11 while not the end of the schedule;
12 if (swap == YES) simulate schedule();n Obtain the new schedule’s trace info for next iteration. n
13 while (swap == YES);
Chapter 6. The Control Paradigm and the Compiler 168
The MAP with Lookahead (LA) MAP with LA and Post-pass
Prgm MAP Optimal Make- Close- The Make- Close- The
DAG Arch Mkspn span ness Range span ness Range
BT3 Min1 1105nS 1185nS 92.76% 75% 1105nS 100% 100%
BT3.5 Min1 1505nS 1585nS 94.68% 85.71% 1505nS 100% 100%
BT4 Min1 1785nS 1885nS 94.4% 85.71% 1785nS 100% 100%
DD Min1 1325nS 1325nS 100% 100% 1325nS 100% 100%
DM Min1 1905nS 1905nS 100% 100% 1905nS 100% 100%
SM Min1 2085nS 2265nS 91.37% 79.55% 2085nS 100% 100%
CC Min1 735nS 735nS 100% 100% 735nS 100% 100%
CCL Min1 945nS 1015nS 92.59% 88.89% 965nS 97.88% 96.82%
BT3 3bus1 1105nS 1105nS 100% 100% 1105nS 100% 100%
BT3.5 3bus1 1355nS 1355nS 100% 100% 1355nS 100% 100%
BT4 3bus1 1605nS 1605nS 100% 100% 1605nS 100% 100%
DD 3bus1 1225nS 1225nS 100% 100% 1225nS 100% 100%
DM 3bus1 1645nS 1645nS 100% 100% 1645nS 100% 100%
SM 3bus1 2005nS 2005nS 100% 100% 2005nS 100% 100%
CC 3bus1 735nS 735nS 100% 100% 735nS 100% 100%
CCL 3bus1 835nS 835nS 100% 100% 835nS 100% 100%
BT3 Min2 930nS 930nS 100% 100% 930nS 100% 100%
BT3.5 Min2 1230nS 1230nS 100% 100% 1230nS 100% 100%
BT4 Min2 1500nS 1500nS 100% 100% 1500nS 100% 100%
DD Min2 570nS 570nS 100% 100% 570nS 100% 100%
DM Min2 1250nS 1250nS 100% 100% 1250nS 100% 100%
SM Min2 1180nS 1190nS 99.15% 97.96% 1180nS 100% 100%
CC Min2 400nS 400nS 100% 100% 400nS 100% 100%
CCL Min2 550nS 550nS 100% 100% 550nS 100% 100%
BT3 3bus2 920nS 920nS 100% 100% 920nS 100% 100%
BT3.5 3bus2 1220nS 1220nS 100% 100% 1220nS 100% 100%
BT4 3bus2 1500nS 1500nS 100% 100% 1500nS 100% 100%
DD 3bus2 490nS 490nS 100% 100% 490nS 100% 100%
DM 3bus2 1230nS 1230nS 100% 100% 1230nS 100% 100%
SM 3bus2 1160nS 1160nS 100% 100% 1160nS 100% 100%
CC 3bus2 400nS 400nS 100% 100% 400nS 100% 100%
CCL 3bus2 550nS 550nS 100% 100% 550nS 100% 100%
Table 6–2: The effects of Post-pass optimisations on Instruction Lookahead
schedules
The results of this optimisation, shown in Table 6–2, on the schedules gener-
ated by the lookahead heuristic are quite dramatic. All of the schedules except
one (HTL on Min1, which has been significantly improved nevertheless) are
now optimal, including the makespan for SM on Min1 which was made worse
by instruction lookahead (see Table 6–1). The results also show that the post-
pass heuristic does not adversely affect any of the schedules (even those which
are already optimal).
The results of applying the post-pass heuristic directly to the schedules
Chapter 6. The Control Paradigm and the Compiler 169
The The MAP Heuristic MAP with Post-pass
Prgm MAP Optimal Make- Close- The Make- Close- The
DAG Arch Mkspn span ness Range span ness Range
DM Min1 1905nS 1925nS 98.95% 98.11% 1925nS 98.95% 98.11%
DM Min2 1250nS 1280nS 97.6% 92.5% 1280nS 97.6% 92.5%
DM 3bus2 1230nS 1230nS 100% 100% 1230nS 100% 100%
SM Min1 2085nS 2245nS 92.33% 81.81% 2085nS 100% 100%
SM Min2 1180nS 1200nS 98.3% 95.9% 1260nS 93.22% 83.67%
SM 3bus1 2005nS 2035nS 99% 97.67% 2035nS 99% 97.67%
SM 3bus2 1160nS 1180nS 98.28% 96.01% 1180nS 98.28% 96.01%
Table 6–3: The effects of Post-pass optimisation on MAP instruction schedules
produced without using instruction lookahead are shown in Table 6–3. In the
cases of DM on Min1, Min2 and 3bus2, and SM on 3bus1 and 3bus2, there is no
improvement. The makespan for SM on Min2 is actually worse, while for SM on
Min1 it is now optimal. (Note that all of these schedules are optimal when both
lookahead and post-pass are applied.) This does not mean that the post-pass
heuristic will only work for schedules which can be improved by lookahead
(emphasised by BT3, BT3.5 and BT4 on Min1). But rather, that the heuristic
seems to give better results on those which are.
This post-pass heuristic can be applied initially to either the beginning (for-
ward post-pass) or the end (reverse post-pass) of the instruction schedule gener-
ated by the first pass scheduler. The final schedules of the two approaches are
identical, however reverse-postpass tends to attempt (to test for) more swaps.
6.6.2 Are These Schedules Really Optimal?
Remember that these schedules are only optimal with respect to the instruction
costs which have been assumed. In practice, these schedules may not be optimal
for a particular execution of the program for the reasons discussed earlier, i.e. the
behaviour of the micronet is difficult to predict a priori and therefore instruction
schedules are based on worst-case costs. One could even expect that each run of
the program would have a different optimal schedule. Therefore it is impossible
Chapter 6. The Control Paradigm and the Compiler 170
to determine how far from true optimality the schedules are, without in effect
executing the actual instructions on the target architecture. This technique of
scheduling through self-simulation has already been proposed when schedul-
ing without a precise computational model [10]. The practicalities of such an
approach are still open to question. Although the stability of the schedules in
light of variance in the resource delays needs further study, this does not mean
that good (at least comparable with synchronous systems) program executions
cannot be achieved.
6.7 Open Problems






























































The Schedule Based on Average-case Costs
20 11
Figure 6–1: The makespans of schedules based on worst- and average-case
run-time costs
Chapter 6. The Control Paradigm and the Compiler 171
In a micronet-based processor, the actual execution times of instructions cannot
be accurately predicted at compile-time. Although the execution times of the
same instruction might vary due to data-dependent delays, worst-case, average-
case or even best-case figures for the execution cost can be found on which the
schedules could be based. When producing static schedules, the compiler has to
use the delays of the FMs and the question arises as to which of the sets of figures
to use. Figure 6–1 illustrates the simplified schedules for the Hennessy Test
(HT1) based on worst-case and average-case costs and figures for the execution
times of the instructions based on actual worst-case and average-case delays at
run-time for these schedules. The ratios of the delays for the two cases for the
instructions realistically reflect actual behaviour for the asynchronous processor
under study. The figures reveal that given these ratios, using a schedule based on
worst-case costs is better in practice. Using this approach a heuristic will always
try to schedule an instruction, if possible, only when its operands are guaranteed
to be available, thereby minimising any stalls. Note also that the schedule’s
correctness is not affected by the changes in instruction costs. Furthermore,
given that a program’s critical path may change with different executions (due
to different data sets) and that the schedule is generated once, the compiler’s
choice of which costs to use is important (e.g. for real-time programmers [133]).
By basing the schedule on worst-case delays a lower bound on performance can
be achieved.
6.7.2 Interaction Between Executing Instructions
While optimising the instruction schedule is more difficult than in synchronous
processors for the reasons stated previously, other reasons contribute as well,
such as the difficulty in predicting the global state of the micronet. In synchron-
ous processors, the compiler can assume when scheduling a basic block that the
datapath is idle and all of the resources are available. This is a consequence of
Chapter 6. The Control Paradigm and the Compiler 172
the fact that in synchronous pipelines, an instruction never affects the execution
of other instructions. This is not necessarily the case in a micronet, since the
execution times of instructions might vary for the following reasons: only a
partial ordering is employed between instructions (i.e. it is not necessary for
the previous instructions to have completed their execution before successive
ones); instructions compete for shared resources, e.g. the write-back bus; during
execution instructions might interfere with each other. Therefore, the state of
the resources at any particular time cannot be predicted accurately at compile-
time. But this information is indeed available at run-time in the issue unit of
the micronet. This could be used to dynamically tune (i.e. allow out-of-order
instruction issues) the static schedule by the control unit. This requires identi-
fying an instruction which can be executed immediately (easily achieved using
the control acknowledgement signal scoreboarding mechanism), and checking
that the instruction is independent of earlier ones in the instruction buffer. Al-
though the latter may be expensive to perform, the task can be made simpler
with assistance from the compiler by using a concurrency bit.
6.8 Conclusions
The micronet model exposes temporal and spatial concurrency in the datapath,
with fine-grained resources now being visible to the compiler. This model
subsumes the micropipeline model which only exploits temporal concurrency
in the datapath and the scheduling methods described here can be equally
applied to micropipeline-based processors.
Code scheduling (on ILP architectures) and machine-dependent optimisa-
tions have a significant impact on program performance. It is the task of the
compiler to schedule instructions such that these resources are efficiently util-
ised. The instruction schedule is devised based on a (parameterised) computa-
Chapter 6. The Control Paradigm and the Compiler 173
tional model of the target architecture. For synchronous architectures the model
is simple; in contrast, an asynchronous model is necessarily less accurate for
the reasons discussed earlier. However, initial studies have shown that these
factors do not significantly hinder a MAP compiler’s ability to schedule code
efficiently. Worst-case instruction execution times have been considered for the
reasons described earlier and the resulting schedule is treated as a first pass one.
The interference between the instructions can be reduced by applying post-pass
optimisations. The instructions could then be dynamically reordered at run-
time to fine-tune this schedule by taking advantage of actual run-time costs.
Due to the asynchronous behaviour these instructions are issued as soon as
possible, without the need for delays using NO-OP instructions. In conclusion,
preliminary studies have shown that a micronet-based asynchronous processor
architecture does present a suitable target for an ILP compiler.
Chapter 7
Conclusions and Future Work
7.1 A Summary
Traditionally, the sequencing of information within processor architectures has
been synchronous – centrally controlled by a clock. This global clock places
limits on future gains in performance which can potentially be achieved by
improvements in implementation technology. This thesis has investigated the
effects of relaxing the strict synchrony by distributing control within the pro-
cessor architecture and also its impact on the overall system design. Micronets
have been proposed as an efficient implementation of an asynchronous control
paradigm for processor architectures and their effect on system performance
has been explored on three fronts. Firstly, with respect to an instruction set,
the execution time of individual instructions were compared under the two
control alternatives. A synchronous RISC architecture was transformed into a
comparable self-timed one and simulation studies demonstrated improvements
in the performance of the instruction set over the corresponding synchronous
processor. Secondly, although improved performance through increased silicon
utilisation within architectures which exploit instruction-level parallelism (ILP)
174
Chapter 7. Conclusions and Future Work 175
in the form of pipelining is a key feature in processor designs, synchronous
designs actually incur an increase in control complexity which adversely affects
their efficiency. Based on an initial MAP design, a series of refinements have
been made to the control framework which shows that the micronet approach
is better able to exploit the available ILP amongst the functional units within
processor architectures, and without significantly increasing control complexity.
In micronets, not only can the handshake protocols be used to avoid hazards
and minimise stalls, but the overheads due to asynchrony can also be hid-
den. Finally, although additional processor performance within the datapath
has been exposed, whether or not the system benefits depends on a compiler’s
ability to exploit this improvement. An architecture needs to expose the avail-
able resource concurrency, while the compiler extracts the program parallelism
(architecturally-independent) and maps it onto the former in such a manner as
to maximise performance. Machine dependent optimisations and code schedul-
ing (on ILP architectures) have a significant impact on the overall system per-
formance. Performance gains obtained by RISC compilers have been due to
the availability of accurate models of instruction behaviour on their target ar-
chitectures. However, under asynchronous control, the resulting variable and
non-predeterministic execution time of instructions due to data dependent op-
erations does not seem to adversely effect the generation of good schedules. In
conclusion, the adoption of micronets as an asynchronous distributed control
paradigm can lead to a more efficient utilisation of functional units and thus
improved system performance.
7.2 Effects on System Design
It is well known that the effective performance of a well integrated computer
system is to a large measure determined by the synergy between the design
of the processor architecture, the instruction set and the compiler. Therefore,























Figure 7–1: Influences within processor system architectures
the design of any such system should consider each of these areas and their
relationship to each other. Furthermore, as this work has highlighted, another
area (namely the control paradigm) also requires consideration (Figure 7–1):
The Instruction Set design is determined by the type of applications or
programming languages for which the system is targeted, with RISC designers
choosing to include only those instructions which are likely to be used frequently
and whose exclusion would seriously degrade the system’s performance [11].
Although the trend from sets containing a large number of complex instructions
towards sets containing fewer less complex ones has led to simpler datapath
and control, an improvement in overall system performance is still dependent
on the compiler.
The Compiler is responsible for the construction of a (near) optimal (minimal
Chapter 7. Conclusions and Future Work 177
code size or execution length) sequence of instructions which implements the
target application program (written in a specific programming language). The
advent of reduced instruction set architectures saw the accelerated development
of optimising compiler techniques. These techniques have had a significant
impact on system performance, thus making compilers, at least their back-
ends, an integral part of a processor system. Code generation/scheduling and
machine dependent optimisations require a detailed knowledge about each
instruction’s execution behaviour on the target architecture.
The Datapath Architecture, which is a collection of components (register
bank, functional units, etc.) and their control signals and datapath interconnec-
tions (dedicated or shared), aims to implement the execution of each instruc-
tion as efficiently as possible which may also involve considering trade offs
between power consumption, silicon area and performance. By streamlining
the datapath of an architecture, its complexity can effectively be migrated to the
optimising compiler. Details of the architecture become easier to make visible
to the compiler and its computational model which reflects the behaviour of
the architecture is now more tractable. The regular and determinate behaviour
allows optimising techniques to be more effective.
The Implementation Technology (IT) has played a significant part in im-
proving performance of processor architectures: transistorisation, various pro-
cess technologies, scaling, fabrication techniques have all played their part.
However, current advances in IC technology affect a synchronous control paradigm’s
ability to exploit the performance gains available (as discussed earlier in Chapter 2).
The Control Paradigm (CP), as the name suggests, is the mechanism by
which the operation of the components within the datapath architecture are
coordinated. Throughout the history of computer architecture, with a few
exceptions, this has been synchronous where a global clock signal sequences
actions and whose period is used to account for delays. With the majority of
Chapter 7. Conclusions and Future Work 178
processor designs being based on a centralised synchronous control, the notion










Figure 7–2: Previously implicit influences within system architectures
This thesis has explored an asynchronous control paradigm where the se-
quencing is decentralised and architectural components communicate using
handshaking protocols. Self-timed control has been investigated together with
its influence on the above areas and thus the overall effect on the performance
of an integrated system. This work has not set out to find the best instruction
set for asynchronous processors since it is felt that the CP does not significantly
restrict the choice of instructions that could be included. In fact, an asynchron-
ous CP is less restrictive since instruction execution times and design delay do
not effect its correctness. However, some implementations of instructions, such
as those which rely on the timing of operations or other instructions, may not
be efficient or even possible.
On the other hand, the CP’s influence on the datapath architecture is defin-
itely more marked. While it is possible to implement a traditional synchronous
architecture precisely in an asynchronous manner [137], one may find that the
new design operates slower mainly due to the additional control required to
Chapter 7. Conclusions and Future Work 179
force the design to operate or support certain features peculiar to the synchron-
ous version. As described earlier in Chapter 2, the design goals under the two
CPs are different, leading to possibly different implementations of the architec-
tural components. The micronet approach with decentralised and distributed
control, which does not preclude any particular architecture, does however lead
to architectures composed of autonomous concurrently operating units.
From a purely performance point of view, for the modern RISC optimising
compiler, the influence of an asynchronous control paradigm appears at first
sight to be detrimental. The reason is the rôle played by the CP: synchronous
control implies both an event ordering and timing which leads to predictable
behaviour; asynchronous control implies only an event ordering. Without any
timing information one might surmise that it becomes more difficult to optimise
code schedules. In practice, preliminary results seem to imply that the compiler
is not adversely affected by an asynchronous CP. Even though the schedules
produced should be at least as good as those produced for a synchronous
processor, the system should benefit from dynamic reordering to exploit further
(run-time) performance.
The use of particular implementation technologies may also influence the
choice of CP. Some types of MOS families may be considered well suited to
self-timed circuit design, e.g. those which could use the precharge phase as
a “spacer” between data values as an alternative method for hiding hand-
shake overheads. Techniques such as differential cascade voltage switch level
DCVSL [30] [69], Precharged CVSL [160] or domino CMOS logic [175] have
already been used [98].
The behaviour of asynchronous processors is complex and their performance
is difficult to predict. Discrete event simulations as described in this work offer
a method for accurately measuring their performance. The model in Occam2
naturally captures the concurrency and asynchronous communication. This
also allows the simulation to be parallelised to obtain reasonable run-times for
Chapter 7. Conclusions and Future Work 180
large circuits and test programs. This is aided by the asynchronous nature of the
underlying simulation algorithm itself [8]. Although there are numerous tools
and techniques for the synthesis, verification and silicon compilation of self-
timed circuits, tools for the development, evaluation and testing of self-timed
(processor) systems [29] [51] are still lacking.
Given that micronets provide an efficient control framework for MAP sys-
tems many of these aspects are being addressed [92] [127] and could be invest-
igated in more detail in future work.
7.3 On-Going and Future Work
7.3.1 Easing System Design
The distribution of control to the functional units improves performance by
exploiting fine-grain concurrency and actual delays. The majority of control in
MAP architectures is delegated to the interfaces of the functional units. The
work in [126] has addressed the design of these control interfaces (the CMs)
by introducing the idea of control constructs. These enable the efficient im-
plementation of control interfaces which is crucial to the performance of the
asynchronous processors. High-level descriptions of control constructs have
been described in VHDL and a library of cells has been implemented in the
Cadence Design Framework for automated synthesis [164]. Results from SPICE
simulations for an add ALU operation have been presented which demonstrate
the feasibility of distributing controls [4]. This work is an important step for the
rapid prototyping of micronet-based asynchronous processors in a top-down
fashion. The separation of timing and functionality enables truly modular
designs, i.e. functional units can be modified without redesigning the rest of the
system. Thanks to the micronet, the number and type of functional units can
Chapter 7. Conclusions and Future Work 181
be changed, by simply specifying the behaviour of the control interface with
respect to the rest of the system in terms of the control constructs. This enables
the designer/computer architect to explore the architectural design space with
ease, for example, determining the optimal number of functional units for a
class of problems in the design of micronet-based superscalar architectures.
7.3.2 Extending the Micronet Architecture
Conditional Branching
The Fetch and Branch Unit (FBU) itself can be viewed as an instruction pre-
processor, handling all PC related instructions. Its task is simply to supply the
CU (and the execute stage) with, if possible, the correct stream of instructions.
However, the implementation of (conditional) branch instructions is one of the
hardest and most important problems to be dealt with in high performance
pipelined processors. Branch instructions tend to interrupt the smooth flow
of instructions through the datapath making the average instruction through-
put rate much lower than the peak rate. For example, early studies for the
pipelined MU5 computer showed that if branches occurred in only one out of
ten instructions then performance would be reduced by a factor of three, unless
precautions were taken [122]. The importance of dealing with the performance
degradation has long been recognised [23]. Implementing branch instructions
so that a branch transfer does not take effect until a fixed number of instruc-
tions after the branch are also executed can be used to reduce branch delay.
This technique is commonly referred to as “delayed branching” and was used
as early as 1952 in the Los Alamos MANIAC and more recently in early RISC
processors such as IBM 801 [138], the Berkeley RISC I [136] and the Stanford
MIPS [71]. Delayed branching is one of the simplest ways to optimise branches
in synchronous architectures. However, a major limitation is the difficulty of
filling the required number of delay slots determined by the time taken to re-
Chapter 7. Conclusions and Future Work 182
solve the branch condition [113]. While this number is fixed for a synchronous
architecture, the number of instructions required to be fetched to hide the branch
latency in an asynchronous datapath may be variable, depending not only on
the execution cost but also the relative instruction fetch cost. Although this
approach could be used for a specific MAP architectural design, as a general
approach it is not viable. Thus, for micronet-based architectures, the preferred
techniques might be ones which do not rely on fixed timing for their correct op-
eration, such as branch prediction schemes [100] [153] or advanced branching
mechanisms [132].
Out-of-Order Instruction Issue
Since the compiler may not be able to generate the best schedule, the CU may
need to issue instructions out-of-order from the instruction buffer. This requires
the identification an instruction which can be executed immediately (easy and
cheap since the handshake mechanism with the functional units acts like a
scoreboard), and checking that it is independent of the previous instructions in
the buffer which might be expensive (dynamic register renaming) without the
compiler’s help [120].
Out-of-order instruction issue would allow the control unit to fine-tune the
static instruction schedule to take advantage of variable instruction execution
times. In the presence of out-of-order instruction issue (or out-of-order operand
fetch), the issuing (and execution) of instructions is only limited by the availab-
ility of resources and operands. Micronets can therefore be viewed as a hybrid
dataflow style of architecture, limited to the window of instructions available in
the instruction buffer, without the bookkeeping costs associated with traditional
dataflow architectures [61].
Chapter 7. Conclusions and Future Work 183
Exception Handling and Speculative Execution
Many synchronous processor architectures have been developed to exploit high
degrees of ILP. Some of these processors dispatch multiple instructions from a
conventional linear instruction stream to multiple functional units simultan-
eously and use mechanisms for out-order instruction issue and completion,
branch prediction and speculative execution to remove the constraint of se-
quential instruction execution. The added complexity brought about by these
mechanisms make it more difficult for the processor to maintain a precise sys-
tem state after an exception occurs [81]. An exception is said to be precise if
the saved process state corresponds with a sequential mode of program execu-
tion where one instruction completes before the next one begins. Many of the
methods adopted by these synchronous processors for implementing precise
interrupts [154] can be applied to MAP. For example, a history buffer (which is
a first-in-first-out (FIFO) queue of all the instructions that are executing) can be
used in the same way as it is in the MC88110 [40]. Alternatively, by introducing
some processing (decision making) capabilities into the register bank, tech-
niques equivalent to shadow registering [95] or the use of reorder buffers [154]
could also be employed [155].
Although instructions may be fetched speculatively by the FBU in MAP,
whether they should be executed speculatively is an architectural trade-off. Just
as in synchronous designs, the techniques and hardware support for exception
handling can be exploited to support speculative execution [40] [155] [167].
Extending Micronets to Implement Superscalar Architectures
The evolution of a synchronous scalar architecture into a superscalar one gen-
erally requires the duplication of the entire datapath. In MAP, this may not
be necessary for a number of reasons: since the fetch and execute stages are
decoupled, the effective instruction fetch rate may be sufficiently fast enough
Chapter 7. Conclusions and Future Work 184
to mean that duplication of the fetch stage is not required; superscalar architec-
tures exploit spatial parallelism and this is already achieved to a degree by a
native/scalar micronet datapath; the natural extensibility of the micronet means
that the incorporation of additional resources can be easily and efficiently ex-
ploited given a sufficiently fast enough instruction issue rate. Should this not
be the case, the duplication of the instruction issue unit is possible (making the
architecture superscalar) with more microagents to support concurrent operand
fetching and the pre-issue conditions being modified to avoid new hazards and
support out-of-order instruction issue. Due to the asynchronous behaviour, it
would be inefficient to operate the instruction issue units in lock-step. The pro-
cessor would now have to support complete dynamic instruction scheduling
(out-of-order issue and out-of-order operand fetch). Johnson [81] provides a
careful assessment of the complexity of the control logic involved in synchron-
ous superscalar processors. The design and implementation of a superscalar
micronet-based processor is currently being investigated [127].
Some Additional FU modifications
The designs of the functional units themselves may need to be modified to ex-
ploit the benefits of an asynchronous control paradigm or MAP architecture,
e.g. average case delays. The Memory Unit (MU) services the load and store
requests. While the simplest design option is to maintain the order in which
the requests are serviced, in order to reduce the amount of time other func-
tional units are stalled waiting for data, these load and store requests could be
separated. Giving priority to load requests may reduce the data wait latency,
although this requires the requests to be checked against any pending write
request.
If the fetch stage has a sufficiently small delay, (i.e. there are no signific-
ant periods where the execute stage is starved of instructions), the FBU could
Chapter 7. Conclusions and Future Work 185
be modified to allow it to be able to decode encoded instructions stored in
memory, i.e. the unit can effectively be used as a pre-instruction-decode stage
to speed up the instruction issue stage at the expense of a wider instruction
buffer, increased fetch latency but smaller code size and perhaps lower power
consumption [24] [48].
7.3.3 Parallelising Compilers for a Superscalar MAP
Although, Instruction-Level Parallelism (ILP) has been exploited by high per-
formance uniprocessors for the past 30 years, the 1980s saw it play a much
more significant rôle in computer design [94] [139]. ILP consists of a number
of processor and compiler design techniques which are generally transparent
to the user. Certain functions must be performed if a sequential program is to
be executed in an ILP fashion: the program must be analysed to determine the
type of dependencies between instructions and when these will be resolved;
scheduling and register allocation must be performed; often operations must be
executed speculatively, which in turn requires branch prediction. A number of
design choices exist as to whether these functions are supported in the compiler
or run-time hardware. Future MAP research should attempt to answer these
questions.
Since a formidable amount of work has been done in the traditional ILP
field [139], future work regarding the use of micronets may only need to consider
the effects of an asynchronous control paradigm on ILP techniques (e.g. [17]).
Although, some work has been done with List Scheduling heuristics, this ap-
proach may not produce the best results. Other interesting questions also arise,
such as: with out-of-order instruction issue, how much work needs to be (and
can be) done by the compiler? In practice, how much variance in instruction
execution times should be expected in typical programs [53]? Also, how feasible
is it to develop one efficient compiler for a family of MAP architectures?
Chapter 7. Conclusions and Future Work 186
7.4 Discussion
The emergence of VLSI technology, together with the maturing of optimising
compiler techniques, aided the development of early RISC architectures [71] [86] [136].
Their primary concern was the efficient usage of expensive silicon real estate,
and careful consideration was given to the design of the instruction set architec-
ture [102]. There have been two orthogonal trends in the evolution of synchron-
ous processor architectures [84]: the deeply-pipelined architectures [118], i.e.
ones which exploit temporal parallelism, and superscalar architectures which
exploit spatial parallelism [40] ([44] is an example which exploits both). Both
these classes have benefited from improvements in technology and the result-
ing faster clock frequencies. But these improvements have been sustained at
a high price in terms of clock distribution, power consumption, and design
complexity [42]. Furthermore, significant additional control costs are incurred
in exploiting ILP in both cases.
Micronets offer an alternative model for the design of future processor archi-
tectures. Whereas the original RISC ideal was the efficient usage of the silicon
space by identifying the critical resources, a micronet is essentially concerned
with their efficient utilisation over time. This is achieved in two ways: by re-
moving the clock, and distributing control to the resources; and viewing the
datapath not as a linear pipeline, but as a network of communicating resources.
Micronets are able to efficiently (the overheads due to asynchrony being hid-
den) exploit fine-grain ILP without the additional control costs (the protocol
also implements a scoreboarding and hazard avoidance mechanisms).
The asynchronous and distributed nature of the control in micronets allows
the processor to be easily extended with little effect on the rest of the design. For
a given class of problems, the designer is able to easily explore the architectural
design space more accurately by adding critical resources. This can be naturally
Chapter 7. Conclusions and Future Work 187
extended to superscalar architectures by increasing the number of issue units.
(Synchronous superscalar architectures replicate entire datapaths.) The same
scoreboarding mechanism is shared between the issue units for determining the
global state of the datapath.
7.5 Conclusions
This thesis has highlighted the increasing inefficiencies due to the clock and
centralised control in synchronous designs. Many of these problems can be
avoided by using self-timed circuits and a method for converting synchronous
pipelines to the self-timed equivalents has been outlined. This has been gen-
eralised to a novel asynchronous control technique, known as Micronets, for
decentralising controls in asynchronous processor architectures. Micronets are
viewed as a network of communicating functional units, which expose fine-
grain concurrency between instructions.
This work has investigated the effect of removing synchrony in processor
design and the consequent influences of an asynchronous control paradigm
on the design and performances of RISC processor architectures for exploiting
fine-grained ILP. It has been demonstrated that for a RISC architecture, the in-
struction execution of a self-timed design is able to exploit actual run-times. The
advantages of an asynchronous control go even further, in being able to sup-
port instruction level concurrency. A Micronet-based Asynchronous Processor
(MAP) architecture (which is effectively a variable length multiple-pipelined
datapath) has been designed to efficiently exploit instruction-level parallelism
and the nature of control for such an architecture has also been outlined. It
has been demonstrated that four-phase handshaking protocols enable the im-
plementation of highly concurrent structures and in most cases the overheads
can be hidden. Just as importantly, these protocols are used to efficiently avoid
Chapter 7. Conclusions and Future Work 188
datapath hazards. By using the self-timed design paradigm to the decentralised
control, the control mechanisms in MAP are distributed amongst its functional
units which allows the exploitation of a finer grain of ILP than previously pos-
sible. Improved architectural performance comes from being able to exploit
both the actual run-time delays of the microagents and their concurrent op-
eration. Some of the issues relating to micronets as targets for parallelising
compilers have been discussed. Initial work has also confirmed the suitability
of the asynchronous processor as a good target for these compilers. The modular
nature of micronets eases modification and empowers the computer architect
with finer control in the design, for example, of superscalar architectures. Fi-
nally, the micronet model considers the interactions between the underlying
implementation technology, the architecture and the compiler, and underlines
the integrated approach to system design.
Appendix A
Glossary
Actual (Program) Execution Time – The time between the issuing of an instruc-
tion (or start of a program) and the completion of all actions associated
with that instruction (or program).
Asynchronous – An asynchronous circuit is an ‘unclocked’ circuit, i.e. a circuit
which does not rely on global synchronisation by an external clock signal.
Asynchrony implies the absence of any timing bounds on the operation of
a circuit (whose duration may be subject to many uncontrolled factors).
Delay Insensitive – A circuit is delay-insensitive if its correct operation is in-
dependent of any assumptions about the delays of the individual com-
ponents or wires in the circuit except that those delays be finite, c.f. speed-
independent.
Equipotential Region – An equipotential region is a portion of a circuit within
which propagation delays in wires are considered to be negligible. The
smaller the area of the region, the more validity this assumption has in
practice.
189
Appendix A. Glossary 190
Fetch Cycle Time – The time between the Control (Execute) Unit requesting
the next instruction from the instruction cache or memory and receiving
it.
Instruction Cycle Time (ICT) – The execution time of a particular instruction
as seen by the Control Unit. It is measured as the time between instruction
issues of the same type.
Instruction Issue Time (IIT) – The time taken to issue an instruction. This
constitutes just half of the four phase protocol and represents the time
between decoding and issuing the instruction. (In synchronous designs,
this would be the decode cycle with operand fetch occurring either con-
currently or afterwards). In MAP, the fetching of operands is considered
to be part of the instruction’s execution. This is because the register bank is
also treated as a functional unit or resource from which required operands
may be unavailable.
Instruction Issue Cycle Time – The time between the issue of any two success-
ive instructions. This is the time to complete the four-phase handshaking
protocol and is therefore limited by the handshake cycle time of the slowest
common control signal or IIT.
Isochronic Fork – A fork or branch of a wire in a circuit is considered to be
isochronic if the difference in the propagation delays between branches
is negligible. This is obviously the case if all branches of the fork are
contained in an equipotential region.
Micronet – A micronet is a network of pipelines (micropaths), with (selected)
stages of different pipelines being able to communicate with each other.
This enables the exploitation of both spatial and temporal concurrency
between instructions [4] (in contrast, a micropipeline only exploits tem-
poral parallelism [6]).
Appendix A. Glossary 191
Micropath – A micropath is a pipeline or sequence of microagents, and in turn, a
microagent performs either a communicating or a functional micro-operation.
A functional microagent (FM) communicates with other FMs through their
respective communicating microagents (CM).
Micropipeline – A micropipeline is a self-timed, event-driven, elastic pipeline
whose stages operate asynchronously and communicate using the two-
phase bundled data protocol [158].
Self-Clocked – Self-clocked circuits are self-timed designs that are implemented
using a hidden internal clock within an equipotential region. Although in-
ternally they are composed of clocked synchronous elements, self-clocked
circuits retain an external asynchronous interface.
Self-Timed – Self-timed circuits use asynchronous initiation and completion (or
request/acknowledge) signals. The class of self-timed circuits includes all
delay-insensitive, speed-independent and self-clocked circuits.
Speed Independent – A circuit is said to be speed-independent if its correct
operation is independent of the delays in the individual components of
the circuit. It is assumed that there is no propagation delay associated
with the wires of the circuit, c.f. delay-insensitive.
Appendix B
The PEPSÉ Simulator
B.1 The Simulation Algorithm in OCCAM2{{{ PROC elsa.platformPROC elsa.platform(CHAN OF ANY tty,[]CHAN OF INT::[]INT in,out,VAL INT function.delay)-- Basic structure for the simulation platform.-- Folders marked with ** require modifications when customising.{{{ process runtime parameters **VAL INT max.input.width IS elsa.tuple.len.default+2:VAL INT max.output.width IS elsa.tuple.len.default+2:-- elsa.tuple.len.default is a constant currently set to 4. This-- is length of tuple with only one state value. Here, the input-- and output buffers will be defined to hold tuples with up to-- 3 state values.}}}{{{ variables[no.inputs][max.input.width] INT ipdata:[no.outputs][max.output.width] INT opdata:-- Buffers for inputs and outputs.}}}
192
Appendix B. The PEPSÉ Simulator 193{{{ PROC function **PROC function([][]INT istates,ostates)-- This is the procedure which evaluates the output states given-- the current inputs.}}}SEQ{{{ initialisation-- Set default values for flags}}}{{{ initialise input and output buffersPARPAR i=0 FOR no.inputsPAR j=0 FOR max.input.widthipdata[i][j]:=0SEQ -- Each output set to initial valuesPAR i=0 FOR no.outputsSEQopdata[i][elsa.tup.len]:= elsa.tuple.len.defaultPAR j=1 FOR max.output.width-1opdata[i][j]:= tristate -- initial state values.opdata[i][elsa.start.time]:= 0opdata[i][elsa.end.time]:= function.delay}}}{{{ send initial output tuplesPAR i=0 FOR no.outputsout[i] ! opdata[i][elsa.tup.len]::opdata[i]}}}WHILE NOT finished.simSEQ{{{ fetch necessary inputsPAR i=0 FOR no.inputsIF(ipdata[i][elsa.start.time]=ipdata[i][elsa.end.time])in[i] ? tuple.length::[ipdata[i]FROM 0 FOR tuple.length]TRUE
Appendix B. The PEPSÉ Simulator 194SKIP}}}{{{ execute function **function(ipdata,opdata) -- Behavioural model of Object.}}}{{{ determine OUTPUT start timePAR i=0 FOR no.outputsopdata[i][elsa.start.time] :=ipdata[0][elsa.start.time]+function.delay}}}{{{ determine OUTPUT end timeminimum.end.time :=max.sim.timeSEQ i=0 FOR no.inputsIF(minimum.end.time>ipdata[i][elsa.end.time])minimum.end.time := ipdata[i][elsa.end.time]TRUESKIPPAR i=0 FOR no.outputsopdata[i][elsa.end.time] :=minimum.end.time + function.delay}}}{{{ send outputsPAR i=0 FOR no.outputsIF(max.sim.time > ipdata[i][elsa.end.time])out[i] ! opdata[i][elsa.tup.len]::[opdata[i] FROM 0 FOR opdata[i][elsa.tup.len]]TRUESKIP}}}{{{ update simulation timePAR i=0 FOR no.inputsipdata[i][elsa.start.time] := minimum.end.time}}}
Appendix B. The PEPSÉ Simulator 195{{{ Simulation Complete ?IF(ipdata[0][elsa.start.time] >= max.sim.time)finished.sim := TRUETRUESKIP}}}{{{ Sink irrelevant inputsSEQ i=0 FOR no.inputsWHILE (max.sim.time > ipdata[i][elsa.end.time])in[i] ? tuple.length::[ipdata[i] FROM 0 FOR tuple.length]}}}:}}}
Appendix C
The MAP Test Programs{{{ Instruction Test code{{{ Program - Load Test-- instruction format <opcode,Rx,Ry,Rz,condflg,timestampflg>-- remember to initialise reg[i] = iinstr[0] :=[ld,0,0,1,false,true]instr[1] :=[ld,0,0,2,false,true]instr[2] :=[ld,0,0,3,false,true]instr[3] :=[ld,0,0,4,false,true]instr[4] :=[ld,0,0,5,false,true]instr[5] :=[ld,0,0,6,false,true]instr[6] :=[ld,0,0,7,false,true]instr[7] :=[time,1,2,2,false,false]instr[8] :=[jmp,8,0,0,false,true]}}}{{{ Program - Store Test-- instruction format <opcode,Rx,Ry,Rz,condflg,timestampflg>-- remember to initialise reg[i] = iinstr[0] :=[st,0,1,1,false,true]instr[1] :=[st,2,0,2,false,true]instr[2] :=[st,0,3,3,false,true]instr[3] :=[st,4,0,4,false,true]instr[4] :=[st,0,5,5,false,true]instr[5] :=[st,6,0,6,false,true]instr[6] :=[st,0,0,7,false,true]instr[7] :=[time,1,2,2,false,false]instr[8] :=[jmp,8,0,0,false,true]}}}
196
Appendix C. The MAP Test Programs 197{{{ Program - Alu Test-- instruction format <opcode,Rx,Ry,Rz,condflg,timestampflg>-- remember to initialise reg[i] = iinstr[0] :=[add,0,0,1,false,true]instr[1] :=[add,0,0,2,false,true]instr[2] :=[add,0,0,3,false,true]instr[3] :=[add,0,0,4,false,true]instr[4] :=[add,0,0,5,false,true]instr[5] :=[add,0,0,6,false,true]instr[6] :=[add,0,0,7,false,true]instr[7] :=[time,1,2,2,false,false]instr[8] :=[jmp,8,0,0,false,true]}}}{{{ Program - Hennessy Test-- instruction format <opcode,Rx,Ry,Rz,condflg,timestampflg>-- x[i] := k + x[j]; x addr in R0, (1,R1),(i,R2),(j,R3),(k,R4),(Xj,R5),(Xi,R7)instr[0] :=[ld, 0,3,5,false,true]instr[1] :=[add, 1,3,3,false,true]instr[2] :=[add, 5,4,7,false,true]instr[3] :=[st, 0,2,7,false,true]instr[4] :=[add, 1,2,2,false,true]instr[5] :=[time,0,0,0,false,true]instr[6] :=[jmp, 0,6,0,false,true]}}}}}}
Appendix D
Published Papers
The copyright on each of the following papers has been transferred to the El-
sevier Science Publishers and the IEEE Computer Society Press (as indicated),
which have granted to the authors the right to republish without specific per-
mission.
D.1 Instruction-level Parallelism in Asynchronous Pro-
cessor Architectures
Title: Instruction-level parallelism in asynchronous processor
architectures.
Authors: D. K. Arvind and V. E. F. Rebello.
Presented at: The 3rd International Workshop on Algorithms and Parallel
VLSI Architectures.
Place: Leuven, Belgium.
Date: 29th – 31st August 1994.
Publisher: Elsevier Science Publishers.
198
INSTRUCTION-LEVEL PARALLELISM IN ASYNCHRONOUSPROCESSOR ARCHITECTURESD. K. ARVIND and V. E. F. REBELLODepartment of Computer Science, The University of EdinburghMayeld Road, Edinburgh EH9 3JZ, Scotland, U. K.fdka,vefrg@dcs.ed.ac.ukABSTRACT. The Micronet-based Asynchronous Processor (MAP) is a family of processorarchitectures based on the micronet model of asynchronous control. Micronets distributethe control amongst the functional units which enables the exploitation of ne-grainedconcurrency, both between and within program instructions. This paper introduces the mi-cronet model and evaluates the performance of micronet-based datapaths using behaviouralsimulations.KEYWORDS. Instruction-level parallelism (ILP), asynchronous processor architecture,self-timed design.1 INTRODUCTIONCentralised controls have been traditionally used to correctly sequence information withinprocessor architectures. However, the ability to sustain this design style is under pressurefrom a number of directions [6]. This paper examines the eect of relaxing this strictsynchrony on the design and performance of processor architectures. The reasons are thefollowing. The the clock frequency of a synchronous processor is determined a priori by thespeed of its slowest component (which takes into account worst-case timings for executionand propagation for pessimistic operating conditions). In contrast, the performance ofan asynchronous processor is determined by actual operational timing characteristics ofindividual components (eectively the average delays), and overheads due to asynchronouscontrols. Secondly, an important consequence of asynchronous controls is the ability toexploit ne-grained Instruction-level Parallelism (ILP), and this is explored in greater detailin the rest of this paper.ILP can be achieved either by issuing several independent instructions per cycle as insuperscalar or VLIW architectures, or by issuing an instruction every cycle as in a pipelinedPublished in the Proceedings of the 3rd International Workshop on Algorithms andParallel VLSI Architectures, pp 203-215, Leuven, Belgium, August 1994.c Elsevier Science Publishers.
architecture where the cycle time is shorter than the critical path of the individual oper-ations [5]. This work concentrates on the design and evaluation of asynchronous pipelinesfor exploiting ILP, as a number of control issues resulting from data and structural depend-encies between instructions have to be resolved eciently.A few asynchronous processors have recently been proposed [3, 8, 9]. These designs arebased on a single micropipeline datapath [10]. One disadvantage of viewing a datapathas a linear sequence of stages is that, in general, only one of the functional units will beactive in any cycle. Pipelining the functional units themselves is expensive both in termsof additional hardware and the resulting increase in latency.We introduce an alternative model for an asynchronous datapath called a micronet. Thisis a network of elastic pipelines in which individual stages of the pipelines have concurrentoperations, and stages of dierent pipelines can communicate with each other asynchron-ously. This allows for a greater degree of ne-grained concurrency to be exploited, whichwould otherwise be quite expensive to achieve in an equivalent synchronous datapath.2 MICRONETS AND ASYNCHRONOUS ARCHITECTURESMicronets are a generalisation of Sutherland's micropipeline [10], which dynamically con-trol which stages communicate with each other. Thus micronets can be viewed not justas a pipeline but rather as a network of communicating stages. The operations of each ofthe stages are further exposed in the form of microagents which operate concurrently andcommunicate asynchronously with microagents in other stages. Each program instructionspends time only in the relevant stages and for just as long as is necessary. This is in con-trast with synchronous datapaths in which the centralised control forces each instruction togo through all the stages, regardless of the need to do so (in eect a single pipeline). Fur-thermore, the microagents within a stage might operate on dierent program instructionsconcurrently.Micronets are controlled at two levels: the data transfer between microagents is controlledlocally, whereas the type of operation carried out by a microagent (called a microopera-tion) and the destination of its result is controlled by the sequencer or by other microagents.Microagents can communicate either across dedicated lines or via shared buses where ar-bitration is provided either by the sequencer or some other decentralised mechanism suchas a token ring.Data dependencies in synchronous pipelines are resolved by using either hardware orsoftware interlocks [4], which increases the complexity of the controls. Micronets use theirhandshaking mechanisms together with simple register locking to achieve the same eect,but with trivial hardware overheads. In synchronous designs the structural hazards arenormally avoided in hardware by using a scoreboarding mechanism. In micronets this isprovided by existing handshaking protocols. Out-of-order instruction completion can besupported in synchronous designs, but at a non-trivial cost. Micronets are able to relaxthe strict ordering of instruction completions and thereby further exploit ILP. The resultis to eectively increase the utilisation of the functional units by reducing their idle timesor stalls. Better program performances can be achieved by exploiting both ILP and actual
instruction execution times.2.1 Asynchronous ArchitecturesFigures 1-3 illustrate micronet models of a generic asynchronous RISC datapath. Theintention is not to focus on the functional units themselves but rather on their asynchronouscontrol and investigate their eect on the performance. The number of units and theirfunctionality may be changed without side-eects.The architecture can be described as a network of microagents (denoted by solid boxes)which are connected via ports. The microagents which are labelled in the gures, calledFunctional Microagents (FMs), perform microoperations which are typical of a datapath.On each of their ports are Communicating Microagents (CMs) which are responsible forasynchronous communications between FMs and the rest of the micronet. The FMs areeectively isolated and only communicate through their CMs, and can therefore be modiedwithout aecting the rest of the micronet.2.2 Measuring PerformanceWe next introduce a few metrics for measuring improvements due to the distribution ofcontrol. There are two principal characteristics which aect performance - the microop-eration latency (the time between initiating the operation and the result being available),and the microoperation cycle time (the minimum time between successive initiations of thesame operation, i.e. throughput). The metrics dened for MAP are as follows:Minimum Datapath Latency (MDL) - The time between asserting the control signals(i.e. initiating instruction issue) and receiving the nal acknowledgement of the in-struction's completion.Instruction Cycle Time (ICT) - The time between two identical instruction issues oncethat instruction's pipeline is full. In asynchronous pipelines which usually have non-uniform stage delays, the time between successive instruction issues is inuenced bythe slowest stage currently active in the pipe.Program Execution Time (PET) - The actual execution time of the program.A more detailed exposition of performance-related issues is presented in [1].To study the eectiveness of the micronets, it is sucient to focus on the LD, ST, andALU instructions. Five simple test programs were devised to exercise the design. TheAlu, Load and Store test programs measure the maximum attainable utilisation of theirrespective FMs. Each of these programs contain a number of identical instructions, suchthat only structural dependencies exist between instructions (in eect setting up a staticpipeline or a xed path through a network of components). The number of instructionsin the test programs are sucient to ll the pipeline, i.e. enough instructions exist for theControl Unit (CU) to achieve a steady issue rate. The Hennessy Test (HT1) consists ofa mix of the previously-mentioned instructions but without any data dependencies, which



























ZMs AckRx, Ry & Rz Acks MUs AckRof Ack AUs Ack
Micronet Datapath for Stage 1 & Stage 2
Communicating Microagents (CMs).
Data flow between microagents.
Control Acknowledgement Signal.
Functional Microagents (FMs) - includes the
Reg. Bank and the functional units, ALU & MU.


















Load Instruction Issued ALU Inst Issued
ALU Instruction Cycle Time
Load Instruction Cycle Time
Timing Diagram for Stage 1 Timing Diagram for Stage 2
Load Instruction Issued ALU Inst Issued
ALU Instruction Cycle Time
Initiate ALU Inst Issue
Load Instruction Cycle TimeFigure 1: The micronet model of Stages 1 & 2AUs - This signal identies the next operation of the ALU. The corresponding acknow-ledgement is asserted when the interface is ready to fetch the ALU's operands fromthe registers and is cleared when it initiates the write-back handshake.MUs - This signal identies a load instruction to the MU and is asserted and cleared inthe same manner as above. (Control signals for the other MU microoperations havebeen omitted for the sake of clarity).ZMs - This signal identies the destination register for data write-backs from the ALU orMU via the Z bus. The corresponding acknowledgement signal is asserted when theregister is ready to receive data and cleared once the data has been written back.In Stage 1, all the microoperations for a particular instruction are initiated together,and the next set cannot be initiated until the completion of the set of microoperations ofthe previous one. This eectively serialises the instruction execution, as illustrated in thetiming diagram in Figure 1. In successive renements the rôle of the CU is diminishedby distributing the control of the micronet to local interfaces and microoperations areindividually initiated as early as possible.
Instruction Inst. Cycle Time (ICT) Datapath Latency (MDL)ALU 24nS 24nSLD 43nS 43nSST 23nS 21nSTable 1: Instruction Execution on Stage 1In the base stage, the ICT is determined by the slowest control signal handshake sincethe next instruction issue cannot begin until all the previous handshakes have been com-pleted. The results in Table 1 show that the ICT is equal to the MDL (except for the STinstruction), which is not surprising as instructions execute sequentially but only take aslong as is necessary. The higher value for the ST instruction is due to a handshake delay,which in the LD instruction is hidden by the write-back stage. Although there is no explicitpipelining of the datapath, dierent phases of the handshaking may occur at the same time,e.g. a CM may initiate a handshake with another CM while completing one with its FM.As was expected the execution times of the test programs (Table 5) are the sum of theirindividual instruction execution times together with startup overheads.3.2 Stage 2The strict condition which was employed in Stage 1 for initiating a set of microoperationsafter the completion of the previous set is now relaxed. Furthermore, the CU can now assertany of the individual microoperations for an instruction asynchronously, where previouslythe set of microoperations for an instruction were initiated in unison. This allows micro-operations relating to dierent instructions to overlap (Stage 2 in Figure 1). Note thata control signal which is related to an instruction can only be de-asserted once all of therelevant control signals have been acknowledged. The eect of relaxing this constraint is tointroduce possible hazards and ecient mechanisms have been devised to avoid them. For-tunately, these hazard avoidance mechanisms are implicit in the orderings of the assertionsof the control signals, known as the pre-issue conditions and these are discussed below:Read-after-Write (RAW) - A register locking mechanism is implemented in the registerbank without the CU having to keep track of the \locked" registers. The acknow-ledgement signal ZMs is asserted after the locking operation, and is de-asserted oncethe data is written back (signaling the unlocking of the register). By denition aninstruction is issued once all the acknowledgements of the relevant microoperationshave been received. This implies that the destination register of the previous instruc-tion will have been locked before the CU initiates any of the current instruction'smicrooperations.Write-after-Read (WAR) - This hazard is avoided without additional hardware over-heads. When a register is used as both source and destination within the sameinstruction, then it is necessary to ensure that the source data is obtained before theregister is locked, otherwise deadlock will occur. The CU stalls the assertion of ZMsuntil the source operand control signals Rx and Ry have been asserted.

































Rof Ack Rx, Ry, Rw
& ZMs Acks




















Load Instruction Cycle Time
ALU ICT
Timing Diagram
Figure 2: The micronet model of Stage 3Enforcing write-backs in order restricts the degree of concurrency which can be exploited,especially when the FU executions times vary signicantly. However supporting out-of-order completion of instructions in an asynchronous environment is more dicult thanunder synchronous control. Determining the precise order in which results will be availableis virtually impossible since microoperation delays vary.Out-of-order instruction completion is supported by tagging the write-back data with theaddress of the destination register. The CU cannot predict the write-back order, therefore adecentralised bus arbitration scheme as in a token ring is employed. The ring is distributedamongst the CMs and is very simple to implement in VLSI. However, the ring's cycle timewill increase with the number of FMs, and might be infeasible for larger numbers.With data transfer on the Z bus being tagged, CMs can identify and intercept operandsfor which it may be waiting. This mechanism is reminiscent of the IBM 360/91 common busarchitecture [12]. Data-forwarding has been implemented by exploiting the feedback loopsof the micronet. In the event of data forwarding, where data is routed directly to the CM ofa waiting FM, the CM's previous request for that operand is in eect cancelled by initiatinga separate handshake. This frees the corresponding \operand fetch" CM to service its nextrequest. An alternative approach would be to implement operand bypassing, where theoperand is fed back to the \operand fetch" microoperation. This avoids the need for dataforwarding CMs and the cancel handshake. The dual rôle of the Z bus can no longer besupported due to the data-forwarding mechanism. A separate operand fetch bus (W bus)is used, thereby making the Z bus purely a write-back one (see Figure 2).As a result of these modications, the acknowledgements to the control signals and thepre-issue conditions have to be revised as shown below:Rx, (Ry, Rw) - The acknowledgement is asserted by the CM of the register bank whenthe X (Y, W) bus operand fetch microoperation is ready, and de-asserted once theoperand fetch handshake is in progress.Rof - Same as above. Note that both the control signals Rx and Rof cannot be active

























































AUs AckRof Ack MUs Ack
Figure 3: The micronet model of Stage 4Instruction Inst. Cycle Time (ICT) Datapath Latency (MDPL)ALU 12nS 24nSLD 23nS 43nSST 12nS 21nSTable 4: Instruction Execution on Stage 4actual operational cost and eectively hide the overheads of self-timed design. The ICTsfor the ALU and ST instructions are limited by their operand fetch cycle times. The overallimprovements in the program execution times in Stage 4 over Stage 1 for the rst three testprograms (shown in Table 5 and Figure 4) are due to improvements in temporal concurrencydue to the pipelining of the datapath. The actual speedup which is achieved is less thanthe maximum attainable improvement (the ratio of the ICTs in Tables 1 and 4), due tothe MDL and the startup overheads (for longer test programs the speed-up will approachthis maximum value). The speed-up for HT1 is due in part to pipelining of the instructionsas observed in the other test programs, but also due to additional spatial concurrencydue to the overlapping of dierent instructions in the same stage of the micronet. Thisfurther improvement is still signicant (approximately 17% in this example) given thatboth successive instruction operand fetches and write-backs are eectively forced to takeplace sequentially due to resource constraints. (In fact, since these delays are larger thanPET Alu Test Load Test Store Test HT1 HT2 HT2(DF)Stage 1 175nS 308nS 164nS 143nS 143nS -Stage 2 157nS 302nS 165nS 119nS 119nS -Stage 3 121nS 280nS 165nS 83nS 97nS 91nSStage 4 103nS 188nS 98nS 79nS - 91nSEective Speed Up 1.75 1.66 1.71 1.89 - 1.62Table 5: Execution Times of the Test Programs



















Figure 4: Comparison of Execution Times of the Test Programsthe FM delays for the Store and ALU operations, the scope for spatial concurrency in thisparticular example is quite small). As the number of microagents in each stage is increased,the spatial concurrency eect will be more pronounced. The speed-up for HT2 as expectedreects the reduced concurrency which can be exploited, due to the data dependencies inthe program.In summary, the rôle of the CU in an asynchronous processor has been considerablysimplied to just initiating individual microoperations as early as possible. The control ofthe datapath is distributed to local interfaces, courtesy of the micronet.4 CONCLUSIONSThis work has investigated the inuence of an asynchronous control paradigm on the designand performance of processor architectures. By viewing the datapath as a network ofmicroagents which communicate asynchronously, one can extract ne-grain concurrencybetween and within instructions. The micronet can be easily implemented using simpleself-timed elements such as Muller C-elements [7] and conventional gates. Future work willinvestigate the suitability of asynchronous processors as targets for optimising compilers.
AcknowledgementsV. Rebello was supported by the U. K. Engineering and Physical Sciences Research Coun-cil (EPSRC) through a postgraduate studentship. This work was partially supported bya grant from EPSRC entitled Formal Infusion of Communication and Concurrency intoPrograms and Systems (Grant Number GR/G55457).References[1] D. K. Arvind and V. E. F. Rebello. On the performance evaluation of asynchronous pro-cessor architectures. In P. Dowd and E. Gelenbe, editors, Proceedings of the 3rd InternationalWorkshop on Modeling, Analysis and Simulation of Computer and Telecommunication Systems(MASCOTS'95), pages 100{105, Durham, NC, USA, January 1995. IEEE Computer SocietyPress.[2] European Silicon Structures Limited. Solo 1400 Reference Manual. ES2 Publications Unit,Bracknell, U.K., 1990.[3] S. B. Furber, P. Day, J. D. Garside, N. C. Paver, and J. V. Woods. A micropipelined ARM.In T. Yanagawa and P. A. Ivey, editors, The Proceedings of the IFIP International Conferenceon Very Large Scale Integration (VLSI'93), pages 5.4.1{5.4.10, Grenoble, France, September1993.[4] J. Hennessy and T. Gross. Postpass code optimisation of pipeline constraints. ACM Transac-tions on Programming Languages and Systems, 5(3):422{448, July 1983.[5] N. P. Jouppi and D. W. Wall. Available instruction-level parallelism for superscalar and su-perpipelined machines. In The Proceedings of ASPLOS III, pages 272{282. ACM Press, April1989.[6] C. Mead and L. Conway. Introduction to VLSI Systems. Addison-Wesley, Reading, Mass.,1980.[7] R. E. Miller. Switching Theory. Volume II: Sequential Circuits and Machines. John Wiley andSons, 1965.[8] W. F. Richardson and E. L. Brunvand. The NSR processor prototype. Technical ReportUUCS-92-029, Department of Computer Science, University of Utah, USA., 1992.[9] R. F. Sproull, I. E. Sutherland, and C. E. Molnar. Counterow pipeline processor architecture.Technical Report SMLI TR-94-25, Sun Microsystems Laboratories Inc., April 1994.[10] I. E. Sutherland. Micropipelines. Communications of the ACM, 32(6):720{738, June 1989.[11] J. E. Thornton. Design of a Computer: The Control Data 6600. Scott Foresman and Company,1970.[12] R. M. Tomasulo. An ecient algorithm for exploiting multiple arithmetic units. IBM Journalof Research and Development, 11(1):25{33, January 1967.
Appendix D. Published Papers 211
D.2 On the Performance Evaluation of Asynchronous
Processor Architectures
Title: On the performance evaluation of asynchronous processor
architectures.
Authors: D. K. Arvind and V. E. F. Rebello.
Presented at: The 3rd International Workshop on Modeling, Analysis
and Simulation of Computer and Telecommunication
Systems (MASCOTS’95).
Place: Durham, NC, USA.
Date: 18th – 20th January 1995.
Publisher: IEEE Computer Society Press.








































































































(c) An Asynchronous Pipeline - exploiting spatial parallelism as wellFigure 1: Synchronous and Asynchronous PipelinesThe clock period of a synchronous pipeline is de-termined by the delay of the slowest stage which takesinto account worst-case timings for execution andpropagation. Furthermore, optimal performance for apipeline is achieved when all the stages are balanced.This is quite dicult to achieve in practice, since thestages of a typical pipeline perform dierent opera-tions, and often their delays are data-dependent. Fig-ure 1(a) illustrates the operation of such a datapathin which synchronisation overheads have been omittedfor the sake of brevity. This imbalance between stagedelays results in idle periods leading to poor utilisationof the physical resources. Of course, further pipelin-ing of the slower stages could reduce this at the cost ofincreased design complexity and synchronisation over-heads.In contrast, the performance of an asynchronouspipeline is determined by the actual delays of indi-vidual stages (usually the average delays), and over-Published in the Proceedings of the 3rd International Workshop on Modeling, Analysis and Simulation ofComputer and Telecommunication Systems (MASCOTS'95), pp 100-105, Durham, NC, USA, January 1995.c IEEE Computer Society Press.
heads due to self-timing protocols (which have beenomitted in Figure 1(b), but have been included in themodels). This pipeline only exploits temporal paral-lelism as before, but does so more eciently. We makesome further observations about the stages in a syn-chronous datapath. All the instructions may not re-quire the services of all the stages. Secondly, althougheach stage may consist of dierent resources, only oneof them will be active at any time for a given instruc-tion. Figure 1(c) illustrates an asynchronous pipelinewhich exploits spatial parallelism within some of thestages. Successive instructions which utilise dierentresources within a stage are now able to execute con-currently. In the simple example under considerationin Figure 1(c), the execute stage has two concurrently-operating resources. It is possible for the instructionsto share resources in any of the stages. For example,while an instruction is stalled waiting for an operandon one bus, another instruction could use the otherbuses to fetch its operands. The amount of spatialparallelism which can be exploited in practice is de-termined by the relative delays of the functional unitsin the datapath (see Section 4.2 for more details). Thenext section briey describes micronets which can beused to model asynchronous datapaths.3 MicronetsMicronets can be viewed as a generalisation of Suth-erland's micropipelines [15]. A micronet is describedas a network of elastic pipelines in which individualstages of the pipelines have concurrent operations,and stages of dierent pipelines can communicatewitheach other asynchronously. The operations of a mi-cronet stage can be exposed as ne-grained micro-agents. This should not be confused with furtherpipelining of each of the stages. In fact microagentswithin each stage operate concurrently and can com-municate asynchronously with microagents of any ofthe other stages. A microagent res when the set ofinputs determined by the control signals are valid, andgenerates a set of outputs. Each program instructionspends time only in the relevant stages and for just aslong as is necessary. Furthermore, the dierent micro-agents within a stage which belong to dierent pro-gram instructions operate concurrently.Synchronous datapaths require either software orhardware interlock mechanisms to resolve data de-pendencies [8], and scoreboards to avoid structuralhazards. However, a micronet-based datapath usesexisting handshaking mechanisms and register lock-ing to attain the same eect. Out-of-order instructioncompletions can be easily achieved, thereby furtherexploiting ILP in the programs. In the following sec-tion the performance evaluation of a micronet-basedasynchronous processor is presented.
4 Performance Evaluation of MAPA MAP architecture can be viewed as an ensembleof heterogeneous functional units which operate con-currently and communicate with each other asyn-chronously. We wish to accurately measure the per-formance of programs on such an architecture, and toobserve the eects of architectural changes. For ourpurposes the architecture is modelled at the register-transfer level in the Occam2 language [9], with ac-curate timing delays of the functional units beingprovided by SPICE-level simulations of their VLSIimplementations. Occam2 is based on the processmodel view of computing in which a system can bedescribed as a collection of concurrent processes whichcommunicate with each other asynchronously throughchannels. The simulation platform is a transputer-based MEiKO Computing Surface [10]. The underly-ing timekeeping mechanism is based on a parallel asyn-chronous simulation algorithm [2], which ecientlysimulates the class of architectures under investiga-tion.4.1 The MAP DatapathThe datapath can be described as a network of mi-croagents (denoted by solid boxes) which are connec-ted via ports as illustrated in Figures 2 and 3. TheFunctional Microagents (FMs) perform microopera-tions which are typical of a datapath. On each port ofa FM is a Communicating Microagent (CM) which isresponsible for communications among the FMs, andwith the Control Unit (CU). The FMs are eectivelyisolated and only communicate through their CMs,and can therefore be modied without aecting therest of the micronet.The processor design as illustrated in Figure 2 onlyexploits the actual execution times of microoperations(MAP 1), whereas the design as shown in Figure 3 ex-ploits both this property and concurrency between themicrooperations of dierent instructions (MAP 2). Inboth cases, each microoperation is initiated by four-phased control signals from the CU, whose acknow-ledgements are used as status ags for mimicing ascoreboard.4.1.1 Instruction Issue and Data TransferAll the microoperations for an instruction are initiatedin unison, with the next set waiting until the comple-tion of the previous one. The start of a microoperationis acknowledged which results in the de-assertion ofthe initiating control signal. The subsequent instruc-tion can only be issued once the previous set of controlsignals have all been acknowledged which eectivelyserialises the instruction execution. In MAP 2, theCU initiates the microoperations individually for the
Functional Microagents (FMs) - includes the
Reg. Bank and the functional units, ALU & MU.
Communicating Microagents (CMs).
Control Acknowledgement Signal.
(Control Signals flow in the opposite direction.)



























ZMs AckRx, Ry & Rz Acks MUs AckRof Ack AUs Ack






























AUs AckRof Ack MUs Ack
& ZMs Acks
Token RingFigure 3: The micronet model of MAP 2current instruction as early as possible via the corres-ponding CMs. The receipt of the acknowledgementonly conrms that the CMs will initiate the corres-ponding microoperation. This allows microoperationsrelating to dierent instructions to overlap. Hazardavoidance is implicit in the orderings of the assertionsof the control signals [1]. The rôle of the CMs hasbeen enhanced to eectively buer the initiations ofthe microoperations from the CUs until the respectiveFMs are ready to perform. The writing back to theregister bank is no longer controlled by the CU, butdirectly by the CMs of the FMs which require the ser-vice. These features help to exploit more ner-grainedconcurrency between instructions than previously pos-sible. In MAP 2, out-of-order instruction completion(due to dierent execution delays in the FMs) anddata-forwarding are also supported [1].In the next section the eect of these features on
the performance of simple programs are investigatedby simulating the micronet model in a parallel dis-crete event simulation environment which was brieydescribed earlier.4.2 Performance ResultsThe performance evaluation of asynchronouspipelines is non-trivial since the stage delays arenon-uniform, and variable due to data dependencies.The interaction between successive instructions whichleads to spatial and temporal concurrency is dicultto evaluate accurately through analytical methods.The two principal attributes which aect the perform-ance of programs in asynchronous pipelines are thelatency of the relevant microagents, which is denedas the time between initiating the microoperation andthe result being available, and their cycle time, whichis the minimum time between successive initiations ofthe same microoperation, i.e. throughput. (They arethe same in a synchronous pipeline, with the cycletime being determined by the slowest latency.) Thedierence between the two values can be viewed asthe overhead due to asynchronous protocols and agood design should endeavour to minimise it. Thisis achieved in micronets by overlapping the phases ofthe communication protocol in CMs with useful opera-tions in the FMs, thus hiding the overhead. The eect-iveness of this method can be determined by measur-ing the utilisation of FMs by exercising them with testprograms composed of appropriate, identical instruc-tions. A few metrics are now introduced for gaugingthe performance of micronet datapaths.Minimum Datapath Latency (MDL) - The timebetween asserting the control signals (i.e. initi-ating an instruction issue) and receiving the nalacknowledgement of the instruction's completion.Instruction Cycle Time (ICT)- The time between two identical instruction is-sues once that instruction's pipeline is full. Inasynchronous pipelines which usually have non-uniform stage delays, the time between success-ive instruction issues is inuenced by the sloweststage currently active in the pipe.Program Execution Time (PET) - The actualexecution time of the program.ALU Utilisation - The percentage of the programexecution time for which the ALU performs usefulcomputation.MU Utilisation - Same as above, but for theMemory Unit (MU).Maximum FM Utilisation (MFU) - The upperbound on the FM utilisation is the ratio of theFM's microoperation latency and the ICT.
Inst MAP 1 MAP 2ICT MDL MFU ICT MDL MFUALU 24nS 24nS 16.7% 12nS 24nS 33.3%LD 43nS 43nS 53.5% 23nS 43nS 100%ST 23nS 21nS 42.9% 12nS 21nS 75%Table 1: Instruction ExecutionTest Pgs ATP LTP STP HT1 & 2PET 168nS 301nS 159nS 136nSALU Util 16.6% 0% 0% 8.4%MU Util 0% 53.3% 39.9% 22.4%Table 2: Execution of Test Programs on MAP 1The Alu, Load and Store test programs (ATP, LTP,STP) measure the maximum attainable utilisation oftheir respective FMs. Each contains repetitions ofeither ALU, LOAD or STORE instructions, such thatonly structural dependencies exist between instruc-tions (in eect setting up a static pipeline or a xedpath through a network of components). The numberof instructions in the test programs are sucient to llthe pipeline, i.e. enough instructions exist to allow theCU to achieve a steady issue rate. The Hennessy Test(HT1) consists of a mix of the previously-mentionedinstructions without any data dependencies, which ex-ercises the spatial concurrency and out-of-order com-pletion, for a particular schedule devised by the com-piler. HT2 is a variant of HT1 with data dependen-cies, which exercises the data forwarding mechanismas well.The functional units were implemented in a 1.5 mCMOS process. The timing characteristics were ex-tracted from a post-layout simulation tool within acommercial VLSI design package called SOLO 1400 [5]and incorporated into the Occam2 model.In MAP 1, the ICT value for each instructionis determined by the slowest microagent control sig-nal handshake required by that instruction, since thenext instruction issue cannot begin until all the previ-ous handshakes have been completed. The results inTable 1 show that the ICT is equal to the MDL (ex-cept for the ST instruction), which is not surprising asTest Pgs ATP LTP STP HT1 HT2PET 96nS 181nS 93nS 72nS 84nSSpd Up 1.75 1.66 1.71 1.89 1.62A Util 28.9% 0% 0% 16.4% 14.1%M Util 0% 88.5% 67.7% 43.8% 37.7%Table 3: Execution of Test Programs on MAP 2
instructions execute sequentially but only take as longas is necessary. The higher value for the ST instructionis due to a handshake delay, which in the case of theLD instruction is hidden by the write-back stage. Al-though there is no explicit pipelining of the datapath,dierent phases of the handshaking may occur at thesame time, e.g. a CM may initiate a handshake withanother CM while completing one with its FM.Also in Table 1, the maximumFM utilisations rep-resents the proportion of the MDL taken by the FMto complete its operation. As expected, the execu-tion times of the test programs in Table 2 are thesum of their individual instruction execution times.We observe that the utilisations achieved for the FMs(in Table 2) are very close to their upper bounds (inTable 1) which shows that asynchronous control usinga micronet can be ecient.The ICT gure for the LD instruction in MAP 2 isthe best attainable as it represents the MU delay forthe operation. The corresponding utilisation gurein Table 3 supports this claim (Note: these utilisa-tion measurements do not take into account both theinitial operand fetch and the nal write-back delays,and will therefore never attain the theoretical upperbound). These gures show that the micronet can ex-ploit the actual operational costs and eectively hidethe overheads of self-timed design. The ICTs for theALU and ST instructions are limited by their operandfetch cycle times, and the utilisation of the FM in thesecases also approach their bounds. This cycle time isdue to the communication protocol between the FUsand the register bank. These delays can be reducedby using a less conservative bundling delay [15] andthrough layout and transistor size optimisation [3].The improvements in the program execution times(PET) for MAP 2 (shown in Table 3) for the three in-struction test programs are due to improvements intemporal concurrency due to asynchronous pipelin-ing of the datapath. Although the actual speedupsachieved are less than the ratios of the ICTs for MAP 1and MAP 2 (shown in Table 1), they are the maximumattainable improvement. The speed-up for HT1 is inpart due to the pipelining of the instructions as ob-served previously in the other test programs, and alsodue to additional spatial concurrency through over-lapping of dierent instructions in the same stage ofthe micronet. This further improvement is still sig-nicant (approximately 17% in this example) giventhat successive instruction operand fetches and write-backs are eectively forced to take place sequentiallydue to resource constraints. (In fact, since these delaysare larger than the FM delays for the Store and ALUoperations, the scope for spatial concurrency in thisparticular example is quite small.) As the numberof microagents in each stage is increased, the spatialconcurrency eect will be more pronounced, subjectto relative delays of the microagents. The speed-upfor HT2 as expected reects the reduced concurrency
which can be exploited, because of data dependenciesin the program.It has to be noted that the datapath latency isunaected by the exploitation of temporal parallel-ism which is generally not the case in a synchronouspipeline.The interaction between concurrently executing in-struction is quite dicult to predict. For example,two instruction which compete for the same resourcesmight acquire them in dierent order depending on theactual delays which are themselves data-dependent.This is not in itself a drawback, since one of the in-struction is stalled for just as long as is necessary,which would not be true in a synchronous case.5 ConclusionsThe behaviour of asynchronous processors are com-plex and their performance is dicult to predict. Dis-crete event simulations as described in this work of-fer a method for accurately measuring their perform-ance. The model in Occam2 naturally captures theconcurrency and asynchronous communication. Thisalso allows the simulation to be parallelised to obtainreasonable run-times for large circuits and test pro-grams. This is aided by the asynchronous nature ofthe underlying simulation algorithm itself.To the best of our knowledge this is the rst workwhich has investigated the inuence of an asynchron-ous control paradigm on the performance of processorarchitectures for exploiting ne-grained ILP. The mi-cronet model allows the exploitation of both temporaland spatial concurrency which results in ecient util-isation of resources within the datapath.AcknowledgementsV. Rebello was supported by a postgraduate stu-dentship from the U. K. Engineering and PhysicalSciences Research Council (EPSRC). This work waspartially supported by a grant from EPSRC en-titled Formal Infusion of Communication and Con-currency into Programs and Systems (Grant NumberGR/G55457).References[1] D. K. Arvind and V. E. F. Rebello. Instruction-levelparallelism in asynchronous processor architectures.In M. Moonen and F. Catthoor, editors, Proceedingsof the 3rd International Workshop on Algorithms andParallel VLSI Architectures, pages 203{215, Leuven,Belgium, August 1994. Elsevier Science Publishers.
[2] D. K. Arvind and C. R. Smart. Hierarchical paralleldiscrete event simulation in composite ELSA. In Pro-ceedings of the Sixth Workshop on Parallel and Dis-tributed Simulation (PADS'92), pages 147{156, Janu-ary 1992.[3] S. M. Burns. Performance Analysis and Optimisationof Asynchronous Circuits. PhD thesis, Computer Sci-ence Department, California Institute of Technology,Pasadena, California, USA, 1991.[4] I. David, R. Ginosar, and M. Yoeli. Self-timed ar-chitecture of a reduced instruction set computer. InS. Furber and M. Edwards, editors, The Proceedingsof the IFIP Working Conference on AsynchronousDesign Methodologies, Manchester, UK, March 1993.Elsevier Science Publishers.[5] European Silicon Structures Limited. Solo 1400 Ref-erence Manual. ES2 Publications Unit, Bracknell,U.K., 1990.[6] S. B. Furber, P. Day, J. D. Garside, N. C. Paver, andJ. V.Woods. A micropipelined ARM. In T. Yanagawaand P. A. Ivey, editors, The Proceedings of the IFIPInternational Conference on Very Large Scale Integra-tion (VLSI'93), pages 5.4.1{5.4.10, Grenoble, France,September 1993.[7] S. Hauck. Asynchronous design methodologies: Anoverview. Technical Report TR 93-05-07, Departmentof Computer Science and Engineering, University ofWashington, Seattle, USA, 1993.[8] J. Hennessy and T. Gross. Postpass code optimisa-tion of pipeline constraints. ACM Transactions onProgramming Languages and Systems, 5(3):422{448,July 1983.[9] INMOS Limited. Occam2 Reference Manual. PrenticeHall International, 1988.[10] INMOS Limited. Transputer ReferenceManual. Pren-tice Hall International, 1988.[11] A. J. Martin. Programming in VLSI: From communic-ating processes to delay-insensitive circuits. TechnicalReport Caltech-CR-TR-89-1, Department of Com-puter Science, California Institute of Technology, Pas-adena, California, 1989.[12] A. J. Martin, S. M. Burns, T. K. Lee, D. Borkovic,and P. J. Hazewindus. The design of an asynchron-ous microprocessor. In C. L. Seitz, editor, AdvancedResearch in VLSI: Proceedings of the Decennial Cal-tech Conference on VLSI, pages 351{373, Cambridge,Mass., 1989. MIT Press.[13] W. F. Richardson and E. L. Brunvand. The NSRprocessor prototype. Technical Report UUCS-92-029,Department of Computer Science, University of Utah,USA., 1992.[14] C. L. Seitz. System Timing. In C. Mead and L. Con-way, editors, Introduction to VLSI Systems, chapter 7,pages 218{262. Addison-Wesley, 1980.[15] I. E. Sutherland. Micropipelines. Communications ofthe ACM, 32(6):720{738, June 1989.
Appendix D. Published Papers 217
D.3 A Model for Decentralising Control in Asyn-
chronous Processor Architectures
Title: Micronets: A model for decentralising control in
asynchronous processor architectures.
Authors: D. K. Arvind, R. D. Mullins and V. E. F. Rebello.
Presented at: The 2nd Working Conference on Asynchronous Design
Methodologies.
Place: London, UK.
Date: 30th – 31st May 1995.
Publisher: IEEE Computer Society Press.
Micronets: A Model for Decentralising Control in AsynchronousProcessor ArchitecturesD. K. Arvind, R. D. Mullins and V. E. F. RebelloDepartment of Computer Science, The University of EdinburghEdinburgh, EH9 3JZ, United KingdomE-mail: dka@dcs.ed.ac.ukAbstractMicronets model processor architectures as a net-work of communicating resources, in contrast to thetraditional one of a linear pipeline. Micronets distrib-ute the control to the functional units, which enablesthe exploitation of ne-grain concurrency between in-structions. The overhead due to asynchrony is hid-den with the four-phase protocol being used to imple-ment scoreboarding and hazard avoidance mechanisms,without incurring additional control costs. This pa-per demonstrates the feasibility of micronet-based pro-cessors. Results are presented for SPICE-level simula-tions of a 0.7m CMOS implementation of a datapath.The relationships between micronets and both the com-piler and the computer architecture are also explored.1 IntroductionMicropipelines [22] have been used to model lin-ear asynchronous pipelines such as datapaths [6] [18],and two-dimensional pipeline structures [8]. However,viewing a datapath as a single linear pipeline has lim-itations [2]. A new paradigm called micronets has re-cently been proposed for the distribution of control inasynchronous processor architectures [1]. Micronetsmodel datapaths as a network of communicating func-tional units which allows the ecient exploitation ofboth ne-grained instruction-level parallelism and theactual execution costs of instructions.The choice of a four-phase communication pro-tocol [19] between the functional units allows the ef-cient utilisation of these resources, by avoiding theadditional control costs (scoreboarding and hazardavoidance mechanisms) normally associated with pro-cessors which exploit ILP.The design of an eective micronet-based sys-tem should also consider the interplay between thecompiler and the processor architecture, i.e. does a















a) Typical resource utilisation in a pipeline
b) Snapshot of typical resource utilisation in a micronet
I4
I5Figure 1: Contrasting a micropipeline with a micronettion of condition branching [9] [16] [17]. However, thistechnique is unsuitable for asynchronous datapathsbecause of the diculty in estimating the time toresolve the branch condition (which is xed in syn-chronous architectures). Therefore, the number of in-structions which have to be fetched cannot be determ-ined. For micronet-based architectures, the preferredtechniques are ones which do not rely on xed tim-ing for their correct operation, such as branch predic-tion schemes [12] [20] or advanced branching mechan-isms [15]. Precise exception handling and speculativeexecution are supported through the use of history andwrite-back buers [21] [23].A micronet-based datapath, as illustrated in Fig-ure 2, is composed of a network of microagents (de-noted by solid boxes) which are connected via ports.The Functional Microagents (FMs) perform microop-
erations which are typical of a datapath. On eachport of a FM is a Communicating Microagent (CM)which is responsible for communications among theFMs, and with the Control Unit (CU). The FMs are ef-fectively isolated and only communicate through theirCMs, and can therefore be modied without aectingthe rest of the micronet. The protocol used in thedesign of micronet-based datapath is discussed in thefollowing section.
Adder
MU/








































V BusFigure 2: A micronet-based processor architecture2.1 Choice of protocolBoth transitions in a generic four-phase protocol(the assertion and the return-to-zero) are accompan-ied by additional acknowledgements from the receiver.The principal advantage of this approach is a sim-pler circuit implementation. However, it uses twiceas many transitions than is necessary and wheneverthe wire delay is a substantial fraction of the oper-ation time, the extra trip required by a single com-munication can be a serious performance penalty. In











a) Resource activity in a synchronous pipeline
b) Resource activity in a micropipeline






FM 2 Figure 3: Resource activityprevious one completes its execution. Furthermore, afour-phase protocol exposes more concurrency by ef-fectively decoupling the sender's and receiver's opera-tions from their communication [1].2.1.2 Routing data in micronetsAlthough the actual data transfer between microa-gents is controlled locally via handshake protocols, theaccess to shared resources, such as data highways, maybe controlled either globally by the CU or locally byan arbitration scheme. Global control is used in caseswhere the order of granting resources is known in ad-vance and has to be enforced. This is again achievedthrough the use of pre-issue conditions [1]. Otherwise,a local mutual exclusion scheme such as in token ringsor arbiters will grant requests. For example, the writ-ing back to the register bank is controlled directly bythe CMs of the FMs which require this service. As aconsequence of this and also due to the dierences inthe execution times of microoperations, instructionsmay complete out of order. Therefore data has to betagged with its destination which also enables data-forwarding to be supported.The reader is referred to [1] for further informationon micronets, and to [2] for the performance evalu-ation of micronet-based datapaths.























































































































































ZTAGFigure 5: Interfaced ALUPhase 1 - Request is made to the register bank foran operand. This phase is usually hidden sinceit takes place concurrently with the operand's re-gister access.Phase 2 - Acknowledge or data valid signal is re-ceived, the receipt of operands is now detectedby the control unit and the ALU operation maybegin.Phases 3 and 4 - Handshake completes concur-rently with the ALU operation, and these extraphases are eectively hidden.The destination register for each ALU operation isstored in two tag latches, as shown in Figure 5. Thetag and data are sent together to the register blockallowing the correct destination register to be selected.The functions of the register interfaces as shown inFigure 6 are listed below:Operand Interfaces - These interfaces communic-ate with the control unit, the register bank andthe operand fetch interfaces of other functionalunits, to control the supply of operands. An op-erand may only be sent to a functional unit whenthe following operations have been successfullycompleted: A request has been made to the operand in-terface by the CU.










































Figure 6: Interfaced Register BankSince the write-back bus is shared by a number offunctional units, some form of arbitration mechanismmust be used to avoid contention, like a token ring.Although easy to implement, its performance woulddegrade with an increase in resources sharing the bus.
4 Simulation ResultsA prototype datapath was implemented in ES2's0.7m CMOS process using the Cadence design tools.They were used to create a library of self-timed com-ponents and datapath elements. The Cadence DesignFramework provided interfaces to both VHDL andHSPICE. A VHDL model of the datapath was cre-ated from a high-level specication and synthesised.The HSPICE simulations of the entire datapath tookapproximately 17 hours on a SUN Sparc-10.Figure 7 shows the execution of an ALU instruc-tion with traces of the relevant control signals beingnumbered.
Figure 7: ALU InstructionPanel 1 - An asserted request signal (73) from thecontrol unit to the ALU initiates an add opera-tion. The acknowledge signal (54) represents theperiod of ALU activity.Panel 2 - The operand request signal (155) is sentby the control unit to initiate a register accessmicrooperation. The acknowledge signal (93) isasserted by the register operand interface to pre-vent further operand requests until a functionalunit has claimed the current operands. Only thesignals for one of the operands is shown.Panel 3 - A request to lock the destination register(118) and the register bank acknowledge sig-nal (127) are shown. The acknowledge signal islowered after the register has been locked and ago-write request has been received from a func-tional unit.
Panel 4 - Shows an ALU operand request (165) tothe register bank, together with the correspond-ing data valid signal (148) from the register bank.Panel 5 - After the add operation has completed, arequest (151) is made to write the result to thedestination register. The receipt of data at theregister bank is signaled by the assertion of theacknowledge ag (67).Panel 6 - Shows the instruction decode start signal(880). The duration of this signal indicates theinstruction issue time. Also shown is the registerwrite signal (3847), where data is written backon the nal edge of this signal. The completeinstruction execution time is represented by thedelay between the rst and last edges as shownin this panel. (Note that both signals are activelow.)The following subsections describe a number ofmeasurements which inuence the performance ofmicronet-based datapaths.4.1 Handshaking
Figure 8: The handshake cycleThe handshake cycle is implemented by two back-to-back C-elements. This forms the basis for distrib-uted control in micronets (a circuit commonly used forcommunication between microagents) and therefore acrucial factor which inuences performance within themicronet. Ignoring the computation within a micro-agent, the throughput would then be limited by thecontrol handshake cycle. Figure 8 shows a cycle time
of 0.8nS, corresponding to a maximum throughputrate of 1.25GHz. This suggests that micronet controlcircuitry is unlikely to limit throughput in processingpipelines.4.2 Maximum instruction issue rate
Figure 9: Maximum instruction issue rateFigure 9 represents an instruction issue time of1.85nS. The maximum instruction issue rate is de-termined by the earliest possible reassertion of theissue signal. Given sucient instruction fetch band-width, the minimumcycle time for this signal is 2.05nSwhich equates to a maximum instruction issue rate of488Mhz. This represents a theoretical upper limit onprocessor performance while ignoring datapath delays.4.3 ALU throughputFigure 10 shows the signal from the control unit(73) being asserted to initiate an ALU operation. Theperiod when both the ALU and its interface are busyis represented by the duration of signal 54 (4.31nS).During this period the ALU interface requests bothoperands, initiates the operation, detects the result,obtains write-back (go-write) permission and writesthe result to the Z-bus. The actual instruction exe-cution time of the ALU is determined by the periodbetween the operands arriving and the ALU's acknow-ledge being deasserted (3.11nS). This is the delay re-quired to add without any carry propagation and thusrepresents the minimum time through the functionalunit. The minimumALU instruction cycle time is de-termined by the earliest possible reassertion of signal
Figure 10: ALU Activity73. This cycle time was estimated at 4.51nS, imply-ing a peak processor performance of 222 MIPS for addinstructions.Only the FM latency should be considered as timespent in useful work, with the other delays being over-heads of control paradigm and the architecture. Inthis implementation, the micronet overhead for thisoperation is 1.4nS (the dierence between the opera-tion's cycle time and its latency). This overhead canbe eectively removed by modifying the ALU interfaceto deassert the microoperation acknowledge signal tothe CU once the operands have been fetched [2].4.4 Operand fetchThe operand fetch delay, as shown in Figure 11,was calculated as the period between the assertion ofthe operand request signal (155) and the assertionof the data valid signal (148) from the register bank(1.45nS). The actual time to access one of the registersis determined by the duration between the assertionof the operand request acknowledge (93) and the datavalid signal (1.24nS).4.5 Write-backFigure 12 shows that the time, between the resultbecoming available at the output of the ALU and be-ing written back in the destination register, is 2.48nS.The actual time taken to write data to a register is0.5nS (duration of signal 3847). The slow rising andfalling edges of the write-back request (151) signal,limits the write-back rate to 474MHz.
Figure 11: Operand FetchNote that no circuit optimisation of transistor sizeshave yet been made to either improve performance orsharpen edges. Micronet datapaths can be synthesisedfrom high-level specications using a custom-built lib-rary of four-phased self-timed components and inter-connection cells.5 DiscussionThe emergence of VLSI technology, together withthe maturing of optimising compiler techniques,had aided the development of early RISC architec-tures [9] [11] [16]. Their primary concern was the e-cient usage of expensive silicon real estate, and carefulconsideration was given to the design of the instruc-tion set architecture [13]. There have been two or-thogonal trends in the evolution of synchronous pro-cessor architectures [10]: the deeply-pipelined archi-tectures [14], i.e. ones which exploit temporal par-allelism, and superscalar architectures which exploitspatial parallelism [3] ([4] is an example which ex-ploits both). Both these classes have beneted fromimprovements in technology and the resulting fasterclock frequencies. But these improvements have beensustained at a high price in terms of clock distribu-tion, power consumption, and design complexity [4].Furthermore, signicant additional control costs areincurred in exploiting ILP in both cases.Micronets oer an alternative model for the designof future processor architectures. Whereas the originalRISC ideal was the ecient usage of the silicon space
Figure 12: Register Write Backby identifying the critical resources, we are essentiallyconcerned with their ecient utilisation over time. Weachieve this in two ways: by removing the clock, anddistributing control to the resources; and viewing thedatapath not as a linear pipeline, but as a network ofcommunicating resources. We are able to eciently(the overheads due to asynchrony are hidden [1]) ex-ploit a ne-grain ILP without the additional controlcosts (the protocol also implements a scoreboardingand hazard avoidance mechanisms).The asynchronous and distributed nature of thecontrol in micronets allows the processor to be easilyextended with little eect on the rest of the design. Fora given class of problems, the designer is able to easilyexplore the architectural design space more accuratelyby adding critical resources. This can be naturallyextended to superscalar architectures by increasingthe number of issue units. (Synchronous superscalararchitectures replicate entire datapaths.) The samescoreboarding mechanism is shared between the issueunits for determining the global state of the datapath.A micronet-based superscalar architecture has beendesigned and its performance is currently being eval-uated.5.1 The Micronet and the compilerThe micronetmodel exposes structural concurrencyin the datapath, with ne-grained resources now beingvisible to the compiler. It is the task of the compilerto schedule instructions such that these resources areeciently utilised. The instruction schedule is devised
based on a model of the architecture; for synchronousarchitectures the model is simple: instructions do notinteract and their execution times are xed. In con-trast, an asynchronous model is necessarily less ac-curate for the following reason: execution times forthe same instruction may vary due to environmentalparameters, data-dependent operations, and interac-tions between dierent instructions which are simul-taneously executing in the micronet. We have con-sidered models based on worst-case instruction execu-tion times where the resulting schedule is treated asa rst pass one. The instructions are dynamically re-ordered at run-time to tune this schedule and due tothe asynchronous behaviour these instructions are is-sued as soon as possible, without the need for delaysusing NO-OP instructions.A micronet-based datapath has several communic-ating \pipelines" which can all potentially be busysimultaneously. The control unit aims to issue thoseinstructions successively which minimise resource con-tention. It will only stall if no instructions are avail-able for issue, or all the instruction's resources arebusy. The micronet's asynchronous behaviour minim-ises the duration of this stall. In the case of instruc-tions with data or structural hazards, both instruc-tions are issued without stalling, with the second in-struction executing until the busy microagent. Thesene-grain hazard avoidances are enforced at run-timeby the pre-issue conditions of the micronet.Other reasons contribute towards the complexityof compiler-time scheduling on micronets. An initialstate of activity is assumed for scheduling a basic blockwithin the micronet, which might well be dierent atrun-time. The actual state can indeed be determ-ined at run-time thanks to the implicit scoreboardingmechanism in the CU. This information is used to dy-namically alter the static schedule by identifying aninstruction which can be executed immediately (eas-ily achieved using the control acknowledgement sig-nal), after checking for independence from the previ-ous instructions in the buer, which is determined atcompile-time and marked by a concurrency bit. Theinstruction issue is only limited by the availability ofresources and operands, in the presence of out-of-orderinstruction issue. Micronets can therefore be viewedas a hybrid dataow style of architecture which is lim-ited to the window of instructions available in the in-struction buer, without the bookkeeping costs of tra-ditional dataow architectures [7].
6 ConclusionsWe have presented a new model, called micronets,for decentralising controls in asynchronous processorarchitectures. They are viewed as a network of com-municating functional units, which expose ne-grainconcurrency between instructions. We have demon-strated that four-phase handshaking protocols enablethe implementation of highly concurrent structuresand in most cases the overheads can be hidden. Justas importantly, these protocols are used to ecientlyavoid datapath hazards.The modular nature of micronets eases modica-tion and empowers the computer architect with nercontrol in the design, for example, of superscalar ar-chitectures. Some of the issues relating to micronets astargets for parallelising compilers have been discussed.The control interfaces for the micronet-baseddatapaths are specied using a library of interconnec-tion cells, and automatically synthesised in terms ofsimple C-elements. Results from SPICE simulationsfor an add ALU operation have been presented whichdemonstrates the feasibility of distributing controls.In conclusion, the micronet model considers theinteractions between the underlying implementationtechnology, the architecture and the compiler, and un-derlines our integrated approach to system design.AcknowledgementsV. Rebello and R. Mullins were supported by post-graduate studentships from the U. K. Engineering andPhysical Sciences Research Council (EPSRC). Thiswork was partially supported by a grant from EPSRCentitled Formal Infusion of Communication and Con-currency into Programs and Systems (Grant NumberGR/G55457).References[1] D. K. Arvind and V. E. F. Rebello. Instruction-level parallelism in asynchronous processor archi-tectures. In M. Moonen and F. Catthoor, edit-ors, Proceedings of the 3rd International Work-shop on Algorithms and Parallel VLSI Architec-tures, pages 203{215, Leuven, Belgium, August1994. Elsevier Science Publishers.[2] D. K. Arvind and V. E. F. Rebello. On the per-formance evaluation of asynchronous processorarchitectures. In P. Dowd and E. Gelenbe,editors, Proceedings of the 3rd International
Workshop on Modeling, Analysis and Simula-tion of Computer and Telecommunication Sys-tems (MASCOTS'95), pages 100{105, Durham,NC, USA, January 1995. IEEE Computer Soci-ety Press.[3] K. Diefendor and M. Allen. Organisation ofthe Motorola 88110 superscalar RISC micropro-cessor. IEEE Micro, 12(2):40{63, April 1992.[4] D. W. Dobberpuhl et al. A 200-MHz 64-bitdual issue CMOS processor. IEEE Journal ofSolid-State Circuits, 27(11):1555{1567, Novem-ber 1992.[5] S. B. Furber. Lessons from AMULET1: To-wards AMULET2. In Computing Without Clocks:Asynchronous Microprocessor Design. The SunAnnual Lecture in Computer Science at the Uni-versity of Manchester, September 1994.[6] S. B. Furber, P. Day, J. D. Garside, N. C. Paver,and J. V. Woods. A micropipelined ARM. InT. Yanagawa and P. A. Ivey, editors, The Pro-ceedings of the IFIP International Conference onVery Large Scale Integration (VLSI'93), pages5.4.1{5.4.10, Grenoble, France, September 1993.[7] J.-L. Gaudiot and L. Bic. Advanced Topics inDataow Computing. Prentice-Hall, EnglewoodClis, NJ, USA, 1991.[8] G. Gopalakrishnan. Some unusual micropipelinecircuits. Technical Report UUCS-93-015, Depart-ment of Computer Science, University of Utah,Salt Lake City, UT, USA, December 1993.[9] J. Hennessy, N. Jouppi, F. Baskett, and J. Gill.MIPS: A VLSI processor architecture. In TheProceedings of the CMU Conference on VLSI Sys-tems and Computations, Rockville, Md. USA.,October 1981. Computer Science Press.[10] N. P. Jouppi and D. W. Wall. Availableinstruction-level parallelism for superscalar andsuperpipelined machines. In The Proceedings ofASPLOS III, pages 272{282. ACM Press, April1989.[11] M. G. Katevenis, R. W. Sherbourne, D. A. Pat-terson, and C. H. Sequin. The RISC II micro-architecture. In F. Anceau and E. J. Aas, edit-ors, The Proceedings of VLSI'83: VLSI Design ofDigital Systems, pages 349{359. North-Holland,1983.
[12] J. K. F. Lee and A. J. Smith. Branch predictionstrategies and branch target buer design. IEEEComputer, 17(1):6{22, January 1984.[13] A. Lunde. Empirical evaluation of some featuresof instruction set processor architectures. Com-munications of the ACM, 20(3):143{153, March1977.[14] S. Mirapuri, M. Woodacre, and N. Vasseghi. TheMIPS R4000 processor. IEEE Micro, pages 10{22, April 1992.[15] Y.-J. Oyang, C.-H. Wen, Y.-F. Chen, and S.-M.Lin. The eects of employing advanced branch-ing mechanisms in superscalar architectures.ACM Computer Architecture News, 18(4):35{51,December 1990.[16] D. A. Patterson and C. H. Sequin. RISC I: Areduced instruction set VLSI computer. In TheProceedings of the 8th International Symposiumon Computer Architecture, pages 443{457, May1981.[17] G. Radin. The 801 minicomputer. In The Pro-ceedings of the Symposium on Architectural Sup-port for Programming Languages and OperatingSystems, pages 39{47, March 1982.[18] W. F. Richardson and E. L. Brunvand. The NSRprocessor prototype. Technical Report UUCS-92-029, Department of Computer Science, Universityof Utah, USA., 1992.[19] C. L. Seitz. System Timing. In C. Mead andL. Conway, editors, Introduction to VLSI Sys-tems, chapter 7, pages 218{262. Addison-Wesley,1980.[20] J. E. Smith. A study of branch predictionstrategies. In The Proceedings of the 8th Inter-national Symposium on Computer Architecture,pages 135{148, May 1981.[21] J. E. Smith and A. R. Pleszkun. Implementingprecise interrupts in pipelined processors. IEEETransactions on Computers, 37(5):562{573, May1988.[22] I. E. Sutherland. Micropipelines. Communica-tions of the ACM, 32(6):720{738, June 1989.[23] N. Ullah and M. Holle. The MC88110 implement-ation of precise exceptions in a superscalar ar-chitecture. ACM Computer Architecture News,21(1):15{25, March 1993.
Appendix D. Published Papers 228
D.4 Static Scheduling of Instructions on Micronet-
based Asynchronous Processors
Title: Static scheduling of instructions on micronet-based
asynchronous processors.
Authors: D. K. Arvind and V. E. F. Rebello.
Presented at: The 2nd International Symposium on Advanced Research on
Asynchronous Circuits and Systems (ASYNC’96).
Place: Aizu Wakamatsu City, Japan.
Date: 18th – 21st March 1996.
Publisher: IEEE Computer Society Press.
Static Scheduling of Instructions onMicronet-based Asynchronous ProcessorsD. K. Arvind and V. E. F. RebelloDepartment of Computer Science, The University of EdinburghEdinburgh, EH9 3JZ, United KingdomE-mail: fdka, vefrg@dcs.ed.ac.ukAbstractThis paper investigates issues which impinge on thedesign of static instruction schedulers for micronet-based asynchronous processor (MAP) architectures.The micronet model exposes both temporal and spa-tial concurrency within a processor. A list schedul-ing algorithm is described which has been optimisedwith MAP-specic heuristics. Their performance onsome program graphs are presented and conclusionsare drawn on the suitability of MAP as targets for ILPcompilers.Keywords: Asynchronous Processor Architecture,Instruction-level Parallelism (ILP), Micronets, Staticscheduling.1 IntroductionA number of novel asynchronous processor architec-tures have been proposed recently [6, 9, 10, 12, 21, 23,24, 27], but scant attention has been paid to any un-derstanding of the interactions between the processorand compiler designs. Instead, existing synchronousRISC compiler technology has been reused (largelyunmodied), while exploiting any improvements inthe performance of the hardware which asynchronyprovides.One of the outcomes of the RISC design approachhad been a deeper understanding of the interactionsbetween the processor design and the implementa-tion and compiler technologies, respectively. The pro-cessors were streamlined for ecient implementationin the emerging VLSI technology, and the system com-plexity was migrated upwards to their compilers. Forinstance, MIPS did away with hardware interlocks andrelied instead on the compiler to reorder instructionsand introduce null ones where appropriate [15]. The
optimisers for synchronous pipelines have assumed adeterministic model of the target, with each stagedelay being approximated to being the same, havingbeen xed a priori by the clock. They produce, both,an order of execution for the instructions, and thetimes - in terms of multiples of the basic RISC instruc-tion cycle, when they are to execute. In contrast, a lin-ear, asynchronous pipeline, e.g. micropipeline [28], hasstages whose delays can vary, thanks to data depend-encies. Now, the compiler has a less accurate timingmodel of the target, and any optimisations based ona synchronous model, such as scheduling instructionsin execution gaps, are less eective.A micronet is a network of pipelines, with (selec-ted) stages of dierent pipelines being able to commu-nicate with each other. This enables the exploitationof both spatial and temporal concurrency between in-structions [2] (in contrast, a micropipeline only ex-ploits temporal parallelism [4]). It is more dicult fora compiler to predict the behaviour of the micronetfor the following reasons: rstly, as in a micropipelinethe delay of each pipeline stage might vary; secondlyand more uniquely, each instruction only visits the rel-evant stages and the multiple paths enable more thanone instruction to operate concurrently within a stage,which enables instructions to race each other, withpossible out-of-order completion of instructions. Fur-thermore, instructions may interfere with each otherwhen competing for the same resource in a particularstage.The eective performance which a MAP system candeliver depends intimately on the compiler's abilityto match the parallelism in programs with the tem-poral and spatial concurrency exposed by the MAParchitecture. This paper is a preliminary attempt tounderstand the interface between the back-end of aparallelising compiler and MAP architectures. In therest of this paper, Section 2 briey describes MAParchitectures; Section 3 introduces the MAP schedul-Published in the Proceedings of the 2nd International Symposium on Advanced Research on AsynchronousCircuits and Systems (ASYNC'96), pp 80-91, Aizu Wakamatsu City, Japan, March 1996.c IEEE Computer Society Press.




















MicropathsData BusFigure 1: A micronet model of a MAP architectureA micronet is an ensemble of micropaths, where amicropath is a pipeline or sequence of microagents,and in turn, a microagent performs either a com-municating or a functional micro-operation. A func-tional microagent (FM) communicates with other FMsthrough their respective communicating microagents(CM). Even with a single issue unit, where the issuerate is faster than the slowest instruction executionrate, the microagents can all operate concurrently inspace, in addition to the temporal concurrency associ-ated with pipelines. Another feature of the micronet isthat the micro-operations for an instruction are initi-ated independently by the issue unit, as soon as their
particular microagents become available, and deleg-ates all control to them, thus freeing the issue unit forthe next instruction. Therefore, the idle time betweeninstructions is kept to a minimum [2]. The executinginstructions also release their microagents individu-ally, as soon as the respective micro-operations havecompleted, thus freeing the resources immediately foranother instruction. Finally, through a novel applica-tion of the communication protocol, datapath hazardsare resolved eciently while hiding the overheads ofasynchrony [3].A micronet can be stalled due to contention for re-sources. In particular, the issue unit will be stalledwhen the resources required by the current instructionare all busy. The scheduler attempts to minimise thisby suitably ordering the instructions at compile-time.If it is impossible to schedule successive unrelated in-structions, then the micronet minimises the stall atrun-time. In the case of data-dependent instructions:both instructions are issued, with the second instruc-tion awaiting the result to be forwarded. In the case ofresource contention: the second instruction performsall the micro-operations up to the microagent which isbusy. In eect, only the oending micro-operation isstalled, rather than the entire instruction. A detailedexplanation of hazard avoidance is given in [3], withimplications for the scheduler being detailed below:Read-after-Write - Although the dependent in-struction will be issued, its execution will bedelayed until the completion of its predecessor.In practice, it is preferable not to issue suchan instruction, since the resources earmarked forthe dependent instruction are unavailable for useby other, now \ready-to-execute", instructions,which might introduce further structural hazardsin the bargain.Write-after-Write - The write-back order has tobe maintained and this is achieved in hardwareby the micronet. The two instructions are per-mitted to execute concurrently. Although all ofthe second instruction's microoperations will havebeen initiated, the write-back microoperation willstall for as long as the rst instruction holds onto the destination register. The current MAP ar-chitecture supports only one outstanding registerlock request, therefore a subsequent third instruc-tion which requires a locked register cannot beissued, until the rst write-back has been com-pleted. The scheduler should avoid arranging in-structions which write to the register le immedi-ately after two instructions with write-after-writedependencies.
Write-after-Read - In the case of an architecturewith a single set of operand fetch buses, the hard-ware ensures that a dependent instruction will beunable to lock its destination register before itspredecessor has fetched its operand. Should therebe a number of operand fetch buses (as in a su-perscalar MAP), and the possibility of a depend-ent instruction obtaining its operands before itspredecessor, then this instruction may have to bestalled. This would only be necessary when thetime to execute the dependent instruction is lessthan the operand fetch time for the predecessor.This hazard is also known as an anti-dependency,and along with write-after-write hazards can beavoided by register renaming.Hazard resolution is a good example of the interac-tion between the compiler and the architecture. Sincethere is no concept of time in the schedule, it is im-possible to avoid all hazards at compile time (c.f. theMIPS organiser). The scheduler can only hope to pro-duce an ordering of instructions which reduces thenumber of hazards, and relies on the MAP architec-ture to minimise their eects by eciently resolvingthem in hardware.The computationalmodel for synchronous RISC ar-chitectures is simple, in the sense that the executiontimes of instructions are considered xed and instruc-tions do not contend for resources. Neither of thesehold for MAP architectures. The MAP model de-scribes the architecture as a collection of microagents,where each one has an micro-operation latency whichdetermines when the result of that micro-operationbecomes available; and a micro-operation cycle time,which signies the rate at which the micro-operationscan be executed.3 The MAP SchedulerThe MAP scheduling problem can be stated asfollows: Given a set of heterogeneous resources withvariable execution times, devise a minimal-length,non-preemptive schedule which respects dependencieswithin programs. Each program being described as anarbitrary partial ordering of instructions.The precedence- and resource-constrained instruc-tion scheduling problem has been studied well, andit is known that even by imposing restrictions, theproblem is still NP-hard [7] [17] [29]. For example,when the execution times of tasks are not uniformand their partial order is arbitrary, then for two ormore identical processing units, the problem of de-
termining a minimal-length, non-preemptive scheduleis NP-complete [13]. This result is true even if allof the tasks are independent. Therefore, in order toachieve near-optimal execution times for given applic-ations on MAP architectures, an ecient (polynomial-time) scheduling algorithm based on one or a numberof heuristics must be devised.3.1 The MAP Scheduling ProblemList scheduling (LS) is a general method forscheduling tasks in resource- constrained problems [7].LS builds a ready set that contains all of the taskswhich are not waiting on the results of other tasks.When a processor becomes available, a task with thehighest priority is chosen from the set and assigned toit. The ready set is obtained from a topological sort ofthe data dependence graph. LS relies on other heur-istics to prioritise the ready tasks and guide it towardsan optimal solution. This has lead to a profusion ofLS-based heuristics [5, 11, 16, 20, 25].The MAP solution adopted here is based on the op-timal, greedy scheduling algorithm for list schedulingwhich was proposed by Coman and Graham [8]. Thisis an O(n2) algorithm for arbitrary precedence con-straints for two processors with unit execution costs.A MAP scheduler has to deal with heterogeneous re-sources and can no longer just choose the ready in-struction with the highest priority, but must also con-sider whether the correct resources are also available.Once an instruction is issued, its execution cannot besuspended and resumed at the point of suspension ata later time, i.e. schedules must be non-preemptive.The goodness of these schedules are highly dependenton the parameter(s) that are used to prioritise instruc-tions within the ready list [1] [22], and these are nextdiscussed.3.1.1 Minimising Idle TimesThe scheduler's rst assumption is that minimisingthe stall time will lead to an optimal (or at leastnear-optimal) program execution time (the rst pri-ority heuristic). This implies that the MAP compilershould not schedule instructions until the required mi-croagents (resources) are available. Also, the hazardsdue to data dependencies outlined in the previous sec-tion should be avoided. All of this implies that thecomputational model has to maintain a scoreboard ofresource activities.
3.1.2 Primary Instruction PriorityIn Coman and Graham's algorithm, interprocessorcommunication is assumed to be zero and tasks haveunit execution times, which means that time can beconveniently treated as being discrete rather than con-tinuous. This allows priorities to be assigned based onthe task's level within the DAG from the sink tasks.Since instructions have dierent worst-case executiontimes in MAP, the problem is similar to multiprocessorscheduling with interprocessor communication delays(where communication costs are only incurred if de-pendent tasks are scheduled on dierent processors).The solutions adopted in this eld have been basedon critical path analysis and heuristics [14] [19] [26].(The critical path cost of a task is the largest sum ofcosts along a path from itself to a sink task.) In theMAP computational model, although actual instruc-tion execution costs may vary, these critical path costscan be determined a priori by basing them on xed,worst-case instruction costs.3.1.3 Secondary Instruction PriorityThe heuristics applied so far may still not prioritisethe executable tasks (i.e. those tasks whose operandsand resources are available and are therefore ready forexecution) suciently. One feature which does seemto signicantly inuence the best choice of candidate isthe dependents of the chosen task. The two heuristicsused to \break ties" amongst candidates of the samepriority act as follows: the rst one gives a higherpriority to the task with the larger number of suc-cessors which are solely dependent on it. If a tie isstill unbroken, then a higher priority is given to thetask with the most number of successors. A feature ofthese heuristics is that the priority of a task increaseswith time. Additionally, these heuristics highlight theneed to consider not only which tasks need to executein the future, but also their resources.3.1.4 Importance of the Instruction IssueCycle TimeUnlike synchronous pipelines, micronet resources havetwo parameters which aect instruction executioncosts: the microoperation's latency and its cycletime [4]. Latency determines when data becomesavailable for subsequent micro-operations and otherinstructions. The cycle time inuences when a re-source (microagent) becomes available again for useby a subsequent instruction (or microoperation). To-gether with program parallelism and the number of re-sources, a limiting factor on the amount of exploitable
ILP is the cycle time of the issue unit in relationship tothe execution time of instructions (or more accuratelytheir cycle times).In order to minimise the issue unit's stall time,the compiler has to devise a schedule that allows in-structions to be issued continuously at the highestpossible rate, which is equivalent to one every min-imum Instruction Issue Cycle Time (IICT) [3]. Tradi-tional synchronous datapaths are pipelined or wherenecessary super-pipelined (i.e. the functional units arethemselves pipelined) suciently to achieve this goal.Due to the spatial ILP in MAP, instructions are is-sued at a rate (determined by the IICT and depend-encies) which is faster than their Instruction CycleTimes (ICTs). The ICT is the eective issue time(due to pipelining) for a particular instruction, whichis determined by the rate at which that specic in-struction type can be processed. As the IICT, whichis less than the largest ICT, gets smaller, the MAP ar-chitecture behaves more in a superscalar fashion andtherefore the value of the IICT itself can have a signi-cant inuence on the optimality of a schedule. Thisis less signicant when the IICT is comparable to thelargest ICT, in which case the order of the independentinstructions is less critical, since the micronet behaveslike a linear pipeline without any spatial concurrency.3.1.5 IICT, ICT and LookaheadWhen choosing an instruction to schedule, it may bebenecial to consider not only those instructions whichare ready, but also ones which will become ready in thenear future, called instruction lookahead, e.g. withinthe next minimum IICT. Note that this may meandeliberately selecting an instruction that causes theprocessor's issue unit to stall.Another form of lookahead is to consider the fu-ture resource requirements when scheduling instruc-tions, called resource lookahead. The two steps inchoosing an instruction and checking for availabilityof resources should take place in conjunction (See Al-gorithm 1 for more details).3.2 The MAP scheduling algorithmThe algorithm takes as its input a directed graphof instruction dependencies and a resource graph witharchitectural parameters, and generates an instructionschedule for the given MAP architecture. Two listsare dened as follows: the WI list - the list of instruc-tions still awaiting their operands, and the EI list - anordered list of instructions which are ready, or will beready in the near future (for lookahead instructions),
but still awaiting issue. The order of the latter list isdetermined by the critical path costs of instructions,i.e. the primary priority. Next, a prioritised list of ex-ecutable instructions is derived from the EI list basedon the availability of their resources at the currenttime. If there are ties, an instruction (or instructionsin the case of superscalar MAP) is chosen for issuebased on secondary priority values.The scheduler mimics the behaviour of the archi-tecture's issue unit. The function generate schedule(),as shown in Algorithm 1, schedules instructions basedon their readiness, their priority and the availabilityof resources. Unlike schedulers for synchronous ma-chines, the scheduling of instructions does not pro-ceed in uniform time steps, but rather in an asyn-chronous event-driven manner until all the instruc-tions are scheduled. Each iteration of the main loop(the while do loop in line 5) corresponds to an instantin time when the issue unit is ready to issue an in-struction. However, a situation may arise when atsome given time there are no instructions ready forissue (line 8), in which case the clock must be ad-vanced, but only as far as necessary to remedy this.The incrementing of the clock simulates the issue unitbeing stalled. The routine, advance clock(), nds theearliest occurrence of three types of events: the readytime of an instruction in the WI list and of a looka-head instruction in the EI list; the time when theresult of an operation becomes available in the re-gister le; and the time a busy resource becomes free.Only the rst two events can change the status of theEI list. There is a choice of heuristics which can beapplied, either the instruction lookahead or the tradi-tional priority-based approach. Instruction lookahead(lines 9 - 17) chooses the best instruction to issue fromthe EI list based on the lookahead heuristic. The func-tion, get ready instr(), returns from the given list ofinstructions the one with the highest estimated-time-to-completion (ETC) priority for which there will besucient resources in the datapath if it is issued atits earliest issue time. This time may be the currentissue time or some time in the future. In the case ofthe latter, issuing this instruction will cause the issueunit to stall. In the current implementation of thelookahead heuristic, only one instruction is chosen perissue cycle iteration. The routine, apply.lookahead(),implements the instruction lookahead heuristic whichuses the ETC priority and the earliest issue time oftwo instructions to determine which of them shouldbe issued rst. By comparing the estimated executiontime of the two instruction schedules, the order withthe smallest time is chosen. Should the two schedules
have the same time, then the order where an instruc-tion completes the earliest is chosen, since this allowsdependents to become ready sooner. The alternativeheuristic (lines 18 - 29) chooses the instruction withthe highest priority which can be issued immediately.This may involve choosing one or more from a numberof instructions with the same primary priority value(ETC). Line 19 creates a list of ready instructions withthe same, highest ETC values and line 22 removesthose instructions with insucient resources for issueat the current time. Line 23 supports architectureswhich incorporate lockstep superscalar instruction is-sue. The routine issue all() issues as many of the in-structions as possible from the given list. If there arenot enough issue-slots for the complete list (readyI),then the routine choosing insts() returns the best in-struction for issue based on the secondary priorities.The two loops (lines 26 and 27) repeat until either theissues slots are lled or their respective lists becomeempty. The clock is advanced appropriately depend-ing on whether or not the scheduler was able to issueone or more instructions at the current time (lines 28and 29). The routine, update writeback, models thebehaviour of the portion of the micronet not directlycontrolled by the issue unit, e.g. write-back bus. Line32 updates the instruction lists and the next instruc-tion issue cycle iteration begins at a new time.Example 1 and Example 2 contrast the inuence ofIICT and resource lookahead on determining an op-timal schedule. A1 and B are ready candidate in-structions, with a third instruction, A2, which has astructural dependency on A1.The lookahead heuristics attempt to match theavailable program and architectural parallelism overa short window of time. The strategy of repeating theprocess over the entire program allows the instruction-level parallelism to be exploited more evenly. This hastwo eects: rstly, a better program makespan is usu-ally achieved; secondly, a schedule is generated whichis more robust to deviations from the predicted in-struction costs because only the appropriate amount ofprogram parallelism is exposed which can be exploitedby the target at any one time. Since costs are basedon worse-case values rather than typical ones, the tra-ditional list scheduling heuristics tend to overly mi-grate independent instructions to the top of the sched-ule, leaving insucient parallelism for later. Kernsand Eggers [18] proposed a code scheduling algorithmcalled balanced scheduling for synchronous architec-tures which is similar in concept. Their algorithm isspecically designed to tolerate a wide range of vari-ance in load latency, e.g. cache misses/hits, global and
Algorithm 1 : The MAP scheduler (generate schedule())1 curr time := 0;2 calc completion times(); n Critical path analysis for each instruction n3 update WI(WI list); n Determine instruction start times n4 update EI(WI list); n Move ready instructions to EI list n5 while (WI list 6= fg) or (EI list 6= fg) do6 no issued := 0; n Number of inst issued simultaneously at this time n7 candidates := EI list;8 if (EI list = fg) curr time := advance clock(YES, YES, NO, curr time);9 else if (lookahead = YES) n Use Instruction Lookahead Heuristics n10 BestChoice := get ready instr(candidates); n The inst with the highest nn priority in the candidates list for which there are sucient resources n11 if (BestChoice 6= NULL)12 while candidates 6= fg do13 NextInst := get ready instr(candidates);14 if (NextInst 6= NULL) apply.lookahead(BestChoice, NextInst);end while15 if (BestChoice.rdy time  curr time + issue cost)16 issue instruction(BestChoice); no issued++;17 EI list := EI list - BestChoice;else18 do n Alternative strategy without Instruction Lookahead nn Let same ETC list be the list of the highest ETC cost ready insts n19 9 same ETC list  candidates, s:t: 8 i 2 candidates,9 v 2 same ETC list, s:t: (v.ETC  i.ETC);20 candidates := candidates - same ETC list;21 do n Remove instructions without sucient resources n22 9 readyI  same ETC list, s:t: 8i 2 readyI,nd avail FU resources(i, datapath, curr time);23 if (jreadyIj  spsclr deg - no issued) issue all(readyI, no issued);else n choose between insts in readyI list n24 inst chosen := choosing insts(readyI, no issued);25 EI list := EI list - finst choseng;26 while ((no issued < spsclr deg) and (same ETC list 6= fg));27 while ((no issued < spsclr deg) and (candidates 6= fg));28 if (no issued > 0) curr time += inst issue cycle;29 else curr time := advance clock(YES, YES, YES, curr time);end if30 update writeback(datapath);31 if (WI list 6= fg)32 update WI(WI list); update EI(WI list);end while33 update writeback(datapath);
Example 1 : Resource Lookahead1 switch IICT2 case 0: Choose schedule fA1,B,A2g or fB,A1,A2g;n Either schedule is optimal n3 case (0  IICT < 12 ICTA):4 if (ICTB >2ICTA) Choose schedule fB,A1,A2g;n Instruction B takes longer than the both A1 and A2 n5 else Choose schedule fA1,B,A2g;n In other words, combine the resource requirements of nn dependent instructions and schedule the instruction nn according to the resource with the most work. n6 case ( 12 ICTA  IICT < ICTA):7 if (ICTB >2ICTA) n then schedule B rst (as before) n8 Choose schedule fB,A1,A2g;9 else n schedule A1 rstn10 if (ICTB < ICTA) Choose schedule fA1,A2,Bg;11 else Choose schedule fA1,B,A2g;12 case (ICTA  IICT):n Schedule the instruction with the largest ICT rst n13 if ICTA < ICTB Choose schedule fB,A1,A2g;14 else Choose schedule fA1,A2,Bg;15 end switch;Example 2 : Without Resource Lookahead1 if (IICT = 0) Choose schedule fA1,B,A2g or fB,A1,A2g;n Again, either schedule is optimal n2 else n Simply schedule the instruction with the largest ICT rst. n3 if (ICTA < ICTB)Choose schedule fB,A1,A2g;4 else if (IICT < ICTA)Choose SchedulefA1,B,A2g;5 else Choose schedule fA1,A2,Bg;local memory. In these architectures, instruction costsare well dened and considered xed. Usually thelatencies reect the most optimistic execution, e.g.,the time of a cache hit rather than a cache miss. Tra- ditional schedulers improve performance through re-ordering instructions to avoid pipeline stalls, e.g., byinserting independent instructions after loads to keepthe CPU busy. The number of instructions inserted
(in the best case) depends on this latency value. Ifthe load instruction is delayed beyond the scheduler'sestimate, then the processor will stall. However, ifthe latency is shorter, then the destination register ofthe load instruction will be tied up for longer and thismay increase register pressure enough to cause unne-cessary code spills. Both balanced scheduling and re-source lookahead are computationally more expensivethan the traditional list scheduling approach, and willnot be considered further in this paper.4 ResultsIn this section, the makespans of MAP schedulesfor a number of typical instruction DAGs (brieydescribed below) are compared with their optimal.(The optimal makespan is derived from an exhaust-ive search.)BT3 - A Binary Tree with three levels.BT3.5 - A Binary Tree with three and halflevels.BT4 - A Binary Tree with four levels.DD - Diamond DAGs which are commonlyfound in the evaluation of partial dier-ential equations.DM - Dense matrix multiplication.SM - Sparse matrix multiplication.CC - Mix of Load, Store and ALU instruc-tions with data dependencies.CCL - A loop unrolled version of CC.Min1 - This architecture contains the min-imum resources - one ALU and oneMemory Unit (MU) which both sharea single write-back bus. The cycletimes and latencies of the ALU andMU micro-operations are assumed tothe same.3bus - This architecture has an additionalALU and each of the three functionalunits has a dedicated write-back bus.(The microoperation cycle times andlatencies are the same as Min1).Min2 - Same as Min1, except that the mi-crooperation costs of the ALU and MUare dierent and reect realistic costsobtained from SPICE-level simulations.
The results for the MAP scheduling heuristic, bothwithout and with instruction lookahead, are shown inTable 1. For each DAG, the number of valid schedulesis recorded together with the optimal makespan forthe given target architecture. The makespan gener-ated by the heuristics together with its closeness to theoptimal (recorded both as a percentage of the optimal(% Di) and as a percentage of the dierence betweenthe best and worst makespans (% of the Range) arealso included. It is assumed that sucient registers areavailable and so code spilling could be avoided. Thiswould normally be determined at the register alloc-ation phase of the compilation and is not consideredhere.The results look quite promising. In a majority ofthe cases for the 3bus architecture, the heuristic cannd an optimal solution (only in the case of SM is in-struction lookahead required to reduce the makespanto optimal). However, the MAP scheduler does notseem to do as well on theMin1 architecture (for BT3,BT3.5, BT4, CCL, DM and SM). The reason for thepoorer makespans is due to a bottleneck on the write-back bus. It turns out to be better in some cases tostall the issue unit for a longer period of time thanthat assumed by instruction lookahead (the IICT),i.e. wait until a higher priority instruction becomesready, because this stall time is hidden by the write-back bottleneck. Where the makespan is only slightlyworse than the optimal, i.e. DM, the heuristic togetherwith instruction lookahead is sucient to nd an op-timal solution. In the case of the Min2 architec-ture, BT3, BT3.5, BT4, and CCL are now optimal.This is because the relative delays of the microagentshave reduced the bottleneck for the write-back bus.In the case of DM and SM, there is still interferencebetween the instructions which result in sub-optimalexecutions. Instruction interference can be reducedby applying a post-pass re-ordering of the generatedschedules, and this is the subject of a future paper.Remember that these schedules are only optimalwith respect to the instructions costs which have beenassumed. In practice, these schedules may not be op-timal for a particular execution of the program for thereasons discussed earlier. One could even expect thateach run of the program would have a dierent op-timal schedule. The stability of the schedules in lightof variance in the resource delays needs further study.






























































The Schedule Based on Average-case Costs
20 11Figure 2: The makespans of schedules based on worst- and average-case run-time costs
5 Discussion5.1 Choice of instruction execution costsAlthough the execution times of the same instruc-tion might vary due to data-dependent delays, worst-,average- or even best-case gures for the executiontimes can be found on which the schedules could bebased. When producing static schedules, the compilerhas to use the delays of the FMs and the questionarises as to which of the sets of gures to use. Figure 2illustrates the simplied schedules for the CC test (ob-tained from [15]) based on worst-case and average-case costs and gures for the execution times of theinstructions based on actual worst-case and average-case delays at run-time for these schedules. (The ra-tios of the delays for the two cases for the instructionsrealistically reect actual behaviour on the asynchron-ous processor under study.) The gures reveal thatgiven these ratios, using a schedule based on worst-case costs is better in practice. Using this approacha heuristic will always try to schedule an instruction,if possible, only when its operands are guaranteed tobe available, thereby minimising any stalls. Note alsothat the schedule's correctness is not aected by thechanges in instruction costs. Furthermore, given thata program's critical path may change with dierentexecutions (due to dierent data sets) and that theschedule is generated once, the compiler's choice ofwhich costs to use is important. By basing the sched-ule on worst-case delays a lower bound on performancecan be achieved.5.2 Interaction between executing in-structionsReasons other than the ones just stated also con-tribute to the diculty in predicting the global stateof the micronet. In synchronous processors, the com-piler can assume when scheduling a basic block thatthe datapath is idle and that all of the resources areavailable. This is a consequence of the fact that insynchronous pipelines, an instruction never aects theexecution of other instructions. This is not necessar-ily the case in a micronet, since the execution timesof instructions might vary for the following reasons:only a partial ordering is employed between instruc-tions (i.e. it is not necessary for the previous instruc-tions to have completed their execution before success-ive ones); instructions compete for shared resources,e.g. the write-back bus; during execution instructionsmight interfere with each other. Therefore, the state
of the resources at any particular time cannot be pre-dicted accurately at compile-time. But this informa-tion is indeed available at run-time in the issue unit ofthe micronet. This could be used to dynamically tune(i.e. allow out-of-order instruction issues) the staticschedule by the control unit. This requires identi-fying an instruction which can be executed immedi-ately (easily achieved using the control acknowledge-ment signal scoreboarding mechanism), and checkingthat the instruction is independent of earlier ones inthe instruction buer. Although the latter may be ex-pensive to perform, the task can be made simpler withassistance from the compiler by using a concurrencybit.6 ConclusionsThe MAP approach eciently combines aspects ofwell-known architectural styles. In dataow architec-tures, the instructions are issued as soon as their op-erands are available. This is achieved dynamicallyin hardware which incurs not insignicant run-time(book-keeping) costs. As in RISC architectures, codescheduling is done statically, but additionally instruc-tion issue (and even possibly the instruction schedule)is ne-tuned dynamically to take advantage of run-time characteristics as in the data-ow model. In asense, at the instruction-level MAP follows the clas-sical von Neumann style, whereas at the level of mi-croagents it is more in the character of dataow archi-tectures.The micronet model exposes temporal and spa-tial concurrency in the datapath, with ne-grainedresources now being visible to the compiler. Thismodel subsumes the micropipeline model which onlyexploits temporal concurrency in the datapath and thescheduling methods described here can be equally ap-plied to micropipeline-based processors.Code scheduling (on ILP architectures) andmachine-dependent optimisations have a signicantimpact on program performance. It is the task of thecompiler to schedule instructions such that these re-sources are eciently utilised. The instruction sched-ule is devised based on a computational model of thetarget architecture. For synchronous architectures themodel is simple; in contrast, an asynchronous modelis necessarily less accurate for the reasons discussedearlier. However, initial studies have shown that thesefactors do not signicantly hinder a compiler's abil-ity to schedule code eciently. Worst-case instructionexecution times have been considered and where theresulting schedule is treated as a rst pass one. The
interference between the instructions can be reducedby applying post-pass optimisations, and this is beingcurrently investigated. The instructions could thenbe dynamically reordered at run-time to ne-tune thisschedule by taking advantage of actual run-time costs.In conclusion, preliminary studies have shown thata micronet-based asynchronous processor architecturedoes present a suitable target for an ILP compiler.AcknowledgementsV. Rebello was supported by a postgraduate stu-dentship from the U. K. Engineering and PhysicalSciences Research Council (EPSRC). This work waspartially supported by a grant from EPSRC en-titled Formal Infusion of Communication and Con-currency into Programs and Systems (Grant NumberGR/G55457).References[1] T. Adam, K. M. Chandy, and J. R. Dickson.A comparison of list schedules for parallel pro-cessing systems. Communications of the ACM,17(12):685{690, December 1978.[2] D. K. Arvind, R. D. Mullins, and V. E. F. Re-bello. Micronets: A model for decentralisingcontrol in asynchronous processor architectures.In M. B. Josephs, editor, The Proceedings ofthe 2nd Working Conference on AsynchronousDesign Methodologies, pages 190{199, London,UK, May 1995. IEEE Computer Society Press.[3] D. K. Arvind and V. E. F. Rebello. Instruction-level parallelism in asynchronous processor archi-tectures. In M. Moonen and F. Catthoor, edit-ors, Proceedings of the 3rd International Work-shop on Algorithms and Parallel VLSI Architec-tures, pages 203{215, Leuven, Belgium, August1994. Elsevier Science Publishers.[4] D. K. Arvind and V. E. F. Rebello. On the per-formance evaluation of asynchronous processorarchitectures. In P. Dowd and E. Gelenbe,editors, Proceedings of the 3rd InternationalWorkshop on Modeling, Analysis and Simula-tion of Computer and Telecommunication Sys-tems (MASCOTS'95), pages 100{105, Durham,NC, USA, January 1995. IEEE Computer Soci-ety Press.
[5] J. Baxter and J. H. Patel. The LAST Al-gorithm: A heuristic-based static task allocationalgorithm. In The Proceedings of the 1989 Inter-national Conference on Parallel Processing, pages217{222, 1989.[6] E. Brunvand. The NSR processor. In TheProceedings of the Hawaii International Confer-ence on System Sciences. IEEE Computer SocietyPress, January 1993.[7] E. G. Coman. Computer and Job-Shop Schedul-ing Theory. John Wiley and Sons, New York,1976.[8] E. G. Coman and R. L. Graham. Optimalscheduling for two-processor systems. Acta. In-formatica, 1:200{213, 1972.[9] I. David, R. Ginosar, and M. Yoeli. Self-timedarchitecture of a reduced instruction set com-puter. In S. Furber and M. Edwards, edit-ors, The Proceedings of the IFIP Working Con-ference on Asynchronous Design Methodologies,Manchester, UK, March 1993. Elsevier SciencePublishers.[10] Mark E. Dean. STRiP: A Self-timed RISC Pro-cessor. PhD thesis, Stanford University, July1992.[11] H. El-Rewini and T. G. Lewis. Scheduling paral-lel program tasks onto arbitrary target machines.Journal of Parallel and Distributed Computing,9:138{153, 1990.[12] S. B. Furber, P. Day, J. D. Garside, N. C. Paver,and J. V. Woods. A micropipelined ARM. InT. Yanagawa and P. A. Ivey, editors, The Pro-ceedings of the IFIP International Conference onVery Large Scale Integration (VLSI'93), pages5.4.1{5.4.10, Grenoble, France, September 1993.[13] M. R. Garey and D. S. Johnson. Computers andIntractability: A Guide to the Theory of NP-Completeness. W. H. Freeman and Company,1979.[14] A. Gerasoulis and T. Yang. A comparison of clus-tering heuristics for scheduling directed acyclicgraphs on multiprocessors. Journal of Paralleland Distributed Computing, 16:276{291, Decem-ber 1992.
[15] J. Hennessy and T. Gross. Postpass code optim-isation of pipeline constraints. ACM Transac-tions on Programming Languages and Systems,5(3):422{448, July 1983.[16] J-J. Hwang, Y-C. Chow, F. D. Anger, and C-Y. Lee. Scheduling precedence graphs in systemswith interprocessor communication times. SIAMJournal of Computing, 18(2):244{257,April 1989.[17] H. Kasahara and S. Narita. Practical multipro-cessor scheduling algorithms for ecient parallelprocessing. IEEE Transactions on Computers, C-33(11):1023{1029, November 1984.[18] D. R. Kerns and S. J. Eggers. Balanced schedul-ing: Instruction scheduling when memory latencyis uncertain. SIGPLAN Notices, 28(6):278{289,June 1993. Proceedings of the ACM Confer-ence on Programming Language Design and Im-plementation.[19] S. J. Kim and J. C. Brown. A general approachto mapping of parallel computation upon multi-processor architecture. In The Proceedings of theInternational Conference on Parallel Processing,Vol. III, pages 1{8, 1988.[20] S. Manoharan and P. Thanisch. Assigning de-pendency graphs onto processor networks. Par-allel Computing, 17(1):63{73, April 1991.[21] A. J. Martin, S. M. Burns, T. K. Lee,D. Borkovic, and P. J. Hazewindus. The designof an asynchronous microprocessor. In C. L.Seitz, editor, Advanced Research in VLSI: Pro-ceedings of the Decennial Caltech Conference onVLSI, pages 351{373, Cambridge, Mass., 1989.MIT Press.
[22] C. McCreary, A. A. Khan, J. Thompson, andM. E. McArdle. A comparison of heuristics forscheduling DAGs on multiprocessors. TechnicalReport CSE-93-07, Auburn University, Auburn,AL, 36849. USA., 1994.[23] S. V. Morton, S. S. Appleton, and M. J. Liebelt.ECSTAC: A fast asynchronous microprocessor.In M. B. Josephs, editor, The Proceedings ofthe 2nd Working Conference on AsynchronousDesign Methodologies, pages 180{189, London,UK, May 1995. IEEE Computer Society Press.[24] T. Nanya, Y. Ueno, H. Kagotani, M. Kuwako,and A. Takamura. TITAC: Design of a quasi-delay-insensitive microprocessor. IEEE Designand Test of Computers, pages 50{63, Summer1994.[25] C. H. Papadimitrou and M. Yannakakis. To-wards an architecture-independent analysis ofparallel algorithms. SIAM Journal of Comput-ing, 19(2):322{328, April 1990.[26] V. Sarkar. Partitioning and Scheduling ParallelPrograms for Execution on Multiprocessors. TheMIT Press, 1989.[27] R. F. Sproull, I. E. Sutherland, and C. E. Mol-nar. Counterow pipeline processor architecture.Technical Report SMLI TR-94-25, Sun Microsys-tems Laboratories Inc., April 1994.[28] I. E. Sutherland. Micropipelines. Communica-tions of the ACM, 32(6):720{738, June 1989.[29] J. Ullman. NP-complete scheduling prob-lems. Journal of Computer and System Sciences,10:384{393, 1975.
Bibliography
[1] T. Adam, K. M. Chandy, and J. R. Dickson. A comparison of list schedules
for parallel processing systems. Communications of the ACM, 17(12):685–
690, December 1978.
[2] M. Afghahi and C. Svennson. Performance of synchronous and asyn-
chronous schemes for VLSI systems. IEEE Transactions on Computers,
41(7):858–872, July 1992.
[3] A. Aiken and A. Nicolau. A development environment for horizontal
microcode. IEEE Transactions on Software Engineering, 14(5):584–594, May
1988.
[4] D. K. Arvind, R. D. Mullins, and V. E. F. Rebello. Micronets: A model for
decentralising control in asynchronous processor architectures. In M. B.
Josephs, editor, The Proceedings of the 2nd Working Conference on Asynchron-
ous Design Methodologies, pages 190–199, London, UK, May 1995. IEEE
Computer Society Press.
[5] D. K. Arvind and V. E. F. Rebello. Instruction-level parallelism in asyn-
chronous processor architectures. In M. Moonen and F. Catthoor, editors,
Proceedings of the 3rd International Workshop on Algorithms and Parallel VLSI
Architectures, pages 203–215, Leuven, Belgium, August 1994. Elsevier Sci-
ence Publishers.
[6] D. K. Arvind and V. E. F. Rebello. On the performance evaluation of asyn-
chronous processor architectures. In P. Dowd and E. Gelenbe, editors,
Proceedings of the 3rd International Workshop on Modeling, Analysis and Sim-
ulation of Computer and Telecommunication Systems (MASCOTS’95), pages
100–105, Durham, NC, USA, January 1995. IEEE Computer Society Press.
[7] D. K. Arvind and V. E. F. Rebello. Static scheduling of instruction on
micronet-based asynchronous processors. In The Proceedings of the 2nd
International Symposium on Advanced Research on Asynchronous Circuits and
Systems (ASYNC’96), pages 80–91, Aizu Wakamatsu City, Japan, March
1996. IEEE Computer Society Press.
241
Bibliography 242
[8] D. K. Arvind and C. R. Smart. A unified framework for parallel event-
driven logic simulation. In Proceedings of the 1991 Computer Simulation
Conference, Baltimore, Maryland, USA, July 1991.
[9] D. F. Bacon, S. L. Graham, and O. J. Sharp. Compiler transformations for
high performance computing. ACM Computing Surveys, 26(4):345–420,
December 1994.
[10] H. G. Baker. Precise instruction scheduling without a precise machine
model. ACM Computer Architecture News, 19(6):4–8, December 1991.
[11] M. R. Barbacci and D. P. Siewiorek. The Design and Analysis of Instruction
Set Processors. McGraw Hill, 1982.
[12] J. Baxter and J. H. Patel. The LAST Algorithm: A heuristic-based static
task allocation algorithm. In The Proceedings of the 1989 International Con-
ference on Parallel Processing, pages 217–222, 1989.
[13] D. Bernstein, D. Cohen, Y. Lavon, and V. Rainish. Performance evalu-
ation of instruction scheduling on the IBM RISC System/6000. In The
Proceedings of the 25th Annual International Symposium on Microarchitecture
(MICRO’25), pages 226–235, 1992.
[14] D. Bernstein and I. Gertner. Scheduling expressions on a pipelined pro-
cessor with a maximal delay of one cycle. ACM Transactions on Program-
ming Languages and Systems, 11(1):57–66, 1989.
[15] D. Bernstein and M. Rodeh. Global instruction scheduling for superscalar
machines. In The Proceedings of the Conference on Programming Language
Design and Implementation, pages 241–255, June 1991.
[16] B. Bose and T. R. N. Rao. Theory of unidirectional error correct-
ing/detecting codes. IEEE Transactions on Computers, C-31(6):521–530,
June 1982.
[17] D. G. Bradlee, S. J. Eggers, and R. R. Henry. Integrating register allocation
and instruction scheduling for RISCs. ACM Computer Architecture News,
19(2):122–131, April 1991.
[18] E. Brunvand. The NSR processor. In The Proceedings of the Hawaii Interna-
tional Conference on System Sciences. IEEE Computer Society Press, January
1993.
[19] E. Brunvand and R. F. Sproull. Translating concurrent programs into
delay-insensitive circuits. In The Proceedings of the International Conference
on Computer Aided Design (ICCAD-89), pages 262–265, November 1989.
Bibliography 243
[20] J. A. Brzozowski and J. C. Ebergen. Recent developments in the design of
asynchronous circuits. In Fundamentals of Computation Theory, pages 78–
94. Lecture Notes in Computer Science, Vol. 380, Springer-Verlag, 1989.
[21] J. A. Brzozowski and J. C. Ebergen. On the delay-sensitivity of gate
networks. IEEE Transactions on Computers, 41(11):1349–1359, November
1992.
[22] J. A. Brzozowski and K. Raahemifar. Testing C-elements is not elementary.
In M. B. Josephs, editor, The Proceedings of the 2nd Working Conference
on Asynchronous Design Methodologies, pages 150–159, London, UK, May
1995. IEEE Computer Society Press.
[23] W. Buchholz. Planning a Computer System: Project Stretch. McGraw-Hill,
1962.
[24] J. Bunda, W. C. Athas, and D. Fussel. Evaluating power implications
of cmos microprocessor design decisions. In The Proceedings of the 1994
International Workshop on Low Power Design, pages 147–152, Napa, CA,
USA., 1994.
[25] S. M. Burns. Automated compilation of concurrent programs into self-
timed circuits. Technical Report Caltech-CS-TR-88-2, Computer Science
Department, California Institute of Technology, 1988.
[26] S. M. Burns. Performance Analysis and Optimisation of Asynchronous Cir-
cuits. PhD thesis, Computer Science Department, California Institute of
Technology, Pasadena, California, USA, 1991.
[27] G. J. Chaitin, M. A. Auslander, A. K. Chandra, J. Cocke, M. E. Hopkins,
and P. W. Markstein. Register allocation and spilling via graph coloring.
Computer Languages, 6:47–57, 1981.
[28] T. J. Chaney and C. E. Molnar. Anomalous behaviour of synchronizer and
arbiter circuits. IEEE Transactions on Computers, 22(4):421–422, April 1973.
[29] C.-H. Chien, M. A. Franklin, T. Pan, and P. Prabhu. ARAS: Asynchronous
RISC architecture simulator. In M. B. Josephs, editor, The Proceedings of
the 2nd Working Conference on Asynchronous Design Methodologies, pages
210–219, London, UK, May 1995. IEEE Computer Society Press.
[30] K. M. Chu and D. I. Pulfrey. Design procedures for differential cascade
voltage switch circuits. IEEE Journal of Solid-State Circuits, 21(6):1082–1087,
1986.
Bibliography 244
[31] Tam-Anh Chu. Synthesis of Self-Timed VLSI Circuits from Graph-Theoretic
Specifications. PhD thesis, MIT Laboratory for Computer Science, June
1987.
[32] E. G. Coffman. Computer and Job-Shop Scheduling Theory. John Wiley and
Sons, New York, 1976.
[33] E. G. Coffman and R. L. Graham. Optimal scheduling for two-processor
systems. Acta. Informatica, 1:200–213, 1972.
[34] R. P. Colwell, C. Y. Hitchcock, E. D. Jensen, H. M. B. Sprunt, and C. P.
Kollar. Computers, complexity and controversy. IEEE Computer, 18:8–19,
September 1985.
[35] I. David, R. Ginosar, and M. Yoeli. An efficient implementation of boolean
functions as self-timed circuits. IEEE Transactions on Computers, 41(1):2–11,
January 1992.
[36] I. David, R. Ginosar, and M. Yoeli. Implementing sequential-machines as
self-timed circuits. IEEE Transactions on Computers, 41(1):12–17, January
1992.
[37] I. David, R. Ginosar, and M. Yoeli. Self-timed architecture of a reduced in-
struction set computer. In S. Furber and M. Edwards, editors, The Proceed-
ings of the IFIP Working Conference on Asynchronous Design Methodologies,
Manchester, UK, March 1993. Elsevier Science Publishers.
[38] M. E. Dean, D. L. Dill, and M. Horowitz. Self-timed logic using current-
sensing completion detection (CSCD). In The Proceedings of the Inter-
national Conference on Computer Design (ICCD’91), pages 187–191. IEEE
Computer Society Press, October 1991.
[39] Mark E. Dean. STRiP: A Self-timed RISC Processor. PhD thesis, Stanford
University, July 1992.
[40] K. Diefendorff and M. Allen. Organisation of the Motorola 88110 super-
scalar RISC microprocessor. IEEE Micro, 12(2):40–63, April 1992.
[41] D. L. Dill. Trace Theory for Automatic Hierarchical Verification of Speed-
Independent Circuits. ACM Distinguished Dissertations. MIT Press, 1989.
[42] D. W. Dobberpuhl et al. A 200-MHz 64-bit dual issue CMOS processor.
IEEE Journal of Solid-State Circuits, 27(11):1555–1567, November 1992.
[43] J. C. Ebergen. A formal approach to designing delay-insensitive circuits.
Distributed Computing, 5(3):107–119, 1991.
Bibliography 245
[44] J. H. Edmondson, P. Rubinfeld, R. Preston, and V. Rajagopalan. Super-
scalar instruction execution in the 21164 Alpha microprocessor. IEEE
Micro, 15(2):33–43, April 1995.
[45] H. El-Rewini and T. G. Lewis. Scheduling parallel program tasks onto
arbitrary target machines. Journal of Parallel and Distributed Computing,
9:138–153, 1990.
[46] J. R. Ellis. Bulldog: A Compiler for VLSI Architectures. MIT Press, 1986. PhD
Thesis, Yale, 1985.
[47] C. J. Elston, D. B. Christianson, P. A. Findlay, and G. B. Steven. Hades -
Towards the design of an asynchronous superscalar processor. In M. B.
Josephs, editor, The Proceedings of the 2nd Working Conference on Asynchron-
ous Design Methodologies, pages 200–209, London, UK, May 1995. IEEE
Computer Society Press.
[48] P. Endecott. Processor architectures for power efficiency and asynchron-
ous implementation. Master’s thesis, Department of Computer Science,
University of Manchester, UK., 1993.
[49] P. Endecott. SCALP: A Superscalar Asynchronous Low-Power Processor. PhD
thesis, Department of Computer Science, University of Manchester, UK.,
December 1995. CST-41-86.
[50] European Silicon Structures Limited. Solo 1400 Reference Manual. ES2
Publications Unit, Bracknell, U.K., 1990.
[51] C. Farnsworth, D. A. Edwards, J. Lie, and S. S. Sikand. A hybrid asyn-
chronous system design environment. In M. B. Josephs, editor, The Pro-
ceedings of the 2nd Working Conference on Asynchronous Design Methodologies,
pages 91–98, London, UK, May 1995. IEEE Computer Society Press.
[52] J. Fisher. Trace scheduling: A technique for global microcode compaction.
IEEE Transactions on Computers, 30(7):478–490, July 1981.
[53] M. J. Flynn, C. L. Mitchell, and J. M. Mulder. And now a case for more
complex instruction sets. IEEE Computer, 20(9):71–83, September 1987.
[54] M. A. Franklin and T. Pan. Clocked and asynchronous instruction
pipelines. In The Proceedings of the 26th Annual International Symposium on
Microarchitecture (MICRO’26), pages 177–184, Austin, Texas, USA, Decem-
ber 1993. IEEE Computer Society Press.
[55] S. B. Furber. Lessons from AMULET1: Towards AMULET2. In Comput-
ing Without Clocks: Asynchronous Microprocessor Design. The Sun Annual
Bibliography 246
Lecture in Computer Science at the University of Manchester, September
1994.
[56] S. B. Furber, P. Day, J. D. Garside, N. C. Paver, and J. V. Woods. A mi-
cropipelined ARM. In T. Yanagawa and P. A. Ivey, editors, The Proceedings
of the IFIP International Conference on Very Large Scale Integration (VLSI’93),
pages 5.4.1–5.4.10, Grenoble, France, September 1993.
[57] H. Gabow. An almost linear algorithm for two-processor scheduling.
Journal of the ACM, 29(3):766–780, 1982.
[58] G. R. Gao. An efficient hybrid dataflow architecture. Journal of Parallel
and Distributed Computing, 19:293–307, 1993.
[59] M. R. Garey and D. S. Johnson. Computers and Intractability: A Guide to the
Theory of NP-Completeness. W. H. Freeman and Company, 1979.
[60] J. D. Garside. A CMOS VLSI implementation of an asynchronous ALU.
In S. Furber and M. Edwards, editors, Asynchronous Design Methodolo-
gies, volume A-28 of IFIP Transactions, pages 181–207. Elsevier Science
Publishers, 1993.
[61] J.-L. Gaudiot and L. Bic. Advanced Topics in Dataflow Computing. Prentice-
Hall, Englewood Cliffs, NJ, USA, 1991.
[62] A. Gerasoulis and T. Yang. A comparison of clustering heuristics for
scheduling directed acyclic graphs on multiprocessors. Journal of Parallel
and Distributed Computing, 16:276–291, December 1992.
[63] R. Ginosar and N. Michell. On the potential of asynchronous pipelined
processors. ACM Computer Architecture News, 18(4):27–34, December 1990.
[64] G. Gopalakrishnan. Some unusual micropipeline circuits. Technical Re-
port UUCS-93-015, Department of Computer Science, University of Utah,
Salt Lake City, UT, USA, December 1993.
[65] W. R. Hamburgen and J. S. Fitch. Packaging a 150W bipolar ECL mi-
croprocessor. Research report 92/1, DEC Western Research Laboratory,
March 1992.
[66] S. Hauck. Asynchronous design methodologies: An overview. Technical
Report TR 93-05-07, Department of Computer Science and Engineering,
University of Washington, Seattle, USA, 1993.
[67] P. Hazewindus. Testing Delay-Insensitive Ciruits. PhD thesis, California
Institute of Technology, Pasadena, CA, USA., 1992. CS-TR-92-14.
Bibliography 247
[68] S. Heath. VMEbus User’s Handbook. CRC Press, 1988.
[69] L. G. Heller and W. R. Griffin. Cascade Voltage Switch Logic: A dif-
ferential CMOS logic family. In The Proceedings of the IEEE International
Conference on Solid-state Circuits, pages 16–17, 1984.
[70] J. Hennessy and T. Gross. Postpass code optimisation of pipeline
constraints. ACM Transactions on Programming Languages and Systems,
5(3):422–448, July 1983.
[71] J. Hennessy, N. Jouppi, F. Baskett, and J. Gill. MIPS: A VLSI processor
architecture. In The Proceedings of the CMU Conference on VLSI Systems
and Computations, Rockville, Md. USA., October 1981. Computer Science
Press.
[72] J. Hennessy and D. A. Patterson. Computer Architecture: A Quantitative
Approach. Morgan Kaufmann, 1990.
[73] C. A. R. Hoare. Communicating sequential processes. Communications of
the ACM, 21(8):666–677, August 1978.
[74] L. A. Hollaar. Direct implementation of asynchronous control units. IEEE
Transactions on Computers, C-31(12):1133–1141, December 1982.
[75] T. C. Hu. Parallel sequencing and assembly line problems. Operational
Research, 9(6):841–848, 1961.
[76] H. Hulgaard, S. M. Burns, and G. Borriello. Testing asynchronous circuits:
A survey. Technical Report Technical Report UW-CSE-94-03-06, Depart-
ment of Computer Science and Engineering, University of Washington,
1994.
[77] J-J. Hwang, Y-C. Chow, F. D. Anger, and C-Y. Lee. Scheduling preced-
ence graphs in systems with interprocessor communication times. SIAM
Journal of Computing, 18(2):244–257, April 1989.
[78] INMOS Limited. Occam2 Reference Manual. Prentice Hall International,
1988.
[79] INMOS Limited. Transputer Reference Manual. Prentice Hall International,
1988.
[80] R. Jain. The Art of Computer System Performance Analysis. John Wiley &
Sons, 1991.
[81] M. Johnson. Superscalar Processor Design. Prentice-Hall, Englewood Cliffs,
NJ, USA., 1991.
Bibliography 248
[82] M. B. Josephs and J. T. Udding. Delay-insensitive circuits: An algebraic
approach to their design. In J. C. M. Baeten and J. W. Klop, editors,
Theories of Concurrency: Unification and Extension (CONCUR’90), pages
342–366. Springer-Verlag, August 1990.
[83] N. P. Jouppi, P. Boyle, and J. S. Fitch. Designing, packaging and testing a
300-Mhz, 115W ECL microprocessor. IEEE Micro, 14(2):50–58, April 1994.
[84] N. P. Jouppi and D. W. Wall. Available instruction-level parallelism for
superscalar and superpipelined machines. In The Proceedings of ASPLOS
III, pages 272–282. ACM Press, April 1989.
[85] H. Kasahara and S. Narita. Practical multiprocessor scheduling al-
gorithms for efficient parallel processing. IEEE Transactions on Computers,
C-33(11):1023–1029, November 1984.
[86] M. G. Katevenis, R. W. Sherbourne, D. A. Patterson, and C. H. Séquin.
The RISC II micro-architecture. In F. Anceau and E. J. Aas, editors, The
Proceedings of VLSI’83: VLSI Design of Digital Systems, pages 349–359.
North-Holland, 1983.
[87] D. Kearney and N. W. Bergmann. Performance evaluation of asynchron-
ous logic pipelines with data dependent processing delays. In M. B.
Josephs, editor, The Proceedings of the 2nd Working Conference on Asyn-
chronous Design Methodologies, pages 4–13, London, UK, May 1995. IEEE
Computer Society Press.
[88] D. R. Kerns and S. J. Eggers. Balanced scheduling: Instruction scheduling
when memory latency is uncertain. SIGPLAN Notices, 28(6):278–289, June
1993. Proceedings of the ACM Conference on Programming Language Design
and Implementation.
[89] R. W. Keyes. The evolution of digital electronics towards VLSI. IEEE
Transactions on Electronic Devices, ED-26(4):271–278, 1979.
[90] A. Khoche and E. Brunvand. Testing self-timed circuits using partial
scan. In M. B. Josephs, editor, The Proceedings of the 2nd Working Conference
on Asynchronous Design Methodologies, pages 160–169, London, UK, May
1995. IEEE Computer Society Press.
[91] S. J. Kim and J. C. Brown. A general approach to mapping of parallel
computation upon multiprocessor architecture. In The Proceedings of the
International Conference on Parallel Processing, Vol. III, pages 1–8, 1988.
[92] M. Ko. Instruction scheduling for micronet-based asynchronous pro-
cessors. Master’s thesis, Department of Computer Science, University of
Edinburgh, Edinburgh, Scotland, UK., September 1995.
Bibliography 249
[93] S. Komori, H. Takata, T. Tamura, F. Asai, T. Ohno, O. Tomisawa, T. Yama-
saki, K. Shima, K. Asada, and H. Terada. An elastic pipeline mechanism
by self-timed circuits. IEEE Journal of Solid-State Circuits, 23(1):111–117,
February 1988.
[94] R. F. Krick and A. Dollas. The evolution of instruction sequencing. IEEE
Computer, 24(4):5–15, April 1991.
[95] M. Kuga, K. Murakami, and S. Tomita. DSNS (Dynamically-hazard-
resolved, Statically-code-scheduled, Nonuniform Superscalar): Yet an-
other superscalar processor architecture. ACM Computer Architecture
News, 19(4):14–29, June 1991.
[96] H. T. Kung. Why systolic architectures? IEEE Computer, 15:37–46, January
1982.
[97] S-Y. Kung, S. C. Lo, and P. S. Lewis. Timing analysis and design optimisa-
tion of VLSI data flow arrays. In The Proceedings of the IEEE International
Conference on Parallel Processing, pages 600–607, 1986.
[98] C H. Lau. SELF: A self-timed systems design technique. Electronics Letters,
23(6):269–170, March 1987.
[99] L. Lavagno and A. Sangiovanni-Vincentelli. Automated synthesis of asyn-
chronous interface circuits. In S. Furber and M. Edwards, editors, The
Proceedings of the IFIP Working Conference on Asynchronous Design Method-
ologies, Manchester, UK, March 1993. Elsevier Science Publishers.
[100] J. K. F. Lee and A. J. Smith. Branch prediction strategies and branch target
buffer design. IEEE Computer, 17(1):6–22, January 1984.
[101] P. F. Lister and A. M. Alhelwani. Design methodology for self-timed VLSI
systems. IEE Proceedings-E Computer and Digital Techniques, 132(1):25–32,
January 1985.
[102] Å. Lunde. Empirical evaluation of some features of instruction set pro-
cessor architectures. Communications of the ACM, 20(3):143–153, March
1977.
[103] T. Mano, F. Maruyama, K. Hayashi, T. Kakuda, N. Kawato, and T. Uehara.
OCCAM to CMOS: An experimental logic design support system. In
C. J. Koomen and T. Moto-oka, editors, Computer Hardware Description
Languages and their Applications: The Proceedings of the Decennial Caltech
Conference on VLSI, pages 381–390. North Holland, 1985.
[104] S. Manoharan and P. Thanisch. Assigning dependency graphs onto pro-
cessor networks. Parallel Computing, 17(1):63–73, April 1991.
Bibliography 250
[105] R. M. Marshall. Synthesis of Hardware Systems from Very High Level Be-
havioural Specifications. PhD thesis, Department of Computer Science,
University of Edinburgh, UK., December 1986. CST-41-86.
[106] A. J. Martin. Programming in VLSI: From communicating processes to
delay-insensitive circuits. Technical Report Caltech-CR-TR-89-1, Depart-
ment of Computer Science, California Institute of Technology, Pasadena,
California, 1989.
[107] A. J. Martin. The limitations to delay-insensitivity in asynchronous cir-
cuits. In W. J. Dally, editor, The Proceedings of the 6th MIT Conference on
Advanced Research in VLSI, Cambridge, Mass., 1990. MIT Press.
[108] A. J. Martin. Asynchronous datapaths and the design of an asynchronous
adder. Technical Report Caltech-CR-TR-91-08, Computer Science Depart-
ment, California Institute of Technology, 1991.
[109] A. J. Martin. Tomorrow’s digital hardware will be asynchronous and
verified. In J. van Leeuwen, editor, Algorithms, Software, Architecture:
Proceedings of the IFIP Information Processing Conference, pages 684–695.
North-Holland, September 1992.
[110] A. J. Martin, S. M. Burns, T. K. Lee, D. Borkovic, and P. J. Hazewindus.
The design of an asynchronous microprocessor. In C. L. Seitz, editor,
Advanced Research in VLSI: Proceedings of the Decennial Caltech Conference
on VLSI, pages 351–373, Cambridge, Mass., 1989. MIT Press.
[111] A. McAuley. Four state asynchronous architectures. IEEE Transactions on
Computers, C-41(2):129–142, February 1992.
[112] C. McCreary, A. A. Khan, J. Thompson, and M. E. McArdle. A comparison
of heuristics for scheduling DAGs on multiprocessors. Technical Report
CSE-93-07, Auburn University, Auburn, AL, 36849. USA., 1994.
[113] S. McFarlane and J. Hennessy. Reducing the cost of branches. In The
Proceedings of the 13th Annual International Symposium on Computer Archi-
tecture, pages 396–403, June 1986.
[114] E. McLellan. The Alpha AXP architecture and 21064 processor. IEEE
Micro, pages 36–47, June 1993.
[115] C. Mead and L. Conway. Introduction to VLSI Systems. Addison-Wesley,
Reading, Mass., 1980.
[116] T. H.-Y. Meng, R. W. Brodersen, and D. G. Messerschmitt. Automatic
synthesis of asynchronous circuits from high-level specifications. IEEE
Transactions on Computer Aided Design, 8(11):1185–1205, November 1989.
Bibliography 251
[117] R. E. Miller. Switching Theory. Volume II: Sequential Circuits and Machines.
John Wiley and Sons, 1965.
[118] S. Mirapuri, M. Woodacre, and N. Vasseghi. The MIPS R4000 processor.
IEEE Micro, pages 10–22, April 1992.
[119] C. E. Molnar, T-P. Fang, and F. U. Rosenberger. Synthesis of delay-
insensitive modules. In Henry Fuchs, editor, 1985 Chapel Hill Conference
on VLSI, pages 67–86. Computer Science Press, 1985.
[120] S.-M. Moon and K. Ebcioglu. An efficient resource-constrained global
scheduling technique for superscalar and VLIW processors. In The Pro-
ceedings of the 25th Annual International Symposium on Microarchitecture
(MICRO’25), pages 55–71, 1992.
[121] G. E. Moore. Cramming more components onto integrated circuits. Elec-
tronics, pages 114–117, April 1965.
[122] D. Morris and R. N. Ibbett. The MU5 Computer System. The Macmillan
Press, 1979.
[123] S. V. Morton, S. S. Appleton, and M. J. Liebelt. ECSTAC: A fast asyn-
chronous microprocessor. In M. B. Josephs, editor, The Proceedings of the
2nd Working Conference on Asynchronous Design Methodologies, pages 180–
189, London, UK, May 1995. IEEE Computer Society Press.
[124] D. E. Muller and W. S. Bartky. A theory of asynchronous circuits. In Vol.
XXIX of The Annals of the Computation Laboratory of Harvard University.
Harvard University Press, 1959.
[125] D. E. Muller and W. S. Bartky. A theory of asynchronous circuits. In The
Proceedings of an International Symposium on the Theory of Switching, pages
204–243. Harvard University Press, April 1959.
[126] R. D. Mullins. A VLSI design methodology for asynchronous processor
architectures. Technical report, Department of Computer Science, Uni-
versity of Edinburgh, Edinburgh, Scotland, UK., May 1994.
[127] R. D. Mullins. An asynchronous superscalar RISC architecture. Mas-
ter’s thesis, Department of Computer Science, University of Edinburgh,
Edinburgh, Scotland, UK., September 1995.
[128] E. J. Muth. The production rate of a series of workstations with variable
service times. International Journal of Production Research, 11(9):155–169,
1973.
Bibliography 252
[129] T. Nanya, Y. Ueno, H. Kagotani, M. Kuwako, and A. Takamura. TITAC:
Design of a quasi-delay-insensitive microprocessor. IEEE Design and Test
of Computers, pages 50–63, Summer 1994.
[130] A. Nicolau. Loop quantization or unwinding done right. In The Proceed-
ings of the 1st International Conference on Supercomputing, pages 294–308,
June 1987.
[131] C. D. Nielsen and A. J. Martin. A delay-insensitive multiply-accumulate
unit. Technical Report CS-TR-92-03, Computer Science Department, Cali-
fornia Institute of Technology, 1992.
[132] Y.-J. Oyang, C.-H. Wen, Y.-F. Chen, and S.-M. Lin. The effects of employ-
ing advanced branching mechanisms in superscalar architectures. ACM
Computer Architecture News, 18(4):35–51, December 1990.
[133] K. V. Palem and B. B. Simons. Scheduling time-critical instructions on
RISC machines. ACM Transactions on Programming Languages and Systems,
15(4):632–658, September 1993.
[134] C. H. Papadimitrou and M. Yannakakis. Towards an architecture-
independent analysis of parallel algorithms. SIAM Journal of Computing,
19(2):322–328, April 1990.
[135] V. Patel and K. Steptoe. Evaluation of self-timed systems for VLSI. Elec-
tronics Letters, 25(3):215–217, February 1989.
[136] D. A. Patterson and C. H. Séquin. RISC I: A reduced instruction set VLSI
computer. In The Proceedings of the 8th International Symposium on Computer
Architecture, pages 443–457, May 1981.
[137] N. C. Paver. The Design and Implementation of an Asynchronous Micro-
processor. PhD thesis, Department of Computer Science, University of
Manchester, UK., 1994.
[138] G. Radin. The 801 minicomputer. In The Proceedings of the Symposium
on Architectural Support for Programming Languages and Operating Systems,
pages 39–47, March 1982.
[139] B. R. Rau and J. A. Fisher. Instruction-Level Parallel processing: History,
overview and perspective. The Journal of Supercomputing, 7(1/2):9–50,
May 1993.
[140] M. Rem. Concurrent computations and VLSI circuits. In M. Broy, editor,
Control Flow and Data Flow: Concepts of Distributed Programming, pages
399–437. Springer-Verlag, 1986.
Bibliography 253
[141] M. Rem. Trace theory and systolic computations. In J. W. deBakker, A. J.
Nijman, and P. C. Treleaven, editors, PARLE: Parallel Architectures and
Languages Europe, volume 1, pages 14–34. Springer-Verlag, 1987.
[142] W. F. Richardson. Architectural Considerations in a Self-Timed Processor
Design. PhD thesis, Department of Computer Science, University of Utah,
UT, USA., February 1996. CSTD-96-001.
[143] W. F. Richardson and E. L. Brunvand. The NSR processor prototype. Tech-
nical Report UUCS-92-029, Department of Computer Science, University
of Utah, USA., 1992.
[144] M. Roncken. Partial scan test for asynchronous circuits illustrated on
DCC error corrector. In The Proceedings of the International Symposium on
Advanced Research on Asynchronous Circuits and Systems (ASYNC’94), Salt
Lake City, Utah, USA, March 1994. IEEE Computer Society Press.
[145] M. Roncken and R. W. Saeijs. Linear test times for delay-insensitive cir-
cuits: A compilation strategy. In S. Furber and M. Edwards, editors, The
Proceedings of the IFIP Working Conference on Asynchronous Design Method-
ologies, Manchester, UK, March 1993. Elsevier Science Publishers.
[146] O. Salomon and H. Klar. Self-timed fully pipelined multipliers. In
S. Furber and M. Edwards, editors, The Proceedings of the IFIP Working
Conference on Asynchronous Design Methodologies, Manchester, UK, March
1993. Elsevier Science Publishers.
[147] K. C. Saraswat and F. Mohammadi. Effect of scaling of interconnections
on the time delay of VLSI circuits. IEEE Journal on Solid-State Circuits,
SC-17(2):275–280, April 1982.
[148] V. Sarkar. Partitioning and Scheduling Parallel Programs for Execution on
Multiprocessors. The MIT Press, 1989.
[149] F. Schalij. The Tangram manual. Technical Report Technical Report UR
008/93, Philips Research Labs Eindhoven, 1993.
[150] C. L. Seitz. System Timing. In C. Mead and L. Conway, editors, Introduc-
tion to VLSI Systems, chapter 7, pages 218–262. Addison-Wesley, 1980.
[151] R. Sethi. Scheduling graphs on two processors. SIAM Journal of Comput-
ing, 5(1):73–82, 1976.
[152] A. Severson and B. Nelson. Throughput in a Counterflow pipeline pro-
cessor. ACM Computer Architecture News, 23(1):5–12, March 1995.
Bibliography 254
[153] J. E. Smith. A study of branch prediction strategies. In The Proceedings of
the 8th International Symposium on Computer Architecture, pages 135–148,
May 1981.
[154] J. E. Smith and A. R. Pleszkun. Implementing precise interrupts in
pipelined processors. IEEE Transactions on Computers, 37(5):562–573, May
1988.
[155] S. P. Song, M. Denman, and J. Chang. The PowerPC 604 RISC micropro-
cessor. IEEE Micro, 14(5):8–17, October 1994.
[156] J. Sparsø, C. D. Neilsen, L. S. Nielsen, and J. Staunstrup. Design of self-
timed multipliers: A comparison. In S. Furber and M. Edwards, editors,
The Proceedings of the IFIP Working Conference on Asynchronous Design Meth-
odologies, Manchester, UK, March 1993. Elsevier Science Publishers.
[157] R. F. Sproull, I. E. Sutherland, and C. E. Molnar. Counterflow pipeline pro-
cessor architecture. Technical Report SMLI TR-94-25, Sun Microsystems
Laboratories Inc., April 1994.
[158] I. E. Sutherland. Micropipelines. Communications of the ACM, 32(6):720–
738, June 1989.
[159] S. M. Sze. VLSI Technology. McGraw Hill, 1983.
[160] Y. K. Tan and Y. C. Yim. Self-timed system design technique. Electronic
Letters, 26(5):284–286, 1990.
[161] G. Theodoropoulos. Strategies for the Modelling and Simulation of Asyn-
chronous Computer Architectures. PhD thesis, Department of Computer
Science, University of Manchester, UK., September 1995.
[162] J. E. Thornton. Design of a Computer: The Control Data 6600. Scott Foresman
and Company, 1970.
[163] R. M. Tomasulo. An efficient algorithm for exploiting multiple arithmetic
units. IBM Journal of Research and Development, 11(1):25–33, January 1967.
[164] I. P. Tzonos. A VLSI library of asynchronous cells. Master’s thesis, De-
partment of Computer Science, University of Edinburgh, Edinburgh, Scot-
land, UK., September 1995.
[165] J. T. Udding. Classification and Composition of Delay-Insensitive Circuits.
PhD thesis, Eindhoven University of Technology, September 1984.
[166] J. T. Udding. A formal model for defining and classifying delay-
insensitive circuits and systems. Distributed Computing, 1:197–204, 1986.
Bibliography 255
[167] N. Ullah and M. Holle. The MC88110 implementation of precise ex-
ceptions in a superscalar architecture. ACM Computer Architecture News,
21(1):15–25, March 1993.
[168] J. Ullman. NP-complete scheduling problems. Journal of Computer and
System Sciences, 10:384–393, 1975.
[169] S. H. Unger. Asynchronous Sequential Switching Circuits. Wiley-
Interscience, John Wiley & Sons, Inc., New York, 1969.
[170] K. van Berkel, R. Burgess, J. Kessels, A. Peeters, M. Roncken, and F. Schalij.
A fully asynchronous low-power error corrector for the DCC player. IEEE
Journal of Solid State Circuits, 29(6):1429–14398, 1994.
[171] K. van Berkel, J. Kessels, M. Roncken, R. W. Saeijs, and F. Schalij. The
VLSI-programming language Tangram and its translation into handshake
circuits. In The Proceedings of the European Design Automation Conference,
pages 384–389, 1991.
[172] J. L. A. van de Snepscheut. Trace Theory and VLSI design, volume 200 of
Lecture Notes in Computer Science. Springer-Verlag, 1985.
[173] R. van de Wiel. High-level test evaluation of asynchronous circuits. In
M. B. Josephs, editor, The Proceedings of the 2nd Working Conference on
Asynchronous Design Methodologies, pages 63–71, London, UK, May 1995.
IEEE Computer Society Press.
[174] T. Verhoeff. Delay-insensitive codes - An overview. Distributed Computing,
3:1–8, 1988.
[175] N. Weste and K. Eshraghian. Principles of CMOS VLSI Design. Addison-
Wesley, Reading, Mass., 1985.
[176] W. A. Wulf. Compilers and computer architecture. IEEE Computer, 14:41–
47, July 1981.
[177] M. Yoeli. Structured design of the control parts of self-timed VLSI systems.
In O. N. Garcia and X. Zhang, editors, The Proceedings of 2nd International
Conference on Computer and Applications, pages 839–841. IEEE Computer
Society Press, 1987.
[178] M. Yoeli. Net based synthesis of delay-insensitive circuits. Technical
Report 609, Department of Computer Science, Technion - Israel Institute
of Technology, Haifa, Israel, February 1990.
Bibliography 256
[179] M.-L. Yu and P. A. Subrahmanyan. Hazard-free asynchronous circuit
synthesis. In S. Furber and M. Edwards, editors, The Proceedings of the IFIP
Working Conference on Asynchronous Design Methodologies, Manchester, UK,
March 1993. Elsevier Science Publishers.
[180] J. Yuan and C. Svensson. High-speed CMOS circuit techniques. IEEE
Journal on Solid-State Circuits, SC-24(1):62–70, February 1989.
