Interconnection of transputer links using a multiple bus configuration. by Adda, M.
1 :i 96409
UN IU E R S I  TV OF SURREY LIBRFlR'
ProQuest Number: All rights reserved
INFORMATION TO ALL USERS 
The quality of this reproduction is dependent upon the quality of the copy submitted.
In the unlikely event that the author did not send a com plete manuscript 
and there are missing pages, these will be noted. Also, if material had to be removed, 
a note will indicate the deletion.
uest
ProQuest 10130240
Published by ProQuest LLO (2017). Copyright of the Dissertation is held by the Author.
All rights reserved.
This work is protected against unauthorized copying under Title 17, United States C ode
Microform Edition © ProQuest LLO.
ProQuest LLO.
789 East Eisenhower Parkway 
P.Q. Box 1346 
Ann Arbor, Ml 4 81 06 - 1346
INTERCONNECTION OF TRANSPUTER LINKS 
USING A MULTIPLE BUS CONFIGURATION
By 
M. Adda
CLASS. ~  ^ " 1
LOC. “ 3  “ 93 I
1 ACC.
|_SURREV ilgjaVEHSTTy LIBRARY
A thesis
submitted to the Department of Electrical and Electronic Engineering
University of Surrey 
for the degree of Doctor of Philosophy
November 1992.
ABSTRACT
The design of an efficient distributed memory transputer network is a difficult issue. 
In order to construct successfully highly concurrent systems with a large num ber o f pro­
cessors, their interconnection networks have to be as universal as possible and provide 
adequate connectivity for most applications.
To satisfy these requirements, these communication networks should possess: an ease 
o f expansion, a high bandwidth, a  low latency, a deadlock freedom and an acceptable degree 
o f reliability.
This thesis presents a new type of interconnected network based on a multiple bus 
organisation and routing resources (gateways) that offers significant improvements in 
bandwidth over previously accepted bus-oriented topologies (i.e. multi-bus and spanning 
bus) and in latency over most directly-connected transputer networks (e.g. ring and mesh 
configurations). Besides, it has an easier expansion than hypercube-like structures.
Relatively high bandwidth, low latency, good processor scalability, semi-adaptive 
routing and deadlock freedom are the fundamental features by which o f our proposal con­
tributes to the design o f an efficient interconnection network for transputers. They have been 
achieved by separating the routing (gateways) from the computational resources (processors).
Although this topology can be exploited by general purpose parallel processors based 
on shared or distributed memory techniques, transputers and an OCCAM -like program ming 
methodology have been considered as a case study in this project as it is the prim ary objective 
o f the thesis.
Simulation models and analytical results, mainly based on gap equations we have 
developed, exhibit conclusively the superior performance o f our system  com pared to most 
transputer topologies. The detail o f this architecture is presented in a design form  that 
embodies many of the concepts discussed and proposed throughout the course o f this research. 
As it is important to address uniquely each processor within the network, a dynamic address 
assignment algorithm that preserves the features o f the proposed architecture is also sug­
gested.
Acknowledgements
I would like to express my gratitude to my supervisors Dr. Roger M.A. Peel and Dr. 
David F. Gray for their continual guidance and support throughout the course of my research 
and many helpful comments during the writing of this thesis.
I also gratefully acknowledge Dr. C. Smythe and Dr. P.Sweeney for their stimulating 
and fruitful discussions, especially on network simulation and queuing theory.
I should also like to thank Dr. J.H.B Deane and Dr. A.V. Shafarenko for reading my 
thesis and making some useful remarks.
M y thanks go to many of my friends, particularly to Dr. Y. Negadi for his constructive 
advices, and to Dr. A. Giti, Miss D. Tabtab, Mrs P. Mukerjee and Mrs M. Harris for their 
m oral support and help.
Finally, I would like to express my deepest gratitude to my family for their love and 
patience, especially to my mother Mina, my father Allel, my wife Hassiba and my daughter 
Sarah.
I wish to thank the Institut National d ’Electricite’ et d ’Electronique (Alger) for their 
financial support and the University of Surrey for paying my expenses for the conferences I 
attended.
CONTENTS
Chapter 1: Introduction
1.1 B ackground ..................................................... 1
1.2 Aims of the P ro jec t.............................................. 5
1.3 The Approach to the P rob lem .................................. .... 6
1.4 M ain R esu lts ..................................................... 8
1.5 Structure of the T h e s is ................. ..................... 8
Chapter 2: A Proposal fo r a Multiple Bus Structure
2.1 In troduction ..................................................... 10
2.2 Construction of our n-dimensional S tructure..................  11
2.3 Related A rchitectures.....................................  13
2.4  The Achieved Bandwidth and Processor Scalability of our Struc- 15 
ture
2.5 A Proposal for a Semi-adaptive Routing A lgorithm ..............  19
2.6 A Simple and Effective M ethod for Deadlock Prevention in the 21 
2D struc tu re ...............................
2.7  C onclusion .........................................................  23
Chapter 3: On Adopting the CSMA/CD Arbitration Protocol.
3.1 In troduction ..................................................... 22
3.2 Selecting the CSMA/CD Communication P ro toco l.................  23
3.2.1 Token R i n g .......................................  27
3.2.2 Token B u s ............................................ 27
3.2.3 Slotted R in g .......................................  27
3.2.4  CSM A/CD Protocol..................................... 28
3.2.5 C om parison ............................................ 28
3.2.6  Performance of CSMA/CD Pro toco l......................  30
3.3 Towards the Acknowledged CSMA/CD Pro toco l..................... 31
3.3.1 The Operation of the Acknowledged CSMA/CD Protocol 31
3.3.2 An Approximate Mean Packet Delay Expression of the 32 
Acknowledged CSMA/CD P ro to c o l.......................................
3.3.3 Simulation M odel...........................................  33
3.4  Opting for the Prioritised CSMA/CD Pro toco l.................... 36
3.4.1 Description of the two Random P ro to c o ls .......................  36
3.4.2 Description of the two Hybrid P ro toco ls..................... 37
3.4.3 Simulation R e s u lts .........................................  40
3.5 Combining the Acknowledged and the Prioritised CSMA/CD 41 
Protocol
3.6  C onclusion .........................................................  43
Chapter 4: Simulating and Modelling the 2-Dimensional Structure
4.1 In troduction ....................... 44
4.2 Network C onstruction.........................................  45
4.2.1 Routing Packets in an Unreliable Environm ent................  46
4.2.2 Congestion Avoidance in the N etw ork .............................. 51
4.3 Expansion of the 2D S tructure .............................. 52
4.3.1 Transforming a Single Bus into a 2D S tructure  52
4.3.2 Broadening the 2D Structure by Varying its Physical Para- 53 
meters ( G ,P ) ...........................................
4.4 The Queuing Model of the 2D S tructure.......................  56
4.4.1 The Balanced N etw ork ...........................................  57
4.4.2 The Stability Region of the N etw ork .............................. 58
4.4.3 The Optimal Network  ..................................  59
4.5 The Number of transputers accommodated by our Balanced 2D 62 
N e tw o rk ..........................................................
4.6 C onclusion ..................................................... 63
Chapter 5: Network Interfacing: Transputer as a Computing Node
5.1 In troduction ........................ 64
5.2 A Transputer Interface P roposal..............................  65
5.3 Development of the Gap Equations..............................  68
5.4 Finalising the Interface M o d e l................................  70
5.5 Overlapping Part of the Software P ro toco l..................... 74
5.5.1 Transm itter S id e ............................................ 75
5.5.2 Receiver S id e ...........................................  76
5.6 An Optimal Pipelined Transm ission..............................  79
5.6.1 Splitting Messages into Smaller Sub-packets............... 79
5.6.2 Comparing the Obtained Communication Latency to that of 81 
a Linear Transputer A r r a y ...........................................
5 .7  Performance of the Network Under Event Parallelism Tasks . . .  84
5.7.1 Speed up Factor and E fficiency................................ 84
5.7.2 Case Study of a Processor F a rm ........................... 85
5.7.3 Simulation R esu lts...........................................  85
5.8 S u m m ary ............................................................. 87
Chapter 6: Stations and Gateways Design Proposal
6.1 In troduction ......................... ..........................  89
6.2 Hardware L e v e l ..................................................  92
6.2.1 Overview of the Hardware Level B locks.........................  92
6.2.2 Synchronisation F ie ld ...........................................  94
6.2.3 Buffer M anagement at the SIMP Hardware L e v e l   94
6.2.4 Buffer M anagement at the Gateway L eve l...................... 96
6.2.5 Generating Acknow ledgem ents................................ 103
6.2.6 Flow Control at the Network L ev e l....................... 104
6.3 Software L e v e l ..................................................  106
6.3.1 Sequence V ariab les.........................................  106
6.3.2 Commands P ack e ts ...........................................  108
6.3.3 Descriptor F ie ld   ................................  109
6.3.4 Providing Physical Addresses for S tations..................  110
6.4 U ser L e v e l.........................................................  115
6.5 OCCAM  Simulation of the SIMP Protocols..........................  118
6.6 S um m ary ............................    120
Chapter 7: Conclusion
7.1 Summary of the Thesis and A ch ievem en ts  122
7.2 Future O rien tations............................................. 126
References .........................................................  128
Appendices
A. The Acknowledged CSMA/CD Mean Packet D elay ...............  137
B. Scheduling Function of Demand Access Protocols...............  139
C. Appying the Gap Equations.........................  141
D . Initial Conditions for Transparent Transm ission  146
E. Simulation Programs
E .l  Single Bus Hybrid CSMA/CD Communication M odel  149
E.2 2D Network and its Communication P ro toco ls................  154
E.3 Processor Farming in the 2D Network............................ 168
E.4 Physical Address Assignment A lgorithm ..................... 175
E.5 OCCAM  Simulator 180
chapter 1
I N T R O D U C T I O N
This thesis presents the design and development of a bus-based message passing 
interconnection scheme which can be used to join a large number of INMOS transputers via 
their serial links [Inm 89]. The main feature of this architecture is that it avoids the com ­
m unication overhead which occurs in systems where processing nodes relay comm unications 
to their neighbours. The essential outcome of the research is the production of a flexible and 
scalable m achine with an adequate bandwidth whose attractive characteristics are its 
sim plicity and low latency for large configurations.
1.1 Background
The prim ary challenge facing computer architects is to find ways for interconnecting 
processors so that they cooperate effectively to achieve a common goal. This problem  is 
ever pressing given that the developments in Very Large Scale Integrated (VLSI), U ltra Large 
Scale Integrated (ULSI) and W afer Scale Integrated (WSI) technologies [Lea 87] have 
achieved a cheap realisation of the most complex processing elements and changed the basic 
fram ew ork in which computer systems are designed and constructed [Inm 89, Pas 88].
The two main classes of architecture which dominate the parallel processing market 
are shared and distributed memory systems. The shared memory architectures are charac­
terised by a shared address space between the processing elements. This is usually provided 
by sharing a global memory [Ful 78, Got 83] or, alternatively, by combining the local 
m em ories o f each processing unit [Sto 78, Cha 79]. Distributed memory architectures, by 
contrast, involve no shared memories. Each processor in the system has its own local storage 
and com m unicates with other processing elements through message passing [Har 87, Pas 
88].
Even though the shared memory scheme has limited performance scalability [Ott 89], 
it is appropriate for small granularity parallel applications, given that not many processing 
elem ents share the communication medium (e.g. a shared bus) and the global memory. The 
distributed m em ory approach, on the other hand, performs well provided that the system 
operates under low communication loads or on coarse grained problems [Kal 88]. One of
chapter 1
the dom inant problems faced by these distributed memory structures is how to decompose 
the data to balance the load between the processing elements [Ott 89]. Eventually, load 
imbalance produces a diminution in the overall system’s efficiency. Despite these drawbacks, 
m essage passing systems are achieving respectable performances over a wide range of 
applications (e.g. physics, engineering, image processing, graphical imaging and sim ula­
tions).
Applications containing inherent parallelism may be partitioned into a num ber of 
concurrent processes which are joined together by channels of communication, thus form ing 
process-structured graphs or logical networks. The implementation of such applications on 
distributed memory (or message passing) architectures requires processors on which these 
processes are executed and interconnections along which the communications take place. 
The m ajor problem  of these distributed memory schemes is how to map a particular logical 
netw ork onto the hardware provided - i.e. the physical network, in order to balance the load 
on each processor and to minimise the amount of traffic on the inter-processor connections 
[Ste 91].
In the field of transputer networks, this problem has been tackled with different message 
passing approaches, to realise acceptable topologies for most applications.
Ideally, it would be desirable to allocate each process to exactly one transputer and to 
place each logical channel onto one physical inter-transputer link. O f course, this w ould be 
possible only if the transputer was provided with an infinitely large number of links and, in 
this case, a fully connected static topology would satisfy all our demands. However, there 
are physical limits to the number of links available to each transputer and attendant problem s 
such as electrical noise on signals together with physical wiring difficulties between geo­
graphically rem ote processors in a large network which would make this strategy impractical.
In reality, the problem is to map the logical network onto a physical netw ork of 
transputers. In some cases, the logical network fits the optimum physical one exactly (e.g. 
a matrix m ultiplication problem maps directly onto a mesh topology). If for any given 
application there exists a matching physical network of transputers, then a reconfigurable 
architecture acting on physical links is sufficient to deal with all problems. This is not always 
the case though, since some reconfigurable machines do not permit all possible configurations 
to be achieved. This is not also surprising given that in machines with many processors, the 
interconnection structure needed would be very large. For instance, although De G root et al 
designed their systolic processor with a reconfigurable interconnection network of transputers 
based on memory mapped 4 x 8  cross-bar switches connected in a 6-dimensional hypercube 
[DeG 87], only a limited range of topologies can directly be arranged - e.g. a 6-level binary 
tree, an 8 x  8 square mesh, a 10 x  10 triangular mesh and a 64-nodes shuffle exchange con­
chapter 1
figurations. Another example of reconfigurable architectures is that of PARSIFAL [Kno 
87a, Jin 87] which is constructed by a set of INMOS C004 cross-bar switches [Inm 87a]. 
The m ajor lim it of this system is its dependence on the Hamiltonian cycle - i.e a path which 
visits all the vertices of the logical interconnection graph once. It is well known that there 
are m any logical networks that do not possess Hamiltonian cycles. Besides, it is a com pu­
tationally difficult task to determine whether an arbitrary network has one or more Ham il­
tonian cycles. To provide a universal reconfigurable topology, the Super-Node architecture 
[Nic 88, Hey 88] uses a two sided cross-bar switch. Conceptually, by applying the Eulerian 
cycle - i.e a path which traverses all arcs of a graph once, it is always possible to sub-divide 
a logical network of degree four into two separate ones. The resulting sub-networks contain 
the sam e num ber o f processes having only two communication links each and, therefore, 
they can both be directly mapped into each side of the reconfigurable network. Despite the 
w ide range of topologies that can be arranged using this scheme, all are dependent on the 
degree of their nodes being equal to four (i.e. each transputer has four links).
To overcom e the drawbacks of these statically reconfigurable architectures and to 
provide high connectivity between processors of the physical network, these switching 
systems can in general be operated in two other modes: quasi-static and dynamically 
reconfigurable modes.
In quasi-static operation [Fay 87], the switch is reconfigured at pre-determined syn­
chronisation points and therefore, at run time, various topologies can be arranged. This is 
often possible since separate temporal phases of a computation use different channels. 
Unfortunately, there are limited classes of application, such as the solution of differential 
equations for fluid dynamics m odelling and some image processing, which exhibits this 
behaviour.
The other alternative is to provide dynamically reconfigurable architectures whose 
processors comm unicate by requesting a central controller to establish their communication 
paths at run time. This is commonly called circuit switching technique. Jones [Jon 88] 
im plem ented such a system using the PARSIFAL switching mechanism. In his approach, 
the m onitoring bus plays the major role in passing commands between the working processors 
and the controller which changes the setting of the switches to create the communication 
paths between any comm unicating processors. Another dynamically reconfigurable archi­
tecture is based on a multi-layered approach [Tud 90]. In this configuration, three layers are 
im plem ented. The working transputer layer consists of 32 processors arranged in groups o f 
eight and interconnected to four independent INMOS C004 cross-bar switches. The control 
layer com prises control transputers which execute the global switching control primitives 
and the system  processor layer performs the overall system functions. Besides having a very
-3-
chapter 1
poor expansion capability, this strategy has a complex and slow control mechanism. It is 
noted that in these dynamic circuit switching systems the improved connectivity is obtained 
at the expense of poor hardware scalability and its related problems (bottlenecks, long 
response times, potential livelock and inefficient use of the available bandwidth). In addition, 
the cost and latency delay increase linearly with the number of cascaded switches.
A totally different approach employs a fixed physical network to support each appli­
cation and provides an additional routing to fulfil the communication requirem ent on a 
per-m essage basis - a packet switching network.
One way to transport messages in these packet switching networks is to use software 
on each processor to route messages; this runs in parallel with the processors main com pu­
tational activities . There are several works which discuss this principle using specific static 
topologies [Arr 90, Car 89, Cos 87]. The communication in such physically distributed 
system s is costly and slow, because the relay code consumes a considerable memory band­
width in each intermediate node, occupies its processor cycles, introduces an extra com ­
m unication overhead and limits the point-to-point bandwidth. Deadlock m ay be avoided in 
these schemes depending on the structure of their topologies. Knowles and Kanchev [Kno 
89], for example, described a method to prevent deadlock within networks exhibiting 
Ham iltonian paths. Their proposal consists of restricting the flow of packets to the ring 
form ed by the Hamiltonian cycle, when the traffic increases within the network. Besides the 
inefficiency in adopting the software router, this restriction further reduces the performance 
of the network to that of a simple ring.
A hardware message router is another transport mechanism which can carry data 
between non-adjacent communicating host processors. By contrast to the software router, 
the host m em ory bandwidth is totally dedicated to processing the application and to injecting 
or absorbing messages from the routers. If this method is to be successfully implemented, 
these routers have to pass the information as fast as possible using addresses provided within 
each data packet.
The idea of using these transport mechanisms has been explored in some distributed 
m em ory architectures, examples of which are:
* The ring architecture, used in the Meiko Computing Surface computer [Wei 89], is 
based on dedicated T212 transputer routers separated from the host transputers. The 
code involved to route messages and the store-and-forward routing mechanism [M er
80] itself induce high latencies. Deadlock is avoided in this architecture by lim iting 
the num ber o f packets circulating in the ring.
chapter 1
* The Transtech high performance computer architecture [Bra 90] provides wormhole 
based communication interfaces which link together processors into a ring. Since 
these interfaces use a very high asynchronous transmission rate and adopt the 
worm hole routing technique [Dal 87], the system provides very low latency (10ns) 
between adjacent nodes. It is however not clear how deadlock is avoided when many 
ring are joined together..
* The new generation o f H I transputers [Inm 91, Ros 91] are supported by an efficient 
IMS C l04 routing chip that uses a wormhole mechanism to fulfil all non-adjacent 
inter-processor communications. W hen a given network is built from  such elements 
a high bandwidth and low latency will be ensured. Obviously, the issue o f deadlock 
w ill be treated according to the shape o f  the configurations involved.
* Another attractive approach is the M ad-Postman MP1 [Mil 91]. A synchronous one 
bit hardware message router and four virtual network planes are the key elements o f 
this architecture. Each plane is a 2D-mesh which provides deadlock-free intercon­
nections under an adaptive routing mechanism. However, an additional com m uni­
cation overhead could result whenever the truncated headers o f the packets are directed 
in the wrong direction.
1.2 Aims of the Project
The m ajor aim of our project is to design an efficient interconnection network which 
will link together a large number o f transputers and provide a dynamic architecture which 
will accommodate a wide range o f regular and irregular configurations.
In general, as discussed in the previous section, the design criteria comprise a number 
o f objectives and constraints to which the network must conform.
O ur primary objective is to target applications which demand high connectivity between 
processors for which the number o f links is a limit, as well as those requiring a low com ­
m unication latency and those exhibiting frequent and irregular changes in their comm uni­
cation patterns. There is a wide range o f applications having these features - e.g. artificial 
intelligence problems, graphical imaging (by the use of ray-tracing techniques), biological 
problem s (DNA sequence comparison), computational intensive VLSI simulations and 
M onte Carlo simulations applied to some problem  in physics.
O f course, constraining applications which require very high throughput such as data 
parallel tasks are outside the scope o f our architecture. In general, these applications are 
better attended to in point-to-point topologies (e.g. Mad-Postman, PARSIFA L etc.), although 
the very low latency design presented in this thesis may have some attractive properties.
chapter I
In order to justify that our tr ansputer interconnection network meets these objectives, 
a num ber of fundamental issues are explored in this thesis. Using packet switching principles 
based on hardware routers, we investigate the means of providing a very low latency, high 
bandwidth, expandable, reliable and deadlock-free system.
13  The Approach to the Problem
Our strategy relies basically on the design of an improved packet switching scheme 
which m ust deliver messages to their destinations with a minimum communication latency. 
As a pre-requisite to implem ent such a system, fast hardware routers that support transputers 
within the network with efficient communication protocols are to be proposed. The 
m otivation to develop this architecture is stimulated by the serial bus organisation.
The serial bus was initially selected for its simplicity, its varieties of implementations 
(e.g. copper, optical fibres or infrared transmission), its reduction in wire density (i.e. a single 
bus carries data, control and acknowledgement) and its compatibility in connecting trans­
puters via their serial links. Because of its limited bandwidth, only a small num ber of 
tr ansputers can be interconnected by a single serial bus and hence it is realistic to restrict its 
length to 1.5 metres, in this application. In addition, a capacity of 40-50 M Bits/sec is 
obtainable, since shorter CSM A/CD buses tolerate higher transmission rates (Figure 1.1a 
shows a one-dimensional CSM A/CD bus).
Although this initial system achieves a low communication latency for any num ber of 
processors, provided that only one is requesting a communication at any time, it is inadequate 
to m eet all our objectives. In particular, when more transputers are added to the network 
with each one demanding a frequent use of the bus, their sharing of the bus bandwidth becomes 
a serious obstacle and, hence, limits the range of applications handled by the network. The 
m ultiple bus strategy transcends this inadequacy by reproducing many similar buses which 
hold a lim ited num ber of processors. These are then interconnected through other orthogonal 
branches by means of routing gateways (Figure 1.1b shows a 2-dimensional structure), This 
expansion therefore increases the total bandwidth and the processor scalability of the system. 
By processor scalability we refer to the number of processors that may be added to the system 
while conserving their required bandwidth.
Even though this new arrangement provides the necessary bandwidth for the type of 
applications mentioned in the objectives, it involves further considerations in order to reach 
all the prescribed targets.
chapter 1
For instance, to reduce the latency of this multiple bus proposal to an acceptable 
minimum, a semi-adaptive algorithm which makes a good use of the available bandwidth by 
selecting those branches with the lowest traffic load is developed. In addition, it is essential 
to adopt efficient routing mechanisms such as the virtual cut-through method which permits 
packets to flow fast through the network.
Regarding the consideration of deadlock in this new multiple bus configuration, the 
approach pursued is systematically centred around the prevention of un-granted request 
cycles. In the 2-dimensional structure, this is realised by allowing processors (or stations) 
to occupy only specific buses - we will call them later injection branches in which case the 
flow of packets does not generate any closed paths.
Obviously, the high transmission rate and the nature of the bus are totally different from 
those o f a transputer link. So to maintain full compatibility between the serial bus and a 
transputer, an interface has to be designed.
To complete the strategies discussed above, one needs some tools to verify their 
self-consistency and to measure their performance. The tools we shall utilise involve 
simulation, modelling, formal analysis and informal OCCAM emulations.
Figure 1.1 a single bus holding a set o f stations (a) has been partitioned into a 
2-Dimensional multiple bus configuration joined by gateways'(b).
chapter 1
1.4 Main Results
Our new multiple bus interconnection network contributes to the implementation o f an 
efficient parallel machine in various ways.
W e have been able to model our proposal by a mathematical representation. The 
resulting topology, which is based on gateways incorporated between buses, was shown to 
be more expandable and to exhibit higher processor scalability and bandwidth than most 
other bus-oriented topologies.
Through several simulation models and analyses, we have demonstrated that the par­
ticular 2D-network coupled with an efficient CSMA/CD arbitration protocol has m any 
advantages. Firstly, its throughput can be maximised by choosing the number of injection 
branches to be twice that o f the number o f routing branches. Secondly, a balanced 2D network 
constructed from 512 gateways can accommodate up to 200 T414 transputers - i.e. it can 
supply each o f their links with their required maximum bidirectional bandwidth o f 
0.8M bytes/sec. This is an important achievement, and it is discussed further in section 4.5. 
It shows that our routing structure can support a large number o f processors; indeed far more 
processors than other bus-oriented schemes. Thirdly, comparative analyses with a linear 
array o f transputers, which use store-and-forward and pipelined transmissions through 
software routers, have shown that our system operates with much smaller communication 
latency. In addition, this latency is independent o f the location o f each processor within our 
multiple bus network. Finally, we have compared the performance o f the processor farm 
application obtained by Green [Gre 88] with that achieved using our ID  network. Green 
connected 15 processors in a tree and used software harness to pass messages between them, 
consuming 9.4% of their processing effort. In our ID  network, simulations show that for 
the same num ber o f processors, only 0.3% degradation would be suffered. This degradation 
does not increase significantly even with far more processors - see section 5.7.
1.5 Structure of the Thesis
W e begin, in chapter 2, by introducing and constructing the novel n-dimensional 
multiple bus configuration. Its most important aspects o f low latency, high bandwidth, ease 
o f expansion and reliability have been addressed. The issue o f deadlock is only considered 
for the particular 2D network as it is the one we shall adopt in this project.
The objective o f chapter 3 is to justify the choice o f the most suitable arbitration 
m echanism  that efficiently and reliably resolves the contention between processors and
chapter 1
gateways sharing the same bus. As a result, it will be proved by simulations and analysis 
that the prioritised and acknowledged CSMA/CD protocol is the desired dynamic arbitration 
mechanism.
In chapter 4, the 2D multiple bus structure adopted in this project has been exhaustively 
analysed and simulated. The results conclusively demonstrate the high performance and 
features o f this 2D network, and provide an initial step for its design decisions.
In chapter 5, we propose an appropriate serial interface that couples any four transputers 
to any branch of our 2D network. The gap equations developed will be used as a tool to 
enhance the communication latency of the 2D interconnection and to optimise the structure 
of this interface. It will be shown that an interface which consists of a double buffer and 
adopts the simple send-and-wait transmission scheme is convenient for our system. Using 
this interface and splitting long messages into smaller blocks will be shown to improve the 
com m unication latency of the network.
The design of the hardware for the system is mainly focused at the stations, in particular 
their interfaces to the transputer hosts, and at the gateway routers. This is discussed in chapter 
6. A longside, various suggestions and proposals that embody many of the concepts acquired 
throughout the duration of this research are also illustrated. In particular, we will suggest an 
efficient algorithm  which, based on the collision principle, assigns a unique address to stations 
sharing the same bus. W e will also propose a dual port FIFO model which not only stores 
m essages and handles error managements, but implements an effective m echanism  to control 
the flow  of packets within the network as well. Finally, we will verify the correctness of the 
whole system  protocols using an OCCAM-language parallel simulation.
Chapter 7 is the conclusion which consists of a general summary of the thesis, indicates 
the m ost im portant results obtained and suggests some future directions.
A num ber of appendices are also included. In appendix A, an expression of the mean 
packet delay for the CSMA/CD random access protocol is given. The derivation of the 
scheduling function for virtual token passing schemes is contained in appendix B. Appendix 
C is devoted to the full applications of the gap equations related to processor interfacing in 
the 2D structure. Further derivations of the initial conditions for what we call transparent 
transm ission are presented in appendix D. Finally, appendix E contains the listings of the 
sim ulation programs.
chapter 2
A PROPOSAL FOR A MULTIPLE BUS STRUCTURE
2.1 Introduction
In order to construct successful general-purpose highly concurrent systems with a large 
num ber o f processors, their communication networks have to be as universal as possible and 
m ust provide adequate connectivity for all applications. As a pre-requisite to fulfil these 
requirements, these interconnection networks should exhibit an ease of expansion, high 
bandwidth, low latency, deadlock freedom and a tolerable degree o f reliability.
A system that can be extended in a natural way, without major upheaval to its elements, 
is always preferable. Its bandwidth and latency are extremely important parameters which 
can be used to determine the optimum granularity o f a parallel task. Deadlock freedom is a 
necessity if  the system is to operate properly, and it has to be realised in the interconnection 
network as well as at the processor levels.
The framework we have adopted to fulfil these requirements and, hence, to provide a 
dynamic interconnection for a wide range o f applications is the common bus organisation 
which is the least complex and the easiest to reconfigure. However, the constraints imposed 
by the implementation technology on the speed o f the bus and therefore the processor scal­
ability o f the system led us to consider a multiple bus approach. This enhanced configuration, 
which is based on the concepts o f packet switching techniques and hardware routers, consists 
o f a num ber o f buses joined by gateways yielding a well-structured grid structure. The 
gateways are the switching elements o f this topology (see figure 1.1). They receive, queue 
and route packets through the network with a minimum communication latency. In addition, 
each processor is connected to the bus via an efficient interface message processor which 
will be thoroughly discussed in chapter 5.
O ur rt-dimensional architecture is defined and constructed in section 2.2 where its ease 
o f expansion is highlighted. In section 2.3, we review some systems which, from a topological 
point o f view, seem to have a close resemblance to our proposal. An emphasis is subsequently 
given to the spanning bus based architectures in order to determine their capabilities in terms 
o f bandwidth and, then, compare them to our multiple bus interconnection network. The 
achievable bandwidth and processor scalability o f our configuration are addressed in section
chapter 2
2.4 and compared to the ones o f the spanning bus-oriented topologies in table 2.1. The 
requirem ent o f a low latency is partially covered in section 2.5 where we introduce an 
appropriate semi-adaptive routing algorithm that adapts to the traffic load, selects the shortest 
path for messages and maintains a tolerated degree o f reliability. The aspects o f reliability 
and latency will also be discussed in chapters 3 and 4 respectively. The former is maintained 
by the adoption of a strong arbitration scheme (i.e. CSMA/CD) and the latter is ensured by 
the use o f an efficient routing mechanism  (i.e. virtual cut-through). Finally, in section 2.6, 
we address the deadlock issue associated with our 2-dimensional multiple bus topology and 
propose a simple and effective technique to prevent it. This method is based on allocating 
processors in specific branches.
2.2 Construction of our n-dimensional Structure
Before introducing our multiple bus configuration which originates from the expansion 
o f a single bus, it is instructive to examine the impact of the expansion in general on the 
bandwidth, the latency and the degree o f reliability o f an interconnection network fo ra  general 
purpose parallel machine.
It is preferable to employ an interconnection network which scales favourably with the 
num ber o f processors without a m ajor upheaval to its elements and provides them with 
adequate bandwidth. Furthermore, although this interconnection network may be expanded, 
it is still desirable to preserve a low communication latency between communicating pro­
cessors in the enlarged system. It is also worth having a topology with high connectivity to 
increase the potential level o f fault tolerance.
However, it is well known that because these characteristics are interrelated, they cannot 
coexist in most networks. For instance, expansion o f the cube connected cycle networks [Pre
81] is not easy and the system must be completely restructured. In fully connected and alpha 
[Bhu 84] networks, the expansion is also not evident as the number o f connections on each 
node (i.e. the degree per node or the num ber o f communication ports) are dependent on the 
num ber o f processors in the network. On the other hand, ring and mesh networks are obviously 
the sim plest to expand. Although their degree per node is always fixed (2 and 4), the linear 
relationship between the size o f the topology and its average diameter in terms o f hops makes 
the communication latency unacceptably large as the network grows. Besides, the reliability 
o f the ring structure is questionable. In a bidirectional ring the failure of only two nodes 
renders the entire system nonfunctional, and in a unidirectional ring no node failure can be 
tolerated.
chapter 2
Fortunately, it is often possible to find a compromise between these criteria in order to 
satisfy a particular range o f applications. For example Benes [Ben 65] has shown that, despite 
the complexity o f expanding a single cross-bar switch, it is not always too difficult to 
decom pose it into smaller switching elements which are then connected in multiple stages 
to build the required networks. However, these multiple stage networks [Bat 68, Law 75, 
Lan 76] are not easily expandable, since each stage must be completely reconfigured as extra 
processors are added. In addition, there is no improvement on the reliability o f the inter­
connection network as long as a unique path exists between any pair o f processors. Fur­
thermore, the communication latency increases with the number of stages involved.
On the other hand, the single broadcast bus has a limited expansion capability, as its 
bandwidth is not scalable with the number o f processors sharing it. Various bus-oriented 
networks [Hwa 85, Ino 88, Rob 91] have been introduced to overcome this constraint - some 
will be discussed in the next section. However, to a large extent, these topologies cannot 
tolerate the addition o f more processors as their m ajor obstacle is still the lack o f bandwidth.
O ur multiple bus structure, as will be shown in table 2.1, goes belong this limitation, 
by including gateways as routing resources at the intersection o f the orthogonal buses, as 
shown in figure 1.1b. Similarly to any bus-oriented configurations, the impact o f extending 
a single bus (Figure 1.1a) into multiple buses, however, results in an increase o f the number 
o f connections (C) of each gateway (i.e. the num ber o f its communication ports, its valence, 
its degree o r its number of links). Also, the network diam eter enlarges, which can be quantised 
by the m axim um  number of hops (H) a message from  any processor has to travel to its farthest 
destination. These quantities, for our proposal, can be evaluated as C -  n and H = n + 1, 
where n is the dimension o f the multiple bus structure.
W e will show, in section 2.4, that this expansion is beneficial in providing an adequate 
bandwidth for all processors o f the network. However, the extra cost o f connections per 
gateway and the diam eter enlargement over the single bus can be compensated for by adopting 
low dimensional structures such as the 2D network and using efficient routing mechanisms 
like the virtual cut-through method that transports packets through the network with minimum 
delay.
2.3 Related Architectures
In order to highlight particular features o f our proposed architecture, we selectively 
discuss some structures which, even if different in construction and goals, resemble our 
suggested one, to a certain extent, from  a topological point o f view.
chapter 2
The W isconsin Multi-cube and the Aquarius Multi-Multi architectures [Rob 91], 
developed at the Universities o f W isconsin-M adison and California at Berkeley respectively, 
consist o f a symmetric grid o f buses with processing nodes located at each intersection. In 
the W isconsin approach, a memory module is connected to each column bus whereas in the 
Berkeley model each node has its own semi-private memory, much like the Spanning bus 
network described by W ittie [Wit 81].
Another related system, the Hector architecture, consists of a hierarchy o f two levels 
o f bit parallel rings (one global and many locals) that link smaller buses to each other [Rob 
91]. Each bus contains a set o f processing modules which form a station. M any stations, 
through special link controllers, are connected to their local rings which are themselves linked 
to the global one via inter-ring interfaces. Although this architecture uses the same principle 
o f bus hierarchy as our multiple bus configuration, its operation is totally different. In fact, 
all buses are joined to their local rings by point-to-point links hence paying a heavier price 
on the non-local communication latency.
These architectures are intended to achieve highly scalable shared-memory systems. 
Topological similarities, however, are not confined solely to these approaches. In the domain 
o f distributed memory systems, Jones [Jon 91] has proposed a hyper-dynamic architecture 
that is built up by a symmetric grid of switches with transputers located at their intersections. 
This structure has been introduced to utilise the strength of circuit switching while overcoming 
the weakness o f packet switching based on software routers. It seems that this architecture 
requires sophisticated switches, such as the ones described by W hobrey [W ho 89], and 
controllers to update the dynamic activities o f the system making it quite complicated to 
realise.
Unlike all the above mentioned parallel architectures, our proposed configuration is 
based on gateways at each bus intersection. Thus the processors do not form the network 
but are attached to it. As a consequence, their insertion or removal conserves the symmetric 
and regular properties o f the topology. Some local area networks already use the concept o f 
passive gateways for their sub-network connections.
For example, the Farmer loop [Far 69], which appears to be a particular case o f our 
topology, consists o f a number o f local loops connected to a single major one. This concept 
is sim ilar to the Hector one except that the processing modules in Farmer’s loops are connected 
like a broadcast bus, rather than point-to-point, to their local rings.
Other irregular structures such as the Ethernet highways [Ben 87] are based on gateways 
or bridges which link several LAN networks. Unlike our proposal that connects processors 
over small distances in a regular way, the size o f the Ethernet highways increases with the 
spatial location o f their sub-networks in a rather irregular shape.
-13-
chapter 2
Finally, there are also some hierarchical networks operating on the principle o f loops 
(rings) and utilising gateways to connect them. Each ring contains a set o f stations and is 
connected to other loops via gateways to form a hierarchical network o f loops which may 
grow irregularly in shape. One such example is Pierce’s loops [Pie 72]. W hilst differing 
from our topology, nonetheless it operates on the same principle of linking sub-networks 
through gateways.
Overall, it can be concluded that the closest architectures to our proposal are the 
W isconsin M ulti-cube and the Aquarius M ulti-Multi topologies which are based on the 
spanning bus concept. The difference centres on the construction rather than on the shape. 
O ur multiple bus network is constructed from gateways which link several buses in a regular 
form  with processors then added to the resulting structure. By contrast, the spanning 
bus-oriented architectures involve the computing resources or processors themselves to link 
their buses [Wit 81, Ino 88, Rob 91]. We will show in the next section that our method 
improves considerably the bandwidth per processor by scaling in proportion to the number 
o f gateways and the dimension o f the network.
2.4 The Achieved Bandwidth and Processor Scalability of our 
Structure
One o f the main features o f our proposed multiple bus topology is the separation o f the 
routing (gateways) from the processing resources. The network is solely constructed from 
gateways which allows it to be developed independently from its attached processors. 
Therefore, it permits the designer the freedom to allocate processors anywhere within the 
network. In order to express this allocation strategy, we define the function
1 if  stations belong to the /th-dimension.
0 if  stations do not belong to the /th-dimension.
According to this definition, we can distinguish between the following two sets o f 
branches (or buses) in the network: the injection or absorption branches, from which packets 
enter or leave the network through the stations, and routing branches that contain only 
gateways to route messages.
M oreover, if we denote by p t the number o f gateways along each branch in the
/th-dim ension, the total num ber o f gateways that construct our n-dimensional multiple bus 
structure can then be calculated as the product o f all gateways in all dimensions, G =  PxP2- • 
given that a gateway is always present at each intersection o f orthogonal buses.
chapter 2
the fundamental parameters for providing the processor scalability o f the network. In this 
context, the processor scalability refers to the number o f processors (TV) that may be 
accommodated by the topology whilst preserving their required bandwidth.
In order to evaluate this quantity and the total bandwidth o f our proposal, we conceive 
the rc-dimensional structure as being constructed by replicating the (n-ijth-dim ensional 
structure p n times and joining their corresponding gateways by PiP2...p n-i num ber o f 
branches. As illustrated in figures 2.1 and 2.2 below, the 3-dimensional topology can be 
built by replicating the 2-dimensional one three times (i.e. p3 =  3) and join ing their corre­
sponding gateways by nine branches (i.e. Pi.p2 —9).
As a consequence, the allocation function f and the total number of gateways (G) are
Figure 2.1 A 2-dimensional multiple bus network having three gateways in the 1st and 
2nd dimensions ( P i - 3  and p 2 ~ 3) yields nine total gateways (pxp 2 =  9). All 
stations are allocated in the 2nd-dimension. The diameter is equal to three 
(H -3 )  and each gateway has two connections (C -2). The couples (X2,X1), 
as will be discussed in section 2.5, represent the address o f a gateway in the 
Ist-dimension (XI) and in the 2nd-dimension (X2).
-15-
chapter 2
3rd-dinension
lst-dinension
Figure 2.2 This is a 3-dimensional multiple bus topology. It is constructed by replicating 
the 2-dimensional one three times (p3-  3) yielding three planes, and then 
joining the corresponding gateways by nine branches (pxp 2 -  9). The total 
number o f gateways is therefore equal to 27 (i.e. P\P2p 3 — 27). Stations are 
allocated in the 3rd-dimension.
Using this concept, we can express the total number o f injection (/) and routing ( / )  
branches, which form our n-dimensional multiple bus topology, as
n f.
/  =  G I -  ( 2 . 1 )
1 = 1 Pi
n 1 - f .
J = G I — (2.2) 
* = 1 Pi
The total bandwidth achieved by our /2-dimensional multiple bus network is determined 
by m ultiplying the capacity o f each bus (R Mbits!sec) by the summation of the two equations 
above (i.e. B = R (I  + / ) ) .  It is clear that this bandwidth increases with the total num ber o f 
gateways (G) and the dimension o f the network (n).
Moreover, the bandwidth available continuously to each processor (bproc)  can be
calculated as the ratio of the bandwidth provided by the total number o f injection branches 
to the total number o f processors within the system. To evaluate this ratio, we assume that
chapter 2
all processors are allocated evenly along the «th-dimension which means f n =  l a n d / j ^  =  0, 
and the number o f gateways along each bus in the ith-dimension is equal - i.e. a structure 
where all Pi — p  and therefore G = p n. By using these assumptions in equation 2.1, the 
bandwidth available continuously to each processor can be expressed as
B-l
proc• 2N ' ^
Here, R (in Mbits/sec units) is the capacity o f each bus, N  is the number o f processors 
accom modated by the topology and G is the total number o f gateways form ing this con­
figuration. W ithout the factor 2 in the denominator of equation 2.3, it would just model the 
bandwidth available continuously for transmission by each processor within an isolated 
injection branch. In order to allow for traffic entering from the routing branches via the 
gateways, consider that in the worst case, every packet generated within an injection branch 
is sent to a station on another injection branch. This increases the traffic at the destination 
branch by a factor o f 2 shown, assuming an even distribution o f these received messages.
It will be shown in chapter 4 that this processor bandwidth can be im proved further if 
a balanced 2-Dimensional structure is constructed, that distributes the traffic load equally 
among all its branches (i.e. routing and injection branches) within the network. Improvements 
from higher dimensional structures such as the 3-dimensional one is part o f our future work.
To illustrate the concept behind these expressions, which demonstrate the high per­
form ance o f our structure, we present the total and processor bandwidths o f our network for 
various dimensions in table 2.1. These figures are compared to the ones associated with the 
spanning bus topology. The spanning bus is chosen for our comparison since it is similar in 
shape to our proposal and is among the best and most popular bus-oriented topologies. It 
also has a high processor scalability and a similar ease o f expansion. Its total and processor 
bandwidth are reported to be nN  * and n/Nn respectively [Rob 91]. Evidently, these figures 
can also be obtained using equations 2.1 and 2.2, by replacing the number o f gateways, G, 
by the num ber of processors N.
Notice in table 2.1 that the total bandwidth provided by the spanning bus topology is 
dependent upon the number o f processors, whilst in the multiple bus architecture the total 
bandwidth is only dependent on the number o f gateways.
chapter 2
Our multiple bus ' 4 
network, % ,
P The spanning 
 ^||| bus network.
n total bandwidth processor bandwidth total bandwidth processor bandwidth
2 2 G m G m/2N 2N xa 2JNm
3 3 G 2* G m/2N 3 /V2* 3/A/1/3
4 4 G w G W/2JV 4 N m 4/A/1/4
5 5 G 4* G**/2N 5N*5 5/A/1/5
6 6G 5* G 5*/ 2N 6 N*6 6/A/1*
Table 2.1 This table shows the calculated total and processor bandwidths o f our proposal 
and the spanning bus having the same number of processors (N) for various 
dimensions (n). These figures are expressed as a multiple o f the capacity of  
single bus (R Mbits/sec).
Table 2.1 raises a number of notable points:
1- The total bandwidth o f both topologies scales with the size of the network - i.e. 
its dimension.
2- The total bandwidth o f our multiple bus network can be increased by adding 
more gateways (column 2). In contrast, in the spanning bus this operation 
requires the addition o f more processors (column 4) which will eventually reduce 
their processor bandwidth (column 5).
3- Our proposal is more scalable than the spanning bus. It is always possible to 
insert a large number of processors (N) into the network, if a sufficient number 
o f gateways (G) is provided to supply them with the required bandwidth (column 
3). This is not true with the spanning bus, where the addition o f more processors 
reduces their continuously available bandwidth (column 5). For instance, the 
multi-vector processor VPP [Ino 88] based on the spanning bus can only connect 
up to 64 processors, after which the bandwidth continuously available to each 
processor becomes a serious limitation. It will be explained shortly that our 
proposal can accommodate far more transputers.
It will be deduced in chapter 4 that the balanced 2-dimensional network, which has the 
smallest maximum number of hops (H=3) and the least number of connections per gateway 
(C=2), provides an adequate bandwidth for a large number o f transputers connected via their 
serial links. To give an example for the table above, let us consider a bus having a transmission 
rate o f 40M bits/sec (e.g. a serial communication channel). Equation 2.3 shows that a structure 
having p 2 =  32 and p x = 32 gateways in each dimension requires 1024 gateways and that this 
can support up to 200 T414 processors transmitting continuously at 0.4M bytes/sec and 
receiving at the same rate. This equation may be refined to approximately halve the number
chapter 2
o f gateways required - see section 4.5 for a balanced network. It m ust be rem embered that 
this architecture is not solely designed for very high throughputs, but also for low-latency 
communication, in which it out-performs other topologies by a large margin. In addition, 
the bandwidth available to each processor scales linearly with the capacity o f the serial channel 
and thus could be considerably enhanced from 40Mbits/sec to accommodate faster processors 
(e.g. T800) or support extra stations.
In conclusion, our proposed multiple bus structure can achieve a scalable processor 
bandwidth which is necessary to build highly concurrent systems with a large number o f 
processors. W e now focus on the requirem ent for minimal latency by investigating simple 
and efficient routing algorithms which will partially adapt to the traffic and select the shortest 
path to deliver messages to their destinations.
2.5 A Proposal for a Semi-adaptive Routing Algorithm
Once a message has been constructed it needs to be routed from one processor to another. 
This is achieved by a routing algorithm implemented within each gateway. Ideally, it would 
be desirable to provide a simple routing algorithm which requires no knowledge o f the entire 
topology and could route messages along the shortest path whilst adapting to the traffic.
The proposed network can be viewed as a generalised hypercube topology [Bhu 84] 
and hence the address o f a gateway node can be represented by the n -tuples
QCH,Xm_u ...,X h . . .9X l) (2.4)
where 0 <X, < p t - 1, /?,- is the num ber o f gateways along the ith-dim ension and n is the
dimension o f the topology.
This representation which requires packets to be routed towards the directions differing 
in theX .rd coordinate was used by W ittie [Wit 81] to route messages in the spanning bus 
network. It can also be used to forward packets in our multiple bus configuration with a 
slight modification. Once the destination gateway is reached, a final broadcast in the 
absorption branch will be needed to pass the packet to its destination station (ds).
To illustrate this concept, for instance, let us take the 3-dimensional network o f figure 
2.2. Suppose station (A) wants to send a message to station (B) on the bus with gateway 
address (0,0,2). The address headers o f the message are therefore (X3= 0, X2=0, Xx~2) and 
(ds). (A) broadcasts this message into its injection branch where the gateway o f address 
(0,2,0) picks it because its third coordinate (X3 =  0) matches that o f the message. Subse­
quently, this gateway directs the m essage towards the bus holding the next differing
chapter 2
coordinate (X2 = 2 ^  0) where the gateway of address (0,0,0) receives the m essage since its 
second coordinate (X2 =  0) matches that o f the message. This in turn sends the message 
towards the differing 1st coordinate (X1 =  2 9*0) where the destination gateway (0,0,2) is 
reached. Finally, the message will be broadcast into the absorption branch containing the 
destination station (B) with address ds.
This algorithm requires the reception o f the whole address field before the station 
decides the direction of the message. Therefore, a higher communication latency results 
especially with serial buses and, in addition, more hardware will be required to buffer all the 
address fields.
The structure o f our topology, however, overcomes this drawback by identifying each 
hop rather than the addresses o f each intermediate gateway by the full n + 1 tuples,
((Xn>Ln), (X„ _ j, Ln _ . (Xt-, L,)..., (Xj, Lj)) (ds) (2.5)
where L-t is the address of a link or a communication port within a gateway having C con­
nections (0 <L,- < C -  1) andX,- is the physical address of a gateway along the ith-dimension 
(0 <Xt < P i- 1  with pi being the number o f gateways along the ith- dimension). Therefore, 
all couples (X,-, Lt) represent one step in the routes that packets follow to reach their designated 
stations (ds). To distinguish between routes and stations, the most significant bit o f each 
address field is set to 1 or 0 accordingly; for more explanations see section 6.2.4 .1 o f chapter 
6. In this routing algorithm, it is only necessary to buffer each address field (X,-,L,), route 
the message toward its appropriate dimension and thereafter, remove this used field.
Although this addressing scheme seems quite complicated in its general form, any 
station in the generalised multiple bus structure can be reached by any m essage addressed 
to it. In the case o f the 2-dimensional topology which has only two connections for each 
gateway, this representation can be simplified to ((X2,L1), (XuLx)) (ds) where Lv -  0 or Lj =  1 
and L2 =  0 o r L 2= l .  A further reduction can be made, if we assume that packets will never 
return to the branches they come from. Hence, the final representation o f addresses in the 
2D network can be written as (X^XQ (ds).
Therefore, we can design a routing algorithm which allows any initial routing branch 
to be selected arbitrary in such a way that the address header X2 o f any packet can random ly 
be chosen from the interval [0,p2 - 1 ] .  Such a characteristic permits the avoidance o f heavily 
loaded routing branches and faulty buses or gateways. For this reason we refer to this routing
chapter 2
as a semi-adaptive routing algorithm. We only need to make such a choice once as messages 
cross only one routing branch. In the injection or the absorption branches, packets are buffered 
while waiting to be transmitted.
To explain this algorithm for a 2-Dimensional class, let us suppose that a message 
originates at an injection branch allocated along the 2nd-dimension as in figure 2.1. The X2 
part o f the address o f the next intermediate gateway may be selected from  the range [0,p2 - 1] 
according to the traffic (in this example 0 <X 2 < 2). Once the X2 gateway has accepted the 
message, the used field address X2 is removed and the message sent in the direction o f the 
X x gateway (in this case Xx =  2). The final destination will be rem oved and the message is 
broadcast in the absorption branch towards the destination station (ds).
The routing algorithm proposed in this section is simple and does not require complex 
hardware in its implementation. It guarantees that the shortest path is always followed and 
partially adapts to the traffic, yielding lower latencies. It also ensures a high degree o f 
reliability in the network as all routing branches lead to any destination.
2.6 A Simple and Effective Method for Deadlock Prevention in 
the 2D structure
An interconnection network with high bandwidth and low latency is not sufficient to 
support the design of an efficient distributed memory system. Deadlock freedom  is another 
challenging requirement to ensure that a network operates properly. In general, deadlock in 
the interconnection network o f concurrent systems occurs when no message can flow toward 
its destination because all the buffers along its path are full. Such a behaviour results from 
a cycle o f un-granted requests [Ros 87].
Basically, methods for avoiding such deadlock concentrate around the prevention o f 
cyclic requests. These methodologies depend on the structure and the routing mechanisms 
adopted by the network [Dal 87].
In this section, we will illustrate the appearance of the deadlock in our 2-dimensional 
multiple bus structure and present a simple and effective way to avoid it which is based on 
the allocation strategy.
Consider a portion of the 2D network with four gateways using only one buffer per 
comm unication port and all branches containing stations, see figure 2.3. Suppose now that 
station (A) sends packets to (C), (C) to (A), (B) to (D) and (D) to (B) across gateways glt g2, 
g3 and g4 by utilising buffers <2i(1),G2(1)> <23(1) and Q4( 1). At a certain stage, (2i(l) buffers 
a packet from (A), (22(1) from (B), <23(1) from (C) and Q4(l)  from (D). The cycle is then 
blocked because the packet held at (24(1) can never find its way through Q ^ l)  to reach the 
destination (B).
chapter 2
Figure 2.3 A portion o f the 2D network having stations in all its branches.
The straight-forward way to prevent this deadlock is to provide two networks or 2D 
planes on top of each other; and then use each one individually to carry a one way com ­
munication much like the four M ad-Postman routing planes [ Yan 89] where each plane allows 
the messages to flow in one direction and, hence, prevents the occurrence o f cycles. Despite 
the fact that this approach increases the bandwidth o f the network, it adds m ore complexity 
to the hardware o f each gateway as it increases its number o f connections, especially for 
higher dimensional topologies. Besides, to transport messages between these two routing 
planes, additional address overheads will be needed to identify their positions.
The other alternative is the use o f structured buffer pools and path switching, where 
buffers are partitioned into classes and the assignment o f buffers to messages is restricted to 
define a partial order on buffer classes [Mer 80, Gel 81]. In other words, the number of 
buffers must be equal to the number o f hops a packet must travel to reach its destination. 
Clearly, in the example o f figure 2.3, if  each gateway gj (j e  [1 ,2 ,3 ,4}) uses its two buffers 
there will be no deadlock, provided that each buffer £2/ 1) is restricted to copy a packet into 
the next buffer Qj+,(2) through its path - here the addition j+1 is taken cyclically (1,2,3,4,1,...). 
However, this technique requires many buffers per connection. For an ^-dimensional 
topology the number o f buffers will be n2.
The simplest and most effective way to prevent deadlock is to use an allocation strategy. 
Suppose that all stations are placed as shown in figure 2.4. All communications then from 
(C) or (D) finish at £24(2) and from (A) or (B) end up at £22(2) without any closing cycle. 
Notice that other buffers labelled £24(1), £23(2), £22( 1) and £2i(2) have not been used at all.
chapter 2
Therefore, they can be rem oved and only one buffer per connection or port will be enough 
to prevent the deadlock. As a consequence, the occurrence of deadlock can be prevented in 
the 2-dimensional network by distributing all stations over the injection branches only (e.g. 
2nd-dimension with f  = 0 and /2 =  1), whilst providing a single buffer for each gateway 
connection.
Figure 2.4 When all stations belong to 2nd-dimension, the number o f buffers 
per port can be limited to one without creating a deadlock cycle.
2.7 Conclusion
W e have introduced a multiple bus configuration which contributes to the design o f an 
efficient communication network for highly parallel machines. Our proposal is distinct from 
other bus-oriented topologies by its inclusion o f gateways as routing resources between buses 
to form a well-structured grid interconnection.
This architecture has four fundamental aspects which make it very appropriate for 
connecting a large number o f transputers.
Firstly, it has an ease o f expansion without major upheaval to its elements (gateways 
and processors). Secondly, it provides a higher bandwidth and processor scalability than the 
most powerful bus-oriented topologies, such as the spanning bus. Thirdly, the communication 
latency can be considerably reduced by the use o f the semi-adaptive routing algorithm  which, 
besides avoiding heavily loaded routing branches, offers a high degree of reliability and the 
use o f an efficient routing mechanism like the virtual cut-through that will be studied in 
chapter 4. Finally, the separation of routing from computing resources permits an absolute 
freedom  to migrate processors anywhere within the injection branches of the network. As a
chapter 2
result, in the 2-dimensional multiple bus network, this feature offers a simple and effective 
technique to prevent deadlock. This is realised by distributing all processors in injection 
branches under the allocation function/! =  0 a n d /2 =  1 .
chapter 3
ON ADOPTING THE CSMA/CD ARBITRATION PROTOCOL
3.1 Introduction
A mechanism must be provided to resolve contention between the various units (stations 
and gateways) competing for access to a shared bus. Having introduced a bus-based inter­
connection topology in chapter 2 , the main concern o f this chapter is therefore the choice o f 
an appropriate bus and its arbitration mechanisms which efficiently exploit the reduced size 
and the high capacity o f the channel.
As far as the proposed architecture is concerned, there is no constraint on the bus type 
that may be used (serial or parallel).
W ith parallel buses, there are many arbitration methods [Bai 81]: the static priority 
algorithm  [Bel 78] used in Unibus and VM E, the dynamic priority algorithm [Bes 71], the 
fixed time slice [Dig 76] used in Digital Equipment Corporations’ parallel communication 
link and central polling. Although these single bus organisations are reliable and relatively 
inexpensive, they do introduce some critical components to the system. For instance, a 
malfunction in any o f the bus interfaces may cause an increase in the system failure rate; an 
addition to the number o f units sharing the bus elicits an increase in arbitration logic.
The best arbitration mechanisms which bypass these problems are the Future-Bus [Tau 
84], the MultiBus II (Intel bus) and the NuBus (MIT). Among the three, the Future-Bus m ay 
be more desirable for its asynchronous communication mechanism.
An alternative class o f buses is the serial one in which a single conductor carries both 
data and an arbitration protocol. Despite their lim ited speed compared to parallel buses, they 
can be used for fully distributed and asynchronous communications provided that an 
appropriate communication protocol is implemented (e.g. Token bus or CSMA/CD). The 
serial bus organisation has been chosen in this project for its simplicity, its varieties o f 
implementation (copper, optical fibres [Raw 78, Kel 84] or infrared transmission [Gfe 78]), 
its compatibility in interconnecting transputers via their serial links, and its reduction in 
wiring as a single line carries data, acknowledgement and control. In particular, each bus o f 
the 2D structure outlined in chapter 2 should be able to support communication between 32 
stations and a number o f gateways in the largest network envisaged (see chapter 4). It is
chapter 3
realistic to lim it each branch to 1.5 metres in length. Moreover, the assumption o f a maximum 
data rate o f 40-50 Mbits/sec, mainly determined by the hardware components used to built 
the system (chapter 6), allows for the buses to be implemented using copper, fibre-optic 
cabling or free-space optical techniques.
Section 3.2 is devoted to the comparison o f the most popular serial arbitration protocols 
(e.g. Token bus, Token ring, Slotted ring). In section 3.3, we study the performance o f the 
acknowledged CSMA/CD protocol and prove that, over a wide range o f loads, it behaves as 
an ideal M arkovian queue. In section 3.4, we first compare the performance o f the Fair and 
Prioritised CSM A/CD protocols with proposed hybrid versions. W e then show that the 
prioritised CSMA/CD protocol is desirable for nodes with large num ber o f buffers. By 
introducing the acknowledgement mechanism, we finally extend this protocol in section 3.5 
in order to make it suitable for a bus which connects different units such as gateways and 
stations having either multiple-packet or single-packet buffers.
3.2 Selecting the CSMA/CD Communication Protocol
There are many serial communication protocols associated with com puter networks. 
They may be classified into three wide categories: fixed assignment protocols such as FDM A, 
CDM A and TDMA, demand assignment protocols with implicit [Chi 79, Kle 80, Gol 82, 
Fra 81, 88, Tob 83,84] and explicit [Fra 74, Jay 85] token passing and random  assignm ent 
protocols such as ALOHA [Abr 73], CSMA [Kle 75] and CSMA/CD [Hey 82, Apo 86, Sho 
79, Tob 80].
The most commonly used communication protocols are the Token Ring [Dix 83], the 
Token Bus [Ulu 83], the Slotted Ring [Bux 81] and the CSMA/CD [Lam 80]. O ur requirem ent 
in this research is for a protocol which satisfies the following characteristics: 
a- high performance (high throughput, low delay), 
b- suitable for higher level protocols, 
c- reliable and cost-effective, 
d- simple to implement.
In order to choose the appropriate communication protocol from the above m entioned ones 
(Token Ring, Token Bus, Slotted Ring, CSMA/CD), we explain each one in turn and con­
centrate on one with the highest performance and which is the simplest to implement.
chapter 3
In this protocol, tokens are passed from station to station around the ring. Stations with 
data to transmit capture a free token, change its state to busy and start their transmission. 
W hen the buffer becomes empty, it releases the token.
It turns out that when the mean packet length is greater than the ring latency, which is 
the case for a small bus, only one single token circulates in the ring. Such a restriction gives 
an average packet delay o f [Ham 86, Hay 84]
3.2.1 Token Ring
p  = T + l + - 5 L - + <  1 - ^ ) +^ (l (31)
2 2 ( 1 - 5 )  2 ( 1 - 5 )  2 ( 1 - 5 )  ’ '
where T is the packet transmission time, z  the ring latency, N  the number o f stations, S the 
throughput (number of packets transmitted per packet time), z the end-to-end propagation 
delay and T, the latency per station.
3.2.2 Token Bus
Token buses are largely the same as token rings since stations are organised in a virtual 
ring. However, the topology involved is a bus rather than a ring with a larger length control 
token. The mean packet delay for this protocol can be expressed as [Ham 86]
n - T , \  ST N x ( l-S /N )  N T ,(l-S /N )
3 2 ( 1 - 5 )  6 ( 1 - 5 )  2 ( 1 - 5 )  ’ ( )
where Tt is the token transmission time and all other parameters are similar to the ones o f 
the token ring expression above.
3.2.3 Slotted Ring
Slotted rings are distinguished by the fact that a constant number o f bit positions grouped 
into fixed-length slots circulate continuously around the ring. To transmit a packet, a ready 
station captures an empty slot and fills it with a fixed amount o f data. There are header bits 
associated with each slot to indicate whether it is empty or full.
The main aspect of this protocol is the overhead per packet since each slot imposes a 
fixed length transmission. The average packet delay o f this protocol can be written as [Bux 
81, Hay 84]
chapter 3
u  -  2(\ + h)T X
( 1 - S ( 1 + A ) ) 2'( )
where h is the ratio o f the header to the data bits and % the ring latency.
32.4 CSMA/CD Protocol
Unlike the above protocols, the CSMA/CD scheme uses a random  strategy to access 
the channel. Stations with a full buffer transmit packets if  the channel is sensed to be idle. 
W henever a collision is detected, a rescheduling algorithm is invoked after which a 
retransmission is stimulated [Met 76].
M ore elaborate analyses of this protocol are provided in various publications [Kle 75, 
Tob 80, Lam  80, Hay 84, Apo 86], The approach followed by Lam, leading to an analytical 
expression for the mean packet delay, is summarised in Appendix A, from which we deduce 
the m axim um  throughput (5max), in units o f the number o f packets transmitted per packet time 
- i.e. the channel utilisation - as
_1_
1 + 6.44aSmax=7 7YT77Z (3-4)
( a  is the ratio o f the end-to-end propagation delay o f the bus to the packet transmission time).
32.5 Comparison
As previously mentioned in chapter 2, an efficient interconnection network is always 
characterised by a low latency and high bandwidth, both o f which can be related to the mean 
packet delay and the throughput. Our primary choice for a suitable communication protocol 
is therefore based on the one which provides the smallest average packet delay and the highest 
throughput.
The first consideration o f equations (3,1) and (3.2) reveals that the token ring performs 
better than the token bus. This fact is due to the smaller length o f the token ring control 
header. In other words the token bus scheme involves much longer tokens which carry 
addresses and control overheads from station to station.
Equation (3.3) shows that under no load (S -0  packets per packet time), the delay o f 
the slotted ring is twice the m inimum delay o f all other described protocols. This is not only 
due to the large amount o f overhead presented in each slot, but, also to the anti-hogging rule 
[Ham 86] that produces empty slots circulating even when a station has data to transmit.
chapter 3
M oreover, when the number o f slots increases, their length in bits has to decrease so that the 
ring can hold all circulating slots. As a result, the overhead increases and the corresponding 
average delay becomes worse than that o f the token ring.
Figure 3.1 shows that over a wide range o f loads, the CSMA/CD protocol has a sm aller 
average delay than the token ring. This is not surprising because, with the insignificant 
propagation delay o f a physically small bus, the collision window (which is equal to twice 
the end-to-end propagation delay) is negligible compared to the packet transmission time, 
and hence a higher throughput (Equation 3.4) and a reduced average delay are achieved.
Single bus throughput S (number of packets per packet time)
Figure 3.1 The calculated normalised mean packet delay (i.e. the ratio o f the actual 
delay D to the packet transmission time T, DIT) as a function o f the channel 
utilisation (S, throughput per packet time) for a single CSMAICD (from 
Equation A .l in appendix A) and Token ring (from Equation 3.1) bus.
-29-
chapter 3
As a conclusion, the CSMA/CD protocol does not only provides higher performance 
than the described protocols for networks with short end-to-end propagation delays, but is 
also simple to implement, reliable, robust1 and suitable for higher level protocols (stations 
can be added and removed without altering the progress o f the network).
3.2.6 Performance of the CSMA/CD Protocol
The main constraint faced by the CSMA/CD protocol is its sensitivity to the ratio o f 
the end-to-end propagation delay and the packet transmission time ( a  =  % IT) which deter­
mines the likelihood o f collisions.
In order to reduce the number o f collisions, various modifications and versions such 
as the p-persistent, the non-persistent [Tob 80], the slotted [Tan 88] and hybrid CSMA/CD 
protocols have been devised. The latter, like the HYMAP (hybrid multiple access protocol) 
[Rio 85] and the CSMA/CD/DP [Kie 83] with dynamic priority, combine the random  and 
the collision-free operations to reduce the collision probability.
Another modification, the CSMA/CD/DCR with deterministic contention resolution, 
uses the relative position o f stations in the bus to resolve the contention [Tak 83]. The 
feedback FB/CSM A/CD scheme limits the total time to resolve the contention to few a 
propagation delays [Gab 83]. The M/CSM A/RI and the M/CSMA/IC use m ultiple channels 
where each bus is randomly or idly selected to reduce the factor a  (the ratio o f the end-to-end 
propagation delay to the packet transmission time) by the number of available channels [M ar 
83],
However, for short buses (i.e. in a multi-processor environment) where the end-to-end 
propagation delay is very small (e.g. less than 5 nsec)2, these protocols do not provide a 
significant performance improvement when compared to a normal CSM A/CD protocol. 
Table 3.1 shows that high efficiencies can be obtained even at higher transmission rates. This 
is due to the small end-to-end propagation delay which keeps the effect o f the ratio o f the 
end-to-end propagation delay to the packet time (a )  insignificant.
1 The token ring is less robust since a token has to be passed forever making it susceptible to any loss.
2 1.5 metres divided by the speed of light
chapter 3
R(Mbits/sec) Emax(W=2048bits)% Emin(W=72bits)%
1 100 99.96
40 99.94 98.24
80 99.87 96.55
100 99.84 95.72
120 99.81 94.91
160 99.75 93.32
200 99.69 91.79
Table 3.1 Channel utilisation o f the CSMA/CD protocol for different trans­
mission rates using packet lengths ofW  = 72  and 2048 bits. These 
figures have been obtained from equation 3.4 with a  = xR/W 
(x = 5nsec.)a
In addition to the sensitivity to a , the other weakness o f the CSMA/CD protocol is its 
inability to provide built-in acknowledgements as the ring schemes do. W henever a reliable 
comm unication is desired, an acknowledgement has to fight for the channel just as any other 
data packet does thus reducing the performance of this protocol. To improve the performance 
o f the CSM A/CD protocol, Tokoro [Tok 77] has proposed a high priority acknowledgement 
which is transmitted in the next time slot after each successful data packet transmission. This 
is discussed further below.
3.3 Towards the Acknowledged CSMA/CD Scheme
Having focused on the adaptation o f the CSMA/CD communication protocol as the 
main arbitration mechanism for resolving conflicts in our multiple interconnected bus 
structure, we explain in this section the operation o f the Acknowledged CSM A/CD which 
permits an efficient and secure way (i.e. guarantees the recovery o f corrupted packets) for 
passing messages. We then deduce an expression of the mean packet delay which will be 
used for further investigations in subsequent chapters, and show its correctness through a 
simulation model. We prove, therefore, that over a wide range o f loads this protocol behaves 
as an ideal Markovian queue making it an appropriate scheme for our purpose.
3.3.1 The Operation of the Acknowledged CSMA/CD
In the acknowledged CSMA/CD communication protocol developed by Tokoro [Tok 
77], a ready station finding a silent channel waits for a basic time equal to at least twice the 
end-to-end propagation delay (2x) to avoid collisions with acknowledgement packets. 
Afterwards, if  the channel is sensed idle, the ready station starts its transmission. If a collision 
occurs, the station aborts the current transmission with a wasted bandwidth o f at most twice
chapter 3
the end-to-end propagation delay (2t)  and reschedules its next retransmission according to 
a back o ff algorithm [Met 76]. In the case o f a successful transmission, the receiving station 
returns an acknowledgement immediately within the next slot. Figure 3.2 illustrated these 
data and acknowledgement packets travelling along a serial bus. The vertical and horizontal 
axes show four equally-spaced stations along the conductor and the time respectively.
Figure 3.2 Station s2 transmits a data packet at time equal 0. Station s4 
acknowledges it upon arrival. Station s3 becomes ready at the point 
A and waits for a basic time (Bwt) and hence it avoids a collision 
with the Ack. When the channel is sensed idle at the point B, the 
station s3 waits a further basic time and starts its transmission at 
point C.
3.3.2 An Approximate Mean Packet Delay Expression for the Acknowl­
edged CSMA/CD Protocol
The CSMA/CD mean packet delay expression derived by Lam  and later modified by 
Bux is reported in Appendix A [Lam 80, Bux 81]. We have introduced further simplifications 
and modifications to this equation by considering high priority acknowledgem ent and short 
end-to-end propagation delay ( t  =  (I.Smet res)/(speed of light) =  Snsec).
W e findin  appendix A that the approximate expression of the acknowledged CSMA/CD 
mean packet delay can be written as
D =  T +  2 .5x+  —  (3.5)
2(1 -X X )
w ithX  = T + Ta + 8.44x and X being the input rate to the network in terms o f messages per
unit o f time. Xrepresents the service time in which the factor 8.44x is the num ber o f wasted 
collision and retransmission slots, T the time to transmit a packet at the full channel rate and
chapter 3
Ta the time needed to return a high priority acknowledgement. The two limiting cases o f this 
protocol can be found by noticing that for a very heavily-loaded system —» l/X) the delay
becomes unbounded as packets tend to spend a longer time in their buffers. At the other 
extreme, under no load (A, —> 0 ), the expected delay is D = T  + 2x + x/2 which includes one 
basic time (2x), a complete packet transmission (T) and a half-way propagation delay (x/2).
This expression resembles the average response time of an M /G /l queue with a service 
time X . In particular, with constant packet length (X2=X^ [Ham 86]) this expression is 
similar to a M/D/1 M arkovian deterministic queue (FIFO protocol) used as a reference 
comm unication scheme.
To verify that this equation, resembling an ideal protocol, correctly describes the 
acknowledged CSM A/CD scheme in short buses, we build a simple simulation model for 
the acknowledged CSM A/CD protocol using the SimScript language (see Appendices E l  
and E 2).
3.3.3 Simulation Model
The simulation model consists o f three concurrent processes (Figure 3.3) forming the 
main building blocks o f stations and the shared resource (channel).
The process generator injects messages according to a given distribution (e.g. Poisson) with 
an average processing time Tp. When a message is ready and a space is available in the 
buffer, a process buffer is activated; otherwise this process generator is suspended.
The process buffer regulates the flow of packets according to its buffer size. It reactivates 
the process generator which may be suspended every time a buffer space becomes available. 
W hen a m essage is ready in the buffer, the process packet is activated by the process buffer 
which suspends itself. During its activity, the process packet crosses two phases:
1 - packets wait in the queue o f the system when the resource is acquired (the flag 
request is set).
2- packets attempt to flow into the channel when the resource is idle (the flag request 
is reset) either because there are no transmissions or the current data or 
acknowledged packet transmission has not reached the station.
At this stage, a ready station initiates a transmission; meanwhile, if  packets arrive 
during the collision window, the transmission is aborted and the collision counter 
is incremented. The retransmission is thus rescheduled following a random  
num ber o f slots times chosen from 
the interval [0, 2C- 1] where c is the number of accumulated collisions. 
Otherwise, a transmission is successful if  no packet arrivals occur during the 
collision window.
chapter 3
Ounun
ununCu
Figure 3.3 Process interaction in the simulation model
T o record the simulation statistics o f the model, the average delay can be calculated as
delay =  X Aq vm, where m is the number o f packets crossing the channel during the
v = i
sim ulation progress and Ar,- the time interval between the instant when packets enter the 
system  and the instant they leave it (final time - initial time). The channel utilisation is the 
accum ulation of all successful transmissions during the simulation time (S = mean trans­
mission time / simulation time). Basically, the mean delay, the channel utilisation and other 
statistics such as the average system queue and the offered load can be obtained using the 
built-in routines o f the language.
The simulation parameters used are: 
mean processing time Tp -  500 p. sec,
mean transmission time T = (32Bytesx8)/40M bits/sec = 6.4jxsec, 
acknowledge time Ta = 0.4 jisec, 
basic time Bwt = 2 x = .01 jisec, 
collision window < 5 x 10~\lsec.
chapter 3
Obviously, the choice o f these parameters is arbitrary and design dependent. For 
instance, the average processing and transmission times are assumed to be constant at 500 
|ii sec and 6.4 ji sec respectively since the load to the network can easily be varied by changing 
the num ber o f stations (S =  TN ITp number o f packets per packet time). On the other hand, 
the collision window and the basic time are associated with the length of the bus ( 1.5 metres) 
and the propagation delay. Furthermore, the acknowledge time is equal to the length o f the 
acknowledgem ent (2 Bytes) designed in chapter 6 divided by the channel capacity 
(40Mbits/sec).
Single bus throughput S (number o f  packets p er  packet time)
Figure 3.4 The normalised mean packet delay (i.e. the ratio o f the actual delay D to 
the packet transmission time T, D/T) as a function o f the channel utilisation 
(S, throughput per packets time) for the simulated and theoretical single 
bus Ack/CSMA/CD (from Equation 3 5 ) compared to the simulated single 
bus ideal MIDI I protocol
Figure 3.4 shows the average packet delay as expressed by Equation (3.5) for various 
theoretical throughputs (S =  TN ITp) and as obtained by the simulation model. We see that
chapter 3
for a channel utilisation o f less than 70%, there is not a significant difference between the 
two curves. However, for large throughputs, they start diverging with the theoretical one 
tending m ore rapidly towards infinity. This divergence is due to the simulation model being 
constructed for a finite number o f stations as opposed to the theoretical expression which is 
derived for an infinite population, and hence it must diverge expeditiously as the throughput 
approaches its maximum. Nevertheless, the theoretical expression can be used as an 
approximate upper lim it for further design and decision making.
For the two simulated protocols Ack/CSMA/CD and the ideal M/D/1, we observe that, 
over a wide range o f traffic loads, there is no distinction between them. A slight difference 
occurs only when the Ack/CSM A/CD is heavily loaded and hence the number of collisions 
increases. It can be concluded that over a wide range of traffic loads, the Ack/CSM A/CD 
and the ideal M/D/1 behave similarly. Therefore, the latter is suitable as an arbitration 
mechanism for our serial bus configuration.
3.4 Opting for the prioritised CSMA/CD Protocol
Previous sections have considered the CSMA/CD protocol as an arbitration mechanism 
between single buffer stations. However, some questions arise when this protocol is used to 
resolve contention between nodes having a large amount o f buffering (i.e. gateways). Is the 
prioritised3 or the fair4 CSMA/CD more efficient? Are the hybrid protocols, with a low 
collision probability [Kie 83, Rio 85], more efficient ?
To answer these questions, we consider the CSMA/CD protocol operating in fair and 
prioritised manners, in comparison with an efficient hybrid version which has a low collision 
probability. Our choice will mainly be based on the smallest average packet delay time and 
the ease o f implementation.
3.4.1 Description of the Two Random Protocols
The finite buffer CSMA/CD protocol with a limited number of stations and a fair access 
to the channel has been analysed by Apostolopoulos [Apo 86]. He has shown that the load 
to the network increases with the buffer capacity and therefore a high average message delay 
results. In this section, we comment on this protocol and the relevant prioritised one by 
com paring their performance through simulations.
3 gateways empty the content of all their buffers when they acquire the channel
4 gateways transmit one packet at the time and release the channel
chapter 3
a- Fair CSMA/CD: In this protocol, each gateway sends one packet at a time. Therefore, 
all ready packets may be subjected to a collision as they all fight for the channel. As 
a consequence, one would expect a limited throughput.
b- Prioritised CSMA/CD: With the prioritised CSMA/CD protocol, an exhaustive 
service is achieved. W hen a ready gateway finds an idle channel, it empties all the 
contents o f its buffer. This strategy improves the system throughput since only packets 
at the head o f the queue may be susceptible to collisions. In other words, the number 
o f successful transmissions dominates the reduced number of contention intervals, and 
hence the maximum throughput Smax approaches 1/(1 +  a )  with a  being the ratio o f the 
end-to-end propagation delay to the packet transmission time.
3.4.2 Description of the Two Hybrid Protocols
H ybrid protocols combine features from the Random CSMA/CD and demand access 
protocols with implicit token passing [Fra 88 & 81, Tob 83 & 84]. During normal operation, 
all stations obey the CSMA/CD rules. When a collision is detected by the transmitting node 
using bit-by-bit comparison and by all other nodes employing the irregularity detection 
capabilities (e.g. packet incompleteness, lack o f flags, address or control errors etc.), the 
whole protocol switches to a free-collision one under the rule of a pre-defined scheduling 
function. There are many free-collision protocols, among which the BRAM  [Chi 79] and 
the MS AP [Kle 80] do not address the loss o f synchronisation, the SOSAM [Gol 83] obtains 
better performance than the BRAM  by providing more information about the inter-distance 
between stations and the BID [Tob 84] allocates the position of each station in the bus 
correspondingly to their address order. These improved versions give rise to m any efficient 
hybrid protocols such as the HYM AP [Rio 85] based on station positions to provide a fair 
access to the channel.
However, for short buses where the end-to-end propagation delay is insignificant, as 
in our multiple bus network, these particular considerations of station positions or knowledge 
o f their inter-distance does not furnish much improvement. In fact, simulation has shown 
that the proposed hybrid protocol obeying the scheduling function
H ( i J)  = 3 z [ ( j - i ~ u ) m o d N ] (3.6)
derived in appendix B approaches an ideal scheme (M/D/1). x is the propagation delay, N 
the num ber o f gateways sharing the channel, i the address of the currently transmitting node,
chapter 3
j  the address o f the next transmitting node and u determines the priority of the protocol. The 
fair (u=l)  and the prioritised (w -0) hybrid protocols are explained below and compared to 
their analogous CSMA/CD derivatives using simulations.
a- Fair CSMA/CD with virtual token passing: This protocol can be implemented as 
follows
1- a ready station finding an idle channel waits for an inter-spacing time and attempts 
to transmit its packet,
2- if  the transmission is successful, the station waits until a new packet occupies its 
buffer and then repeats the algorithm at step 1 ,
3- if  a collision is detected, the station aborts the transmission and switches to the 
collision free protocol. At this stage, all stations schedule their transmissions with 
respect to a reference one (e.g. i -0)  and the scheduling function in Equation (3.6) 
becomes H(QJ)  =  3x[(j -  l)m odiV] allowing consecutive stations (1, 2, 3 ,.. .  N -l, 0) 
to transmit,
4- afterwards, stations follow the general scheduling function of Equation (3.6) for u-1 .
5- a time out counter is set to T —H(i , i )  — 3%(N — 1) every time an end o f carrier is 
detected. If T expires, indicating that the load in the network has decreased, the protocol 
returns to its CSMA/CD form.
b-prioritised CSMA/CD with virtual token passing: The contention part o f this protocol 
is similar to the prioritised CSMA/CD scheme. It is only when a collision is detected 
that stations follow the scheduling function o f Equation (3.6) for u=0 as explained in 
Figure 3,5.
chapter 3
wit for interspacing
U|
X 1,0 \/
/oufferX 
\  PIW  /
transnit the packet
X  ^
\
Figure 3 5  This figure is a part o f the next one
-39-
chapter 3
Figure 3 5  This flowchart is for the prioritised CSMAICD with virtual token passing.
It can also be seen as the one for the prioritised CSMAICD protocol if  the 
scheduling function H(i , j )  is determined randomly in the interval
0 <H( i , j )  < 2min(c,8). EOC and BOC stand for the end and beginning of  
carrier respectively.
3.4.3 Simulation Results
Figure 3.6 shows the simulated normalised mean packet delay as a function o f the 
throughput for the four protocols respectively. W e notice that at low load where the number 
o f collisions is small, the four protocols behave identically as an ideal queue (M/D/1), whereas 
at high load, a slight variation appears between the fair and the prioritised CSM A/CD because 
there are more collisions in the former.
On the other hand, the fair and the prioritised CSMA/CD with virtual token passing 
rem ain identical because, during the collision-free operation and under small propagation 
delays, the two protocols converge toward the M/D/1 queue. This result can be verified using 
the expressions developed by Konhein [Kon 74] and Chlamtac [Chi 79] for the prioritised 
and the fair protocols respectively.
chapter 3
Finally, this small difference between the two prioritised schemes which occurs for 
intensive traffic should not matter when the robustness and the simplicity (no scheduling 
function to compute) are provided by the prioritised CSMA/CD protocol.
S i n g l e  b u s  t f i r o u g h p u t  S  ( n u m b e r  o f  p a c k e t s  p e r  p a c k e t  t u n e )
Figure 3.6 The simulated mean packet delay (i.e. the ratio o f the actual delay D to the 
packet transmission time T, D/T) as a function o f the channel utilisation (S, 
throughput per packets time) for the four inset protocols. Each gateway 
has large a buffer (>10).
3.5 Combining the Acknowledged and the Prioritised CSMA/CD 
Protocol
We have proved in section 3.4 that the acknowledged CSMA/CD protocol is an efficient 
arbitration mechanism for resolving the conflict between single buffer stations. In the same 
perspective, we have verified that the prioritised CSMA/CD protocol that arbitrates between 
gateways of large buffers is desired for its practical implementation and its performance
chapter 3
approaching the low collision hybrid one. In this section, we combine the acknowledged 
and the prioritised CSMA/CD schemes to provide a realistic protocol that sustains both 
stations and gateways on the same channel.
Stations may have one or two buffers depending on the pipelined transmission scheme 
used (see chapter 5). For a single node buffer, the Fair and the Prioritised protocols provide 
similar services since the unique packet in each node buffer will be transmitted at a low 
priority. To implement the exhaustive service for large buffer nodes, packets at the header 
of each node queue are transmitted at low priorities UO, whereas for all subsequent ones 
higher priorities U l will be assigned.
Moreover, the introduction of the acknowledgement requires the use of an extra high 
priority U2. One way to implement these priorities UO, Ul and U2 is by considering the 
waiting times before attempting a transmission. In other words a station which after finding 
an empty channel sends its packet without waiting has the highest priority. As shown in 
Figure 3.7 these priorities can be defined in terms of waiting times as C/2 = 0, U l -  g and 
UO = 2x + g with g and % being the inter-packet space and the propagation delay respectively.
CD
U
d
C5
~P
in
Figure 3.7 Station st2 broadcasts a single packet of duration T. When this whole 
packet is received by gateway gt2, an ack is returned immediately 
at a high priority U2. After transmitting the acknowledgement, gt2 
is able to broadcasts the first packet waiting in its transmit queue to 
st3 after a delay corresponding to priority UO. st3 then acknowledges 
this packet, where upon gt2 transmits its remaining packets to st3 at 
priority U l, waiting for an acknowledgement between each.
chapter 3
As it is necessary to provide an arbitration mechanism for each branch of our multiple 
bus network, we have opted for serial protocol schemes, mainly because they can effectively 
realise a full distributed and asynchronous communication. They are also simple to imple­
ment, they allow a reduction of the network wiring density and they are compatible with the 
transputer serial links.
In particular, we have chosen the CSMA/CD protocol from the other popular ones for 
its simplicity, robustness and high performance on short buses.
We have adopted the acknowledged CSMA/CD scheme which provides a secure way 
for passing messages between nodes. This protocol has an improved performance over the 
normal CSMA/CD one where the acknowledgement would have to fight for the channel as 
a normal data packet would. An equation of the mean packet delay has been derived, and a 
simulation model verifies that the acknowledged CSMA/CD scheme behaves over a wide 
range of loads as an ideal M/D/1 queue or a FIFO transmission.
Further comparison of the prioritised CSMA/CD protocol with the low collision 
probability ones has revealed that in short buses the CSMA/CD with an exhaustive service 
has an acceptable performance and a high throughput. We have also included in this protocol 
the acknowledgement through the use of three priority levels to produce an acknowledged 
prioritised CSMA/CD protocol which resolves contention between different units (stations 
and gateways) belonging to the same channel. This resulting protocol, like the original 
CSMA/CD protocol is easy to implement, robust in terms of fault tolerance5 and offers the 
possibility of adding and removing stations without affecting the correct functioning of the 
network and, thus, permits a high degree of network maintainability.
Such features make this protocol the most suitable arbitration protocol mechanism for 
our serial multiple interconnected buses network, and hence it will be fully exploited within 
the 2D structure in the subsequent chapters.
3.6 Conclusion
5 i f  one or m any stations fail the com munication protocol still functions properly.
chapter 4
SIMULATING AND MODELLING THE 2 DIMENSIONAL 
STRUCTURE
4.1 Introduction
The 2D structure shown in figure 4.1 is a particular class of the generalised multiple 
bus networks that we introduced in chapter 2. It is special in the sense that by allocating 
stations in specific branches (say in the 2nd-dimension), the system becomes deadlock free 
provided that packets flow in a fixed direction. Our attraction towards this topology is 
motivated by its smallest maximum number of hops (H -3 ) which minimises the com­
munication latency and its low number of connections per gateway (C=2 ) that simplifies the 
hardware and the cost of the network in terms of the wiring density. Besides, the configuration 
of this network allows the implementation of the proposed routing algorithm with a minimal 
packet addressing overhead, and hence reduces the complexity of each gateway.
The study of the properties and the features of the 2D structure are therefore our main 
interest in this chapter, as this network represents the ultimate building blocks for our new 
generalised multiple bus configuration and is the framework for this research.
Having used the acknowledged CSMA/CD communication protocol [Tok 77] as the 
main arbitration mechanism to resolve contention between various units sharing the same 
channel (bus or branch), we are now in a position to describe, simulate and analyse our 2D 
structure.
This begins in section 4.2 where we first compare some well known routing mechan­
isms, and simulate the effect of the store-and-forward [Mer 80, Gel 81] and the virtual cut- 
through [Ker 79] strategies on the mean packet delay under different communication services 
(end-to-end and hop-by-hop). We then address the problem of congestion and propose a 
feasible solution to control the flow of packets inside the network. We also simulate in section 
4.3 the effect of adding more injection and routing branches on the mean packet delay and 
the system throughput. Section 4.4 investigates a queuing model of the 2D structure from 
which we analyse the load balancing of the traffic loads, the stability region of the network 
and the optimal structure that gives a minimum average packet delay for a fixed number of 
gateways, supporting our results by simulations. We will demonstrate that, under the Poisson 
distribution input load, a balanced 2D network which has the same traffic load in all its 
branches can be constructed by making the number of injection branches twice the number
chapter 4
of routing branches. Furthermore, we show that this network has the widest stability region 
and approaches an optimal structure. Finally, in section 4.5 we determine the relationship 
between the network capacity, the delay of the interface and the application running in it 
using a Poisson process model. This relationship confirms that the 2D structure based on 
40Mbits/sec serial buses using only 512 gateways can afford an adequate bandwidth for 800 
T414 transputers.
-Q—\ | hostl
bk
-0 —1 | h*t2
IMP3
-O'—| j
m
-p H Ihostn
single bus (a)
Figure 4.1 a single bus holding N stations may be partitioned into a 2 -Dimensional 
multiple interconnected bus scheme containing G gateways and P routing 
branches.
4.2 Network Construction
Before studying the performance and the capacity of the 2D structure, we need to define 
the simulated network, to choose an efficient routing mechanism to pass data between pro­
cessors and to solve problems that affect the flow of packets (e.g. congestion, buffer overflow 
etc.). This section addresses these issues in turn.
The 2D multiple bus structure can be constructed as a special class of the generalised 
multiple interconnected buses (see chapter 2), where any two orthogonal branches (buses) 
are joined by a single gateway. Each gateway acts as a packet hardware router for the 
configuration. It receives, queues if the channel is busy and transmits packets at the effective 
capacity of the bus. Following this, the transmitted packet is accepted by another gateway 
for a final routing or reaches its destination station. Stations consist of Transputers and their 
interface message processor IMP which will be studied in the next chapter. The IMP 
decouples the communication between any transputer link and the bus of the 2D structure. 
The complete stations are allocated in special branches called "injection or absorption
chapter 4
b r a n c h e s because they can only inject or absorb packets to or from the network. On the 
other hand, routing branches (containing only gateways) are responsible for the flow of 
packets (see figure 4.1b).
To build the simulation model, we have considered the network as composed of N 
stations and G gateways according to the following arrangement. Given P routing branches 
that contain G/P gateways, we can equally partition the N stations into GIF injection branches 
- i.e. each injection branch contains NP/G stations (see figure 4.1b). The simulations carried 
out in this chapter use the following parameters:
T (mean transmission time) = 6.4 p. sec,
Ta (acknowledge time) = 1.4 psec,
Tp (mean processing time) = 500 psec,
x (end-to-end propagation delay) = 0.005psec,
collision limit = 8,
total number of exchanged messages = 10000.
As mentioned in chapter 3, the average processing and transmission time have been 
chosen arbitrarily since the load to the network can be imposed by varying the number of 
stations, whereas the end-to-end propagation delay, the acknowledgement time and the 
collision limit are design dependent. Here, the acknowledgement time is assumed to be 1.4 
p sec in order to simulate the end-to-end services which require larger acknowledgements. 
Obviously, for any simulated situation which represents a single point in any figure, the 
consideration of a large number of messages (e.g. at least 10000) gives accurate results. 
Finally, the structure can be shaped by varying the total number of gateways G and arranging 
the corresponding number of routing branches P within each particular run of the simulation 
depending on the purpose of the experiment.
4.2.1 Routing Packets in an Unreliable Environment
Within the simulated model, packets travel from one station to another, crossing a fixed 
number of gateways and branches. In an unreliable network there is no guarantee that they 
will successfully reach their destination.
Among the obstacles faced are packet loss due to errors introduced by the channel, 
collisions induced by the protocol, timing problems at receivers and buffer overflow in 
intermediate nodes.
One obvious way to indicate a successful transmission is by the provision of an 
acknowledgement packet which has already been introduced to the CSMA/CD protocol [Tok 
77]. Such an acknowledgement can be exploited efficiently through various services.
chapter 4
For instance, with the hop-by-hop service an acknowledgement is required in every 
bus crossed by each packet, increasing the message arrival rate in each branch. The end-to-end 
service which provides complete flow control between the two communicating entities allows 
the acknowledgements to be issued from destination stations and therefore, requires more 
knowledge about the origin of messages. Packets in the end-to-end service must carry more 
information (e.g. address of the intermediate stations etc.) which have to be handled with 
more complex hardware.
On the other hand, the transport of messages from one place in the network to another 
whilst limiting the latency of delivery is another challenging issue.
For a general purpose distributed computer, the network is required to support arbitrary 
and dynamic forms of inter-processor communication. If circuit switching techniques are 
used, the considerable amount of static bandwidth allocated to each connection will be wasted 
during the connection and disconnection phases. Simulation has revealed that for a 2D torus 
of 256 nodes, the recoiling circuit switching communication network outperforms 
packet-switching over an average distance of 5 hops and message length in excess of 64 bytes 
[Who 89]. However, the packet-switching technique simulated was based on the inefficient 
store-and-forward routing mechanism [Mer 80, Gel 81, Tan 88] which is no longer considered 
an optimum routing method due to its extremely high latency, especially for long messages 
and for routes with large number of hops. Without loss of any generality, its latency for a 
1 Mbit/sec transmission rate is expressed as SF = (W + V) x H where W is the message length, 
V its addressing overhead and H the distance travelled, and thus depends on the product of 
the packet transmission time and the distance travelled.
Fortunately, in computer networks, there are other ways messages can be routed bit 
serially to their destinations:
Virtual Cut-through and Wormhole
The virtual cut-through routing technique [Lau 91, Ker 79] behaves the same as the 
wormhole [Dal 87] where it is possible for the head of the packet (V) to reach its destination 
station before the whole packet has left the source station. Contrary to the store-and-forward 
technique, in these schemes, only a fraction of the packets on which the routing decision is 
made are held in each intermediate node. Their latency is therefore dependent on the sum, 
rather than the product, of the packet transmission time and the travelled distance 
(VCT = (V x H + W)). Unlike wormhole routing, the virtual cut-through scheme obtains even 
better performance by removing blocked packets from the network and buffering them and 
,hence, improves the hot-spot performance of the system and increases its throughput.
chapter 4
Mad-Postman
As each message is delayed in each crossed node by just one bit, the Mad-Postman 
routing mechanism [Yan 89] considerably reduces the latency. In general, the latency of this 
routing mechanism can be expressed as MP -  k(H -  1) + V + W. Although the research group 
claimed that k could be made smaller than one, if other means of communications are 
implemented (e.g. optical or asynchronous), it has only built synchronous Mad-Postman 
hardware that involves k — 1 .
In principle, for a system of channel width equal to the length of the address headers 
(V), there is no difference in communication latency between the virtual cut-through and the 
mad postman routing schemes.
Street Sign
In this routing technique [Bor 88, Kun 89], address headers are constructed as a function 
of the distance travelled by each packet. Each partial header identifies an output channel 
from any node it visits. As a packet progresses through the network, the header is pro­
gressively discarded until the destination processor is reached. It has been shown that the 
latency obtained in a 2D mesh network is SS = (3 x H + W). It is apparent that the connection 
latency per node (=3) consists of two cycles needed to receive the leading header, and another 
cycle to obtain and then output the beginning of the next. Obviously, this routing is equivalent 
to the virtual cut-through when the address overhead (V) is equal to three bits. Furthermore, 
the street sign routing exhibits smaller latency than the mad postman and the virtual-cut 
through for short distances.
The street sign cannot be used for our 2D network, because each gateway has one 
outgoing channel, and has to be identified by the whole address header (V). The mad postman 
on the other hand will generate approximately PxG (P is the number of routing buses and G 
the total number of gateways) dead addresses for every packet Probably, such unwanted 
traffic will increase the load on the network. Fortunately, the latency of the virtual-cut through 
routing applied to a fixed number of hops reduces to (3V + W). Undoubtedly, this virtual-cut 
through routing exhibits smaller latency compared to the mad postman and the street sign 
when their packets travel distances greater than 2V + 1 and V respectively.
The Simulation of the Routing Mechanisms in our 2D Network
Typically, the addressing overhead (V) for our 2D structure is 8-bits. Our simulation 
results, in addition to the previous parameters, have been obtained by varying the number of
chapter 4
stations (N) accommodated by the network comprising 8 gateways and 2 routing branches 
(0 < N  < 128). Because each station on average injects 2000 messageslsec (i.e. X =  1/7/), 
the load to the network can be varied accordingly.
As shown in figure 4.2, the virtual cut-through scheme with any service (hop-by-hop, 
end-to-end, no acknowledgement) is superior to the store-and-forward method. It is only at 
high loads, where most of the packets cannot cut through gateways and get entirely buffered, 
that the two routing techniques start exhibiting the same mean packet delay. It is clear that 
when the system is subjected to a load of 40%, the network latency is about 1.5 times the 
packet transmission time (e.g. 9.6|isec for a packet of length 32 bytes transmitted at 
40Mbits/sec). Therefore, the virtual cut-through routing mechanism and our proposed 2D 
structure contribute to the design of an efficient communication network for highly parallel 
machines.
Another distinction, illustrated by the simulation results in figure 4.3 shown for the 
virtual-cut through scheme, is that when gateways have a limited number of buffers (e.g. 2 ), 
they restrict the total traffic in the network by allowing packets to be buffered at their source 
stations. Hence, a smaller mean packet delay will be achieved from network-to-network 
level.
N
or
m
al
ise
d 
de
la
y 
D
/T
chapter 4
Total throughput S (number o f packets per packet time)
Figure 4.2 The simulated normalised mean packet delay (actual delay over the packet 
transmission time DIT) as a function of the total channel utilisation 
(throughput per packet time) for the 2D structure composed of 8 gateways 
and 2 routing branches. The graphs of the store-and-forward (SF) and the 
virtual cut-through (TT) routing mechanism performance under the no 
acknowledgement (noACK), hop-by-hop (HHACK) and end-to-end 
(EE.ACK) services have been obtained by varying the load with the number 
of stations (1 < < 128).
-50-
chapter 4
Total throughput S (number of packets per packet time)
Figure 4.3 The simulated noimalised mean packet delay (actual delay over the packet 
transmission time D/T) as a function of the total channel utilisation 
(throughput per packet time) for the 2D structure composed o f 8 gateways 
and 2 routing branches. The load to the network is varied by changing the 
number o f stations (1 < N  < 128) and each gateway is assigned two and 20 
buffers to generate the two graphs respectively.
4.2.2 Congestion Avoidance in the Network
When too many packets are present in all or part of the 2D network, the performance 
degrades. This situation is called congestion [Tan 88]. The branches become congested 
because of the traffic introduced through gateways even if the traffic generated within 
themselves is not heavy. In addition, an overloaded branch due to the traffic generated 
internally becomes more congested from other traffic coming into it through its gateways.
As a result, buffers in the paths of the congested traffic flow become full and start losing 
packets. Obviously, transmitting stations or gateways repeatedly retransmit lost packets 
resulting in the back-pressure effect.
Mechanisms for controlling the congestion in computer networks (e.g. Arpanet) are 
described in the reference [Hea 88, Tan 88]. In particular, Bux [Bux 84] has used flow control 
in his Token Ring network proposal. Nishida [Nis 86] has utilised emigrant suppressing, 
together with p-persistent CSMA/CD to deal with congestion in two sub-networks connected 
by a gateway.
chapter 4
In all situations, congestion occurs because parts of the network are overloaded. 
Although it is application-dependent, our network reduces the congestion effect by distrib­
uting the load equally over all branches. This is possible by adopting the semi-adaptive 
routing algorithm, arranging stations in specific branches, using gateways of a fixed buffer 
size that limits the flow in the network (Figure 4.3) and by relying on the process-to-process 
flow control to limit the flow of data into the network.
In order to reduce this back-pressure effect and to prevent the reduction in channel 
utilisation, we require that a jamming signal sent by a receiving node having a full buffer 
should collide with the incoming packet. Thus, it informs the sender of a full buffer before 
it completes its whole packet transmission, more specifically, when only the address header 
has been identified by the receiver. The FIFO model that we propose in chapter 6 permits 
an efficient implementation of this flow control at the network level.
4 3  Expansion of the 2D Structure
To provide a system which can handle high traffic loads and accommodate a large 
number of processors, it is necessary to provide more bandwidth for the structure. The 
puipose of this expansion, besides using the full potential of the virtual cut-through routing 
to obtain minimum latency, is to reduce the total average packet delay and to increase the 
system throughput.
4.3.1 Transforming a Single Bus into a 2D Structure
Obviously, by decomposing the single bus (Figure 4.1) into smaller interconnected 
branches, the performance in terms of the mean packet delay and the throughput is improved.
In figure 4.4, for instance, the simulations show that this single bus saturates at a load 
of 20.5%, whereas the 2D structure, which adopts the two routing mechanisms, copes with 
higher traffic loads.
However, regarding the communication latency for a lightly loaded network, the single 
bus has smaller mean delay up to loads of 16% and 7.5% respectively compared to the 2D 
structure which uses the store-and-forward and the virtual cut-through routing mechanisms. 
This difference is due to packets being completely stored in their intermediate nodes for the 
former and to the latency of the headers of the packets, on which gateways have to make 
routing decisions, for the latter. The difference in mean packet delay at smaller traffic load 
(<0.3 packets per packet time corresponding to a load of 7.5%) existing between the 2D 
structure adopting the virtual cut-through and the single bus is not worth considering as the 
system is meant to hold a large number of processors where high bandwidth is required.
chapter 4
T o t a l  t h r o u g h p u t  S  ( n u m b e r  o f  p a c k e t s  p e r  p a c k e t  t i m e )
Figure 4.4 The simulated normalised mean packet delay (actual delay over the packet 
transmission time D/T) as a function of the total channel utilisation 
(throughput per packet time )for a single bus and the 2D structure composed 
of 8 gateways and2 routing branches. In all situations, the load to the network 
is varied by changing the number of stations (1 < N  < 128). For the 2D 
structure, both routing mechanisms store-arid-forward (SF) and the virtual 
cut-through (TT) under the same hop-by-hop service (HH) are used.
4.3.2 Broadening the 2D Structure by Varying its Physical Parameters (G,P)
In order to simulate and study the mean packet delay and the throughput of the 2D 
structure under the impact of varying the number of injection and routing branches, we adjust 
the input load of the network to 0.718 messages per packets time. This load is generated by 
64 stations which, after each average processing time (Tp) of 570|xsec, inject a message of 
length 32bytes (S -N T !T P). It will be shown in section 4.4.3 that the optimisation of the 
number of injection and routing branches is independent of the routing mechanism used 
(store-and-forward or virtual cut-through). Therefore, the utilisation of the store-and-forward 
routing scheme during the following simulations does not affect the results obtained. All 
simulated results are also supported by the analytical ones using equations 4.1, 4.2 and 4.5.
chapter 4
If the number of injection branches (G/P), is kept constant while more routing branches 
(P) are added by increasing both G and P accordingly, the traffic load in the injection branches 
containing stations and gateways stays constant. Meanwhile, the traffic load in the routing 
branches that hold only gateways decreases as the adopted semi-adaptive routing algorithm 
distributes the load equally among them (see the branch throughput in figure 4.5(b)). As 
more bandwidth is being supplied as a result of this expansion, the mean delay decreases to 
a limit determined by the traffic existing in the injection branches only (see the normalised 
delay in figure 4.5(a)).
Q
•X3"a
fa
Number of routing branches P (a)
0.8
I 0.6Oossto
£ 0.4
s;afaQQ 0.2
I n j e c t i o n  b r a n c h e s
R o u t i n g  b r a n c h e s
1 < • 1----\ H—I-- --f—t-1-1-1-(— I— h
10 15 20 25
Number o f routing branches P (b)
30 35
Figure 4.5 The noimalised mean packet delay (a) (actual delay over the packet trans­
mission time D/T) and the branch throughputs (in packets per packet time) 
(b) as a function of the number of routing branches in the 2D network having 
4 injection buses. The continuous and dashed lines represent the calculated 
results, whereas the discrete points (*,+) are obtained from simulation.
N
o
rm
a
lis
e
d
 
de
la
y 
D
/T
chapter 4
On the other hand, if the number of gateways (G) is increased while the same number 
of routing branches (P) are retained, the traffic load in the injection branches decreases leaving 
that in the routing branches constant. Therefore, as previously observed, the mean packet 
delay will be bounded by the traffic load in these routing branches only (Figure 4.6).
T o t a l  n u m b e r  o f  g a t e w a y s  G  ( a )
T o t a l  n u m b e r  o f  g a t e w a y s  G  ( b )
Figure 4.6 The no finalised mean packet delay (a) (actual delay over the packet trans­
mission time D/T) and the branch throughputs (in packets per packet time) 
(b) as a function o f the total number of gateways in the 2D network having 
fixed number o f routing branches (inset). The continuous and dashed lines 
represent the calculated results, whereas the discrete points (*,+) are 
obtained from simulation.
chapter 4
Finally, if more injection and routing branches are added as a way of increasing the 
two physical parameters of the network (G,P), the mean packet delay can be reduced without 
limit until it reaches the latency of the store-and-forward or the virtual cut-through routing 
techniques in a lightly loaded network. This procedure proves that it is possible to provide 
more bandwidth which helps in decreasing the communication message latency by increasing 
both injection and routing branches.
By investigating the queuing model of the 2D structure under a Poisson process dis­
tribution probability in the next section, we will establish the relation by which both injection 
and routing branches can be adjoined so that higher throughput and lower mean packet delay 
can be achieved.
4.4 The Queuing Model of the 2D Structure
From a mathematical point of view, the 2D structure can be viewed as a collection of 
queues interacting with each other, where packets enter the system at various points and 
queue for services. Upon departure from a given channel (or queue), they proceed to other 
queues to receive additional services (Figure 4.7).
*9
Figure 4.7 the queueing model for the 2D-structure
In order to study such a queuing system, we apply the well known theories of Burke 
and Jackson. At steady-state, the output of a stable M/M/m queue with input parameter X 
and service time parameter |X for each of the m channels is in fact a Poisson process at the 
same rate X [Bur 56]. Furthermore, in an arbitrary network of queues, the total packet rate 
of a given queue is the sum of all input arriving at it. As a result, each queue can be studied 
independently as a MIMIm Markovian queue provided that the system is stable with Mar­
kovian inter-arrival and service times.
chapter 4
The theorem of Jackson [Jac 57] points out that networks with feedback are such that 
the individual nodes behave as if they were fed totally by Poisson arrivals, where in fact they 
are not. This theory has been applied by Kleinrock to study computer networks where he 
made the "independent assumption" which allows any node of a computer network to be 
studied independently as separated single queues [Kle 64]. In our particular network the 
independence between input rate at each queue level, which is effectively the output of some 
others, is restored by merging many inputs and considering collisions and retransmissions. 
Hence, the theorem of Jackson supported by simulations can be applied to analyse and study 
the behaviour of our network for different pertinent situations. As a result, the total traffic 
load in each injection and routing branch can be expressed as
X,(i) = M * ) + 7 ? S M i )  (4-1)Cr / = l
V GIP
\(J) = \ = p S M O  (4.2)1 i = 1
where Xx(i) = XNS is the message input rate per unit of time to each injection branch having
Ns stations, v the probability by which messages migrate to other injection branches, G the 
total number of gateways, P the number of routing branches and Xs(i) and Xg(j) are the traffic 
loads in a given injection (i) or routing (j) branch. It is noticed from equation 4.2 that all 
routing branches have an equal traffic load (Xg(j) = Xg for all 1 < j< P ) .  This is guarantee 
by the semi-adaptive routing algorithm introduced in chapter 2 that avoids heavily loaded 
branches and, hence, through a random selection distributes the traffic load equally among 
all the routing branches.
4.4 J The Balanced Network
We define a balanced 2D structure as a network having the same traffic load in all its 
branches. Qualitatively, for any injection branches / and k we can write Xs(i) = Xs(k) = Xg 
where i ^ k. In other words the traffic load in all the injection and routing branches is the 
same.
By using equations (4.1) and (4.2) with the above relations, we can express the condition 
for balancing the 2-Dimensional network as
G = P \  1 +  v ' 1), (4.3)
-57-
chapter 4
with v ^ 0 indicating that not all generated messages should be consumed in their local 
branches. Since each injection branch contains more traffic load (generated from itself and 
arrived from others) than the routing ones, one would expect that the number of injection 
branches (G / P ) be twice that of the routing ones ( P ) so that the traffic in all branches becomes 
distributed.
The useful conclusion to be extracted from equation (4.3) is that, providing the rate of 
message arrival in the system obeys a Poisson distribution and the stations are equally 
partitioned into their injection branches, the 2-Dimensional network can be balanced using 
only the structure parameters (G and P), In the extreme case, when all the traffic is non-local 
(v=l), the optimal number of gateways constructing the network indeed is related to the 
routing branches by G = 2P 2.
The importance of balancing the network is to give a fair access to all the routing 
resources and to reduce the congestion in all or part of the system. As a consequence, packets 
will be circulating in the network with a finite average delay.
4.4.2 The Stability Region of the Network
Merakos [Mer 87] has studied the stability criteria for the interconnection of two LAN’s 
by a bridge (gateway). In our case, we have an arbitrary number of nodes which form a 
different but regular structure. Hence, using the regularity and the symmetric properties of 
our topology, we can study the stability of the system by considering each sub-system (or 
branch) alone. By definition, a system is stable if all its sub-systems are stable [Mer 87]. 
Thus, the 2-dimensional network is stable if all the traffic offered to the branches does not 
exceed their maximum throughput
In chapter 3 section 5, we have shown that for small end-to-end propagation delays, 
the collision window is very small and hence the maximum throughput or the channel 
utilisation of each branch can be approximated by Snmx = 1/(1 + [3), where (3 is the ratio of the 
acknowledgement to the data packet time. Using equation 4.1 and 4.2 for an equal partition, 
that is each injection branch contains the same number of stations (Xs(i) = Xs for any 
1 < i < G /P), the stability condition of the network can be stated as
XNT < min (4.4)/> (l+ v )(l + p)’ (1 + (3)v 
The total message input rate XN to the system coming from N stations is maximised 
when the two quantities G/(l + v )(1 + p)P and P/( 1 + P)v are equal. This is obviously the 
balanced equation of the network having an equal distribution of traffic loads in all its branches 
under the stability condition XNT < P /(  1 + p)v. Theoretically, the total message input rate
chapter 4
is limited by the number of routing branches P determined by the total number of gateways 
( yG I2). It is clear that, in the worst case, such a restriction imposes a limit on the total 
message input rate and hence on the bandwidth provided for each processor within the network 
(0.7y { G  / N ) .  Despite this, our 2D network shows a considerable improvement in processor 
bandwidth over the single bus (1/N), the multi-bus ( M I N ,  where M is the number of lines 
connected to each processor [Hwa 85]) and the spanning bus (2a//V) [Wit 81]. The maximum 
achievable throughput or channel utilisation of our network is therefore Smax - 3 P / ( l  +  (3).
This last result shows that it is always possible to increase the maximum throughput 
and hence the stability region by providing more buses or by reducing the communication 
protocol overhead (e.g. the acknowledgement length and packet header size).
4.43 The Optimal Network
We define an optimal 2D str ucture as the network in which packets flow with a minimum 
average delay. In our topology, it is possible to add more stations, a corresponding number 
of gateways and routing branches without affecting the traffic load in each individual branch. 
As a consequence, the mean packet delay stays constant even if the system holds a large 
number of stations. By using equations 4.1-4.3, it is clear that the traffic load of each branch 
expressed as XvN/2s[g  can be kept constant by a corresponding increase in the total number 
of gateways even though more stations could be added to the network.
The idea of an optimal structure is different; given a fixed number of gateways G and 
a fixed number of stations N , we want to find the number of routing branches P so that the 
total average delay is minimised. The fulfilment of this requirement can be obtained by 
finding an expression of the mean packet delay and differentiating it with respect to the 
parameter P (dD/dP =0). We will show that the optimal 2-Dimensional structure is inde­
pendent from the routing mechanisms (i.e. store-and-forward or virtual cut-through). It does 
however depend slightly on the traffic load imposed to the system and strongly on the physical 
parameters forming the network (i.e. G and P). In particular, this optimised structure can be 
approximated by the balanced network having the same traffic in all branches.
In general, when packets are generated by stations in a given injection branch, they 
may locally stay in it with a probability (1-v) or migrate to other branches with a probability 
v by crossing three consecutive buses. Therefore, the total mean packet delay for an equal 
distribution of stations - i.e. all injection branches contain the same number of stations - can 
be expressed as
D -  (I - v ) d s + v(2ds +  dg) (4.5)
chapter 4
where ds and dg are the mean packet delay of the injection and routing buses under the
CSMA/CD arbitration protocol respectively. If \>=0, the packet remains in its local injection 
branch for a time delay of D = d s, whereas if v=l, the packet travels to another injection 
branch crossing a routing bus in a total time delay of D = 2ds + dg.
The expressions of both ds and dg can directly be evaluated from equation 3.5 of chapter
3, using the two relations 4.1 and 4.2. It is noticed that, unlike the store-and-forward routing 
mechanism, which uses these delays in their entirety, the virtual cut-through reduces them 
by a constant (1 -  t/T) where t is the time to transmit the header of the packet After substituting 
these expressions for both routing mechanisms into equation 4.5, it can be shown that its 
differentiation with respect to P gives the result
_ W v X + v V g / (  1 +  v )
i+ W vI/Vg  ( )
where X  is the service time of each bus. It is thus concluded that the optimal number of 
routing branches which minimises the mean packet delay in the 2D structure is independent 
from the type of routing mechanism used. As shown in figure 4.8, this optimal number of 
routing branches is slightly dependent on the total throughput of the system in the stability 
region. As expected, more routing buses are required to cope with intensive traffic. In 
practice, the optimised 2-Dimensional structure can be constructed approximately as a bal­
anced network whose shape is solely determined by the physical parameters of the network 
0G,P).
Figure 4.8 The calculated optimal number o f routing branches Popt, as a function o f the 
number o f gateways constructing the 2D structure fo r different channel 
utilisations (throughputper packet time). The continuous line represents the 
balanced network.
N
o
rm
a
lis
e
d
 
de
la
y 
D
tT
chapter 4
Furthermore, the results of the simulations reported in figure 4.9 confirm the existence 
of an optimal network.
This behaviour can be explained by the traffic load graph in figure 4.9(b). When there 
are only a few routing branches (P is small), each is congested by the non-local traffic flow 
resulting in a large average packet delay. When the number of routing branches increases 
(P is large), the injection branches hold a large number of stations and hence the average 
packet delay increases as the traffic load in these branches grows (figure 4.9(a)). The optimal 
arrangement is therefore somewhere between small and large values of P as given by equation 
(4.6) above.
N u m b e r  o f  r o u t i n g  b r a n c h e s  P  ( a )
N u m b e r  o f  r o u t i n g  b r a n c h e s  P  ( b )
Figure 4.9 The normalised mean packet delay (a) (actual delay over the packet trans­
mission time D/T) and the branch throughputs (in packets per packet time) 
(b) for a fixed number o f gateways as a function o f the number o f routing 
branches. The continuous and dashed lines represent the calculated results, 
whereas the discrete points (*,+) are obtained from simulation.
-61-
chapter 4
4.5 Number of Transputers Accommodated by our Balanced 2D 
network
The objective of this section is to study the inter-relation between the message injection 
rate from processors to the network and the bandwidth provided by the 2D-structure. This 
relationship will be expressed in terms of the number of transputers accommodated by our 
2D network.
Using equations 4.1 and 4.3, the total message input rate to the balanced 2D network 
can be deduced as f lN /^ G (T P + WIr) where Tp is the average processing time of each 
transputer between attempting inter-processor communications, W the length of a message, 
G the total number of gateways constructing the balanced 2D structure and r the effective 
link rate of the transputer. Here, the interface latency is mainly dominated by the com­
munication time through the transputer link (W/r). We will study in the next chapter the 
effect of the software protocol of the interfacing on the communication latency. As will be 
shown shortly, this can be made independent of the number of transputers.
To operate within the stability region of the network (see equation 4.4), the message 
input rate, normalised by the packet time (W/R, where R is the capacity of the bus), has to 
be limited by the effective capacity afforded by each branch. Quantitatively, it should be 
limited by the maximum achievable throughput Moreover, as it has been confirmed through 
simulations in section 3.4, the packet transmission time in the high priority CSMA/CD 
environment dominates the collision intervals, and smaller acknowledgments will also be 
designed in chapter 6. Therefore, the maximum throughput or the channel utilisation of a 
single branch approaches unity. Besides, each station can hold four fully connected trans­
puters (see section 5.2) which yields
N <
R ^ f  rT  Y
v1+ ^  j
(4.7)
Many useful characteristics can be deduced from this equation. Firstly, with the low 
latency interfaces exhibited, for instance, by parallel I/O ports connected directly to the host 
bus such as the Mad-Postman interface, the effective input rate will be larger than the capacity 
of a single bus. As a consequence, the number of processors supported by the network will 
dramatically reduce. Secondly, computationally intensive applications (Tp »  1) allow the 
network to accommodate far more processors. Finally, the most important result that can be 
deduced from this equation is useful in a design process. Given that the capacity of each bus 
is larger than the effective link rate of the transputers involved (R>r), it is always possible 
to connect a large number of these processors - determined only by N < RyjSG /r - to our 2D
-62-
chapter 4
network regardless of the application Qf = q) or the message length used. To illustrate this 
last result, let us consider an example. Suppose that we consttuct a 2D network from G=512 
gateways with each bus having a capacity of 40Mbits/sec. We can accommodate within this 
system 200 T414 transputers distributed evenly across the 32 injection branches, which can 
each transmit and receive at 0.4Mbytes/sec (implying a bidirectional link rate of 0.8 Mby­
tes/sec). Of course, this total traffic could be shared amongst more transputers with corre­
sponding lower communication requirements - see equation 4.7.
4.6 Conclusion
The 2D structure, which is a particular class of the generalised multiple interconnected 
bus networks, has been investigated in this chapter. We have shown that the virtual cut- 
through routing mechanism with the hop-by-hop acknowledgement service allows packets 
to travel faster through the network with a minimum average delay time than the more 
complicated store-and-forward, street sign and mad postman schemes.
When we have to add more bandwidth to the system, we have verified that both the 
number of injection and routing branches have to be increased. Based on a Poisson process, 
such increases can be arranged so that the network becomes balanced - i.e. all the traffic 
loads of the branches are the same provided that the number of injection branches is twice 
that of the routing ones. The balanced network that can be constructed using only these 
physical parameters of the structure has the highest throughput and can be considered as the 
optimal configuration so that the mean delay time of any packet is minimised. In general, 
to handle any parallel application, the system relies on the application and the network level 
flow control mechanisms, the limited buffer size of each gateway and the routing algorithm 
which distributes the traffic load within each routing branch. As we will discuss in chapter 
6, the flow control of packets can be maintained effectively at the network level using our 
proposed FIFO model.
Finally, we have confirmed that each serial bus of the 2D structure limited to a capacity 
of 40Mbits/sec can support a moderate number of transputers when they are connected by 
serial interfaces, and that the routing buses can, together, support the evenly-distributed loads 
generated by a population of up to 800 transputers, using only 512 gateways. Of course, 
more transputers could be accommodated by our structure if more gateways were inter­
connected.
chapter 5
NETWORK INTERFACING: 
THE TRANSPUTER AS A COMPUTING NODE
5.1 Introduction
The design of an efficient interconnection network is not enough to exploit a high degree 
of parallelism. What is also needed is a fast and reliable interface between the processor and 
the network which, by using the best characteristics of the network, ensures that low latencies 
and high throughputs are achieved from application to application.
In many early message-passing concurrent computers, the overheads of proces­
sor/network interfacing, which supports the injection and absoiption of packets, was swamped 
by the high latency of routing packets through the network. Evidently, such systems are 
inefficient for large-scale concurrent computers.
However, the virtual cut-through routing technique has reduced the network latency to 
an acceptable level over the store-and-forward routing mechanism and the improved network 
channel capacity has speeded up packet transmission to a degree which is now limited by 
the technology and hardware design. Consequently, the latency of the processor-to-network 
interface has become heavily exposed, and is now the major concern in our 2D-structure.
It is suggested that the best approach for model and process independent communication 
resources is to separate the issue of interfacing from that of routing. This allows networks 
to be developed independently by providing their own high bandwidth and low latency 
configuration to be exploited by various off-the-shelf microprocessors, equipped with a 
suitable interface device, or custom, model-specific processors designed to interface directly 
to the network.
In particular, the approach adopted in this thesis is to physically separate processors 
and their interfaces from the proposed hardware routers (gateways). This strategy, in addition, 
permits the accommodation of various word-length or types of processor and interfaces 
without requiring any compatibility with the 2D structure or its hardware routers, and offers 
a high degree of freedom in system design and integration.
The essential concern of this chapter, however, is a proposal of the transputer serial 
interface (SIMP) which can be used efficiently with the serial bus organisation adopted in
-64-
chapter 5
this project to facilitate the inter-process communication through the network, and we will 
study its limitations and the associated problems of passing messages from the processor to 
the network and vice-versa. This chapter is organised as follows.
In section 5.2 we suggest a possible interface to support the transputer on the inter­
connection network. We introduce our gap theory in section 5.3, with which we analyse the 
effect of the interface structure and its software protocol on the communication overhead. 
Within the framework of this theory, we also demonstrate in section 5.4 that a double buffer 
with a send-and-wait scheme [Hal 85] is appropriate for our interface. Section 5.5 is devoted 
to the investigation of the possibilities of reducing the effect of the software protocol, used 
to ensure a reliable communication between communicating entities, by overlapping it with 
the reception of packets via the channel. In section 5.6, we reduce the communication 
overhead introduced by the SIMP structure on large messages by splitting them into smaller 
sub-packets using pipelined transmission, and compare the total time to other known methods 
applied to a linear array of transputers. Finally, section 5.7 is devoted to the study of the 
performance of the whole system including network, processors and interfaces under an 
event-parallel application where we show a higher efficiency than that obtained from a 
directly-wired tree of transputers running the same processor-farm application.
5.2 A Transputer Interface
Early hypercube machines such as the Cosmic Cube [Sei 85], the Intel iPSC [Intel 85] 
and the Ametek system 14 [Ame 86] used standard off-the-shelf microprocessors and standard 
communication devices to fulfil the message-passing requirement in the system. Unfortu­
nately, the software overhead associated with the events of receiving and transmitting 
messages via the DMA controller is extremely high [Ibb 89]. The approach of using custom 
processors that incorporate the communication resources within the processor chip has 
reduced the interface overhead to an acceptable level. The chip prototype of the Connection 
Machine [Hil 85] is such an example. This custom device contains 16 processors/memory 
cells and one router unit (for its packet-switched communication network) to link the chips 
to each other. The full integration of communication with processor design can also be found 
in both Inmos Transputers [Inm 91] and the Message-Driven processors (MDP) of the 
J-Machine [Dal 89]. The transputer supports a static, synchronous model of programming 
based on CSP [Hoa 78], and the MDP of the J-Machine sustains a dynamic, asynchronous 
model based on ACTORS [Agh 85].
chapter 5
The Mad-Postman interface, which contains two 8x32 memory mapped buffers, is used 
to support fast processors such as the transputer or the ARAM RISC device and preserve 
their speed providing separate message routers. Upon receiving segments of data, the 
interface injects them into one of its the associated Mad-Postman routers in a 3-bit wide 
format. Although this scheme seems to be feasible, it has been claimed that the latency of 
this interface is high since block moves in transputers are inefficient and injection/reception 
cannot be overlapped [Mil 91]. Furthermore, the interface supporting the Mad-Postman 
routers does not suit our serial bus configuration because the rate of injection of parallel data 
(32 bits) does not match the serial bus speed, and hence limits the number of stations com­
peting for it (see equation 4.7 in section 5 of chapter 4). In addition, the unreliability of the 
serial communication channel and its collision-based protocol requires the use of an extra 
high level protocol and memory for error management which undoubtedly should not be the 
responsibility of the application processors. As a consequence, a serial and intelligent 
interface is required, which matches the speed of the bus, handles error management and 
provides a virtual circuit service to the application processes.
Since we are investigating the possibilities of interconnecting transputers, it is appealing 
to maintain compatibility with Inmos serial links by providing another Inmos transputer as 
a unique support device [Inm 87b]. For example, a 16-bit processor is desirable for its low 
cost and hardware simplicity (e.g. T212, T222).
Our Serial Interface Message Processor (SIMP) will therefore consist of an intelligent 
processor (e.g. T212 or T222), a memory to hold the code responsible for error management 
and message formatting to provide compatibility with the network and its communication 
protocol, buffering to hold packets ready for transmission and reception, and additional 
hardware to couple the SIMP to the bus. Three possibilities are discussed subsequently.
i- drawbacks with DMA controllers are: (Figure 5.1)
1- the cost of the SEMP,
2- the hardware complexity,
3- the software overhead (i.e. time to set up the DMA),
4- the overhead per packet which includes two access times to the memory (i.e. read
and write),
5- the suspension of the T2 processor during DMA activities.
ii- drawbacks with buffered links are: (Figure 5.2)
1 - extra hardware for driving the link,
2 - consumption of one link for the network interface,
3- more overhead per packet consisting of two access times to the memory.
iii- desirable memory mapped buffer (Figure 5.3)
chapter 5
This structure incorporates a buffer (FIFO) or a parallel memory sharing the same 
address space with the main memory of the T2 Transputer. This approach, compared to the 
others, requires less hardware and offers less overhead time per packet because just one access 
time for retrieving or storing data to or from the interface buffer is needed. In other words, 
the total communication latency is dominated by the message transmission through the 
transputer link which preserves the reduction of interface overhead gained by Transputers. 
In addition, the T2 Transputer has full control of the bus and its elements and is totally 
separated from the lower levels.
A final possible structure of a station is given in figure 5.4 where four hosts are fully 
connected and supported by a single processor (T2) forming the heart of the interface. This 
configuration shows that it is possible for each SIMP to serve up to four processors, and 
hence minimises the amount of interface hardware per host.
4 host transputers 3 host transputers
Figure 5.1 A SIMP based on a Figure 5.2 A SIMP based on linked buffer 
DMA controller
Figure 5.3 A SIMP based on mem- Figure 5.4 A single SIMP supporting four hosts,
ory mapped FIFO each o f which is directly connected
buffers. and has a connection (via the SIMP)
to the 2D network.
chapter 5
5.3 Development of the Gap Equations
Before finalising the structure of the Serial Interface Message Processor used to 
interconnect transputers to the buses of the 2D structure, it is necessary to build a mathematical 
model, by which the impact of buffering and software protocols on the total communication 
latency can be analysed.
These expressions called gap-equations can generally be applied to any sequence of 
pipelined actions or a chain of nodes, where gaps or idle states result as a consequence of 
process (or action) synchronisations.
To understand this concept, suppose that at time zero two processes P x and P2 have
been assigned a task 7\ and T2 respectively. If the two processes are synchronised at the 
completion of their actions (execution of the task), depending on the speed of execution one 
must wait for the other resulting in a gap g that can be written as g = Aw (A)1 where A - T 2- T x 
or A - T l - T 2. Obviously, one notices that if the two actions terminate at the same time, 
which is not always the case, there will be no gap (g=0); also for autonomous processes, the 
gap is conventionally denoted as g = ® (Figure 5.5).
Pl
PP TP TP
(a) (b)
Figure 5.5 A gap or idle state o f time resulting between h\>o actions executed 
by two processors.
In general, for n pipelined processes forming a chain of nodes, each executing m-t
sub-tasks of a total task m = mx + m2 + ...  + mn for any 1 < i < n and 1 < j  < mh there will be 
as many gaps (gitj) as there are synchronised actions (Figure 5.6). Any gap of time between 
concurrent actions can be written recursively in terms of previous idle states and accomplished 
actions (tasks) in the form
SiJ
Qi (2,-i
£ 7).u + £ Si-uKfe-,.*)k=i k- 1 £ Tit+1
J
1Jt = l (5.1)
1 u is the unit step function defined as tt(x) = 1 for x > 0 and u(x) = 0 for x < 0.
chapter 5
where Tt j  is the action executed by the process / at the position j  in time (m,- = Z T-t J) and Q;
j
the protocol parameter specifying quantatively how these actions are executed at the level 
of each processor i (i.e. each processor can execute for instance one or more actions then 
wait for further ones). Basically, these actions represent any useful processor or process 
execution time including computations or communications.
Ill gll 112 nl2 TinPI ______  y _______ a--  ,---
P2 g20 T21 g21 122 T2n
line ^
Pn gnO Ini g„l T»2 In"
Figure 5.6 Gaps resulting between n pipelined processes.
Based on these gap equations, various useful figures can be obtained. For instance, the 
total time to accomplish the execution of the whole task by n processes specifically defined 
as T -  Max(7))1 S|. is constrained by the slowest processor in the pipe, where
ntj iij -1
r .=  z r „ .+  1  gijuigu). (5.2)j=1 j=1
Also the utilisation of each processor i can be defined as
2  Lj
U, = ---------------  (5.3)' mi ntj -1 x 7
I  T i j +  Z  g i j u ( g i j )  
j =1 j  = 1
* 1 and the total utilisation of the parallel system as UT = -  Z Ux. Finally, a good load balancingn « = i
of the parallel system results if all processors are made equal use of ((/, = Uk \/i,k ).
The above equations will be used throughout the remainder of this chapter to analyse 
any pipelined situation that may occur from processor-to-processor via the SIMP interfaces 
and the 2D interconnection network.
chapter 5
5.4 Finalising the Interface Model
The model shown in figure 5.7 consists of a p* number of buffers and uses an arbitrary
window size q* to limit and regulate packet transmission. It will only become clear later in 
this section how many buffers and what strategy (i.e. send-and-wait, go-back n or selective 
transmission [Hal 85]) the system should adopt so as to achieve a minimum communication 
overhead between processors.
In this section, therefore, we are interested in finding the optimal structure of the SIMP 
model (Figure 5.7) and the appropriate transmission strategy. For that, we use the theory 
developed previously to analyse the host-to-host performance and to study the impact of the 
software protocol used by each interface on the packet transmission time. The main ideas 
ar e outlined here and further details can be found in Appendix C.
A message generated by a source processor in the model of figure 5.7 has to cross 
several processes before reaching its destination. The route of each message can thus be 
decomposed into host-SIMP interaction where the communication latency is imposed by the 
message transmission through the transputer link (r MBytes/sec) and the sending software 
protocol of the SIMP (b jxsec). The total number of packets that could be stored inside each 
SIMP buffer (p*) depends on the number of outstanding packets waiting for their 
acknowledgements which is related to the window size (q *) utilised by the communication 
protocol in each interface. When each packet leaves its SIMP at the effective capacity of the 
channel (R Mbits/sec), it crosses a limited number of gateways (=2) resulting in a set of 
interactions or actions at the network level between SIMP-gateways. Finally, the packet 
reaches the destination station where it will be submitted to further processing by the receiver 
software protocol (a frsec) and then it is transmitted to the destination host via the transputer 
communication link (/• MBytes/sec) (Figure 5.7).
chapter 5
Figure 5.7 The complete structure of the SIMP consisting ofp  * buffers and using 
a transmission strategy of window q* is connected to the network.
These interactions have been thoroughly analysed in Appendix C for a large number 
of packets m under various communication services (e.g. End-to-end and hop-by-hop 
acknowledgement) and routing mechanisms (e.g. Store-and-forward and virtual cut-through). 
In particular, the following assumptions, adopted for our system, are made to carry out the 
comparison between different types of transmission strategies (q* > 1) and SIMP structures 
(Pm* 1).
1 - the communication service is hop-by-hop,
2 - the cut-through routing mechanism is adopted,
3- the network is balanced (i.e. the same average queuing delay is assumed in any 
branch),
4» originally, the host provides its SIMP with equal sub-messages to be transmitted 
as equal sub-packets,
5- channels are free from errors, although without this assumption the analysis does 
not lose generality,
6- premature reception is implemented, that is each destination SIMP starts initial 
processing on the header of the received message before the formatted packet 
is completely stored in its SIMP receiving buffer (section 5.5),
7- high priority acknowledgement is used with the CSMA/CD arbitration protocol,
chapter 5
8- for any buffer size (p *) and window size (<7*), the sequence of transmission is
stationary; it has a constant protocol parameter Q (see equation 5.2).
Assumptions 1 to 5 simplify the gap equations in which the packets from each host-SIMP 
are physically propagated to the destination SIMP without any alteration. Hence, without 
loss of generality, the number of hops crossed by each message can be fixed to one ( all 
packets stay in the branches where they have been generated from). It will be shown in the 
next section that assumption 6 is the most feasible way of using the software protocol of each 
SIMP to receive messages. The most influential parameters to the gap equations are 
assumptions 7 and 8 which formulate the whole behaviour of the system and impose the 
majority of the communication latency.
The process of transmission occurs as follows. A transmitter host stores messages into 
its SIMP’s buffers. When no buffer is available, gaps or idle periods denoted by a  result, 
which represent the time spent by the host waiting for an empty buffer. Gaps denoted by (3 
also occur if there are no transmissions in the network channel due to collisions or errors. 
These gaps also appeal*, when messages arriving from the host are held up by the interface 
processing speed hence delaying transmissions through the channel. Finally, gaps may also 
appeal* at the receiving interface as the arrival of consecutive packets may be delayed by the 
network, and messages are only delivered to their destination host after being completely 
stored inside the SIMP receiving buffer.
We are mainly interested in gaps formed at the transmitter and the channel, since the 
receiver contains a single buffer FIFO for which no optimisation can be done, as illustrated 
in figure 5.7. It will however be shown in section 5.5 that the software protocol needs to be 
optimised at the receiver, if the communication latency of the system is to be minimised.
Therefore the utilisation of the channel (£/p) and of the transmitting interface (Ua) can
be deduced from equation 5.3 as Ugap = usefuLtime/(useful.time A gap). The expressions 
of gaps t/p and Ua mentioned above are given in appendix C. By substituting them in equations
5.2 and 5.3, we can calculate the total transfer time (7) for passing an arbitrary number of 
packets between two physically separated processors within the 2D network, the corre­
sponding channel utilisation and the interface utilisation. This is shown in table 5.1 for a 
hypothetical case using a T212 transputer as an interface processor [Inm 87, 88] to pass 
groups of 50 data packets of different lengths. Both the number of buffers available (p *) and 
the window size (q*) of the interface are arbitrarily chosen as reported in the table.
-72-
chapter 5
w(bytes) T(msec) Up% §§| T(msec) Up% ua%
32 5.60 7.40 100.0 5.60 7.40 100
64 9.68 7.66 100.0 9.68 7.66 100
96 13.76 7.68 99.0 13.76 7.76 100
128 18.71 7.61 97.4 17.35 7.82 100
160 23.12 7.57 96.4 21.44 7.85 100
192 27.52 7.54 95.8 25.52 7.88 100
224 31.93 7.52 95.3 29.61 7.89 100
256 36.34 7.51 95.0 33.70 7.90 100
^3
• II is = 1 (b) p* = 2, <7* = 1
'X a* &
 
cn (rnlec) m Ua% (mTec) I I I ua%
32 10.22 3.90 100 5.60 7.40 100
64 14.27 5.07 100 9.68 7.66 100
96 18.35 5.72 100 13.76 7.76 100
128 22.44 6.13 100 17.35 7.82 100
160 26.52 6.41 100 21.44 7.85 100
192 30.61 6.62 100 25.52 7.88 100
224 34.69 6.78 100 29.61 7.89 100
256 38.78 6.91 100 33.70 7.90 100
(c )p ‘ = 2,<7* = ;) (d )p ‘ = 4, qr* = 1
cr ■g
*
C/5 N—4 T(msec) Up% ua% T(msec) Up% u0%
32 10.75 3.71 100 10.41 3.60 100
64 14.79 4.90 100 14.49 4.87 100
96 18.88 5.55 100 18.57 5.55 100
128 22.97 5.98 100 22.66 5.98 100
160 27.05 6.28 100 26.75 6.29 100
192 31.14 6.51 100 30.83 6.51 100
224 35.22 6.68 100 34.92 6.68 100
256 39.31 6.81 100 39.00 6.82 100
(e) p* = 4 <7* = 2 (f) p* = 4, <7* = 4
Table 5.1. The calculated communication latency and the utilisation o f the channel 
Up, and that o f the transmitting interface Ua is shown for various packet 
lengths (in bytes) when 50 packets are exchanged between two processors 
belonging to the 2D network.
From the table, it can be concluded that the software protocol associated with a large 
window transmission (q* > 1 ) is more complex and involves a large execution time both at 
the receiver and the transmitter SIMP (tables c, e, f)- For this reason, all transmissions with 
a large window (q > 1 ) have a higher communication overhead time when transferring any 
number of packets and a smaller channel utilisation (Up%) than the single window case. A
-73-
chapter 5
single window strategy, on the other hand, (tables a, b, d) has a better performance because 
the high priority acknowledgement incorporated within the CSMA/CD protocol returns fast 
enough to release buffers from their outstanding packets. In all cases, structures with a higher 
number of buffers (p* > 1) have a superior performance (tables b, c, d, e, f), because their 
transmitting interfaces are fully utilised by the hosts (Ua = 100%). However, the single buffer 
interface exhibits similar performances for small packet lengths (<= 64 Bytes) or for high 
transmission rates (table a). The channel utilisation is reduced for smaller packet lengths 
due to the dominating idle states of the bus and, also, for larger packet lengths where a 
considerable time is needed to store the longer messages in a single buffer. This situation 
does not arise with an interface containing many buffers, where the efficiency keeps 
increasing with the length of packets due to the complete overlapping of packet transmissions 
via the channel with their storage inside the SIMP buffers. In other words, an interface 
consisting of many buffers is fully utilised by its host, and achieves the minimum com­
munication latency. From this point of view, it appears that the SIMP structure having a 
double buffer (p* -  2) and utilising the send-and-wait transmission strategy (q* = I) is the 
most effective in cost and performance and, therefore, it is the one adopted for our SIMP.
5.5 Overlapping Part of the Software Protocol
We have shown in the previous section that the double buffer with the send-and-wait 
scheme [Hal 85] is desirable for providing a low communication latency between any pro­
cessors in the network. To exploit this characteristic further, we investigate in this section 
the possibility of overlapping the software managing the communication protocol with the 
transmission or reception of packets at the sender and receiver SIMP.
Before addressing the two sides of the communication transmitter and receiver, we 
consider packets as represented in figure 5.8 where V is the location of the address and control 
headers, Wc the CRC field check-sum and d  the useful data. Packets can be considered as 
divided into slots or slices (possibly of different sizes) labelled Wx, W2, ..., Wn. Although 
these slices have no physical meaning, to a certain extent, the first slice Wx can be thought 
of as the packet header used in the virtual cut-though routing decisions.
chapter 5
V :r ^
V d Vc
VI VP Vn
Figure 5.8 packet format divided into slices.
5.5.1 Transmitter Side (Figure 5.9)
At the SIMP transmitter, the procedure of formatting messages into packets occurs 
before the complete message flow from the host to the transmitting buffer is terminated. 
Hence, there is no way to overlap the transmitting software protocol, and it would only be 
possible to overlap the message flow with the transmissions via the channel.
In general, when a message is ready to be passed from the transmitting host to its SIMP, 
the transmission via the network channels will not take place until a gap of time g0 is elapsed. 
According to figure 5.9, this time can be expressed as g0 = b + Wx/r with b being the software 
protocol time for formatting messages into packets and enabling their transmission through 
the network bus. W fr represents the time to actually send the slice Wx at the bit rate r through 
the transputer communication link.
Vl/r bl (W-VD/r
/N (V+Vc)/R
line
Figure 5.9 Transmission of a packet through the SIMP.
If this transmission process is executed in store-and-forward manner (st), the whole 
packet would have been stored inside the transmitting buffer of the interface before its 
transmission via the network channel is issued. Therefore, the whole packet length W is 
represented by the virtual slice Wx. On the other hand, if the flow of messages is passing
chapter 5
transparently through the transmitting buffer toward the channel (similar to the virtual cut- 
through but without routing decisions) the virtual slice includes only the header portion of 
the packet expressed as [Appendix D]
1Yi = W u (l~ X )+ L u x-l- br L + b lr u(x - 1 ) - u x
' - f
(5.4)
where b x is the time to prepare the transmission of the next slices, L the word length in bytes
(e.g. 2 Bytes for a T212), n the number of virtual slices (expressed explicitly iu Appendix 
D), x  the ratio of the transputer link rate to the channel capacity (r/R) and u() is the unit step 
function.
5.5.2 Receiver Side (Figure 5.10)
In the receiver side, however, part of the software protocol in addition to the message 
flow from the receiver buffer to the host can be overlapped. Three separate possibilities arise.
In delayed reception (dr), the SIMP stores the complete packet inside the receiver 
buffer, initiates the software protocol processing for a duration a jisec to interpret the 
information carried by the slices and delivers error-free messages to the host. Such operations 
introduce to the reception of a packet an idle time of duration g = g0 = (W + Wc)/R (as shown 
in figure 5.10 along the time axis (1)).
In premature reception (pr), the SIMP overlaps part of its software protocol processing 
which consists of decoding the destination host (ax jisec) and exploring the control infor­
mation to determine packet duplications (a2 p. sec) with the current reception of slices. 
Afterwards, the SIMP waits for the completion of the packet reception and initiates the 
message block transfer (a3 p.sec). The gap of times g l = (L lR ) -a l and 
g2 = (W - L)/R -  (ax + a2+ g M g J )  yield a total idle time of g = g lu(gl)+ g 2u(g2) during 
which the receiver must wait for packets to be buffered (as shown in figure 5.10 along the 
time axis (2)).
In transparent reception (tr), the SIMP starts its software protocol processing and directs 
messages to its hosts word-by-word as they are received. The amount of idle time resulting 
is therefore
g3 = (W —L)IR -  [a + g lu(gl) + (W - L)/r], g4 = (W -L)IR  -  [a + g xu(gx) + g 3u(g3) + Wfr] 
which yields a total gap of time# = g iu(gl)+ g 3u(g3) + g4u(g4) (as shown in figure 5 .10  along 
the time axis (3)).
chapter 5
Figure 5.10 Reception o f a packet by the SIMP.
These partial operations can be combined into six methods of transmission labelled 
according to their transmission and reception types as stdr, stpr, sttr, ttdr, ttpr and tttr.
The total communication time or latency to exchange a single packet between two 
communicating processors can be deduced from equation (5.2) by replacing the appropriate 
gap expressions above to obtain
T = g a+ g + — + a + W lr .  (5.5)
Notice that, as far as the comparison is concerned, the inclusion of the network queuing
delay is unnecessary knowing that the same amount of time will be added to the six methods.
The following values of the parameters used to plot the curves in figure 5.11 are procured
for the T212 transputer from [Inm 88],
L = 2Bytes ( T2 Transputer), 
b = 20 |X sec, bx = 2.5|X sec
a -  30 jx sec decomposed into ax + a2 = 23jxsec, a3 = 7|Xsec and = 8|X sec,
Wc = 2Bytes, 
r — 0.4 MBytes/sec.
In figure 5.11, although Transparent Transmission (tt) and Transparent Reception (tr) 
achieve the smallest communication time by overcoming the interface latency, their imple­
mentation is inadvisable. Firstly, the transparent transmission scheme (tt) tries to send bits 
not yet provided by the host. More importantly, however, the unpredictable behaviour of 
each transputer that may execute several concurrent processes imposes a lower bound on its 
effective transmission rate which physically is not synchronised with the network channel 
capacity. Secondly, when corrupted packets are received with the transparent reception
La
te
nc
y 
150
 
(ji
se
c)
chapter 5
scheme (tr), they will be directly delivered to the host. This step eventually requires the host 
to check for errors, and therefore increases the inter-process communication latency and host 
software overhead.
One encouraging fact is that, for small packet lengths (e.g. < 32Bytes), all methods 
exhibit practically the same transfer time. As a consequence, single store-and-forward 
transmissions with premature reception (stpv) can be used effectively and securely in this 
range without the extra concern of higher-level protocol complexity.
Packet length W (Bytes)
Figure 5.11 The calculated latency o f the system (in p. sec) to transfer a single packet 
of length (W Bytes) from host-to-host using the six transmission methods 
(inset). The scale of the latency axis is divided by 50. x is the ratio o f the 
transputer link rate to the channel capacity.
-78-
chapter 5
5.6 An Optimal Pipelined Transmission
Although packets may travel inside the network by virtually cutting through gateways, 
we have shown in the previous section that the SIMP senders and receivers impose the 
store-and-forward transmission and reception method on messages. We have also illustrated 
that the proposed transmission (stpr) can only be used efficiently for small packet lengths. 
If longer messages are forwarded in their entirety, communication delays would tend to be 
high. The obvious way of improving this situation is by dividing large messages into smaller 
sub-packets and using pipelined transmission to achieve concurrent communication [Har 86, 
Cok 91].
In this section, we investigate these possibilities, study their impact on our network and 
compare the communication overhead time obtained with a linear array of interconnected 
transputers.
There are two ways by which a large message can be converted into smaller sub-packets: 
a- the host sends the whole message to its SIMP, and afterwards, it is the duty of 
the latter to partition it into smaller blocks, 
b- originally, the host provides its SIMP with smaller sub-messages to be sent as 
sub-packets.
In both methods, we require the use of a virtual circuit service [Tan 88] to ensure that blocks 
will be received in the order of their transmission. To avoid complex re-assembly of these 
blocks, we force the group of sub-packets to follow the same path in the network. For 
convenience, we assume that the second possibility (b) is chosen.
5.6.1 Splitting Messages into Smaller Sub-packets
One way to split the message of figure 5.8 into m separate sub-packets is to copy the 
same addressing overhead bits into each slice s such that s = V + d/m. Undoubtedly, the 
resulting concatenation of all slices provides a much greater overhead ms = (m -  1)F + W 
than pipelined transmission applied between a set of transputers interconnected, via their 
communication links, that use one addressing overhead only.
Inauspiciously, there are no alternatives to avoid these addressing overheads since each 
sub-packet has to travel inside the 2D network as an independent entity self-identifying the 
destination stations. Fortunately, there are two hops through which messages enter and leave 
the network, respectively, excluding the network routes crossed transparently by each sub­
packet (Figure 5.7). The two external hops are included at the transmitter and the receiver 
interfaces (SIMP). This situation however relieves the effect imposed by the
-79-
chapter 5
store-and-forward transmission at the interface level on communication latency provided 
that the length of each sub-packet is properly chosen to dominate the addressing overhead 
and the software protocol time (a and b \i sec).
Given that each processor interface consists of double buffers, the gap expressions 
obtained in appendix C can be substituted into the total transfer time of equation 5.2 to yield
T — m + (H -  1) (y + Q) + (Q +L/R) + eu(e) + a (5.6)r
where L is the transputer word length (2 Bytes), H the number of hops crossed by the m 
sub-packets inside the interconnection network, Q the queuing time at each gateway, 
E = (s + sc)/R -  (ax + a2) the gap resulting at the receiver to wait for the complete arrival of a 
sub-packet ( see premature reception above) and y  explicitly determined according to the 
type of routing mechanism:
K (s + sc)/R for store-and-forward routing
y =  \
L/R for virtual cut-through routing.
Notice that equation 5.6 is valid only if the network is not heavily loaded. As the 
transmission via the channel is six to twelve times faster than the one through the transputer 
link interface (depending on the type of the interface processor used - e.g. 10 Mbits/sec for 
T212 with non-overlapped acknowledgements or 20 Mbits/sec for T222/225 with overlapped 
acknowledgement), typically, any sub-packet must not be queued for a time greater than 
6(5//?)ji.sec or 1 2 (s/R) jxsec, in which case all sub-packets will move according to the branch 
traffic. Basically, for moderate and light traffic load, the interface communication latency 
dominates that of the network, and the optimal value of the number of sub-packets m = 
(W-V)t(s-V) which minimises the total communication time for transferring a message of 
length W between two physically distributed processors belonging to the 2D structure can 
be evaluated as
 ^ I (W ~V )(zlR  + l/r)
m = y — FTWr— • (5-7)
The parameter z given by
-80-
chapter 5
1, s >= 113bytes for virtual cut-through 
z— { 0,  ^ < 113bytes for virtual cut-through
H, s >= 113bytes for Store-and-forward 
H -  1, s < 113bytes for Store-and-forward
is related to the type of routing mechanism used (i.e. store-and-forward or virtual cut-through) 
and the length of each sub-packet. The length of the block 5(113 bytes) is derived from the 
unit step function in equation 5.6 (u(e)).
Evidently, in the case of store-and-forward routing, the optimised value of m slightly 
depends on the number of hops crossed by each sub-packet (5), whereas, unlike the other 
pipelined transmissions [Har 86], it is encouraging that with virtual cut-through routing the 
number of sub-packets which minimises the latency to transfer a message W is independent 
on the distance of the destination station. It is only specified by the known physical quantities 
(W, V, R, b and r).
It is also important to notice that for smaller messages (W < 32 bytes) there is no need 
to sub-divide them into smaller data blocks. The sub-division technique is only beneficial 
when large amounts of data are passed between processors.
Finally, for a heavily loaded 2D network all sub-packets will be travelling according 
to the traffic load imposed to each crossed branch, and even if the optimal number of sub­
packets m is determined as in equation 5.7, it will not produce a significant improvement on 
the performance of the network.
5.6.2 Comparing the Obtained Communication Latency to that of a Linear Transputer Array
Now, the primary emphasis is to compare the obtained communication latency of the 
2D network developed in this thesis to the well known store-and-forward, pipelined and 
virtual cut-through transmissions when applied to a linear array of interconnected transputers.
In a lineai' transputer array, messages of length W are split into small blocks s and routed 
bit-serially a distance of up to (N -l) hops along a one dimensional pipeline sequence of N  
processors. By applying equation 5.2 to this particular situation, the transfer time for the 
pipelined transmission which depends on the distance (N-l) between transputers can be 
written as
T = ( m + N -  1)
f  s  ^
a + -
V r J
(5 .8)
-81-
chapter 5
Harp [Har 86] has deduced the same equation using the following arguments. To 
transfer the first sub-packet (s) through N nodes requires a time N(a + s/r) where a is the 
software protocol processing time at each transputer and r is its effective link rate. The 
remaining (m-1) sub-blocks arrive subsequently at the last transputer during a time (m-1 )(a. 
+ sir). Therefore, these quantities sum up to provide equation (5.8). Furthermore, Harp has 
minimised the transfer time by chosing an optimal value of the number of blocks (m). This 
minimum transfer time is given by
T = N - 1  +
aW
‘(N -  1 )J
(5.9)
Such a figure can only be achieved using two buffers per node. In the same circum­
stances, a single buffer will impose the store-and-forward transmission where an entire packet 
is consumed by the node before retransmission occurs. This results in an extremely high 
latency, dependent on the product of the message transmission and the distance to be travelled. 
As shown in appendix C and equation 5.2, the total transfer time is expressed as
T = (N -  1) a + W
R
(5.10)
Despite its high latency, this technique can be found in many machines such as the 
Cosmic Cube [Sei 85], the Ametek/14 multicomputer [Ame 86] and the Meiko ring-trans- 
puters [Wei 89].
Unlike the above transmission schemes, the virtual cut-through routing technique has 
an extremely low latency, dependent on the routing decision per node (L) for the whole 
message (W).
L W
T = ( N -  1 ) -+  — . (5.11)r r
Although this method seems quite efficient, it turns out that, only networks based on 
hardware routers, such as our 2D structure or the Mad-Postman, can adopt it. An attempt by 
Lau to implement this technique on a directly-connected network of transputers using a 
common buffer was reported in [Lau 91]. However, as criticised by Peel [Pee 92], Lau’s 
proposed method has several drawbacks. It is not constructed in a legal OCCAM since the
chapter 5
common buffer is shared by two concurrent processes. In addition, the technique works 
provided that the rate of data arriving to a transputer input link is similar or faster than its 
output link can pass it on. Probably the most crucial point is the choice of the delay which 
assumes a constant speed of each transputer link.
Fortunately, the virtual cut-through routing transmission adopted by our 2D structure 
can effortlessly be implemented within each gateway router; and obviously, the network 
latency would be insignificant compared to the SIMP interface latency which is imposed by 
the store-and-forward pipelined transmission through two hop transputer links (1 link for 
transmitter & 1 for receiver). Table 5.2 shows the total communication time for transferring 
lOKBytes of data across a set of linearly-connected T800 processors. The calculations 
obtained from equations 5.9 and 5.10 assume that each processing node has a software 
protocol processing time and an effective communication link bandwidth of a = 5 p sec and 
;-1.8Mbytes/sec respectively. Within the same circumstances, a comparison is made to our 
2D structure with its particular network parameters (equation 5.6).
linear transputer array 2D network with V.CT for different
parameters
hops Store and 
forward
Optimal
pipelined
b= 20ji sec 
and R = 
40Mbit/sec
b=0 
and R = 
40Mbit/sec
b= 20 p sec 
and R « 100 
Mbit/sec
2 1 1 12 1 6036.12 6384.096 5823.42 6315.56
3 16681 6147.91 same same same
4 22242.2 6241.39 same same same
5 27802.7 6325.9 same same same
6 33363.3 6401.2 same same same
Table 5.2 Comparison of the communication latency (|isec) obtained by our 
2D network and a linear transputer array when passing lOKBytes 
of data between two processors. These figures are derived from their 
corresponding equations 5.6,5.9 and 5.11.
Table 5.2 reveals that the transfer time in our 2D structure is independent of the number 
of processors. As previously mentioned, this is because the latency is mainly dependent on 
the two hop transputer links and the software protocol time for which a reduction from b = 
20 |X sec to 0 p sec accordingly improves the communication time by a factor of 5 transputer 
hops compared to the optimal pipelined transmission. Higher transmission rates show an 
insignificant improvement on the inter-process communication latency (column 6), which is 
mainly dominated by the interface delay. Higher rates are only desirable to provide higher
-83-
chapter 5
bandwidth when the whole network operates under heavy traffic loads. In optimal pipelined 
transmission, the software protocol processing, which includes the context switching at each 
transputer node, must be invoked for each block of data. As a consequence, the transfer time 
depends on both the distance a message has to travel and the number of data blocks sent. In 
particular, this optimal pipelined transmission becomes equivalent to the virtual cut-through 
scheme for a zero protocol processing load (a=0). In general, therefore, when the traffic in 
the transputer array network increases, both performance of the pipeline and virtual cut- 
through transmission approaches that of store-and-forward scheme where the whole message 
travels at the effective transputer link rate. In our network, fortunately, all sub-packets will 
cross at most two hops at the network effective speed which is, in practice, higher than the 
transputer link rate. Consequently, a higher performance is expected than for directly-con­
nected transputers.
5.7 Performance of the Network under Event Parallelism Tasks
It is now important to consider the application-to-application response time and evaluate 
the performance of the whole system. Such parameters are directly related to the network 
characteristics in terms of bandwidth and latency, and the type of interface used to connect 
processors to the network.
Event parallelism [May 87] is chosen to determine the speed up factor and the efficiency 
of the system because it allows automatic load balancing and process synchronisation.
5.7.2 Speed up Factor and Efficiency
In general, the efficiency of an algorithm implemented on a parallel machine of N 
processors is defined as E -  (time on 1 processor)/N (time on N processors). Additionally, 
on a transputer network where a node is seen as doing some computations (Tp), setting up a 
communication (Ts) and sending or receiving messages (Tc), the set up and the communication 
times are undesirable, but they are a fact of the inter-process communication.
Fortunately, transputers are designed to overlap their communication time (Tc) with the
computational one (Tp), from which Glendinning and Hey [Gle 87] have determined the 
efficiency for a transputer array to be
chapter 5
Typically, the set up time for transputers is about 2.6 p. sec. In the case of a T800-20 
using internal RAM, communication and computation times are 0.56 psec per byte and 0.66 
psec per flop respectively. Therefore, the ratio of the computational time in flops to the 
communication in bytes is the crucial point in determining the performance of the whole 
system. High perfonnance with a linear speed-up factor can only be achieved with the 
provision of adequate bandwidth and low communication latency between any processing 
nodes.
5.7.2 Case Study of a Processor Farm
In principle, the model used for a processor farm (or an event-parallel application) is 
a single processor acting as a controller for one or more workers. Such a system provides 
an automatic mechanism for dynamic load balancing and synchronisation as hungry workers 
always demand more data or tasks.
A processor farm exhibits high perfonnance only when the amount of communication 
overhead is insignificant compared to the time for computations. For problems involving 
small data sets, a linear farm of processors with 10  to 100 transputers attains a near-optimum 
speed up [May 87, Pac 87]. However, for more complex problems such as ray-tracing 
algorithms, the databases which describe the scene of an object may be too large to fit in 
each worker memory. Therefore, each processor must fetch the data items from a controller. 
As a result, a heavy overhead on the communication (i.e. O(N)) occurs, especially for long 
linear farms. In contrast, a k-tree farm of processors [Gre 88] improves the effective 
bandwidth, and reduces the communication overhead (i.e. 0(logkN)) by allocating a sub-set 
of the database to each node of the configuration. This is achieved by a more sophisticated 
routing mechanism at the cost of a greater overhead for processing messages.
In the case of our 2D network, we have shown that the communication is independent 
of the location of processors (i.e. 0(2)) and, therefore, does not involve any additional routing. 
In fact, sub-sets of the database can be allocated evenly to any processor in the network 
without affecting the communication overhead. The next simulation results justify this 
statement.
5.7.3 Simulation Results
From an application point of view, it is hard to evaluate the exact performance of the 
system which includes the 2D network, the interfaces and the transputers, because neither 
the internal operations of the transputer (i.e. context switching between processes, effective
-85-
chapter 5
processor speed etc.) nor the exact execution of each task can be predicted in advance. 
Nonetheless, the simulation model written in SIMSCRIPT (Appendix E.3 ) gives an 
approximate performance of a real system based on the 2D network.
Within this model, each interface is represented by a single or double buffer as shown 
in figure 5.7 of section 5.5. The CSMA/CD network protocol used on the other hand is 
derived from the simulation model of chapter 3. To this model, the software protocol times 
of the transmitter (b=20\isec) and the receiver (a~30\isec) interface have been included.
Using this model, two examples of event-parallel applications have been simulated on 
the 2D structure, one of which is compared to the processor farm free interconnection network.
In the first one, 1000 arbitrary tasks of exponentially distributed average processing 
time of 1800psec have been generated. Each portion of the problem which corresponds to 
a set of tasks has been attributed to a processor in the network. Results and data are carried 
in an 18 byte packet Table 5.3 (column 2,3) shows that for small average processing times, 
the efficiency, as expected, decreases rapidly with the number of processors to which the 
tasks have been submitted, because the initial communications to load each processor in turn 
dominate the computations achieved per worker and hence saturate the speed up factor. As 
far as this efficiency is concerned, it does not matter how the initial communication has been 
carried out. For instance, fully or partially loading each processor in turn while responding 
to the recently accomplished task, or loading all processors sequentially and then answering 
the arriving tasks, does not provide any difference in the efficiency since all processors are 
equally distributed from the controller and exhibit a symmetric behaviour. For the same task 
executed by 1 to 31 processors, table 5.3 (columns 4,5) shows that double buffer stations 
provide higher overall efficiencies than single buffer ones since, in the former, more com­
munication can be overlapped.
In the second example, a ray-tracing algorithm that generates a graphic ball is considered 
[Hai 87]. In this experiment 262144 pixels, each assigned an average computational time of 
14.90 m sec and a message length of 18 bytes corresponding to a point on a 2 dimensional 
surface, has been simulated. As shown in table 5.1 (columns 6,7), we have obtained an 
efficiency of 99% with 31 processors belonging to the same branch. This is quite an 
improvement over the processor farm free interconnection network [Gre 88], which in the 
same circumstances attains an efficiency of 90.66% with 15 processors.
chapter 5
single buffer double buffer single buffer
Number o f  
processors
Speed up fac­
tor
E ffic ien cy ^  I Speed up fac­
tor
Efficiency% Speed up fac­
tor
E ffic ien cy ^
1 .96 96.75 1 100 .99 99 .72
6 5.78 96.33 6 100 5.98 99 .72
11 10.40 94.56 11 100 10.96 99 .72
16 14.38 89.87 15 93.75 15.95 99.71
21 15.15 72.16 18 85.71 20.94 99.71
26 15.10 58.11 21 80.76 25.92 99.71
31 15.14 48.85 21 67.74 30.91 99.71
Table 5.3 the speed up factor and the efficiency for N processors sharing the 
same channel in a ID network for an event-parallel arbitrary 
application (columns 2,3,4,5) and ray-tracing algorithm (columns 
6,7). Stations having single and double buffers were considered in 
section 5.4.
5.8 Summary
Development of efficient distributed computers clearly requires low latency network 
interfacing along with optimal point-to-point network routing. The semi-adaptive routing 
algorithm and the virtual cut-through technique enhance the performance of the 2D structure, 
and makes it a convenient interconnection network for many computational models.
Although the routing resources (e.g. gateways) have been physically separated from 
the computational models and model-specific processor architectures, processors where 
special hardware supports the communication instructions and message handling (e.g. the 
transputer) can reduce their interface overhead, and hence benefit from the features provided 
by the network.
In particular, a specific interface called the SIMP has been developed in this research 
to support transputers via their serial communication links and to provide a software protocol 
to ensure a secure communication protocol within the network without extra hardware cost.
This Serial Interface Message Processor (SIMP), consisting of a double buffer and a 
simple software protocol, adopts the send-and-wait scheme with high priority acknowl­
edgement, supports high transmission rates, overlaps part of its software protocol with the 
reception of packets (stpr), pipelines the transmission of large messages and can adequately 
be used to connect transputers to the serial channels (buses) of our topology. Furthermore, 
this interface can operate and transfer data efficiently if the majority of the function protocols 
are implemented in hardware so as to minimise the software protocol times.
-87-
chapter 5
The whole system including the network, the processor and the interface has been 
simulated for an event-parallel task. The results of the simulation show that high efficiency 
is achieved with interfaces of double buffers in the SIMP. Moreover, it has been shown that 
with the graphic ball ray-tracing example, our topology achieves higher performances than 
a tree of 15 transputers implementing the processor farm directly. This is because the 2D 
structure emulates the whole system as a star network having a total communication latency 
imposed by the transmitting and the receiving transputer links only.
-88-
chapter 6
SYSTEM  D E S IG N  PROPOSALS
6.1 Introduction
The purpose of this chapter is to complete the design of the 2D structure by focusing 
on the functions of gateways and stations. The gateways, which are responsible for routing 
messages, represent the ultimate building blocks of the configuration, provide the appropriate 
bandwidth and impose the reduced latency. The stations on the other hand are the only 
elements that inject and absorb messages to or from the network. They consist of at most 
four Transputers supported by a single serial interface processor (SIMP). This interface 
consists of a hardware part that is attached to the physical network and a software part that 
allows host processors to communicate across the 2D structure, ensuring the correctness of 
their communications by the use of robust protocols.
The separation of the routing elements (gateways) from the interface ones (stations) 
not only endows greater design freedom, but also permits the whole architectural structure 
of the system to be addressed separately and independently. As new ideas have continuously 
emerged throughout the duration of this research any particular development fits within the 
corresponding part of the architecture. What strongly makes such decisions feasible is the 
visualisation of the whole system structure from the hierarchical 7-layer standard (ISO and 
IEEE) [Kno 87b, Ham 86, Tan 88, Smy 90]. This multi-layered architecture makes the design 
more structured and eases further development.
It is important to realise that the ISO and the IEEE standards are just models. In fact, 
very few networks adhere strictly to the 7-layer format. In some cases, layers may be missing 
because they are not needed. In others, functions normally associated with a particular layer 
may be implemented in different ones [Rya 81, And 82, Enn 83],
In our system, the whole collection of SIMPs and the network is considered as the 
transmission system identified by the network access protocols. The services associated with 
host transputers on the other hand are defined by the end-to-end protocols (Figure 6.1). 
Therefore, to make the system operate properly, three major layers can be distinguished: the 
user layer which is associated with the application services provided by user processes, the 
software layer which comes at an intermediate level between the host and the hardware
chapter 6
components of the SIMP and the hardware layer which consists of all physical blocks 
necessary to maintain proper interfacing to the bus. By analogy with the ISO and the IEEE 
standards, the model of chapter 5 section 4 can be compared with figure 6.1 as follows: 
User Layer = (Application + ... + Transport Layer) = High Level Protocol,
Software Layer = (part of the Data Link Layer) = Logical Link Control,
Hardware Layer = (part of the Data Link Layer and the Physical layer ) = Medium 
Access Control + Physical Layer.
In particular, the flow control and routing, which are normally part of the network layer 
in the ISO standard, are handled in our configuration at the user and the hardware level 
respectively.
ISO IEEE
application end to end high
presentation protocol level
session (User layer) protocols
transport
network network access proto­
col
logical link control
data link (software) & medium access
physical (hardware) layers physical
Figure 6.1 Identification of the ISO and IEEE layers.
The important issues to emphasise are the services provided by each layer and the packet 
format which carries information and control commands between different parts of the 
architecture, see figure 6.2 .
Starting from an application point of view, the user layer must perfom three fundamental 
activities: process identification, process multiplexing and flow control. Every process 
running in a given transputer may send data to other processes allocated to the same or 
different hosts. This requires that each message must, in addition to the length and data fields, 
carry the address of the destination processes. Also, for any arbitrary application, it is most 
likely that several processes share the same communication link on a single transputer. As 
a consequence, a multiplexor process is included, which by relying on the process identifi­
cations fairly serves each application process in turn. Finally, flow control between physically 
distributed processes is essential to support the inter-process communication. It contributes 
to regulating the flow of packets inside the network as well as to providing a deadlock free 
operating system.
chapter 6
User
layer
----- 0------- ,
Desi host Lenqih 
8 or li> bits
i
Dest, process d iata
Soft.
layer
-. 1:1 ............ ,
Route 
8 bits
Dest, host Src. host
or 16 bits 16 bits lfn#
Descriptor p|
Hard.
layer SYNC F2 FI CRC
Figure 6.2 Message and Packet format generated by the 
protocol layers. (*) destination process identifi­
cation is application dependent. It is the address 
of the virtual channels if  they exist.
The software layer, on the other hand, upon receiving a message from the host in the 
form shown in figure 6.2 , converts it into a packet by appending three distinct fields: a source 
host, a descriptor and a route for the packet The source host field in conjunction with the 
descriptor field determines the sequence variables of each host connected to the 2D network 
to be used during error recovery and duplicate packet delection. The descriptor field also 
differentiates between data and command packets. The latter are only issued for network 
management; in particular, during the address assignment procedure that distributes physical 
addresses to all stations within the network (see section 6.3.4). The route field identifying 
a routing branch can arbitrary be chosen and appended to any inter-packet. As illustrated in 
chapter 2 , this feature allow packets to avoid routing branches with higher traffic loads.
Finally, in order to securely sustain packet flow within the network, the hardware layer 
surrounds the leading and trailing edges of the software-level packet with synchronisation 
(SYNC) and check-sum (CRC) fields. The former are essentially used to trigger the receiving 
hardware circuits since the data and their clock are merged into the same bit stream. The 
latter is dedicated to the detection of errors occurring within the channel. In addition, at the 
network level, the hardware layer provides necessary link-layer functions, among which are 
retransmissions, collision detection and resolution, acknowledgement and buffering man­
agement, and flow control.
The purpose of this chapter is to convey the many design ideas acquired during this 
research that enable gateways and stations to work in a CSMA/CD environment using the 
our prescribed layers standard, and to define the whole system from application process to 
the physical level on a peer-layer basis.
chapter 6
This begins in the next section, which is devoted to the design and description of the 
hardware layer and is mostly concerned with the building blocks of the SIMP and gateways. 
In section 6.3, the services provided by the software layer have been explained and possible 
solutions proposed. Furthermore, in the same section, an algorithm which allows a dynamic 
attribution of physical addresses to stations sharing the same branch is suggested. The user 
layer relating the multiplexing and the synchronisation of the physically distributed processes 
is addressed in section 6.4. Using the OCCAM language as a simulation tool, the operation 
of the three layers that compose the whole system is verified in section 6.5.
6.2 Hardware Layer
The main purpose of this hardware layer is to provide services to both the network and 
the software layer. Such services consist of passing frames on a peer-layer basis, handling 
retransmissions when the timer expires, managing the transmission and reception of 
acknowledgements, resolving collisions, overcoming buffer overflows, maintaining flow 
control, generating and testing CRC parity checks and detecting other errors.
There is a trade off between services done at each layer. For instance, the services done 
at the hardware layer described above will be used by the software layer through their service 
access points (i.e. control, status and flag read/writes) to accomplish the communication 
process. As a result, the tasks handled by the software protocol become relatively simple. 
However, this reduction in software services will undoubtedly increase the cost of the 
hardware modules.
6.2.1 Ovetyiew of the Hardware Level Blocks
Basically, all the hardware blocks shown in figure 6.3 are essential to allow stations 
and gateways to operate within a CSMA/CD environment, and to generate the proper packet 
format. The function of each block can briefly be described as follows:
a- 8-bit Transmitter, 8-bit receiver and 1-bit status buffers hold data and status flags 
for the software level operation, 
b- 16-bit Serial-in parallel-out receiver and 8-bit parallel-in serial-out transmitter 
registers decouple the communication between the data buffers and the channel 
(serial bus).
c- A control and status register passes commands between the hardware and the 
software layers.
d- Transmission and reception controllers manage the whole operation of the 
blocks.
-92-
chapter 6
e- A 5-bit address filtering unit accepts or rejects incoming packets by comparing 
their address header to the station local address. This address is acquired by 
means of the algorithm described in the section 6.3.4 every time the network is 
initiated.
f- A 16-bit CRC unit generates and detects a polynomial code for each packet that 
passes (e.g. X 16+ X 12+X5+ 1 defined by CCITT) 
h- Manchester encoder and decoder units maintain bit synchronisation between the 
transmitter and the receiver by merging the data bits with their clock,
i- Collision detection and rescheduling units handle the CSMA/CD protocol 
mechanism.
g- A 16-bit acknowledge register constructs an acknowledge packet every time a 
successful reception occurs.
H o s t s
Figure 6.3 The block diagram of the SIMP structure.
chapter 6
Most of these units are standard circuits found in all types of CSMA/CD communication 
designs. This chapter is therefore concerned with the ones we have proposed in this project, 
which complement the protocols already existent in order to built our required system.
6.2.2 Synchronisation Field
Although the design of synchronous systems that demand the distribution of a global 
clock [Mil 90] is generally far easier than the design of asynchronous circuits, the latter has 
more advantages. Firstly, it overcomes the necessity of distributing a global clock signal to 
all the nodes of the system, which becomes more difficult as the network size increases. 
Secondly, asynchronous networks provide optimal performance as packets cannot propagate 
across the network within one clock cycle.
Conceptually, the CSMA/CD protocol profits from an asynchronous system by use of 
the preamble or synchronisation field, and the Manchester encoding that merges the bits to 
be sent with their clock.
The synchronisation field comes at the header of any packet as a stream of alternating 
l ’s and 0’s, to synchronise the receiver clock to the transitions. This preamble ends with a 
starting bit and then the information of the packet itself.
6.2.3 Buffer Management at the SIMP Hardware Level
The most important part in the interaction of the software and hardware layers is the 
service provided by the buffers which form an interface point between the host and the 
communication medium. The buffering model, which has been focused in the design and 
verification of the SIMP operation, is the AMD 67C450X, a deep first-in-first-out CMOS 
memory [AMD 88].
This FIFO is a RAM-based device of 256,512 or IK words deep and 9 bits word length. 
By cascading, it is possible to expand it to any width and/or depth to create much larger 
FIFO’s. Typically, two 32 bytes deep FIFOs each with 8-bit width are enough for our SIMP 
as the maximum packet length efficiently handled by the 2D structure is limited to 32 bytes, 
and two buffers are required to achieve an efficient communication (see chapter 5 section 
5). Furthermore, the 9th bit provided with this model will be used to determine the boundaries 
of a packet stored inside the FIFO. The status flags, which signify empty, full and half-full 
conditions of the device, can be controlled by the T2 Transputer through simple logic in order 
to have full control of the FIFO buffer and its environment. The FIFO also has the ability 
to store and output the data packets simultaneously and asynchronously, and to retransmit a 
message when requested.
-94-
chapter 6
6.2.3.1 P acket Boundaries
The hardware layer accepts a raw bit stream and attempts to deliver it to the software 
layer through its service access points. The storage of the stream, like its transmission, is 
straightforward; after every 8-bit word is serially shifted into the receiving shift register 
(Figure 6.3), a counter generates a write signal to the FIFO which stores one byte at a time. 
However, due to errors, the number of bits received may be less than, equal to, or more than 
the number of bits transmitted. There are four commonly used methods to delimit the 
boundary of a frame [Tan 88].
1 - character count,
2 - starting and ending characters with character stuffing,
3- starting and ending flags with bit stuffing,
4- physical layer coding violations.
The first framing method uses a field in the header to specify the number of characters 
in the frame. If this count should ever be corrupted by transmission errors, the protocol can 
get out of synchronisation and lose track of the frame boundaries. To minimise the chance 
of a bad count field, DDCMP (Digital Data Communication Message Protocol) uses an extra 
check-sum for the headers.
The second framing method has been used by IBM’s BISYNC and the ARPANET 
IMP-IMP protocol. This method, where frames are bounded by using a character delimiter, 
is closely tied to 8-bit characters (i.e. ASCII codes) and hence limits the data transparency.
The third approach is more advantageous than the second one. It is used by the IBM 
HDLC (High-level Data Link Protocol) in which frames are delimited using a special flag.
The last method is only applicable to a network in which encoding on the physical 
medium contains some redundancy such as the Manchester encoding which works on 
transitions (e.g. IEEE 802 standard). The absence of transitions results in a code violation 
(i.e. high-high or low-low) and thus stops reception. The framing boundary technique used 
in our protocol works as follows:
a- The physical layer uses the Manchester code violation to detect an end of carrier 
as a normal CSMA/CD protocol does, 
b- The software layer uses the length field found in each message (Figure 6.2) to 
extract the packet from the receiving FIFO when no errors are reported by the 
physical layer. Though, in the case of errors that may also have changed the 
length field, the 9th bit in each FIFO receiver, which is set to one when transitions 
vanish from the channel, will be used by the software protocol to find the 
boundary and discard packets.
chapter 6
c- The user layer receives only correctly framed messages and the length field 
already present within the message is then used to pass messages to the host 
transputer.
6.23.2 Detecting Errors
Errors in received packets may be detected using the CRC logic to calculate the 
check-sum [Pet 61, Fon 61, Kas 63, Bur 72, Ben 84]. Given that the software protocol starts 
to process the header of a packet immediately when it is received (chapter 5 section 5) and 
that the CRC test only reports a success or failure at the end of the packet reception, an 
additional one bit status FIFO is required to hold the CRC result (e.g. 1 if an error is detected 
and 0 otherwise) of all incoming packets regardless of the speed of the software layer.
Based on the result of this status FIFO, successful packets are efficiently removed from 
the receiving FIFO and directed towards their destination hosts using the length field and the 
SIMP Transputers’ block move instructions [May 88], whereas, corrupted ones should be 
discarded up to the 9th bit boundary marker in a rather less efficient way ( e.g. an iterative 
while loop).
6.2.4 Buffer Management at the Gateway Level
One of the most important components of the gateway structure is its buffering because 
otherwise the gateway transmitter and receiver units are similar to the SIMP ones, see figures
6.3 and 6.4.
A gateway must be able to distinguish between intra-network and inter-network packets, 
in order to know which packets to forward over the bridge link, unless full broadcasting is 
desired. Both receiving shift register units R l and R2 in figure 6.4 contain address filtering 
which is used in conjunction with each incoming packet to decide whether to accept or reject 
it. Obviously, similar to the SIMP, the receiving shift registers convert a serial stream of 
data into a word equal to the width of the FIFOs, and the transmitting shift registers retrieve 
words from the FIFOs and sends them bit-serially. Associated with these registers is a separate 
counter that monitors the number of bits received or transmitted, before issuing a store or 
read signal to the respective FIFO.
chapter 6
Figure 6.4 Gateway structure.
Because packets may arrive at the gateway faster than they can be retransmitted, 
buffering is necessary. The buffering management will depend on the type of service offered 
by the network (i.e. hop-by-hop or end-to-end).
If the end-to-end service is provided and error recoveries are done at each station, 
gateways will consists of a simple FIFO of the type AMD 67C450X. When packets arrive 
at a full buffer, they will be discarded and their source stations guarantee their retransmission.
However, with the hop-by-hop service where error management and retransmissions 
are processed at each intermediate gateway, the AMD 67C450X model appears insufficient. 
When the buffer contains many packets at once, the 9th bit of the FIFO can still be used to 
delimit them, and hence can be used to create inter-frame gaps between their transmissions. 
If one of these packets is required for a retransmission as a result of collision, error or buffer 
overflow at the next receiving node, the AMD 67C450X is designed to do so and automatically 
resets its read counter to zero. As a consequence, all packets previously sent will be 
retransmitted. Of course, the protocol associated with the software layer at the receiving 
station is strong enough to detect duplications. Regardless, an enormous bandwidth will be 
wasted for every retransmission. There are two alternatives to overcome this problem.
The use of tv>>o 67C450X FIFOs: one holds all arriving packets and the other the currently 
transmitted packet The main FIFO finishes its internal transfer towards the retransmission 
FIFO, even if the original transfer to the channel was prematurely aborted. Afterwards, 
retransmissions will only be issued from the second FIFO that contains the single packet 
until an acknowledgement erases it and the main one will be considered for transmission 
again (Figure 6.5).
chapter 6
Figure 6.5 Two 67C450X FIFO cooperate to accomplish the retransmission.
The use o f a cyclic dual-port memory: Such a model is an enhanced version of the AMD 
67C450X. It incorporates all necessary functions to satisfy the operation of our protocol 
with a minimum of additional logic. Figure 6.6 shows the proposed memory model and the 
flowchart in figure 6.7 explains its operation. This dual-port memory contains registers that 
point to the next memory words to be stored and read out, and extra registers which can be 
utilised to save copies of the first two.
The write pointer increments upon the write signal wt being activated every time a word 
(8 or 16 bits) is stored in the memory. The read pointer on the other hand increments upon 
the read signal rd being activated every time a word is retrieved from the memory. The 
detection of a carrier by the receiver asserts the signal cw (carrier write) which saves the 
current contents of the write pointer wp into the intermediate write pointer iw. The clear 
signal cl, activated every time a CRC error is detected by the receiver, restores the old content 
of the write pointer which was in the intermediate write register. Hence, (wp-iw) words 
which have already been stored will be discarded. Similarly, the starting of a transmission 
activates the signal cr which saves the current content of the read pointer rp into the inter­
mediate read register ir. When a retransmission is required, the signal (rt) restores the old 
value of the read pointer from the intermediate read register. Therefore, (rp-ir) already 
transmitted words will be recovered. In all the cases the 9th bit flag can be used as a packet 
delimiter.
In particular, during the store-and-forward operation or when packets are momentarily 
blocked by the traffic, the corrupted ones can easily be erased from the FIFO by activating 
the signal clear (cl). However, during the cut-through progress a wrong CRC field will be 
generated at the tail of the flying packet and somewhere it will be discarded when blocked 
at a gateway or when it reaches its destination station.
chapter 6
Figure 6.6 The cyclic dual-port memory that extends the 67C450X model.
^ \
lw  ^ 0 T T -
rt l-T-r CW |-| r
I II i
(retransnit) ( [V)
Rp = Wp = Ir = 
(Rp-Ir) (I«: Vp) J l 5
\ Yes/ \ Yes
Store q lord in PIFQ renove ft word fron FIFO
N/
Vp: (Vp M) \ rqx Rp - (Rp M) \ riQX
Figure 6.7 Operation of the cyclic dual-port buffer.
chapter 6
6.2.4.1 Manipulating Addresses at the Network Level
Although the 2D structure is intended to link transputers that operate with the OCCAM 
programming language [May 88] under the CSP formalism [Hoa 78) and, hence, only 
point-to-point communication is required, broadcasting and multicasting operations are 
necessary for network management (see section 6.3.2 for more details). Therefore, two types 
of addresses can be found within the packet format: point-to-point and broadcasting or 
multicasting addresses.
The 2D architecture as defined in chapter 4 has a simple addressing format as a limited 
number of hops (=3) have to be crossed. In order to pass a packet through three hops with 
two gateways and one destination station, three addresses in the destination fields are needed, 
with the most significant bit being set to 1 or 0 to distinguish between gateways and stations 
respectively (Figure 6.8). Notice that the route is also an address of the gateway through 
which the packet enters its next routing branch. Furthermore, when a packet stays in its local 
branch, the route and the gateway fields are unused, and the destination field contains only 
an 8-bit station address (ds). These notations are similar the ones used in chapter 2 section
7 25. Using an 8-bit length format, in principle, it is possible to address up to (2 -  2) = 15876 
transputers, even though, in practice, smaller numbers are required in each branch to achieve 
a good performance. This number is calculated from the fields in figure 6 .8, where two 
addresses designating the broadcasted and reserved addresses (for the latter see section 6.3.4) 
have been dismissed. The route field does not contributes to the addressing of processors, 
because all the gateways on each injection branch provide routes along different routing 
branches to the same gateways on the absorption branches.
<— 1 route 1 gateway 0 station x x
7 bits 7 bits 5 bits
X2 XI ds
Figure 6.8 Format of the route and point-to-point destination 
addresses shown in figure 6.2. Here, xx determines 
one of 4 hosts supported by a single station and the 
arrow points to the direction of the flow.
On the other hand, broadcast or multicast address headers generated only by the software 
layer carry commands or special services to all or groups of stations (Figures 6.9 ,6 .10). Any 
gateways in the routing branch keep monitoring all headers of any received packet until the 
most significant bit of the received field is zero. This declares the end of all gateway addresses
-100-
chapter 6
and the beginning of the local station address. During this process, any gateway reecognising 
the broadcast or its unique physical address, used at the network level, forwards the packet 
towards the attached injection branch where all stations of that group receive the message.
<--- I route 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1
Figure 6.9 Format of the broadcast destination 
address header.
<--- 1 route 1 gateway 1 gateway 1 gateway 0 1 1 1 1 1 1 1
1 2 n 1
Figure 6.10 Format of the multicast destination address 
header.
The strategy we undertake to pass packets through the network is as follows. An 
inter-network packet carrying a 24-bit destination address (Figure 6.8) crosses two con­
secutive gateways. Every time a gateway is crossed, the matching physical address field is 
removed until the destination station is reached.
There are many ways to implement this operation depending on the width of the FIFO 
buffer and the length of the receiver shift register, and whether the address header is removed 
by the receiving or the transmitting part of the gateways. Obviously, the length of the receiving 
register and the procedure of discarding packet headers play a major role in determining the 
latency per node.
Since each packet is composed of a multiple of 8-bit fields, the FIFO and the transmitting 
shift register are organised with 8-bit words, thereby avoiding the complexity of re-packet- 
ising frames. Accordingly, if the receiving shift register is chosen to be 8 bits long, the latency 
to decode the headers of a packet before storing it inside the FIFO buffer would be 0.4ji sec 
(i.e. 8 bits to receive the preamble plus 8 bits to detect the address matching at a rate of 40 
Mbits/sec). Although with a 16-bit length shift register, it is also possible to accept or reject 
an arriving packet within the same time, if the address matching test takes place when the 
address header occupies the first half of the shift register (i.e. first 8 bits), another eight 
additional shifts of (0 .2 |i sec total) would be needed to line up the header of the packet over 
the register (Figure 6.4). Finally, after a successful address matching, each consecutive eight 
shifts will generate a signal to store the most significant byte of the register inside the FIFO.
-101-
chapter 6
With both register lengths, if the used address header (that of the current gateway) is 
removed at the receiver side of the node, eight more shifts will be required to move this field 
out of the register, hence increasing the latency (e.g. 0 .6)0. sec for an 8-bit or 0 .8|O sec for 16-bit 
registers).
It is therefore advisable to remove the used addresses at the transmitting side of the 
nodes. With this operation, two advantages can be noticed. Firstly, at the gateways, packets 
can be stored on the fly inside the FIFO, and their first address headers substituted directly 
by the preamble (Figure 6.4). Secondly, at the station, even though the destination station 
address has been partially utilised by the hardware layer protocols, it is also prerequisite for 
the software layer to identify one of the four attached hosts. Its removal is therefore only 
permitted at the software layer side (i.e. in this context, it is equivalent to the transmitting 
side of a gateway).
Although an 8-bit receiving shift register introduces the smallest possible node latency, 
it has been decided to utilise a 16 bit one for our 2D structure which yields a latency of 
0.8)4sec per node, including the address matching time of 0.4|itsec. This choice will be 
justified in the next section where we explain that the length of the receiving shift register is 
related to the length of the CRC check-sum field.
6.2.4.2 CRC and the Latency per Gateway
Since the address fields used are removed in each intermediate gateway, appending a 
new CRC for each packet is necessary. However, depending on the lengths of the receiving 
shift register and the CRC check-sum field, two situations arise. Firstly, if the check-sum 
length is larger than the width of the receiving shift register, part of the CRC will be stored 
inside the FIFO, while the other part will be discarded. As a result, the newly-generated 
CRC check-sum covers the whole frame including a fraction of the old CRC recently stored. 
Therefore, the packet length increases from hop to hop as redundancies are added. Although 
this operation allows a large length CRC check-sum and hence reduces the probability of 
un-detected errors, it necessitates additional software complexity at the SIMP receiver to 
count and discard all CRC’s appended to the tail of each packet. Secondly, if the CRC 
check-sum is chosen to be equal to the receiving shift register length, in this case 16 bits, the 
end of the carrier signal prevents it from being stored inside the FIFO. With such imple­
mentation, the probability of an un-detected error would be acceptably low (at 1.5 x 1(T15 
[Hop 86]) and therefore this last suggestion is adopted for our configuration.
-102-
chapter 6
6.2.5 G enerating Acknowledgem ents
In a hop-by-hop service, an acknowledgement must be returned to inform the sender 
that the next node in the path of each packet has successfully accepted it. However, this 
acknowledgement must be returned to the sender and this is possible only if the data frame 
contains the source address field.
The strategy of addressing and moving packets through the 2D str ucture requires the 
removal of the destination fields used within each gateway crossed by the packet. Hence, 
the destination header fields vary across the network. In order to incorporate the source 
addresses, usually stored after the destination ones, within the acknowledgement format 
requires complex buffering.
If the end-to-end service with acknowledgements managed in software is adopted, 
complexities in hardware can be avoided. Even though each destination field may be removed 
at each gateway crossed, the unique source station field, which is located at the end of all 
destination ones, can only be processed at the receiving station. It has been shown that this 
end-to-end service acknowledgement is inefficient to incorporate within the CSMA/CD 
protocol [Tok 77]. Plus, if transmitting buffers are used to hold both data and acknowl­
edgement packets, a livelock situation may result.
An efficient and simple alternative that can be used with the hop-by-hop service is to 
broadcast an acknowledgement locally in each branch where the message has been received 
successfully. Basically, each node, when a transmission is completed within a branch, is in 
one of the states:
1- waiting for an acknowledgement if it has issued a transmission,
2- trying to transmit or receive data.
Any node in state 2 ignores the reception of any acknowledgement. As long as the channel 
is shared among transmissions using the properties of the CSMA/CD bus, there will be only 
one transaction at a time, and thus only one station may be found in state 1. Hence, each 
generated acknowledgement will only be received by the unique node that has sent that data 
packet
To do this, a node must distinguish between data, command and acknowledgement 
packets by the use of bits in the descriptor field which specify the type of packets received. 
However, the position of this descriptor field is a crucial point between latency and hardware 
complexity. If it is placed at the tail of all the address headers, more buffering will be required 
to buffer all these address headers before taking any decision, and hence the latency at each 
node increases. If it is placed at the head of the address header, address removal by gateways 
will be more complex.
chapter 6
We suggest that the descriptor field (section 6.3.3) will only be utilised for data and 
command frames, and independently the acknowledgement packet will be identified by its 
fixed length. The overhead associated with this format is insignificant (2 Bytes) compared 
to the data packet length. In its special and unique format, the acknowledgement may be 
hard-wired at each node and broadcast from a separate acknowledgement register during the 
next slot of time following a successful packet reception (Figures 6.11 and 6.3).
SYNC 11111111
Figure 6.11 A 16-bit acknowledgement packet format.
However, this frame may be corrupted because
1- it is truncated to less that 16 bits, due to noise or failure of synchronisation,
2- it is expanded to more than 16 bits for the same reasons mentioned above,
3- a bit is inverted in the address field which comes after the SYNC.
In case 1, the frame will not be accepted because the counter used to count the incoming bits 
of each packet, and then issue a write to the FIFO after every 8 pulses, did not even reach its 
minimal count (8) . Case 2 will be reported as an error once it has been stored inside the 
buffer. In case 3, the value of the address is of no importance since it has only been used to 
complete the format, and acknowledgements are determined by their fixed length as shown 
in figure 6.11 above. As a result, the inclusion of the CRC field is irrelevant.
6.2.6 F low  C ontrol a t the N etwork Level
It is always possible to provide a window mechanism and to manage acknowledgements 
at the software layer so as to regulate the space available in the FIFO receiving buffer and, 
hence, implement a secure flow control strategy. Unfortunately, this operation is very 
inefficient in terms of communication overhead and system performance. Since we have 
utilised the hop-by-hop fast acknowledgement services, the receiving buffers on the path of 
the data flow fill up rapidly as more packets arrive from the network. Eventually, as pointed 
out in chapter 4, a full buffer rejects packets and the sender retransmits them creating, 
therefore, a "backpressure" effect. In this situation, to prevent the waste of bandwidth during 
the retransmission process, we have suggested that a jamming signal interrupts the sender 
every time a buffer overflow condition occurs. Afterwards, the packet will be re-sent after 
a pre-defined time sufficient for the receiver to provide empty spaces within its buffer.
-104-
chapter 6
In this section, we investigate the possibilities of using our FIFO buffer model to 
implement these strategies efficiently. We assume that the AMD 67C450X FIFO model is 
partitioned into blocks of length 32 bytes each, corresponding to the maximum packet length 
handled by the network. From a design point of view, this can be realised by cascading a 
given number of 32 bytes deep FIFOs, and "ORing" their status lines which signal their empty 
conditions. The resulting status line will ultimately indicate an empty space, if at least one 
block has no data in it. It is also possible to incorporate this function within the dual-port 
memory model introduced previously. Instead of ignoring receptions when the whole buffer 
is full (i.e. | wp -  rp j= length .of.buffer), the full condition is reported whenever less than 32 
bytes locations are available (i.e.j w>p -  rp |= length.of.buffer -32 ). If this full condition 
occurs during reception, the current packet will be entirely stored within the FIFO, since the 
last 32 byte margin always exists, before further receptions are ignored. From this partitioning 
proposal, various advantages emerge.
Firstly, the FIFO receiver contains complete packets (that is no data fragments), because 
the last block always offers a space for a complete frame before disabling the reception. This 
contrasts the case of a continuous storing buffer - i.e. the FIFO accepts continuously packets 
until it becomes full, possibly at the middle of the last packets, where the last data fragment 
written into the FIFO would lead to a severe degradation in the system performance and 
probably involve a livelock situation (verified by simulation in section 6.5). Secondly, when 
the last block of the buffer contains data, any further transmission towards this full buffer 
will be interrupted by the jamming signal in a time of 0.4p,sec, that is the latency of each 
network node to detect a matching address of the incoming packets. Therefore, the waste of 
bandwidth is enormously reduced (i.e. only 0.16% of the bandwidth of a branch is lost in 
every trial during full buffer conditions). Thirdly, by choOsing the retransmission period 
equal to the time needed to transmit the largest packet (i.e. 32 bytes), we can increase the 
probability of finding an empty space during the next retransmission; typically, these times 
can be evaluated as 6.4p,sec and lOOqsec for gateways and stations respectively. Finally, 
when the FIFO buffer is full, all senders will be synchronised with the receiver and hence 
the flow of data is controlled.
However, even though not particularly important, when the network exchanges smaller 
messages using this partitioned FIFO scheme, there may be unused space in the last 32 byte 
block.
-105-
chapter 6
The software layer depends on the service provided by the hardware layer through its 
service access points. If the lower layer protocols provide a virtual circuit service, and 
guarantee that messages are delivered in order from the sender to the receiver without errors, 
losses or duplications, the software layer becomes relatively simple. However, if the lowest 
layer protocols furnish datagram services, it is up to the software layer to ensure that packets 
are correctly delivered.
For instance, if the extended FIFO model was used, more functions will be accomplished 
at the hardware level where only error-free packets are passed to the layer of interest with 
minimum overheads. Nonetheless, the sequencing, duplication management, packetising 
and network monitoring are part of the software layer.
6.3.1 S eq u en c e  Variables
In general, a full duplex transmission on a simplex channel is achieved by using an 
acknowledgement mechanism that allows up to q* frames to be outstanding at any time. This 
number is called the "window size". As new information frames are received and 
acknowledged on the receiver side, the window is advanced and the sender is allowed to 
transmit new frames. Every station maintains sent V(S) and expected V(R) sequence variables 
associated with the data frame to be transmitted or received respectively. There are two basic 
strategies which are commonly used in communication protocols [Hal 85,Tan 88], the 
send-and-wait and the continuous repeat request. Both of them are a trade off between 
buffering and channel utilisation and one of them is used at the expense of the other.
With the send-and-wait control scheme, the sender and the receiver windows both have 
a length equal to one frame (q* = 1), and hence only two identifiers are sufficient for the 
receiver to determine whether a particular frame received is new or duplicated copy of the 
last frame correctly received. Typically, the two identifiers would be 0 and 1 requiring just 
a single binary digit for their implementation.
Within the continuous repeat request, two operations are distinguished: the go back n 
characterised by a window size greater than one at the sender and equal to one at the receiver 
and the selective retransmission characterised by a window size greater than one both at the 
sender and the receiver. In all cases, the size of the receiver window is always fixed and it 
is allowed to slide through the sequence numbers every time a correct message is received.
In addition to the simplicity of the send-and-wait strategy it has been shown [chapter 
5 section 3] that when the round-trip acknowledgement is very small (hop-by-hop service
6 . 3  S o f t w a r e  L a y e r
chapter 6
with high priority CSMA/CD acknowledge), a window size greater than one adds more 
complexity than its performance gain justifies. Therefore, the send-and-wait scheme can 
efficiently be used to overcome duplications.
However, since acknowledgements are issued in each branch and packets may flow 
along different paths, the receiving stations may discard correct packets received out of 
sequence. This critical situation is illustrated by an example in figure 6.12. Suppose that 
both stations A and B start with an initial sequence variable equal to zero (V/S) = V(R) -  0). 
A consecutively sends two packets in its injection branch with the sequence variables V(S) 
= 0 and V(S) = 1 respectively. Now, suppose that the packet V(S)-1 arrives before the packet 
V(S) = 0 because the traffic in the routing branch (1) is heavy. The receiver B simply discards 
the packet as it is expecting the one with the sequence variable V(R)=0. On the other hand, 
if packet V(S)=0 arrives before packet V(S)=1, but its acknowledgment is lost, the duplicated 
copy of the packet with V(S)-0  retransmitted later will be accepted since the expected 
sequence variable has been changed to V(R) = 0. Hence, in both situations the protocol fails.
Figure 6.12 Correct packet received out o f sequence with a window equal to one.
One way to overcome this problem is to assume that the network provides a datagram 
service ensuring only the error-free delivery of messages. Such an assumption requires the 
use of the window scheme at the user level to recover, assemble and handle messages. 
Obviously, no one would agree to include such a heavy task at the application process level.
The simplest approach is to assign a unique route or path to each host and hence all 
packets arrive in the order of their transmission. The route can be initiated by having the 
software layer protocol establish new sequence variables to both sending and receiving 
stations involved in the communication, and can be changed later, if an intensive number of 
jamming signals arrive.
-107-
chapter 6
I m p le m e n t in g  th e se  S e q u e n c e  V a r ia b le s  in  th e  S I M P  M e m o r y
The send-and-wait transmission strategy requires only two bits (1 and 0) to specify the 
current sent and expected sequence variables. Since OCCAM philosophy does not allow 
shared variables, it is therefore important to separate the locations of the two sequence 
variables in the memory as they are accessed by two concurrent processes: transmitter and 
receiver.
If we assume that each reserved memory word of the SIMP holds the sent or expected 
sequence variables of M hosts (M < 16 for T2 Transputers), a total number of Transputers N 
requires 2N/M bytes of memory space. Therefore, to access a particular cell or bit, M = M/2 
shifts on average are needed. These relations give a trade off between the average number 
of shifts, typically affecting the latency of the interface, and the amount of storage which 
determines the cost of each SIMP. Table 6.1 reveals that the greater the average number of 
shifts that one can tolerate, the smaller the amount of storage that is needed to retain all 
sequence variables of the system hosts.
An appropriate choice would be 2 KBytes of memory that could be provided by the 
internal RAM of the T2 Transputer to store the sequence variables of 15876 transputers and 
this involve an average of 8 shifts or 0.4 jisec latency with a 20 MBits/sec processor band­
width. Obviously, a simple look up table will be required to store and retrieve these variables 
(see Appendix E 5).
L KBytes M
32 .5
16 1
8 2
4 4
2 8
Table 6 .1 Memory length versus the number o f shifts.
6.3.2 Command Packets
In contrast to the data packets that are consumed by application processes using a 
point-to-point routing operation, commands are handled by the software layer protocol to 
monitor and manage the network. Among these commands, the most common ones are: the 
commands to initialise the sequence variables of two SIMPs, those to reset all attached hosts
chapter 6
or load them with program code, those to signal an error on some part of the network, and 
those to perform the physical addressing assignment operation (to be presented in section 
6.3.4).
In some cases, these commands are point-to-point communications requiring 
acknowledgement where one or more packets are exchanged between two physically dis­
tributed stations or gateways. In others, the command may be broadcast or multicast packets 
sent to all or groups of stations. As it is a problem to generate multiple acknowledgements 
in a serial bus, we suggest that all broadcast or multicast commands are not acknowledged. 
Their correct reception is therefore notified from the service they have provided [Geh 84]. 
Although inefficient, these commands can still be accomplished by a series of point-to-point 
routing operations.
6.3.3 D escrip to r F ie ld
As shown in figure 6.2, the descriptor field is generated by the software layer and 
appended to the message received form the host. Basically, this field contains the sequence 
variables by which the protocol overcomes packet duplications when their acknowledgements 
are lost, and bits to distinguish between data and commands. Table 6.2 summaries the format 
of the descriptor field. It turns out that only 2 bits are needed to implement this field. 
Therefore, knowing that the maximum packet length handled by the network is 32 byte that 
necessitates 5 bits only, without any extra consideration, we can combine the length of the 
data received from the host and the descriptor generated by the SIMP into a single field as 
shown in figure 6.13. In the next section, we will see how these command fields can be used 
to implement an address assignment algorithm.
6-bit bl bO Description of the field.
64 type 
commands
0 1/0 data packet with sequence variable 1 or 0.
1 1/0 command packet with sequence number 1 or 0.
Table 6.2 the descriptor field format.
< -— 6-bit 1 1/0
type
6-bit 0 1/0
length
Figure 6.13 the first field is the descriptorfield o f a command where 
the length is irrelevant. The second field contains the 
length o f the data packet and the two bits o f the 
descriptor field chained to it as in figure 6 .2 .
-109-
chapter 6
6 .3 .4  P r o v id in g  P h y s ic a l  A d d r e ss e s  f o r  S ta tio n s
Running a program on the 2D network involves discovering the configuration of the 
structure, allocating processes to transputers found, instructing SIMPs about their injection 
branch location and finally loading the code for each host processor. This task is performed 
by a master process running in one of the SIMPs.
Each SIMP and gateway node in the 2D network must therefore be uniquely identified 
by a pair of addresses. A SIMP needs a physical address (ds) on the injection branch to which 
it is attached, and a branch number (XI) of that injection branch. In the case of gateways, 
two port addresses that represent a routing and an injection branch to which a gateway is 
connected are to be provided (see figure 6.8).
In direct transputer interconnection networks such as the hypercube or the torus, these 
addresses are defined by the physical connection of each processor. However, when pro­
cessors are indirectly connected to their physical network via special hardware support, such 
as the MadPostmau router [Mil 90] or our dedicated interface message processor, physical 
addresses that uniquely identify the whole station (processor and interface or router) must 
be provided within these hardware devices.
There are different approaches to assign physical addresses to stations.
The straightforward way is to assign globally a unique address to each station at the 
building stage (hard-wired address). The side effect of this approach involves long addresses 
in every packet on the network even if the latter is using fewer stations. The other drawback 
is the limitation in migrating stations between branches or adding stations to the network.
A simpler alternative would be to change the address of each device with DIP switches 
or with an address plug. In this scheme, one must maintain a static administrative structure 
to determine which addresses are currently used so that new stations could be added at unused 
addresses. Because these addresses are set up by the user, it would be hard work to remember 
all the address assignments, especially when the network contains a large number of stations. 
The scheme would require the user to have far more knowledge of the system than seems 
reasonable.
Although complex, the most flexible approach that we shall follow in this project is to 
assign addresses dynamically to every active station in the network. The world dynamic here 
means no user or manual intervention is required, as each station only relies on the existence 
of a pre-defmed algorithm that ensures the assignment, uniqueness and consistency of 
addresses. The complexity of this protocol is a function of the network type (e.g. CSMA/CD, 
Token ring, connected mesh etc.) and the environment where processors operate.
-110-
chapter 6
A simple dynamic algorithm for distributing local addresses to MadPostman nodes in 
a 2D array is explained by Miller1. Based on a broadcast pattern, each node uses the content 
of the received message from the appropriate dimension as its unique local address and passes 
the increment to the next neighbour. This algorithm is only valid for 2D array networks; 
besides, it assumes that messages are transmitted with a zero error-probability.
In broadcast networks, where all communications between stations attached to the 
network take place through a single shared transmission medium, the situation is different. 
Gopal and Segal [Gop 84] have proposed a generalised protocol for use on any broadcast 
network. However, each active station is forced to assume some form of pre-assigned long 
and unique identifiers. Loucks et Al. [Lou 86] proposal does not have such a restriction. 
Nevertheless, to operate properly, their dynamic address assignment scheme places an 
additional hardware constraint - i.e that there must be some way to distinguish one station 
from the rest. It turns out that only Tomet-type networks with piggy-backed acknowl­
edgement fulfil this requirement. In addition, because the controller starts the address 
assignment procedure, only its failure will involve manual intervention and replacement.
The algorithm we propose in this thesis does not have the restrictions of these long 
identifiers or distinguished stations. It is based on the following broadcast bus characteristics.
The high-priority acknowledged CSMA/CD protocol [Tok 77] resolves conflicts 
between any nodes accessing the bus. In addition, each node solely accepts messages with 
the broadcast address, and does not acknowledge the sender but rather responds to the 
requested task within the received frame. Moreover, each un-assigned node that seeks an 
address uses a reserved address (e.g. 0111 lOxx on 8 bits format for station or 11111110 for 
gateways) for all its exchanged messages with the node responsible for the address assignment 
(i.e. it can be the primary master that is the root SIMP or a secondary master of a branch with 
a zero address). This reserved address is very useful when new nodes (i.e. apply only to 
stations) join the network.
In general, even though it is easy to allocate the two port addresses of each gateway 
using a simple DIP switch, the algorithm described below can discover the configuration of 
the whole network and then, based on the same principle, assigns physical addresses (ds) to 
stations. It should be first run on the injection branch containing the master SIMPs, thus 
discovering all gateways on that branch. Each gateway thereafter passes its assigned address 
(XI) to all other gateways located on the routing branch to which it belongs. This operation 
therefore identifies all injection branches of the network (XI). It should then be run on a 
routing branch having a secondary master gateway (i.e. selected gateway with an address
1 P. M iller, Private com munication
chapter 6
zero) and, in a similar way, each gateway passes its acquired address (X2 ) to all other adjacent 
gateways in the injection branch that support it. As a result, all the configuration of the 2D 
structure is specified. Finally, provided that each injection branch has a unique secondary 
master station (i.e initially an SIMP set to zero), all injection branches concurrently run the 
algorithm that assigns physical addresses (ds) to any active station. The partial configuration 
of any injection branch is held by each secondary master that can pass it to the root SIMP 
for program configuration.
In the case of stations, the algorithm needs two communicating entities: a secondary 
master called the address distributor and the un-assigned stations. Three special commands, 
as shown in figure 6.14, are also required.
S Y N C 0 i 0 k type 1 X j C R C
Figure 6.14 The command frame for the address 
assignment operation. The type field selects 
one o f the command used and x is the don’t 
care condition.
If the type of this frame is equal to "000000", the command is "request an address", 
denoted by r.F. This command will only be issued by a station which has not yet been 
assigned an address. In this command, the destination address field i is that of the master 
(i.e. i = 0 0 0 0 0 .xx, where xx represents the address of the hosts supported by the station which 
is irrelevant in this operation.), the source address k holds the reserved address to be used by 
the master (i.e. k = l l l lOxx)  and the identifier j  is unused.
On the other hand, if the type field is equal to "000001", the command is "selected 
address", represented by a.Fj. This command is broadcast by the requesting station after 
receiving an acknowledgement that its r.F frame has been received at the secondary master. 
This frame contains the broadcasting (i.e. i = 11111.xx) and the reserved addresses (i.e. k = 
11110XX). In addition, the value of j  is incremented every time the station sends the frame 
r.F, to monitor the load on the bus.
Finally, if the type is equal to "000010", the command is "assign the address", denoted 
by s.Fj. This command sent from the secondary master back to the requesting station and 
carries the address assignment. Here, the destination and source address fields will be that 
of the reserved station and the master respectively (i.e. i = 111 lOxx, k -  OOOOOxx), and the 
identifier j  holds the unique physical address to be assigned to the requesting station.
-112-
chapter 6
Using these commands, the algorithm operates as follows. Initially, the whole system 
is reset by a broadcasted command issued by the root transputer or upon power-on. Each 
injection branch or bus has its own secondary master with address zero. Each un-assigned 
station waits a random time and tries to access the CSMA/CD channel. The successful one 
issues the point-to-point frame r.F to the address distributor. Of course, the correct reception 
of this frame is notified by the high priority acknowledgement, after which the master disables 
reception of any other frame and, hence, avoids address confusions. Following the successful 
transmission of the r.F command, the requesting station also broadcasts the s.Fj frame which 
stops the other stations still competing for the bus, thus allowing them to update their waiting 
interval time I  = random [0, Max - j]. All unsuccessful stations retry the next assignment 
phase, when their waiting time expires. Meanwhile, the master choses a free address and 
passes it to the requesting station, which temporary holds the reserved address, through the 
command a.Fj. Finally, following a successful transmission of this command, the master 
prepares itself to receive further frames during the next assignment phase.
To verify that the features of this algorithm can be realised in practice, a simulation 
program in SIMSCRIPT was developed (appendix E4). The algorithm was verified to work 
properly in all situations even under selective errors which causes the frames to be received 
in error by some stations and correctly by others. Because both r.F and a.Fj are point-to-point 
packets, they are acknowledged and therefore protected from errors through repeated 
retransmissions. However, the frame s.Fj is a broadcast one which is not acknowledged. If 
it is lost, then the only consequence would be that the response from the master will be delayed 
as it is forced to compete for the channel with other stations not receiving the s.Fj frame. An 
OCCAM version of this algorithm is also built from this informal description of the SIMS­
CRIPT which deeply simulates the behaviour of the CSMA/CD bus in appendix E4. In this 
program, the CSMA/CD bus is simulated by a simple OCCAM multiplexor process (see 
figure 6.17). The correctness of the address assigment algorithm, described above, is verified 
by the processes listed in figure 6.15 and 6.16.
chapter 6
PROC station( VAL INT i)
... variable declarations 
SEQ 
j := 0
un-assigned := TRUE 
delay := random(0,Max-j)
W HILE un-assigned 
ALT
tim er ? AFTER time.now PLUS (delay + Constant)
SEQ
bus.outfi]! r.F 
bus.in[i] ? FRAME 
IF
FRAME = s.Fj —unlucky station
delay := random(0, max - address.of(FRAME))
FRAME = a.Fj
— the acknowledgement is lost but the station is assigned
-  an address.
SEQ
station.address := address.of(Frame) 
un-assigned := FALSE 
TRUE 
SEQ
j := j+ l
broadcast(s.Fj)
station.address := reserved.address 
bus.in [i] ? FRAME
station.address := address.of(FRAME) 
un-assigned := FALSE 
bus.infi] ? FRAME
delay := random(0, Max - address.of(FRAME))
Figure 6.15 Code run by an un-assigned station. Here, the broadcast routine is 
implemented by a series o f point-to-point operations through a 
multiplexor process which only passes the commands to un-assigned 
stations. The "Constant" variable can be adjusted to cover the 
processing time o f the master so that the load on the channel can be 
reduced.
PROC master()
SEQ
j:= l
W HILE running 
SEQ
bus.in[0] ? r.F 
ignore.commandsO 
bus.out[0]! a.Fj
— the address of the frame a.Fj which is j  varies.
j : = j + l
accept.commands()
Figure 6.16 The address distributor code.
-114-
chapter 6
Master stations
Figure 6.17 Multiplexor process abstracts the behaviour o f the CSMAICD 
bus and allows the above codes to be executed concurrently.
6 .4  U s e r  L a y e r
The network-access protocol built up from software and hardware layers provides a 
virtual circuit service which ensures that messages are delivered in the order of their trans­
mission, free from errors and duplication. Therefore, the application level software in the 
hosts, consisting of a set of logically interconnected processes working to achieve a specific 
task, is concerned with the identification, multiplexing and synchronisation of physically 
distributed processes.
In general, in a concurrent system, the inter-process communication can be synchronous 
or asynchronous. When the communication between a sender and a receiver is indirect and 
requires decoupling buffers, it is called asynchronous communication (e.g. single or multi-slot 
buffers, pipelined buffers etc.). The sender may deposit data in the buffer and continue 
processing provided that there is space available in this buffer. Therefore, the sender and 
the receiver become synchronised only when the buffer is full. When the communication 
between a sender and a receiver is direct (i.e. through the standard OCCAM channel com­
munication mechanism), it is called synchronous communication. Both entities are held in 
synchronisation until the data communication has taken place.
Basically, the communication between processes in our distributed network can be 
accomplished asynchronously or synchronously. However, a naive consideration of buffers 
at the sender and receiver SIMP in an asynchronous communication can critically affect the 
behaviour of the application [Wes 86],
-115-
chapter 6
Although this view can vary from application to application, the general form of figure 
6.18 is sufficient. Essentially, in this system, three fundamental operations are needed at the 
user layer. Firstly, given that a number of application processes may exceed the number of 
transputers available in our 2D configuration where several processes may have to run on 
each processor, multiplexor and demultiplexor processes are therefore necessary to provide 
a fair access to their shared communication link [Pee 89, Jon 89, Sha 90], Secondly, in 
addition to the physical address of any station/host that is appended to each message, in some 
applications, the identity of the process virtual channel is also required (see figure 6.2 at the 
user layer). This address is part of the application protocol agreement between communi­
cating entities. It is hidden from the SIMP protocols by its inclusion within the data field. 
Thirdly, the secure way of maintaining a correct operation between physically distributed 
processes is to use a handshaking protocol that endures the synchronisation between any 
processes (Figure 6.19). The two front-end processes called the sink and the source run in 
parallel with the sender and receiver application and provide the request mechanism across 
the shared channel. They ensure, in all cases, that the multiplexor and demultiplexor will 
not block, because the sink and the source processes can always buffer data or request. The 
application processes are interfaced to both sink and source processes using unidirectional 
channel interfaces making the synchronisation protocol transparent to the application [Wei 
89]. In general, each producer and consumer application may be written as
PROC producer(VAL INT i)
SEQ
... produce data 
ch.producer[i]! message
PROC consumer(VAL INT i)
SEQ
ch.consumer[i] ? message 
... consume data
Basically, two type of synchronisations schemes can be distinguished.
a- Forw ard request
PROC source(VAL INT i)
SEQ
ch.producer[i] ? message 
ch.MlUX.sourcefi]! message 
ch.DMX.sourcefi] ? request
-116-
chapter 6
PROC sink(VAL INT i)
SEQ
ch.DMX.sink[i] ? message 
ch.consumerp]! message 
ch.MUX.sinkp]! request
Figure 6.18 A general model o f a user layer.
b- Backw ard Request
PROC source(VAL INT i)
SEQ
ch.DMX.sourcep] ? request 
ch.producerp] ? message 
ch.MUX.sourcep]! message
PROC sink(VAL INT i)
SEQ
ch.MUX.sinkp]! request 
ch.DMX.sink[i] ? message 
ch.consumerp]! message
Despite the extra buffering introduced by the sink and the source processes in addition 
to the multiplexors, the backward and forward synchronisation schemes implement a general 
model with a high degree of concurrency. Depending on the applications, it is usually possible 
to incorporate the sink or the source within the consumer or producer processes respectively 
and, hence, reduce the amount of buffering. Buffering is not a serious problem because many 
applications do not rely on the fact that they have direct synchronisation with their target. 
Moreover, when a distributed pair of processes inserts its own handshaking mode, the sink 
or the source processes are not necessary. The most important point is that the multiplexors 
will never block and hence the system is safe from deadlock. In the forward request, a 
message is always accepted by its sink process because the return of an acknowledgement
chapter 6
not only guarantees that the consumer has used the message but also there is a free space in 
the receiving buffer. In the backward synchronisation, a request is always accepted by the 
source process which guarantees that the consumer is ready to accept messages.
6 .5  O C C A M  S i m u l a t i o n  o f  t h e  S I M P  P r o t o c o l s
We end this chapter by using a specific language to verify the operation of the system 
protocols. Specification and description languages such as SDL, LOTOS, ESTELLE [CCI 
88] and CSP [Hoa 78] can be utilised for this purpose. In particular, the OCCAM language 
[May 88] has been chosen as a tool to accomplish and verify the correctness of the com­
munication protocol for the following reasons:
- it is the main parallel programming language of our system,
- it can be transformed to CSP,
- it can be executed on a single transputer or a network of transputers.
Basically, the whole system can be seen as a collection of concurrent processes, see 
figure 6.19 and appendix E5 for the simulation code.
Figure 6.19 Process interactions o f the system
In the hardware layer, we assume the existence of three processes, transmitter, receiver 
and acknowledgement hardware process (THP, RHP and AHP). When a packet is suc­
cessfully received, the RHP triggers the AHP which constructs an acknowledgement in the
-118-
chapter 6
form described in section 6.2.5 and transmits it to the process channel. If the receiver buffer 
is full, the AHP generates the jamming signal. In addition to the low-level functions such 
as CRC generation, SYNC production, and collision and buffer overflow detection, the THP 
deals with all operations associated with the FIFO transmitter. It stores word-by-word data 
received from the software layer, starts the transmission and returns the status of the FIFO 
to the higher level protocols. The RHP only accepts packets with a proper matching address, 
monitors the buffer length, checks for CRC errors, reports the status of each received packet, 
interrupts the software layer and delivers the data word-by-word upon request.
In the software layer, three concurrent processes can also be identified: the interface 
transmission process, the interface reception process and the interface manager process (ITP, 
IRP and IMP). The ITP is responsible for packetising messages and constructing the source 
addresses and the descriptor fields. When a message arrives from one of the four host links, 
the ITP process searches the appropriate destination station sequence number from an array 
VS, formats the message into a packet and then waits for the transmitting buffer to become 
free to store it. The communication processes from link to memory axe implemented using 
block move instructions. In order to use the block move between a communication link and 
a block memory, the latter has to be spread over consecutive locations. Our FIFO is a first 
in first out buffer occupying only one location. It is necessary therefore to map the FIFO 
into a number of consecutive memory locations. The IRP simply discards corrupted and 
duplicated packets by examining the array VR of the expected sequence number and passes 
data to one of its four hosts. Addresses and the descriptor field are accessed by the IRP 
process concurrently with the reception of the remainder of the packet. Therefore, an extra 
check to the flag of the receiving buffer is needed to signal the end of corrupted packets that 
have been captured by the station. When the receiving FIFO is empty, interrupts draw the 
attention of the IRP to receive packets. Then, the IRP disables any further interrupts and 
continues reading until there are no messages available. Finally, besides dealing with the 
command packets that are used for network management, the interface management process 
also manages excessive collisions, buffer overflows and failure of the channels.
The simulation test is based on this algorithm. It has been carried out using sink and 
source processes at the user layer which generate and consume infinite streams of data at full 
speed.
Although the channel process of the 2D structure can be developed to include collisions 
and error generation, for a matter of simplicity, we have simulated it as a multiplexor which 
transparently passes data and acknowledgements through two distinct paths. In reality, the 
system uses only a single line to carry data and acknowledgements. Moreover, errors that 
corrupt the reception of packets may be due to frame truncation when the buffer is full.
-119-
chapter 6
In this simulation model, using a single station which sends data to itself with a unique 
buffer, we have been able to pass a stream of data from transmitter to receiver at their full 
speed. This assumption does not contradict the system behaviour as each node in the network 
can be seen as two separate entities, the receiving and transmitting sides. Even through the 
data has been generated at its full speed, our buffer regulates its flow efficiently.
One last point to notice is that the sender and the receiver at the network level become 
synchronised when the receiving FIFO is full. However, if the RHP is very slow to pass 
messages to the sink process, during a buffer full situation the FIFO will fill up with broken 
frames and, therefore, a livelock situation may occur. Should this arise, the interface man­
agement process IMP resets the FIFO after a certain number of broken packets have been 
received. Although the livelock occurrence can be prevented, this affects the system 
performance as most of the transmitted packets will be discarded. Our proposed FIFO model 
with the block partitions efficiently avoids this problem,
6 . 6  S u m m a r y
It is apparent that the visualisation of the system as separate protocol layers endows 
great design and development freedom. The functions of the whole architecture, although 
complex, have been divided into three major levels which interact with each other providing 
the necessary services to accomplish a particular task within the network.
Primarily, based on the standard building blocks of the CSMA/CD communication 
protocol consisting of the CRC generator and detector, Manchester encoder and decoder, and 
collision detector and rescheduling units, we have introduced various proposals which 
complete the design of our system. The hardware design has been simplified by the adoption 
of four fundamental factors.
The broadcasted acknowledgement basically does not rely on any addressing form apart 
from a broadcasting address. As such the receiving node does not have to know much about 
the information provided with the packet (e.g. source addresses etc.). This reduces the latency 
and the hardware complexity of each node.
The hierarchical way of addressing stations also permits gateways to effortlessly remove 
used address fields. Hence, it offers a great flexibility in hardware design, since stations can 
be accommodated by a network of any dimension, and placed in any injection branch without 
altering their hardware structure. Besides, the removal strategy which occurs at the sending 
side of each gateways provides an extremely low latency per node of 0.8(4 sec for a 16-bit 
receiving shift register.
chapter 6
The adoption of a 16-bit CRC check-sum not only provides a high probability of 
detecting errors, but also simplifies the structure of the shift register associated with each 
FIFO buffer.
The most important parts of the hardware layer are the FIFO buffers. An enhanced 
version of the AMD 67C450X has been proposed which provides most of the services 
indispensable for the proper operation of the network. Firstly, it has the ability to delimit 
the packet boundaries. Secondly, it can manage, without any other layer interventions, 
retransmissions and error recoveries, and it preserves the feature of the virtual cut-through 
routing mechanism. Thirdly, the most distinctive feature is that by partitioning this FIFO 
into blocks of 32 bytes we can implement, at the network level, an efficient and secure flow 
control strategy.
Additional services, which seem extremely expensive to incorporate at the hardware 
layer, have been achieved at the software level. To each message is appended a descriptor 
field which not only differentiates between data and command packets, but also holds the 
sequence variables of all the transputers of the system to overcome duplications. The fact 
that special commands can be handled by the software layer gives a variety of services for 
the network management. In particular, the most important task is to automatically provide 
physical addresses to every station without user intervention. We have developed an algo­
rithm that assigns a unique address dynamically to each station of the network, and verified 
its correctness using a simulation model.
Although the user layer is application dependent, a general view of what a process in 
the CSP philosophy looks like has been presented. Based on a virtual circuit network that 
delivers messages in the order of their transmission, free from error and duplications, an 
application layer must support its concurrent processes with three fundamental behaviours. 
Each message sent by any process must contain the full identification of the destination host 
process. Multiplexors must be provided to share a single transputer link fairly between a set 
of processes running on each host. Finally, despite the amount of buffering associated with 
the sink and source processes, they must establish non-blocking operation with secure flow 
control. These multiplexor, source and sink processes may be transparently and automatically 
added to any OCCAM program to allow it to be run on the 2D network.
-121-
chapter 7
C O N C L U S I O N
7.1 S u m m a ry  o f  the  T hesis  a n d  A ch ievem en ts
If high concurrency is to be successfully harnessed in computation systems based on 
a network of distributed processors, deadlock freedom, high bandwidth, low latency and 
expandable interconnections must be provided. Whatever the means used to fully exploit 
these features, communications must be minimised with respect to processing if scalable 
performance is to be achieved.
In most distributed transputer networks, this optimum scalability cannot be realised, 
because, as elaborated in the introduction of this thesis, some systems cannot physically 
accommodate the addition of more processors or, in others, the efficiency of computation 
decreases as more transputers are inserted.
To go beyond the limitations of the current hardware and performance scalability, the 
purpose of our project was to interconnect a large number of transputers via their serial 
communication links, and this has been achieved using a 2-Dimensional grid of 
CSMA/CD-style broadcast buses joined by gateways. The resulting topology is able to 
configure any application which could be built from networks of hard-wired transputers, plus 
others where the four links on each processor is the limitation. It is especially suitable for 
the ones demanding high connectivities between processors and those where it is necessary 
to pass messages with the minimum communication delay (i.e. imposed only by the transputer 
link).
Of course, as a prerequisite to achieve this aim, the interconnection network, in which 
all processors of the system have the illusion of being joined directly through their com­
munication links, has to be as transparent as possible. This transparency-based strategy gives 
any two processors the full autonomous freedom to communicate along low-latency paths.
The high performance of our Multiple Bus configuration which achieves this trans­
parency rests upon four fundamental characteristics: its high bandwidth, its extremely low 
latency, its freedom to migrate processors within the structure and its physical separation of 
routing from computing resources.
-122-
chapter 7
Unlike other classical bus topologies (i.e. single, multiple and spanning bus), our 
multiple bus configuration obtains its high bandwidth by increasing the number of routers 
(gateways). For instance, the VPP architecture based on a spanning bus principle can only 
connect up to 64 processors [Ino 88]. Undoubtedly, as demonstrated in chapter 2, our proposal 
can accommodate far more transputers because of its high processor scalability. Further 
improvements on the bandwidth could also be achieved by implementing higher dimensional 
structures or augmenting the capacity of each bus of the network.
In our network, the latency is minimised by the provision of three complementary 
factors. Because the configuration is bus-based, it has a small maximum number of hops 
between any two processors, and hence packets cross only a few buses before reaching their 
destinations. In addition, the virtual cut-through routing mechanism adopted requires only 
the header of the packets to be delayed for routing decisions. Typically, the resulting latency 
was found to be 1.8 jxsec, which is determined by the time taken to move the header of a 
packet from source to destination in an empty network. Furthermore, the semi-adaptive 
routing algorithm we have developed selects branches containing the lowest traffic loads 
and, therefore, yields a higher throughput by distributing the load equally across the network. 
It also allows a high degree of reliability when faulty gateways or routing branches can be 
avoided.
The possibility of migrating stations or processors anywhere in the network offers us 
a simple and effective technique for preventing deadlock in the 2D multiple bus topology 
without imposing the restriction that packets must follow a fixed path between sources and 
destinations. We have achieved this by means of the semi-adaptive routing algorithm that 
always selects the shortest path, without introducing any cycles, and by the allocation of 
stations in pre-defined buses called injection branches.
The best approach for model and process independent communication resources is to 
separate the issue of processor-to-network interfacing from that of routing. Clearly, various 
advantages emerge from the physical separation exhibited in our interconnection network. 
Firstly, it allows the network to be constructed independently by providing its own higher 
bandwidth and lower latency. Secondly, it offers a high degree of freedom in the system 
design and integration. Stations have the full freedom to be placed within specific branches 
of the network hence breaking the possibilities of any deadlock cycles. Besides, each pro­
cessor can be added, removed or reset individually without altering either the progress or the 
regularities of the configuration. Thirdly, it supports any type of processors. Indeed, custom 
processors and off-the-shelf microprocessors can be linked efficiently to our network pro­
vided that their interfaces do not swamp its overall performance.
chapter 7
The above mentioned features of the generalised multiple bus network represent sig­
nificant progress in the communication performance of transputer systems. In this project, 
we have given emphasis to the 2 dimensional structure, mainly because the hardware cost 
of each router (gateway) is relatively cheap. In addition, the 2D configuration provides an 
adequate bandwidth for a large number of transputers, it permits a simplified implementation 
of the semi-adaptive routing algorithm as the addressing overheads are reduced and it offers 
a simple method to prevent deadlock. Furthermore, the small number of hops between any 
two processors in this class (i.e. a maximum of three hops between any pair of stations) keeps 
the message latency low (e.g. 1.8 jisec).
Traditionally, bus-based architectures must adopt an arbitration mechanism. The one 
we have chosen for our 2D structure is founded on the single line CSMA/CD communication 
protocol for the following reasons. Firstly, besides dynamically sharing the available 
bandwidth, it provides an adequate one for an acceptable number of transputers (e.g. 200 
T414 of 0.8MBytes/sec effective bidirectional link rate). Secondly, it permits a variety of 
implementations ranging from a single copper conductor to optical fibre and free space 
infrared transmission. Thirdly, because each conductor carries data, acknowledgements and 
control based on slotted time, the design and integration of the system elements is simplified 
by reducing the number of wires (conductors). Fourthly, it obviates the need to distribute a 
global clock as a self-timed system can be implemented through the Manchester encoding 
technique. Finally, it is a more robust protocol and it preserves the network characteristic 
and versatility (moving stations and addressing them).
We have demonstrated, by means of simulation models and comparative analysis, that 
the enhanced serial prioritised and acknowledged CSMA/CD protocol is an efficient and 
relevant arbitration mechanism needed to resolve the conflict in each bus of the 2D structure. 
In order to enhance the performance and optimise the features of this 2D structure based on 
the single channel CSMA/CD protocol, we have constructed further simulation models. 
Through these simulations and subsequent analyses we have verified the superior perform­
ance of the virtual cut-through routing mechanism applied to a limited number of hops. We 
have also demonstrated that the equal distribution of the traffic loads in all branches of the 
configuration rests only upon the choice of the physical network parameters - i.e. the number 
of injection branches must be equal to twice the number of routing branches. This resulting 
balanced 2D network achieves the highest throughput and almost the smallest mean packet 
delay compared to other possible arrangements of the 2D structure itself - i.e. choosing 
different numbers of injection and routing branches.
chapter 7
The development of the high performance distributed transputer system is not solely 
confined to the transparency of this balanced 2D structure, but also to how efficiently 
transputers can exploit and benefit from the high communication characteristics that it offers.
We have proposed a Serial Interface Message Processor which directly couples 
transputers via their serial links to the injection branches of the 2D interconnection network. 
So as to understand the interactions and the communication latency between transputers, 
interfaces and the interconnection network, we developed the gap equations from which 
various guide-lines for designing an efficient interface were inferred. Because the prioritised 
and acknowledged CSMA/CD allows acknowledgements to be returned very fast to the 
sending nodes, interfaces consisting of double buffering and adopting the simple 
send-and-wait transmission strategy are demonstrably most efficient. This has been verified 
for the processor farm application. Green [Gre 88] connected 15 processors in a tree and 
used a software harness to pass messages between them, consuming 9.4% of their processing 
effort. In our ID network, simulations show that for 15 processors, only 0.3% degradation 
would be suffered. This degradation does not increase significantly even with far more 
processor.
Furthermore, since the network transmits packets faster than it can receive them from 
a host, the SIMP interface stores messages completely before sending them through the 
channel. Such operation - store-and-forward transmission - induces unacceptable latency on 
large messages. We have shown that it is always possible to overcome this drawback by 
splitting longer messages into smaller sub-blocks and adopting pipelined transmission in 
which the communication of these blocks can be overlapped with each other. As a conse­
quence, comparing this technique to other pipelined transmissions applied to a large linear 
transputer array, ours exhibits a much lower communication latency. We have also shown 
that this latency is independent of the distance, in terms of number of hops, separating any 
two communicating processors.
At the design level of the whole system, we have accomplished further important issues 
that complement the features of the proposed interconnection network. Firstly, as stations 
have to be identified by physical addresses within the network, we suggested an efficient 
algorithm, based on the collision principle, which assigns a unique address to each node 
sharing the same channel, and therefore allows good maintainability of the system’s stations. 
Secondly, we propose a complete dual-port FIFO model that allows the retransmission 
strategy to be effortlessly implemented, easily erasing corrupted packets and preserving the 
virtual cut-through notion by means of its asynchronous behaviour. The most important issue 
provided by this buffering model is the flow control at the network level which is obtained 
by adding a status flag to each 32 byte long block, corresponding to the maximum packet
-125-
chapter 7
length handled by the network. Likewise, each transmitted packet is protected against errors 
introduced by the channels, collisions or loss due to receivers’ buffer overflow by a CRC 
check-sum. Thirdly, although it is application dependent, the user level, in general, must 
perform three fundamental tasks: to provide addressing for each message, to multiplex the 
virtual channels of processes and to regulate the flow control between them.
We have simulated the interactions of the system layers by considering an arbitrary 
application which consists of two processes. The first one injects and the second one absorbs 
data at their full speed. The generated messages flow through all concerned layers and are 
subjected to their protocols until they reach their destination. It was shown that the proposed 
protocols are strong enough to pass any form of data through the network even under 
pathological situations (i.e. errors or buffer overflow).
7.2 F u tu re  O rien ta tions
There are various views to what a process is and what a concurrent system comprises. 
For instance, much like CSP [Hoa 78] and OCCAM [May 88], ADA [Geh 84b] and ACTOR 
[Agh 85] view the system as a collection of processes with a set of synchronous or asynch­
ronous communications involving two participants only. On the other hand, BSP [Geh 84] 
extends the CSP notion by providing multicast primitives to enable a process to send a data 
value to a set of processes. Furthermore, programming languages such as CIRCAL [Mil 85] 
implement a full broadcasting model. If our 2D structure or our generalised multiple 
interconnected buses configuration, initially intended to link transputers, is to be successfully 
exploited by many concurrent programming languages and distributed operating systems, in 
addition to the issue of point-to-point communication, those of broadcasting, multicasting 
and distributed synchronisation will have to be directly supported in hardware.
The point-to-point routing operation of the scheme described in this thesis has been 
realised by using high priority acknowledgement CSMA/CD, a hop-by-hop service and a 
descriptor field associated with every packet to identify its type. Whilst virtually based on 
this principle, any required multicasting or broadcasting operation of a packet can be achieved 
through a series of single point-to-point routing operations that subsequently send the packet 
to a group or all stations. As a result, the emanating latency for broadcasting or multicasting 
this single packet will be undoubtedly unacceptable. In order to implement all these com­
munications within our 2D topology, the frame work in which the serial bus carries data and 
acknowledgement will have to be changed.
-126-
chapter 7
There is a considerable scope for extending or changing the functionality of the single 
line CSMA/CD communication protocol. This can be achieved either by simply increasing 
the number of wires to hold data and controls separately or by restructuring the basic 
framework of the collision protocol by other kinds of arbitration mechanisms such as seen 
in the future-bus [Tau 84]. Such an extension could have various useful implications on the 
system design and performance. Firstly, the acknowledgements and error signals issued 
during broadcast or multicast communications will be efficiently and separately handled. 
Secondly, as the bandwidth is linearly proportional to the number of lines each bus holds, 
increasing the capacity of each branch yieldes a higher bandwidth interconnection structure. 
Thirdly, because each bus can carry the whole data word for every clock cycle, the latency 
of the network will be reduced to an order of 0.1|isec from source to destination. Such a 
figure allows the network to compete with the most effective architectures (e.g. Mad-Postman 
and iWARP) at a moderate traffic load. Finally, one of the most important consequences 
will be the banishment of the interface software layer services, which seem mostly responsible 
for swamping the performance of the network. Such services were provided to secure the 
communication in a single line CSMA/CD network. Most error flags and acknowledgement 
pulses will be implemented through the control lines hence bypassing the software services. 
Undoubtedly, error management will still be handled at the hardware level, particularly using 
our proposed FIFO buffer model. As a result, the interfaces of transputer to this intercon­
nection network will be selected from the cheaper and simplest ones (e.g. IMS C011). The 
store-and-forward transmission imposed at each interface could be substituted by the efficient 
virtual cut-through scheme as asynchronous signals and handshaking lines may be maintained 
between all communicating nodes.
Of course, the price of these performance improvements would be paid by the extra 
hardware complexity involving more wires which has been avoided during the period of this 
research.
It is also highly likely that a 3-Dimensional structure that provides a higher bandwidth 
than the 2D network could be considered as the interconnection network for a highly con­
current machine. This would obviously require a further consideration of the semi-adaptive 
routing algorithm at the implementation level, of the buffering utilisation at each gateway, 
and of deadlock avoidance.
-127-
R E F E R E N C E S
[Abr 73] Abramson N. & Kuo F .,"The ALOHA System," In Computer Communication 
Networks, Prentice Hall, Englewood Cliffs, New Jersey, 1973.
[Agh 85] Agha G.A., "Actors: A Model o f Concurrent Computation in Distributed 
Systems," MIT, Artificial Intelligence Laboratory, Technical Report 844, June 
1985.
[Agr 86] Agrawal D.P. & Janakiram V.K., "Evaluating the Performance o f Multi­
computer Configurations," Computer, Vol 19, No. 5, May 1986, pp. 23-37.
[Amd 88] AMD, " Special Memory Products," Data Book, 1988.
[Ame 86] Ametek Computer Research Division, " Ametek System 14 User’s Guide," 
Arcadia, California, USA, 1986.
[And 82] Andrew D.W. & Schultz G.D., "A Token-ring Architecture fo r  Local-area 
Networks and Update," Proceeding of CompCon, Fall 1982, pp. 615-624.
[Apo 86] Apostolopoulos K.T. & Protonotarios E.N., "Queuing Analysis o f Buffered 
CSMA/CD Protocols," IEEE Trans, on Comm., C-34(9), 1986, pp.898-905.
[Are 90] Arruabareena A. et al., "An Optimal Topology for Multicomputer Systems 
Based on a Mesh o f Transputers," Occam User Group Newsletter, No. 12, 
January 1990.
[Bai 81] Bain W.L. & Ahuja S.R., "Performance Analysis o f High-speed Digital Buses 
fo r Multiprocessing Systems," Proc. 8th Annual Symposium on Computer 
Architecture, Minneapolis, MN, USA, 12-14 May 1981, pp. 107-133.
[Bat 68] Batcher K.E., "Sorting Networks and their Applications" Proc. AFIPS FJCC, 
Vol. 32, 1968, pp. 307-314.
[Bel 78] Bell C. et al., "Computer Engineering- A DEC View o f Hardware Systems 
Design," Max. Digital Press, Bedford, 1978, pp. 280-286.
[Ben 65] Benes V.E., "Mathematical Theory o f Connecting Networks and Telephone 
Traffic" Academic Press, New York, 1965.
[Ben 84] Benice R.J. & Frey A.H., "Comparisons o f Error Control Techniques," IEEE 
Trans. Commun. Technology, Dec. 1984, pp. 146-154.
[Ben 87] Benhamou, E. et al. 11 Practical Considerations in Building Large Interne­
tworks, "Proc. 12th Conf. on Local Computer Networks, Nimmeapolis, MN, 
USA, 5-7 Oct. 1987, pp. 23-39.
[Bes 71] Best D. & Watson W., "Distributed Priority o f Access to Computer Unit," 
U.S. Patent 3,573,856, April 1971.
-128-
[Bhu 84] Bhuyan N.L., Yang Q. & Agrawal P.D., "Generalised Hypercube and 
Hyperbus Structures for Computer Networks/' IEEE Trans, on Computers, 
Vol. 33, No. 4, Apr. 1984, pp. 323-333.
[Bor 88] Borkan S. et al. "iWARP: An Integrated Solution to High Speed Parallel 
Computing," Proc. Supercomputing Conference, Orlando, Florida, Nov. 1988, 
pp. 330-339.
[Bra 90] Bradshaw S.J., "A High Performance Scalable Computer Architecture," 
Transtech Note 3, Nov. 1990.
[Bur 56] Burke P.J., "The Output o f a Queuing System," Operational Research, Dec. 
1956, Vol. 4, pp. 699-704.
[Bur 72] Burton H.O. & Sullivan D.D.,"Errors and Control," Proceedings of the IEEE, 
60(11), Nov. 1972, pp. 1293-1301.
[Bux 81] Bux W., "Local Area Subnetworks: Performance Comparison," IEEE Trans, 
on Commun., C-29(10), Oct. 1981, pp. 1465-1473.
[Bux 84] Bux W. & Grillo D., "End-to-End Performance in Local Area Networks o f 
Interconnected Rings," Proceedings of the IEEE INFOCOM 84, San Fran- 
sisco, CA, USA, 9-12 April 84, pp. 60-68.
[Car 89] Carre F. & Vidal-Naquet G., "Topologies for Large Transputer Networks: 
Theoretical Aspects and Experimental Approach," Proceedings of the 10th 
Occam User Group, Technical Meeting, Enschede, Netherlands, 3-5 April 
1989, pp. 198-212.
[Cci 88] CCITT Manual, Guide-lines for the Application of Estelle, Lotus and SDL, 
COM-X-R29-30, Published by ITV, General Secretariat-sale, Places des 
Nations, Ch-1211 Geneva 20, Switzerland, 1988.
[Cha 79] Chang S. "A Model for Distributed Computer System Design," IEEE Trans. 
System, Man and Cybers, Vol. Smc-5, No. 6, 1979, pp. 344-359.
[Chi 79] Chlamac I., Franta W.R. & Levin D. "BRAM: The Broadcast Recognizing 
Access Method," IEEE Trans, on Commun., C-27, Aug. 1979, pp. 1183-1190.
[Cok 91] Cok R .S .,"Parallel Programs for Transputers," Prentice Hall, 1991.
[Cos 87] Cosnuau A. & Poirel O., "Some Numerical Experiments on Transputer Net­
works," Proceedings of the 7th Technical Meeting of Occam User Group, 
Grenoble, Sep. 1987.
[Dal 87] Dally W .J., & Seitz C.L.,"Deadlock-Free Message Routing in Multiprocessor 
Interconnection Networks," IEEE Trans, on Computers, C-26(5), May 1987, 
pp. 547-553.
[Dal 89] Dally W.J. et al., "The J-Machine: A Fine-Grain Concurrent Computer," 
Information Processing 89, Elsevier, 1989, pp. 1147-1153.
[DeG 87] De Groot A. J., Johansson E.M., Fitch J.P., Grant C.W. & Parker S.R.,"Sprint- 
The Systolic Processor with Reconfigurable Interconnection Network o f 
Transputers," IEEE Trans. Nuclear Science, Ns-34(4), 1987, pp. 873-877.
[Dig 76] Digital Equipment Corporation Special Systems, PCL11-A Option Descrip­
tion (Document#YC-C000C), Nov. 1976.
[Dix 83] Dixon R.C. et al., "A Token-ring Network for Local Data Communication," 
IBM Systems Journal, 22(1/2), 1983, pp. 47-62.
-129-
[Enn 83] Ennis G. & Filice P., "Overview o f a Broad-band Local Area Network 
Architecture," IEEE Journal on Selected Area in Communications, Sac-I, 
1983, pp. 832-841.
[Far 69] Farmer W.D. & Newhall E.E., "An Experiment Distributed Switching System 
to Handle Bursty Computer Traffic," Proc. ACM Symposium on Problems in 
the Optimisation of Data Commun., Oct. 1969, pp. 1-33.
[Fay 87] Fay D.Q.M. & Das P.K., "Hardware Reconfiguration o f Transputer networks 
for Distributed Object-Oriented Programming," Microprocessing and 
Microprogramming, 21(1987), pp. 623-628.
[Fon61] Fontain A.B., "Error Statistics and Coding for Binary Transmission over 
Telephone Circuits," Proceedings of the IRE, June 1961, Vol. 49, No. 6, pp. 
1059-1065.
[Fra 75] Fraser A.G., "Loops for Data Communication," Computing Science Tech. 
Rep. No. 24, Bell Laboratories, Murray Hill, USA, Dec. 1974.
[Fra 81] Fratta L., Borgonovo F. & Tobagi F.A., "The Expressnet: A Local Area 
Communication Network Integration Voice and Data," In Proc. Int. Conf. 
Performance Data Communication Systems, their Applications, Paris 
(France), Sept. 1981, pp. 14-16.
[Fra 88] Fratta L. & Wozniak J., "PR-Express:Collision-Free Access Protocol fo r  
Packet Radio Network," Computer Networks and ISDN Systems, 16(3), 
1988/89, pp. 229-242.
[Ful 78] FullerS. etal., "Multi-Microprocessors: An Overview and Working Example," 
Proc. IEEE, 66(2), 1978, pp. 216-228.
[Gab 83] Gable M.G. & Sherman R.H., "Carrier Sense Multiple Access with Feedback," 
Local Networks and Distributed Office systems, Vol I: Network System 
Development, 1983, pp. 199-214.
[Geh 84a] Gehani N., "Broadcasting Sequential Processes (BSP)," IEEE Trans. On Soft. 
Eng., SE-10(4), 1984, pp. 343-351.
[Geh 84b] Gehani N., "Ada Concurrent Programming," Prentice-Hall, Englewood Cliffs 
N.J., 1984.
[Gel 81] Gelernter D. "A DAG-Based Algorithm for Prevention o f Store-and-forward 
Deadlock," IEEE Trans, on Computers, C-30(10), Oct. 1981, pp.709-715.
[Gfe 78] Gfeller F.R. et al., "Infrared Communication for In-House Applications," 
Proceedings of COMPCOM 78F, Computer Communication Networks, 
Washington DC, USA, 5-8 Sept. 1978, pp. 132-138.
[Gle 87] Glendinning I. & Hey A., "Transputer Arrays as Fortran Farms for Particle 
Physics," Computer Physics Communications, 45(1987), pp. 367-371.
[Gol 82] Gold Y.I. & Franta R.W., "An Efficient Scheduling Function fo r Distributed 
multiplexing o f a Communication Bus Shared by a Large Number o f Users," 
In Proc. Int. Conf. Commun., Philadelphia, 13-17 June 1982.
[Gol 83] Gold Y.I. & Franta R.W., "An Efficient Collision-Free Protocol fo r  Prioritized 
Access Control o f Cable or Radio Channels," Computer Networks, Vol 7, No. 
2, 1983, pp. 83-98.
[Gop 84] Gopal S.I. & Segall A., "Dynamic Address Assignment Protocols," Pro­
ceedings of the IEEE INFOCOM 84, San Fransisco, CA, USA, 9-12 April 84,
pp. 120-128.
-130-
[Got 83] Gottlieb A. et al., "The NYU Ultracomputer: Designing a MIMD Shared 
Memory Parallel Computer," IEEE Trans, on Computers, C-32(2), 1983, pp. 
175-189.
[Gre 88] Green S.A. & Paddon D .J., "an Extension o f the Processor Farm Using a Tree 
Architecture," Proceedings of the 9th Occam User Group Technical Meeting, 
Southampton, U.K., 19-21 Sept. 1988, pp. 53-69.
[Hai 87] Haines E., "A Proposal for Standard Graphics Environments," IEEE Com­
puter Graphics and Applications, 7(11), Nov. 1987, pp. 3-5.
[Hal 85] Halsall F., "Introduction to Data Communication and Computer Networks," 
Addison Wesley, Electronic Systems Engineering Series, 1985.
[Ham 86] Hammond J.L. & O’Reilly J.P., "Performance Analysis o f Local Computer 
Networks," Addison Wesley, 1986.
[Har 86] Harp J.G., Roberts J.B.G. & Ward J.S., "Signal Processing with Transputer 
Arrays (TRAP)," Computer Physics Communications, 37(1985), pp.77-86.
[Har 87] Harp J.G., "Phase 2 o f the Reconfigurable Transputer Project-P1085," Pro­
ceedings of Esprit’87 Conference- Achievement and Impact, PT85, 1987, 
pp.583-591.
[Hay 84] Hayes J.E., "Modeling and Analysis o f Computer Communication Network," 
Plenum Press, New York, 1984.
[Hea 881 Heath J.R., "Analysis o f Gateways Congestion In Interconnected High-Speed 
Local Netw>orks," IEEE Trans, on Commun, C-36(8), 1988, pp. 986-989.
[Hey 82] Heyman D.P., "An Analysis o f the Carrier Sense Multiple Access Protocol," 
Bell System Technical Journal, 61(8), Oct. 1982, pp. 2023-2051.
[Hey 88] Hey A.J.G., "Reconfigurable Transputer Networks: Practical Computation," 
Phil. Trans. R. Soc. Lond., A326, 1988, pp. 395-410.
[Hil 85] Hillis W.D., "The Connetion Machine" MIT Press, Cambridge, MA, USA, 
1985.
[Hoa 78] Hoare C.A.R. "Communicating Sequential Processes," In Communication of 
the ACM, 21(8), 1978, pp. 666-677.
[Hop 86] Hopper A., Temple S. & Willamson R., "Local Area Networks Design," 
Addison-Wesley, International Computer Series, 1986.
[Hwa 85] Hwang K. & Briggs F.A, "Computer Architecture and Parallel Processing," 
Me. Graw-Hill, International Editions, 1985.
[Ibb 89] Ibbett R.N. & Topham N.P., "Architecture o f High Performance Computers" 
Vol II, Mac Millan Computer Science Series, 1989.
[Inm 87a] INMOS, "Designs and Application for C 0 0 4 Technical Notes, No. 19,1987.
[Inm 87b] INMOS, "IMS T212 Transputer," Engineering Data Book, Nov. 1987.
[Inm 88] INMOS, "Transputer Instruction Set -A Compiler Writer's guide," Prentice 
Hall International, 1988.
[Inm 89] INMOS, "The Transputer Data Book," Second Edition 1989.
[Inm 91] INMOS, "The T9000 Transputer Products," First Edition 1991.
[Ino 88] Inoue A. & Maeda A. "The Architecture o f a Multi-Vector Processor Sytem, 
VPP" Parallel Computing 8 (1988), pp. 185-193.
-131-
[Int 85] INTEL Scientific Computer, iPSC User’s Guide, Aug. 1985.
[Jac 57] Jackson J.R., "Networks o f Waiting Lines," Operational Research, Vol. 5, 
Aug. 1957, pp. 518-521.
[Jay 85] Jayasumana A.P. & David P., "The Token-Skipping Channel Access Scheme 
fo r  Bus Networks," Computer Networks and ISDN Systems, 9(1985), pp. 
201-208.
[Jin 87] Jinks P.J., "PARSIFAL- Hardware for Mapping Arbitrary Occam Networks 
onto Transputers," Proceedings of the 6th Occam User Group, Technical 
Meeting, Surrey, UK, 1987.
[Jon 88] Jones P., ",Support for Occam Channel Via Dynamic Switching in 
Multi-Transputer Machines" Proceedings of the 9th Occam User Group, 
Technical Meeting, Southampton, 19-21 Sept 88, pp. 101-110.
[Jon 89] Jones G., "Carefully Scheduled Selection with ALT," Occam User Group 
Newsletter No. 10, January 1989.
[Jon 91] Jones P. & Cha H., "Towords a Hybrid Message Passing Regime for Large 
Multi-processor Machines," Proceedings of the World Transputer User Group 
(WOTUG), Conference, Sunnyvale, USA, 22-26 April 1991.
[Kal 88] Kallstrom M. & Thakkar S.S., "Programming Three Parallel Computers," 
IEEE Software 5(1), 1988, pp. 11-22.
[Kas 63] Kasami T., "Optimum Shortened Cyclic Code for Burst-Error Correction," 
IEEE Trans, on Information Theory, T-9(l), Apr. 1963, pp. 105-109.
[Kel 84] Kelley R.P. et al., "Transceiver Design and Implementation Experience in An 
ETHERNET-Compatible Fiber Optic Local Area Network," Proceedings of 
the IEEE INFOCOM 84, San Fransisco, CA, USA, 9-12 April 84, pp. 2-7.
[Ker 79] Kermani P. & Kleinrock L., "Virtual Cut-Through: a New’ Computer Switching 
Technique," Computer Networks, Vol.3, No. 4, Sept. 1979, pp. 267-286.
[Kie 83] Kiesel W.M. & Kuehn P.J., "A New CSMAICD Protocol fo r Local Area 
Networks with Dynamic priorities and Low’ Collision Probability," IEEE 
Journal on Selected Areas in Communications, Sac-1(5), Nov. 1983, pp. 
869-876.
[Kle 64] Kleinrock L .,"Communication Nets: Stochastics Message Flow and Delay," 
Mac. Graw-Hill, New York: Dover, 1964.
[Kle 75] Kleinrock L. & Tobagi F.A., "Packet Switching in Radio Channels:Parti 
Carrier Sense Multiple Access Modes and their Throughput-Delay Char­
acteristics," IEEE Trans, on Commun., C-23(12), Dec. 1975, pp. 1400-1416.
[Kle 80] Kleinrock L. & Scholl M., "Packet Switching in Radio Channels: New’ 
Conflict-Free Multiple Access Schemes," IEEE Trans, on Commun., C-28, 
July 1980, pp. 1015-1029.
[Kno 87a] Knowles A.E., "PARSIFAL- A Parallel Simulation Facility Based on 
Transputer," Presented at the School on High Performance Architectures and 
Algorithms, Primorsko, Bulgaria, 17-23 May 1987.
[Kno 87b] Knowles T., Larmounth J. & Knighton K.G., "Standards for Open Systems 
Interconnection," BSP Professional Books, 1987.
[Kno 89] Knowles A.E. & Kanchev T., "The Support o f the Occam Model o f Com­
munication via Message Passing in a Network o f Transputers," In Micro­
processors and Microsystems, 13(2), March 1989, pp. 113-123.
-132-
[Kon 74] Konhein A.G. & Meister B., "Waiting Lines and Times in a System with 
Polling," Journal of the Association of Computing Machinery, 21(3), July 
1974, pp. 470-490.
[Kun 89] Kung H.T., "iWARP, Systolic Array Processors," Prentice Hall, International, 
1989.
[Lam 80] Lam S.S., "A Carrier Sense Multiple Access Protocol for Local Area Net­
works," Computer Networks, 4(1980), pp. 21-32.
[Lan 76] Lang T. & Stone H.S., "A Shuffle-Exchange Netu>ork with Simplified Control" 
IEEE Trans. Comput. Vol C-25, No. 1, Jan. 1976, pp. 55-65.
[Lau 91] Lau S.W. & Lau F.C.M., "A Simple Virtual Cut-Through Router," WOTUG 
Newsletter, No. 15, July 1991.
[Law 75] Lawrie D.H., "Access and Alignment o f Data in an Array Processor", IEEE 
Trans. Comput. Vol C-24, No. 12, Dec. 1975, pp. 1145-1155.
[Lea 87] LeaR.M., "An Overview o f the Influence o f Technology on Parallelism," Major 
Advances in Parallel Processing, London, UK, Dec. 1987, pp. 3-12.
[Lou 86] Loucks W.M., Kwak W.I. & Vranesic Z.C., "Implementation o f Dynamic 
Address Assignment Protocol in a Local Area Network," Computer Networks 
& ISDN System, 11(2), Feb. 1986, pp. 133-146.
[Mar 83] Marsan M. A. & Roffinela D .,"Multichannel Local Area Network Protocols," 
IEEE Journal on Selected Ares in Communications, Vol. SAC-1, No. 5, 
November 1983, pp. 885-897.
[May 87] May D. & Shepherd R., "Communicating Process Computers," INMOS 
Technical Note 22, 1987.
[May 88] May D., "OCCAM Reference Manual," Prentice Hall International, London, 
1988.
[Mer 87] Merakos L.F. et al., "Interconnection o f CSMA/CD Local Area Networks: the 
Frequency Division Approach," IEEE Trans on commun., C-35(7), 1987, pp. 
730-738.
[Mer 80] Merlin P.M. & Schwietzer P.J., "Deadlock Avoidance in Store-and-Foiward 
Networks-I: Store-and-Forward Deadlock," IEEE Trans, on Commun.,C- 
28(3), Mar. 1980, pp. 345-354.
[Met 76] Metcalf R.M. & Boggs D.R." Ethernet: Distributed Packet Switchingfor Local 
Area Networks," Communication of the ACM, Vol. 22, No. 7, July 1979, pp. 
395-403.
[Mil 85] Milne G., "Circal and the Representation o f Communication, Concurrency 
and Time," ACM TOPLAS, Vol. 7, No. 2, Apr. 1985.
[Mil 90] Miller P.R. & Yanchev J.T., "Developing Poweiful Communication Mech­
anisms for Distributed Memoiy Computers From Simple and Efficient 
Message Routing," Proc, 5th Distributed Memory Computing Conference, 
Charleston, USA, 1990.
[Mil 91] Miller P.R., Jesshope C.R. & Yantchev J.T. "The Mad-Postman Network 
Chip," Proceedings of the World Transputer User Group (WOTUG), Con­
ference, Sunnyvale, USA, 22-26 April 1991.
[Nic 88] Nicole D. A. et al., "Switching Networks for Transputer Links," In Proceedings 
of the 8th Technical Meeting of the Occam User Group, Ed. by Kerridge Jon, 
March 1988, pp. 147-165.
-133-
[Nis 86] Nishida T. et al., "Congestion Control in Interconnected LAN," Local Area 
and Multiple Access Networks, R.L. Pickhoitz, Ed., Rockville, MD, USA: 
Computer Science Press, 1986, pp. 107-136.
[Ott 89] Otto S.W., "Shared-Memory Versus Distributed-Memory Half-time score," 
Computer Physics Communication /Netherlands), Vol.57, No. 1-3, Dec. 
1989, pp. 95-100.
[Pac 87] Packer J., " Exploiting Concurrency; A Ray-Tracing Example," inmos 
Technical Note 7, Inmos Ltd., Bristol, 1987.
[Pas 88] Pase D.M. & Larrabee A.R., "Intel iPSC Concurrent Computer," In Pro­
gramming Parallel Processors, Badd R.G. & Robert G/Eds), Addison- 
Wesley, Massachuessetts, 1988, pp. 93-104.
[Pee 89] Peel R.M.A., "Issues Raised While Implementing a Layered Protocol Using 
Occam, and Transputer," In Proceedings of the Occam User Group, 10th 
Technical Meeting, Ed. Andre Bakkers, Applying Transputer Based Parallel 
Machines, 1989.
[Pee 92] Peel R.M.A.," Virtual Cut-Through Routing, Direct Memory Access and the 
IMS B407 Ethernet TRAM," WOTUG Newsletter, No. 16, January 1992.
[Pet 61] Peterson W.W. & Brown D.T., "Cyclic Codes for Error Detection," Pro­
ceedings of the IRE, Jan 1961, pp. 229-235.
[Pie 72] Pierce J., "How Far Can Data Loop Go," IEEE Trans, on Commun., C-20(3), 
1972, pp. 527-530.
[Pre 81] Preparata F.P. & Vullemin J.V., "The Cube Connected Cycles: A Versatile 
Network for Parallel Computers," CACM, Vol.24, No. 5, May 1981, pp. 
300-309.
[Raw 78] Rawson E.G. & Metcalfe R.M.,"Fibernet: Multimode Optical Fiber for Local 
Computer Networks," IEEE Trans, on Commun. Vol C-26, No. 7, 1978, pp. 
983-990.
[Rio 85] Rios M. & Georganas N.D., "A Hybrid Multiple Access Protocol fo r Data and 
Voice Packet over Local Area Network," IEEE Trans, on Computer, C-34(l), 
Jan 1984, pp. 91-94.
[Rob 91] Robinson M., "Popular and Parallel," Byte, Vol. 16, No. 6, June 1991, pp. 
219-228.
[Ros 87] Roscoe A.W., "Routing Messages Through Networks: An Exercise in Dead­
lock Avoidance," Oxford University Computing Laboratory Report, 1987.
[Ros 91] Roscoe A.W. et al., "Formal Methods in The Development o f the HI Trans­
puter," Proceedings of the World Transputer User Group (WOTUG), Con­
ference, Sunnyvale, USA, 22-26 April 1991.
[Rya 81] Ryan R. et al., "INTEL Local Network Architecture," IEEE Micro, Vol. 1, No. 
4, Nov. 1981, pp. 26-41.
[Sei 85] Seitz C.L. , "The Cosmic Cube," Communication of the ACM, 28(1), January 
1985, pp. 22-23.
[Sha 90] Shallow P.A., "Really Efficient Multiple Buffering in Occam and Efficient 
Fair ALT," Occam User Group Newsletter No. 12 January 1990.
[Sho 79] Shoch J.F. & Hupp J. A., "Performance o f an Ethernet Local Network -A 
preliminary report," Local Area Communication Networks Symposium, 
Boston, May 1979, pp. 318-322.
-134-
[Smy 90] Smythe C., "Networks and their Protocols," Electronics Communication 
Engineering Journal, Vol. 2, No. 1, Feb 1990, pp.27-34.
[Ste 91] Stein R.M., "Scaling Up: GetTheMessage?," Byte, Vol. 16, No. 6, June 1991, 
PP 231-240.
[Sto 78] Stone H.S., "Parallel Computers: An Introduction to Computer Architecture," 
Science Research Associates, Chicago, 1978.
[Tak 83] Takagi A., Yamada S. & Sugawara S. "CSMAICD with Deterministic Con­
tention Resolution," IEEE Journal on Selected Areas in Communications, 
Sac-1(5), Nov. 1983, pp. 877-884.
[Tan 88] Tanenbaum S.W., "Computer Networks," Prentice Hall International, 
Englewood Cliffs, New Jersey, 1988.
[Tau 84] Taub D.M., "Arbitration and Control Acquisition in the Proposed IEEE 896 
FutureBus," IEEE Micro, Aug. 1984, pp. 28-41.
[Tob 80] Tobagi F.D. & Hunt B .V .," Performance Analysis o f Carrier Sense Multiple 
Access with Collision Detection," Proceedings of the Local Area Communi­
cation Networks,Boston, MA, USA, 7-9 May 1980, pp. 217-245.
[Tob 83] Tobagi F.D. & Fine M., " Performance o f Unidirectional Broadcast Local 
Area Networks: Expressnet and Fastnet," IEEE Journal on Selected Areas in 
Communications, Sac-1(5), Nov 1983, pp. 913-926.
[Tob 84] Tobagi F.D. & Fine M., "Demand Assignment Multiple Access Schemes in 
Broadcast Bus Local Area Networks," IEEE Trans, on Computers, C-33(12), 
Dec. 1984, pp. 1130-1159.
[Tok 77] Tokoro M. & Tamaru K., "Acknowledging Ethernet," 15th IEEE Computer 
Society International Conference, Washington D.C., USA, 6-9 Sept. 1977, 
pp. 320-325.
[Tud 90] Tudruj M. & Thor M., "A Multi-layer Dynamically Reconfigurable Transputer 
System,” 5th Int. Workshop on Parallel Processing by Cellular Automata and 
Arrays, 17-21 Sept. 1990.
[Ulu 83] Ulug M.E., "Calculation o f Waiting Time for Real Time Token Passing Bus," 
General Electric Corporation Research and Development, 1983, pp. 1-15.
[Wei 89] Welch P.H., "TRANSNET- A Transputer Based Communication Service," 
Proceedings of the 10th Occam User Group, Technical Meeting, Enschede, 
Netherlands, 3-5 April 1989, pp. 198-212.
[Wes 86] West A.J., "Naive Insertion O f Monitor Processes Alters Occam Semantics," 
Internal Report, Manchester University, Feb. 1986.
[Who 89] Whobrey D., "A Communication Chip for Multiprocessors," In CONPAR 88, 
Ed. by C.R. Jesshope and Reinhartz, Cambridge University Press, 1989, pp. 
464-473.
[Wit 81] Wittie L.D., "Communication Structures for Large Netwvrks o f Microcom­
puters," IEEE Trans, on Computer, C-30(4), April 1981, pp. 264-273.
[Yan 89] Yantchev J. & Jesshope C.R., "Adaptive Low-Latency Deadlock Free Packet 
Routingfor Networks o f Processors," IEE Proc. E, Vol. 136, No. 3, May 1989, 
pp. 178-186.
-135-
appendix a
A P P E N D I X  A
A . l  C S M A / C D  A v e r a g e  P a c k e t  D e l a y  E x p r e s s i o n
The embedded Markov chain approach has been used by Lam [Lam 80] and later 
repeated by Hayes [Hay 84] to carry out the derivation of the slotted CSMA/CD average 
delay under the following assumptions:
1- The source of traffic to the channel is an infinite population, who collectively form 
an independent Poisson process with mean delivery rate of X messages/seconds.
2- Each station is allowed to transmit at most one packet when its buffer is full.
3- Following a collision, each station involved reschedules its retransmission using the 
binary exponential Back off algorithm.
Bux has introduced a slight modification concerning the slot time which yields
D = T + - +  
2 f - 1
J
i
1 -  exp (-2 A,x)
+
2 [F (X)v -  (1 -  exp(-2A,x))]
X{n f  + 2m (2x/v) + (2x/v)2 [v2 + 2(1 -v)]}  
2[1 -  X(m + 2x/v)]
x 2 A -  + 2xv -  6x 
X
A.l
The parameter m = T + x consists of the packet transmission time plus the end-to-end
propagation delay, v is the probability of successfully transmitting a packet and F*(X) is the 
Laplace transform of the packet density function m expressed as F*(X) = exp[—5(1 + a)] for 
a constant packet length (m2 = m 2).
A . 2  S k e t c h  o f  t h e  P r o o f
Let us assume that the channel consists of idle periods I  and contention periods C, where 
an immediate successful transmission may result if only one message is sent and last for the 
duration of its transmission and propagation delay m. The key elements of the embedded 
Markov chain analysis is the message arrival process that gives the state equation at the 
embedded points
n i  +  i  ~ n i  +  a i  +  i  +  b i + l - 1 .
-136-
appendix a
rii is the number of packets left behind after the departure of the fth  packet, a! + l the number
of packets arriving during the period/,- +t + C,-+t and b;+1 the number of packets arriving during 
the transmission of the ( i + l f  th packet including the propagation delay x ( in the above relation 
-1 comes from the fact that one message has been successfully transmitted). The probability 
of having ai + l and b; + l packets can be found by using the geometric distribution and Poisson 
arrival during an arbitrary message distribution length. Finally at the steady state, the expected 
number of packets in the queue of the system can be found using the probability-generating 
function and applying Little’s formula n = A x average waiting time.
A . 3  A p p r o x i m a t e  A v e r a g e  P a c k e t  D e l a y  f o r  t h e  A c k n o w l e d g e d  
C S M A / C D  P r o t o c o l
The probability v of a successful transmission during contention periods is assumed to 
be l ie , which is the optimal slotted Aloha throughput rate for an infinite population model 
[Met 76]. Because the multiprocessor buses are very short (1.5 meters), the end-to-end 
propagation delay can be of an order x = 5n sec. As x, the propagation delay, approaches 
zero, the expression in the second line of equation A. 1 which is always negative approaches 
zero. Since x is very small, the neglects of this expression will give an overestimate of result. 
Furthermore, replacing the term (2zef  (e~2 + 2(1 -  e~1)) by ( ze f  yields an error on the mean 
packet delay of the order of (2.96 x KT^/Tusee which is insignificant. Finally in the 
acknowledged CSMA/CD any station having a packet ready in its buffer has to wait for a 
basic time (2x) before attempting transmission even though the channel is idle. Therefore, 
this remark leads us to add (2x) to the average delay which gives
D = T +  2.5X+ A.2
2[1 -  X(m + 2 ez)]
where m = T  + Ta + 3x with Ta being the time to transmit an acknowledge. If we let 
X = m + 2ez  we can rewrite the above equation as
    Vv-2
D = T + 2.5X+-------------------A.3
2(1- A X )
-137-
appendix b
A P P E N D I X  B
B . l  S c h e d u l i n g  F u n c t i o n  o f  D e m a n d  A c c e s s  P r o t o c o l s
In general, we can derive an expression for the scheduling function to allow N  competing 
stations to share a common serial bus under a collision-free access strategy. We assume that 
any station has the capability to detect the beginning and the end of a carrier in the non-zero 
time 8. We also consider the interspacing time between packets as g.
One wants to express the scheduling function denoted by H(ij) as computed at station 
j  separated by k logical cyclic distances from a currently transmitting station i (i.e. i+lyi+2 , 
i+3, . . .  ,(i+k)mod N) so that any station between i and j  is allowed to access the channel 
without conflict, if its scheduling function expires (Figure B.l).
Figure B .l The evaluation o f the H(ij) scheduling function by all stations when 
station i finishes its current transmission.
For the random assignment protocol, the scheduling function after every collision 
detection is chosen randomly as 0 <H( i , j ) < 2ra,n(/',8) where L is the number of collisions. In 
the collision-free protocol, H(iJ) is chosen so that a unique station j  can access the channel. 
Specifically, if H(i , i+ p ) = 0 the next station to transmit is j  - ( i + p ) modiV. One notices
-138-
appendix b
that the parameter p determines the priority of the scheme. For instance, when p=0 the current 
transmitting station has the highest priority (//(/,/) = 0), whereas for p= l, the station 
j  = (z + l)m odA  is the next in turn. Fomi figure B.l we can write the following:
H(i, i + 2) - H ( i J  + 1) + [x(I + , + x(-+Ii+2 -  xtV+2] + (5 + g )
H(i , i  + 3) = H  (i ,i + 2) + [xf>/+2 + xI+2;+3 -  x,v+3] + (5 + g )
H(i , i  + 4  ) = H(i, i  +3) + [x,. J.+3 + xl+3t/+4- x ; J+4] + (8 + <g)
H(i, ( i+k)modN)  = H(i,  ( i + k - l ) m o d N )  + [xiti+k_l + xi+k_li+k- x ii+k} + (8 + g). 
By adding these expressions, we end up with
( j - i -  l)modW
+ [(/ ~i  ~ l)modA/] (8 + g), B. l
1=0
where j  = (i +k)  modN.
In SOSAM [Gol 83], all the individual xj y are precalculated and stored in the station
memory, to be used for the evaluation of the //-function. In BID [Tob 84] and later in HYM AP 
[Rio 85] the physical ordering of the stations is the same as their logical one, thus
k
X Tj+/,,•+/+! -X /j =0. In this respect, the scheduling function can easily be implemented
without involving complex mathematics. For instance, in HYMAP, starting with 
H(0, j)  = H(0,1) + ( / '-  l)(S + g ) where H(0,1)=0 for fair-protocol, one can load the station 
counter with the value (j-1) at the speed of (5+ g). After the first cycle, the counter will be 
maintained at the value H(i,i)  = (N -  l)(5  + g) + 2x. In BRAM [Chi 79] the propagation 
delay was chosen as x1+/ /+/+1 = x resulting in
= [O’ ~ i ~ l)modIV] (5 + g + x )+ / / ( / ,z + 1).
This restriction works only if stations are equally spaced (i.e. star topology).
In general for any position of the stations in the bus, one can choose x/+/ /+/+1 = 2x. The
scheduling function of equation B.l therefore becomes 
H ( i J )  = H(i , i  + l) + (2x + b + g ) [ ( j - i -  1) mod N].
On the other hand, if we let 2x + 8 + g = 3x (i.e. g = x and 8 = 0), the result is still general 
provided that the time needed to detect a carrier and the interspacing time is included in x. 
The final scheduling function can then be written as 
H ( i J )  = H(i , i  + l) + 3 x[(/ -  i - 1 )  mod A]. B .2
-139-
appendix c
A P P E N D I X  C
C . l  A p p l y i n g  t h e  G a p  E q u a t i o n s
Before determining the expressions of different gaps that result at the transmitter, 
channel and receiver interface, it is important to define all parameters involved. By a , (3,6, e 
and a  we label all gaps occurring between activities involved to transmit m packets from a 
sender to a receiver processor. A transmitter host stores messages into its SIMP’s buffers. 
When no buffer is available, gaps or idle periods denoted by a  may results, which represent 
the time spent by the host waiting for an empty buffer. Within the network channel, gaps 
denoted by |3 also occur if there is no transmissions due to collisions or errors. Plus, although 
messages may arrive from the host, they will be held-up by the interface processing speed 
hence delaying transmissions through the channel. Gaps denoted by 8, a  appear at the 
receiving interface as the arrival of consecutive packets may be delayed by the network and 
messages are only delivered to their destination host after being completely stored inside the 
receiving buffer. Finally, the 0 gaps occur between transmitted packets and their acknowl­
edgements which may be delayed by the network of the higher level protocols.
The durations of the activities at the transmitter and the receiver interfaces are equal to 
s f gap) and ^ ^ re sp e c tiv e ly  (gap represents the corresponding parameter a, (3,0, e or a). This 
processing consists of the software protocols and the transmission or reception of the yth 
packet through the transputer communication link. On the other hand, the services associated 
with the interconnection network, composed of the yth packet transmission and its queuing 
delay, and equal to sjfj at the hop labelled k, determine different routing transmissions and 
strategies.
Activities in various stages can interact with each other and have some degree of 
abstraction. For instance, s£j interacts withsj0 o rjjr) and may contains gaps and sub-activities.
In general, as stated in the gap equation of chapter 5, with each gap there is a protocol 
parameter Qgap that establishes the rule of activities. In particular, these parameters specifies 
how many messages should be stored inside the buffer before a transmission can be issued 
through the network, and how many packets should be sent before an acknowledgement can 
be received; this is typically the window strategy.
appendix c
u(x)={ 0 x - °{X) 1 1 jc > 0
1 if j - k . i  with k = (0 ,1 ,2 ,...}
 ^ 0 elsewhere
Let us define the two following functions which will be used in each gap equation as,
ti)
l i__
ID (I) (!)
(0
1 1
({)
...> J
......
s  1,1-1 0U-I 0 ‘J. . .
(c)
line
Figure C.l storing and transmitting sub-packets at the transmitter.
C J J  o G a p  E x p r e s s i o n s
The a-gap occurs at the transmitter interface when the transmitting buffers are full. 
The last activity after which this event occurs would be Sjla) given that all other activities
X sP have accomplished (i.e. all (j-1 ) packets have been stored or transmitted).< = i
aj ~ C .l
Qa the protocol parameter is defined asp -q*  where p and q* are the number of buffers and
the window size respectively.
Premature formatting consists of preparing the headers of each message before the 
whole packet is received from the host. In this case, the activity 5(m) is equal to a constant 
or a fraction of the software protocol time.
C . l . 2  (3 G a p  E x p r e s s i o n s
The P gaps occur at the first channel ( pt j)  as a result of waiting for the current activity 
■s/fi+Qp to be accomplished. The latter consists of storing the packet inside the transmitting
-141-
appendix c
buffer and proceeding the transmission given that all activities X s[° have been finished
-2 + Qp
/ = i
(i.e. formatting, storing previous sub-packets and preparing parameters of the communica­
tion).
Pi.;
Cp+y-2 <2p+y-2
+ Z s f +  Z  a,«(a,)1 1 = 1 i-p
C.2
Pi, i — X 5/1 = 1
(0
where the protocol parameter for (3 is the constant = /z0.
Within the network (3 can be expressed from hop to hop as (Figure C.2)
fty  = T £  t i l l  + P*-w«(P*-W» + 2  .(1)1 + P*.,«(Pw) + et,,M .(o )1 = 1 i = l 4 t = lV q ' C. 3
for 2 < A: < /ip.
If the transmission at each gateway is done in a forward-and-store manner then we 
would have a complete store activity therefore,
Pm = Pi,i+ X sj\i and e-j.i=2
On the other hand, if the transmission uses virtual cut-through, only a fraction of the 
store activity will be completed after which the transmission in the other side of the network 
staits. Therefore,
Pm = kd and e- j - l ,  where d is the flit of the packet.
-142-
appendix c
k = l
k=i
(t) i t )  (c)
Hi H? Hi
i t )  J t )  i i  i t )
H  j<l H  H i 1* H  j + 3
(
fii'i frijo
li --1 sll1 H? c(cl
„ (c)
Hj'l
(^c) 
Stj. i
i !
i I
Pii'i IHjU
!c(tlL«>| Hi i Sj? s([1j s"1H j't s“'H  j ♦ ?
ti
H j'? s(c) H j 13
line
’ i j * 1 ijt? f»ijU
Figure C.2 Activities through the network channels.
C.1.3 0  GAP EXPRESSION
The 0 gap results from the acknowledgement processing where each successfully 
transmitted packet expects an acknowledgement. When a hop-by-hop service is applied, the 
acknowledgement is required in each channel as the packet goes its way, thus QkJ = skGJ. In 
particular, for a high priority acknowledgement this activity is simply equal to a constant that 
is the time to transmit an acknowledgement. When an end-to-end service is applied, the 
acknowledgement has to come from the destination station, where the packet has crossed hp 
channels or hops. The direction of the flow is therefore reversed,
- £  (sg u  + P* + e*.  uM,•(«')] - [  £  (Sg + P,.,« (&.,)> + £  6t,M . ( 0
.< = 1 q J  L,=1 /= i q + C -  CA
C . l . 4  e  a n d  a  G a p s  E x p r e s s io n s
The s gap results at the receiver interface when the yth packet is waited for to be 
completely stored inside the receiving buffer. In other words, e gaps determine the premature 
or delayed receptions discussed in chapter 5.
e.j=\ X (s$  + $k>iu ((3,,.)) + 0A. _ XM M  -1)1 -  T skj  + X (4rJ + (£,.)) + X a,w(a,) .
L,=1 J L i=i »=o
a  gaps are the idle times between the reception of consecutive packets (Figure C.3)
C. 5
a- = X (41+ CfW (e,) + a f _ xu (a,- _ / )i = i C.6
-143-
appendix c
ao = P*=/, ,i + d. If 0 is related to 8 we have Qk=h j — -cyw(-87) + sjfj.
Figure C.3 The reception o f the sub-packets
In the case of a connected transputer network, an acknowledgement is required from 
node-to-node but it is overlapped with the transmission because of the bidirectional link thus 
0 kj  = 0 and p* >y- is the protocol time (latency) involved in each node for transferring messages. 
If the nodes have a hardware router, P will a small constant depending on the implementation 
of the router.
-144-
appendix d
A P P E N D I X  D
D . l  I n i t i a l  C o n d i t i o n s  f o r  t h e  T r a n s p a r e n t  T r a n s m i s s i o n
The effective capacity of the network channel is much higher than the one of the 
transputer link. Therefore, each message sent by the host to its SIMP interface has to be 
completely buffered, before it is sent through the network. This is the rule of a store-and- 
forward transmission which occurs initially at each SIMP interface transmitter. The trans­
parent transmission described in section 5.5, violates this principle in the sense that when a 
fraction of the message is present inside the buffer a transmission through the network is 
issued. To do this, the slices Wx, W2, . . Wn of a packet W have to be properly chosen with 
respect to the first one Wx so that no gap can be induced within the transmission of the packet 
through the network.
a)- X  = r/R < 1 the ratio of the transputer effective link rate to the channel capacity: 
Let us assume that we have virtually divided a packet of length W into n slices of length 
Wx, W2, ...,W n respectively. The process of transmitting W is as follows, when the first slice 
Wx is stored inside the transmitting buffer, the transmission through the channel takes place 
after dw queuing time. In a contiguous transmission (i.e. there is no gaps between slices) the 
following conditions must be fulfilled,
W2/r + bx<dw + Wx/R
WJr < W2/R
W4/r < W2/R
W Jr < Wn_x/R
by adding the above relations up to a rank j  we can write,
appendix d
Wl >(bl ~d w)R +
f  V-i 
\ r J
Wj. DA
One can find a direct relation between the first slice Wx and the original packet length
n
W. By solving the Equation D.l for W: in term of Wx with X W■ = W and X=(r!R), the first
j=i
slice length can be expressed as
( W - P ) ( l - X) 
1 1 -X "
An expression for any slice Wj in term of the original packet length W can also be found by 
replacing equation D.2 into D.l.
q r - P ) y - ( i - . y )  
' i - x n
with p = (&i -  dw)R •
To minimise the communication overhead per transmission, it is evident that the first 
slice has to be as small as possible. Hence, the number of slices n must be larger. To 
ensure that no gap exists between slices, before wx is transmitted w2 must be completely 
stored inside the T-buffer, and w3 before w2 and so on. The last slice Wn has to be at least 
equal to the machine word L (e.g. 2Bytes for the T212 transputer). By solving Equation D.3 
for (j-n)  and Wn =L  we get the following condition on the number of slices,
^ log[l + (WX -  pr) (1 -X)ILX2] 
log(X)
Therefore, to ensure no gaps within the transmission of a packet the number of slices 
n and the first slice Wx have to be chosen as determined by equation D.4 and D.2 respectively.
For the sake of generality, we can also derive an expression for the first slice Wx when 
the transputer link rate is greater than the network bus capacity.
-146-
appendix d
B)- X >1 The first slice has to be chosen appropriate for the transmission so that the 
second slice delayed by the protocol time bx will have enough time to be stored inside the 
buffer. Thus WX=L forX > 1 -\-br/L and -  (L +bxr)/X for 1 < X  < 1 + br/L.  Combining 
these equations with the unit step function u, we end up with the required equation in chapter 
4.
appendix E .l
S I M U L A T I O N  P R O G R A M S
All read and print routines have not been excluded form the SIMSCRIPT codes.
E l :  S i n g l e  b u s  h y b r i d  C S M A / C D  p r o t o c o l
preamble
normally mode is integer 
processes include PROTOCOL.SWITCH 
every TRANSPUTER has a RES.BUFFER 
every STATION has a SOURCE.TRANSPUTER, 
a STATION.TYPE, 
and an INDEX.BUFFER 
every ACKNOWLEDGE has an IDF.ACK 
every PACKET has a SOURCE.STATION, 
an IDF.STATION, 
and a PACKET.TYPE 
define SOURCE.STATION, SOURCE.TRANSPUTER,STATION.TYPE, 
RES.BUFFER, INDEX.BUFFER,IDF.ACK, and PACKET.TYPE as integer variables 
resources include CHANNEL,GATE, and BUFFER
define MEAN.TRANSMISSION.TIME, MEAN.PROCESSING.TIME, FOR.ACK, 
DELAY.TIME, WAITING.TIME, and PROPAGATION.DELAY as real variables 
define INC.NUM.STATIONS, MAX.STATIONS, MIN.STATIONS, NUM.PACKET- 
S.COMPLETED, NUM.PACKETS.DESIRED, OFFERED.LOAD, TRANS.ADR,
NUM.TRANSPUTERS.PER.STATION, BUFFER.LIMIT, NUM.PACKETS.SENT, and
NUM.STATIONS as integer variables
define .MICROSECONDS to mean units
define IS.WAITING.TRANSMISSION to mean 1
define H AS. COMPLETED .TRANS MIS S ION to mean 0
define SLOT.TIME to mean 2*PROPAGATION.DELAY
define YES to mean 1
define DATA to mean 1
define ACK to mean 0
define UNLOCKED to mean 1
define LOCKED to mean 0
define DEMAND to mean 1
define CONTENTION to mean 0
define SEED1 and SEED2 as double variables
define ACK.FLAG, DELAY.FLAG, TRANSMISSION.FLAG,GATE.IS, CHAN- 
NEL.STATE, and TRANSMISSION.STATE as integer variables 
define X as 1-dimensional array
tally MEAN.DELAY.TIME as the mean of DELAY.TIME
tally MEAN. WAITING.TIME as the mean of WAITING.TIME
accumulate AVG.NUM.PACKETS.IN.QUEUE as the average of N.Q.CHANNEL
accumulate AVG.BACKLOGGED.STATIONS as the average of OFFERED.LOAD
accumulate UTIL.CHANNEL as the average of TRANSMISSION.FLAG
end ’’preamble
appendix E.l
main
03.11 resd d3t3
for NUM.STATIONS=MIN.STATIONS TO MAX.STATIONS BY INC.NUM.STA- 
TIONS 
do
call initialize 
start simulation 
loop 
end
process ACKNOWLEDGE giving IDF.ACK
wait 3*MEAM:fRANSMlSSI0N.TIME .MICROSECONDS
” on the average until the ack is returned from the corresponding
”  subnetwork.
create a STATION
STATION.TYPE(STATION)= ACK
INDEX. BUFFER(ST ATION) = IDF.ACK
activate this STATION now
end
routine initialize
define I,K and J as integer variables 
let time.v = 0
let NUM. PACKETS .COMPLETED = 0
let NUM.PACKETS.SENT = 0
let DELAY.FLAG = 0
let ACK.FLAG = 0
let GATE.IS = UNLOCKED
let K = NUM.TRANSPUTERS.PER. STATION
reset totals of TRANSMISSION.FLAG, OFFERED.LOAD, DELAY.TIME, 
and WAITING.TIME 
let seed.v(l) = SEED1 
let seed.v(2) = SEED2 
reserve X(*) as K*MAX.STATIONS 
for 1=1 to NUM.STATIONS 
do
X(I) = BUFFER.LIMIT
for J=1 to NUM.TRANSPUTERS.PER.STATION 
do
activate a TRANSPUTER giving I now 
loop 
loop 
end
process packet
define QUEUE.WAS, CURRENT.ADR, HD,STATE, and CHANNEL.WAS.IDLE as 
integer variables
define START.TIME as real variable 
let START.TIME = time.v 
let STATE = IS.WAITING.TRANSMISSION 
let CHANNEL.WAS.IDLE = 1 
add 1 to OFFERED.LOAD 
while STATE = IS.WAITING.TRANSMISSION 
do
appendix E .l
if CHANNEL.STATE = DEMAND 
’wait.in.queue’ request 1 gate(l) 
relinquish 1 gate(l)
CURRENT. ADR = TRANS.ADR 
HD = mod.f(IDF.STATION(PACKET)-CURRENT.ADR+NUM.STATIONS-1,NUM.S 
TATIONS)
”  each ready station senses the channel every slot time 
for i=l to HD 
do
wait PROPAGATION.DELAY .MICROSECONDS 
if GATE.IS = LOCKED 
go to wait.in.queue 
always
loop
request 1 channel(l) 
request 1 gate(l)
GATE.IS = LOCKED
TRANS.ADR = IDF.STATION(PACKET)
wait PROPAGATION.DELAY .MICROSECONDS
call TRANSMISSION giving PACKET.TYPE(PACKET)
STATE = HAS.COMPLETED.TRANSMISSION 
GATE.IS = UNLOCKED 
relinquish 1 gate(l) 
activate a PROTOCOL.SWITCH now 
else ”  channel is in the contention mode 
CHANNEL. WAS.IDLE = U.CHANNEL(l)
QUEUE. WAS = N.Q.CHANNEL(l) 
request 1 CHANNEL(l)
if (CHANNEL.WAS.IDLE = YES or QUEUE.WAS = 0)and(CHANNEL.STA- 
TE=CONTENTION)
wait PROPAGATION.DELAY .MICROSECONDS 
if N.Q.CHANNEL(l) = 0 
call TRANSMISSION giving PACKET.TYPE(PACKET)
STATE = HAS.COMPLETED.TRANSMISSION 
else ’’collision has occured 
wait 2*PROPAGATION.DELAY .MICROSECONDS 
CHANNEL.STATE = DEMAND ’’switch to the reservation mode 
TRANS.ADR = NUM.STATIONS ” shedule according to a fixed station 
always 
always 
always
relinquish 1 channel(l) 
loop
reactivate the STATION called SOURCE.STATION(PACKET) now 
if PACKET.TYPE(PACKET) = DATA 
let DELAY.TIME = time.v - START.TIME 
add 1 to NUM.PACKETS.COMPLETED 
if NUM.PACKETS.COMPLETED = NUM.PACKETS.DESIRED 
call report 
always 
always
end ’’packet process 
process PROTOCOL. SWITCH
-150-
appendix E .I
” this is a timer which updates activities on channel 
”  every slot time, when there is carrier the gate is locked to 
prevent any other station from transmission 
for i=l to NUM.STATIONS 
do
if GATE.IS = UNLOCKED
wait 2*PROPAGATION.DELAY .MICROSECONDS 
else
go to stop.timer 
always 
loop
CH ANNEL.ST ATE = CONTENTION 
’stop.timer’ end
process STATION
define INIT.WAIT.T1ME as real variable 
I NIT. WAIT.TIME = time.v 
if STATION.TYPE(STATION) = DATA 
subtract 1 from X(INDEX.BUFFER(STATION)) 
if X(INDEX.BUFFER(STATION)) > 0
reactivate the TRANSPUTER called SOURCE.TRANSPUTER(STATION) now 
always 
always
request 1 B UFFER(INDEX. B UFFER(ST ATION)) 
if STATION.TYPE(STATION) = DATA 
WAITING.TIME = time.v - INIT.WAIT.TIME 
always
create a PA.CKJET
let SOURCE.STATION(PACKET) = STATION 
let PACKET.TYPE(PACKET) = ST ATION.TYPE(ST ATION) 
let IDF.STATION(PACKET) = INDEX.BUFFER(STATION) 
activate this PACKET now
suspend ’’block the buffer until a successful transmission... 
relinquish 1 BUFFER(INDEX.BUFFER(STATION)) 
if STATION.T YPE(ST ATION) = DATA 
if X(INDEX. BUFFER (ST ATION)) <= 0 
reactivate the TRANSPUTER called SOURCE.TRANSPUTER(STATION) now 
always
add 1 to X(INDEX.BUFFER(STATION)) 
always 
end
routine TRANSMISSION giving IDENTITY 
define IDENTITY as an integer variable 
if IDENTITY = DATA 
TRANSMISSION.FLAG = 1
wait MEAN.TRANSMISSION.TIME .MICROSECONDS 
TRANSMISSION.FLAG = 0
”  activate an ACKNOWLEDGE giving randi.f(l,NUM.STATIONS,l) now 
else ’’the packet is an ack 
wait FOR.ACK .MICROSECONDS 
always
subtract 1 from OFFERED.LOAD 
end
-151-
appendix E .l
process TRANSPUTER given IDF.BUFFER
define IDF.BUFFER as integer variable
until NUM.PACKETS.SENT >= NUM.PACKETS.DESIRED
do
wait exponential.f(mean.processing.time,l) .MICROSECONDS 
add 1 to NUM.PACKETS.SENT 
if NUM.PACKETS.SENT <= NUM.PACKETS.DESIRED 
create a STATION
let SOURCE.TRANSPUTER(STATION)=TRANSPUTER 
let INDEX.BUFFER(STATION) = IDF.BUFFER 
let STATION.TYPE(STATION) = DATA 
activate this STATION now 
suspend 
always 
loop 
end
-152-
appendix E.2
E 2 :  2 D  N e t w o r k  a n d  i ts  C o m m u n ic a t io n  P r o to c o ls
preamble
normally mode is integer 
processes
every TRANSPUTER has a XT, 
and a YT
every STATION has a SRC.TRANSPUTER, 
a XS, 
and a YS
every INTRA.PACKET has a SRC.STATION, 
a SRC.OGP, 
a CHANG.ORG, 
a ORG.IGP, 
a ADR.TRANS.NODE, 
a XP1, 
a YP1,
an INTRAP.TIME, 
an INTRAP.ORG, 
and an INIT.INTRAP.TIME 
every INTER.PACKET has a SRC.GATEWAY, 
a INIT.INTERP.TTME, 
a ENTERP.TIME, 
a XP2, 
a YP2
a DEST.CHAN, 
and a SRC.IGP 
every INP.GATEWAY has a INIT.IG.TIME, 
a DEST.IG, 
a DEST.CHAN.IG, 
a XIG, 
and a YIG
every OUT.GATEWAY has a INIT.OG.TIME, 
a DEST.CHAN.OG, 
a SRC.CHAN, 
a DEST.OG, 
a SRC.OG, 
a XOG, 
and a YOG
define INIT.INTRAP.TIME, INIT.IG.TIME, INIT.OG.TIME, INTRAP.TIME, 
INIT.INTERP.TIME and INTERP.TIME as real variables 
resources include ST.RES, INTRA.CHAN, INTER.CHAN, IG.RES 
and OG.RES 
temporary entities
every INTRA.SUSP.PACKET has an IDF.STATION 
every INTER.SUSP.PACKET has an IDF.GATEWAY 
define MEAN.TRANSMISSION.TIME, MEAN.PROCESSING.TIME, FOR.ACK, 
PROB
FOR.TIME.OUT, DELAY.TIME, INTERP.RETRANSMISSION, 
INTRAP.RETRANSMISSION and PROPAGATION.DELAY as real variables
-153-
appendix E.2
define INC.NUM.STATIONS, MAX.STATIONS, MIN.STATIONS, NUM.PACKET- 
S.COMPLETED,
"  NUM.PACKETS.DESIRED, NUM.TRANSPUTERS.PER.ST ATION,
ST. BUFFER.LIMIT,
NUM.PACKETS.SENT, P, G, GT.BUFFER.LIMIT, COLLISION.LIMIT,
INIT.G, INIT.P, STEP.P, STEP.G, MAX.G, MAX.P, SERVICE.TYPE,EX- 
TER.DATA
SERVICE.TIME, SERVICE.ROUT and NUM.STATIONS as integer variables 
define .MICROSECONDS to mean units 
define IS.WAITING.TRANSMISSION to mean 1 
define HAS.COMPLETED.TRANSMISSION to mean 0 
define SLOT.TIME to mean 2*PROPAGATION.DELAY 
define YES to mean 1 
define NO to mean 0 
define DATA to mean 1 
define ACK to mean 0 
define NACK to mean -1 
define NODE to mean 1 
define GATEWAY to mean 0 
define SEED1 and SEED2 as double variables
define INTER.DELAY.FLAG, INTER.TRANSMISSION.FLAG,INTER.ACK.FLAG, 
INTRA.DELAY.FLAG, INTRA.TRANSMISSION.FLAGJNTRA.ACK.FLAG, 
BUFFER.ST, BUFFER.IG, BUFFER.OG,
INTRA.PREVIOUS.TIME, INTER.PREVIOUS.TIME,
INTER.LAST.ADR and INTRA.LAST.ADR as integer 1-dimensional array 
define INTER.DELAY, INTRA.DELAY, INTER.OFFERED.LOAD, ACCUM.IN- 
TRA.OFFERED.LOAD,
ACCUM.INTER.OFFERED.LOAD and INTRA.OFFERED.LOAD 
as real 1-dimensional array 
define BIG.SUSP.PACKET and BOG.SUSP.PACKET as an integer 2-dimensional array 
define INTRA.RD.POINTER, INTRA.WT.POINTER, INTER.RD.POINTER 
and INTER.WT.POINTER as integer 1-dimensional array 
tally MEAN.DELAY.TIME as the mean of DELAY.TIME 
tally AVG.IG.BUFFER as the mean of N.Q.IG.RES 
tally MAX.IG.BUFFER as the maximum of N.Q.IG.RES 
tally AVG.OG.BUFFER as the mean of N.Q.OG.RES 
tally MAX.OG.BUFFER as the maximum of N.Q.OG.RES 
end ’’preamble
main
call read.data
for NUM.STATIONS=MIN.STATIONS TO MAX.STATIONS BY INC.NUM.STA­
TIONS 
do
if EXTER.DATA = NO 
for G=INIT.G TO initg BY STEP.G 
do 
if G=1 
MAX.P = 1 
else
MAX.P = intf(G/2) 
always
for P=INIT.P TO init.p BY STEP.P ” max.p=l 
do
if G=1 and P =1 
PROB = 0
-154-
appendix E.2
if int.f(G/P)=G/P and int.f(P*NUM.STATIONS/G)=P*NUM.STATIONS/G 
’ ’regular structure 
call INIT.ARRAYS 
always 
loop 
loop 
else 
G=INIT.G 
P-INIT.P
if int.f(G/P)=G/P and int.f(P*NUM.STATIONS/G)=P*NUM.STATIONS/G 
’’regular structure 
call INIT.ARRAYS 
always 
always 
loop 
end
routine CREATE.INP.GATEWAY giving I, J 
create an INP.GATEWAY
let YIG(INP.GATEWAY) = YP1(INTR A. PACKET)
let INTT.IG.TIME(INP.GATEWAY) = INTRAP.TIME(INTR A. PACKET)
let DEST.IG(INP.GATEWAY) = J
let DEST.CHAN.IG(INP.GATEWAY) = I
let XIG(INP.GATEWAY) = XPl(INTRA.PACKET)
activate this INP.GATEWAY now
end
routine CREATE.OUT.GATEWAY giving S ^ I ,  J 
create ah OtJT.GATEWAY
let XOG(OUT.GATEWAY) = XP2(INTER. PACKET) 
let Y OG(OUT.G ATEWA Y)= YP2aNTER. PACKET)
let INIT.OG.TIME(OUT.GATEWAY) = INIT.INTERP.TIME(INTER. PACKET)
let DEST.OG(OUT.GATEWAY)=J
let DEST.CHAN.OG(OUT.GATEWAY)=I
let SRC.CHAN(OUT.GATEWAY)=Z
let SRC.OG(OUT.GATEWAY)=S
activate this OUT.GATEWAY now
end
routine initarrays
release BUFFER.ST(*),BUFFER.IG(*),BUFFER.OG(*),INTER.LAST.ADR(*), 
INTRA.LAST.ADR(*), INTER.DELAY.FLAG(*),INTRA.DELAY.FLAG(*), 
INTER.ACK.FLAG(*),INTRA.ACK.FLAG(*), INTER.TRANSMIS- 
SION.FLAG(*),
INTRA.TRANSMISSION.FLAG(*), INTER.DELAY(*), INTRA.DELAY(*), 
INTER.OFFERED.LOAD(*),INTRA.OFFERED.LOAD(*), 
ACCUM.INTRA.OFFERED.LOAD(*),ACCUM.INTER.OFFERED.LOAD(*), 
INTRA.PREVIOUS.TIME(*), INTER.PREVIOUS.TIME(*), 
BIG.SUSP.PACKET(*,*), BOG.SUSP.PACKET(*,*), INTRA.RD.POINTER(*), 
INTRA.WT.POINTER(*),INTER.RD.POINTER(*),INTER. WT.POINTER(*)
always
-155-
appendix E.2
reserve BUFFER.ST(*) as (G/P)*int.f(NUM.STATIONS*P/G)+l 
reserve BUFFER.IG(*) and BUFFER.OG(*) as G 
reserve BIG.SUSP.PACKET(*,*) as G by int.f(NUM.STATIONS*P/G)+l 
reserve BOG.SUSP.PACKET(*,*) as G by int.f(G/P)+l 
reserve INTRA.RD.POINTER(*),INTRA.WT.POINTER(*) as G 
reserve INTER.RD.POINTER(*),INTER.WT.POINTER(*) as G 
reserve INTER.LAST.ADR(*), INTER. DELAY.FLAG(*), INTER.ACK.FLAG(*), 
and INTER.TRANSMISSION.FLAG(*) as P 
reserve INTRA.LAST.ADR(*), INTRA.DELAY.FLAG(*), INTRA.ACK.FLAG(*), 
and INTRA.TRANSMISSION.FLAG(*) as int.f(G/P)+l 
reserve INTER.DELAY(*) and INTER.OFFERED.LOAD(*) as P 
reserve INTRA.DELAY(*) and INTRA.OFFERED.LOAD(*) as int.f(G/P)+l 
reserve ACCUM.INTRA.OFFERED.LOAD(*) and INTRA.PREVIOUS.TIME(*) as 
int.f(G/P)+l
reserve ACCUM.INTER.OFFERED.LOAD(*) and INTER.PREVIOUS.TIME(*) as P 
destroy each inter.chan 
destroy each intra.chan 
destroy each st.res 
destroy each ig.res 
destroy each og.res 
let N.INTRA.CHAN = int.f(G/P) 
create every INTRA.CHAN 
for each INTRA.CHAN 
let U.INTRA.CHAN(INTRA.CHAN) = 1 
let N. INTER. CHAN = P 
create every INTER.CHAN 
for each INTER.CHAN 
let U.INTER.CHAN(INTER.CHAN)= 1 
let N.ST.RES = NUM.STATIONS 
create every ST.RES 
for each ST.RES 
let U.ST.RES(ST.RES) =1 
let N.IG.RES = G 
create every IG.RES 
for each IG.RES 
let U.IG.RES(IG.RES) = 1 
let N.OG.RES = G 
create every OG.RES 
for each OG.RES 
let U.OG.RES(OG.RES)=l 
call initialize 
start simulation
end
routine initialize
define I,K and J as integer variables 
let time.v = 0
let NUM.PACKETS.COMPLETED = 0
let NUM.PACKETS.SENT = 0
reset totals of DELAY.TIME
for each IG.RES
reset totals of N.Q.IG.RES
for each OG.RES
reset totals of N.Q.OG.RES
-156-
appendix E.2
let seed.v(l) = SEED1 
let seed.v(2) = SEED2 
for 1=1 to G 
do
BUFFER.IG(I) = GT.BUFFER.LIMIT 
BUFFER.OG(I) = GT.BUFFER.LIMIT 
loop
for 1=1 to int.f(G/P) 
do
for J=1 to int.f(P*NUM.STATIONS/G) 
do
B UFFER. ST(int.f(P*NU M.ST ATIONS/G)*(I-1)+J) = ST.BUFFER.LIMIT 
for K =1 to NUM.TRANSPUTERS.PER.STATION 
do
activate a TRANSPUTER giving I and int.f(P*NUM.STATIONS/G)*(I-l)+J now 
loop 
loop 
loop 
end
process inp, gateway
define IN IT .W A rT .T IM E  as real variable 
IN IT . W A IT .T IM E  =  time.v
subtract 1 from BUFFER.IG(DEST.IG(INP.GATEWAY))
ACCUM.INTER.OFFERED.LOAD(DEST.CHAN.IG(INP.GATEWAY))=
ACCUM.INTER.OFFERED.LOAD(DEST.CHAN.IG(INP.GATEWAY))
+(time.v-INTER.PREVIOUS.TIME(DEST.CHAN.IG(INP.GATEWAY)))*
INTER.OFFERED.LOAD(DEST.CHAN.IG(INP.GATEWAY))
let INTER.PREVIOUS.TIME(DEST.CHAN.IG(INP.GATEWAY)) = time.v
add 1 to LNTER.OFFERED.LOAD(DEST.CHAN.IG(INP.GATEWAY))
request 1 IG.RES(DEST.IG(INP.GATEWAY))
create an INTER.PACKET
let INTERP.TIME(INTER.PACKET) = INIT. WAIT. TIME
let SRC.G ATE W A Y(INTER. PACKET) = INP.GATEWAY
let INIT.INTERP.TIME(INTER.PACKET) = INIT.IG.TIME(INP.GATEWAY)
let XP2(INTER. PACKET) = XIG(INP.GATEWAY)
let YP2(INTER. PACKET) = YIG(INP.GATEWAY)
let DEST.CHAN(INTER.PACKET) = DEST.CHAN.IG(INP.GATEWAY)
let SRC.IGP(INTER.PACKET) = DEST.IG(INP.GATEWAY)
activate this INTER.PACKET now
suspend ’’block the buffer until a successful transmission... 
relinquish 1 IG.RES(DEST.IG(INP.GATEWAY)) 
add 1 to BUFFER.IG(DEST.IG(INP.GATEWAY)) 
end
process INTER.PACKET
define RESCHEDULE, CHANNEL.WAS.IN.DELAY.STATE, STATE,DEST1, 
COLLISION.COUNTER and CHANNEL.WAS.IDLE as integer variables 
define DELTA as real variable 
let COLLISION.COUNTER =0 
let STATE = IS.WAITING.TRANSMISSION 
let CHANNEL.WAS.IDLE = 1 
DEST1= randi.f( 1 ,int.f(G/P), 1) 
if DEST1=XP2(INTER. PACKET)
DEST1=mod.f(XP2(INTER. PACKET),int.f(G/P))+1
-157-
appendix E.2
always
while STATE = IS.WAITING.TRANSMISSION 
do
if INTER. LAST. ADR(DEST.CHAN(INTER. PACKET)) = SRC.IGP(INTER.PACKET) 
wait 2*PROPAGATION.DELAY .MICROSECONDS 
else
if U.INTER.CHAN(DEST.CHAN(INTER.PACKET)) = 1 or CHANNEL.WAS.IDLE = 0 
or
INTER.ACK.FLAG(DEST.CHAN(INTER.PACKET))=1 or 
INTER.DELAY.FLAG(DEST.CHAN(INTER.PACKET))=1 
wait 4*PROPAGATION.DELAY .MICROSECONDS 
always
CHANNEL. WAS.IDLE = U.INTER.CHAN(DEST.CHAN(INTER,PACKET)) 
CHANNEL. WAS.IN.DELAY.STATE = INTER. DELA Y.FLAG(DEST.CH AN(IN- 
TER.PACKET)) 
always
request 1 INTER. CHAN(DEST.CHAN(INTER.PACKET)) 
if CHANNEL.WAS.IDLE = YES 
INTER.DELA Y .FLAG(DEST. CHAN(INTER. PACKET)) = 1 
wait uniform.f(0,PROPAGATION.DELAY,3) .MICROSECONDS 
INTER.DELAY.FLAG(DEST.CHAN(INTER.PACKET)) = 0 
if N.Q.INTER.CHAN(DEST.CHAN(INTER.PACKET)) = 0 
call ENTER.TRANSMISSION giving DEST1 yielding STATE 
else
wait 2*PROPAGATION.DELAY .MICROSECONDS 
’’the channel is waisted for that max time 
relinquish 1 INTER.CHAN(DEST.CHAN(INTER.PACKET)) 
add 1 to COLLISION.COUNTER 
if COLLISION.COUNTER <= COLLISION.LIMIT 
let RESCHEDULE = randi.f(0,2**COLLISION.COUNTER,l)
DELTA= (RESCHEDULE *S LOT.TIME) 
else
COLLISION. COUNTER = 0 
DELTA = (2**COLLISION.LIMIT)*SLOT.TIME 
always
wait DELTA .MICROSECONDS 
always 
else
relinquish 1 INTER.CHAN(DEST.CHAN(INTER.PACKET)) 
if CHANNEL.WAS.IN.DELAY.STATE = 1 ’’COLLISION 
add 1 to COLLISION.COUNTER 
if COLLISION.COUNTER <= COLLISION.LIMIT 
let RESCHEDULE = randi.f(0,2**COLLISION.COUNTER, 1)
DELTA= (RESCHEDULE *SLOT.TIME) 
else
COLLISION.COUNTER = 0 
DELTA = (2**COLLISION.LIMIT)*SLOT.TIME 
always
wait DELTA .MICROSECONDS 
always 
always 
loop
reactivate the INP.GATEWAY called SRC.GATEWAY(INTER.PACKET) now
appendix E.2
let INTER.DELAY(DEST.CHAN(INTER.PACKET)) = 
INTER.DELAY(DEST.CHAN(INTER.PACKET))+ 
time.v - INTERP.TIME(INTER.PACKET) - FOR.ACK 
”  each station has only one buffer,so it is blocked until 
”  a successful transmission is completed 
end ’’packet process 
routine INTER.TRANSMISSION given DEST1 yielding I 
INTER.LAST.ADR(DEST.CHAN(INTER.PACKET)) = SRC.IGP(INTER.PACKET) 
DEST2 = P*(DEST1-1) + DEST.CHAN(INTER. PACKET) 
if BUFFER.OG(DEST2) <= 0
wait FOR.ACK .MICROSECONDS ” 6bytes to detect collision 
relinquish 1 INTER.CHAN(DEST.CHAN(INTER.PACKET)) 
I=IS.WAITING.TRANSMISSION 
add 1 to INTERP.RETRANSMISSION
create an INTER.SUSP.PACKET called BOG.SUSP.PACKET(DEST2,
INTER. WT.POINTER(DEST2)+1)
IDF.GATEWAY(BOG.SUSP.PACKET(DEST2,INTER.WT.POINTER(DEST2)+l))=
process.v
INTER.WT.POINTER(DEST2)=mod.f(INTER.WT.POIN- 
TER(DEST2)+1 ,int.f(G/P)+1)
SUSPEND
else
add 1 to INTER.TRANSMISSION.FLAG(DEST.CHAN(INTER.PACKET)) 
wait (MEAN.TRANSMISSION.TIME-FOR.ACK) .MICROSECONDS 
INTER.ACK.FLAG(DEST.CHAN(INTER.PACKET)) = 1 
wait uniform.f(0,PROPAGATION.DELAY,3) .MICROSECONDS 
INTER. ACK.FLAG(DEST.CHAN(INTER. PACKET)) = 0 
wait FOR.ACK .MICROSECONDS
subtract 1 from INTER.OFFERED. LO AD(DEST. CHAN (INTER. PACKET)) 
call CREATE.OUT.GATEWAY giving SRC.IGP(INTER.PACKET), 
DEST.CHAN(INTER.PACKET),DEST1, DEST2 
relinquish 1 INTER.CHAN(DEST.CHAN(INTER.PACKET)) 
wait FOR.ACK .MICROSECONDS
if INTRA.WT.POINTER(SRC.IGP(INTER.PACKET))-INTRA.RD.POINTER( 
SRC.IGP(INTER.PACKET))<>0 
’ ’ some packets are waiting for buffer 
reactivate the INTRA.PACKET called IDF.STATION(BIG.SUSP.PACKET( 
SRC.IGP(INTER.PACKET),INTRA.RD.POINTER(SRC.IGP(INTER.PACKET))+l)) 
now
destroy the INTRA.SUSP.PACKET called BIG.SUSP.PACKET( 
SRC.IGP(INTER.PACKET),INTRA.RD.POINTER(SRC.IGP(INTER.PACKET))+l) 
INTRA.RD.POINTER(SRC.IGP(INTER.PACKET))=mod.f(INTRA.RD.POINTER( 
SRC.IGP(INTER.PACKET))+l,int.f(NUM.STATIONS*P/G)+l) 
always
if N.Q.IG.RES(SRC.IGP(INTER.PACKET)) =0 
’ ’ there is no packets in the buffer 
INTER.LAST.ADR(DEST.CHAN(INTER.PACKET)) = -1 
always
1= HAS.COMPLETED.TRANSMISSION 
always 
end
process INTRA.packet
appendix E.2
define RESCHEDULE, CHANNEL. WAS.IN.DELAY.STATE, STATE,LO­
CAL.PACKET MULT,
COLLISION.COUNTER, DEST1 and CHANNEL.WAS.IDLE as integer variables 
define DELTA as real variable 
let COLLISION.COUNTER =0
’arriving.packet’ let STATE = IS.WAITING.TRANSMISSION 
let CHANNEL.WAS.IDLE = 1 
DEST1= randi.f(l,P,l)
while STATE = IS.WAITING.TRANSMISSION 
do
if INTRAP.ORG(INTRA.PACKET) = GATEWAY and INTRA.LAST.ADR(XP 1 (IN- 
TRA.PACKET))
= SRC. OGP(INTR A. PACKET) 
wait 2*PROPAGATION.DELAY .MICROSECONDS 
else
if U.INTRA.CHAN(XP 1 (INTRA.PACKET)) = 1 or CHANNEL.WAS.IDLE = 0 or 
INTRA. ACK.FL AG(XP 1 (INTRA. PACKET)) =1 or 
INTR A.DELA Y.FLAG(XP 1 (INTR A.P ACKET))=1 
wait 4*PROP AG ATION. DELAY .MICROSECONDS 
always
CHANNEL.WAS.IDLE = U.INTRA.CHAN(XP1 (INTRA.PACKET))
CHANNEL. WAS.IN.DELAY.STATE = INTRA.DELAY.FLAG(XP1 (IN­
TRA.PACKET)) 
always
request 1 INTRA.CHAN(XP1 (INTRA.PACKET)) 
if CHANNEL.WAS.IDLE = YES 
INTRA. DELAY. FLAG(XP 1 (INTRA.PACKET)) = 1 
wait uniform.f(0,PROPAGATION.DELAY,3) .MICROSECONDS 
INTRA.DELAY.FLAG(XP1 (INTRA.PACKET)) = 0 
if N.Q.INTRA.CHAN(XP1 (INTRA.PACKET)) = 0
call INTRA.TRANSMISSION given DEST1 yielding 
LOCAL.PACKET and STATE
else
wait 2*PROPAGATION.DELAY .MICROSECONDS 
’ ’the channel is waisted for that max time 
relinquish 1 INTRA.CHAN(XP1(INTRA.PACKET)) 
add 1 to COLLISION.COUNTER 
if COLLISION.COUNTER <= COLLISION.LIMIT 
let RESCHEDULE = randi.f(0,2**COLLISION.COUNTER,l)
DELTA= (RESCHEDULE *SLOT.TIME) 
else
COLLISION.COUNTER = 0 
DELTA = (2**COLLISION.LIMIT)*SLOT.TIME 
always
wait DELTA .MICROSECONDS 
go to arriving, packet 
always 
else
" relinquish 1 INTRA.CHAN(XP1 (INTRA.PACKET)) 
if CHANNEL.WAS.IN.DELAY.STATE = 1 ’’COLLISION 
add 1 to COLLISION.COUNTER 
if COLLISION.COUNTER <= COLLISION.LIMIT 
let RESCHEDULE = randi.f(0,2**COLLISION.COUNTER,l)
DELTA= (RESCHEDULE *SLOT.TIME) 
else
appendix E.2
COLLISION.COUNTER = 0 
DELTA = (2**COLLISION.LIMIT)*SLOT.TIME 
always
wait DELTA .MICROSECONDS 
go to arriving.packet 
always 
always 
loop
if INTRAP.ORG(INTRA.PACKET) = NODE 
reactivate the STATION called SRC.STATION(INTRA.PACKET) now 
else
reactivate the OUT.GATEWAY called SRC.STATION(INTRA.PACKET) now 
always
if (LOCAL.PACKET = YES or INTRAP.ORG(INTRA.PACKET) = GATEWAY) 
if LOCAL.PACKET = YES 
MULT=1 
else 
MULT = 3 
always
let DELAY.TIME = time.v - INIT.INTRAP.TIME(INTRA.PACKET)- 
MULT*FOR.ACK 
add 1 to NUM. PACKETS .COMPLETED 
if NUM.PACKETS.COMPLETED = NUM.PACKETS.DESIRED 
call report 
always 
always
’ ’ each station has only one buffer,so it is blocked until 
’ ’ a successful transmission is completed 
end "packet process
routine intra.packet.sender giving U 
define U as a real variable
"U  is the detection of the collision due to the buffer full 
"  it is 0 in case of local or buffer empty
add 1 to INTRA.TRANSMISSION.FLAG(XPl(INTRA.PACKET)) 
wait (MEAN.TRANSMISSION.TIME-U) .MICROSECONDS 
INTRA.ACK.FLAG(XP1(INTRA.PACKET))=1 
wait uniform.f(0,PROPAGATION.DELAY,3) .MICROSECONDS 
INTRA.ACK.FLAG(XP1(INTRA.PACKET))=0 
wait FOR.ACK .MICROSECONDS 
end
- h o p -b y -h o p  tra n sm iss io n  process
routine INTRA.TRANSMISSION giving DEST1 yielding J and I 
define DEST1, DEST2,1 and J as integer variables 
let J = NO
if INTRAP.ORG(INTRA.PACKET) = NODE "packet comes from station 
if random.f(l) < PROB "probability of non-local transmission 
DEST2 = P*(XP1(INTR A.PACKET)-1) + DEST1 
if BUFFER.IG(DEST2) <=0
wait FOR.ACK .MICROSECONDS "6bytes do detected collision 
relinquish 1 INTRA.CHAN(XP1(INTRA.PACKET)) 
I=IS.WAITING.TRANSMISSION 
add 1 to INTRAP.RETRANSMISSION 
create an INTRA.SUSP.PACKET called BIG.SUSP.PACKET(DEST2,
-161-
appendix E.2
INTRA.WT.POINTER(DEST2)+l)
IDF.STATION(BIG.SUSP.PACKET(DEST2,INTRA.WT.POINTER(DEST2)+l))=
process.v
INTRA.WT.POINTER(DEST2)=mod.f(INTRA.WT.POINTER(DEST2)+l,
int.f(NUM.STATIONS*P/G)+l)
SUSPEND
else
call intra.packet.sender giving 0.0
call CREATE.INP.GATEWAY giving DEST1 and DEST2 
relinquish 1 I NTRA.CHAN(XP1(INTRA.PACKET))
INTR A.DELA Y(XP 1 (INTRA.PACKET)) = INTRA.DELAY(XP1(IN- 
TRA.PACKET))+
time.v-INTRAP.TIME(INTRA.PACKET) -FOR.ACK 
subtract 1 from INTRA.OFFERED.LOAD(XPl (INTRA.PACKET)) 
I=HAS.COMPLETED.TRANSMISSION 
always 
else ’ ’ packet is local 
call intra.packet.sender giving 0.0 ” no collision due to buffer full 
relinquish 1 INTRA.CHAN(XP1 (INTRA.PACKET))
INTRA.DELAY(XP1 (INTRA.PACKET)) = ENTRA.DELAY(XP1(IN- 
TRA.PACKET))+
time.v-INTRAP.TIME(INTRA.PACKET) -FOR.ACK 
subtract 1 from INTRA.OFFERED.LOAD(XPl(INTRA.PACKET)) 
I=HAS.COMPLETED.TRANSMISSION 
let J=YES 
always
else ’ ’ packet comes form a gateway 
call intra.packet.sender giving 0.0
INTRA.LAST.ADR(XP1 (INTRA.PACKET)) = SRC.OGP(INTRA.PACKET) 
subtract 1 from INTRA.OFFERED.LOAD(XPl (INTRA.PACKET)) 
INTRA.DELAY(XP1(INTRA.PACKET)) = INTRA.DELAY(XP1(INTRA.PACKET-
))+
time.v - INTRAP.TIME(INTRA.PACKET)-FOR.ACK 
wait FOR.ACK .MICROSECONDS 
if INTER.WT.POINTER(SRC.OGP(INTRA.PACKET))-INTER.RD.POENTER( 
SRC.OGP(INTRA.PACKET))<>0 
’ ’ some packets are waiting for buffer 
reactivate the INTER.PACKET called IDF.GATEWAY(BOG.SUSP.PACKET( 
SRC.OGP(INTRA.PACKET),INTER.RD.POIN- 
TER(SRC.OGP(INTRA.PACKET))+l)) now
destroy the INTER.SUSP.PACKET called BOG.SUSP.PACKET(
SRC.OGP(INTRA.PACKET),INTER.RD.POINTER(SRC.OGP(INTRA.PACKET))+l) 
INTER.RD.POINTER(SRC.OGP(INTRA.PACKET))=mod.f(INTER.RD.POINTER( 
SRC.OGP(INTRA.PACKET))+l,int.f(G/P)+l) 
always
if N.Q.OG.RES(SRC.OGP(INTRA.PACKET)) =0 
’ ’ there is no packets in the buffer 
INTRA.LAST.ADR(XP1(INTRA.PACKET)) = -1 
always
relinquish 1 INTRA.CHAN(XP1 (INTRA.PACKET))
I = HAS.COMPLETED.TRANSMISSION 
always 
end
process O U T . G A T E W A Y
appendix E.2
define INIT.WAIT.TIME as real variable 
INIT.WAIT.TIME = time.v
subtract 1 from BUFFER.OG(DEST.OG(OUT.GATEWAY))
ACCUM.INTRA.OFFERED.LOAD(DEST.CHAN.OG(OUT.GATEWAY))=
ACCUM.INTRA.OFFERED.LOAD(DEST.CHAN.OG(OUT.GATEWAY))
+(time.v-INTRA.PREVIOUS.TIME(DEST.CHAN.OG(OUT.GATEWAY)))*
INTRA.OFFERED.LOAD(DEST.CHAN.OG(OUT.GATEWAY))
let INTRA.PREVIOUS.TIME(DEST.CHAN.OG(OUT.GATEWAY)) = time.v
add 1 to INTRA.OFFERED.LOAD(DEST.CHAN.OG(OUT.GATEWAY))
request 1 OG.RES(DEST.CX}(OUT.GATEWAY))
create an INTRA.PACKET
let INTRAP.TIME(INTRA.PACKET) = INIT.WAIT.TIME
let INIT.INTRAP.TIME(INTRA.PACKET) = INIT.OG.TIME(OUT.GATEWAY)
let SRC.STATION(INTRA.PACKET) = OUT.GATEWAY
let INTRAP.ORG(INTRA.PACKET) = GATEWAY
let SRC.OGP(INTRA.PACKET) = DEST.OG(OUT.GATEWAY)
let XP1 (INTRA.PACKET) = DEST.CHAN.OG(OUT.GATEWAY)
let YP1 (INTRA.PACKET) = YOG(OUT.GATEWAY)
let CHANG.ORG(INTRA.PACKET)=SRC.CHAN(OUT.GATEWAY)
let ORG.IGP(INTRA.PACKET)=SRC.OG(OUT.GATEWAY)
activate this INTRA.PACKET now
suspend ’’block the buffer until a successful transmission...
relinquish 1 OG.RES(DEST.OG(OUT.GATEWAY))
add 1 to BUFFER.OG(DEST.OG(OUT.GATEWAY))
end
process STATION
define IMT.W Arr.TtME as real variaBie 
INIT.WAIT.TIME = time.v
ACCUM.INTRA.OFFERED.LOAD(XS(STATION))=ACCUM.INTRA.OFFERED.LOA 
D(XS (ST ATION))
+(time.v-INTRA.PREVIOUS.TIME(XS(STATION)))*INTRA.OFFERED.LOAD(XS(ST
ATION))
let INTRA.PREVIOUS.TIME(XS(STATION)) = time.v 
subtract 1 from BUFFER.ST(YS (STATION)) 
if BUFFER.ST(YS(STATION)) > 0
reactivate the TRANSPUTER called SRC.TRANSPUTER(STATION) now 
always
add 1 to INTRA.OFFERED.LOAD(XS(STATION)) 
request 1 ST.RES(YS(STATION)) 
create an INTRA.PACKET
let INTRAP.TIME(INTRA.PACKET) = INIT.WAIT.TIME 
let INIT.INTRAP.TIME(INTRA.PACKET) = INIT.WAIT.TIME 
let SRC.STATION(INTRA.PACKET) = STATION 
let XP1 (INTRA.PACKET) = XS(STATION) 
let YP1 (INTRA.PACKET) = YS(STATION) 
let INTRAP.ORG(ENTRA.PACKET) = NODE 
activate this INTRA.PACKET now 
suspend ’’block the buffer until a successful transmission... 
relinquish 1 ST. RES (YS (STATION)) 
if BUFFER.ST(YS(STATION)) <= 0 
reactivate the TRANSPUTER called SRC.TRANSPUTER(STATION) now 
always
add 1 to BUFFER.ST(YS(STATION)) 
end
-163-
appendix E.2
process TRANSPUTER given X,Y
until NUM.PACKETS.SENT >= NUM.PACKETS.DESIRED 
do
wait exponential.f(mean.processing.time, 1) .MICROSECONDS 
add 1 to NUM.PACKETS.SENT 
if NUM.PACKETS.SENT <= NUM.PACKETS.DESIRED 
crcstc  ^STATION
let SRC.TRANSPUTER(STATION)=TRANSPUTER 
let XS(STATION) = X 
let YS(STATION) = Y 
activate this STATION now 
suspend 
always 
loop 
end
process STATION
define ibJlT.WAIT.TIME as real variable 
INIT. W AIT.TIME = time.v
ACCUM.INTRA.OFFERED.LOAD(XS(STATION))=ACCUM.INTRA.OFFERED.LOA 
D(XS (STATION))
+(time.v-INTRA.PREVIOUS.TIME(XS(STATION)))*INTRA.OFFERED.LOAD(XS(ST 
ATION))
let INTRA.PREVIOUS.TIME(XS(STATION)) = time.v 
if STP.TYPE(STATION) = DATA 
subtract 1 from BUFFER.ST(YS (STATION)) 
if BUFFER.ST(YS(STATION)) > 0
reactivate the TRANSPUTER called SRC.TRANSPUTER(STATION) now 
always
add 1 to INTRA.OFFERED.LOAD(XS(STATION)) 
always
request 1 ST.RES(YS(STATION)) 
create an INTRA.PACKET
let INTRAP.TIME(INTRA. PACKET) = INIT.WAIT.TIME 
let INIT.INTRAP.TIME(INTRA.PACKET) = INIT.WAIT.TIME 
let SRC.STATION(INTRA.PACKET) = STATION 
let XP1 (INTRA.PACKET) = XS(STATION) 
let YP1 (INTRA.PACKET) = YS(STATION) 
let INTRAP.TYPE(INTRA.PACKET) = STP.TYPE(STATION) 
let INTRAP.ORG(INTRA.PACKET) = NODE 
activate this INTRA.PACKET now 
suspend "block the buffer until a successful transmission... 
relinquish 1 ST. RES (YS (STATION)) 
if STP.TYPE(ST ATION) = DATA 
if BUFFER.ST(YS(STATION)) <= 0 
reactivate the TRANSPUTER called SRC.TRANSPUTER(STATION) now 
always
add 1 to BUFFER.ST(YS(STATION)) 
always 
end
process TRANSPUTER given X,Y
appendix E.2
until NUM.PACKETS.SENT >= NUM.PACKETS.DESIRED 
do
wait exponential.f(MEAN.PROCESSING.TIME,l) .MICROSECONDS 
add 1 to NUM.PACKETS.SENT 
if NUM.PACKETS.SENT <= NUM.PACKETS.DESIRED 
create 3. STATION
let SRC.TRANSPUTER(STATION)=TRANSPUTER 
let STP.TYPE(STATION) = DATA 
let XS (STATION) = X 
let YS (STATION) = Y 
activate this STATION now 
suspend 
always 
loop 
end
- end-t-end transmission process
routine INTRA/TRANSMISSION yielding J and I 
define DEST1, DEStl, I and J as integer variables 
let J = N O
DEST1 = randi.f(l,P,l)
if INTRAP.ORG(INTRA.PACKET) = NOD E  ’’packet comes from station 
add 1 to INTRA.TRANSMISSION.FLAG(XPl (INTRA.PACKET)) 
wait MEAN.TRANSMISSION.TIME .MICROSECONDS 
INTRA.ACK.FLAG(XP1 (INTRA.PACKET)) = 1 
wait uniform.f(0,PROPAGATION.DELAY,3) .MICROSECONDS 
INTRA. ACK.FLAG(XP 1 (INTRA. PACKET))=0 
wait FOR.ACK .MICROSECONDS
if random.f(l) < PROB ’’probability of non-local transmission 
DEST2 = P* (XP1 (INTRA.PACKET)-1) + DEST1 
if BUFFER.IG(DEST2) <= 0 ” destination buffer full 
relinquish 1 INTRA.CHAN(XP1 (INTRA.PACKET)) 
wait FOR.TIME.OUT/P .MICROSECONDS 
” for large p there is many possibilities to send the packet 
’ ’ thus wait small time.
I = IS.WAITING.TRANSMISSION 
else ’ ’ there is space in the buffer 
call CREATE.INP.GATEWAY giving DEST1 and DEST2 
relinquish 1 INTRA. CHAN(XP 1 (INTRA. PACKET))
INTR A.DELA Y(XP 1 (INTRA.PACKET)) = INTR A.DELA Y(XP 1 (IN- 
TRA.PACKET))+
time.v-INTRAP.TIME(1NTRA.PACKET) - FOR.ACK 
subtract 1 from INTRA.OFFERED.LOAD(XPl(INTRA.PACKET)) 
create a TIME.OUT called WAIT.FOR.ACK(YPl(INTRA.PACKET)) 
let IDF.TIME.OUT(WAIT.FOR.ACK(YPl (INTRA.PACKET))) = process, v 
suspend
if INDICATION.PACKET(YPl (INTRA.PACKET)) = NA C K  
wait FOR.TIME.OUT/P .MICROSECONDS 
1= IS.WAITING.TRANSMISSION
subtract 1 from INTRA.TRANSMISSION.FLAG(XPl(INTRA.PACKET)) 
’ ’ because a retransmission is required 
else
I=HAS.COMPLETED.TRANSMISSION
always
-165-
appendix E.2
always 
else "local packet 
subtract 1 from INTRA.OFFERED. LOAD(XP 1 (INTRA. PACKET)) 
INTRA.DELAY(XP1(INTRA.PACKET))= INTRA.DELAY(XP1(INTRA.PACKET-
))+time.v -INTRAP.TIME(INTRA.PACKET)-FOR.ACK 
relinquish 1 INTRA.CHAN(XP1 (INTRA.PACKET))
I = HAS.COMPLETED.TRANSMISSION 
J=YES 
always
else ’ ’ packet comes form a gateway 
INTRA.LAST.ADR(XP1 (INTRA.PACKET)) = SRC.OGP(INTRA.PACKET)
DEST2 = P* (ADR.TRANS. NODE(INTR A.PACKET)-1) + DEST1 
if INTRAP.TYPE(INTRA.PACKET) = DATA  
add 1 to INTRA.TRANSMISSION.FLAG(XPl(INTRA.PACICET)) 
wait MEAN.TRANSMISSION.TIME .MICROSECONDS 
INTRA. ACK. FLAG(XP 1 (INTRA.PACKET)) = 1 
wait uniform.f(0,PROPAGATION.DELAY,3) .MICROSECONDS 
INTRA.ACK.FLAG(XP1(INTRA.PACKET))=0 
wait FOR. A C K  .MICROSECONDS 
’ ’ get the subnet where the ack has to go 
call CREATE.INP.GATEWAY giving DEST1 and DEST2 
subtract 1 from INTRA.OFFERED.LOAD(XPl (INTRA.PACKET))
INTRA. D E L A Y  (XP1 (INTRA. PACKET)) = INTRA.DELA Y (XP1 (INTRA. PACKET-
))+time.v - INTRAP.TIME(INTRA.PACKET)-FOR.ACK 
else " it is pack or nack 
INTRA.ACK.FLAG(XPl(INTRA.PACKErQ) = 1 
wait uniform.f(0,PROPAGATION.DELAY,3) .MICROSECONDS 
INTRA.ACK.FLAG(XP1(INTRA.PACKET))=0 
wait FOR. A C K  .MICROSECONDS 
reactivate the INTRA.PACKET called
IDF.TIME. OUT(WAIT. FOR. ACK(YP1 (ENTRA. PACKET))) now 
destroy the TIME.OUT called WAIT.FOR.ACK(YPl (INTRA.PACKET))
INDICATION. PACKET(YP 1 (INTRA.PACKET)) = ENTRAP.TYPE(IN- 
TRA.PACKET) 
always
if N.Q.OG.RES(SRC.OGP(INTRA.PACKET)) =0 
* ’ there is no packets in the buffer 
INTRA. LAST. ADR(XP 1 (INTRA.PACKET)) = -1 
always
relinquish 1 INTRA.CHAN(XP1 (INTRA. PACKET))
I = HAS.COMPLETED.TRANSMISSION 
always 
end
-166-
appendix E.3
E3: Processor Farming in the 2D Network
preamble
normally,mode is integer 
processes
every MASTER.TRANSPUTER has a PROCESSOR.MASTER, 
and a STATION.MASTER 
every T.STATION has an IDF.TYPE.T, 
a DEST.TRANSPUTER, 
and a DEST.STATION 
every R.STATION has a R.RESP.TTME, 
an IDF.TYPE.R, 
a SRC.TRANSPUTER, 
and a SRC.STATION 
every SLAVE.TRANSPUTER has an PROCESSOR.SLAVE, 
and a STATION.SLAVE 
every PACKET has a SOURCE.STATION, 
an IDF.TYPE.P, 
an ADDR.TRANSPUTER, 
an ADDR.STATION, 
a RESP.TIME, 
a COLLISION.COUNTER, 
a START.TIME, 
and a STATE
define SRC.TRANSPUTER, SRC.STATION, DEST.TRANSPUTER,IDF.TYPE.T, 
DEST.STATION, SOURCE.STATION, ADDR.TRANSPUTER,IDF.TYPE.R, 
ADDR.STATION, PROCESSOR.SLAVE, STATION.SLAVE, IDF.TYPE.P, 
PROCESSOR.MASTER, STATION.MASTER,
COLLISION.COUNTER, and STATE as integer variables 
define START.TIME,R.RESP.TIME, DETECT.TIME, 
and RESP.TIME as real variables 
resources include GATE, LINK, MUX,
T.BUFFER, R.BUFFER,MASTER.R.BUFFER, 
and CHANNEL  
temporary entities
every IDLE has an IDF.PROCESSOR 
define IDF.PROCESSOR as an integer variable
define MEAN.TRANSMISSION.TIME, MEAN.PROCESSING.TIME, FOR.ACK, 
DELAY.TIME, WAITING.TIME, RESPONSE.TIME, IDLE.TIME,
MASTER.WAITING.TIME, and PROPAGATION.DELAY as real variables 
define BWT,T.SETUP,CHAN.RATIO, R.PROTOCOL.TIME 
and T.PROTOCOL.TIME as real variables 
define INC.NUM.STATIONS, MAX.STATIONS, MIN.STATIONS, NUM.STATIONS, 
NUM.PACKETS.COMPLETED, NUM.PACKETS.DESIRED,NUM.PACKET- 
S.SENT,
BACKLOGGED.STATIONS, NUM.TRANSPUTERS.PER.STATION,
H, NUM.PACKETS.RECEIVED,CHANNEL.ACQUIRE,COLLISION.LIMIT, 
INIT.LOAD.TASK and LIMIT.JOB.PROCESS as integer variables 
define SPEED.UP.FACTOR and TO.PROC as integer variables 
define .MICROSECONDS to mean units 
define IS.WAITING.TRANSMISSION to mean 1 
define HAS.COMPLETED.TRANSMISSION to mean 0
-167-
appendix E.3
define YES to mean 1 
define MASTER to mean 1 
define SLAVE to mean 0 
define INITIATOR to mean 2
define LINK.TIME to mean MEAN.TRANSMISSION.TIME*CHAN.RATIO
define R.IMP.TIME to mean R.PROTOCOL.TIME+ LINK.TIME
define T.IMP.TIME to mean T.PROTOCOL.TIME+ LINK.TIME
define CHAN.TIME to mean MEAN.TRANSMISSION.TIME+FOR.ACK+PROPAGA-
TION.DELAY
define SLOT.TIME to mean 2*PROPAGATION.DELAY 
define SEED1 and SEED2 as double variables
define ACK.FLAG, DELAY.FLAG, TRANSMISSION.FLAG,INIT.TYPE, 
and TRANSMISSION.STATE as integer variables 
define TIME.FIRST.ARRIVAL as a real variable 
define JOB.ARRAY and SUPD.ARRAY as 1-dimensional arrays 
tally MEAN.DELAY.TIME as the mean of DELAY.TIME 
tally MEAN.WAITING.TIME as the mean of WAITING.TIME 
tally MEAN.RESPONSE.TIME as the mean of RESPONSE.TIME 
tally MEAN.IDLE.TIME as the mean of IDLE.TIME
accumulate MAX.PACKET.BUFFERED as the maximum of N.Q.MASTER.R.BUFFER 
accumulate AVG.BACKLOGGED.STATIONS as the average of BACKLOGGED.STA- 
TIONS
accumulate UTIL.CHANNEL as the average of TRANSMISSION.FLAG 
end ’’preamble
main
C3.ll read data
for NUM.STATIONS=MIN.STATIONS TO MAX.STATIONS BY INC.NUM.STA-
TIONS
do
call initialize 
start simulation 
loop 
end
routine initialize
define I,J,K and P as integer variables 
let time.v = 0
let NUM.PACKETS.COMPLETED = 0 
let NUM.PACKETS.SENT = 0 
let NUM.PACKETS.RECEIVED = 0 
let DELAY.FLAG = 0 
let ACK.FLAG = 0
let P = NUM.TRANSPUTERS.PER.STATION
let H  = NUM.PACKETS.DESIRED - LIMIT.JOB.PROCESS*P*NUM.STATIONS 
let H  = H - INIT.LOAD.TASK*P*NUM.STATIONS 
let TO.PROC = P*NUM.STATIONS
reset totals of TRANSMISSION.FLAG, BACKLOGGED.STATIONS, DELAY.TIME, 
WAITING.TIME, IDLE.TIME, N.Q.MASTER.R.BUFFER(l) and RESPONSE.TIME 
let seed.v(l) = SEED1 
let seed.v(2) = SEED2
reserve JOB.ARRAY(*) and SUPD.ARRAY(*) as 4*MAX.STATIONS+l 
for 1=2 to NUM.STATIONS* 1 
do
for J= 2 to P+l
-168-
appendix E.3
do
JOB.ARRAY(P*(I-2)+J)=INIT.LOAD.TASK 
activate a SLAVE.TRANSPUTER giving P*(I-2)+J and I now 
loop 
loop
select case INIT.TYPE 
case 1,2
for 1=2 to NUM.STATIONS+1 
do
for J=2 to P+l 
do
for K=1 to LIMIT.JOB.PROCESS 
do
create a T. STATION 
IDF.TYPE.T(T.STATION)=INITIATOR 
DEST.TRANSPUTER(T.ST ATION) = P*(I-2)+J 
DEST.STATION(T.STATION)=I 
activate this T.STATION now 
loop 
loop 
loop 
case 3 4
for K=1 to LIMIT JOB.PROCESS 
do
for 1=2 to NUM.STATIONS+1 
do
for J = 2 to P+l 
do
create a T.STATION 
IDF.TYPE.T(T.STATION)=INmATOR 
DEST.TRANSPUTER(T.STATION)=P*(I-2)+J 
DEST.STATION(T.STATION)=I 
activate this T.STATION now 
loop 
loop 
loop 
endselect 
end
process master.transputer 
if H  > 0
wait B W T  .MICROSECONDS 
create a T.STATION 
IDF.TYPE.T(T.STATION) = MASTER
DEST.TRANSPUTER(T.STATION)=PROCESSOR.MASTER(MASTER.TRANSPUTE
R)
DEST.STATION(T.STATION) = STATION.MASTER(MASTER.TRANSPUTER)
activate this T.STATION now
always
subtract 1 from H
add 1 to NUM.PACKETS.RECEIVED
if NUM.PACKETS.RECEIVED = NUM.PACKETS.DESIRED
call single.processor
-169-
appendix E.3
call REPORT 1 
always 
end
process packet
define RESCHEDULE, QUEUE.WAS, CHANNEL. WAS.IN.DELAY.ST ATE, 
and CHANNEL.WAS.IDLE as integer variables 
define DELTA as real variable 
let START.TIME(PACKET) = time.v 
let COLLISION.COUNTER(PACKET) =0 
let STATE(PACKET) = IS.WAITING.TRANSMISSION 
let CHANNEL.WAS.IDLE = 1 
add 1 to OFFERED.LOAD
while STATE(PACKET) = IS.WAITING.TRANSMISSION 
do
if CHANNEL.WAS.IDLE = 0 or U.CHANNEL(1)=1 or ACK.FLAG =1 or 
DELAY. FLAG =1
wait 2*PROPAGATION.DELAY .MICROSECONDS 
always
CHANNEL.WAS.IDLE = U.CHANNEL(l)
QUEUE.WAS = N.Q.CHANNEL(l)
CHANNEL. WAS.IN.DELAY.STATE = DELAY.FLAG 
request 1 CHANNEL(l) 
if CHANNEL.WAS.IDLE = 1 
DELAY.FLAG = 1
wait PROPAGATION.DELAY .MICROSECONDS 
DELAY.FLAG = 0
if N.Q.CHANNEL(l) = 0 and QUEUE.WAS = 0 
create 3 R STATION
IDF.TYPE.R(R.STATION) = IDF.TYPE. P(P ACKET)
S RC. TRANS PUTER(R. ST ATION)= ADDR.TRANSPUTER(P ACKET) 
SRC.STATION(R.STATION)=ADDR.STATION(PACKET) 
R.RESP.TIME(R.ST ATION) = RESP.TIME(P ACKET) 
activate this R.STATION now
TR A N  S MIS S ION. FLAG = 1
wait MEAN.TRANSMISSION.TIME .MICROSECONDS 
TRANSMISSION.FLAG = 0 
ACK.FLAG = 1
wait PROPAGATION.DELAY .MICROSECONDS 
ACK.FLAG = 0
wait FOR.ACK .MICROSECONDS 
subtract 1 from OFFERED.LOAD 
relinquish 1 CHANNEL(l) 
reactivate the T.STATION called SOURCE.STATION(PACKET) now 
ST ATE(P ACKET)= HAS.COMPLETED.TRANSMISSION
else
wait 2*PROPAGATION.DELAY .MICROSECONDS 
"the channel is waisted for that max time 
relinquish 1 CHANNEL(l) 
add 1 to COLLISION.COUNTER(PACKET)PHANNFI WAS! mi F = VFS 
if COLLISION.COUNTER(PACKET) <= COLLISION.LIMIT 
let RESCHEDULE = randi.f(0,2**COLLISION.COUNTER(PACKET),l) 
DELTA= (RESCHEDULE *SLOT.TIME)
-170-
appendix E.3
else
COLLISION.COUNTER(PACKET) = 0 
DELTA = (2**COLLISION.LIMIT)*SLOT.TIME 
always
wait DELTA .MICROSECONDS 
always 
else
relinquish 1 C H A N N E L / ) 
if CHANNEL.WAS.IN.DELAY.STATE = 1 ’’COLLISION 
add 1 to COLLISION.COUNTER(PACKET)PHANNF1 WAS mi F = YFS
if COLLISION.COUNTER(PACKET) <= COLLISION.LIMIT 
let RESCHEDULE = randi.f(0,2**COLLISION.COUNTER(PACKET),l) 
DELTA= (RESCHEDULE *SLOT.TIME) 
else
COLLISION.COUNTER(PACKET) = 0 
DELTA = (2**COLLISION.LIMIT)*SLOT.TIME 
always
wait DELTA .MICROSECONDS
always
always
loop
let DELAY.TIME = time.v - START.TIME(PACKET)-FOR.ACK- 
PROPAGATION.DELAY 
if IDF.T YPE. P(P ACKET) = SLAVE 
add 1 to NUM. PACKETS .COMPLETED 
always
if NUM.PACKETS.COMPLETED = NUM.PACKETS.DESIRED 
call REPORT 
always 
end
process r.station
define INT)EX1 and INDEX2 as integer variables 
if IDF.TYPE.R(R. STATION) = SLAVE 
INDEX 1 = MASTER 
INDEX2 = MASTER 
wait DETECT.TIME .MICROSECONDS 
request 1 MASTER.R.BUFFER(INDEX2) 
wait R.IMP.TIME .MICROSECONDS 
create a MASTER.TRANSPUTER
PROCESSOR.MASTER(MASTER.TRANSPUTER)= SRC.TRANSPUTER(R. STA­
TION)
STATION.MASTER(MASTER.TRANSPUTER) = SRC.STATION(R.STATION) 
activate this MASTER.TRANSPUTER now 
relinquish 1 MASTER.R.BUFFER(INDEX2) 
else
INDEX 1= SRC.TRANSPUTER(R.STATION)
INDEX2 = SRC.STATION(R. STATION) 
wait DETECT.TIME .MICROSECONDS 
request 1 R.BUFFER(INDEX2) 
wait R.IMP.TIME .MICROSECONDS
-171-
appendix E.3
if JOB.ARRA Y(INDEX 1)=0 
reactivate the SLAVE.TRANSPUTER called 
IDF.PROCESSOR(SUPD.ARRAY(INDEXl)) now 
destroy the idle called SUPD.ARRAY(INDEXl) 
always
add 1 to JOB.ARRAY(INDEXl) 
relinquish 1 R.BUFFER(INDEX2) 
always
RESPONSE.TIME = time.v - R.RESP.TIME(R.STATION) 
end
process SLAVE.TRANSPUTER given PROC.ADDR and IDF.BUFFER 
define PROC.ADDR and IDF.BUFFER as integer variable 
until NUM.PACKETS.SENT >= NUM.PACKETS.DESIRED 
do
INIT.IDLE.TIME = time.v
if JOB. A R R A Y  (PROC.ADDR) = 0
create an IDLE called SUPD.ARRAY(PROC.ADDR)
let IDF.PROCESSOR(SUPD.ARRAY(PROC.ADDR)) = process.v
suspend
always
IDLE.TIME = time.v - INIT.IDLE.TIME 
if NUM.PACKETS.SENT <= NUM.PACKETS.DESIRED 
add 1 to NUM.PACKETS.SENT 
wait exponential.f(mean.processing.time,l) .MICROSECONDS 
wait T.SETUP .MICROSECONDS 
create a T. STATION 
IDF.TYPE.T(T. ST ATION)= SLAVE 
DEST.TRANSPUTER(T.STATION)=PROC.ADDR 
DEST.STATION(T.STATION) = IDF.BUFFER 
activate this T.STATION now 
subtract 1 from JOB. A R R A Y  (PROC.ADDR) 
always 
loop 
end
process t.station
define INIT.TIME as real variable 
define INDEX 1JNDEX2 as integer variables 
if IDF.T YPE.T(T. ST ATION) = SLAVE 
INDEX 1 = DEST.TRANSPUTER(T. STATION)
INDEX2 = DEST. ST ATION(T. STATION) 
else
INDEX 1=M ASTER 
INDEX2=M ASTER 
always
if ((INIT.TYPE = 2 or INIT.TYPE =4) or (IDF.TYPE.T(T.STATION)=INITIATOR)) 
and IDF.T YPE.T(T. ST ATION) not equal to SLAVE 
request 1 GATE(l) ” to assign priority 
always
INIT.TIME = time.v
-172-
appendix E.3
request 1 LINK(INDEX1)
wait T.SETUP .MICROSECONDS 
request 1 T.BUFFER(INDEX2)
WAITING.TIME = time.v - INIT.TIME 
INTT.TIME = time.v 
wait T.IMP.TIME .MICROSECONDS 
relinquish 1 LINK(INDEXl)
if ((INIT.TYPE=2 or INIT.TYPE-4) or (IDF.TYPE.T(T.STATION)=INITIATOR)) 
and IDF.TYPE.T(T.STATION) not equal to SLAVE 
relinquish 1 GATE(l) 
always
request 1 MUX(INDEX2) 
create a PAOtvRT
SOURCE.STATION(P ACKET) = T. STATION 
IDF.TYPE.P(P ACKET) = IDF.TYPE.T(T. STATION) 
ADDR.TRANSPUTER(PACKET) = DEST.TRANSPUTER(T.STATION) 
ADDR.STATION(PACKET) = DEST.STATION(T.STATION) 
RESP.TIME(PACKET) = INIT.TIME 
activate this PACKET now 
suspend 
relinquish 1 MUX(INDEX2) 
relinquish 1 T.BUFFER(INDEX2) 
end
-173-
appendix E.4
E4: Physical Address Assignment Algorithm
preamble
normally,mode is integer 
processes
every SLAVE has an IDF, 
and an ADR 
every MASTER has an ADR, 
and an IDF 
every PACKET has a ADR,
a SOURCE.SLAVE, 
a SOURCE.MASTER, 
an IDF, 
a FRAME.T, 
a STATE,
and a COLLISION.COUNTER 
resources include CHANNEL
define N.STATIONS, MAX.STATIONS, LOCK, J, and TYPE.FRAME as integer vari­
ables
define CH.ERROR as a real variable 
define .MICROSECONDS to mean units 
define SEED1 and SEED2 as double variables
define BUFFER.SLAVE and BUFFER.ADDRESS as 1-dimensional arrays
define HAS.COMPLETED.TRANSMISSION to mean 1
define IS.WAITING.TRANSMISSION to mean 0
define COLLISION.LIMIT to mean 8
define YES to mean 1
define N O  to mean 0
define EMPTY to mean -1000
define c.Fj to mean -1
define a.Fj to mean 0
define r.Fj to mean 1
define ERROR to mean 1
define NO.ERROR to mean 0
define SLOT.TTME to mean .01
end "preamble
main
call read.data 
call initialize 
start simulation 
end
routine ERROR.GENERATOR giving TYPE.F, IDF yielding STATE.ERROR 
STATE.ERROR = NO.ERROR 
if TYPE.F = r.Fj
if random.f(l) < CH.ERROR or LOCK = Yes 
STATE.ERROR = ERROR 
else
activate a master giving SOURCE.SLAVE(PACKET), IDF now 
for 1= 1 to MAX.STATIONS 
do
if random.f(l) > CH.ERROR 
BUFFER.SLAVE(I) = a.Fj
1appendix E.4
BUFFER. ADDRESS(I) = ADR(PACKET) 
always 
loop 
always 
else
if random.f(l) < CH.ERROR 
STATE.ERROR = ERROR 
else
BUFFER.SLAVE(IDF) = c.Fj 
BUFFER. ADDRESS(IDF) = ADR(PACKET) 
always 
always 
end
routine initialize
T S T
N. STATIONS = 0
reserve BUFFER.SLAVE(*) and BUFFER. ADDRESS/) as MAX.STATIONS+1 
for I = 1 to MAX.STATIONS 
do
activate a SLAVE giving I now 
BUFFER.SLAVE(I) = EMPTY 
BUFFER. ADDRESS(I) = 1 
loop 
end
process M A S T E R  given SUSP.SLAVE and IDent.F 
L O O K  = YES 
wait 10 .MICROSECONDS 
create a PACKET
let SOURCE.MASTER(PACKET) = MASTER 
let SOURCE.SLAVE(PACKET) = SUSP.SLAVE 
let IDF(PACKET) = IDent.F 
let ADR(PACKET) = J 
let FRAME.T(PACKET) = c.Fj 
activate this PACKET now 
J = J+1 
suspend 
L O C K  = N O  
end
process
define RESCHEDULE, QUEUE.WAS, CHANNEL.WAS.IN.DELAY.STATE, STA­
TE.ERROR
and CHANNEL.WAS.IDLE as integer variables 
define DELTA as a real variable 
let COLLISION.COUNTER(PACKET) = 0 
let STATE(PACKET) = IS.WAITING.TRANSMISSION 
let CHANNEL.WAS.IDLE = YES 
while STATE(PACKET) = IS.WAITING.TRANSMISSION 
do
if CHANNEL.WAS.IDLE =NO or U.CHANNEL(1)=1 or ACK.FLAG =1 or 
DELAY.FLAG =1
-175-
appendix E.4
wait .01 .MICROSECONDS 
always
CHANNEL. WAS.IDLE = U.CHANNEL(l)
QUEUE. W A S  = N.Q.CHANNEL(l)
CHANNEL. W A S  .IN.DELAY.STATE = DELAY.FLAG
if FRAME.T(PACKET) = r.Fj and BUFFER.SLAVE(IDF(PACKET)) <> EMPTY 
STATE(PACKET) = HAS. COMPLETED .TRAN S MIS S ION 
reactivate the SLAVE called SOURCE.SLAVE(PACKET) now 
else
request 1 CHANNEL(l) 
if CHANNEL.WAS.IDLE = YES 
DELAY.FLAG = 1 
wait .005 .MICROSECONDS 
DELAY.FLAG = 0
if N.Q.CHANNEL(l) = 0 and QUEUE.WAS = 0 
TRANSMISSION.FLAG = 1 
wait 1 .MICROSECONDS 
TRANSMISSION.FLAG = 0
call ERROR.GENERATOR giving FRAME.T(PACKET), IDF(PACKET) yielding 
STATE.ERROR 
wait .005 .MICROSECONDS 
ACK.FLAG = 0 
wait .4 .MICROSECONDS 
relinquish 1 CHANNEL(l) 
if STATE.ERROR = NO.ERROR 
ST ATE(PACKET)= HAS. COMPLETED.TRANS MIS SION 
always 
else
wait .01 .MICROSECONDS 
’’the channel is waisted for that max time 
relinquish 1 CHANNEL(l) 
add 1 to COLLISION.COUNTER(PACKET)r'HANTNnRT WA<5 TDT F — YF<5
if COLLISION.COUNTER(PACKET) <= COLLISION.LIMIT 
let RESCHEDULE = randi.f(0,2**COLLISION.COUNTER(PACKET),l) 
DELTA= (RESCHEDULE *SLOT.TIME) 
else
COLLISION. COUNTER(PACKET) = 0 
DELTA = (2**COLLISION.LIMIT)*SLOT.TIME 
always
wait DELTA .MICROSECONDS 
always 
else
relinquish 1 CHANNEL(l)
if CHANNEL.WAS.IN.DELAY.STATE = 1 ’’COLLISION 
add 1 to COLLIS ION. COUNTER(PACKET)
CHANNEL.WAS.IDLE = YES
if COLLISION.COUNTER(PACKET) <= COLLISION.LIMIT 
let RESCHEDULE = randi.f(0,2**COLLISION.COUNTER(PACKET),l) 
DELTA= (RESCHEDULE *SLOT.TIME) 
else
COLLISION.COUNTER(PACKET) = 0 
DELTA = (2**COLLISION.LIMIT)*SLOT.TIME 
always
wait DELTA .MICROSECONDS
-176-
appendix E.4
always
always
always
loop
if FRAME.T(PACKET) = c.Fj ’ ’master packet
reactivate the MASTER called SOURCE.MASTER(PACKET) now 
reactivate the SLAVE called SOURCE.SLAVE(PACKET) now 
always
end ’’packet process
routine read.data
print 1 line thus
enter the number of station
read MAX.STATIONS
print 1 line thus
enter the channel error probability
read CH.ERROR
let SEED1 = seed.v(l)
let SEED2 = seed.v(2)
create every channel(l)
let U.channel(l) = 1
end ’’read.data
process SLAVE given IDent.F* y
’’temporary address is only used for dynamic traffic load balancing 
’’if omitted the protocol is not affected.y y
UNAS SIGNED = YES
TEMPORARY. ADDRESS = BUFFER. ADDRESS(IDent.F) 
while UNASSIGNED = YES 
do
v = randi.f(0,32-TEMPORARY. ADDRESS, 1)y y
the waiting time is relative to the master response to reduce the load 
’ ’ on the bus. It is independent on the proper operation of the protocol.y y
for 1= 1 to (v + 1000) 
do
wait SLOT.TIME .MICROSECONDS 
if BUFFER.SLAVE(IDent.F) <> EMPTY 
BUFFER.SLAVE(IDent.F) = EMPTY 
TEMPORARY. ADDRESS = BUFFER. ADDRESS(IDent.F) 
go to ’end’ 
always 
loop
create a PACKET
let SOURCE.SLAVE(PACKET) = SLAVE
let IDF(PACKET) = IDent.F
let FRAME.T(PACKET) = r.Fj
let ADR(PACKET) = TEMPORARY.ADDRESS+1
activate this packet now
suspend
if BUFFER.SLAVE(IDent.F) = c.Fj 
PERMANENT. A D R  = BUFFER. ADDRESS (IDent.F) 
print 1 line with PERMANENT.ADR , idENT.f thus
-177-
appendix E.4
ADRESS = *** idf=***
UNASSIGNED = N O  
NOTATIONS = NOTATIONS + 1 
always
BUFFER.SLAVE(IDent.F) = EMPTY
TE M P O R  A R  Y. ADDRES S = BUFFER. ADDRESS(IDentF)
wait 2 .MICROSECONDS5 9
’ ’ this time corresponds to the real execution of the software code 
’ ’ in each interface.) 9
if NOTATIONS = MAX.STATIONS 
print 1 line with time.v thus 
simul (irri© ^  
always 
’end’ loop 
end
-178-
appendix E.5
E.5: OCCAM simulator
**Listed On 26-11-91 0:11 **
**List of Fold** "simulato.tsr"
**List of File** "simulaOO.tsr"
**File Last Modified 7-04-91 1:42 
**List all lines with Fold Headers 
**Excluding : N O  LIST folds 
#USE userio 
#USE t4math 
— {{{ constant
VA L  N  IS 4: — number of stations 
V A L  sync IS 3 :
VAL EOC IS 0:
--}}}— {{{ channel declarations
[NJCHAN OF A N Y  ch.th.data, ch.th.ack, ch.rh.data, ch.rh.ack:
[4]CHAN OF A N Y  ch.host.out, ch.host.in:
-}}}
PROC impdJCHAN OF A N Y  ch.host.in, ch.host.out, C H A N  OF A N Y  ch.th.data, 
ch.th.ack, ch.rh.ack, ch.rh.data, VAL INT local.adr)
-{{{ local channel declarations 
TIMER clock:
C H A N  OF A N Y  ch.enab.trans, ch.enab.retrans, ch.event:
C H A N  OF A N Y  ch.buffer.reqst, ch.buffer.reply, ch.ack:
C H A N  OF A N Y  ch.itp, ch.irp.data, ch.data.reply:
C H A N  OF A N Y  ch.irp.status, ch.status.reply, ch.command:
C H A N  OF A N Y  ch.irp.count, ch.interuptenab, ch.count.reply:
C H A N  OF A N Y  ch.th.error, ch.rh.error, ch.reset.fifo:
--}}}PAR
— {{{ imp hardware layer
— {{{ process acknowledgement 
INT any:
WHILE TRUE 
SEQ 
ch.ack ? any 
IF 
any = 1 
— {{{ send a positive ack 
ch.rh.ack ! sync; #EF; EOC 
--}}}
TRUE —jam the channel 
— {{{ jam the channel because the buffer is full 
ch.rh.ack ! EOC 
--}}}
--}}}
— {{{ imp receiver hardware 
— { {{ declare local variables 
[256]INT R.fifo:
[128]INT S.fifo:
B O O L  BOC, packet.accept, interrupt.enab:
INT data, R.reg, crc.out, data, revc.adr:
INT count 1, count, wt, rd, any, wtl, rdl, old.count:
-179-
appendix E.5
~}}}SEQ
— {{{ initialise local variables
wt:=0
rd:=0
B O C  := TRUE 
packet.accept:= FALSE 
wtl := 0 
rdl := 0 
count := 0 
countl := 0 
old.count := 0 
interrupt, enab := TRUE 
--}}}--{{{ start the reception of packets 
WHILE TRUE 
ALT
(count > 0) &  ch.irp.data ? any 
— {{{ irp is requesting data 
SEQ
ch.data.reply ! R.fifo[rd] 
rd := (rd+1) R E M  (SIZE R.fifo) 
count:= count-1 
-}}}
(countl > 0) &  eh.irp.status ? any 
— {{{ irp is requesting status 
SEQ
ch.status.reply! S.fifo[rdl] 
rdl := (rdl +1) R E M  (SIZE S.fifo) 
countl := countl -1
--}}} ch.irp.count ? any 
ch.count.reply ! count 
ch.interupt.enab ? any 
interruptenab := TRUE 
ch.rh.data ? data 
- {{{ begining of the carrier 
IF
B O C  —begining of the carrier 
-{{{ gets the header of the packet 
SEQ 
ch.rh.data ? data 
IF
data <> EOC 
— {{{ a packet is being received 
SEQ
B OC := FALSE
revc.adr:= data - received address
R.reg := data — push the data into the receiver register
— {{{ decode the address
IF
((R.reg»2) = localadr) OR (R.reg = #EF) —#EF is the broadcasting
address
-{{{ accept the packet 
SEQ
appendix E.5
packet.accept := TRUE 
crc.out := 0
old.count := wt —monitors the writting to the fifo
--}}}TRUE 
— {{{ refuse the packet 
SKIP 
~}}}
TRUE
-{{{ rubbished is detected 
SKIP 
--}}}
--}}}TRUE
— {{{ reception in progress 
IF
packet, accept 
— {{{ packet has been accepted 
IF
(data <> EOC) A N D  (count < ((SIZE R.fifo)-l))
— {{{ write die packet to the R.fifo 
SEQ 
R.fifo[wt]:= R.reg 
count := count +1 
wt := (wt+1) R E M  (SIZE R.fifo) 
crc.out := CRCWORD(R.reg, crc.out, #10210000) 
R.reg := data 
IF
interrupt.enab
SEQ
interrupt.enab := FALSE 
ch.event! any 
TRUE 
SKIP
~}}}TRUE
— {{{ end of carrier or buffer overflow 
SEQ
packet, accept := FALSE 
B O C  := TRUE 
IF
old.count = wt 
— {{{ there is no writing to the R.fifo 
IF
data = EOC 
— {{{ short packet (ignored ack)
SKIP
~}}}
TRUE 
— {{{ buffer already full 
SEQ
ch. ack! EOC 
WHILE data <> EOC
-181-
appendix E.5
collision
ch.rh.data ? data 
— {{{ comments and explanations
— throw all the remaining portions of incomming data because
— the fifo flag is set to full until the end of the current carrier
— this happens when the station noticed a late jamming signal
— broacasted packets are discarded when the buffers are full
— in the real system all branches stations are involved in the
--}}}
~}}}
-}}}TRUE
--{{{ there is writing to R.fifo 
SEQ
R.fifo[wt] := EOC 
IF
data = EOC 
— {{{ report status error 
SEQ
S.fifojwtl] := crc.out~R.reg 
IF
(S.fifojwtlj = 0) A N D  (revc.adr <> #EF)
—braodcasted packets are acknowledged by their services 
-- or high level protocols 
ch.ack! 1
— packet acknowledgement 
TRUE 
— {{{ packet in error 
SKIP 
--}})
-}}}TRUE
-{{{ broken packet is written into the R.fifo 
SEQ
S.fifo[wtl] := 1 
ch.ack ! 0 — jamming signal 
WHILE data <> EOC 
ch.rh.data ? data
--}}} 
count := count +1 
countl := countl +1 
wt := (wt+1) R E M  (SIZE R.fifo) 
wtl := (wtl +1) R E M  (SIZE S.fifo)
~}}}
--}}}
--}}}
TRUE - packet is not intended for this station 
--{{{ do not do any operation on the packet 
IF
data <> EOC 
SKIP 
TRUE 
BOC := TRUE
appendix E.5
~}}}
--}}}
--}}} ch.reset.fifo ? any 
— {{{ reset all the fifo’s 
SEQ 
count := 0 
countl := 0 
rd := 0 
wt := 0 
rdl := 0 
wtl := 0 
--}}}
--}}}
H{:{{ imp transmission hardware
— {{{constant
V A L  time.out IS 10000:
V A L  delay.bf IS 100000:
V A L  max.bf IS 128:
V A L  max.c IS 16:
V A L  max.col IS 8:
--}}}— {{{ declaration of valuables 
[256]INT T.fifo:
INT crc.in, crc.out, time.now, T.reg, ack, data, collision.limit: 
INT any, signal, wt, rd, empty.flag, collision.counter, buffer.flow
-}}}SEQ
— {{{ initialisation of the parameters
signal := 0
wt:=0
rd:=0
empty.flag := 0 
collision.counter:= 0 
collision.limit:= 0 
buffer.flow:=0 
--}}}
— {{{ hardware management 
WHILE TRUE 
SEQ 
PRI ALT 
ch.itp ? T.fifo[wt]
--{{{ write to tfifo
wt:= wt+1 — packets< 256 words
--}}} ch.enab.trans ? any 
— {{{ enable the transmission of the current packet 
signal := 1 
--}}}
ch.enab.retrans ? any 
— {{{ enable the retransmission of the current packet 
signal := 1 
--}}}ch.buffer.reqst ? any
appendix E.5
— {{{ return the status flag of the fifo to the itp process 
ch.buffer.reply ! empty.flag 
--}}}— {{{ start the transmission of the packet in the tfifo 
WHILE signal = 1 
SEQ
— {{{ initialise the transmission circuitry 
SEQ 
rd := 0 
crc.in := 0 
--}}}
— {{{ send the sync character
ch.th.data! sync
—ch.th.ack ? any
any :=0 —no collision is tested
--}}}— {{{ detect any collision 
IF 
any = 1 
SEQ
collision.counter:= collision.counter +1 
— {{{ collision management 
IF
collision, limit > max.c 
— {{{ interrupt the IRP protocol 
SEQ
ch.th.error! 0 
signal := 0
collision.counter := 0 
collision.limit := 0
~}}}TRUE
— {{{ retry a retransmission 
REAL32 x, y:
INT32 z:
SEQ
signal := 1 
IF
collision.counter >= max.col 
— {{{ wait a fixed time and retransmit 
SEQ
collision.limit := collision.limit +1 
clock ? any
x:= REAL32 R O U N D  collision.counter 
y := POWER(2.0 (REAL32),x) 
data := (INT R O U N D  y)*time.out 
clock ? AFTER any PLUS data
--}}}TRUE 
— {{{ wait a random time 
SEQ
collision.counter := collision.counter +1 
clock ? any
x:= REAL32 R O U N D  collision.counter 
y := POWER(2.0 (REAL32),x)
-184-
appendix E.5
x, z := RAN( INT32 any)
data:= ((INT z)»l) R E M  (INT R O U N D  y)
clock ? AFTER any PLUS (data*time.out)
-}}}
-}}}
-}}}TRUE
SEQ
collision,counter:= 0
— {{{ cany on the transmission of the remaining bits 
— {{{ detect buffer overflow 
empty.flag := wt - rd 
WHILE empty.flag <>0 
SEQ 
PRI ALT 
ch.th.ack ? any 
— {{{ buffer overflow is detected 
SEQ
ch.th.data ! EOC —finish the last 1 to 0 transition of the carrier 
empty.flag : = 0 - and stop the current transmission 
buffer, flow := buffer.flow+1 
— {{{ buffer overflow management 
IF
buffer, flow > max.bf 
— {{{ interrupt the IRP protocol 
SEQ 
ch.th.error ! 1 
signal := 0 
buffer.flow := 0
~}}}TRUE
— {{{ wait for determined time 
SEQ 
clock ? time.now
clock ? AFTER time.now PLUS delay.bf 
~}}}
TRUE &  SKIP 
— {{{ no buffer overflow carry on the transmission 
SEQ
— {{{ transmit and compute the ere for each word 
T.reg := T.fifo[rd] 
rd := rd + 1 
empty.flag := wt - rd
crc.in:= CRCWORD(T.reg, crc.in, #10210000) 
ch.th.data! T.reg
~}}}IF
empty.flag = 0 
— {{{ end of transmission 
SEQ
ch.th.data ! crc.in ; EOC 
buffer.flow := 0 
clock ? time.now
-185-
appendix E.5
— {{{ start time out mechanism 
ALT
ch.th.ack ? data 
IF
data = sync —it is a positive ack 
— {{{ reset the T.fifo 
SEQ
ch.th.ack ? data; data 
wt := 0 
rd := 0
empty.flag := 0 
signal := 0
--}}}TRUE
— {{{ buffer overflow is detected 
SEQ
buffer.flow := buffer.flow* 1
— {{{ buffer overflow management
IF
buffer.flow > max.bf 
— {{{ interrupt the IRP protocol 
SEQ 
ch.th.error ! 1 
signal := 0 
buffer.flow := 0
--}}}
TRUE
— {{{ wait for determined time 
SEQ
clock ? time.now
clock ? AFTER time.now PLUS delay.bf 
--}}}
»}}}
--}}}
clock ? AFTER time.now PLUS time.out 
— {{{ start retransmission 
SKIP 
--}}}
--}}}
--}}}TRUE
SKIP
--}}}
--}}}
--}}}
--}}}
--}}}
-}}}
-}}}
-{{{ imp software layer
-{{{ ITP: interface transmission process
appendix E.5
— {{{ declarations
INT dstadr, len, len.cont.field, src.adr:
INT data, T.buffer, location:
[32]INT VS:
--}}}SEQ 
src.adr := local.adr 
SEQ i = 0 FOR SIZE(VS)
VS[i]:= 0 
— {{{ packet.format.process 
WHILE TRUE 
ALT j=0 FOR 4 
ch.host.out[j] ? dstadr 
SEQ
T.buffer := 1 -assume transmitting FIFO full 
IF
dstadr < 0
— {{{ the message is an internetwork packet 
SKIP
~}}}TRUE
— {{{ the message is an intranetwork packet 
SEQ
-{{{ prepare the control and length field 
ch.host.out[j] ? len 
location := dst.adr»4
len.contfield := (len «2)V((VS[location]»(dst.adiA15))Al) 
~}}}
— {{{ wait until a buffer becomes ready 
WHILE T.buffer <>0 
SEQ
ch.buffer.reqst! 0 
ch.buffer.reply ? T.buffer
~}}}PAR
SEQ
— {{{ write the packet to the T.FIFO 
ch.itp ! dstadr; (src.adr«2)V(j); len.cont.field 
SEQ i=0 FOR len 
SEQ
ch.host.out[j]? data 
ch.itp ! data 
~}}}
— {{{ enable the transmission of T.FIFO 
ch.enab.trans ! 1 
T.buffer := 1 
~}}}
— {{{ get the next send sequence number for the next message.
VS [location] := VS [location] >< (1« (dst.adrA15))
~}}}
—{{{ IRP: interface receiver process
-187-
appendix E.5
--{{{ declaration of local variables 
[32]INT VR:
INT any, src.adr, dst.adr, link.host.adr, location:
INT error.counter, exps, len, len.cont.field, status, count,countl,data: 
--}}}
— {{{ constant 
V A L  error.max IS 128 :
--}}}SEQ 
--{{{ initialise 
error.counter := 0 
SEQ i = 0 FOR SIZE VR 
VR[i] := 0 
--}}}WHILE TRUE 
SEQ
ch.event ? any
count := 1 — there is at least one arrival 
WHILE count > 0 
SEQ
— {{{ proceed the reception of packets 
ch.irp.data ! 1 - read signal is issued to the r.fifo 
ch.data.reply ? dst.adr —read the header data 
IF
dstadroEOC 
— {{{ not a broken packet 
SEQ
link.host.adr:= dst.adr A3 
ch.irp.data ! 1 —read signal 
ch.data.reply ? src.adr 
location := src.adr»4 
IF
src.adroEOC 
— {{{ not a broken packet 
IF
src.adr < 0
— {{{ internetwork packet 
SKIP
--}}}TRUE
— {{{ intranetwork packet 
SEQ
exps := (VR[location]»(src.adrA15)) A 1 
ch.irp.data ! 1
ch.data.reply ? len.cont.field 
len := len.cont.field » 2  
IF
len.cont,fie!d <> EOC 
— {{{ not a broken packet 
IF
(len.contfield A 2)= 0 
— {{{ it is a data packet 
SEQ
ch. irp.status ! 0 -read the status of the packet 
ch.status.reply ? status
-188-
appendix E.5
service or
IF
(exps = (len.cont.fieldAl)) A N D  (status = 0)
— {{{ there is no errors or duplications packet may be accepted 
PAR
— {{{ to host send the message 
SEQ 
error, counter := 0 
ch.host.in[link.host.adr] ! len 
SEQ i=0 FOR len 
SEQ
ch.irp.data! 1 
ch.data.reply ? data 
ch.host.in[link.host.adr] ! data 
ch.irp.data ! 1
ch.data.reply ? any --discard the EOC 
--}}}
— {{{ update the next expected sequence number 
VR[location] := VR[loeation]x(l«(src.adrA15))
--}}}
-}}}TRUE
— {{{ packet has to be discarded 
SEQ
WHILE len <> EOC 
SEQ
ch.irp.data! 1 
ch,data.reply ? len
—len.control.fied is used as a data valuable
IF
error.counter = error.max 
SEQ
error.counter :=0 
ch.rh.error! any 
TRUE
error.counter := error.counter +1 
--}}}
--}}}TRUE
— {{{ it is a command packet 
SEQ
ch.irp.data ! 1 -read the end of carrier 
ch.data.reply ? any 
ch.command! len.cont.field
— the command may single command for a host(s) and an imp(s)
— or a broadcasted packet to hosts or imps the interpretation
— of the command can be obained fron the len and the control
— broadcasted packets or command are acknowledged by the
— the highievel protocols 
--}}}
--}}}
TRUE 
--{{{ broken packet 
SKIP 
--}}}
-189-
appendix E.5
--}}}
-}}}TRUE
— {{{ a broken packet as a result of buffer full 
SKIP 
--}}}
--}}}
TRUE 
— {{{ broken packet 
SKIP 
--}}}
-}}}ch.irp.count! any 
ch.count.reply ? count 
ch.interupt.enab ! any 
-}}}
— {{{ IMP: interface manager process 
INT long.time, any:
SEQ 
— {{{ initialise 
long.time := 10000000 
--}}}WHILE TRUE 
ALT 
ch.th.error ? any 
— {{{ transmission error 
SEQ
— {{{ errormanager(any,long.time)
SKIP
-->}}clock ? any
clock ? AFTER any PLUS long.time 
ch.enab.retrans ! any
--}}} ch.rh.error ? any 
— {{{ reception error 
ch.reset.fifo ! any
--}}} ch.command ? any 
— {{{ command packet to be interpreted by the system 
SKIP 
--}}}
--}}}
-}}}
PAR
— {{{ source.process.test 
WHILE TRUE 
SEQ j=0 FOR 4 
SEQ 
ch.host.out[j] ! j+8 
ch.host.out[j] ! 3
-190-
appendix E.5
SEQ i=0 FOR 3 
ch.host.out(j] ! j+1
--}}}
--{{{ sink.process.test
WHILE TRUE 
PAR j=0 FOR 4 
INT data, len:
SEQ 
ch.host.in[j] ? len 
SEQ i=0 FOR len 
SEQ 
ch.host.in[j] ? data
--}}}
imp(ch.hosLin, ch.host.out, ch.th.data[2], ch.th.ack[2], ch.rh.ack[2], 
ch.rh.data[2], 2)
— (({ channel process
PAR 
INT data:
WHILE TRUE 
SEQ 
ch.rh.ack[2] ? data 
ch.th.ack[2] ! data 
INT data:
WHILE TRUE 
SEQ„ 
ch.th.data[2] ? data 
ch.rh.data[2] ! data 
— {{{ display 
SEQ
write.int(screen,data,0) 
write.full.string(screen," ")
IF
data = EOC 
newline(screen)
TRUE
SKIP
--}}}
~}}}
UNIVERSITY OF SURREY U B R A R  f
