Reconfigurable FPGA-based message routing for embedded real-time parallel processing applications. by Frost, Graham.
M 000082 OS (Z.
1351363
UNIVERSITY OF SURREY LIBRARY
ProQuest Number: All rights reserved
INFORMATION TO ALL USERS 
The quality of this reproduction is dependent upon the quality of the copy submitted.
In the unlikely event that the author did not send a com plete manuscript 
and there are missing pages, these will be noted. Also, if material had to be removed, 
a note will indicate the deletion.
uest
ProQuest 10130240
Published by ProQuest LLO (2017). Copyright of the Dissertation is held by the Author.
All rights reserved.
This work is protected against unauthorized copying under Title 17, United States C ode
Microform Edition © ProQuest LLO.
ProQuest LLO.
789 East Eisenhower Parkway 
P.Q. Box 1346 
Ann Arbor, Ml 4 81 06 - 1346
University Of Surrey 
Department of Electrical and Electronic Engineering
Reconfigurable FPGA-Based Message Routing 
for Embedded Real-Time Parallel Processing 
Applications
Graham Frost M.Sc. B.Sc.(Hons)
A thesis submitted in partial fulfilment of the requirements for the 
higher degree of Doctor of Philosophy
March 1998
ABSTRACT
T o m eet the requirem ents o f  high performance em bedded applications, this 
thesis proposes an architecture in which multiple general-purpose processors are 
connected through a com m unication netw ork to  form  a parallel processing system. 
Tw o aspects o f  this approach are considered.
A  netw ork for connecting multiple processors m ust provide high speed, low 
latency communication to  ensure that inter-processor comm unications do not degrade 
the overall com putation perform ance. A  proposal to  use program m able-logic-based 
m essage routing nodes to  construct such a netw ork is investigated and the advantages 
o f  this approach are explored. A  netw ork design, produced using the hardware 
description language VHDL, is described and its verification by simulation is 
presented. Routing mechanisms incorporated in the design to  im prove netw ork 
utilisation include multicast, adaption and encapsulation. The construction o f  a small 
netw ork and its verification using four processors to  generate m essages and verify 
m essage delivery is described. The successful use o f  program m able logic provides a 
foundation for developing netw orks in which the specific rou ter design is based on 
m atching the resources o f  the netw ork  to  those required by the application using a 
library o f  routing functions.
An investigation into the processing requirem ents o f  a  system  which uses video 
cameras on a railway carriage to  m easure the relative position o f  station platforms and 
the rails is presented. The implem entation o f  tw o algorithms to  provide the platform 
position in real-time is investigated, and the generation o f  a database o f  real sensor 
data for further off-line algorithm  developm ent is described. The results obtained 
show  that the requirem ents o f  the video measurement application can be m et with a 
reasonable number o f  processors, and provide a metric for estim ating the processor 
requirem ents o f  future systems. A  new  technique is presented which uses image 
processing to  calibrate the video cam eras so that optical alignment o f  the cameras is 
no longer critical.
ACKNOWLEDGMENTS
I would like to  thank my supervisors, Chris Jesshope and Neil M acCuaig, for 
their advice and guidance, R oger Peel for taking over the supervision o f  this research 
from Chris during the closing stages, and for his guidance in the preparation o f this 
thesis, and Bob Crocker, the RO LIN  project manager, for enthusiastically supporting 
this w ork and for lifting it from its darkest days.
For their support, in the form o f  funding for the ROLIN project, I would like to 
thank Railtest and the DTI. I am grateful to my employer, SM IS Ltd., for supporting 
this w ork financially and allowing time for this project during normal working hours. I 
would also like to thank XILIN X  Ltd. for supplying the program m able logic devices 
used in the demonstration netw ork and Transtech Ltd for the loan o f  some o f  the 
equipment used during final verification o f  the network.
This work is dedicated to my wife Theresa, and my children Lucy and Graham. 
Their understanding, patience, support and encouragement throughout this lengthy 
endeavour has been unwavering. I would also like to include my daughter Amy in this 
dedication. She was born as this thesis was nearing completion and will thankfully 
never hear the “Sorry, I ’ve got to w ork on my PhD” response that her siblings are so 
familiar with. We can now do all the things that were put on hold for so long........
© Graham Frost 1998
IV
CONTENTS
1. In tro d u c tio n  2
1.1 Motivation for the Research 2
1.2 Key Applications 6
1.2.1 Fast Medical Imaging 6
1.2.2 Video Gauging 9
1.3 Research Objectives 15
1.3.1 Evaluating the Required Performance 15
1.3.2 Implementing a Message Routing Network 16
1.4 Thesis Outline 16
2. Parallelism  in C o m p u ters  19
2.1 The Exploitation of Parallelism in Computer Engineering 19
2.2 Computer Architecture Classification 21
2.2.1 Multiprocessors 24
2.2.2 Multicomputers 26
2.2.3 Dataflow and Reduction Systems 28
2.2.4 Heterogeneous Computing Systems 29
2.3 Communication Networks for Concurrent Computers 30
2.3.1 Indirect Networks 31
2.3.2 Direct Networks 32
2.3.3 Switching Mechanisms 34
2.3.4 Important Network Properties 35
2.4 The Packet-Switched Direct Network Approach 35
2.4.1 Packet Routing Schemes 3 5
2.4.2 Deadlock Freedom 36
2.4.3 Multicast Routing 40
2.5 The Implementation of an FPGA-Based Network Routing Node 40
3. P ro g ram m ab le  Logic 43
3.1 Devices 43
3.1.1 Historical Development 43
3.1.2 Complex Programmable Logic Devices 44
V
3.1.3 Field Programmable Gate Arrays 47
3.1.4 Device Programmability 49
3.2 Application Areas for FPGAs 50
3.2.1 Hardware/Software Codesign 50
3.2.2 Reconfignrable Computing 50
3.2.3 Genetic Algorithms 52
3.2.4 Intellectual Property 52
3.3 FPGA-Based Message Routers 52
4. T he  R O L IN  P ro jec t 57
4.1 Key Project Themes 58
4.1.1 Positioning 58
4.1.2 Video Processing 60
4.1.3 Database Management 62
4.2 The Technology Demonstrator 64
4.2.1 System Overview 64
4.2.2 Data Acquisition Board 66
4.2.3 Generation of Platform Edge Co-ordinate Measurements 70
4.2.4 Calibration Procedure 72
4.2.5 Platform Edge Gauging Algorithms 75
4.2.6 Results 77
5. A n FPG A -B ased  M essage R o u te r  83
5.1 Routing Node Configurability 83
5.1.1 An Example Application 84
5.1.2 Towards a Virtual Message Routing Network 86
5.1.3 The Routing Node Configuration Used in this Research 87
5.2 Network Topology 87
5.3 Physical Implementation 92
5.3.1 Asynchronous Communication Links 92
5.3.2 Centralising the Network Fabric 93
5.4 Routing Node Architecture 94
5.4.1 Number o f Routing Channels 94
5.4.2 Internal Architecture 94
5.4.3 The Two-Channel Router Engine 95
v i
5.4.4 Routing Channel Flow Control 100
5.4.5 Processor Interface 102
5.5 Message Routing 102
5.5.1 Data Driven Wormhole Routing 103
5.5.2 Eager Inertial Routing 105
5.5.3 Message Types 108
5.5.4 Node Addressing 114
5.5.5 Adaption 114
5.6 Deadlock and Livelock 115
5.6.1 Multicast Collision Deadlock 116
5.6.2 Multicast Contention Deadlock L17
5.6.3 Multicast Adaption Deadlock 119
5.7 Software Support 120
5.7.1 Application Configuration 121
5.7.2 Processor Farms 121
5.7.3 Micro-Kernel 127
5.7.4 Communication using DMA 127
6. P ro to ty p e  N etw ork  Im p lem en ta tio n  an d  V erification  130
6.1 Implementation Details 130
6.1.1 Design Goals 130
6.1.2 The VHDL Approach to Digital Design 131
6.2 Routing Node Design 134
6.2.1 Channel Data Flow 135
6.2.2 Dualbus Router 136
6.2.3 Processor Interface 141
6.2.4 FPGA Design Considerations 150
6.3 Verification - Simulation 155
6.3.1 The VHDL Test Bench Approach 156
6.3.2 Verification o f the Network Node Components 157
6.3.3 Verification o f the Complete Network Design 158
6.3.4 Simulation Results 159
6.4 Verification - Demonstration H ardw are 159
6.4.1 The Demonstration Network Hardware 160
6.4.2 Demonstration Network Application 162
v ii
6.4.3 Verification Tasks 164
6.4.4 Different Network Configurations 169
6.4.5 Verification Results 171
7. C onclusions and  F u r th e r  W o rk  173
7.1 The ROLIN Project 173
7.2 FPGA-Based Message Router 174
7.3 Summary 175
8. R eferences 178
9. A ppend ix  A - V H D L  H ie ra rc h y  an d  V H D L  Source C ode 198
10. A ppend ix  B - Schem atics 258
Figures
Figure 1 - False Nearest Approach Result 12
Figure 2 - Platform Edge Image 13
Figure 3 - Wet Platform Edge Image 14
Figure 4 - Flynn's Taxonomy 22
Figure 5 - B e ll’s Taxonomy ofM IM D  Computers 23
Figure 6 - Processor to Memory Architecture 24
Figure 7 - PE-to-PE Architecture 26
Figure 8 - A Simple Data Flow Graph 28
Figure 9 - Shuffle Exchange Network 31
Figure 10 -  Direct Network Topologies 33
Figure 11 - Processor Deadlock 37
Figure 12 - PAL Sum o f  Products Structure 43
Figure 13 - FLEXLogic Product Borrowing 45
Figure 14 - M AX9000 Architecture 46
Figure 15 - Simplified Block Diagram ofXC-4000 CLB 48
Figure 16 - ROLIN Technology Demonstrator 65
Figure 17 - Data Acquisition Board Block Diagram 68
Figure 18 - Typical Platform Edge Image 69
Figure 19 - Calibration Processing 71
Figure 20 - Platform Edge Processing 72
Figure 21 - Calibration Board Marker Layout 73
Figure 22  -  Captured Image o f  Calibration Board 73
Figure 23 - Boxes Marking Calibration Regions 74
Figure 24 - DET3 Object Identification 76
Figure 25 - DET3 Wire Frame Construction 77
Figure 26 - Manual vs. Video-Based Platform Edge Measurements 79
Figure 27 - Run-on-Run Error Distribution 80
Figure 28 -  Example Application Process Mapping 84
Figure 29 - Example Application Network 85
Figure 30 - Deadlock Free Linear Array 88
Figure 31 - Message Progression 89
Figure 32 - Deadlock in a Ring Topology 90
Figure 33 - MP1 Comparison 91
Figure 34 - Processing Element 93
Figure 35 - Internal Architecture o f  Network Routing Node 95
Figure 36 - Two-channel Router Engine  .  96
ix
Figure 37 - Routing Engine and Flit Buffer Logic 97
Figure 38 - Flit Buffering 99
Figure 39 - Routing Channel Signal Definition 100
Figure 40 - Message Transfer 102
Figure 41 - Store and Forward Routing Versus Wormhole Routing 104
Figure 42 - Blockage Characteristics 105
Figure 43 - Combinatorial Address Comparison 106
Figure 44 - Registered Address Comparison 107
Figure 45 - Encapsulated Multicast 110
Figure 46 - Hierarchical Network Expansion 111
Figure 47 - Global Encapsulation Deadlock Freedom 113
Figure 48 - Multicast Collision Deadlock 116
Figure 49 - Multicast Contention deadlock 118
Figure 50 - Outbound Channel Architecture 119
Figure 51 - Multicast Adaption Deadlock 120
Figure 52 - Processor Farm 122
Figure 53 - Processor Farm Network 124
Figure 54 - Four-Plane Farm Router Node 125
Figure 55 - Multi-Plane Farm Controller Processes 126
Figure 56 - Internal VHDL Architecture 135
Figure 57 - Injection Controller 137
Figure 58 - Flit Buffering 138
Figure 59 - Flow Controller 139
Figure 60 - C40 Link Signalling 143
Figure 61 - Injection Message Format 144
Figure 62 - C40INJ Controller 146
Figure 63 - C40 Reception Controller 148
Figure 64 - Conventional and One-Hot State Machine Encoding 151
Figure 65 - Verification Application 163
Figure 66 - TX Message Format 166
Figure 67 - Ping Injection Message Format (node 8) 167
Figure 68 - Ping Network Message (node 8 to node 1) 168
Figure 69 - Ping Injection Message Format (node 1) 168
Figure 70 - Index o f  VHDL Descriptions 198
Figure 71 - Index o f  Schematics 258
X
Photographs
Photograph 1 -  Light Fan-Beam From the Structure Gauging Train 10
Photograph 2 - SGTAcquisition Board 67
Photograph 3 -  Demonstration Network Board 160
xi
Chapter 1 
Introduction
1. Introduction
The research described in this thesis has been undertaken by the author in a 
collaborative program me betw een the Com puter Systems Research Group in the 
Departm ent o f  Electrical and Electronic Engineering at the University o f  Surrey, and 
Surrey M edical Imaging Systems Ltd. (SM IS).
SM IS is an international company specialising in M agnetic Resonance Imaging 
(M RI) systems and, through a subsidiary company, non-destructive testing and 
inspection equipment. SMIS design and construct medical imaging systems for 
research and clinical applications, and industrial inspection and m easurem ent systems 
using ultrasound and video techniques, predominantly for the railway industry.
Part o f  the research described in this thesis is linked with another research 
project (ROLIN) jointly funded by the D TI and Railtest. The RO LIN  project 
investigated three areas o f  technology required to implement future test trains; 
positioning, video processing and database management. The video processing aspect 
o f  the ROLIN project, with which this research is linked, is concerned with real-time 
video processing for several key application areas, including real-time compression o f 
video data and the identification o f  features from video images for m easurement 
purposes.
1.1 Motivation for the Research
Real-time digital signal and image processing are the core processing functions 
o f  both the medical imaging systems and the non-destructive inspection systems 
produced by SMIS. To date, one or two general purpose Digital Signal Processor 
(DSP) boards have been used in SM IS medical systems to successfully meet these 
data processing requirements. The DSP boards are PC AT plug in cards, used as 
accelerator units by program s running on the PC processor. Signal processing tasks 
are loaded to the board across the PC AT bus, and the results are read back when the 
DSP board has completed the processing. Tw o new applications, fa s t M RI imaging 
and video gauging, require a level o f  perform ance beyond that achievable with this 
basic approach.
2
Fast MR1 imaging techniques for medical imaging, which can produce data 
acquisition rates up to a hundred times faster than previous m ethods, are now 
realisable through advances in m agnet technology, improvements in the electronics 
associated with the magnet, and refinement o f  the m easurem ent technique. To fully 
exploit the potential o f  fast M RI imaging, it is essential that the perform ance o f  the 
signal processing and image reconstruction tasks is increased to  m atch that o f the 
higher data acquisition rates.
Video gauging is implemented on an existing test train run by Railtest. The test 
train measures the position o f  sta ic tu res surrounding the track relative to the rails 
using six video camera pairs to ensure that rolling stock has a safe w orking clearance 
(clearance gauge). Operational difficulties caused by the simple video processing 
currently implemented on the test train and its restricted field o f  view limit the 
commercial exploitation o f  the vehicle. To be commercially successful, the processing 
o f  the video images must be improved to make the clearance m easurem ents reliable 
and immune to spurious signals, the processing must operate in real-tim e to  provide 
on-line annunciation o f  exceedances, and m ore cameras must be used to provide a 
wider view around the vehicle.
It is clear that the processing demand o f  these two embedded application areas 
is very high. One or tw o D SPs are needed for the current medical imaging systems, 
but new techniques produce much higher rates o f  data. The extraction o f  clearance 
measurements in real time from m ore than six video cam era pairs will also require 
many processors. A single processor will not meet these embedded signal processing 
demands; a number o f  processors connected in a parallel processing system are 
required. W ith a variety o f  existing and potential applications imposing different 
requirem ents on the processing system, a flexible approach to embedded parallel 
processing is required. It is highly desirable to  have a ‘building block’ approach to 
eliminate the hardware and softw are engineering costs associated with designing a 
different system for each new application.
The processes in these applications will be diverse and an efficient 
implementation will require processors with different capabilities for each process. A 
Digital Signal Processor (DSP), for example, is optimised for signal processing and, in 
general, will not perform character processing well. A video graphic processor will
perform  image manipulation faster than a general purpose m icroprocessor. The ability 
to use different processors, and to  mix these processors in the same netw ork to 
provide a heterogeneous system, should ensure that a good match can be made 
betw een each process and its host processor. A  heterogeneous approach should also 
enable future processors to be used on the same netw ork with minimum effort.
A system consisting o f  standard processors connected through a shared bus is a 
comm on approach to  providing an embedded parallel processing system. This 
multiprocessor technique is usually implemented through the use o f  bus standards like 
V M E and FutureB us+ which are widely supported by m anufacturers o f  processor, 
memory and I/O board products. H ow ever, in larger systems consisting o f  many 
processors, the shared bus becomes a bottleneck which limits the performance o f the 
system, and extension o f  the bus is limited by electrical signalling considerations, 
including bus driver capacity and signal integrity.
An alternative approach has been supported by a num ber o f  m anufacturers in 
the form o f  processors with point-to-point inter-processor com m unication links, most 
notable o f  which is the Inmos Transputer, the first commercial processor to  provide 
this type o f  facility. A collection o f  such processors are connected together in a 
network, with comm unication betw een processors performed only through the links. 
W ith the use o f  communication links, the bottleneck o f a shared bus is rem oved and 
extension o f  the system is not limited by electrical signalling issues. N etw orks o f any 
size can be constructed from these devices. However, in larger netw orks there is an 
unacceptably high overhead with the point-to-point comm unication link approach. 
W hen a processor communicates with another processor which is not directly 
connected to it via a link, the m essage must be passed through intermediate 
processors. The processors at these intermediate nodes within the netw ork must take 
time out from performing useful tasks to accept and forw ard messages. A separate 
message routing netw ork, through which any processor can comm unicate with any 
other, provides a solution to this problem  by taking the burden o f  message routing 
away from the processor. This approach has not been widely supported in the 
commercial sector with the provision o f  routing devices, except in the .case  o f  the 
Transputer, which has m essage routing chips for both the first and second generation 
processor families.
4
The Transputer family provides processors with inter-processor communication 
links and message routing devices, which are the building blocks required to put 
together the type o f  netw ork o f  processors required for future SM IS products. 
However, the m anufacturer o f  the latest processor o f the T ransputer family, the 
T9000 [Shepherd92], [IEE94], experienced significant delays in producing a working 
device, with subsequent late release o f  the product into the m arketplace. This period 
o f delay has seen other m anufacturers providing devices with inter-processor link 
capabilities which have significantly higher perform ance than the T9000. These 
problems eliminate it as a commercially credible contender. The older T800 
generation processors in the Transputer family [H om ew ood87] have poor 
performance compared to the latest generation o f  general purpose processors. In the 
signal processing employed by SM IS, the floating point perform ance is the key 
parameter. The quoted perform ance o f  the 30M H z T800 T ransputer processor is 
typically 4.3 Million Floating Point Operations per second (M FLO Ps) for summation 
and 1.9 M FLO Ps for multiplication; the 50M Hz TM S320C40 achieves 50 M FLOPs 
for either operation. The problem with utilising the higher perform ance DSPs or 
general processors is that none are supported with message routing devices, although 
the two dominant high perform ance DSP devices, the Texas Instrum ents TM S320C40 
and the Analog Devices SHARC, do provide on-chip serial links for building parallel 
processing systems. The Transputer m essage routing chips are processor specific, 
which makes their use with other processors difficult and potentially inefficient.
To utilise a number o f  high perform ance ‘com m odity5 processors in a 
heterogeneous system requires that a m essage routing netw ork be implemented in a 
processor-independent way. The engineering costs o f  designing semi-custom  or full- 
custom  devices for this function are high, and with the low volum e requirem ents o f 
SMIS, the part cost would set the overall signal processing sub-system  cost at too 
high a level for most custom ers. Re-program m able logic offers an alternative, which is 
both cheaper in engineering cost and m ore accessible through the use o f  PC-based 
developm ent tools. In addition, their re-program mability introduces the option o f 
altering the function o f  the netw ork router com ponent during operation. An important 
consideration with the use o f  reprogram m able logic, however, is that it may not offer 
the same performance as that achievable with custom  logic.
5
The research undertaken by the author and described in this thesis addresses 
tw o aspects o f the future processing requirem ents in SMIS systems. The m ajor part o f 
the w ork implements a m essage router device in a reprogram mable logic device to 
determine w hat can be achieved in term s o f functionality and performance. In parallel 
w ith this, a video based m easurem ent system is implemented to  provide m ore detailed 
information about one o f  the m ost demanding target applications and its processing 
requirements.
1.2 Key Applications
The processing requirem ents for tw o key applications, Fast M edical Imaging 
and video gauging, provide the motivation for this research. These application areas 
are described in detail below.
1.2.1 F as t M edical Im ag ing
M edical images can be produced by the interaction o f  biological tissue with a 
number o f  different types o f electrom agnetic radiation. X-ray techniques produce a 
shadow  image resulting from the attenuation o f  the X-ray photons by the body, 
relying on the contrast differences produced by variations in tissue density. Images 
are produced using ultrasound by measuring the relative am ounts o f  backscattered 
signal from  the subject. The M agnetic Resonance Imaging (M RI) technique employed 
in SMIS systems uses the relative response o f  specific nuclei to absorbed radio 
frequency energy. M RI is non-invasive, has a higher resolution than ultrasound 
techniques and, unlike X-ray imaging, does not employ potentially hazardous ionising 
radiation. The principles o f  M RI are presented in “Pro ton  N M R  tom ography" 
[Locher83].
The M RI technique is based on m easuring the response o f  nuclei within the 
subject which have an uneven atom ic mass or uneven atomic number. These nuclei 
possess an angular m omentum  or spin. The spin characteristic o f  the nucleus, a 
charged particle, induces a magnetic field with an axis coincident w ith the axis o f  spin, 
and a m agnitude and direction represented by a magnetic moment. Normally, the 
magnetic moments in a collection o f  nuclei will be randomly oriented, as dictated by 
the principles o f  Brownian motion. W hen a static magnetic field is applied, these 
magnetic dipoles tend to assume discrete orientations (parallel or anti-parallel) relative
6
to  the applied magnetic field. The alignment o f  the magnetic m om ents o f  the nuclei 
with the applied magnetic field is not perfect, resulting in the nuclei precessing around 
the axis o f  the applied field with a precise frequency, the Larm or frequency. The net 
m agnetisation vector resulting from all nuclei in the sample is in an equilibrium state. 
In order to  measure information from  the spins o f  the nuclei, they m ust be perturbed 
or excited. The application o f  a short burst o f  radio frequency energy at the Larmor 
frequency causes the nuclei to  be deflected from  their equilibrium orientation. As they 
decay back to the equilibrium state, the energy released induces an RF signal which 
can be detected by an RF receiver coil placed near to the sample.
Production o f a M RI image relies on the use o f  a static m agnetic field to  align 
the nuclei, a variable magnetic field gradient to  encode spatial information on the 
nuclei within the sample, and pulses o f  RF energy to stimulate the resonance o f the 
nuclei. The signal generated by the nuclei is detected by an RF receiver system and 
passed through a digital signal processing system to produce an image. The static 
magnetic field is produced with resistive, perm anent or superconducting magnets. The 
m agnetic field gradients for spatial encoding are produced by three orthogonally 
positioned gradient coils.
Spin Echo imaging [Kumar75] and Gradient Echo imaging [Haase86], 
[Frahm86] are the tw o m ost comm on forms o f  conventional M RI used today. These 
M RI m ethodologies are used as standard protocols for imaging o f  many pathologies 
and body parts. Echo Planar Imaging (EPI) is a fast M RI technique which has several 
advantages over the conventional spin echo and gradient echo techniques. EPI was 
first presented in 1977 [Mansfield77] and is now an established clinical technique 
offered by most o f  the major m anufacturers (Philips, GE, Siemens and Picker). A 
review article o f  this N M R  m ethodology may be found in “Imaging by nuclear 
magnetic resonance” [Mansfield88],
The EPI technique is capable o f  capturing a 2D planar image in as little as 
20m sec with a resolution approaching 2mm [W eisskoff90]. The basic philosophy is to 
capture all the necessary data for the reconstruction o f  an image within a single 
‘pulse-acquire’ experiment. This im poses severe demands on the gradient technology 
in that high magnetic field gradient strengths must be achieved in very short times
(100-200p,secs). Innovations in gradient design have overcom e these problems 
[Mansfield86], [Mansfield87] to make EPI a valuable clinical tool.
W hen performing M RI using conventional imaging techniques, which require 
several seconds to  acquire an image, subject motion, involuntary or respiratory, can 
cause severe artefacts (erroneous features) in the form o f  blurring within the image. In 
the same way a high shutter speed on a cam era is required to  freeze fast motion, a 
high speed M RI technique is required to  overcom e these problems. The condition for 
elimination o f  these artefacts is that the time o f  acquisition m ust be m uch less than the 
unwanted motion. Conventional techniques can synchronise to  periodic motion such 
as the cardiac cycle and build up an image over many cycles by capturing data from 
the same point in the cycle each time. However, this technique is subject to  artefacts if 
the cycle is not truly repetitive. This can occur in the heart, for example, where 
arrhythmic beats are common. EPI, with its high image acquisition rate, eliminates 
these issues and has been used successfully for many cardiac applications.
A relatively new application o f  M RI is the mapping o f  human brain function 
[Turner95], EPI is an ideal M RI investigative m ethod for studies o f  human brain 
activity because o f  its inherent high speed characteristics. H ow ever, studies presented 
so far [Turner93] have been restricted to two-dimensional slices through the brain. A 
variant o f  EPI, term ed Echo Volum ar Imaging (EVI) [Harvey96], acquires a three- 
dimensional data set in the order o f  lOOmsecs. Capturing brain activity from  a volume 
rather than a plane has exciting prospects for human brain function studies, as the 
entire brain may be scanned in a single experiment [Mansfield95].
Future exploitation o f  this technique will make high demands on the signal 
processing system. The signal processing o f  the received M RI signal for tw o- 
dimensional imaging consists o f  three tasks; filtering, two-dim ensional Fast Fourier 
Transform  (2DFFT) and image scaling for display. SMIS currently implements all o f  
this processing for EPI using a single Intel i860 processor [Intel89] at a rate o f 10 
images/sec for a 128x128 image. The execution time o f  the 2DFFT processing 
[Brigham74] performed on a n x n  image is proportional to27?2 log2 w; the other
tasks each take time proportional to ? r .
In general, the processing required for an n x n EPI image can be expressed as
EPInxn oc 2n2 log2 n + 2n2 
EPIn8x128 =  Z5 X 2
w here P is a constant relating to  the performance o f  the  processor and the data 
acquisition rate.
The minimum EV I volume size required for practical brain function studies is 
16x64x64. For a general E V I volum e m x n 2, the processing required can be 
expressed as
EVImxnxn °c 2m n2 log2 n +  mn2 log2 m + 2m n2 
• '•£ ^ 1 6 * 6 4 x 6 4  = / > x 4 . 5 x 2 18
Given recent advances in gradient technology, it is not unreasonable to assume 
that a 64 x 64 x 64 array size w ould be used in the future. Peripheral nerve stimulation 
caused by rapid gradient switching may impose a limit on the maximum allowable 
array size for this type o f  experiment, but this has yet to  be determined. For a 
64 x 64 x 64 EV I image, the processing required can be expressed as
log2 +
= p x 20x218
From  this analysis, it can be seen that, for the same image update rate, using 
identical processors in all cases and assuming that there is a linear scaling o f  
performance with number o f  processors, EV I currently requires over four times the 
number o f  processors needed for E P I and in the future is likely to  require up to 
twenty times the number o f  processors.
1.2.2 V ideo G aug ing
Railtest run several test vehicles to  m onitor the infrastructure o f  the railway, 
including the Structure Gauging Train (SGT) which m easures the closest approach o f 
structures to the vehicle [Edworthy86]. This data is used in conjunction with the 
known kinematic envelope o f  a vehicle to  ensure that there is a safe/operational 
clearance o f  the vehicle along any particular route.
The m easurement o f  the distance o f  the structures from  the train is achieved 
with the use o f a fan-beam o f  light which is orthogonal to the longitudinal axis o f the 
vehicle. The SGT is operated at night to  ensure adequate contrast betw een the light 
from the fan-beam and other sources o f  light. Calibrated video cam eras observe where
9
the fan-beam hits a structure, from which the nearest approach distance is obtained 
using geom etric expressions (Photograph 1*).
Photograph I - Light Fan-Beam From the Structure Gauging Train
The processing o f  the video data is performed in simple dedicated hardware. Six 
camera pairs provide coverage all around the vehicle, with restricted resolution in the 
ro o f area. Each pair o f  cameras is coaxial and views the same section o f  the fan-beam 
from opposite sides o f  the beam. The video image from each camera is passed 
through a hardware threshold process to produce an image with tw o states, black and 
white, which are represented by logic levels 0 and 1 respectively. The threshold level 
is set to ensure that the light reflected from the fan-beam is above the threshold, 
producing white areas in the image, and background sources o f  light are below the 
threshold level to produce black areas in the image. The threshold process includes 
hysteresis to reduce noise at the transitions between black and white.
The threshold-processed images from each camera pair are logically combined 
with an AND operation to form the final image from which clearance measurements 
are made. To appear in the final image, any areas o f white light must be in the same
1 Photograph by kind permission o f Serco Railtest
10
position in both cameras. This processing eliminates the majority o f  spurious light 
sources like, for example, a street lamp which cannot be seen by both cameras 
simultaneously.
Cam era line scanning is arranged such that, for all cameras, the start o f  the line 
is near to the vehicle and scans outw ards from the vehicle. W ith this arrangement, the 
first illuminated pixel on each video line corresponds to  the nearest illuminated 
structure to  the train.
M easurements must be referenced to  the rails, which is the common reference 
frame used for all vehicle gauging. The cameras are m ounted on a sprung vehicle, 
which requires that vehicle attitude corrections are applied to the measurem ents using 
data from sensors attached to  the suspension system.
The processing, as described above, is veiy simple and is implemented in 
hardware but there are problems associated with some o f the operational conditions 
commonly encountered on the railway network. The logical AND process o f  the 
images from pairs o f  cameras eliminates m ost spurious sources, but there are 
situations where tw o independent sources o f  light can appear at the same position in 
both cameras. Over-bridges are com m on at the ends o f  stations and it is not unusual 
to have signal lamps beyond the over-bridge. The station lighting will be seen by one 
camera, and the signal lamp will be seen by the other when the test train is under the 
bridge. These spurious light sources can lead to false nearest approach measurements 
for the bridge, Figure 1.
11
PLAN VIEW
track
camera 1 view camera 2 view
(with signal lamps) (with station lamp)
ANDed image
rr
nearest approach profile 
Figure 1 - False Nearest Approach Result
One o f  the structures which gets closest to the train is a platform, and therefore 
an accurate measurement o f  its position is essential. To measure the position o f the 
platform, a separate camera is used, which views the area in which the light beam will 
hit the platform. The processing o f  the image is similar to that used for the main 
gauging function. It is known that the edge o f the platform will appear within a small
12
range o f video lines. The video line in this range which has the minimum time to the 
first ko n ’ pixel, or leftmost pixel, is used for calculating the nearest approach o f the 
platform. Figure 2 shows the video image obtained from the fan-beam o f light hitting 
a platform.
Figure 2 - Platform Edge Image
The reflection from the surface o f  the platform forms m ost o f  the white area in 
the image. The vertical face o f  the platform  edge, which is close to the train, reflects 
the light beam to produce a discernible ‘knee’ in the image, seen at the left o f  the 
white area.
One camera is used on each side o f  the vehicle. The camera is angled down to 
view the platform, which eliminates the problem o f  spurious light sources. A problem 
arises with this system, however, when a wet platform is well illuminated by station 
lamps. This situation produces images which are incorrectly interpreted by the simple 
processing. Figure 3 shows that the normal camera image, as illustrated in Figure 2, is 
grossly distorted by the light reflected o ff the wet platform surface. In this image the 
leftmost pixel is no longer the edge o f  the platform.
To overcome the problems encountered with the simple video processing 
currently implemented on the SGT, both the shape o f  objects within the image and the 
motion o f the objects within the image must be used as rejection criteria for 
eliminating spurious light sources and recognition o f platform flare used to alter the 
platform edge detection algorithm.
13
Figure 3 - Wet Platform Edge Image
Data recording is restricted on the SGT to storage o f  the w orst case nearest 
approach for each 5m segment o f  track. The data is available at 50 frames per second, 
the frame time o f the video cameras. At 40mph, the maximum m easuring speed o f the 
vehicle, this corresponds to data at intervals o f  approximately every 350mm. This 
data, currently thrown away within each 5m segment, is a valuable source o f 
information for maintenance purposes. Data stored at frame rate from several 
recording runs throughout the year would provide civil engineers with a means o f 
determining if a structure is moving or collapsing. In addition to  maintenance issues, a 
single spurious light source appearing closer than the true position o f  a structure will 
mask the true data from a whole 5m section o f  track.
New requirements for gauging which cannot be met by the current SGT systems 
restrict the commercial exploitation o f  this proven m easurement technique. Larger 
vehicles, like Channel Tunnel Freight, are now in service with associated structures 
which are outside o f the m easurement range o f  the SGT. The resolution o f  the current 
measurement system is also too low in some areas around the vehicle. Increasing the 
measurement range and providing a uniform resolution around the vehicle will require 
more cameras. The real-time processing o f  data is highly desirable for safety reasons. 
Exceedances o f  safe clearance limits can be annunciated to  the SGT staff and 
appropriate action taken immediately to close the line.
14
The SGT was developed in the early 1980s and came into service in 1986 
[Crocker93]. The electronics and com puters used are obsolete and must be updated 
to ensure that the vehicle continues to remain operational. To implement the image 
processing necessary to  eliminate the majority o f  spurious light sources, in real-time, 
with additional cameras, and to  store the data at frame rate will require a high 
performance embedded parallel processing system.
1.3 Research Objectives
The purpose o f  this research has been to  investigate a proposal made by the 
author to  implement a high performance, heterogeneous, embedded parallel
processing environment which can meet the demands o f  a num ber o f  real-time 
application areas, including those described above. The proposed system is based on 
the message-passing paradigm  o f  parallel processing, utilising general purpose 
processors connected through a separate message-passing netw ork which is
implemented using program m able logic devices.
Tw o key issues which establish the validity o f the proposal, and which have 
prom pted this research, are w hether the real-time performance from a number o f  
general purpose processors is sufficient to  meet the processing demands o f the target 
applications, and whether the use o f  program mable logic to  implement the building- 
block inter-processor comm unication netw ork is feasible.
1.3.1 Evaluating the Required Performance
A significant problem in evaluating the expected perform ance o f  the proposed 
system is that none o f  the real-time applications for which the system is targeted had 
been developed at the start o f  the project. Algorithm requirem ents are not available a 
priori to enable realistic estim ates o f  processor performance requirem ents to be
calculated. To this end, the first objective o f  this research w as to gain more
knowledge o f  the primary applications in a practical way by developing one o f the 
image processing algorithms, running it in real-time on a num ber o f  processors, and 
providing a means o f  recording raw  data. This provided a means o f  measuring the 
performance o f  the system under realistic conditions, and the raw  data necessary to 
undertake algorithm development.
15
The implementation o f  this application required the construction o f an 
embedded parallel processing system which had a secondary benefit o f  providing a 
test platform  for future investigations o f  architectural, algorithmic and inter-processor 
communication netw ork issues.
1.3.2 Implementing a Message Routing Network
Provision o f a message routing netw ork is an essential element o f  the embedded 
parallel processing system. It has been proposed by the author that the message 
routing netw ork be implemented with program mable logic devices, which have 
several advantages over a custom  silicon approach, including low er engineering costs, 
lower part cost for small volumes, and the feature o f  reprogrammability.
Reprogrammability offers the potential to implement a building block approach 
to netw ork design to provide application-specific netw ork topologies within an 
unchanged hardware configuration. In the building block approach to  netw ork design, 
the m essage router is constructed from  a library o f standard routing functions. 
Physical connections o f  processors and m essage routers within the netw ork are fixed, 
but the internal configuration o f  the m essage routers is tailored for each application 
using routing functions from the library. The configuration o f  the netw ork nodes is 
loaded into the programmable logic at application run time and can be reloaded with a 
different configuration for each application. The idea o f  using a library o f  parts for 
routing functions can be extended to the use o f  a library o f  processor interfaces for a 
range o f  processors, providing a route to the implementation o f  heterogeneous 
networks.
The second objective o f  this research has been to dem onstrate that a high 
performance message routing netw ork can be implemented using program mable logic 
devices. To achieve this objective, the developm ent o f  a dem onstration netw ork has 
been undertaken to establish whether the device logic density and the achievable 
system speed are sufficient to  implement a high performance netw ork. The building 
block approach to netw ork design has also been explored.
1.4 Thesis Outline
Chapters 2 and 3 introduce m essage routing netw orks and program mable logic 
respectively to provide a foundation for describing the research undertaken by the
16
author. Chapter 2 looks at parallelism in com puter systems and parallel architectures. 
Communication netw orks for parallel com puters are discussed with particular 
reference to  the packet-sw itched direct netw ork approach adopted for the message 
routing node design described in this thesis. Chapter 3 describes program mable logic, 
including the techniques used to achieve programmability. The m ajor application areas 
o f these devices are highlighted, including current research activity, and the use o f 
programmable logic for message routers is discussed.
The ROLIN program m e is a jo in t DTI/Railtest funded project concerned with 
positioning, processing and database issues encountered in the monitoring o f  railway 
infrastructure. Part o f  the research undertaken by the author is linked with the video 
processing topic o f the ROLIN program me. Chapter 4 describes the development o f  a 
real-time video processing application and its associated processing platform, which 
includes a data acquisition system developed by the author. Processing o f images from 
cameras arbitrarily aligned to the vehicle is explored in a proposal made by the author 
to eliminate the critical camera alignment procedure currently necessary to calibrate 
the system.
Chapter 5 provides a high level description o f  the architecture and operation o f 
the message-passing netw ork developed by the author. The flexibility o f  
programmable logic is discussed and reconfigurable networks based on a building 
block approach are proposed. Deadlock and livelock issues for the verification o f the 
netw ork router design are dealt with in this chapter and the softw are support for the 
verification process is described.
Chapter 6 gives a detailed description o f  the netw ork routing node design. 
Verification o f  the correct operation o f  the netw ork router node was achieved 
through simulation o f  the design in a fully populated netw ork configuration. This was 
followed by the implementation o f  an eight node netw ork in hardware. A full 
description o f  the hardw are and softw are used in the process o f  checking the 
operation o f  the eight node netw ork is provided. Simulation and experimental results 
from the verification process are presented.
Chapter 7 summarises the results o f  the research described in this thesis and 
suggests some avenues for further work.
17
Chapter 2 
Parallelism in Computers
2. Parallelism in Computers
This chapter provides a brief introduction to  the use o f  parallelism in computers, 
looks at a classification scheme for parallel processing system architectures and 
describes the most comm on approaches used to  implement concurrent computers.
A major part o f  the research described in this thesis is concerned with the 
implementation o f  a communication netw ork using FPGAs for use in a parallel 
processing system. A description o f  the major features o f  comm unication networks 
for concurrent com puter systems is presented.
2.1 The Exploitation of Parallelism in Computer Engineering
Parallelism has been utilised in all generations o f  electronic com puters since 
their birth a little over 50 years ago. Parallelism manifests itself in many forms, from 
the operations performed on all o f  the bits o f  an operand simultaneously, to the use o f 
many processors in the construction o f  M assively Parallel Processing systems. In 
processor design, provision o f  multiple functional units is a comm on form of 
parallelism used to improve peiform ance. This can either be in the form  o f  pipelining, 
exploiting temporal parallelism, or in the provision o f separate functional units which 
takes advantage o f spatial parallelism.
Pipelining is used in processing vector quantities w here the same set o f 
operations is performed on every element o f  the vector. In many D SP algorithms, for 
example, a multiply/accumulate function (M AC) is performed across the whole o f a 
vector. W ith the use o f  a multiplier and an arithmetic unit arranged in a pipeline, the 
performance o f  the M AC operation when operating on large vectors can be improved 
to provide almost double the performance for the same clock rate  and the same 
memory bandwidth. The actual performance achieved will be less than double because 
in a pipelined processor, cycles are required to  fill and empty the pipeline. Unless 
these cycles can utilise idle periods on the bus, the performance o f  the pipeline will be 
degraded. To ensure that maximum benefit is obtained from the pipeline approach, it 
is therefore necessary to use vectors which are much longer than the pipeline, 
reducing the significance o f  the pipeline fill/empty latency, and the operations 
performed must use all o f  the functional units in the pipeline. The pipelining technique
19
is also generally used in the design o f  the control unit o f  a processor to  pipeline the 
typical phases o f  each instruction; instruction fetch, decode, execute and write back. 
The principle o f  pipelining can also be further extended to provide superpipelined 
processors in which pipelines operate at a sub-multiple o f  the base cycle time o f  the 
processor. Examples o f  pipelined processors include the AT&T D SP32C [AT&T89], 
the Intel i860 [Intel89] which uses a pipelined control unit, the CRAY 1 [Russell78] 
which utilises pipelines in each o f  its twelve functional units and the FPS A PI20 
[Hockney88a] attached processor which uses multiple pipelined arithmetic units.
For scalar processing, the technique o f  providing multiple functional units can 
also be employed to improve performance. These superscalar processors are designed 
to exploit instm ction level parallelism. M ultiple instruction stream s are used to issue 
an instruction to  each functional unit on eveiy processor cycle. The control unit for 
this type o f  processor is m ore complex than for the pipelined processor and requires 
careful code generation to  ensure that optim um  use is made o f  the separate units. The 
M otorola DSP96002 processor [M otorola90] demonstrates the potential o f this 
approach; in a single instruction cycle, a floating point multiply, a floating point 
addition and a floating point subtraction can be performed. In addition to these 
mathematical operations, the control unit can, during the same instruction cycle, 
decode the next instm ction and transfer tw o register values. The DEC Alpha 
processor [DEC92] and M IPS R4000 [NEC91] are two recent examples o f 
superpipelined superscalar processors, combining both the superpipeline and 
superscalar techniques in the design o f  the processor to  provide very high 
performance.
An alternative approach taken to exploit instm ction level parallelism is 
implemented in Very Long Instm ction W ord (VLIW ) processors [Ebcioglu89], With 
this technique, multiple instm ctions are issued in each instm ction cycle using a very 
long instm ction word. In VLIW  processors, the instm ction level parallelism is 
detected at compilation time, rather than at execution time. The primary advantage o f 
this technique is that the compiler can operate over the entire program , given an 
appropriate approach such as trace-scheduling [Nicolau84], and therefore exploit 
m ore parallelism than the superscalar approach which must detect parallelism on-the- 
fly over a limited w indow on the program. The Texas Instrum ents TM S320C6x
20
family o f  D SPs uses the VLIW  technique to  provide a processor with six arithmetic 
units and two multiplier units which achieve 1.6 billion operations per second at 
200M Hz [TI97].
The operation o f  processors can be further enhanced with the use o f  other 
techniques. Fast memory in the form  o f  data and instruction caches hide the slower 
access times o f  main memory from  the processor. Branch prediction, where the result 
o f  an up-com ing branch instruction is predicted and the code from the predicted path 
pre-fetched [Lilja88], is used to  keep pipelined control units active and prevent 
pipeline stalls. O ut-of-order instruction scheduling techniques are used to hide 
latencies associated with instruction dependencies and keep all o f  the functional units 
within a processor busy. The use o f  m ultiple-context, or multithreaded [Saavedra- 
Barrera90] processors, which have a mechanism for rapidly switching between the 
contexts o f  concurrently executing threads, has been used as a means o f  hiding the 
latency o f any accesses to  memory which produce a cache-miss [Dennis94].
Despite the performance improvements that can be made to a single processor 
with the use o f parallelism and the continued improvement in process geometries 
which provide ever denser and faster devices, there remain applications which require 
performance beyond that available from  a single processor. For these applications, it is 
necessary to  replicate processors in some form  o f  parallel processing system to 
obtain the required performance. A com puter system classification scheme is 
introduced in the following section to assist in the description o f  parallel architectures.
2.2 Computer Architecture Classification
Flynn introduced a classification system [Flynn72] for com puter architectures 
using the concept o f  defining the architecture based on the num ber o f  instruction and 
data streams operational in the system. This approach, although broad in its 
distinctions between different architectural groups compared to other classification 
schemes [Anderson75], [Hockney88], provides a useful starting point for describing 
com puter architectures and is cited extensively in the literature.
There are four classes o f  com puter architecture defined in Flynn’s classification 
scheme as depicted in Figure 4.
21
SISD SIMD
M ISD M IM D
Figure 4 - Flynn’s Taxonomy
Com puters in the Single Instruction Single D ata (SISD) architectural class have 
a single execution unit and a single memory unit. Conventional m icroprocessor 
devices such as the Intel 8086 [Rector80] fall into this category. Parallelism is 
exploitable within the processor o f  this class o f  computer, through the use o f  
pipelining for example, but SISD com puters are the only machines in Flynn’s 
taxonom y which do not represent what are generally referred to as parallel 
computers.
Single Instruction M ultiple D ata (SIM D) com puters use a single control unit 
which issues instructions simultaneously to an array o f  processing elements operating 
in lock-step. The array processing elements may share memory as in the BSP machine 
[Kuck82], or more commonly, implement a distributed memory model with each 
processing element having local memory as in the Illiac IV  [Bouknight72], the 
G oodyear M PP [Batcher80] and the Connection M achine CM 2 [Tucker88] machines. 
SIMD machines are specifically designed to perform the same operation over all 
elements o f  a vector quantity and are therefore not generally considered suitable for 
general purpose applications. A special case o f  the SIM D architecture is the 
algorithm-specific systolic array [Kung82], [McCabe87] which consists o f  a number 
o f nearest neighbour connected processors, each o f  which performs the same 
operation. The array o f  processors is arranged in a m anner which matches the 
algorithm, and data flows through the structure from input at one or more edges to 
output at another edge in a pipelined fashion; the term  systolic is derived from the 
analogy o f  blood flowing through the heart. The Configurable, Highly Parallel 
com puter (CHiP) design [Snyder82] extends the systolic array principle o f 
algorithmically specialised interconnection by using configurable switches for the 
interconnect to  provide a m ore flexible machine.
22
The M ultiple Instruction Single D ata (M ISD) classification is derived as one o f 
the four perm utations o f  the criteria selected by Flynn. No system architecture falls 
obviously into this classification, although pipelines and systolic arrays are 
conceptually close to  M ISD.
M ost general purpose parallel processors fall into the M ultiple Instruction 
M ultiple D ata (M IM D) architectural category. The collection o f  many different 
architectures into a single grouping is the main criticism o f  Flynn’s scheme. Bell has 
provided a hierarchical taxonom y o f  M IM D computers [Bell92] to aid in the 
distinction between the diverse architectural styles within the M IM D group. Bell’s 
classification o f  M IM D com puters (Figure 5) is based on how  the memory is 
organised; into a single address space or as multiple address spaces. This leads to two 
major classes o f M IM D machines, multiprocessors which have a single address space 
and implement shared memory com putation, and multi computers which have multiple 
address spaces and perform  message-passing computation.
M IM D
M ultiprocessors M ulticom puters
single address space multiple address space
shared memory computation message-passing computation
distributed memory central m em ory distributed central
multiprocessors m ultiprocessors multicomputers multicomputers
(scalable) (not scalable) (scalable)
Figure 5 - Bell’s Taxonomy of MIMD Computers
Each o f  these classes is further sub-divided into machines which have central 
memory' and those which have distributed memory. This division is im portant for the 
m ultiprocessor class, as it distinguishes scalable machines from those which are not 
scalable, but is less useful for m ulticom puters which, to date, are exclusively in the 
distributed memory class o f  machines.
23
M ultiprocessor systems are constructed from  a number o f  processors connected 
to  shared memory through an interconnection netw ork (Figure 6). The memory 
appears as a (logically) single memory address space to all processors.
2.2.1 Multiprocessors
Figure 6 - Processor to Memory Architecture
The interconnection o f  multiple processors to  shared memory can be achieved in 
several ways. The simplest approach is to  use a bus-based interconnection scheme 
where all memory accesses are perform ed over the same bus. A significant limitation 
o f  bus-based systems is their lack o f  scalability; as the number o f  processors increases 
beyond the capacity o f  the bus, the bus becom es saturated, introducing a significant 
bottleneck into the system. M ultiple-bus solutions have been adopted to overcom e 
this limitation, as, for example, in the Stanford DA SH  architecture [Lenoski92] which 
uses clusters o f  processors. Each cluster consists o f  a small num ber o f  processors and 
a local memory which share a bus. M ultiple clusters connect together via a higher 
level system interconnect.
24
Crossbar interconnection schemes, where n processors are connected to  m  
memories via an nxm crossbar switch, provide high communications performance, but 
are relatively expensive and are typically limited to  a small num ber o f  processors. The 
C.mmp com puter developed at Carnegie-M ellon University [Wulf72] was an early 
system exploiting crossbars, and comprised 16 DEC PDP-11 minicomputers 
connected to  16 memory modules by a 16x16 crossbar switch.
Multistage interconnection network (M IN) architectures use multiple stages o f 
switches to  form a pathway betw een processor modules and memory modules. A 
large num ber o f  M IN s have been studied and several commercial systems utilise this 
technique. Representative systems include the N Y U  U ltracom puter [Gottlieb83] and 
the Cedar system [Konicek91] which both use Omega netw orks, and the Texas 
Reconfigurable Array C om puter (TRAC) which uses a Banyan netw ork 
[Sejnowski80].
A nother technique, the Cache-only architecture, is implemented in the D ata 
Diffusion M achine (DDM ) [Hagersten92], The DD M  has no physically shared 
m emory and implements the distributed system memory in a m anner similar to  large 
second-level caches which are attached to each processor. As a consequence o f  the 
cache-only memory, the partitioning o f  the data within the m em ory o f  the system is, in 
general, not static. The implementation o f  the cache-only m em ory in the DDM  
attracts the data used by each processor to  the attached second-level cache memory, 
which, because o f  this action, has been called the attraction memory. The Kendall 
Square Research KSR1 machine [Gottlieb92] also has no physically shared memory, 
instead featuring an ALLCACHE memory, consisting o f  large second-level caches 
backed up by disks.
Providing access to  shared m emory for many processors introduces some 
significant challenges to  the system designer [Dubois88]. Shared memory 
architectures must provide access synchronisation mechanisms betw een processors to 
ensure that corruption does not occur as a result o f  simultaneous access to  the same 
memory location. In addition, w here memory access perform ance is improved with 
the use o f  cache memory, it is im portant to  ensure that copies o f  the same data object 
in multiple caches are properly updated using a suitable cache coherency policy 
[Stenstrom 90]. Both hardw are-based and software-based strategies have been
25
proposed to  solve the cache coherency problem. The hardw are-based snooping cache 
coherency technique is commonly used for smaller bus-based systems where every 
cache controller and the main m em ory controller listen in to  cache consistency 
commands which are broadcast on a comm on bus. For larger systems, this approach 
can lead to  saturation o f  the comm unication infrastructure which can be overcom e 
with the use o f  the other commonly used technique, directory-based coherency 
schemes, where only the affected caches are involved in the cache-consistency 
communications. Software-based cache-coherency strategies generally fall into one o f 
two main categories; static schemes where the coherency is managed at compile time, 
and dynamic schemes in which consistency o f  caches is maintained at run-time.
2.2.2 Multicomputers
The multicom puter paradigm  utilises multiple com puters or nodes 
interconnected through a m essage-passing netw ork (Figure 7).
Figure 7 - PE-to-PE Architecture
Each node, commonly referred to as a Processing Elem ent (PE), is an 
autonom ous unit consisting o f  a processor, memory and in some cases devices such as 
a disk drive or Input/O utput interfaces. The node will also have an interface to the
26
message-passing netw ork, which may be integrated onto the same die as the 
processor or implemented in a separate device.
The inter-processor connection scheme can either consist o f  simple point-to- 
point links between nodes which require processes running at each node to  route 
messages, or can be implemented as a separate hardware-based m essage routing 
network. Hardware support for inter-processor communication rem oves the burden o f 
message routing from the processors and leads to  reduced communication latency.
Reducing the latency o f  communication sufficiently enables a m ulticom puter to 
exploit//>7tJ-^7'a77^t/ parallelism, in which processors w ork with just a few instructions 
at a time from each task. The advantage o f  this is that the maximum available 
parallelism may be exposed. In the pioneering Cosmic Cube m ulticom puter [Seitz85], 
messages were transported betw een nodes using store-and-forward routing (Section 
2.4.1) along the point-to-point links o f  a hypercube. Second generation 
multicomputers, such as the Intel iPSC, introduced hardware support for message 
routing [Trew91] to significantly reduce the communication latency. The performance 
o f  the message-passing in these systems dictated that they w ere only able to exploit 
medium-gi'ciimd parallelism [Athas88]. In third generation machines like the J- 
machine [Dally92], comm unication overheads are reduced further to enable fhie- 
gi'ciined parallelism to be exploited and to expose the maximum available parallelism.
A number o f commercial com puter systems are based on the multicom puter 
paradigm, including the SPARC and pVP based M eiko Com puting Surface CS-2 
[Meiko93], the Transputer based Parsytec GC-5 [Langhammer92] and the DEC 
Alpha based Cray T3D [Koninger92]. The Transputer has been instrumental in 
promoting the use o f  parallel processing in embedded environments. The Transputer, 
being a single, inexpensive processor with integrated comm unication engine, makes 
the construction o f  embedded parallel systems a relatively simple task. The success o f 
the Transputer has also prom pted Texas Instrum ents and Analog Devices to  produce 
a range o f signal processing devices, the C40 and SHARC families respectively, with 
integrated processor and comm unication link engines for use in the construction o f 
embedded parallel digital signal processing systems. There are a large number o f 
vendors with m odular products available to construct embedded systems from these 
processor families.
27
Two notable classes o f  com puter systems do not fit well into the computer 
architectural classification schemes presented in the previous sections although, as 
Duncan states [Duncan90], they follow the principles o f  M IM D  operation, having 
multiple instruction and data streams. Dataflow computers take a data-driven 
approach to achieve fine-grain com putation and Reduction computers implement 
either a data-driven or demand-driven approach.
The dataflow paradigm [Gurd86] is based on executing an instruction when the 
data required for the operation becomes available, departing from  the conventional 
control-flow approach in which instructions are fetched from  a sequence o f 
instructions in memory under control o f  an instruction counter. The control-flow class 
o f  com puters includes those machines which perform out-of-order execution, branch 
prediction and speculative execution to improve performance, but none o f these 
techniques alters the fundamental control-flow  approach. In the dataflow  approach, 
the execution o f an instruction depends solely on the availability o f  the data required 
for the operation. The operation o f  the machine can be visualised with the aid o f  a 
data flow graph (DFG) consisting o f  nodes which are actors representing functions, 
and arcs which connect the nodes and indicate data dependency betw een functions. A 
simple D FG  for the function z — (a + b) x (c - d) is shown in Figure 8.
2.2.3 Dataflow and Reduction Systems
a
Figure 8 - A Simple Data Flow Graph
The presence o f a value on an arc is indicated by the presence o f  a token. When 
an actor has a token on each o f  its input arcs, it executes or fires, consuming the input 
tokens and producing a token on its output arc. Early work [Dennis80] was based on 
the static dataflow model in which only a single token can occupy each arc on the
28
DFG at any instant, and an actor is prevented from firing if a token exists on its 
output arc. This restriction can limit the am ount o f  parallelism achievable when a 
subgraph o f  the DFG  is used as a procedure which is called from several places 
[Arvind91]. In the dynamic dataflow model, there can be more than one token on an 
arc at any one time. M ultiple instantiations o f  a subgraph will generate a set o f  tokens 
on each arc. It is necessary to tag tokens to provide a mechanism for tokens from the 
same instantiation to be matched and fire the actor. An example o f  this tagged-token 
dynamic dataflow approach is the M anchester Dataflow M achine [Gurd85]. The 
complexity o f matching these tagged tokens is one o f  the main criticisms o f  the 
dynamic dataflow approach [Gajski82]. The Explicit Token Store extension o f the 
M IT Tagged-Token Dataflow Architecture has addressed this problem with the use o f 
explicitly addressed token matching locations or rendezvous points in memory 
controlled by a presence/empty bit [Papadopoulos91], The rendezvous points for a 
code block are grouped together in an Activation Frame, pointed to  by the token tag 
and managed by the compiler. The effect o f  the activation frame is to eliminate the 
need to check every token against all other waiting tokens for a match to determine if 
an actor should fire.
In the reduction machine paradigm [Treleaven82] nested expressions define the 
program. A machine which evaluates the program  from the outerm ost level by forcing 
the execution o f  further instructions or procedures as required is a demand-driven 
reduction machine. The advantage o f  this approach is that instructions are only 
executed as they are required. A machine which evaluates the program  from the 
innermost level will have all expressions evaluated whether they are required or not, 
exactly as occurs in dataflow machines, and are thus termed data-driven reduction 
machines.
2.2.4 Heterogeneous Computing Systems
Heterogeneous systems mix architectural styles to produce High Performance 
Computing (HPC) systems. H eterogeneous Com puting (HC) is a special form o f 
parallel and distributed computing which allows discrimination o f  algorithms to 
optimise the matching o f  tasks or sub-tasks to  a choice o f  underlying machine 
architectures [Khokhar93], The approach uses either a single machine capable o f
29
operating in both SIM D and M IM D modes, or a number o f autonom ous com puters o f 
different type connected together. In Eshaghian’s taxonom y [Eshaghian96], these two 
main classes are referred to as System H eterogeneous Com puting and N etw ork 
H eterogeneous Com puting respectively. The use o f  opportunistic load-balancing on 
clusters o f  w orkstations does not represent w hat is generally considered to be a 
heterogeneous machine [Freund93]; in a truly heterogeneous environment, the 
characteristics o f  the task are m atched to  machine features to  ensure that the task 
executes on the m ost suitable architecture.
M achines which provide both SIM D and M IM D m odes o f  operation fall into 
two main categories; mixed-mode machines which can rapidly switch betw een SIMD 
and M IM D m odes and multi-mode machines which provide both forms o f  parallelism 
simultaneously. The Partitionable-SIM D /M IM D  (PASM ) system [Siegel81] is a 
mixed-mode machine which can be dynamically partitioned to produce multiple 
independent m ixed-mode machines, providing spatial as well as temporal 
heterogeneity. The Image Understanding Architecture IUA [W eems92] is an example 
o f  a multi-mode machine consisting o f  three levels o f  processors operating in SIMD at 
the lowest level and M IM D at the other levels.
2.3 Communication Networks for Concurrent Computers
A i im portant and continuing area o f  research in the field o f  com puter science is 
in the provision o f  efficient comm unication mechanisms betw een processors in 
concurrent computers. In a comm unication netw ork for a concurrent com puter, the 
processors and memories or com putational resources are linked through Switching 
Elements (SEs). The SEs provide a means o f  modifying the connection betw een the 
computational resources to accom m odate the changing comm unication patterns o f  an 
application. Tw o fundamental categories o f  interconnection netw ork are encountered 
in concurrent com puters :
Indirect netw orks in which the SEs are connected together, with the 
com putational resources connected only to  the boundary SEs o f the network.
30
Direct netw orks which have a computational resource connected to every 
SE.
Hybrid systems have been explored [Padmanabhan90] where direct and indirect 
netw orks are mixed, but in general one or other method is utilised.
2.3.1 Indirect Networks
The majority o f  indirect netw orks are based on a 2-input/2-output exchange cell 
and utilise a number o f  exchange cells connected through an interconnection region. 
Figure 9 illustrates an example o f  this format; the single stage shuffle exchange 
netw ork [Stone71].
exchange shuffle
exchange cell states
Figure 9 - Shuffle Exchange Network
D ata is routed to  its destination by repeated passes through the network. An 
alternative approach, the M ultistage Indirect N etw ork, provides several stages o f
31
exchange/interchange to  reduce the num ber o f  passes required through the network. 
Many forms o f  indirect netw ork have been explored [Feng81], [Siegel79], including 
the indirect binary cube [Pease77], the baseline netw ork [W u80] and the Reverse- 
Exchange N etw ork [W u80a] as well as those already m entioned in Section 2.2.1; the 
Omega, Butterfly and Banyan netw orks.
Indirect netw orks have historically been used in processor-to-m em ory 
architectures in which processors are physically separated from  the memory through 
an interconnection network. In this type o f  architecture, the interconnection network 
is very critical in determining the perform ance o f  the system as every memory access 
must traverse the full extent o f  the netw ork. I f  memory access latency exceeds an 
instruction cycle time, then processors will be forced into an idle state waiting for the 
access to complete. One method o f  hiding this latency is to implement cache memory 
at each o f  the processors which overcom es the interconnection netw ork latency but 
introduces the problem o f  cache coherency (page 25).
Indirect netw orks suffer from a number o f  drawbacks. All communications 
between resources must traverse every stage o f  the netw ork which means that 
neighbouring resources cannot exploit any locality o f communication. The network 
extensibility is poor, as the netw ork m ust be re-wired for any changes in netw ork size. 
Finally, netw ork connectivity is extremely low, with a single path betw een a given 
source-destination pair, leading to  poor fault tolerance and a highly blocking network. 
M ultipath M ulti-stage Interconnection N etw orks (MM INs) have been developed to 
overcom e these single path and fault tolerance problems [Chin84], [Padmanabhan83], 
[Adams82].
2.3.2 Direct Networks
In direct networks, the com putational resources are embedded into the nodes o f 
the communication network. There are many forms o f  direct netw ork [W ittie81], 
ranging from  the bus to the fully connected system (Figure 10).
32
T T T T
bus
ring
A
A A A ,
binary tree
2D mesh
fully connected system
Figure 10 - Direct Network Topologies
The hypercube [Bhuyan84] is a popular topology and has been used in both 
research machines like the Cosmic Cube [Seitz85], and commercial machines like the 
Ncube [Hayes86], the Thinking M achines CM 2 [Tucker88] and the Intel iPSC series 
[Wiley87]. Communication patterns like the mesh, tree and ring are realisable within 
the hypercube topology [Reed87] and therefore the hypercube, in this respect, is a 
good choice for general-purpose machines. The major drawbacks with the hypercube, 
however, are that the number o f  nodes can only be increased in factors o f  tw o and the 
number o f  connections to  each node increases as the netw ork grows. As an 
alternative, low dimension meshes provide good extensibility, and extensions to the 
netw ork can be made without altering the number o f connections to each node.
Direct netw orks have generally been used for PE-to-PE  architectures in which 
each processor possesses its own locally accessed memory and communicates with 
other PEs through a communication network. Locality o f  communication is 
exploitable in direct netw orks due to  the short links between neighbouring resources
which lead to  a lower overhead for nearest-neighbour communications. Direct 
networks generally have high connectivity and therefore can be designed to  provide 
effective fault tolerance schemes. The PE -to-PE  architecture lends itself to  a modular 
approach in which each module contains a processor, local memory and a network 
communication engine. The use o f  identical modules in the netw ork reduces costs and 
provides an opportunity to place the netw ork engine on the same die as the processor 
to reduce costs further.
2.3.3 Switching Mechanisms
The m ethods by which connections are made and broken within a netw ork fall 
into two major categories; circuit-switching and message-switching [Kermani80].
In the circuit-switching approach, a route for communication is established, 
then the message is sent uninterrupted over the established pathway. There will be no 
further delay once a path has been established, except signal propagation delay time, 
and therefore circuit switching is particularly suitable for transm itting long messages. 
The circuit switching technique is m ost easily implemented on machines running in 
lock-step which comm unicate through an indirect network. The establishment o f the 
required communication pattern is synchronised with the instruction stream, and the 
netw ork switched as a whole. For a general purpose machine w here arbitrary and 
dynamic connectivity is required, the netw ork connections m ust be established with 
request signals propagating through the network. A major draw back with this scheme 
is that netw ork bandwidth is wasted waiting for the reply to the request signal.
At each node in a message-switched network, information in the message is 
used to establish the connection; the source node o f the message sends the message 
without waiting for the route to be established. This scheme eliminates the netw ork 
bandwidth wasted in the circuit switching approach that is used to  establish a link 
before the message is sent. A problem with the message-switched netw ork is that the 
progress o f  the message can be blocked in the netw ork and a storage buffer must be 
provided at each node to  prevent loss o f  message data. A variant o f  the message 
switching approach, packet switching, uses the same approach for establishing a path 
as message switching, but a message is divided into packets and sent one packet at a
34
time. The advantage o f  this approach is that the size o f  the buffer at each o f  the nodes 
is reduced.
2.3.4 Important Network Properties
There are a wide range o f  properties which characterise a netw ork, amongst the 
most important o f  which are diam eter and average distance, comm unication latency, 
node degree, connectivity and extensibility [Lyuu92]. The diameter o f  a netw ork is 
the maximum number o f  links along a shortest path that must be traversed in order for 
a message to  reach its destination. The average distance is a m easure o f  the average 
number o f links that must be traversed for two nodes to communicate, which depends 
on the communication patterns o f  the application as well as the netw ork topology. 
The communication latency refers to  the time taken for a m essage to  pass through the 
netw ork which is influenced by the distance travelled and by the message routing 
algorithms in m essage-switched networks. The number o f channels entering or leaving 
a node is term ed the node degree which is usually constrained by the technology o f 
the switching elements. Connectivity is a m easure o f the fault tolerance o f  the 
netw ork and refers to the minimum num ber o f  failed links that result in a disconnected 
or disjoint network. Extensibility corresponds to the ease with which small increments 
in netw ork size can be made.
A netw ork should ideally have low diameter and average distance, high 
connectivity, low communication latency and be extensible w ithout a change o f node 
degree.
2.4 The Packet-Switched Direct Network Approach
The implementation o f  a FPG A -based netw ork message routing device, using 
the packet-switched direct netw ork approach to interprocessor communications, is 
investigated by the research described in this thesis. Three issues relating to the design 
o f  the routing device are explored in this section; packet routing schemes, deadlock 
freedom  and multicast routing.
2.4.1 Packet Routing Schemes
In a packet-switched netw ork, packet routing mechanisms m ust be devised to 
transport messages between source and destination. The transport mechanism should
35
provide arbitrary and dynamic connectivity, where every processor in the netw ork is 
able to communicate with every other. In early netw ork designs, the store-and- 
forward routing technique [Tanenbaum81] was used in which each intermediate node 
on the path o f the message must receive the com plete message before sending it on to 
the next node. This technique has a very high m essage transmission time, proportional 
to the product o f  the number o f  intermediate nodes traversed and the m essage length. 
The virtual cut-through [Kermani79] and w orm hole routing [Ni93] strategies reduce 
the message transport time to  a figure proportional to  the sum o f  the num ber o f  nodes 
traversed and the m essage length, by allowing the message to advance out o f a node 
before the com plete message has arrived. In both techniques, the head o f  the message, 
which contains the destination address, is examined as soon as it arrives. W hen the 
exit route from the node has been determined based on the destination address, the 
message is forw arded out o f  the node along the required route, assuming that the 
route is free. For a message that is not blocked, there is no requirem ent to  buffer a 
complete message; only a few flow control digits or flits  need to  be stored, where a 
flit is the smallest unit o f  information that a channel can accept or refuse. In virtual 
cut-through routing, each node will store a complete message if  the exit route is 
blocked. In worm hole routing, only a few  flits are buffered at each node when a 
blockage occurs, and under these conditions, the message backs up through preceding 
nodes. M ore routing resource is used up by a blocked message in worm hole routing 
because o f  this action, and therefore virtual cut-through routing has a better hot-spot 
performance than worm hole routing, at the expense o f  larger buffers at each node.
2.4.2 Deadlock Freedom
An im portant issue in the operation o f  parallel com puters is deadlock, a 
condition in which the allocation o f  shared resources leads to  a situation where no 
progress can be made due to  a cycle o f  ungrantable requests betw een the users o f  the 
shared resources. A simple example is illustrated in Figure 11 which shows two 
processors that are in deadlock. Each processor is running a task in which a document 
on disk is to be printed. The printer and the disk are shared resources which can only 
be allocated to one processor at a time.
36
(waiting for disk)
(w aiting for printer) 
Figure 11 - Processor Deadlock
The situation depicted in Figure 11 arises if one processor has been granted the 
printer whilst the other has been granted the disk. Both processors are waiting for a 
resource which has been allocated to  the other. This situation will persist, where the 
existing allocation o f  shared resources prevents further progress. Both processors are 
indefinitely locked in a deadly embrace or deadlock condition.
In the design o f  a message routing device we are particularly interested in 
potential deadlocks that could occur within the message routing device. Deadlock can 
o f course, occur in other situations, including in the allocation o f  resources by an 
operating system [Habermann69], [Coffman71], [Singhal89], in distributed databases 
[IsloorSO] and in the communication betw een processes in a concurrent application 
[Roscoe87].
There are two broad approaches to the provision o f  deadlock freedom in a 
packet routing network; detection o f  a possible deadlock situation followed by some 
recovery action, and deadlock freedom  by design. Deadlock detection and recovery 
can be achieved in a number ways including the use o f  acknowledgem ent packets and 
tim e-out mechanisms. Com pressionless Routing, for example, is a protocol that has 
been proposed [Kim94] in which deadlock is allowed to occur and messages are 
removed from the netw ork if  a sufficiently long period has elapsed since progress was 
last made. Another approach, double-buffering [Whobrey88] or the m ethod o f nosey 
worms, buffers a message at the sending node and attempts to  establish a connection 
with the destination by routing the message. I f  a blockage is encountered, the message 
recoils, and another path is tested after some delay.
37
D eadlock prevention schemes in Store-and-Forw ard netw ork routing algorithms 
have been developed based on the concept o f a structured buffer pool [Raubold76], 
[Gerla80]. The buffers at each node are partitioned into classes; a 0-hop class through 
to a IC-hop class, where K is the length o f  the longest path in the netw ork. The buffers 
o f the i-hop class at any node may only be granted to packets which are i hops or 
more rem oved from their origin. This restricted assignment o f  buffers to messages 
imposes a partial ordering o f  buffers to prevent deadlock [Gunther81]. The same 
technique has been used in [Giessler81] and [Gopal85] and extended in 
[Merlin80a/M erlin80b] and [Toeug79]. These techniques are not directly applicable to 
worm hole routing schemes, therefore virtual channels for m essage routing were 
proposed by Dally and Seitz [Dally87] to  prevent deadlock by removing cyclic 
dependencies from the network. To implement these, each physical channel is time- 
shared to provide a num ber o f  virtual channels [Dally92a]. This m ethodology has been 
used in the design o f  routing chips such as the Tom s Routing chip [Dally86] and in 
the iW arp integrated processor [Borkar88]. Jesshope and Y antchev have used the 
similar concept o f  virtual networks in the M ad Postm an routing device [Yantchev89], 
Here, cyclic dependencies are eliminated with the use o f a num ber o f  independent 
acyclic netw ork planes, with minimal adaptive routing perm itted in each o f  the planes.
Although originally proposed for deterministic or non-adaptive routing, the 
virtual channel technique has also been applied to adaptive routing. Adaptive routing 
is desirable to  provide both fault tolerance and to  improve perform ance by utilisation 
o f all available paths. Minimal adaptive routing, which restricts m essages to routes 
such that they always get nearer to  their destinations, cannot route m essages around 
faults when an increase in path length is necessary. Non-minimal adaptive routing 
mechanisms have been proposed which overcom e this limitation. Dally and Aoki 
[Dally93], for example, used two classes o f  routing in their dynamic Dimension 
Reversal Routing algorithm, adaptive and deterministic, which w ere implemented on 
separate sets o f  virtual channels. Packets w ere injected into the adaptive channel set, 
and perm itted to  route in any direction. Dimension Reversal (DR) numbers in each 
packet header recorded the number o f  times a packet routed from  a channel in one 
dimension to a channel in a lower dimension. To prevent deadlock, a packet was not 
perm itted to wait for a channel held by a packet with a lower D R  number; packet
routing was switched to the deterministic routing channel set if this situation 
occurred. The ability for the packet to route away from its destination node 
introduced the possibility o f  livelock, a condition in which a packet never reaches its 
destination. To ensure that each packet eventually reached its destination in the 
dimension reversal routing network, a limit was set on the num ber o f  times a packet 
was permitted to route away from its destination. Duato [Duato93] also proposed 
using two virtual channel sets, one for adaptive routing, and the other acyclic one for 
non-adaptive routing. Boppana and Chalasani [Boppana95] implemented a block-fault 
model in which fault-free rings or chains w ere constructed around each fault region 
using virtual channels to form a fully adaptive fault-tolerant routing algorithm. In an 
approach similar to that taken by Yantchev and Jesshope [Yantchev89], Linder and 
Harden [Linder91] created an independent virtual netw ork for each combination o f 
directions that packets can be routed in and divided the physical channels into as many 
virtual channels as were necessary for the m apping o f the virtual netw ork onto the 
physical network. Gaughan and Yalamanchili introduced Pipelined Circuit Switching 
[Gaughan95] in which the header flit was first routed to the destination, reserving a 
path for the remainder o f the packet. I f  the header could not progress, it was 
perm itted to back track, and adaption was possible. W hen a path to  the destination 
has been set up, the body o f  the packet is pipelined over the path as in wormhole 
routing.
An alternative approach to  virtual channels, proposed by Glass and Ni 
[Glass92], is the Turn Model o f  routing. In this technique, the directions in which 
packets can turn within the network, and the cycles that the turns can form, is 
analysed. The minimum number o f  turns is then eliminated, consistent with preventing 
livelock and ensuring that any cycles which can cause deadlock are not present. 
Partial or fully adaptive routing algorithms are possible within the turn model o f 
routing.
In the class o f deadlock-free-by-design techniques, inherently deadlock-free 
networks can also been defined. A 2-dimensional grid, for example, is inherently 
deadlock-free if messages are always fully routed in one dimension before routing 
commences in the second dimension. The E-cube routing algorithm [Sullivan77], used 
in hypercube structures, also routes in dimension order to provide an inherently
39
deadlock free network. Another approach is logically to  map a tree onto a network, a 
structure which cannot include cycles o f  dependency as there are no loops in the 
structure, and is therefore inherently deadlock-free. Cube Connected M obius Ladders 
have been proposed by Pritchard and Nicole [Pritchard93], as an inherently deadlock- 
free netw ork based on the cube connected cycle [Preparata81] and the use o f  a 
deadlock-free interval routing algorithm [VanLeeuwen87],
2.4.3 Multicast Routing
M ulticast operations [McKinley95] are im portant in a num ber o f  aspects o f 
parallel computing, including numerical applications and memory invalidation in 
distributed shared-memory systems [Robinson95]. The im portance o f  hardware 
support for multicast operations was reported by Ni [Ni95] and several techniques 
have been proposed and explored to  implement hardware-based multicasting. Unicast- 
based multicasting [McKinley94] schemes utilise existing one-to-one routing 
functions with additional software support to  implement one-to-m any communication 
patterns. Tree-based multicasting schemes rely on finding a spanning tree routed at 
the source node. The routing mechanism proposed by M alum bres et al 
[M alumbres96], for example, uses tree-based multicast for short invalidation and 
update messages in shared-memory multiprocessors. A path-based multicasting 
scheme proposed by Lin et al [Lin94], is based on finding a Ham iltonian path in the 
network. A Hamiltonian path is one which visits every node in the netw ork exactly 
once, and is therefore acyclic and deadlock free. The Trip-based model [Tseng96] 
generalises the Hamiltonian path-based technique to provide adaptive, deadlock-free 
multicast on netw orks o f  arbitrary topology.
2.5 The Implementation of an FPGA-Based Network Routing Node
The netw ork routing node developed in this research uses worm hole routing 
and a linear topology. W orm hole routing has the desirable characteristics o f low 
communication latency and relative insensitivity to  routing distance in the absence o f 
blockages. However, in situations o f  high netw ork traffic, the netw ork can become 
highly blocked. Two mechanisms have been implemented in the design o f  the routing 
device to  alleviate this situation. An adaption mechanism is implemented between 
pairs o f channels to  increase the availability o f  routing channels in the network, and
40
collective communication is supported to  reduce the number o f  messages injected into 
the netw ork with the provision o f  hardw are support for a multicast mechanism. The 
multicast mechanism is not universal in that the extent o f  the multicast is limited to a 
contiguous region o f nodes. A multicast message can, however, be encapsulated in 
another message to provide multicast to a rem ote but contiguous region o f  nodes. 
The linear netw ork topology o f  the FPGA-based netw ork is acyclic and is therefore 
inherently deadlock-free. Despite this, the allocation o f resources within each netw ork 
node, especially under multicast conditions, can lead to  deadlock situations. Section 
5.6 addresses these conditions and their elimination. Virtual channels are not used in 
the design; deadlock-freedom  is achieved through the choice o f  an inherently 
deadlock-free netw ork topology and some restriction on the permitted use o f 
multicast routing.
Chapter 5 and Chapter 6 provide a full description o f  the FPGA-based network 
routing node and its implementation o f  an adaptive wormhole routing mechanism with 
multicast capabilities.
41
Chapter 3 
Programmable Logic
3. Programmable Logic
The primary investigation undertaken in this research is in the use o f 
programmable logic devices to provide a message routing netw ork for embedded 
parallel processing applications. Program mable logic is introduced in this section with 
a brief description o f  the historical development o f  these devices. The two major types 
o f programmable logic are illustrated with reference to  existing devices, and the 
techniques used to implement programmability are described. The major application 
areas for programmable logic devices are highlighted and FPG A-based router designs 
are discussed.
3.1 Devices
3.1.1 Historical Development
The general structure o f all programmable logic devices is the same; they have a 
fixed internal architecture consisting o f  a number o f logic blocks connected together 
through a signal routing matrix. The pins o f the device connect to  the logic blocks 
through dedicated I/O blocks. The function o f  the logic blocks, the operation o f  the 
I/O blocks and the routing betw een blocks are all programmable.
First generation program mable logic devices, pioneered by Monolithic 
M emories Incorporated [M M I82], were one-time programmable and provided the 
equivalent o f a few hundred gates per device. The I/O structure was fixed, but fed 
with a programmable logic block consisting o f  a sum -of-products or AND/OR array 
(Figure 12).
Input x
Input y
Figure 12 - PAL Sum of Products Structure
Fuses
Output z
43
The thirty or so different devices in the first generation family were rationalised 
into two parts (GAL16V8, GAL20V8) for the second generation family which were 
developed by Lattice Sem iconductor [Lattice87]. All functions o f  the first generation 
were implementable in one or other o f  these two new parts. The devices were 
reprogrammable, providing a means o f  rapidly fixing bugs or implementing design 
changes and also allowing devices to  be 100% factory tested. Other parts were also 
introduced in the family to  provide devices with more routing, logic and I/O 
resources.
The current generation o f  programmable logic devices consists o f  tw o distinct 
types o f  parts, which are distinguished by the granularity o f  their logic blocks. 
Complex Program mable Logic Devices (CPLDs) are an extension o f  the large logic 
block approach used in the first and second generations o f  program mable logic parts. 
Field Programmable Gate Arrays (FPGAs) provide many simple logic blocks with a 
more complex interconnect structure.
3.1.2 Complex Programmable Logic Devices
CPLDs are coarse-grained devices. They have large complex logic cells with a 
wide fan-in and simple interconnecting paths.
The Intel FLEXlogic™  family o f  devices [Intel92], now taken over by Altera 
and renamed FLASHLogic™ , have a number o f  logic array blocks, each with 10 
macrocells, linked through a fully connected Programmable Interconnect Array (PIA). 
The macrocells each have a register and 24 inputs from the PIA  which feed through a 
logic array to a product term allocation circuit. This allows product term s from 
adjacent macrocells to be borrow ed (Figure 13). The family ranges in density from 
1600 to 3200 usable gates, and 64 to 120 I/O pins. The devices are unique in that they 
have volatile memory cells to  define the logic function for developm ent purposes but 
also have non-volatile cells for use when the design is fixed.
44
to/from
adjacent
macrocell
rfNi / \ >_A » • 0 Z A
24 PIA 
signals
to/from
adjacent
m acrocell
Figure 13 - FLEXLogic Product Borrowing
The Cypress FLASH370 series [Cypress95] have a very similar structure to the 
Altera parts, with a product term  allocator between a central routing pool and the 
logic cell to provide a wide product term.
The M AX 9000 series [Altera96] o f  Electrically Erasable PROM -based devices 
provide the largest Altera CPLD parts with a range between 320 and 560 macrocells 
and up to  216 I/O pins. The m acrocells are grouped into Logic Array Blocks (LABs) 
containing 16 macrocells. Each macrocell has a programmable AND, fixed OR array 
and a configurable register with independently programmable clock, clock enable, 
clear and preset functions. Product term  sharing, as used in the Intel FLEXLogic, is 
also implemented in the M AX9000 series to provide up to  32 product term s per 
macrocell. Connections between macrocells and I/O pins are provided by feedback 
local to each LAB and by the FastT rack Interconnect, a series o f  continuous vertical 
and horizontal routing channels that traverse the whole device (Figure 14). The 
device-wide routing provides predictable routing performance, even in complex 
designs.
45
I/O!___
Figure 14 - MAX9000 Architecture
Generally, operational frequency and pin-to-pin combinatorial delays o f  CPLDs 
are deterministic param eters for each device due to the fixed and predictable paths 
taken by signals within the devices. This makes the estimation o f  system operating 
speed very simple. The complexity o f  the logic blocks limits the number o f  registers 
available, but the high number o f terms available at the input o f  each register means 
that CPLDs are well suited to fast implementation o f  state machines and control logic 
which generally produce wide product terms.
46
FPGAs are medium-grained to fine-grained devices, consisting o f many simple 
logic cells. This architecture increases the number o f  connections betw een cells, and 
therefore a complex, usually hierarchical, interconnection matrix is required to achieve 
good routability. The fixed regular sta ic ture  o f  the FPGA simplifies the process o f 
partitioning and placing a design to  fit a particular part.
The Actel ACT™  family o f  Antifuse FPGAs [Actel92] have tw o simple cell 
types implementing combinatorial and sequential functions and have a two- 
dimensional routing mesh. The ATM EL AT6000 family o f  devices [Atmel94] has a 
simple 4 gate, 1 register logic cell and a hierarchical interconnection scheme 
consisting o f  local and global routing connections. An interesting feature o f  this family 
is that it has the facility to  allow reconfiguration o f  part o f the device whilst the rest o f 
the device remains operational, providing a mechanism for adaptive hardware 
(Section 3.2).
The XILINX XC4000 family o f  programmable logic devices used in the 
message router node design have a medium grain structure [Xilinx94] and are 
therefore classified as FPGAs. Internally, the devices have a two-dimensional array o f 
interconnected Configurable Logic Blocks (CLBs). Each CLB provides a pair o f 
registers and a pair o f  configurable combinatorial logic blocks. The registers have 
individually programmed rising or falling edged clocks. A common Clock Enable can 
be used by each register if required. The combinatorial blocks can be used together to 
form some combinatorial terms o f  9 inputs (Figure 15). The hierarchical 
interconnection provided between CLBs is programmable and consists o f direct 
interconnects which provide nearest neighbour connection, general interconnects and 
long lines for other connections, and low skew clock nets for d o c k  signal 
distribution. The device also provides internal RAM  facilities, a num ber o f  wide 
decoder blocks for applications like address decoding, and dedicated fast carry logic 
for arithmetic functions.
3.1.3 Field Programmable Gate Arrays
47
Figure 15 - Simplified Block Diagram of XC4000 CLB
FPGAs have many registers and are therefore good for designs with data paths, 
like signal processing functions, for example. H ow ever, the operational frequency and 
combinatorial delays o f  an FPGA-based design are not known until after the design 
has been completed. Logic placement and signal path lengths cannot be determined a 
priori; they are dependent on the placement and routing o f  the design which may be 
constrained by a number o f  inter-related requirem ents such as pin-to-pin timing limits 
and area restrictions. This makes system perform ance estimation and device speed 
selection difficult. In addition to the initial uncertainty in performance, design changes 
to a comm itted design can cause difficulty if  the internal routing resources are too 
limited. The early Xilinx design tools and devices, for example, could provide an 
initial fit for a design, given the freedom  to assign the pins arbitrarily, and then be 
unable to re-fit a slightly modified version o f  the design in the same device with pin- 
locked constraints. This problem has been overcom e with better routing resources 
within the devices and improved placement and routing tools. The limited fan-in o f the 
logic blocks in an FPG A  compared to CPLD devices requires that special techniques 
are employed to obtain good perform ance from FPGAs for certain logic functions 
such as counters and state machines (Section 6.2.4).
48
Programmability o f devices is achieved using one o f  three techniques [Rose93]; 
fuse technology, volatile static RA M  (SRAM ) memory and non-volatile ROM  
technology.
Devices based on fuse  programming are fabricated with all possible connections 
in place, the connections being made by silicon Rises. During programming, unwanted 
connections are broken by electrically blowing the relevant fuses. Anti-fuse 
technology is also used in which connections are made as required.
The state o f  memory cells is used in S/M M -based devices to control 
connections. Small blocks o f  memory are used as look-up tables to  implement 
combinatorial logic functions. M emory blocks used for configuration can also be used 
to implement internal volatile memory. These devices are program m ed after pow er up 
using configuration data stored in non-volatile memory external to the device. They 
can also be reprogram med whilst pow er is still applied.
Devices based on Non-volatile memory use one-time programmable or re­
programmable memory elements to  control connections and logic functions in a 
similar way to SRAM -based devices.
Each approach to programmability has its merits. Fuse technology is small in 
feature size on the silicon die, and therefore provides more usable gates per unit area 
than the other techniques. However, fusible devices are One-Time Programmable 
(OTP), and once blown cannot be altered. SRAM  devices can be re-program med 
during operation, and in some cases small amounts o f SRAM  can be made available 
for operational use, but the feature size o f the memory cell is the largest o f  the three 
techniques. The non-volatile approach provides a compromise betw een fuse and static 
SRAM  techniques, with the feature size falling between these two. With the 
implementation o f  In-System Programmability (ISP) to program  the non-volatile 
device memory, as implemented in the some o f  the latest families o f  parts, the devices 
are as easy to  use as SRAM  devices in term s o f in-system design changes. In addition, 
external non-volatile memory devices are not required to  hold configuration data, thus 
reducing the board area required. However, the volatile SRAM -based devices are 
faster to  reconfigure than the non-volatile parts, which require erase cycles.
3.1.4 Device Programmability
49
FPG A  devices are conventionally used to replace discrete devices for cost and 
board area savings, but they are also being used in several areas o f  research, including 
hardw are/software codesign and the developm ent o f  reconfigurable com puter 
systems.
3.2.1 Hardware/Software Codesign
Hardware/software codesign breaks from  the traditional approach to system 
design in which hardw are and software are developed independently, often with little 
interaction until system integration time [Page96]. In the codesign approach, the 
allocation o f functionality betw een hardware and software becom es an interactive part 
o f  the design process [Kumar93], offering the ability to  explore hardw are/softw are 
trade-offs. In addition to perform ance and cost analysis, system reliability and time-to- 
market can also be improved with a unified view o f  a product [Roman84], 
[Franke91]. Implementation o f  the codesign paradigm, however, is a complex 
problem because o f  the wide spectrum  o f  applications, design styles and underlying 
technologies, and research in this field has produced several m ethodologies including 
an approach based on an extension o f  finite state machines [Chiodo94], object- 
orientated functional specifications [W oo94] and HardwareC, an extension to the 
standard C language [Gupta93]. W olf [Wolf94] explores the problem s and challenges 
o f  codesign using the example o f  embedded com puter system design. The availability 
o f  inexpensive, high density, reprogram mable FPG A  devices has aided research into 
complex codesign tasks like the Param eterised Processors from  Oxford University 
[Page94], and has also led to  the introduction o f  commercial FPGA-based 
hardware/software codesign systems [Conner96], [Schewel95].
3.2.2 Reconfigurable Computing
Reconfigurable computing [Seaman94] utilises FPGAs to implement a spectrum 
o f  computing functions from co-processors at one extreme to  SIM D or systolic 
systems at the other. The PRISM -I system from Brown University [Athanas93], 
consists o f a conventional processor connected to  an FPGA-based coprocessor. The 
compiler for the system accepts a program  as its input and produces both a hardware
3.2 Application Areas for FPGAs
50
and a software image. The hardw are image defines the function o f  the FPGA 
coprocessor, the software image defines the operation o f  the conventional processor. 
A similar coprocessor approach is taken in the Chameleon system [Heeb92], although 
it is necessary in this case to develop the hardw are descriptions explicitly using a 
separate hardware description language. Foulk [Foulk94] investigates the use o f 
different FPGAs in a textbase coprocessor for operations on textual databases. The 
AnyBoard system [VanDenBout92] provides flexible FPGA-based rapid prototyping 
hardware in the form o f  a PC-based board. The board has five Xilinx devices 
connected in a ring topology with local busses, and three banks o f  RAM  for memory 
intensive designs. A global bus and system interface connect to each o f  the Xilinx 
devices for expansion and connections to other components o f  the system. The 
Anyboard system has been used in a num ber o f  applications including neural networks 
and systolic convolvers. In a similar approach, Acock and Dimond [Acock96] 
implement a modular system to allow algorithms to  be m apped onto a number o f 
FPGAs. A linear array o f  FPGAs is utilised in the SPLASH design [Gokhale91], 
which w as conceived to  execute a one-dimensional systolic algorithm in D N A  pattern 
matching. The bit-serial REM AP system [Linde92] implements multiple linear-array 
SDVDD elements constructed from FPGAs, primarily for use in Artificial Neural 
N etw ork (ANN) applications. The simple and regular structure o f  neural netw orks is 
well matched to the fine-grained architecture o f  FPGAs [Salapura94], GANGLION 
[Cox92] implements fast ANN-based pattern recognition with FPG As for use in 
industrial inspection applications. The reprogrammability o f  the FPG A s is used to 
provide custom  hardware for each application. Bit-level processing for automatic 
target recognition in Synthetic A perture Radar images has been implemented in 
FPGAs at the University o f  California [Villasenor96].
W here reprogrammability is required within a limited application area, custom 
FPGAs can provide better perform ance than general purpose devices. The Teramac 
Configurable Custom Com puter [Am erson95] is fabricated with custom  designed 
FPGAs to  produce a one million gate capacity machine. Custom  FPG A s were used 
for a number o f  reasons including support for multi-ported registers and reduced 
compilation time, an im portant consideration in a frilly populated system consisting o f 
1728 FPGAs. A custom  FPGA, PR O TEU S, has also been used in an experimental
51
reconfigurable signal transport system [Hayashi95] where the internal architecture o f 
the device is matched to the pipelined processes o f  the application.
W ILDFIRE is an example o f  a commercial reconfigurable com puter 
[M cHenry95], consisting o f  a host com puter connected to  a number o f  array cards, 
each o f  which has sixteen FPG A  processing elements with nearest-neighbour and 
crossbar interconnects and a bus connection to a control element.
3.2.3 Genetic Algorithms
Genetic algorithms are probabilistic search techniques used for optimisation and 
learning problems. They model biological evolution mechanisms such as mutation, 
inheritance and selection. An initial population o f  solutions to the problem being 
solved is repeatedly processed using the evolutionary processes to  create new 
generations o f  solutions. FPGAs have been used to  implement these processes by 
Graham and Nelson [Graham96], Sikoff, et al [Sitkoff95] and Scott, et al [Scott95] to 
provide a significant perform ance improvement over a purely software-based solution.
3.2.4 Intellectual Property
A rapidly expanding field in the commercial reprogram mable logic community is 
Intellectual Property (IP). Standard system functions ranging from simple logic 
functions such as multiplexers to DSP functions and complex and bus interfaces are 
being made available as param eterised IP cores by many program mable logic vendors. 
IP cores are verified function blocks which are re-usable and save significant design 
and verification effort.
3.3 FPGA-Based Message Routers
Message routing devices for use in the construction o f  embedded parallel 
m ulticom puter machines are available in custom  and semi-custom  technologies, but 
there are few examples o f  the use o f  FPGAs for this function.
A number o f  device designs exist for the Transputer family o f  processors, which 
can partially be attributed to their popularity in the construction o f  embedded systems. 
The C004 and C l04 custom  devices from INM OS support the T8/4/2 and T9000 
Transputers respectively. The C004 is a simple 32-way circuit-sw itched crossbar 
switch, controlled by a Transputer through a configuration link [Inmos89a]. The
52
device provides program mable connectivity which is generally static throughout a 
program. Quasi-static connectivity is possible, in which different phases o f  a program 
have different switch settings, but synchronisation o f  the switch actions with message 
boundaries is the responsibility o f  the program m er and is not supported by the device. 
The C l 04 w orm hole-routed packet routing switch developed under the Esprit PUM A 
project [Hey92], also incorporates a 32-way non-blocking crossbar switch, with direct 
connection to the D ata/Strobe (DS) protocol o f  the T9000 [Inmos93], The routing 
algorithm in the C l 04 uses interval labelling, a technique whereby each output link 
from a device is assigned a range o f  addresses or an interval. The number o f 
addresses in the specified range for each link corresponds to  the num ber o f devices 
that are accessible on that link. Intervals are contiguous and non overlapping. The 
head o f  each packet sent through a C l 04 contains an address which is com pared with 
all intervals within the device, and the appropriate exit link selected. The message 
header is consumed by the device, so that hierarchical netw orks o f  C l 04 devices can 
be constructed; multiple addresses at the head o f  each message are sequentially 
consumed by each C l 04 on the route to  the destination. The C l 04 implements group 
adaptive routing, in which a group o f  links is defined for instances where multiple 
paths exist between processors. A m essage may then take a route to  its destination 
using any o f the links in the relevant group.
A 72x36 custom  cross-bar switch device [Nicole88], developed under the Esprit 
P I 085 project, provides the reconfigurable interconnection required to  construct 
Transputer clusters or Supernodes, consisting o f  16 w orker Transputers, a link 
control Transputer, and tw o additional disk server/large cache Transputers [Harp89]. 
Static or quasi-static connectivity is provided by this device in a similar fashion to the 
C004. Hierarchical construction o f  netw orks is supported by the architecture with the 
use o f further switching devices.
The NTR08-30 Routing Device from Nottingham  Trent University [Ellis91] is 
an eight channel semi-custom router device which directly interfaces with standard 
Transputer OS communication links. It has an eight way non-blocking crossbar, 
controlled by header bytes in each message. Links may be grouped together to 
provide group adaptive routing for improving hotspot performance. Virtual 
communication is supported in the device by providing a mechanism for adding tags
53
to the headers o f messages which are ignored by the routing algorithm. These 
features, for use in T8/4/2 systems, are very similar to  those provided by the C l04 for 
the T9000.
The Global Link A daptor (GLA) design o f  Hossack and Guy [Hossack94], is an 
example o f  an FPGA-based T ransputer communication device. It supports up to  forty 
transputers in a five-arm star topology. In each o f  the arms, the eight transputers 
communicate over one o f four links, which are allocated to the transputers using a 
token passing system. For comm unication betw een transputers in different arms, there 
is only a single link. Limited fault tolerance is provided by connecting each o f  the four 
global links in a ring and providing a mechanism for enabling and disabling each 
connection in the ring in order to  isolate a faulty node. A broadcast mechanism is also 
implemented.
The processor-independent MP1 design [Miller91], is a packet routing semi­
custom  device, primarily for use in a two-dimensional mesh topology. The interface 
between the processor and the MP1 device is implemented in a separate device to 
provide processor-independence and ensure that the MP1 netw ork can be used to 
construct heterogeneous systems. F or each processor type which is to be connected 
to the network, it is necessary to  design an interface device [Miller], The MP1 design 
implements the M ad Postman routing strategy in which the head o f  a packet is routed 
out o f  a node along the same direction as its arrival before any routing decision is 
made within the node. This speculative action is effective because it is wrong only 
twice in the routing o f a message, and has the advantage o f  reducing the packet 
transport latency to an absolute minimum. The device provides minimal adaptive 
routing and has an area-based multicast capability, although it is referred to as 
broadcasting by the authors. A m essage encapsulation mechanism is included to 
support both hierarchical netw orks and the routing o f  packets to avoid faults. 
Deadlock freedom is achieved with the use o f  four separate routing planes which are 
each acyclic (Section 2.4.2).
Hartley and Harvey [Hartley93] have simulated a comm unication device in a 
netw ork o f  TM S320C40 (C40) processors, but no evidence has been found that such 
a device has been implemented. They conclude from their simulations that the DM A 
facility o f  the C40 is unable to fully utilise the communication links when four or more
54
links are being used; to  overcom e this, they recommend the implem entation o f virtual 
channels.
Brunvand [Brunvand94] has implemented a self-timed [Payne95] tw o- 
dimensional mesh router design in FPG As for use in the W indchime Parallel Processor 
system. This design, which requires tw o devices for each node, implements 
deterministic dimension-order worm hole routing. It does not include the adaption, 
encapsulation or multicast facilities included by the author in the router device 
described in this thesis (Section 5.5) and it has only a single routing channel for each 
direction. Although FPGA-based, the building-block concept for netw ork design 
using the reconfigurability o f  these devices has not been recognised.
A circuit-sw itched m ulticom puter netw ork has been implemented by Yeh, et al 
[Yeh95] using a m ixture o f FPGAs, GALs and D ual-Ported memory. The architecture 
o f the system is based on clusters o f  processors, each connected via a single link to 
FPGA-based inter-cluster routers. Processors within a cluster use a comm on Time 
Division M ultiple Access bus for all communications, implemented within a single 
FPGA. The use o f  a single shared bus within each cluster and single channel inter­
cluster links introduces potential bottlenecks into the system, even for near-neighbour 
communications and provides poor fault tolerance.
Although the favourable characteristics o f  FPGAs have been exploited in other 
application areas, the potential advantages o f  an FPGA-based approach to  the 
implementation o f  configurable m essage routing devices (Section 5.1) for embedded 
systems have not been recognised or explored. The research described in this thesis 
addresses this situation with a proof-of-concept design and experiments to  establish 
the feasibility o f  a building-block approach to  netw ork design.
55
Chapter 4 
The ROLIN Project
4. The ROLIN Project
The RO LIN project is a research program m e concerned with the development 
and implementation o f  on-line m easurem ent and inspection techniques for high speed 
monitoring o f  railway track and its infrastructure. The ROLIN project is the first stage 
in a larger project which will lead to  the exploitation o f  the RO LIN  research in the 
implementation o f  new infrastructure testing vehicles.
A number o f  test vehicles currently in service measure different aspects o f the 
track and infrastructure, including track geom etiy, ride quality, position o f  trackside 
structures and overhead line condition. These vehicles provide m easurem ent at high 
speed and eliminate the need for railway personnel to be on the track. However, the 
ageing technology o f the sensors and signal processing o f these vehicles means that 
they are unable to provide the accuracy, reliability and data integrity needed by the 
infrastructure engineers. The low level o f  on-line processing o f  existing measurement 
systems leads to  extensive post-processing o f  recorded data, especially for the 
structure gauging train which utilises video cameras and simple real-tim e processing 
to measure the clearance betw een the vehicle and surrounding structures. The post 
processing activity introduces unacceptable delays between making the measurements 
and providing the results to  the customer.
There is no central repository o f  measurem ent data which is accessible by the 
engineering groups within the company which require the information. The 
information form at and storage media for data from the various test trains are diverse 
and information presentation is generally with the use o f paper printout. To provide a 
more effective tool for managing the railway network, it has been proposed that a 
central integrated database o f information be generated, containing all o f  the data 
relating to  the track and infrastructure. All information in the database should be 
stored and accessed by reference to  its geographical position in the railway network.
57
The ROLIN project focuses on three key themes - positioning, database 
management, and video processing - which relate to the implementation o f  the 
integrated database and also to the provision o f  the necessary processing performance 
for the m ost demanding o f  the m easurem ent sub-systems. These areas o f  research are 
required to ensure that
• the position o f  any m easurem ent can be consistently determined to the 
required accuracy
• the integrated database approach is viable given the massive amounts o f 
data which will be generated to  fully describe the condition o f  the railway 
netw ork
® adequate on-line processing is available for enhancing the processing o f 
video sensor data in order to  reduce the post-processing burden and expand 
the scope o f  the current measurements.
The author is primarily concerned with the video processing aspects o f  the 
ROLIN project; positioning and database issues are briefly described in the following 
sections to provide a complete picture o f  the work.
4.1 Key Project Themes
4.1.1 Positioning
This aspect o f  the ROLIN project investigates the means by which train position 
can be determined in real-time with a circle o f  uncertainty o f  ±1 m etre for 95% o f  the 
time. Historically, position in the railway netw ork has been determined by measuring 
the linear distance along the track from  some fixed reference point, usually a platform 
or junction. M easurem ents from the existing test vehicles are given a linear position, 
generated by a position processor on each o f  the test trains which counts pulses from 
a tachom eter connected to  one o f  the wheels o f  the vehicle. M anual correction o f the 
calculated position is used to  make adjustm ents for cumulative errors introduced by 
the difference between the actual wheel size and the value used to  calculate distance 
travelled, and for errors caused by wheel slippage. The correction is made when the 
operator presses a push-button to register that the train is at a designated milepost.
The linear positioning system has several drawbacks
58
® M aintenance crews sent via the road netw ork to  a reported fault need both 
the linear railway maps and conventional geographical maps to  determine 
where to gain entry to the railway netw ork near to  the fault. Once on the 
track, pin pointing the exact location o f  the fault is difficult.
• The positions given in the linear map for railway assets, like signal gantries 
and tunnels, are frequently incorrect2.
• The m ethod o f  manually correcting tachom eter derived distance on the test 
trains is inconsistent and error-prone.
• Test train recording runs are based on pre-planned routes. F or each test 
run, a Route Setting Tape (RST) file is generated. This file contains the 
relevant part o f  the linear map and is used by the position processor on 
each train. I f  a test train is diverted during recording, then the RST no 
longer describes the route o f  the test train.
British Rail has decided that position information for m easurem ents from future 
test train systems will be based on a two-dimensional map system, with test trains 
determining position in real time during recording. This decision is based on the 
problems associated with the current system and the fact that there is a wide range o f 
commercial Geographical Inform ation System (GIS) products for manipulation and 
presentation o f  data from two-dim ensional m ap-based databases.
The ability to  pin-point the current position o f  a test vehicle is essential in 
providing an integrated infrastructure database. The accuracy o f  the m easurement 
must be ±1 m etre to ensure that adjacent tracks, which have a nominal spacing 
between nearest rails o f  six feet, can be distinguished during operation to  guarantee 
that measurem ents are assigned to the correct track. Test vehicles travel through a 
wide variety o f  environments, from open countryside to urban canyons and tunnels. 
To obtain the required accuracy under these changing conditions requires the use o f  a 
multi-modal positioning system which utilises inertial navigation systems,
" It has been claimed that the gangs of men who originally laid the railway track were paid by the 
length of track laid and that unscrupulous workers removed links from the chains that were used to 
measure the amount o f track laid. It is this deception, so the story goes, which has lead to the errors 
in the positioning o f mileposts, and therefore the assets, within the network. Whatever the cause, it is 
certainly tine that the positions of the assets on the linear map are not always accurate.
59
tachom eters, satellite positioning systems and land based radio positioning systems. In 
addition to  sensor data, it is possible to  use historical data from previous runs over the 
same track to aid the position solution.
The results from this aspect o f  the RO LIN  project have dem onstrated that the 
required accuracy in position m easurem ent, given the variable environmental 
conditions o f  the railway netw ork, can only be achieved through the use o f  a multi­
modal positioning system [Dale96].
4.1.2 Video Processing
A study undertaken by Railtest, which provided the m otivation for the ROLIN 
project, analysed the capabilities o f  the current suite o f test trains and considered 
future infrastructure and track m easurem ent requirements. Video processing, one o f 
the key themes identified by this study, relates to  the requirement for complex real­
time video processing systems in future test trains. Applications include compression 
o f  video from forward looking cam eras on test trains and enhancem ents to the 
structure gauging train processing.
Video from forward looking cam eras on the test trains is currently recorded on 
standard VHS video tape. A long term  goal for the integrated infrastructure database 
is to have recordings from these forw ard-looking cameras available as database 
entities which can be accessed in the same way as other data within the database. This 
feature will require digital storage o f  the video signal to  ensure that database access is 
responsive. To reduce the storage requirem ents o f  the video data, compression o f  the 
video signal is necessary. The com pression must be loss-less to  ensure that 
m easurem ents can be made from any single frame o f the recorded data. Real-time 
hardw are implementations o f  standard video compression algorithms are becoming 
available, but these concentrate on lossy compression as required for the domestic 
market. It is believed by the author that dedicated algorithms, which take advantage 
o f  the restricted and predictable m otion o f  the train to  provide accurate pixel 
translation vectors, have the potential for higher loss-less com pression ratios than 
standard techniques. Investigation o f  this proposition requires a flexible and powerful 
video processing architecture to  provide a mechanism for implementing and testing a 
range o f solutions.
60
The m ost demanding measurements on the current suite o f  test trains are those 
which utilise the processing o f  video cam era images to determine the closest approach 
o f structures surrounding the vehicle. The present implementation o f  this closest 
approach function is based on a dedicated hardware solution. Simple hardware 
processing and optical filtering are used to  reduce the number o f  spurious signals 
reported, but this processing is inadequate in many cases, leading to  erroneous results 
which must be manually detected and rem oved from the data.
In addition to  problems introduced by the limited processing o f  the data, there 
are other issues relating to the current implementation which must be addressed for 
future systems. Calibration o f the cameras is difficult, taking in the order o f  two 
weeks to complete. D ata storage frequency is low, with only the w orst case clearance 
over successive 5 m etre track segments being stored, and there is no alignment o f the 
segment start positions between recordings over the same section o f  track. A higher 
frequency o f  data recording and better correlation between different runs past the 
same structure will enable the data to be used by civil engineers to build up an 
historical picture o f the movement o f  structures over an extended period o f  time. This 
information can be used to  schedule maintenance and pin-point structures which are 
subject to  movement.
The channel tunnel has opened up a major opportunity to  expand the 
commercial exploitation o f the Structure Gauging Train. Although access to the 
European netw ork for test trains is now  straightforward, the operating clearances o f 
European trains are larger than those used in the UK. The Structure Gauging Train 
cameras were designed for measuring the UK netw ork only, and are not capable o f 
viewing larger clearances. To accom m odate European networks, m ore cameras are 
required to increase the field o f  view o f  the vehicle.
It has been dem onstrated over the operational life o f  the Structure Gauging 
Train that the basic measurem ent technique, utilising video cameras and a beam o f 
light emitted from the vehicle, provides a workable solution to  the problem o f 
measuring the nearest approach o f  structures to  the vehicle at high speed. Proposals 
have also been made to  use the same technique for measuring the distance to  the 
adjacent track and the spread o f  the m nning rails under the load o f  the train. It is clear
61
that video processing for the Structure Gauging Train requires improvement to fully 
exploit the potential o f  the technique.
The algorithms required to improve the existing processing and the processing 
requirem ents o f the new applications are not known in any detail. For this reason, 
development o f application specific devices to  implement the complex real-time video 
processing functions for the structure gauging function is not an option. Video 
processing on future test trains m ust be based on utilising general purpose, 
commercially available processors with a software implementation o f the required 
algorithms. The development o f  a specification for such a system requires knowledge 
o f the processing demand imposed by the algorithms and the performance obtainable 
from each processor. The video processing sub-project investigates these two issues 
through the implementation o f  a simple gauging algorithm on a commercial Digital 
Signal Processor and the provision o f  a method to record com pressed video data and 
vehicle motion sensor data.
Execution time o f  the gauging algorithm will provide a measure o f  the 
performance obtainable from a commercial processor in a video application, and the 
recorded data will enable off-line developm ent and evaluation o f  robust measurement 
algorithms. The research also provides a means o f testing a proposal for video 
processing put forward by the author which eliminates cam era alignment issues. 
Alignment o f  the cameras accounts for the majority o f  the time taken to  calibrate the 
current system. Elimination o f  this problem will allow m ore frequent calibrations to be 
performed and will lead to  reduced system down time when failed cameras are 
replaced and need calibration. Details o f  this proposal are provided in Section 4.2.3.
4.1.3 Database Management
This aspect o f the RO LIN project is concerned with the implementation o f a 
large m ap-based database and the mechanism for tagging m easurem ent data with 
positional information prior to  storage.
The potential size o f  a database containing all o f  the data from the 
measurements made on the railway netw ork in the United Kingdom  is massive. There 
are approximately 20,000 kilometres o f  track in the passenger transport network. The 
target level o f monitoring for track with this class o f usage is an inspection at least
62
every twelve months. Track geom etry data has the highest data content o f the 
processed data measurements, consisting o f  15 parameters stored at 2.5cm  intervals. 
This generates approximately 24G  bytes o f  data per annum for storage in the 
database. Storage o f  other measurements and access to historical data will increase 
the database size to  several hundred gigabytes and with the addition o f  video data, the 
database size will be considerably larger.
Telecommunications companies, and the utility companies responsible for 
supplying water, electricity and gas, have a similar requirement for massive map-based 
databases to provide asset m anagement facilities. A commercial product, Smallworld 
[Batty], is used by a number o f  companies within these sectors, to implement their 
large databases. A Canadian railway freight company also uses the product to 
implement an asset m anagement facility.
The Smallworld product meets many o f  the requirements o f  the integrated 
database, including the ability to  deal with massive amounts o f  data. Distribution o f 
the database across geographically separate locations is supported using a persistent 
cache approach. Typically, rem ote sites only need to access a subset o f  the whole 
database. A local cache disk will be automatically populated with the required subset 
o f  data from the central database to form a local slave database, the contents o f  which 
will depend on the access requirements o f  the site. All workers at the rem ote site 
have access to  the local slave database. This approach reduces inter-site 
communication demands by providing a slave database at each rem ote site to which 
most accesses are directed, and reduces the load on the central database server.
A Transputer-based position processing system, developed under this section o f 
the ROLIN project, manages the real-time tagging o f data from the measurement 
system with position information provided by the positioning system. It also provides 
real-time exceedance checking and a real-time display o f  train position on a map 
background.
The assignment o f measurement data to positions on the railway map requires 
accuracy in the map and position o f  ± lm . For the demonstration, a map with this 
accuracy was not available. To provide reference points to which the recorded data 
can be attached in the database, the Automatic W arning System (AW S) magnets on 
the track near to  the platforms on the trial route were accurately surveyed prior to the
63
trials. The AWS magnets are part o f  a safety system which ensures that the driver 
remains alert. As the train passes over the AWS magnet, a braking system on the train 
is primed and an alarm is sounded in the drivers cab. The brakes are automatically 
applied by the braking system if  the driver fails to cancel the alarm within a preset 
time.
Retrospective positional manipulation is applied to the recorded data to fit it to 
the database map which consists o f  the AWS positions. A manually digitised map is 
also used as a backdrop to the displayed data and the AWS location markers to 
dem onstrate how the display will look with a real map background.
4.2 The Technology Demonstrator
The w ork from the three key areas o f  the ROLIN project is brought together in 
a technology) demonstrator which implements a real-time video-based measurement, 
tags the measurement with position derived from a m ulti-sensor positioning system 
and stores the results in a two-dimensional map-based database.
4.2.1 System Overview
The dem onstrator system consists o f  three sub-systems; the video processing 
system, the geographical positioning system and the database system as illustrated in 
Figure 16.
The video processing implemented in the dem onstrator is based on the standard 
structure gauging measurement technique o f  observing where a light beam, which is 
orthogonal to the longitudinal axis o f  the vehicle, illuminates a structure using a video 
camera as the sensor. Existing Structure Gauging Train vehicle attitude sensors and 
light beam generators are used in conjunction with an additional camera m ounted on 
the optical car which views the reflections o f  the light beam from  the platform. The 
m easurement derived by the system is the minimum distance to  the platform edge 
from the centre line o f the tracks and the height o f  the platform above the rails.
A standard camera on the Structure Gauging train is used to determine the 
position o f  the running rail nearest to the platform. This information is required to 
compensate for the centre throw  o f  the optical car which occurs when the vehicle is 
on a curved section o f track.
64
GPS FM band Tacho Inertial
receiver corrections platform
Video Vehicle Tacho
data attitude
sensors
Figure 16 - ROLIN Technology Demonstrator
A measurement o f  the position o f the edge o f  the platform from the running 
rails is produced in real-time using an algorithm running on a general purpose Digital 
Signal Processor. A data acquisition board, described in Section 4.2.2, captures data 
from the cameras and the motion sensors and feeds this data directly to  the video 
processing system. Sensor and video data is also stored on a hard disk connected to  a 
second Digital Signal Processor.
The co-ordinates o f  the platform edge, which are derived for every frame o f  
video data, are transmitted to the position processor through a CO 11 Transputer 
Over-Sampied (OS) link adaptor device [Inmos89]. The position processor controls 
the transmission o f the platform edge co-ordinate and initialisation o f  the video 
processing sub-system via a second Transputer OS link.
The position o f  the vehicle is derived from a combination o f the tachom eter, an 
inertial navigation system, a Global Positioning System (GPS) receiver and
65
Differential GPS corrections received from Classic FM transm itters which broadcast 
the corrections on the stereo pilot tone. The data from the different position sensors is 
combined in a Kalman filter [Quest96], which uses a least squares criterion in 
conjunction with a model o f  the behaviour o f  the system to determ ine the position o f 
the vehicle.
Platform  edge co-ordinates, generated at 50Hz from the video processing 
system, and train position generated at 200Hz from the m ulti-modal positioning 
system, are gathered by the Transputer-based position processor. This processor 
combines the data sources to  generate position-tagged platform  edge co-ordinates 
which are stored in a file. This file is post processed before the data is stored in the 
Smallworld database for presentation and manipulation.
4.2.2 Data Acquisition Board
A data acquisition board to capture video and sensor data on the Structure 
Gauging Train has been developed by the author using medium density programmable 
logic to process raw video and sensor data for transmission to  the video processing 
system (Photograph 2). The board is implemented as a PC plug in card conforming to 
the ISA bus standard [Solari91].
D ata acquired by the board is fed to  a byte-wide comm unication channel 
conforming to the Texas Instrum ents TM S320C40 communication link (C40 link) 
standard [TIUG91]. Control inputs to  the board are generated by program s mnning 
on the PC which communicate through an I/O mapped register set. The acquisition 
board also provides link adaptor hardw are to convert betw een the C40 link and 
Transputer OS link standards to  allow the video processing sub-system to 
communicate with the position processor sub-system.
6 6
Photograph 2 - SGT Acquisition Board
Inputs to the acquisition board from the instrumentation consists o f  tw o video 
signals, four analogue inputs and one pulse input (Figure 17). One video input to the 
data acquisition board is connected to the camera which views the light beam 
reflections from the platform. The other is attached to the standard gauging camera 
which views the light beam reflections from the nearest running rail to the platform. 
These video inputs are fed through an analogue threshold function which includes 
hysteresis, to produce a two state video signal. Threshold and hysteresis levels are set 
by the operator prior to data recording via the PC ISA bus. Digital to Analogue 
Converters (DACs) on the acquisition board produce voltages corresponding to the 
threshold and hysteresis levels which are fed to the analogue video circuitry.
67
ISA bus
Figure 17 - Data Acquisition Board Block Diagram
The resultant tw o state video signal is run-length encoded using a 50M Hz pixel 
clock to produce low-rate image data. The run-length encoding typically produces 
only a few codes per video line for the processed images encountered by the Structure 
Gauging Train. A typical example o f  a processed image from  the platform edge 
camera is given in Figure 18, from which it can be seen that, in the absence o f  any 
spurious signals, the lines containing the light reflected from the platform edge 
typically consist o f only three segments. There is a black segment from  the start o f  the 
line to the start o f  the reflection from  the platform, a white segm ent which is the 
reflected light from the platform and a black segment to the end o f  the video line. Run 
length encoding o f  a three segment video line requires only tw o codes; the length 
from the start o f  the video line to the start o f  the white segment and the length o f  the 
white segment. The run length encoding o f  the video frame produces a very high loss­
less compression o f  the data. W ith a 50M Hz pixel clock there are 2600 active pixels 
per video line, producing a compression ratio o f 1300:1 for the example cited.
6 8
Figure 18 - Typical Platform Edge Image
The four analogue inputs to the data acquisition board are fed by four Linear 
Variable Differential Transducers (LVDTs) which measure the attitude o f  the optical 
car relative to the running rails. The optical car has an LVDT mounted over each o f 
its four wheels. Each LVDT is attached at one end to  a bogey o f  the optical car, 
which has a fixed position relative to the running rails, and at the other end to the 
suspended body o f  the optical car. The LVDT inputs to the acquisition board are fed 
to .Analogue to Digital converters (ADCs) to produce digital representations o f the 
position o f  the LVDTs. These measurements are used to produce values for the 
vertical deflection at the centre o f  the optical car and its roll attitude.
The pulse input to the data acquisition board is fed by the tachom eter o f the 
Structure Gauging Train. The time between pulses is measured to produce a digital 
value representing distance travelled since the last pulse.
The run length encoded video, the LVDT values and the tachom eter 
m easurements are combined into a single data stream which is fed out o f the 
acquisition board via a C40 link. The implementation o f  this link standard allows 
acquired data to be fed directly from the acquisition board into a TM S320C40 
processor. The acquisition board employs a small First-In-First-Out (FIFO) memory 
to provide a buffer between the real-time measurement data and the C40 link to 
ensure that no data is lost when the processor briefly suspends link transfers. This 
situation occurs when re-initialising the internal Direct M emory Access (DMA)
69
controller of the TMS320C40 which is used to transfer data internally between the 
communication link buffer and memory,
4.2.3 Generation of Platform Edge Co-ordinate Measurements
With the current structure gauging cameras, alignment o f the cameras is critical. 
The positional measurements for main gauging are based on camera pairs being 
optically aligned with each other, cameras being orthogonal to the plane of the light 
beam and also requires that the camera line scan and frame scan are linear and 
matched between camera pairs. Adjustment of the camera pairs to achieve the 
required alignment is a laborious task. In addition, the requirement for the cameras to 
be orthogonal to the plane of the light beam to simplify the video processing, sets the 
cameras such that approximately half o f the field of each camera is wasted viewing the 
body o f the optical car.
The author proposed that gauging measurements can be obtained from cameras 
which are not optically aligned or electronically adjusted in any way. Camera sweep 
non-linearities and camera pair misalignments are removed as part o f the processing of 
the video images. The technique relies on the use of a calibration board with markers 
at known positions relative to the rails. The markers divide the area of measurement 
into small square regions, or tiles, over which it is assumed that the translation is 
linear between camera image plane co-ordinates and a rail-referenced co-ordinate 
system (Figure 19). A parameterised set o f linear equations defines the translation 
between co-ordinate systems (see below). The parameter values o f these equations for 
each tile in the calibration image are derived during the calibration procedure.
Define: xt ,y i as the pixel coordinates in the camera coordinate system (CCS)
x c , y c as the pixel coordinates on the calibration plane in world coordinate system (WCS) 
Z c as the z coordinate o f the calibration plane in the WCS
(  \
ya , as the origin o f the camera plane in the WCS
r  > 
**
r  \  
xy
x = yx
X
and Y — yy as the unit vectors o f the camera X  and Y  axis in the WCS
70
Translating pixel coordinates Xt ,y t into WCS : W = O + XSX  + y fY
Project these coordinates onto the calibration plane using similar x c Xw
triangles. E.g., for the x coordinate : ~  =
Substituting from W  and rearranging: p x +  xfp 2 +  y fp z
l + x,p4 + y , p 5
Tliere are five unknown parametersp x_5, therefore we need five sets of coordinates {xc,xf }y . ) to 
produce a solution (similarly for y c). One coordinate set comes from each of the four comers of the 
tile. The fifth coordinate set is derived by joining opposite comers of each tile to construct the centre 
coordinate of the tile.
During normal operation, a pixel in the image is translated to the rail referenced 
co-ordinate system using the linear equations with the parameters of the tile 
surrounding the pixel. The technique is extensible to cameras which are angled 
relative to the plane of the light beam. Perspective distortion o f an angled camera 
introduces linear transformations to the system which are compensated for in the same 
way as other distortions.
Calibration board Camera view of the Tile definitions Expanded tile
calibration board derived from
with barrel distortions calibration image
Figure 19 - Calibration Processing
The platform edge measurement is made using a single camera view of the 
platform edge. The camera is angled down from the horizontal plane of the optical car 
and outwards from the lateral vertical plane. This arrangement eliminates extraneous 
light sources and reduces the spurious images produced by wet platforms illuminated 
by platform lighting (Section 1.2.2). With this arrangement, the camera is at an angle 
relative to the light beam, which also provides an opportunity to test the perspective 
distortion compensation technique proposed by the author.
Generation of the rail referenced platform edge co-ordinates from camera image 
co-ordinates requires the processing steps illustrated in Figure 20.
71
platform edge 
camera image
calibration J
parameters I
roll
extract platform 
edge co-ordinates
Xc.Yc
calculate optical 
car referenced 
co-ordinates
Xo.Yo
apply roll 
correction to 
co-ordinates
calculate optical car
LVDTs —^  roll and vertical
displacement
vertical
displacement
Xor.Yor
apply optical car 
vertical displacement 
correction
Xor.Ype
rail camera image
extract rail 
horizontal 
displacement _L
horizontal
displacement
apply optical car 
horizontal displacement 
correction
Xpe.Ype
(co-ordinates relative to the running rails)
Figure 20 - Platform Edge Processing
4.2.4 Calibration Procedure
The calibration board used to derive the param eters o f  the transformation 
equations is black with white markers placed in a 100mm grid pattern covering a
72
region o f 500mm-900mm horizontally and 600mm-1200mm vertically. One marker is 
different in appearance to provide a target for aligning the camera during the 
calibration procedure.
Figure 21 - Calibration Board Marker Layout
The board is m ounted coplanar with the light beam at a known position relative 
to the rails. The image o f  the calibration board obtained from the platform edge 
viewing camera is illustrated in Figure 22.
Figure 22 - Captured Image of Calibration Board
7 3
To generate calibration param eters, data from the cameras and LVDTs is 
recorded into a reference file. This file is accessed off-line with a program  that 
provides mouse controlled boxes which are manually placed over the areas o f the 
image containing the white markers and the rail (Figure 23). The param eter values for 
the tiles are then automatically extracted by the utility using the know n positions o f 
the m arkers relative to the rails.
If
j*
c* a  l*
17 71(11
7) [ 3 0
ns t i  mr/i i..j 
m  Hr! d
17!
rn
[/!wry! .
L  i
: m
Figure 23 - Boxes Marking Calibration Regions
Calibration o f  the LVDT sensors is performed using a calibration rig which 
provides fixed extensions and com pressions o f  the LVDT. The values generated by 
the ADCs for each LVDT with -1 inch, 0 inch and +1 inch extensions' are measured, 
from which gain and offset param eters for each LVDT are derived.
The measurement o f  the position o f  the rail uses a calibrated main gauging 
camera. Pixel positions from the cam era are directly proportional to horizontal 
deflections o f  the rail; therefore, no calibration has been implemented for the
The SGT system uses a mixture of imperial and metric units, including the measurement of 
distance in Miles and Chains, for historical reasons only.
74
dem onstrator. In the final system, the calibration procedure for this camera will be as 
described above for the platform edge camera.
4.2.5 Platform Edge Gauging Algorithms
The video picture to  be processed is a tw o state image. Reflections from track- 
side structures in the path o f  the fan-beam o f  light produce white objects in the video 
picture, consisting o f  segments o f  light spread over a number o f  video lines. The 
platform edge gauging algorithms extract the co-ordinates o f  the platform edge from 
these images.
Tw o platform edge gauging algorithms, LEFTM O ST and DET3, have been 
implemented on the video processing system using Parallel C which is described 
separately in Section 5.7.
The existing hardware implementation o f  the gauging algorithm is replicated in 
software for the LEFTM O ST algorithm. A l cam eras on the structure gauging train 
are set up to scan such that the first illuminated pixel encountered on each video line 
(i.e. the leftmost pixel) represents where the light beam is reflecting from the part o f 
the structure nearest to the train. The LEFTM O ST algorithm simply finds the video 
line which has the leftmost pixel illuminated. The centre point o f  the segment attached 
to this pixel is taken as the platform edge co-ordinate pixel. The screen co-ordinate o f  
this pixel is converted to the rail-referenced co-ordinate system using the process 
described in Section 4.2.3.
The second algorithm, DET3, is a more complex algorithm which takes account 
o f  the structure o f  objects within the image. Spurious objects are rejected by the 
algorithm and the object representing the reflections from the platform is identified 
using a best fit criterion. The first process o f  the DET3 algorithm eliminates very 
short segments consisting o f  a few pixels which are too short to  be reflections from 
the platform edge. The second process identifies objects within the image. For each 
segment remaining after the first process, the lines immediately above and below are 
checked for segments which overlap horizontally. Those segments with a horizontal 
overlap on adjacent lines and which overlap by a prescribed amount are taken to  be 
part o f a single object within the image. The output from this processing step is a 
group o f arrays, each o f  which defines an object that has been identified in the image.
75
The content o f each array is a list o f  the centre points o f each o f  the segments which 
make up the object. Arrays which have only a few entries represent objects that are 
too small to be reflections from the platform  edge and these are eliminated.
The first tw o processing steps are illustrated in Figure 24.
M * group #3, too smallgroup 4 1 6 K
X
segment too small 
-  group #2 \ __ /
5)
ungrouped
Figure 24 - DET3 Object Identification
Reflections from the platform edge produce an object with a specific shape, 
which resembles a non-symmetrical left pointing chevron (group #1 in Figure 24).
The arrays remaining from the object identification process are passed through a 
process which identifies those objects to which a left pointing chevron can be fitted. 
To achieve this, a wire-frame is constructed for every entry within each object array. 
The wire frame consists o f  three nodes joined by tw o wires. Construction o f  the wire 
frame begins with the central node which is set to the value for the array entry o f 
interest. The tw o remaining outer nodes are obtained by stepping through the array an 
equal amount in opposite directions from the central node. The starting condition for 
the two outer nodes is set to  five array entries away from the central node. I f  at any 
stage in the process o f  stepping through the array, the wire frame is not a left pointing 
chevron or the limit o f  the array is reached, the process is stopped and the last valid 
chevron is reported. The wire frame construction process is repeated for every entry 
within the object array, producing a set o f  left pointing chevrons for each object. The 
chevron with the largest offset betw een the outer nodes is then selected to represent 
the object. Figure 25 illustrates the elements o f  the wire frame construction. Example 
1 shows an array element which results in a tw o step construction, example 2 has a 
five step construction.
76
EXAMPLE 1
step 1 step 2
Figure 25 - DET3 Wire Frame Construction
The object in the image which is selected as representing the reflection from the 
platform is selected based on the object with the largest chevron. The platform edge 
co-ordinate is the value o f  the central node o f the selected chevron. The screen co­
ordinate o f  the central node is converted to the rail-referenced co-ordinate system 
using the process described in Section 4.2.3.
4.2.6 Results
The technology dem onstrator brought together all aspects o f the ROLIN 
project in a single application, the measurement o f the platform  edge co-ordinate 
which is tagged with its geographical position and stored in a geographical database.
U nder the video processing theme, a data acquisition unit has been developed to 
interface between the system sensors and the signal processing sub-system, a 
mechanism has been implemented to record minimally-processed data from the 
Structure Gauging Train for use in developing algorithms off-line, and a large set o f 
recorded video and sensor data now exists. An important consequence o f this w ork is 
that algorithm development can progress using the data recorded during the
77
dem onstration and other recording runs made on the SGT. Further data can also be 
gathered if required, using the data acquisition board developed as part o f  the ROLIN
project.
Two video processing algorithms have been successfully implemented using 
Parallel C on commercial DSP products. The LEFTM O ST algorithm runs in real time 
at 50 frames/sec on a single DSP; the DET3 algorithm runs at 25 fram es/sec on a 
single DSP. Complex processing o f  images from  a single camera in real time will 
require at least tw o processors, based on the single-processor execution time o f 
DET3. From these results, the processing requirem ents for a full system can be more 
accurately estimated. It is envisaged that, in future, DET3 and other algorithms will be 
manually decom posed into a number o f  processes for execution on multiple devices, 
based on the communicating sequential processes approach, in line with the 
implementation o f  the software for the technology demonstrator. An alternative 
approach, discussed in Section 5.7.2 and supported in the Parallel C environment, is 
to use the processor farm paradigm, which is well matched to the task  o f  video image 
processing [Tregidgo92],
The performance o f the measurem ent system was analysed for a single platform 
using results from the video processing system and corresponding manual platform 
edge measurements. To ensure that the vehicle attitude and centre throw  corrections 
are working correctly, the chosen platform  is curved and the track is canted (banked). 
The error distributions for manual m easurem ents vs. video m easurem ents are plotted 
in Figure 26, where the standard deviations and medians o f  the errors are displayed.
78
X and Y error distribution : manual vs average measured - 10mph
Error (mm)
Figure 26 - Manual vs. Video-Based Platform Edge Measurements
The standard deviation is within acceptable limits for both X  and Y, but the 
offsets in the error distributions for X and Y are unacceptable. These errors have 
subsequently been found to be due to misalignment o f  the calibration board with the 
plane o f  the light fan-beam. The cam era used for this set o f  data was disturbed before 
a re-calibration could be undertaken, but m ore recent results using an improved 
method o f  aligning the calibration board indicate that the offsets have been reduced to 
acceptable limits. Repeated measurem ents from the video processing system past the 
same platform were also com pared with each other. Figure 27 shows the distribution 
o f  the magnitude o f  the maximum difference in platform edge co-ordinate between 
any o f  the runs.
79
Max Run on run X and Y difference - distribution - 10mph
80 T
Max Difference (mm)
Figure 27 - Run-on-Run Error Distribution
The measurements from four passes o f  the same platform are very well 
correlated and within acceptable limits.
There are a number o f  factors which contribute to the spread o f  data seen in the 
above results. Misalignment o f  the manual and video-based m easurem ents can occur 
due to the problem o f  correlating video data with the physical position o f  the 
platform; currently, alignment is performed manually using the top o f  the platform 
ramp as a reference point in both data sets. The weight o f  the train deflects the rails 
vertically, therefore manual m easurem ents o f  the Y platform co-ordinate will be 
smaller than the video-based results. However, the video-based measurements 
represent a more accurate picture o f  the position o f  the platform under normal loaded 
track conditions. The use o f  four LV D Ts to measure the position o f  the optical car 
produces a good first-order estimate o f  the attitude o f the vehicle, but does not take 
all o f  the degrees o f  freedom  o f  the car into account.
The ROLIN video processing research work has dem onstrated that general 
purpose DSPs can be used to implement the real-time video based measurement
80
algorithms required for the SGT. The w ork has also shown that unaligned cameras, 
which are angled relative to the plane o f  the light beam, can be used with the 
application o f  a set o f  linear transform ation equations generated by a calibration 
procedure. This is an im portant result for operational reasons, in that calibration times 
can be reduced from weeks to  hours. In addition, the positioning o f  the cameras can 
be adjusted such that the available field o f  view o f  each cam era is fully utilised 
(Section 4.2.3).
81
Chapter 5 
An FPGA-Based Message Router
5. An FPGA-Based Message Router
This chapter describes the implementation o f a m essage routing device using a 
programmable logic device. The use o f  devices which are program m able introduces a 
degree o f  flexibility for implementing communication netw orks which is not possible 
with custom  devices. This feature is explored in the first section o f  this chapter with 
the aid o f  an example application.
The design for a netw ork node to assess the perform ance o f  an FPGA-based 
router node is developed in this chapter. The node is used for verification o f the 
message routing mechanisms and to assess the potential for utilising the 
programmable nature o f  the FPG A  devices used.
The choice o f  netw ork topology, the internal architecture o f  the routing node, 
and the message routing mechanisms implemented in the netw ork are described in 
detail. The issues o f deadlock, livelock and software support for the m essage routing 
netw ork are discussed.
5.1 Routing Node Configurability
The use o f  reprogram mable parts for the routing node provides an opportunity 
to implement a new building block approach to the design o f  communication 
netw orks for parallel processing systems. The proposal made by the author is that the 
netw ork for a particular application be built up from message routing nodes which are 
not constrained to have the same functionality at every node because o f their 
programmability, as w ould be the case for a design using custom  devices. Each node 
uses routing functions from a library o f  pre-designed and pre-tested functions. The 
utilisation o f  the available physical channels at each node in the netw ork is optimised 
to match the communication pattern requirem ents o f  the application by the selection 
o f an appropriate internal router architecture. In the routing function library, standard 
inter-node interface protocols are used for every function to  ensure that any routing 
function from the library can be used with any other routing function used in an 
adjacent node. The configuration for each node in the netw ork is simply constructed 
from a number o f  library routing functions.
83
To illustrate the building block approach, consider an example application 
consisting o f  a data acquisition sub-system  which feeds data to, and receives control 
information from, a netw ork o f eight processors. The mapping o f  the processes in the 
application to  the Processor Elem ents (PEs) in the netw ork is shown in Figure 28.
5.1.1 An Example Application
acquired.
data
PE3
.........
' ' ( > < >
control
C\
? E 6 Y ~ ~
1_ 1.
 ^PE5
output 1
output 2
F igu re  28 - E xam ple  A p p lica tion  Process M a p p in g
The netw ork to  support this application is constructed from a number o f  routing 
nodes as illustrated in Figure 29, which includes details o f the three different routing 
node architectures used.
The inter-processor comm unication netw ork used for this application has a 
linear array or one-dimensional mesh topology. The physical connection between 
router nodes consists o f  four independent channels, each o f  which can be configured 
to route messages in an easterly direction or in a westerly direction.
84
PEO) (PEl) (PE2) (PE3) (PE4) ( p E5) ( p E^
i£ l c l  n  f l  f l  f l  f l
6) (PE7)
data
acquisition
sub-system
one channel routers <'
processor
interface
1 t
1
four
channel
router
1
message routing 
node details
Figure 29 - Example Application Network
The eight processors are split into tw o groups consisting o f  three processors 
which receive and pre-process raw  data from the data acquisition sub-system, and five 
processors which perform  the processing necessary to generate the required outputs 
from the system. Control signals fed back to the data acquisition sub-system are 
generated by processors in both groups.
The first three routing nodes, connected to PEO, PEI and PE2, utilise both one- 
and two-channel routers. D ata acquired by the data acquisition sub-system is fed to 
the first three processor nodes using a dedicated channel which is serviced by a one- 
channel router in each o f  the three nodes. The same acquired data is sent to  all three 
processors simultaneously, using a multicast message (one-to-m any). The other one- 
channel router in each o f  these routing nodes provides a path back to the data 
acquisition sub-system for control messages. The two-channel routers within nodes 0,
1 and 2 provide a path to  comm unicate the processed data to  the remaining five 
processors. Each two-channel router has the facility for adaptive message routing 
between its tw o channels to  improve the blocking characteristics o f  the network.
85
Three o f the five remaining processors connect together with routing nodes 
which have a two-channel adaptive router in each direction. The final two nodes, 
which require only an Easterly m essage routing path, utilise a unidirectional four- 
channel adaptive router.
This example illustrates the potential o f  using mixed routing node designs which 
are constructed from a library o f  routing functions. The communication bandwidth 
available from the four physical channels betw een each node is m atched to the 
requirements o f the application. In the pre-processing group o f  processors (PEO to 
PE2), a channel is dedicated to  the raw  acquired data, and tw o channels provide the 
communication bandwidth required to  transfer the processed raw  data to the main 
processing group (PE3 to  PE7). For the first three nodes in the main group, a 
communication bandwidth which is evenly balanced between the W esterly and 
Easterly routing directions is required. This is provided by routing nodes with two 
routing channels assigned to  each direction. The final two nodes require only Easterly 
message transfers, therefore all four channels are made available to the Easterly 
direction using a four-channel router function.
For another application run on the same hardware, a different netw ork 
configuration could be adopted. For example, in an application which required a 
balanced communication bandwidth betw een all processors, each node would be 
configured to have a two-channel router for each direction.
With only four routing paths betw een each node, the number o f  possible router 
combinations is limited. H ow ever, w ith multiple routing devices at each node or the 
use o f  higher density parts to  provide m ore routing channels per device, the number o f 
choices at each node is increased.
5.1.2 Towards a Virtual Message Routing Network
In principle, the modification o f  the routing architecture at each node could be 
undertaken at any time, including during periods o f  activity, to provide a virtual 
message routing network for parallel processing systems. At the simplest level, 
alteration o f the netw ork architecture would be synchronised to  the completion o f 
each application run on the netw ork. Extending the technique further, the nodes could 
be reconfigured rapidly to  switch betw een the netw ork configurations required for
8 6
different sections o f  the same application. This would require very rapid switching o f 
the configuration state to  ensure that minimum netw ork bandwidth is lost during each 
change-over. Rapid switching o f the logic function in an FPG A  is not possible with 
the devices available to the author for this research work. H ow ever, both  the Atmel 
AT6000 family [Rosenburg95] and the Xilinx XC6200 family [Fawcett95] implement 
a mechanism for rapid configuration changes so that the notion o f  switching the 
netw ork rapidly can be realised. The fast re-configuration time o f  these devices is now 
being exploited in re-configurable com puting research (Section 3.2).
Before these ideas can be explored, it is first necessary to  establish that the use 
o f FPG A devices to implement a message routing netw ork is feasible and that the 
resulting netw ork produces acceptable performance. These questions are addressed 
with this research work.
5.1.3 The Routing Node Configuration Used in this Research
A single netw ork router node configuration is described in this thesis and is 
referred to  as the verification node design. Simulation and verification tasks are 
performed using only this configuration.
The verification netw ork routing node is constructed from  a two-channel 
routing function and a processor-specific interface. The developm ent o f  other routing 
functions and configurations is beyond the scope o f  this research. However, 
experiments have been performed by the author using the two-channel routing 
function, in which the direction o f  routing on the physical channels is modified from 
that used in the verification configuration. The objectives o f  the experiment were to 
establish that the building block approach could be realised, and that different internal 
configurations could be implemented in a pin-locked device without loss o f 
performance. The results o f these experiments are presented in Section 6.4.4.
5.2 Network Topology
In an FPG A implementation o f  a m essage routing node, the netw ork topology 
and physical interface between netw ork nodes is determined by a num ber o f  factors 
relating to the architecture and density o f  the FPGA. The I/O resource o f  the chosen 
FPGA sets a limit on the number o f  routing channels o f  a given channel w idth that can 
be implemented. The logic resource available within the device determines the
87
complexity o f the routing mechanisms that can be implemented and the number o f 
routing channels that can supported.
The topology chosen for the verification network is a linear array. This maps 
well to a physical implementation and simplifies the design o f  the routing engine when 
compared to the routing engine required for more complex multi-dimensional 
structures.
The netw ork diameter for a linear array is equal to N -l , where N is the number 
o f nodes. The network diam eter is long for large N, which is undesirable. To improve 
the performance o f the network, a reduction in netw ork diam eter to  N/2 can be made 
by joining the ends o f a linear array netw ork to form a ring, but the ring topology 
introduces the possibility o f  channel deadlock. To illustrate this, consider a five node 
linear array network in which a message from each processor*"is initiated 
simultaneously as depicted in Figure 30.
Figure 30 - Deadlock Free Linear Array
There is one routing channel in each direction between nodes o f  the network, 
and therefore three o f the messages (M02, M l 3 and M 41) will experience blockages 
that will eventually clear to allow all messages to reach their destination (Figure 31).
4 M sd is a message from source node s  to destination node d
time
M02,M13 &M41 blocked
M l3 & M41 routing 
M02 blocked
PEOO (PEI
M02
M02 routing
Figure 31 - Message Progression
In a ring topology for the same application, assuming that the direction o f 
message injection is selected based on shortest distance to the destination node, a 
deadlock situation would occur immediately (Figure 32).
89
PE2
Figure 32 - Deadlock in a Ring Topology
A l messages, being initiated simultaneously and requiring to pass through an 
intermediate node, will be immediately blocked at the next node.
A similar technique to  reduce the netw ork diameter can be used for multi­
dimensional meshes. The edges o f  the netw ork are connected with w rap-around links. 
In the case o f  a two-dimensional mesh, a torus is formed, although other topologies 
are also possible, including the midimew [A aiabarrena90].
The verification netw ork routing node design described in this thesis does not 
support the connection o f  end nodes to form a ring topology. To improve the 
diameter o f  the netw ork for large arrays, message routing mechanisms for 
implementing hierarchical netw orks are provided.
The prim aiy goal o f  this research has been to establish w hether an FPG A can be 
used to implement a message routing netw ork. Given that FPG A  design fit and 
performance are difficult to assess w ithout producing a completed design, a simple 
topology was chosen and a small number o f  message routing mechanisms 
implemented to ensure that the research could reach a successful conclusion. Higher 
dimensional networks, more complex routing mechanisms and exploration o f virtual 
netw ork schemes will be investigated by research following the w ork described in this 
thesis. It is interesting to note, however, that the internal architecture o f  the dual 
channel router produced in this research is very similar to the two-dim ensional routing
5 minimum distance mesh with wrap-around links 90
element o f  the MP1 M ad Postm an routing device [Jesshope93], as illustrated in 
Figure 33.
channel
Xilinx Dual bus architecture
Figure 33 - MP1 Comparison
The MP1 has tw o additional channels, PIGin and PIG out, for netw ork 
expansion which are not included in the verification design. The message 
encapsulation and multicast features provided in the MP1 device are also implemented 
in the verification netw ork node design. M essage adaption in the MP1 design is a
91
two-dimensional scheme where destination addresses at the head o f  the message are 
swapped. A  one-dimensional adaption mechanism is implemented in the verification 
design. Replication o f  the MP1 adaption mechanism would incur some additional 
demand on device capacity.
These similarities in design indicate that the implementation o f a two- 
dimensional FPGA-based netw ork node would require only minimal additional 
internal FPG A  resources to that required for the verification routing node design 
described in this thesis.
5.3 Physical Impiemen tation
The physical implementation chosen for the prototype parallel processing 
system was to  use separate circuit boards for the processors and for the message 
routing devices. Commercial m odular processor products w ere utilised for the 
processing elements at each netw ork node. This minimised the hardw are development 
tasks associated with the implementation o f  the complete parallel processing system 
required for verification o f  the netw ork routing node design. The hardw are design 
task with this approach required only the development o f  a simple netw ork routing 
circuit board. The alternative approach, o f  developing a combined processor and 
netw ork router module or Processing Element (PE), and its associated netw ork 
interconnection mechanism w ould be a significantly longer and m ore complex task.
5.3.1 Asynchronous Communication Links
Connection o f  the processors to the routing nodes is via asynchronous 
communication links in the verification netw ork node design. An advantage o f  using 
asynchronous communication links betw een the processors and the m essage routing 
netw ork is that signalling issues relating to  the length o f the comm unication path are 
simplified. For asynchronous comm unication links, long connections require only a 
simple buffer arrangement at each end o f  the link.
Synchronous comm unication links with signalling in both directions have a 
maximum connection length set by the system clock period and the set up and hold 
requirements o f  the communicating devices. To extend synchronous links beyond this 
limit requires the use o f  additional hardw are in the form o f  synchronous repeaters, the 
effect o f which is to  introduce a pipeline delay into the comm unication path. In
92
addition, the synchronous repeaters have a limited span and for large distances, 
multiple repeaters must be employed.
5.3.2 Centralising the Network Fabric
The physical implementation o f  m ulticom puter systems with a large number o f 
processors introduces problems associated with packaging and interconnection 
lengths. The commonly used Processing Element (PE) module approach, although 
attractive for the simplicity o f  the interconnection backplane, produces a module with 
a relatively large number o f  components. Each PE module typically consists o f  a 
processor, a number o f  memory devices, a netw ork router device and miscellaneous 
interface logic (Figure 34).
Figure 34 - Processing Element
The packing density o f  a system using this approach is restricted by the size o f 
the PE module. The physical separation o f  the netw ork routing nodes imposed by the 
low packing density reduces the maximum speed that can be achieved for the 
network. This is particularly relevant where the number o f  processing elements 
exceeds the physical confines o f  a particular enclosure. Extension o f  the netw ork 
connections beyond this limitation leads to performance reductions associated with 
buffering the connections betw een enclosures and providing a global low  skew system 
clock.
Placing the message routing devices on the backplane offers a potential 
improvement in the netw ork performance, but PE  module size reduction is minimal 
and the problems with spanning enclosures remain.
93
Physical separation o f  the m essage routing netw ork from  the processors, 
combined with the use o f  buffered asynchronous communication links, allows the 
netw ork routing devices to be placed in close proximity to  neighbouring routing 
devices to provide maximum netw ork perform ance. N etw ork node packing density 
with this approach is high. Processors are physically rem ote from the network, 
connected to the central message router via asynchronous links o f  any length. The size 
o f  the processor m odule is not restricted by the need to pack modules closely together 
to achieve good netw ork performance. The use o f  asynchronous links minimises the 
delays caused by the extra wire lengths betw een the processor and the netw ork 
routing device com pared to  a synchronous approach, where delays are in multiples o f 
the clock period.
5.4 Routing Node Architecture
5.4.1 Number of Routing Channels
The number o f  channels available to each node is determined by the number o f 
pins on the device and the width o f  the channels. In the verification netw ork design, 
four byte-wide routing channels are provided betw een each node to  minimise the 
control logic o f  the routing engine. The use o f  sub-byte channels w ould require more 
complexity in the router state machines to  handle the time-multiplexed input data.
Tw o byte-wide C40 link channels are provided to the attached processor at 
each node, one for injection and one for reception.
5.4.2 Internal Architecture
The internal architecture o f  the router node implemented in the verification 
netw ork (Figure 35) consists o f  a two-channel router engine for each direction 
(dualbus), and a processor interface with separate injection and reception channels 
(C40inj and C40rec). The processor interface is shared betw een the two-channel 
routers.
94
each channel has two control 
signals plus an 8-bit data bus
processor connection
network routing channels
each channel has three control 
signals plus an 8-bit data bus
Figure 35 - Internal Architecture of Network Routing Node
The sharing o f  processor interface channels is necessary to  reduce the size o f  
the control logic for the processor interface, to  ensure that the design fits within the 
available Xilinx part. This com prom ise reduces the processor injection and reception 
bandwidth, but more significantly, introduces the possibility o f  deadlock within the 
network. A full explanation o f  this potential deadlock situation and the action taken to 
ensure that it does not occur during netw ork operation is deferred until Section 5.6, 
which discusses deadlock and livelock issues for the dem onstrator network.
5.4.3 The Two-Channel Router Engine
The internal architecture o f the two-channel router engine is illustrated in Figure 
36. The router has tw o inbound netw ork channels and one local processor message 
injection channel. Selection o f  the inbound source for each o f  the routing channels is 
controlled by an input multiplexer. Tw o m essage routing engines within the node are 
attached to  flit buffering logic which feeds tw o outbound netw ork channels and a 
local processor m essage reception channel.
95
Figure 36 - Two-channel Router Engine
All channels within the netw ork are subject to  tem poraiy blockages during 
normal operation due to  routing resource conflicts. Hold signals, which are an integral 
part o f  each netw ork channel, indicate when a netw ork channel is blocked. The 
channel flow control mechanism includes a Strobe signal on each netw ork channel 
indicating when a valid flit exists on the data lines o f  the channel. W ith the use o f a 
strobe signal, flits o f  a m essage can be separated by periods o f  no activity.
Full details o f  the netw ork channel flow control mechanism is given in Section
5.4.4.
5.4.3.1 Data Path Connections
Input multiplexers on the input o f  each router select betw een the netw ork 
connection and the processor injection channel. Input multiplexer selection is under 
control o f  the processor injection logic. The normal state o f  the input multiplexer 
selects the netw ork input connection.
The processor injection control logic detects when the processor is ready to 
send a message. The interface to  the processor eliminates metastability issues (Section 
6.2.3) and a request for channel ownership, synchronised to the system clock, is sent 
to  both routers when a m essage is ready to  be injected. W hen one o f  the routers is 
free, channel ownership is granted to  the injection channel and the multiplexer is
96
switched to  connect the processor injection channel to the appropriate router. I f  both 
channels are granted simultaneously, one is selected and the other is released.
For each channel, the routing engine attached to the output o f  the flit buffering 
logic selects the required output connections using requests to  the output multiplexer 
control logic. The cross connection o f  output channels provides the data paths 
necessary for adaption. A routing engine which encounters a blockage on the inertial 
channel will request ownership o f  the other channel via the output multiplexer logic. 
The first output channel to become free will be used; the inertial channel is used in 
preference to  the adaption channel. Adaption introduces a single clock cycle delay in 
the routing o f  the message for the case w hen the adaption channel is available 
immediately.
The outbound netw ork channel multiplexers favour the inertial connection and 
maintain the inertial connection unless a request from  the other channel occurs for the 
purposes o f  adaption. The processor reception channel multiplexer alternates priority 
between each router until a request for the reception channel occurs. At any time, the 
routing engine with lower priority can be immediately granted the reception channel if 
the routing engine with the higher priority is not requesting the reception channel.
5.4.3.2 Rou ting Engin e
Figure 37 shows the detail o f  the router logic and the flit buffering logic.
flit
97
The routing engine consists o f  tw o state machines. The router m onitors the 
output from  the flit buffer and controls the outbound path o f  the message dependent 
on the message header byte and the availability o f outbound channels. The flow  
control state machine m onitors outbound channel hold signals and controls the flit 
buffering and the inbound channel hold signalling.
The flit buffering logic has a capacity to store two flits. In general, message flits 
pass through one flit buffer in each node. The second flit buffer is necessary because 
the inbound channel Hold  signal to  the preceding node is a registered output. It is 
possible for the preceding node to  transm it a flit, which must be accepted, at the same 
time as the Hold  signal is asserted. The second flit buffer is used to  store this flit until 
the blockage is rem oved from the node.
Registering the Hold  signal within each node is essential to  prevent a potentially 
long combinatorial path in the hold control logic which, in the limit, would span all 
nodes o f  the network. The Hold  signal generated by each node for its inbound channel 
flow control depends on the Hold signals entering the node on the outbound 
channel(s) associated with the route o f  the message through the node. The length o f  a 
purely combinatorial path for the Hold  signal would therefore be dependent on how 
many nodes the message traverses.
The flit buffering logic consists o f  tw o flit registers with two multiplexers as 
illustrated in Figure 37. The flit buffer can be in one o f four states as detailed in Figure 
38, dependent on the state o f  the outbound channels.
98
State 1
There are no blockages in the output 
channel. M essage data passes through the 
primary flit register.
State 2
The output channel is blocked. The 
primary flit register holds the blocked 
m essage flit. The secondary flit register is 
configured to accept another flit if it 
arrives.
State 3
A second flit has arrived during the 
blockage period. This is stored in the 
secondary flit register.
State 4
W hen the output channel is no longer 
blocked, the flit registers are emptied in 
turn.
Figure 38 - Flit Buffering
I f  in State 2 a second flit arrives, the flit buffering control logic will enter State 
3. If  another flit has not arrived before the output blockage is cleared, a return is made 
to State 1. This situation is possible due to  the Strobe signalling on the routing 
channel, which allows gaps in the message. In State 4, the hold signal to  the preceding 
node is released at the same time as the data from the secondary flit register is 
emptied.
A side-effect o f  the auxiliary flit buffer is that blocked messages compress 
spatially in the netw ork because each node in the path o f the blocked message has the 
capacity to store two flits.
In addition, when a blockage occurs, it is possible that gaps in the message will 
be eliminated. This effect occurs if a blockage is signalled to a node when it has a gap 
in the primary flit buffer. The preceding node will not be informed o f  the blockage 
condition until the primary flit buffer holds a flit from the message. This has the effect 
o f  eliminating gaps from the message which allows the message to spatially compress 
in the network.
99
It is possible, given these spatial compression effects, that the tail o f  a message 
will be able to  clear the source node whilst the message is blocked further along its 
route. This progression o f  the tail will release the routing resources closest to the 
source o f  the message earlier than would be the case with single flit buffering in each 
node.
5.4.4 Routing Channel Flow Control
M essages are routed between netw ork nodes over physical channels. A physical 
channel in the prototype netw ork consists o f  an 8-bit data bus and three control 
signals, Strobe, Sync and Hold (Figure 39). The netw ork operates synchronously with 
a comm on clock to all nodes. Each o f  the control signals and the data bus are updated 
on every clock.
/ DATA N O D E
N + l
JNUDc,
N 8
STR O BE
w
S Y N C
w
H O LD
Figure 39 - Routing Channel Signal Definition
The Sync signal indicates to the receiving node when the start and end o f  each 
message occurs. The Sync signal is asserted by the transmitting node on every byte o f 
a message except the last one.
The minimum message length o f  2 bytes is set by the need for at least one high 
period on Sync to  indicate the start o f  a message and one low period on Sync to 
indicate the end o f  the message. There is no maximum message length limitation set 
by the routing channel flow control protocol.
The availability o f each byte o f  the m essage is signalled with the Strobe signal. 
The Strobe signal is asserted by the transm itting node whenever D ata and Sync are
100
valid. A node that receives D ata and Sync with an asserted Strobe must consume the 
data.
The use o f a strobe signal allows the transmitting node to have gaps in its 
transmission and removes the need to have a data buffer between the netw ork and the 
attached processor. This reduces the silicon required to implement the processor and 
injection logic and the m essage launch latency associated with having to store all o f 
the data in a buffer before transmission is eliminated. To maintain the performance o f 
the netw ork with this approach, it is important that the attached processor can 
normally supply the m essage data to the netw ork at high speed with few gaps. A large 
number o f  gaps between m essage bytes will waste netw ork bandwidth. The Texas 
Instruments TM S320C40 processor used in the prototype has Direct M emory Access 
(DM A) capabilities on each o f  the connection channels which can sustain high data 
transfer rates o f the m essage bytes.
The Strobe signalling scheme is particularly useful where the netw ork is 
interfaced to asynchronous devices or devices running with a slower clock than the 
netw ork clock. Re-timed signals from these types o f systems will naturally lead to 
gaps in data flow in a synchronous system.
The Hold  signal provides the mechanism for a receiving node to signal to the 
transmitting node that m essage transmission m ust be temporarily suspended due to a 
blockage. A channel blockage can occur for one o f  two reasons; the processor at the 
next node is already using the requested inbound channel for the injection o f  its own 
message into the network, or the routing engine in the next node is blocked.
A blockage caused by an occupied inbound channel in the next node will only be 
encountered during the transmission o f  the first byte o f  a message. During the body o f 
a message, the injection channel is locked out and therefore cannot cause a blockage. 
A routing engine blockage is caused either by the attached processor signalling a busy 
condition, or a blockage on an outbound channel from the node.
An example o f the typical signal activity o f  a routing channel is illustrated in 
Figure 40.
101
Complete Message
SYNC
STRB
DATA flit 0 tlit 1 flit 2 flit 3 flit 4 flit 5
HOLD
gap in hold asserted by last flit
message receptor
data
Figure 40 - Message Transfer
The diagram shows both a gap during transmission o f  the message and a 
blockage occurring in the message reception channel.
5.4.5 Processor Interface
The Texas instalm ents TM S320C40 Digital Signal Processor is used in the 
prototype router. This device has six, byte-wide, bi-directional, asynchronous 
communications ports, each o f which has a dedicated DM A controller. The maximum 
data transfer rate o f  these ports when connected directly to  the port o f  another C40 
processor is 20M bytes/sec. This speed is degraded if bi-directional communications 
occur due to the requirem ent to pass a link ownership token betw een processors for 
master/slave arbitration and the sharing o f the link for the bi-directional traffic.
Separate communication channels are used for injection and reception in the 
verification netw ork node design to simplify the processor interface. Both channels 
are used in uni-directional mode which eliminates the requirem ent to pass an 
ownership token.
5.5 Message Routing
In general, a m ulticom puter m essage routing netw ork consists o f  a number o f 
routing nodes, each with an attached processor, connected together in a suitable 
topology. Internally, each routing node has a number o f data paths and a routing 
engine which directs inbound messages to  the appropriate outbound channels.
102
Inbound and outbound channels connect to  other nodes within the netw ork and to  the 
attached processor.
5.5.1 Data Driven Wormhole Routing
In the verification netw ork, the routing o f  messages through the netw ork is data 
driven. Each message injected into the netw ork has a header byte appended to the 
beginning o f  the message by the sending process. This byte contains the destination 
address o f  the message and a m essage type, defining how the message should be 
routed. The routing engine within each node checks the header byte o f  a message as it 
arrives and routes the m essage according to this information. W hen a message arrives 
at an inbound channel, one or m ore outbound channels from the node are reserved by 
the routing engine dependent on the requirem ents o f  the message. Channels reserved 
by the routing engine are not released for use by other messages until the tail o f  the 
message passes through the node. The header is discarded by the routing engine when 
the message reaches the destination node, leaving the body o f  the m essage to go to 
the processor.
The routing device utilises wormhole routing, a technique which reduces the 
message transport latency associated with earlier store and forward techniques 
(Section 2.4.1). In a store and forw ard routing network, the m essage is split into a 
number o f  packets prior to injection into the network. The routing engine at each 
node must receive and store the whole o f  a packet in a packet buffer before it is 
forw arded to the next node. In w orm hole routing, each node buffers onQfl.it. and will 
transfer this flit to the next node w ithout requiring the complete packet to  be received 
from the previous node. Figure 41 shows a comparison o f  m essage transport latency 
times between the store and forw ard and the worm hole routing techniques for a 9 
byte message sent from node3 to  node5.
103
Store and Forward routing
node 3 ________________________________________
node 4 i _________________________
node 5 I_____________________________ _________________________I -j
transport latency
Wormhole routing
node 3 ____________________________________________________________
node 4 ____________________________________________________________
node 5 _ . _ ^ n n E U U J B _____________________________________1
transport latency
_________ time____________ ^
Figure 41 - Store and Forward Routing Versus Wormhole Routing
For a message o f  length L traversing N nodes in a netw ork which transfers each 
flit in time T, the transport latencies for the tw o schemes, assuming no blockages, can 
be expressed as
Store and Forward
transport latency = N  x L x T 
W ormhole routing
transport latency = (N  + L) x T 
It can be seen that the worm hole routing network has significantly lower 
transport latency than the Store and Forward routing scheme.
Each node o f  a store and forward routing netw ork must provide the memory for 
buffering a packet, which requires more silicon at each node than that required by the 
worm hole routing technique. In addition, the wormhole routing technique does not 
impose the requirement to split the message into packets. This reduces the amount o f 
pre-processing performed on messages prior to injection by the sending process. 
However, splitting a message into packets does ensure that the netw ork routing 
bandwidth is shared fairly between competing processors.
Under some circumstances, dependent on the length o f  the message and the 
timing o f  the blockage, a store and forward network provides better blocking 
characteristics than worm hole routing. M ore o f  the message is stored within the 
network o f  a Store and Forward routing network. Consequently, routing resources o f
104
the nodes closest to the source o f  the message may be released earlier during a 
blockage situation. An example o f  this is shown in Figure 42.
Store and Forward routing 
node 3
node 4 ____
node 5 z e d h o b h .
Wormhole routing 
node 3 
node 4 
node 5
blockage at node 5
blockage at node 5
time
Figure 42 - Blockage Characteristics
This clearly shows that the blockage on node5 inhibits the release o f node4 and 
node3 in the case o f  W ormhole routing. For Store and Forward, the release o f  node3 
and node4 is unaffected by the blockage, allowing other messages to use the 
resources o f  these nodes earlier than is the case for worm hole routing. This effect is 
most significant for short messages which travel a long distance in the network.
5.5.2 Eager Inertial Routing
In the verification netw ork design, a message header flit arriving at a node is 
unconditionally transmitted to the next node in the inertial direction on the next clock 
cycle. The advantage o f  the eager inertial routing approach is that the clocking speed 
o f the routing engine, and therefore the network, is increased. The routing engine 
decides on the outbound channel requirem ents for the message not on the arrival o f 
the message header byte, but on the clock cycle following its arrival. The effect o f this 
is to reduce the depth o f  the combinatorial logic within the routing engine which 
enables the system to be operated at a higher speed.
To select the appropriate outbound channels from the node, the routing engine 
within each node must com pare the address o f  the incoming message with the address 
o f  the node.
105
flit buffer
outbound
channel
multiplexer
Figure 43 - Combinatorial Address Comparison
In the first case, illustrated in Figure 43, the outbound channel selection is 
determined on the arrival o f  the header byte through a purely combinatorial path. The 
system timing for this case is
TCk ^  Tpd +  Tac +  Tos +  Tnix +  Tm 
where
Tck — system clock period
Tpd =  clock to data propagation delay of the flit buffer 
Tac =  address comparison logic delay 
Tos =  outbound channel selection logic delay 
Tntx = propagation delay of outbound multiplexer 
Tm = data set-up time to clock of the adjacent node
W ith the addition o f  a register betw een the address com parison and the 
outbound channel selection logic, as illustrated in Figure 44, the system clock 
frequency is increased.
106
flit buffer
outbound
channel
multiplexer
Figure 44 - Registered Address Comparison
The system timing for this case is determined by two timing formulae
TCk -  Tpd + Tac +  Tsu 
TCk ^  Tpd +  Tos + Tnix + Tsn 
where
Tsu = data set-up time for an internal register
Other parameters are as defined above for the combinatorial case
These two timing requirem ents for the registered approach must both be 
satisfied, but both set a minimum clock period which is less than that required for the 
combinatorial approach.
A side-effect o f the eager transmission o f  the header flit is that a dead header 
j lit  is generated when a message has reached its destination; at the destination node o f 
a message, the header byte will have been eagerly transmitted before it is known that 
the message has reached its destination. The body o f  a message is not sent beyond the 
destination node, it is directed out o f  the netw ork to  the destination processor. Dead 
header flits serve no purpose and will propagate to the edge o f  the network. Eager 
transmission o f  the header flit is only incorrect once per m essage and therefore the 
small amount o f routing channel bandwidth which is wasted propagating the dead 
header flits is an acceptable penalty for the increase in overall system speed. If  a dead
107
header flit encounters a blockage on its path out o f  the network, it will be removed by 
the blocking node.
5.5.3 M essage Types
Four message types are implemented in the prototype m essage router: Point-to- 
Point, Multicast, Local Encapsulation and Global Encapsulation. The message type 
is sent as part o f  the information within the header byte o f  each message. The routing 
engine uses this and the address field, also in the header, to determine which outbound 
channels the message requires.
5.5.3.1 Point-to-Point Messages
The Point-to-Point message is the simplest o f the message types, carrying a 
message from the initiating node to  one other node. The m essage header contains the 
address o f  the destination node. The message always takes the shortest route to the 
destination node.
5.5.3.2 Multicast Messages
A message can be sent to  tw o or m ore nodes using the M ulticast m essage type. 
A single message sent by the initiating node will be received by the initiating node and 
every node up to and including the destination node as defined by the address in the 
message header flit. The provision o f  a multicast message routing mechanism reduces 
the routing bandwidth required when it is necessary to send the same message to 
m ore than one node. A single m essage sent by the originating node is distributed to 
multiple nodes by the multicast routing mechanism implemented in the hardware o f 
the routing nodes. There is a requirem ent with this mechanism that all nodes are 
simultaneously able to receive a flit if stalls are to be avoided. How ever, blockages 
should only occur whilst the head o f the multicast message has not been received by 
all relevant nodes. Once all o f  the destination nodes have received the head o f  the 
message and provided sufficient local buffer space for the message, there should be no 
further blockages to  reception o f  the complete message.
In two and three-dimensional mesh topologies, the extent o f  a multicast 
message is an area and a volume respectively. Similar hardware multicast mechanisms
108
can be devised for these topologies; the MP1 device implements a two-dimensional 
multicast routing mechanism [Jesshope93].
5.5.3.3 Local Encapsulation
Local Encapsulation provides a means o f  carrying a message to  an intermediate 
node within the netw ork for re-injection into the network.
W hen a Local Encapsulation header reaches its destination, the header byte is 
stripped o ff by the routing engine to  reveal a new message header. This message is 
effectively re-injected into the netw ork by the routing engine as if  it had started from 
the intermediate node. The only restriction on this message type is that the direction 
o f the encapsulated message must be the same as the message which carried it. The 
processor at the intermediate node plays no part in the re-injection o f  the encapsulated 
message, this process is dealt with entirely by the routing engine.
The encapsulation feature is o f  most use in a linear array for M ulticast messages 
where the extent o f the multicast does not include the originating node. Figure 45 
illustrates this, showing a remote multicast to nodes 5, 6, 7, and 8. A multicast 
message injected by the processor at nodel is carried to  node 5 by a local 
encapsulation header. The local encapsulation header is stripped o ff by the routing 
engine at node 5 to reveal a multicast m essage with a destination address o f node 8. 
The resulting extent o f  the multicast m essage is from node 5 to node 8.
109
Figure 45 - Encapsulated Multicast
Although Local Encapsulation o f messages is not restricted to a single level, it is 
not clear that multiple levels are useful in a single dimension network. In a tw o- 
dimensional grid network, encapsulation can be used to carry a message to its 
destination and in the process specifically avoid certain nodes. This technique is useful 
for routing around known hotspots or avoiding faulty nodes and could lead to the use 
o f  multiple levels o f  local encapsulation. Care must be taken, as it is for all routing 
mechanisms, to  ensure that the possibility o f  deadlock in the netw ork is not 
introduced. In the M ad Postm an network, which includes a mechanism for 
encapsulated multicast messages, deadlock in the netw ork is prevented with the use o f  
separate routing planes, one for each o f  the four possible two-dimensional routing 
directions (NE, SE, SW and NW).
110
Global Encapsulation provides a means o f  carrying a message out o f  the local 
network. The distinction betw een the Local and Global Encapsulation message types, 
is that the Local Encapsulation header carries a message to a node which is 
addressable within the immediate, or local, network. Global Encapsulation headers 
are used for hierarchical netw orks and carry messages out o f  the local netw ork to a 
different level o f  the hierarchy. The body o f  the message carried by a Global 
Encapsulation header is not received by any processor on its journey to the edge o f 
the local network.
The Global Encapsulation message type is provided in the design o f  the message 
router to support expansion o f  the network. An example o f the use o f  this feature is in 
the formation o f  a hierarchical 2D mesh netw ork as shown in Figure 46.
Row Sub-network
5.5.3.4 Global Encapsulation
9 ;o— o— o— o— o
Column
Sub-network
O  sending node 
HI receiving node
Figure 46 - Hierarchical Network Expansion
111
This example netw ork has 10 row s o f  linear sub-networks, connected through a 
single column sub-network at one end (shown on the right hand side o f  the diagram) 
which utilises a router based on the design described in this thesis.
Local and Global Encapsulation techniques are used to  route messages through 
the hierarchy. A message destined for a node on another row  will have three header 
fields. The first is a global encapsulation message header byte to  get the message into 
the column router. The second field will contain the routing information for the 
column router. This may consist o f  tw o header bytes if  a multicast is being performed 
which does not include the initiating row. The third field is the header information for 
routing the message within the destination row. This field may also contain two 
header bytes if a multicast is being perform ed which does not include the node directly 
connected to  the column router.
Two examples o f  message routing are shown in Figure 46. A point-to-point 
message from node 5,9 to  node 7,8 which has the following m essage structure
Header field #1 
H eader field #2 
H eader field #3
Global Encapsulation 
Column point-to-point, dest 8 
R ow  point-to-point, dest 7 
Body o f  M essage
and an encapsulated multicast from node3,0 to  a region spanning node4 ,l to  node7,5 
with the following message structure
Header field #1 
Header field #2
Header field #3
Global Encapsulation 
Column Local Encapsulation, dest 1 
Column M ulticast, dest 5
Row Local Encapsulation, dest 7 
Row M ulticast, dest 4
Body o f  M essage
In addition to Global Encapsulation messages, the column routers will receive 
dead header flits from the row sub-networks which are ignored.
Consideration must be given with this scheme to ensure that there is no 
possibility o f deadlock for the case when the extent o f a m ulti-row multicast includes 
the row containing the initiating node. The simple diagram (Figure 46) o f  message 
routing does not show the actual message flow through the routing channels, 
therefore a more detailed diagram is required to illustrate the channel usage in a multi­
row multicast.
Row Sub-network
i  )r
.j----* 'r*J  j— — TT5— J J - >r - i
I )
u 1  ^ Column
Sub-network
0  sending node 
f§| receiving node
Figure 47 - Global Encapsulation Deadlock Freedom
113
Figure 47 shows details o f  the channels used for routing a m ulti-row multicast 
message with an extent which includes the source node. The source node receives the 
multicast message on its return journey from the column router, not as a standard 
multicast message.
D eadlock freedom is guaranteed in this situation because the routing o f the 
message from the source node to  the column router uses a different set o f  routing 
channels to  the routing o f the m essage from the column router to  the destination 
nodes. There is no cycle in the channel dependency graph and therefore no possibility 
o f deadlock.
5.5.4 Node Addressing
The maximum address capacity is set by the number o f  unused bits in the header 
byte. W ith tw o bits o f  the header byte assigned for message type, six free bits provide 
addressing capacity for 64 nodes.
The verification netw ork implements 4-bit addresses, providing a maximum 
netw ork size o f  16 nodes. N odes are numbered in order along the netw ork, from 0 at 
one end to 15 at the other. The use o f  all six available address bits is not necessary to 
prove the principle and would lead to additional levels o f  logic to  decode the address 
due to the limited input width o f  the logic blocks.
M essage routing decisions at each node compare the destination address held in 
the header byte with the fixed address o f  the node. The body o f  a message is 
transported no further than the destination node. The body will be transferred to the 
processors at intermediate nodes in the case o f  a multicast message, or just to the 
destination node for a point-to-point message. The header byte will carry the message 
body to its destination and will then be transported to the end o f  the netw ork as a 
dead header flit as described in Section 5.5.2, unless it encounters a blockage, in 
which case it is discarded.
5.5.5 Adaption
A mechanism for adaption is included in the router design. At any intermediate 
node in the verification netw ork on the route taken by a message, the path o f the 
message can be switched to an alternative free channel if  the inertial channel is
114
blocked. This adaption mechanism improves the blocking characteristics o f  the 
netw ork by allowing blocked m essages to take advantage o f  unused channels. The 
adaption mechanism takes one additional clock period for the case when the adaption 
channel is immediately free.
5.6 Deadlock and Livelock
Freedom  from deadlock and livelock are essential to the correct operation o f 
any message routing netw ork. Livelock can occur when adaption mechanisms are 
implemented for routing messages through a network. If  the adaption o f  a message 
causes it to move further away from its destination, then it is possible that the 
message will never arrive at its destination. In the verification netw ork design, 
message adaption occurs such that a message is always nearer to its destination 
following an adaption, therefore a livelock condition cannot occur. The situation 
where messages are perm itted to  take a path to  their destination which is not minimal 
can also lead to deadlock since cycles can be introduced by this mechanism.
The subject o f deadlock has been discussed in Section 2.4.2. The strategy taken 
to provide deadlock freedom  in the verification design is to implement separate 
routing channels for the Easterly m essages and for the W esterly messages. It has been 
shown [Yantchev89] that an acyclic netw ork is deadlock free, therefore the 
verification netw ork is deadlock free by design. There were, however, a number o f 
potential deadlock mechanisms identified and eliminated during the design process. 
All relate to multicast messages and the use o f  a single processor reception channel.
115
® Multicast Collision Deadlock can occur when multicast messages from 
opposite directions in the netw ork arrive simultaneously at a destination 
node.
® Midticast Contention Deadlock can occur as tw o multicast messages 
traverse the netw ork in the same direction at the same time if care is not 
taken in the allocation o f  routing channels.
© Multicast Adaption Deadlock can occur when two multicast messages are 
blocked at a node simultaneously and the adaption mechanism is 
operational.
These sources o f deadlock have been addressed to ensure that they do not occur 
in the verification network.
5.6.1 Multicast Collision Deadlock
M ulticast collision deadlock is possible when multicast messages arrive at 
adjacent nodes from opposite directions, as illustrated in Figure 48. The deadlock is 
caused by the sharing o f  the processor reception channel betw een the Easterly and 
W esterly routing channels.
Easterly
channel
Node N
reception
channel
Node N +l
Westerly
channel
\ \ / \ / /
reception
channel
Figure 48 - Multicast Collision Deadlock
116
The diagram shows tw o adjacent nodes and their respective reception channels, 
including the multiplexer used to  share the reception channel betw een the Easterly and 
W esterly routing channels. Two multicast messages arrive at the pair o f  netw ork 
nodes at the same time from opposite directions. It can be seen in this instance that 
the two nodes have allocated their reception channels to  different messages which are 
travelling in opposite directions. The Easterly multicast m essage has the reception 
channel for node N, the W esterly multicast message has the reception channel for 
node N + l. Both messages are blocked waiting for the reception channel o f the 
following node. The allocation o f  a processor resource channel is dependent on 
reception o f  the body o f  the message. Therefore, the eager routing o f  the header flit 
will not cause further deadlocks in the network.
This problem is introduced by sharing o f  the processor reception channel, which 
is necessary due to internal resource constraints o f  the FPG A  devices available for this 
research. Deadlock is avoided in the netw ork verification procedures by restricting 
multicast messages to one direction only. In future implementations o f  the routing 
node, additional logic and I/O resources will be necessary to provide a processor 
reception channel for each direction o f  routing to  overcom e this problem. Larger 
devices are now available in the Xilinx XC4000 family which can provide additional 
logic and I/O resources to  meet these requirements. The reception processes within 
the attached processor must be independent and not able to  block each other for 
deadlock freedom to be guaranteed.
A i alternative solution is to  provide a buffer to  hold a received message from 
each direction. This solution w ould only be possible if the maximum message size is 
limited but this may be considered to  be too restrictive. In addition, a large area o f 
memory would be required in the device which is generally not possible in FPGA 
devices.
5.6.2 Multicast Contention Deadlock
Blockages on multicast m essages travelling in the same direction require careful 
consideration to avoid introducing the possibility o f  multicast contention deadlock. 
The potential problem is created as a direct result o f  the eager inertial routing o f the 
header byte. For all message types, it is possible for the header byte to advance
117
through the netw ork whilst the body o f  the message is blocked. Figure 49 below 
shows a sequence which dem onstrates how deadlock can occur if  the header o f a 
multicast message is perm itted to  reserve the reception channel as it progresses 
through the network. The injection channels for each node are om itted for clarity.
Node n Node n +!
Time T
Time T+AT
Figure 49 - Multicast Contention deadlock
N ote that the tw o multicast messages arrive at subsequent nodes along the 
netw ork simultaneously from the same direction and that initially the reception 
channel multiplexers for N ode n and N ode n+1 are in phase, both connected to  the 
upper netw ork channel. At time T, the multicast messages on the tw o network 
channels arrive simultaneously at N ode n, and the reception channel is granted to the 
upper channel. The body o f  the multicast message arriving at the lower channel is 
blocked, however, the header o f  the message continues to the next node on the next 
cycle due to the eager routing o f header flits.
At time T+5T, N ode n+1 receives the multicast headers on both channels 
simultaneously. The reception multiplexer now favours the low er channel, and 
therefore the lower channel is granted the reception channel, blocking the body o f the 
message on the upper channel. A deadlock situation now exists with the upper 
channel blocked at N ode n+1, and the lower channel blocked at N ode n.
118
To avoid this deadlock situation, it is necessary to ensure that the reception 
channel is not granted to a multicast until the first byte after the header o f the message 
has been received. The result o f  this is that the router o f a multicast message which is 
blocked in a node because it has not been granted the reception channel will not allow 
the first byte after the header through to  the following node. This mechanism has been 
implemented in the verification design and guarantees that a node will only grant its 
reception channel to a multicast m essage which has already been granted the reception 
channel in the preceding node.
5.6.3 Multicast Adaption Deadlock
M ulticast adaption deadlock is possible if  care is not taken in the adaption mechanism 
when the heads o f  two multicast messages arriving at a node simultaneously are 
blocked by the reception channel at the node. I f  the adaption mechanism is permitted 
to operate in this situation, it is possible for the granting o f the reception channel to be 
locked in anti-phase to the granting o f  the outbound routing channels to  each o f  the 
routers, leading to a deadlock condition. Figure 50 shows part o f  the internal 
architecture o f the two-channel router element.
outbound 
network 
channel
reception
channel
outbound 
network 
channel
Figure 50 - Outbound Channel Architecture
W hen two multicast headers arrive simultaneously at the node where the 
reception channel is blocked, the routing engines will both request all o f  the outbound 
channels. This is because the routing engine is only notified that a blockage exists, not 
the source o f  the blockage; to modify the channel request depending on the channel
119
block signals would require extra complexity in the logic, reducing the operational 
speed o f  the network. The blockage may be caused by the inertial channel, therefore 
the adaption channel as well as the inertial and reception channels are requested when 
a blockage is detected.
For the multicast message to  proceed, both a netw ork outbound channel and the 
reception channel must be granted. In the situation where the reception channel is 
blocked, channels are granted to the routers but not used. The reception channel 
multiplexer and the tw o netw ork outbound channel multiplexers will switch between 
selecting the upper and lower channels on alternate clock cycles. It is possible that the 
phase o f these switching cycles will prevent both channels from ever being granted the 
reception and an outbound channel simultaneously.
Figure 51 illustrates this situation and shows the two states that the outbound 
channel connections oscillate betw een in the deadlock situation.
Figure 51 - Multicast Adaption Deadlock
To avoid this situation, it is simply necessary to prevent the adaption mechanism 
operating when both channels are requesting an outbound channel.
5.7 Software Support
The software to  support the verification netw ork node design is provided by a 
commercial product, Parallel C [3L95], Parallel C is based on the Communicating 
Sequential Processing (CSP) paradigm  o f  parallel processing [Hoare78]. In this 
model, an application consists o f a num ber o f  concurrently active processes which
120
communicate with each other via uni-directional channels. The processes 
communicate only over channels, and a channel connects one process to one other 
process.
Standard ANSI C is used to  code the processes in the Parallel C application, 
with inter-process comm unication supported by library functions which implement the 
channels.
Channels either connect processes running on the same processor using a 
software implementation o f a communication channel, or across processor boundaries 
via a hardware communication channel.
5.7.1 Application Configuration
The Parallel C developm ent system provides a mechanism for arranging the 
processes o f  an application onto different configurations o f  processors. The choice o f 
software or hardware inter-process communication channel is made at application link 
time to match the architecture o f  the target hardware and the selected distribution o f 
processes across the available processors. This is achieved with a pre-execution 
configurer utility which is controlled with a user generated textual configuration file. 
The configuration file defines the physical links between processors, the channel 
connections between processes and the allocation o f processes to  processors.
The same application can be run on different processor configurations using 
unmodified source code, by simply running a different configuration file through the 
configurer utility.
5.7.2 Processor Farms
The processor farm paradigm, described by May and Shepherd [May87], Hey 
[Hey88] and Pritchard, et al [Pritchard87], is also supported by the Parallel C toolset. 
A processor farm in the Parallel C environment consists o f a single processing element 
running a master process and a num ber o f  other processing elements, each o f  which 
runs an identical instance o f  a worker process. The master process breaks down the 
application into a number o f  small independent w ork packets which are automatically 
distributed over an arbitrary netw ork o f  w orker processors by special routing 
software.
121
M any applications are suited to  the processor farm technique, including ray 
tracing [Lomax91], real-time control [Irwin92] and video coding [Sava96]. The 
processor farm is o f particular interest to SM IS as the software for the signal 
processing in the current medical imaging system has a structure very similar to this 
approach. The signal processing is split into a number o f  small tasks, which are 
distributed to  the available DSPs in the system.
The main advantages o f  the technique are that applications can be run faster 
with the addition o f  more processors given certain conditions, that the processing load 
is automatically balanced across the available processors and that processors can be 
added without recompilation or reconfiguration. Obtaining a linear speed-up depends 
on having a sufficient supply o f w ork packets to keep all o f the w orkers busy, and on 
hiding the cost o f  communicating the task  and results messages through the processor 
netw ork with, for example, the use o f  task message buffers in the w orker processors.
One particular topology commonly used for implementing a processor farm, 
where w orkers are connected to the system controller in a chain, has a similar 
structure to that o f  the verification netw ork topology (Figure 52).
Figure 52 - Processor Farm
The existing verification netw ork design could be used in such an application to 
speed up the transfer o f  tasks and results betw een controller and w orker processes by
122
replacing the task router and result router processes depicted in Figure 52 with 
hardw are message routing mechanisms. The system would not strictly replicate the 
operation o f the processor farm o f  Figure 52 because o f the use o f  addressing in the 
messages. It would also have a limited extensibility due to the limited address field in 
the message routing header. The implementation o f  a variant o f  the verification router 
node design, which replicates the action o f  the task router and result router processes 
at each node, would eliminate these issues. The modified router node would have a 
behaviour such that a task message arriving from the system controller would be 
directed to the attached processor only if  a result message had been sent by the 
attached processor since the last task message arrival. To prime the system from start­
up, each w orker could send a dummy result message to the system controller. A l 
router nodes would then be set in a condition to direct a task message arriving from 
the system controller to the attached processor. The action o f  the system controller 
process would remain unchanged; it w ould simply send a task from the task pool each 
time a result message was received from  a worker. The task message would be 
directed to the first w orker in the chain which was ready to receive a task. The system 
would be self-sustaining, with the result message from each processor triggering both 
the system controller to send another task, and the router to  direct the next task 
message to the processor. Figure 53 shows a schematic representation o f the 
progression o f task messages in the netw ork following start-up and in particular, the 
action o f the task message routing mechanism.
123
~ r
Time
Figure 53 - Processor Farm Network
M essages containing results from the w orker processes are routed in the
netw ork to the controller process using a point-to-point routing technique; no
adaption, encapsulation or multicast mechanisms are required. The routing o f  task 
messages from the controller process to the w orker processes is represented
schematically in Figure 53 by switches within each routing node. These are set
dependent on the state o f  the message router used for routing results messages.
124
This hardware approach to  the routing o f m essages eliminates the 
communication overhead in each w orker associated with the through-routing o f work 
messages and result messages. In addition, the lack o f address information in either 
the task messages or the result messages, and the simplified mechanisms required to 
route messages would enable the implem entation o f the router node in an FPGA to 
operate at a higher system speed than the verification router design.
Taking the technique one step further, the progress o f  m essages between 
controller and w orker processes could be improved with the use o f  multiple routing 
channels in the network. To orchestrate this, it would be necessary to  pair result and 
task routing channels in the routing node. The control o f  the task  channel switch 
would be the same as occurs in the single channel router above; whenever the 
attached processor sends a result m essage into a plane, the corresponding task 
channel is switched to the processor. Effectively, the node design described above is 
replicated to produce a multi-plane netw ork. Additional logic at each node allows the 
w orker processor to select the first free channel from those available in the network. 
Figure 54 shows a four-plane farm router schematic.
four-plane 
network routing node
Figure 54 - Four-Plane Farm Router Node
The controller task attached to  a multi-plane farm netw ork must ensure that 
when it receives a result message, a new  task is sent down the paired task channel. I f  
the controller process were to simply send a new task down an arbitrary channel, this 
could result in the new task being sent down a channel on which there are no free 
processors. In this situation, the task m essage would fall o ff the end o f the network.
125
To maintain the routing channel pairing, a simple partitioning o f  the controller process 
can be made such that each channel pair has a link process with its own task buffer 
(Figure 55).
.....—J  link j
I controller 1 link 1
task pool
to results 
co-ordinator
J  link‘X  controller
— ...A  |jnk \
\ controller /
link 2
link n
Figure 55 - Multi-Plane Farm Controller Processes
The task buffers for all links are fed by a task pool controller process which 
ensures that tasks are distributed to  links as required. W ith the use o f  DM A channels 
for message transportation, the link controller process is a simple D M A  initiator 
triggered by the arrival o f  a results message. The link controller processes and task 
pool controller process can be implemented in separate processors if necessary to 
ensure that task distribution and result collection does not becom e a bottleneck in the 
system.
The above scheme is able to  supply new task messages to more than one 
processor at a time and m ore than one result message at a time is routed to the 
controller processor. This reduces the average latency seen by each w orker process 
between sending a result and receiving the next task.
126
The time available for this research has not permitted the author to  investigate 
these proposals. However, as noted earlier, the SMIS signal processing software has a 
structure suitable for execution on a processor farm architecture. Provision of 
hardware support for the comm unications between farmer and w orkers will be 
investigated as part o f  a future program m e o f  research (Chapter 7).
5.7.3 Micro-Kernel
The Parallel C environment implements a micro-kernel on each processor in the 
system. The micro-kernel is responsible for scheduling ail o f  the executing threads on 
the processor, including the interrupt handling, and provides control for the inter­
processor channels, semaphores and the timers.
M any tasks can be run on the same processor, with each task consisting o f one 
or m ore concurrently executing threads. To ensure that all threads have use o f the 
processor, a scheduling scheme is implemented in the micro kernel. Each thread is 
assigned a priority which is used by the scheduler to determine which thread to run. 
The highest priority thread, term ed an urgent thread, is only stopped if it does 
something to  cause it to  stall, such as sending a message or waiting on a semaphore. 
Low er priority threads will be stopped, or descheduled, for the same reasons as 
urgent threads. In addition, they will also be descheduled if a higher priority thread is 
ready to execute (pre-em ption) or if  the thread has been running for longer than a 
preset period o f  time (time-slicing). Prioritisation and descheduling ensure that all 
threads have access to  the processor resource.
5.7.4 Communication using DMA
The communication primitives supplied with parallel C for the communication 
links use program med transfers. The verification node design connects the processor 
communication channel directly to the network. It is therefore essential that the 
transfer o f  data betw een processor and netw ork node is fast. D irect M em ory Access 
(DM A) transfers provide the highest transfer rates.
The author has implemented D M A  based communication primitives for use in 
the netw ork verification activity described in Section 6.4. The D M A  processor has an 
auto-initialisation facility in which com pletion o f  a DM A transfer can trigger the 
loading o f  a new DM A transfer. This is particularly useful for m essage reception
127
where the length o f  the message is sent as the first item o f  the message. The first 
DM A reception is set up to  receive the fixed size message length param eter and place 
it in the length param eter o f the descriptor for the auto-initialised second DM A 
reception. In this way the variable length o f  the received message is automatically 
managed by the D M A auto-initialisation facility and no processor intervention is 
required.
The Send and Receive primitives implemented by the author both utilise thread 
descheduling functions to ensure that the processor is made available to  other threads 
whenever DM A transfer completion is being tested. Failing to  do this would cause the 
processor to waste time in a tight loop testing for D M A  completion. A  more 
significant consequence o f  locking a processor into a DM A polling loop is that the 
response o f  the processor to other messages would be significantly degraded, but 
despite this, the time-slicing micro-kernel will ensure that a process deadlock would 
not occur.
SetUpDMAO;
while( !DMACompIete(channel) ) 
thread_deschedule();
This approach was adopted because the end-of-DM A interrupt facility 
supported by the toolset did not function reliably. Polling to test for completion o f the 
D M A  did not cause any problems in the verification application as the processors are 
very lightly loaded, needing only to initialise the DM A transfer for each message sent 
over the netw ork and to check received messages.
128
Chapter 6 
Prototype Network Implementation and
Verification
6. Prototype Network Implementation and Verification
This chapter begins with a section outlining a number o f  general implementation 
details for the prototype m essage routing device, followed by a section which 
provides a detailed description o f  the netw ork routing node design developed by the 
author to  implement a dem onstration network. High level textual descriptions are 
utilised to describe the majority o f  the netw ork node design in combination with 
schematic diagrams for key areas.
The operation o f  the netw ork node in a complete netw ork design was first 
verified using simulation o f  the textual source code. Any schematic diagrams in the 
design are replaced with equivalent textual descriptions for the purposes o f this 
exercise. The simulation techniques used in this process are briefly described.
Further verification o f  the netw ork node design has been obtained using 
hardware in the form o f  a demonstration network designed and developed by the 
author. Implementation details o f  the demonstration netw ork design, and the 
verification process undertaken with the demonstration hardw are are provided in the 
final section o f  this chapter.
6.1 Implementation Details
6.1.1 Design Goals
Prior to designing the netw ork node for the demonstration netw ork, a number 
o f design goals were set.
• The complete design o f  the netw ork routing node, including the processor 
interface, must fit within a single Xilinx FPG A device. This minimises the cost 
o f the netw ork routing node and simplifies the circuit layout for the 
demonstration netw ork hardware.
• The processor interface is through a pair o f Texas Instrum ents TM S320C40 
(C40) asynchronous communication links, one for m essage injection, one for 
message reception. The netw ork node must interface directly with the C40 
processor. This simplifies the circuit for the dem onstration router hardware 
and minimises the netw ork cost per node.
130
• The operational frequency o f  the netw ork routing node m ust be 20M Hz or 
greater. This figure is derived from  the specification o f  the C40 processor links 
which have a maximum data transfer rate o f  20M bytes/sec.
These design goals affect the implementation o f  the netw ork routing node. 
Estimates were made by the author, prior to  implementation, to establish that devices 
were available with both the speed and the capacity required by the netw ork node 
[Frost94],
6.1.2 The VHDL Approach to Digital Design
The router device is implemented using VHDL, an industry standard high level 
textual language for describing hardware. V H D L was developed from  a w orkshop in 
1981 sponsored by the American D epartm ent o f  Defense as part o f  the Very High 
Speed Integrated Circuits (VHSIC) program me. The VHSIC H ardw are Description 
Language (VHDL) was standardised by the IEEE in 1987; this version o f  the 
language is referred to  as the IEEE Std 1076-1987 or V H DL’87. A further revision o f 
the language IEEE Std 1076-1993 or V H D L’93 was made to eliminate errors and 
ambiguities in the original Language Reference Manual (LRM ), although m ost tools 
do not support the full V H D L ’93 standard.
The VHDL language has constructs to  express both the sequential and the 
concurrent behaviours found in digital systems, at many levels o f  abstraction from 
gate-level to the algorithmic level. A  system can be modelled as a set o f 
interconnected components in a hierarchical structure and the timing properties o f  the 
system can be described.
6.1.2.1 VHDL Design Methodology
The VHDL hardware design m ethodology improves the hardw are design task in 
several areas.
• Design entry time is reduced. The laborious schematic capture task is 
eliminated, replaced with textual entry. Syntax-directed editors which 
automatically generate skeleton code and have syntax colouring are available 
to reduce the design capture tim e further.
131
VHDL com ponent descriptions are concise, as dem onstrated by the 
following example o f  a generic two input bus multiplexer. 
library IEEE;
use IEEE.std_logic_1164.all;
entity BusMux is
generic (N'.natural);
port (BusA :in std_logic_vector(N-l downto 0);
BusB :in std_logic_vector(N~I downto 0);
SelectA :in std logic;
OutBus :out std Jogic_vector(N-l downto 0) ); 
end BusMux;
architecture behavioural of BusMux is 
begin
OutBus - ■ = Bus A when SelectA = 7' else BusB;
end;
A separate circuit diagram to represent this function w ould be required for 
each bus size used in the design, with the larger multiplexers occupying 
several sheets. W ith the use o f  a generic param eter N, the same VHDL 
description is suitable for any bus size.
• Design param eters supplied to the tool which converts the VH DL code to 
the target device gate-level description can be changed and alternative design 
scenarios tested rapidly. Param eters typically include optim isation criteria 
(speed, area or both), optimisation effort, the state machine encoding scheme 
(e.g. fully encoded, Gray, Johnson, One-Hot) and the target logic family.
• Design time is reduced with the re-use o f  parameterised components.
• State machine implementation is simplified. Enum erated states hide the 
details o f  the state encoding and allow the selection o f  the state coding 
scheme to be optimised. State transition logic and the logic to generate the 
state machine outputs are produced automatically.
Additional benefits o f  the VHDL design m ethodology include target technology 
independence, the ability to  use a top-dow n, bottom -up or a mixed development 
approach, and the portability o f  designs between different vendors’ tool sets.
132
6.1.2.2 Synthesis and Device Fitting
Commercially available tools convert the textual VHDL description o f the 
hardware to produce the equivalent logic cell netlist for the target device in a process 
called synthesis [Hafer82], [Nakamura87]. The synthesis tool uses knowledge o f  the 
architectural features present in the target device to ensure that the results produce a 
design which fits the device structure and exploits special features like fast carry logic 
and clock enables on the device registers.
A further process o f  logic placement and routing following synthesis is required 
for FPGA devices which is perform ed with tools supplied by the FPG A  device 
vendor. This process optimises the placement o f  the logic within the device to obtain 
the best performance and then routes the signals between the logic blocks.
6.1.2.3 Verification
Tools for formal verification were not available to the author; and anyway, for 
FPGA design, these are not mature. Design verification o f the VHDL hardware 
descriptions was perform ed using a process o f extensive simulation before 
commitment to hardware, which then was followed by verification o f  the design in the 
target hardware.
VHDL source level debugging tools are available which provide standard debug 
facilities such as breakpoints and single stepping for the purposes o f  simulation. Two 
stages o f  simulation are employed, functional simulation in which timing param eters 
o f the design are set to  unity, and timing simulation in which the timing param eters are 
set according to the characteristics derived from the fitment o f  the design into the 
target device.
Functional verification o f  the device constitutes the major activity in the 
simulation process; only when the functional behaviour o f the design is correct are the 
actual device timing param eters used. For the purposes o f  functional simulation the 
synthesis and device fitting processes are not necessary, which reduces the cycle time 
o f making corrections and re-verifying the design.
The process o f  timing simulation is used to  verify device timing param eters. The 
functionally correct design is synthesised to  target device logic cells and fitted into the 
device by the vendor’s place and route tools. From the fitted design, timing
133
param eters for all the nodes and cells in the design are derived and fed into the 
simulation tool. Simulation using the actual timing param eters o f  the device verifies 
that there are no timing violations in the design. The timing simulation process may be 
repeated several times with different synthesis param eters in order to  obtain the 
required perform ance from the design. It may also be necessary to  rew ork parts o f the 
design if performance param eters are not met or parts o f  the design are not 
synthesizable.
It is considered good practice by the author to implement a single test bench 
suite for both functional and timing simulation. To this end, the timing o f  the stimuli 
and the timing for the verification o f the device outputs are set to  represent the actual 
timing constraints o f  the design. These constraints are not required for functional 
simulation and are ignored, but providing a single test bench for both simulation 
processes reduces the time spent designing verification code and ensures that 
functional and timing verification use exactly the same criteria for acceptance.
6.2 Routing Node Design
The internal hierarchical architecture o f  the routing device is illustrated in Figure 
56. The labelled boxes surrounding each part o f  the design indicate the corresponding 
name o f the VHDL entity/architecture pair describing the enclosed circuit. The 
hierarchy o f  the VHDL routing node design and the VHDL source code descriptions 
for all modules are provided in Appendix A.
The overall design is encapsulated in a top level VHDL description 
RCORE.VHD. This entity uses the dualbus component (DUA LBU S.V HD) in two 
different connections, one to route in an Easterly direction, the other to route in a 
W esterly direction. The dualbus com ponent is a message router which utilises two 
channels in order to provide adaption facilities for reducing the effects o f  channel 
blockages.
134
Figure 56 - Internal VHDL Architecture
The Texas Instruments TM S320C40 (C40) reception controller C40REC.VHD 
and the C40 injection controller C40INJ.VHD provide the connection between the 
netw ork routing functions and the attached C40 processor.
6.2.1 Channel Data Flow
The flow o f  data along channels through the routing node can be easily 
identified from the architectural diagram above. Each netw ork routing node has four 
inbound and four outbound netw ork channels. A  processor injection and a processor 
reception channel are shared betw een the four routing channels. In the demonstration 
node design, the four available routing channels are split into tw o groups. One group,
135
consisting o f  tw o channels, routes messages in the Easterly direction, the other group, 
also with tw o channels, routes messages in the W esterly direction. The Easterly and 
W esterly routers both use the Dualbus m odule and are identical except for the sense 
o f the address comparison logic within the Dualbus component. The selection 
between the tw o senses o f  the address comparison logic is made in the Dualbus router 
module using a generic parameter, Easterly Routing. This param eter alters the address 
comparison logic such that for the Easterly router a message has not reached its 
destination if the address in the message is less than the node address. For the 
W esterly router, the address comparison sense is such that a message has not reached 
its destination if  the address in the message is greater than the node address.
6.2.2 Dualbus Router
6.2.2.1 Inbound Channels
The input multiplexer (IPM U X.V HD) selects either the netw ork input channel 
or the processor injection channel as the source for the flit buffering circuit, under 
control from the input multiplexer controller (IM UX.VHD) com ponent which has 
two states. The Direct state selects the connection to  the netw ork input channel. 
During injection, the Inject state is entered under control o f  the injection controller 
(INJ.VHD). In this state, the injection channel is selected for input to  the flit buffering 
circuit.
In the idle condition, the injection controller switches betw een two states, 
Favour A and FavourB which dictate the channel used for injection. These two states 
ensure that the injection load is balanced across the tw o available routing channels 
(Figure 57).
136
#  PowerUp
(  InjectB
LastFlit
Header
N ©Header ©
FavourA
else
Inject & 
BUseable
RequestB
Inject & 
BUseable CD
else
Inject & 
( j )  AUseable
Inject & 
AUseable
RequestA j
Header
NoHeader
(  FaV0UrB )■  Las.Fli, (  InJ'eCtA
Figure 57 - Injection Controller
The injection controller requests the injection channel input to the flit buffering 
circuit when a message injection request from the attached C40 processor is detected. 
The injection controller detects an injection channel request when the SYNC and 
STRB signals from the C40INJ com ponent are both asserted. This signalling is 
identical to the netw ork channel start-of-m essage signalling.
Selection o f  the injection channel by the IM UX controller requires that the 
netw ork channel is not in use. This criterion ensures that the routing o f  messages 
through a node is not interrupted unnecessarily for the purposes o f  injecting new 
messages into the network. I f  the netw ork channel is free, it is first blocked by 
asserting the Hold signal to  the preceding netw ork node (RequestA/RequestB states).
The reason for the need to  assert the Hold signal is that it is possible for a 
message header flit to arrive on the cycle following assertion o f the Hold signal 
because the Hold is a registered signal to the preceding node (Section 5.4.4).
If a message header byte has not arrived on the cycle following the Hold, the 
request for the channel is granted and the message injection proceeds (InjectA/InjectB 
states).
If  a message header byte has arrived on the cycle following the Hold, the Hold 
is released and the message on the netw ork channel is accepted and routed. The 
injection controller re-enters the dual idle state mode (FavourA/FavourB) waiting for 
a routing channel to becom e free, at which time the injection request is retried.
137
The flit buffering circuit is controlled by the FLO W  state machine 
(FLOW .VHD). This m onitors the H old signals o f the outbound channels in use and 
the current state o f the input Strobe signal to control the flit buffer logic and set the 
inbound Hold signal as appropriate.
The flit buffering circuit (Figure 58) includes an input multiplexer, address 
comparison logic, message type decoder and the two flit buffers controlled by the 
FLOW  controller.
6.2.2.2 Flit Buffering
flit buffering
Figure 58 - Flit Buffering
The input multiplexer selects betw een data from the netw ork channel and data 
from the injection channel. The selected data is fed to the flit buffering circuit. The 
message type decoder and the address comparison logic use data from the input to  the 
primary flit buffer. The message type decoder and address comparison logic perform a 
look-ahead function to  produce a set o f  registered signals for use in the router state 
machine. This approach is necessary to reduce the depth o f  the combinatorial logic in 
the state transition logic.
138
The FLOW controller (Figure 59) is in the NotStopped state from power up.
PowerUp 0
NotStopped y<- IPathBlocked
IPathBlocked & 
ISecondaryBufferFull
Stopped
PathBlocked
PathBlocked
f Recover )
PathBlocked & 
SecondaiyBufferFulI
Holding
IPathBlocked & 
SecondaiyBufferFulI T
IPathBlocked
Figure 59 - Flow Controller
The action o f  the FLOW  controller depends on two factors; the ability o f the 
requested outbound channel(s) to  accept flits, and the arrival o f  a flit from the inbound 
channels.
In normal operation with no blockages, flits are directed through the primary flit 
buffer and the Hold signal to  the preceding node is not asserted.
W hen an outbound channel blockage occurs, the Hold signal to the preceding 
channel is asserted, the updating o f  the primary flit buffer is suspended and the 
inbound channel signals are directed to the secondary flit buffer. The FLOW  state 
machine is then in the Stopped state. Exit from this state is dependent on both the 
outbound channel blockage condition and whether a second flit arrives on the next 
clock cycle. A second flit can arrive from the previous node only on the clock cycle 
following the assertion o f  the Hold signal. Thereafter, it is not possible for the 
preceding node to produce any flits whilst the Hold signal remains asserted. A second 
flit arriving at the node is stored in the secondary flit buffer.
The action o f the FLOW  controller in the Stopped state depends on whether a 
second flit has arrived from the preceding node 
o If  a second flit has not arrived
139
if  the blockage still exists, FLOW  remains in the Stopped state, otherwise 
FLOW  returns to the NotStopped state and flits are directed back to the 
primary flit buffer.
•  I f  a second flit has arrived
if the blockage still exists, FLOW  enters the Holding state, otherwise 
FLOW  enters the Recover state.
In the Holding state, both the primary and secondary flit buffers hold flits and 
the Hold signal to the preceding node is asserted. Exit from the Holding state requires 
that the outbound channel blockage is cleared. On exit, FLOW  enters the Recover 
state which routes the output o f  the secondary flit buffer to  the input o f  the primary 
flit buffer. This mechanism allows retrieval o f  the flit stored in the secondary flit 
buffer.
It is possible that the FLOW  controller will re-enter the Stopped state from the 
Recover state if the routing channel becomes blocked before the secondary flit buffer 
is flushed.
6.2.2.3 Message Routing
The routing o f  messages is controlled by the RO U TER state machine 
(ROUTER. VHD). A change o f state in the state machine, from the normal IDLE 
state, is triggered by the arrival o f  a message header flit. The header flit contains a 
message type field and an address field. The address field is com pared with the node 
address by the address com parator circuit (IPM U XCM P.VHD ) which is part o f the 
flit buffering circuit. Three comparison signals are produced, less-than (LT), equal-to 
(EQ) and greater-than (GT). The result o f  the address comparison and the message 
type are used to determine which outbound channels are required. At this stage, either 
the inertial network channel is required, the reception channel is required, or both the 
inertial netw ork channel and the reception channel are required. I f  the inertial channel 
is required and it is blocked, the router will also request the other netw ork outbound 
channel on the next clock cycle. The first netw ork outbound channel to become 
available after this will be used. I f  both outbound channels becom e available 
simultaneously, it is possible for the header flit to  be sent erroneously over both
140
channels. To avoid this, an extra clock period is introduced during which the request 
to  the inertial channel is held and the request to  the other channel removed.
The attempt to  use the alternative netw ork outbound channel if a blockage 
occurs on the inertial outbound channel in this process o f  adaption will take place for 
all message types.
Dead header flits are detected by recognising that the address contained in the 
header indicates that the message has already passed its destination. Dead header flits 
that encounter a blockage at a node will be rem oved from the network.
6.2.2.4 Outbound Channels
The outbound channels consist o f  tw o netw ork channels, each controlled by an 
OM UX controller (OM UX .VH D), and the reception channel to  the processor 
controlled by the RM U X  controller (RM UX.VHD).
The OM UX controller normally connects the output o f  the flit buffering circuit 
to the corresponding inertial (direct) netw ork channel. The multiplexer is only 
switched to  the alternative channel (adapt), if  an adaption has been requested by its 
router, if  the direct channel is not in use and if  the adaption is not blocked. The 
adaption mechanism o f  an OM UX controller is blocked if  its direct connection router 
is requesting adaption. This is necessary in order to prevent Midticast Adaption 
Deadlock as described in Section 5.6.3.
In the idle condition, where neither routing channel is requesting the processor 
reception channel, the R M U X  controller alternately switches priority betw een routing 
channels. I f  both routing channels require the reception channel simultaneously, the 
channel with the high priority is granted the reception channel. I f  only one channel is 
requesting the reception channel, it is immediately granted to  the requesting channel.
6.2.3 Processor Interface
The processor chosen for verification o f  the netw ork routing node design is the 
Texas Instruments TM S320C40 Digital Signal Processor (C40).
The C40 processor provides six byte-wide serial communication links for use in 
connecting C40 processors together in a point-to-point topology. In the 
demonstration network, tw o o f  these links are used to  transfer m essages between the 
attached processors and the network.
141
M essages are transferred over the C40 communications links in multiples o f 
four-byte words only. The comm unication links are bi-directional, with a master-slave 
arrangement controlled by ownership o f a token. The token is passed between ends o f 
the link only in the period betw een w ord transfers. In the dem onstration netw ork, the 
bi-directional feature o f the C40 links is not used. One C40 link is used to  inject 
messages into the network, and another separate link is used to  receive messages from 
the network. A simpler interface is derived as a result, but separate links are strictly 
necessary if the netw ork-processor interface is byte oriented to avoid deadlock. This 
is due to  the disparity between the w ord-oriented transfers o f  the C40 link and the 
byte-oriented transfers o f  the network. The change o f  direction in the C40 link can 
only occur when a word transfer is completed, but in the network, blockages can 
occur on byte boundaries. I f  a blockage occurs in the middle o f  a word transfer it 
would be possible for the C40 to  enter a deadlock situation where it is unable to 
complete the current transfer and unable to  switch the direction o f  the link to continue 
transfers in the other direction.
Each message is routed within the netw ork using one or m ore header bytes 
which are stripped off by the router at the destination node before transmission o f  the 
body o f  the message to  the associated processor. The processor reception channel o f  
each netw ork node is directly connected to a C40 communications link and must 
therefore receive messages which are a multiple o f  four bytes in length. Similarly, the 
processor injection channel o f  each netw ork node must produce messages which are 
multiples o f  four bytes.
To summarise, injected messages must be a multiple o f  four bytes, received 
messages must also be a multiple o f  four bytes, but the message routing mechanism 
consumes at least one byte o f  each message. These apparently contradictory 
requirem ents are met with the use o f  an injection message which has an additional 
control w ord preceding the body o f  the message. The control w ord is four bytes long 
and contains the header byte used to route the message within the network, an 
optional second header byte and tw o bytes containing additional information about 
the message for use by the injection controller within the routing node.
142
The C40 links consist o f  a byte wide data bus, and four control signals. Two o f 
the control signals (C40REQ and C40ACIC) are used for passing the link ownership 
token. In this application, links are used in a uni-directional mode, therefore these 
signals are unused.
The data transmission for the C40 communication links is based on the two 
remaining control signals C40STRB and C40RD Y  (Figure 60).
6.2.3.1 C40 Communication Links
C40DATA don't care valid N , don't care ,K next byte
C40STRB
C40RDY ‘
F ig u re  60 - C40 L in k  Signalling
The C40STRB signal is asserted by the data initiator when a data byte is ready 
for transmission. The data is held active until the target signals its consumption o f the 
data by asserting the C40RDY signal.
6.2.3.2 Injection Message Format
To provide a high degree o f  flexibility, the C40 processor injection interface in 
the netw ork router is designed to  accept messages divided into segments. Each 
segment is headed by a control w ord (4 bytes) which specifies the num ber o f  message 
words in the segment, and in the case o f  the first segment, the content o f  the message 
header byte. The control w ord is followed by the data words o f  the segment (Figure 
61).
143
Segment 1 Length]=i (enc Header Flags] 1
Header)
datai [01
datafll]
i____________________:___________________ i
_________________  datafli-1]_________________
^-control
Segment 2 Length2=j null null Flags2
data2[01
data2fl]
^-control
data2[j-l]
Segment s Length,=k null null Flags, ^-control
datas[01
datas[l]
datajk-l]
Figure 61 - Injection Message Format
For encapsulated messages, it is possible to include the content o f  the message 
header byte for the first level o f  encapsulation within the control w ord o f the first 
segment. This is sufficient for intra-netw ork encapsulation. For the deeper levels o f 
encapsulation required for inter-netw ork encapsulation (Section 5.5.3.4), the 
encapsulation headers must be placed at the head o f the data w ords in the first 
segment. For the processor interface to  the netw ork to continue to  function correctly, 
the additional encapsulation headers m ust be padded to  four byte w ords and these 
padding bytes must be consumed by router nodes in the netw ork. This can be 
achieved by utilising the existing local encapsulation message type with a destination 
address which matches the address o f  the node at which it is revealed. The existing 
router design will consume and effectively ignore such a message header flit.
The control word includes a Flags byte, which has three active bits. The Last 
Segment (LS) bit specifies if  another segment from the same m essage follows. For the 
first segment o f  a message, the Direction (DIR) bit specifies the direction o f  travel o f
144
the message (Easterly or W esterly) and if an encapsulated header byte is present in the 
control word, the Encapsulated bit (ENC) is set.
The length and flag fields o f  the control word o f  each segment are consumed by 
the C40 injection controller and do not form any part o f the m essage injected into the 
network.
The body o f the segment, which carries the data o f  the message, is limited to 
256 words (1024 bytes). There is no limit to the number o f  segments in a message.
6.2.3.3 Message Injection Controller
The processor message injection channel is controlled with the C40INJ 
controller (C40INJ.VHD). The input strobe signal to the controller logic, C40STRB, 
is an asynchronous signal which is registered within the C40INJ controller to provide 
immunity to metastability problems. The C40STRB is registered on entry to the 
device using the falling edge o f  the system clock and then used internally with the 
rising edge o f the clock to reduce the number o f  clock cycles introduced by the 
metastability guard circuitry. The metastability characteristics o f  the Xilinx device 
provided in an application note [Alfke94] indicate that no metastable events occurred 
with a device clocked below 25M Hz when using both the rising and falling edges o f 
the clock. Data is only supplied for Xilinx XC3000 devices, but the XC4000 devices 
are faster and have better metastability resolution delays, therefore the approach taken 
will successfully guard against metastability problems.
The state machine controlling the injection o f messages consists o f  eleven states 
(Figure 62)
145
Figure 62 - C40INJ Controller
The states o f  the C40INJ controller are named to describe the next flit expected 
from the processor. The normal idle state is FlcigByte, the byte containing the flags 
being the first flit expected from the C40 processor. The remaining states and the 
transitions between them  can be seen to match the format o f the injection messages as 
detailed in Section 6.2.3.2. The contents o f  the Flags and the Length fields in the 
control w ord o f  each segment control the transitions taken in the C40INJ controller. 
A presettable counter, with carry look ahead to  improve speed o f  operation, is used to 
determine when all words o f  a segment have been received by the C40INJ controller. 
The counter is loaded with the Length field from the control w ord o f  each segment. 
For speed considerations, the logic for the carry look ahead and the zero count 
detection circuit operate over several system clock cycles, taking advantage o f  the
146
fact that the count is a word count, and at least four clock cycles are required to 
receive the four bytes o f  each word.
6.2.3.4 Reception Message Format
The C40 processor reception interface in the netw ork router produces message 
data only; there are no additional control w ords for message identification or 
delineation. For this reason, it is necessary to  send the length o f  a message as part o f 
the data within the message to ensure that the message reception task in the target 
processor allocates sufficient buffer memory for the message and can determine when 
a message is complete. In the application for the dem onstration netw ork, the length o f 
each message is sent as the first w ord o f  the message. This technique is commonly 
used in the message-passing mechanisms for point-to-point links o f  both the C40 and 
the Transputer processors.
A potential deadlock situation exists with this technique if  the length field o f the 
message is received and insufficient message buffer pool memory is available to 
allocate a buffer for the remainder o f  the message. Once the header o f  a message has 
been received, the remainder o f  the message must be accepted before the routing 
channel used by the message is released. The commitment o f  routing resources and 
allocation message buffer pool m emory can lead to cyclic dependencies which result 
in deadlock. In the application for the dem onstration network, maximum message 
sizes are pre-defined and are small, and therefore message buffers large enough for all 
possible messages in transit may be statically allocated at compile tim e to  avoid this 
potential deadlock situation.
6.2.3.5 Message Reception Controller
The processor reception channel is controlled with the C40REC controller 
(C40REC.VHD). The reception channel to the processor is shared betw een the two 
dualbus router components o f  the dem onstration network router, which are referred 
to in this description as channel A  and channel B. The C40REC controller has six 
states as illustrated in Figure 63.
147
' - LastStrobe-I'
L astStrobe-1
'{  FavourB
(
^  BRdy=’l 
BStrobe V - -
BRdy=' I ’
Figure 63 - C40 Reception Controller
In the idle condition, the reception controller alternately favours channel A then 
channel B to ensure the fair distribution o f  the reception channel bandwidth. 
W henever data is available from a channel for the attached C40 processor, a Ready 
flag is asserted; ARdy and BRdy for channel A and channel B respectively.
W hen a Ready has been asserted, the C40REC controller enters a two state 
Strobe/W ait cycle for the appropriate channel until the message for the channel is 
complete.
6.2.3.6 Ensuring Fair Access to and from  the Network
The processor reception controller (C40REC) ensures that the Easterly and the 
W esterly routing channel groups have equal access to the single process reception 
channel. Within each group, an equal opportunity is also given to each channel in the 
group to use the reception channel (RM UX). W ith these tw o mechanisms, messages 
received on any o f  the four netw ork routing channels are guaranteed to  have an equal 
opportunity to use the reception channel o f  the attached processor.
Fair access to  the netw ork by the injection channel m ust be guaranteed to 
ensure that traffic on the netw ork inbound channels for a node does not consistently 
block a processor from injecting a m essage into the network.
The channel into the routing node is biased in its idle state to  the inbound 
netw ork routing channel (IM UX). To gain access to the network, a processor wishing 
to inject a message must have the opportunity to  block the inbound channel when the 
inbound channel is in an idle condition (INJ). In a lightly loaded netw ork, where gaps
148
naturally occur between messages, fair access to the netw ork from the injection 
channel is not a problem.
For a heavily loaded network, the issue o f fairness requires further 
consideration. Ignoring, for the moment, the complication o f  the routing channel 
adaption mechanism and the spatial compression caused by channel blockages, the 
injection o f  a message automatically leads to  an idle period on the inertial outbound 
routing channel. This is due to  the requirem ent to have an idle period on the inbound 
channel o f the node where the injection is occurring, ensuring that the block signal 
produced by the injection request is recognised (Section 6.2.2.1). This inherent idle 
period between messages will allow other processors fair access to  inject messages 
into the network. As a result o f  the implementation o f  the adaption mechanism, an 
adaption always incurs a minimum o f  one additional clock period on a channel. This 
leads to at least a one clock cycle gap betw een messages after an adaption has 
occurred. Finally, the spatial compression o f  the gaps between messages when 
blockages occur never results in the gap being reduced to zero. This is due to the fact 
that the node preceding the end o f  the blocked message will enter the adaption 
mechanism as a result o f  seeing a Hold signal at its outbound channel. As already 
noted, the adaption always introduces at least one additional clock period.
In summary, for all scenarios there is a fair access to the netw ork for the 
injection channel as a result o f  the inherent gaps that exist betw een messages within 
the network.
6.2.3. 7 Processor Interface Limitations
The Injection and Reception controllers both require two clock cycles for each 
byte transfer, restricting the maximum message transfer rate to  half the netw ork clock 
speed. This restriction is due to the metastability guard technique o f  using only signals 
that have been registered through a flip-flop. The lOMbyte/sec transfer rate achieved 
with these interfaces is, however, the same rate achieved in each direction over a C40 
link which is being used in a bi-directional mode.
A simple solution to the removal o f  this restriction is to run the C40 interface 
circuits at double the clock rate o f  the netw ork router circuitry. The injection and 
reception state machines have only a few states compared to the largest state machine,
149
the router engine. The maximum operating frequency o f  the design will be set by the 
largest state machine, so therefore the C40 interface designs will probably run at twice 
the current clock rate w ithout modification. A double frequency clock must be 
distributed to  all netw ork nodes either as well as the current 20M H z clock, or in place 
o f  it, in which case the clock for the netw ork router section o f  the design must be 
derived from an internal clock divider. The Xilinx device supports up to  four internal 
clocks, and therefore the use o f  tw o internal clocks to accom m odate this modification 
is not a problem.
An alternative approach is to  utilise a new feature o f  the Xilinx devices added to 
the XC4000E series since the completion o f  the design for the demonstration 
network. The internal memory facility o f the new Xilinx family o f  devices includes a 
dual port RAM  configuration. Using this option to  pass data betw een a processor and 
its netw ork routing node, the strobes o f  the attached C40 processor could be isolated 
to a higher degree from the netw ork routing engine clock. The interface would 
provide buffering sufficient to  implement four-byte burst transfers, the basic word 
transfer size o f  the C40 processor. Synchronisation between netw ork and C40 link 
would occur only on these four-byte boundaries to eliminate three out o f  the four 
synchronisation events currently required per C40 word. This enhancem ent would 
also allow bi-directional transfers to be used on each link if  required, by eliminating 
the potential C40 link deadlock situation discussed in the introduction to  section
6.2.3.
6.2.4 FPGA Design Considerations
6.2.4.1 FPGA Performance Optimisation
Designs which ignore the internal architecture o f an FPG A  device can give poor 
performance. A good example o f  this is in the implementation o f  state machines. An 
encoded state machine, where each state is represented by a unique value o f  a state 
register, is economical on registers, but the complexity o f the logic used to  control 
each state register flip-flop is increased as a result o f  the encoding scheme. The 
complex control logic typically requires a large number o f input term s to each o f  the 
state registers. In FPG A devices, w ide combinatorial term s require multiple levels o f 
logic due to the small fan-in o f  FPG A  logic blocks. These multiple levels o f  logic
150
reduce the maximum clock rate achievable for a particular design due to  the additional 
signal routing delays betw een the logic blocks and the propagation delays o f the 
additional logic blocks. The use o f  a One H ot Encoding scheme [Alfke94a], where a 
flip flop is assigned to  each state, improves performance by reducing the complexity 
o f the control logic (Figure 64).
Conventional state m achine
state 0 state 1 state 2 state 3
One-H ot state m achine 
Figure 64 - Conventional and One-Hot State Machine Encoding
M ore registers are required to represent the current state compared to the 
encoded state method, but in the register-rich architecture o f  the FPGA, this is not a 
problem.
Large binary counters have a similar problem to the state machine in that the 
combinatorial logic o f  the higher order counter registers requires many input terms. 
One approach to overcom e this problem is to replace a binary counter with a Linear 
Feedback Shift Register (LFSR), which is a shift register with the input fed by a small 
number o f feedback term s exclusively ORed together [Klien94][AT&T95]. W ith the
151
appropriate choice o f feedback terms, a N  bit LFSR will produce 2N-1 unique binary 
values. This is similar to  a N bit binary counter, which produces 2N ordered binary 
values, except that the LFSR does not produce ordered count values, the count values 
are pseudo random. In the case o f the address generator for a FIFO, for example, the 
random addresses generated by the LFSR are not a problem, so long as read and write 
address ordering is the same. The fact that the length o f  the FIFO is not a power o f 
two is also generally irrelevant. W hat is gained by the use o f  a LFSR address 
generator for the FIFO is that its simpler control logic runs significantly faster than a 
conventional linearly addressed FIFO.
Multiplication, which is required for many signal processing applications, is 
another example o f  a function which requires careful consideration. For larger 
operand sizes, a parallel implementation requires a large amount o f  logic and it may be 
necessary to serialise the function, or utilise a mixture o f parallel and serial techniques. 
With the use o f  small internal look-up tables for partial-product generation, 
multiplication times can be reduced significantly over purely serial implementations 
[Chapman95].
The state machine encoding scheme used in the demonstration netw ork is One- 
H ot encoded to reduce the combinatorial next-state logic and increase system 
operating frequency. The word counter for the processor injection interface is the only 
counter in the design. A binaiy counter is used in this application in preference to the 
LFSR technique to simplify the message injection software running on the attached 
processor. It would be possible to  utilise a LFSR counter and use a look-up table in 
the software to convert between a binary count value and the corresponding LFSR 
count value. The word counter is not in the critical timing path o f  the routing node 
design, and therefore the application does not w arrant the additional complication o f  a 
look-up table in software.
The architecture chosen for the router has been selected to  fit closely with the 
internal CLB-based architecture o f  the Xilinx 4000 series o f  devices. The flit buffering 
arrangement is designed to fit in a single CLB to minimise its impact on the operating 
speed o f the network. The node address range is based on using a single stage address 
comparison circuit for the same reason. This limits the address field to  four bits to 
match the four input term s available in the combinatorial section o f  each Xilinx CLB.
152
6.2.4.2 Synthesis Issues
For the purposes o f  simulation, all parts o f  the design are described in VHDL. 
Several issues discovered during the synthesis process force the use o f  schematics for 
time critical parts o f the design. The synthesis tool is unable to identify that the time 
critical flit buffering circuit fits within a single CLB. Synthesis o f  a textual 
representation o f  the circuit produced a solution occupying tw o CLBs, reducing the 
operational frequency o f  the network. To overcom e this problem, schematic diagrams 
o f  the flit buffering circuit w ere used (Appendix B). The circuit diagrams were treated 
as black boxes by the synthesis tool and were fed unchanged to  the Xilinx tool set. 
Using constraint commands set at the schematic level, the Xilinx tools w ere forced to 
fit the circuit inside a single CLB.
The use o f device specific schematics is contrary to  the ideal device independent 
VHDL approach, but in this case it is the synthesis tool which forces this compromise. 
A test carried out on an alternative and m ore expensive synthesis tool on short loan to 
the author, showed that it was able to produce the desired single CLB fit w ithout the 
need to resort to schematic diagrams.
6.2.4.3 The A ddress Comparison Circuitry
Each node in the netw ork has a unique address which it uses to make message 
routing decisions. The routing logic requires an equality, a greater-than and a less- 
than comparison to be made betw een the address contained in the header o f  each 
message and the address o f  the node. The implementation o f  the less-than and 
greater-than functions in the Xilinx 4000 series on address fields wider than two bits 
requires more than one level o f  CLB which would reduce the operational speed o f  the 
router. During netw ork operation, the address o f  each node is a fixed value and 
therefore the use o f  logic signals to define its value is unnecessary. It is m ore efficient 
to use a look-up table accessed by the address within the header byte o f  the message 
to generate the required comparison signals.
The use o f  a look-up table introduces an implementation problem. It is 
necessary to change the contents o f  the look-up table to implement any node address 
within the network. It is undesirable to synthesise and fit the VH DL design for every 
node in the network, using a different address look-up table to  produce a separate
153
Xilinx configuration file for each o f  the nodes. Apart from the time taken to  process 
the design, with the synthesis and the placement and routing processes taking in the 
order o f  fifteen hours for each node, the designs produced would have different timing 
parameters. Ideally, identically placed and routed designs would be used in all nodes 
o f  the netw ork to  eliminate possible sources o f  error.
It is possible to use a black box approach at the synthesis level to stop the 
synthesis process minimising the address comparison logic, but the Xilinx place and 
route process performs a global logic minimisation on the design before placement.
These tool specific problems have been overcom e in the dem onstration netw ork 
node design using a combination o f  the black box approach at the VHDL level and a 
schematic representation o f  the black box logic, which includes a process control flag 
to prevent minimisation o f  the logic by the Xilinx software.
A C program  running under DOS was developed by the author, which is am  
after the Xilinx place and route process. It identifies the unchanged address 
comparison circuit within the place and route output file and modifies the address 
comparison logic within the file to  produce sixteen output files, one for each o f  the 
possible nodes in the network. This same utility also adds the optional internal pull-up 
resistors on all o f  the input pins to  the device node to  provide a well defined state for 
any unconnected inputs. The pull-up facilities on the I/O cells o f  the Xilinx device are 
not accessible from the VHDL level.
6.2.4.4 Device Dependent VHDL Design
The close correlation between design decisions made for the netw ork router and 
the internal architecture o f  the Xilinx XC4000 device family, described in Section 
6.2.4.1, appears to defeat one o f  the objectives o f  VHDL design, which is to  have 
high level design descriptions that are independent o f  target device considerations. 
This objective can be realised in many designs, but those which require the highest 
performance that is available from the device must inevitably take a pragmatic 
approach and consider the finer details o f  the target device architecture. This 
comprom ise is analogous to the high level language/assembler trade-offs made in 
software engineering. It is sometimes necessary to  implement the most time critical
154
parts o f  the design in assembler to  take full advantage o f  the capabilities o f the 
underlying processors and to obtain the required performance.
Special features are specifically included in programmable logic devices by the 
device vendor to improve perform ance in certain key areas, for example, in the 
provision o f  fast carry logic for the implementation o f wide adders. M any synthesis 
tools are able to recognise commonly used constructs at the V H D L level and utilise 
special features o f the target device in the implementation o f  the design, but the 
VHDL code must be written in a specific way for this recognition to  occur.
The objective o f  using VHDL descriptions in this research is less ambitious; 
taking advantage o f  high level design techniques to improve the design process and 
to provide a design which can be easily ported to  faster and denser FPG A devices as 
they become available in the future. The demonstration router node design will be 
easily fitted in other devices o f  the same Xilinx family, which is continually being 
improved, both in terms o f  operational speed and packing density. The latest addition 
to the family, the XC4062E, has four times the number o f  CLBs com pared to the 
device used for the dem onstration design, and a higher speed grade for all parts in the 
family is now available.
Synthesis tools for FPGAs are still maturing. It is anticipated by the author that 
they will follow a similar progression to C language compilers for DSPs and 
embedded processors. M achine code produced by the latest-generation products is 
comparable in performance with hand-coded assembler. In the long term, the FPGA 
vendors and synthesis tool designers will produce products which approach the 
performance o f semi-custom devices w ithout the need to tailor the design for specific 
devices.
6.3 Verification - Simulation
The generation o f a set o f  test vectors to  provide full coverage o f  all paths in the 
logic within the complete routing node design is not a practical approach to 
verification. To cover all possible conditions in a design that has independent 
network-inbound, netw ork-outbound and processor channels and a wide variation in 
message form at and content would require a very large test vector set.
155
An alternative approach is taken, in which individual com ponents o f  the design 
are tested using VHDL Test Benches, then a complete netw ork is simulated with a 
large number o f  random  messages. The individual com ponent testing verifies the 
operation o f  the lower levels o f  the hierarchical design. This requires the generation o f 
only a small number o f  test vector sets, each o f  which covers a small part o f the 
design. The simulation o f  the com plete netw ork provides a means o f  verifying that the 
com ponent parts o f  the design w ork together correctly under a limited number o f 
conditions. The messages used are random  in nature, both in length and content, to 
provide the best chance o f  revealing problems in the design.
6.3.1 The VHDL Test Bench Approach
Verification o f V H DL is achieved with the use o f a Test Bench, which is simply 
a set o f  components, w ritten in VHDL code, which produce stimuli to  exercise the 
com ponent under test, and capture the resulting outputs from the component. The 
analogy o f  this approach, from which the name o f  the technique is derived, is in the 
use o f  signal generators and w aveform  analysers to test hardw are on an engineer’s 
bench.
For simple repetitive signals like system clocks, a simple VHDL description can 
be used to  produce the necessary stimuli.
clockgen : process 
begin
wait fo r CLOCKJPERIOD/2;
Clock = not Clock; 
end process;
For m ore complex signal waveforms, the stimuli can be produced with a series 
o f  VHDL statements which have the stimuli hard-coded into them.
156
stim uli: process 
begin
wait for CLOCK PERIOD/2;
Signal 1 ''= 7 ’; Signal2 -"= 'O’; Signal3 - = 'O’; 
wait for CLOCKPERJOD/2;
Signal I ' = 'O’; Signal 2 -'= 7 ’; Signals ‘O’;
wait fo r CL0CK_PER10D*6;
Signal I -' = 7 Signal2 ‘O’; Signal3 7 V 
wait; 
end process;
In a more flexible approach, textual input files are used to define signal states. 
The file contains a list o f  entries, each o f  which defines the state o f  every stimulus 
signal. The file may also include a field in each entry which defines the time to the 
next step in the file.
The output from the com ponent under test can be viewed as signal traces which 
resemble those from a logic analyser. The process o f checking results which are in the 
form o f  signal traces is laborious and error prone for all but the simplest o f tests. The 
use o f textual files to describe the expected outputs at specific times simplifies and 
speeds up the process o f checking the outputs from the com ponent during 
verification.
6.3.2 V erification  of th e  N e tw ork  N ode C om ponents
Design errors were eliminated from the lowest level com ponents with the use o f 
a VHDL source level debugging tool. Each VHDL component was tested with its 
own small test bench using simple sequential stimulus statements.
The complex interactions betw een state machines within the dualbus component 
meant that most verification time was spent on the complete dualbus component. The 
technique o f  using textual files to  describe both the input stimuli and the resulting 
component outputs was used for this verification process.
M essages were injected into the dualbus component to check a small subset o f 
the possible message arrival and channel blockage scenarios.
157
The simulation to verify the complete node design uses a VHDL test bench 
representing a sixteen-node network. To implement this, it was necessary to  develop a 
simple simulation o f  a C40 processor communications port that generates and 
receives messages. Each C40 communication port is set up to  generate messages o f 
random length separated by a random  number o f  netw ork cycles. Random  numbers 
are generated with maximal length pseudo random  number generators [Mutagi96]. 
The first value in the body o f  the message is the source node, which is used by the 
receiving node for data logging purposes. The remainder o f  the data in the body o f  the 
message is the sequence o f data from the pseudo random data generator. All random 
data in the message can be verified at the destination node or nodes using a seed value 
in an identical pseudo random number generator at the receiving processor.
M essages generated by each simulated C40 communications processor are 
logged in textual form in a Transmitted Messages file. For each message, the source 
node, destination node, message length, message type and the random  number seed 
for the data in the message body are stored. To enable the correct distribution o f 
M ulticast messages to be verified, it is necessary to produce an entry in the 
Transmitted Messages file for each o f  the destination nodes o f  a multicast message.
M essages received by each C40 communications processor are logged in a 
Received Messages file. The netw ork routing channels at the ends o f  the netw ork are 
monitored. Dead header flits are ignored, and other messages from each netw ork 
channel are logged in an End Messages file. The format o f  the logged data for both o f 
these received messages files is identical to the format used in the Transmitted 
Messages file.
Verification o f  the netw ork node design consists o f  running the simulation with 
a specified number o f  messages injected at each node. Following completion o f the 
simulation, the log files are processed and compared to  check for correct netw ork 
operation. The transmitted messages files from each node are combined into a single 
file, which is then sorted. Received messages files from each node and the End 
M essages files are combined into a single file, which is then also sorted. The two 
sorted files are then compared. A match indicates that all transm itted messages were
6.3.3 Verification of the Complete Network Design
158
received at the correct nodes and that there w ere no spurious messages transm itted or 
received. M ismatches between the tw o files indicate incorrect netw ork operation.
6.3.4 Simulation Results
The simulation o f  the sixteen-node netw ork takes in the order o f  one hour per 
ten thousand messages generated. The final verification o f  the VHDL design, 
undertaken before committing the design to a circuit board for use in the 
dem onstration netw ork, consisted o f  a simulation in which a total o f  one million 
messages were injected into each netw ork node. The result from this process was a 
complete match betw een the m essages injected into the netw ork and the messages 
received by the processors at each node and at the ends o f  the network.
This result indicated, within the limitations o f  the test, that the design is 
functionally correct. Further verification o f the design using dem onstration hardware, 
which exercises the netw ork at a much higher rate and therefore can verify many more 
netw ork traffic scenarios, is necessary to provide a higher level o f  confidence in the 
design.
6.4 Verification - D em onstration Hardware
The hardware verification process follows the approach taken to confirm the 
operation o f  the complete netw ork during the VHDL verification process described 
above. Random  messages are injected into the netw ork and verified on arrival. In 
addition to  these messages, a short message is sent through the netw ork periodically 
and its arrival checked to confirm that the netw ork has not failed and is not 
deadlocked or otherwise in an inoperative state.
The dem onstration netw ork hardw are consisted o f  a PC card with eight 
network nodes, each implemented in a single Xilinx XC4013 device.
The C40 processors used are in standard TIM  format, m ounted on a PC AT 
card which houses up to four TIM  modules. The C40 processor modules and the 
m otherboard are commercially available products from Transtech Parallel Systems 
Ltd.
The configuration o f  the routing netw ork Xilinx devices is achieved through an 
I/O mapped port to the PC. This task is perform ed under command from the operator 
after a pow er cycle o f  the PC.
159
The demonstration netw ork consists o f  eight routing nodes assembled on a four 
layer PC format board designed by the author (Photograph 3).
6.4.1 The Demonstration Network Hardware
Photograph 3 - Demonstration Network Board
The netw ork is designed to interface directly with a PC AT card manufactured 
by Transtech Parallel Systems, which has four daughter board sites for TM S320C40 
processor modules. The daughter board module sites conform  to the Texas 
Instruments TIM  industry standard which follows the TRAM  concept familiar to 
Transputer users.
The Xilinx FPGA routing devices are arranged in a linear array on the circuit 
board, with the two processor connectors o f  each node situated at the top edge o f  the 
board. The connectors match those used on the Transtech TM S320C40 TIM  module 
m otherboard, which brings some o f  the communication ports o f  the TIM  modules to 
connectors at the top o f the board. Other communications ports are wired on the 
m otherboard to connect the module processors together.
160
The processors connect to  the message routing netw ork with the use o f  short 
cables between the routing netw ork board and the Transtech m otherboard. This 
arrangement offers flexibility in how  the processors are connected to  the network.
The inter-node netw ork connections for the two end nodes o f  the demonstration 
netw ork are made available at 50 way connectors on the board. These connectors can 
be used to connect to  other netw ork router cards to  expand the network.
All control signals on the netw ork interfaces and the C40 processor interfaces 
are active low. Input control signals on the ports to  the netw ork and C40 are tied to 
logic high through resistors within the Xilinx devices, ensuring that unused ports are 
guaranteed to be inactive.
The configuration o f  the Xilinx devices is performed with the use o f  a PC I/O 
mapped port. A DOS utility implemented by the author interprets the configuration 
bit-stream file generated by the Xilinx place and route softw are to  generate the 
appropriate accesses to  the I/O port o f  the demonstration netw ork board to configure 
each o f  the eight Xilinx devices.
The configuration ports o f  the Xilinx devices are connected in a daisy chain 
arrangement. This simplifies the circuit board layout, but extends the overall time 
required to configure all o f  the Xilinx devices. Each Xilinx device in the daisy chain is 
configured in turn, starting with the device nearest to  the data source. A wired-OR 
signal holds all devices in a non-operational tristate condition until configuration o f all 
devices is complete. The period taken to  reconfigure the whole demonstration 
netw ork is 12 s for the eight devices. It is possible to configure all devices o f  the 
netw ork in parallel with a change to  the circuit and the Xilinx configuration utility to 
reduce the configuration time to 1.5 s . This configuration time is set by the access 
period o f  the PC bus, the theoretical minimum configuration times are 200ms and 
25ms for daisy chained and parallel configuration respectively.
A low skew clock generation device produces the clock signals for the routing 
nodes. The clock generator is driven either by a 20M Hz crystal oscillator on the 
netw ork board, or by an external signal. The external signal is used in multi-board 
netw orks to ensure that all netw ork router clocks are synchronised and have low 
timing skew. The circuit board trace for each o f  the clock signals to the routing nodes 
is set such that a maximum clock skew o f  250 ps is attained across all devices o f
161
the network. At each routing node clock input, an impedance matching circuit is 
provided to  maintain clock signal integrity. A reset signal, controlled by the PC I/O 
port, connects to all routing devices. The reset within the Xilinx routing node is 
implemented as an asynchronous reset. A  similar mechanism is available for the C40 
processor board. A reset to the C40 processor is guaranteed to place the 
communications links into an idle condition. Assertion o f  both netw ork and processor 
resets simultaneously, and the removal o f  the netw ork reset before the processor 
reset, provides a repeatable system reset mechanism.
6.4.2 Demonstration Network Application
The application run on the dem onstration hardware to  verify the operation o f 
the netw ork routing node design is shown in Figure 65, which includes details o f  the 
tasks running on each processor.
Six o f  the eight available nodes in the demonstration netw ork (nodes 2 to 7) are 
used for message generation and verification. The two remaining nodes, which are the 
end nodes o f  the network, are used in the process o f  periodically confirming that the 
netw ork is active.
At netw ork node 4, the C40 processor injection interface o f  the netw ork node 
chip suffered damage due to a solder bridge connecting one o f  the data pins to Vcc. 
The I/O driver for this data signal was internally damaged by this fault. The injection 
channel for this node is left unconnected, and the spare processor tim e on processor 3 
freed up by the lack o f  a TX4 task  is used in performing the TX2 task for node 2. 
This explains the irregular connection seen for netw ork node 2 in Figure 65.
162
Figure 65 - Verification Application
A l processor injection channels to the network have a pull-up resistor on the 
active low C40STRB strobe signal which ensures that unconnected injection channels 
are always forced into an inactive condition.
At netw ork nodes 5 and 7, where there are no C40 processor links available for 
connection to the reception channel, the reception channels are configured to accept
163
the data unconditionally by connecting the C40STRB strobe signal to  the C40RDY 
ready signal. W ith this connection, data is always accepted immediately at the 
reception channel.
At node 1 in the network, the processor channels are connected together such 
that any received message is re-injected into the network. This feature is used by the 
SendPing and Status tasks (Section 6.4.3) on node 8 to  periodically bounce a message 
through node 1 to check that the netw ork is active and has not been corrupted by the 
incorrect routing o f  a message or failure o f  the netw ork to  m n at the design speed. 
The ping message also provides a mechanism for checking that deadlock has not 
occurred, although with the dual routing channels in each direction through the 
network, two separate deadlocks in the same direction would be required to  block the 
ping message. However, if one deadlock has occurred due to a design fault, then a 
second occurrence o f  the deadlock is also likely to  happen.
6.4.3 Verification Tasks
The limited number o f  processors available for netw ork verification requires 
that each o f  the processors runs several tasks in order to exercise all o f  the network 
routing nodes effectively. The Parallel C micro-kernel m nning on each processor 
provides a multi-tasking environment that ensures that each task  has access to the 
processor. The priority o f  all tasks in the application are equal, ensuring that every 
task has an equal share o f  the processing resource.
M essage injection and reception over the C40 communication links is 
implemented using D M A  transfers. W hen any o f  the tasks in the application is waiting 
for a D M A communication transfer to complete, the task de-schedules itself to ensure 
that other tasks on the same processor can fully utilise the processing resource.
Four tasks are used in the complete application, TX to  inject messages into the 
network, R X  to  verify messages received from the network, Status to  report the 
integrity o f the netw ork and SendPing to periodically send a message through the 
network. M ultiple instances o f  the TX and R X  tasks are used in the application to 
service different netw ork node processor channels.
164
The TX task generates m essages for verifying all o f  the message routing 
features o f the network. Several measures are taken to  ensure that a wide variety o f 
communication patterns are present during verification, including the use o f  random 
message lengths and transmission intervals.
All message types are generated for verification purposes; point-to-point, 
multicast, local encapsulation and global encapsulation. M ulticast messages are sent in 
one direction only to avoid M ulticast Collision deadlock (Section 5.6.1).
The message type and extent o f  verification messages generated by the TX task 
are determined at run time using a set o f  counters. The counter values are combined 
to produce messages which repeat with a periodicity o f  60. The extent calculated for 
each message is checked and truncated if  necessary, to ensure that the limits o f the 
demonstration netw ork are not exceeded.
The length o f  each message is determined at run time with a counter which has 
a periodicity dependent on the source node address. The maximum length o f  any 
message is 20 4-byte words.
The period between the transmission o f  each message is determined using a 
random delay. The resolution o f  the built in timers are based on the micro-kernel pre­
emption timer which has a resolution o f  1 ms . This period is too  long for generating 
random intervals between messages, therefore a software loop is used to  generate the 
delay, producing delay times o f  betw een approximately lp s  and 64 ms. .
D ata in the body o f  each m essage is generated at application start-up time using
the netw ork node address as the seed for a random number generator. The message 
data for each node is not changed from the initial values during the verification 
process.
Verification messages generated by the TX task consists o f  tw o segments. The 
first segment contains two message w ords (8 bytes), the total length o f the message 
and the node address o f the m essage source. The second segment carries the
remainder o f the message which is random  data (Figure 66).
6.4.3.1 TX Task
165
Segment 1 Length=2 (enc Header Flags
Header) LS,DIR,ENC
Total length of message
Source node address
^-contro l
Segment 2 Length=j null null Flags
LS
datafOj
data[l]
data[j-l]
{-control
Figure 66 - TX Message Format
For every one million messages generated and injected by the TX  task, a short 
message is sent to the Status task residing on node 8, which is used to  report to the 
operator the total number o f messages injected into the netw ork by each node.
TX  task generates a single w ord m essage for transmission to  the Status task on 
node 8 using a single message segment. The value o f  the w ord in the message is set to 
the netw ork node address.
6.4.3.2 RX Task
The RX  task receives messages from  the netw ork reception channel. The first 
w ord o f each message is the total length o f  the message. In the idle state, the RX task 
sets up a single word DM A to receive the first word o f the message. W hen a word 
arrives, the RX  task sets up a second DM A, using the first w ord as the length 
param eter for the DMA. W hen the second DM A is finished, the com plete message 
has been received and the first and last w ords o f  the message data are verified. To 
ensure that the RX task uses as little o f  the processor time as possible, the other data 
words in the message are not verified. Checking o f the first and last w ords o f the 
message, which share the same path through the network, is considered to be a 
sufficiently thorough test to  detect any possible fault in the transmission o f  the 
message.
166
The purpose o f  the SendPing task is to periodically send a short message 
through the netw ork which is checked by the Status task to ensure that the netw ork is 
active.
The SendPing task is connected to  the processor injection channel at node 8. It 
sends a point-to-point message out o f  node 8 o f  the netw ork to  node 1 o f the network 
every minute. N ode 1 o f  the netw ork has a special connection, with the processor 
reception channel connected directly to the processor injection channel. The SendPing 
task is the only task to  use N ode 1 as a destination for a message. The effect o f the 
special processor connection is that a message received at N ode 1 is immediately re­
injected into the network, minus the header flit which carried it. This is a similar 
action to the encapsulation message type where a message is carried to an 
intermediate node to be re-injected into the network, except that the message re­
injection at N ode 1 is achieved at the processor interface not the netw ork router level. 
The use o f  processor connection to provide a re-injection mechanism also allows the 
message to  be re-injected in the opposite direction to its arrival, which is not possible 
with the encapsulated message type. The re-injected message is set up as a point-to- 
point message to the originator o f  the Ping message, N ode 8.
The form at o f the message injected by the SendPing task at node 8 is shown in 
Figure 67.
6.4.3.3 SendPing Task
Lengtli=2 00h Headen=PPsl Dir=LLS=l
Length^ 1 ooh Headei-PP18 Dir=0,LS=l
FFFFFFFF],
Figure 67 - Ping Injection Message Format (node 8)
The control byte defines a m essage segment o f two words, the header and DIR 
flag to carry the message between node 8 and node 1 as a point-to-point message, and 
the LS flag is set to indicate that the segment is the last segment o f  the message.
The first word o f  the message is a control byte for the processor injection 
controller at node 1. This will be revealed when the message is received at node 1. 
The value o f  the data word in the message is FFFFFFFFh.
167
The content o f  the message which traverses the netw ork from node 8 to  node 1 
is shown in Figure 68
PPs.
Dir=G,LS=l
PP.s
00h
Olh
FFh
FFh
FFh
FFh
Figure 68 - Ping Network Message (node 8 to node 1)
The PP8i header o f  this message carries the 8 bytes o f message data to node 1. At 
node 1, the routing engine strips o ff  the header o f  the message. The body o f  the 
message is received by the processor over the reception channel. The reception 
channel at node 1 is connected directly to  the injection channel. The form at o f  the 
message seen by the injection channel o f  node 1 is shown in Figure 69.
Length=l null Header, 8 Dir=0,LS=l
FFFFFFFFh
Figure 69 - Ping Injection Message Format (node 1)
It can be seen that this message will result in a single w ord point-to-point message 
being transmitted to node 8.
6.4.3.4 Status Task
The Status task is connected to the processor reception channel at node 8, from 
which it receives both the messages sent by the TX  tasks to indicate the number o f 
messages injected into the netw ork and the ping messages sent by the SendPing task.
The processor attached to node 8 o f  the netw ork is also connected through a 
C40 link interface to  the PC processor. The C40 processor library provides standard 
console Input/Output functions which allow a task on the C40 processor to use the
168
PC keyboard and monitor. The prin tf function from the library is used by the status 
task to display information to the operator.
W hen a message is received from  a TX task it contains the node address o f the 
originating node. The message indicates that the specified node has injected one 
million more messages into the netw ork since its last communication with the Status 
task. The status task keeps a running total o f  millions o f messages injected by each 
node.
The periodic arrival o f  the Ping m essage from the SendPing task triggers the 
Status task to display the number o f  messages injected by each node and the total 
number o f  injected messages on the PC console display using the prin tf function. The 
elapsed simulation time, which is derived from the timer functions o f  the C40 library is 
also displayed.
6.4.4 Different Network Configurations
The placement and routing o f  a design which uses the pin constraints from a 
substantially different design places a high demand on the routing structure within the 
device. The demand is significantly higher for designs like the rou ter in which there 
are bus structures internal to the device. The design o f the verification netw ork used 
93% o f the logic resource blocks within the device. Some o f  these are used for 
routing slow signals, but the high utilisation makes the fitment o f  many different 
designs into a single pin-out configuration potentially very difficult. This, however, is 
exactly the requirement o f  the proposed library-based building block approach to 
network design, so it is important to  establish that the fitment o f  multiple designs will 
not cause a problem.
Checks were made with the router components available from the verification 
netw ork design to provide evidence that the building block approach to network 
design is a viable proposition. The design o f  the verification routing node used two 
instantiations o f the dualbus router block, one for Easterly messages, the other for 
W esterly messages (Section 6.2). There are four through-routing channels in each 
node, providing four channel groups on each side o f the device, each with a fixed pin 
assignment. The directions o f  the pins o f  each channel group are uncomm itted, so any 
channel group can be used for either Easterly or Westerly routing. The maximum
169
number o f  different configurations that can be derived using only two dualbus 
com ponents can be determined as follows:
There are six possible combinations o f  channels on each side o f the device for 
the case o f one Easterly and one W esterly dualbus component; EEW W , W W EE, 
EW EW , W EW E, W EEW , EW W E. This leads to  36 possible channel assignments 
given that it is possible to  have a different combination on each side o f  the device. For 
each channel assignment, there are 16 internal orientations o f  the two dualbus 
components. This leads to  576 possible configurations for a device with one Easterly 
and one W esterly dualbus component. There are also 24 configurations for a device 
with two Easterly dualbus com ponents and similarly 24 configurations for a device 
with two W esterly dualbus components. Therefore, in total, there are 624 possible 
configurations for a 4-channel device utilising tw o dualbus components. There was 
not enough time to  try all o f  these configurations given that it took  over 4 hours to 
place and route each design. Restricting the configurations to have the same channel 
assignment on both sides o f  the device and allowing only one orientation o f  the 
dualbus components per channel assignment reduces the number o f configurations to 
eight; designating the verification netw ork design channel assignment as EEW W , the 
eight configurations are :
EEW W  (verification assignment)
EW EW
EW W E
W W EE
W EW E
W EEW
EEEE
W W W W
All eight o f  these configurations were tested with the channel pin allocations 
from the EEW W  design by using different pin constraints files and passing the design 
through the Xilinx Place and Route utility for each o f  the configurations. The result o f  
this test was that all o f  the channel configurations fitted in the device and achieved the 
design speed without any problems.
170
The result from running the hardw are verification application for 120 hours is 
that over 10,000 million messages o f all types successfully reached their destination 
with no reported message errors and no deadlock or netw ork failures detected by the 
Ping/Status checking. In addition to the checking o f message delivery and freedom 
from deadlock with the verification application, a logic analyser w as used to monitor 
the processor interface control signal activity for all nodes in the network. This 
monitoring was used to  ensure that the application produced the desired netw ork 
traffic consisting o f  simultaneous message transfers. These results provide very strong 
evidence that the netw ork design is functionally correct and that it runs without error 
at the design speed under representative netw ork traffic conditions.
The successful fitment o f  multiple configurations within a single device pin 
assignment supports the proposition that a library-based building block approach to 
network design can be realised using FPG A  devices.
6.4.5 Verification Results
171
Chapter 7 
Conclusions and Further Work
7. Conclusions and Further Work
The research described in this thesis has looked at tw o aspects o f  a proposal 
made by the author to implement a m essage-passing multicom puter system.
• The use o f general-purpose Digital Signal Processor (DSP) devices for 
processing video-based data.
• The use o f  FPGAs to implement a message-passing routing device for 
inter-processor communication.
A target video-based application has been implemented and a demonstration 
message routing netw ork has been developed as part o f  this research.
7.1 The ROLIN Project
It has been shown, through the implementation o f  a dem onstration platform- 
edge measurement system on the Structure Gauging Train, that it is feasible to use 
general-purpose D SPs for processing the video-based sensor data from this system. 
From the evidence obtained during experimentation with the system, which provided 
information about the real performance o f  the processors, several DSP devices will be 
required per camera to implement the complex algorithms for providing a system 
immune from all spurious sources o f  data.
A i im portant aspect o f  this part o f  the research has been the successful 
dem onstration o f  the application o f  linear transform ations to  the video images in real­
time. Using a number o f  calibrated tile regions in the image, the linear transform ations 
correct for arbitrary alignment o f  the camera position relative to  the vehicle. The use 
o f arbitrarily aligned cameras leads to significantly reduced system down-tim e for 
calibration purposes, a very important consideration for a revenue-earning facility like 
the Structure Gauging Train. It also broadens the application o f  the vehicle to 
European railway netw orks by allowing a w ider region around the vehicle to be 
m easured using the existing cameras, adjusted to  use their full field o f  view. Another 
direct benefit from these results is that the m easurem ent o f  platform edge position can 
be made from other test vehicles. W ith the fitment o f cameras and a simple fan-beam 
light generator to illuminate the platform edge region, all vehicles in the test fleet will 
be able to  collect platform data, maximising the return from each run made by a
173
vehicle. Through the research undertaken by the author, the techniques for providing 
the real-time processing o f  the video data from the Structure Gauging Train, which 
are required for enhancing the m easurem ent system, are now proven. The 
dem onstration platform edge m easurem ent system has recently been enhanced with 
new user interface software and improved platform  edge detection algorithms, based 
on the D E T3A  algorithm developed in this research. Using the calibrated tile region 
approach, system calibration was achieved in one day, a significant reduction over the 
two w eek period required for the main gauging cameras on the SGT. The system has 
been used in extensive runs o f  the S tm cture Gauging Train to determine the platform 
clearance for a new type o f vehicle over a major part o f the existing railway network. 
Manual gauging o f  such a large part o f  the netw ork could not have been achieved in 
the time available and would have cost significantly more than the SGT approach. 
This revenue earning opportunity was a direct result o f  the research carried out by the 
author.
A proposal to  use the platform edge measurem ent system in daylight has been 
made recently. This will require further research to investigate and characterise 
additional sources o f light in the image, and to  determine the algorithms required to 
cope with them. The simple LEFTM O ST algorithm and the shape-based DET3 
platform edge detection algorithm provide a starting point for developing these 
algorithms, and a large database o f  raw  data now  exists for this development as a 
direct result o f  the experiments undertaken by the author.
7.2 FPGA-Based M essage R outer
The feasibility o f implementing a m essage routing device in an FPG A  has been 
investigated in this research through the design and verification o f  a proof-of-concept 
netw ork routing device. An FPGA-based message routing device has been designed 
using VHDL, a high level hardware description language, and functionally simulated 
at both com ponent and system level. An eight-node PC-based verification network 
was designed and fabricated based on the FPG A message router and a Parallel C 
application was developed to produce messages to random destinations, at random 
rates, and with random content. The verification netw ork has been shown to operate
174
correctly at the design speed with a large number o f  messages using a number o f 
processors over an extended period o f  operation.
The primary m otivation for the use o f  reprogram mable logic to implement a 
message routing device is the potential for providing a building block approach to 
netw ork design. The same underlying netw ork hardware structure can be configured 
to provide different routing resource allocations and routing mechanisms dependent 
on each application. Experim ents undertaken by the author with a small number o f 
different netw ork routing node configurations indicate that the building block 
approach to  the design o f the netw ork node is a viable proposition.
The success o f  the FPG A-based routing node design provides strong evidence 
that such devices can indeed be utilised in this way to  provide a configurable network 
for embedded systems. Further research is justified on this basis, and is necessary to 
explore those issues which could not be covered in the time available. It was not 
possible, for example, to  experiment with different base-topologies for the network; 
the linear-array topology o f  the verification netw ork, although suitable for some signal 
processing applications and the processor farm paradigm, is not suitable for all 
applications. Further research is also needed to  provide other building blocks in the 
routing function library, including functions for multi-dimensional routing and 
unrestricted multicast, and functions to  explore the proposed implementation o f a 
message router for processor farming.
7.3 Sum m ary
The two key issues raised by a proposal to implement a message-passing 
m ulticom puter system for a num ber o f  real-time applications have been addressed and 
answered by this research; general purpose processors provide sufficient performance 
for the target applications and FPGAs can be used to provide a message-passing 
netw ork for the routing o f  m essages betw een processors. The research has also 
proved that two new concepts proposed by the author, a library-based approach to 
netw ork node design and the use o f  linear corrections on camera images for 
correcting perspective distortions, are feasible. A number o f  interesting prospects for 
future research have been proposed, including the implementation o f  a routing node 
design tailored to processor farming.
175
Ill summary, the research described in this thesis provides a solid foundation for 
the development o f  heterogeneous parallel processing systems for video-based 
measurement systems and other embedded applications, using general-purpose 
processors and reconfigurable FPG A-based message routing networks.
176
References
8. References
3L95
Acock96
Actel92
Adams82
Alfke94
Alfke94a
Altera96
Amerson95
Anderson75 
Arruabarrena90 
A rvind91
3L Ltd. “Parallel C User Guide : Texas Instruments
TM S320C 40” . 3L Ltd. 1995
Acock S J B, Dimond K R. “H ardw are Implementations o f  
Algorithms on N etw orks o f FPGA Processors” . IEE Colloquium 
on Digital System Design Using Synthesis Techniques. Digest No 
96/029. p p 3 .1-3.6. February 1996
Actel Corporation. “ ACT™ Family Field Programmable Gate 
Array D A TABOO K” . 5172020-2. 1992
Adams G B, Siegel H  J. “The Extra Stage Cube: A Fault-tolerant 
Interconnection N etw ork for Supersystems” . IEEE Transactions 
on Computers. Vol. C-31, No 5. pp443-454. M ay 1982
Alfke P, N ew  B. “Additional XC3000 D ata” . Xilinx Data Book
1994 (Third Edition). Application note XAPP 024.000. Xilinx 
Inc. pp 8 .11-8.20. 1994
Alfke P, New B. “Implementing State machines in LCA devices” . 
Xilinx D ata B ook 1994 (Third Edition). Application Note XAPP 
027.001. p p 8 .169-8.172. 1994
Altera Corporation. “Altera D ataBook” . A -D B -0696-01. 1996
Amerson R, et al. “Teram ac - Configurable Custom  Computing” . 
IEEE Symposium on FPGAs for Custom Computing Machines 
1995. IEEE CS Press. ISBN 0-8186-7086-X . pp32-38. April
1995
Anderson G A, Jensen E D. “C om puter Interconnection 
Structures: Taxonomy, Characteristics, and Examples". ACM  
Computing Surveys. Vol. 7. No 4. pp 197-213. December 1975.
Arruabarrena A, et al. “An Optimal Topology for M ulticomputer 
Systems based on a M esh o f Transputers” . Occam User Group 
Newsletter. No. 12. pp31-39, January 1990
Arvind, et al. “Evolution o f Data-Flow Com puters” in Advanced 
Topics in Data-Flow Computing. Gaudiot J-L, Bic L (Eds). 
Prentice Hall. ISBN 0-13-006503-X. pp 3-33. 1991
178
AT&T95
Athanas93
Athas88
Atmel94 
Batcher 80
Batty
Bell92
Bhuyan84
Boppana95
Borkar88
Bouknight72
AT&T89
Brigham74
AT&T. “W E DSP32C Digital Signal Processor Information 
M anual” . AT&T M N89-010DM OS. 1989
AT&T M icroelectronics. “AT&T Field Programmable Gate 
Arrays Data B ook” . (Application N otes : ‘Designing High speed 
counters using the LFSR technique’). M N95-001FPG A. 1995
Athanas P M, Silverman H  F. “Processor Reconfiguration 
Through Instruction-Set M etam orphosis” . IEEE Computer. Vol. 
26, No 3. pp l 1-18. M arch 1993
Athas W C, Seitz C L. “M ulticom puters: M essage-Passing 
Concurrent Com puters” . IEEE Computer Vol. 21, No 8. pp9-24. 
August 1988.
Atmel Corporation. “ATM EL Configurable Logic, Design and 
Application book” . 0266A-9/93/15M . 1993/94
Batcher K E. “Design o f  a Massively Parallel Processor” . IEEE 
Transactions on Computers. Vol. C29, No 9. pp836-840. 
Septem ber 1980.
P M Batty, R G Newell. “SmallWorld GIS: GIS Databases are 
different” . Smallworld Technical Paper N o 8
Bell G. “U ltracom puters : A Teraflop Before its Time” . 
Communications of the ACM. Vol. 35, No 8. pp27-47. 1992
Bhuyan L N, Agrawal D P. “Generalized Hypercube and 
Hyperbus Structures for a Com puter N etw ork” . IEEE 
Transactions on Computers. Vol. C33, No 4. pp323-333. April 
1984
B oppana R V, Chalasani S. “Fault-Tolerant W ormhole Routing 
Algorithms for M esh N etw orks” . IEEE Transactions on 
Computers. Vol. 44, No 7. pp848-864. July 1995.
Borkar S, et al. “iWarp: An Integrated Solution to high-speed 
Parallel Com puting” . Proceedings of Supercomputing '88. 
pp330-338. N ovem ber 1988
Bouknight W J, et al. “The Illiac IV System” . Proceedings of the 
IEEE. Vol. 60, No 4. pp369-388. April 1972
Brigham E 0 . “The fast Fourier transform ” . Prentice Hall. ISBN 
0133075052. 1974
179
Chaiken90
Chapman95
Cheong90
Chin84
Chiodo94 
Coffm an71 
C onner96 
Cox92
Crocker93
Cvpress95
Brunvand94
Dale96
Brunvand E. “Windchime: an FPGA-based Self-timed Parallel 
Processor” . More FPGAs. 1993 International Workshop on Field 
Programmable Logic and Applications. W R M oore and W Luk 
(eds). Abingdon EE&CS books. ISBN 0-9518453-1-4. pp365- 
376. 1994
Chaiken D, et al. “Directory-Based Cache Coherence in Large- 
Scale M ultiprocessors” . IEEE Computer. Vol. 23, No 6. pp49- 
58. 1990
Chapman K. “On-Chip M emory - The Key to Multiplication in an 
FPG A” . 5th Annual Advanced PLD and FPGA Conference. 
p p 149-155. 1995
Cheong H, Veidenbaum A V. “Com piler-Directed Cache 
M anagem ent in M ultiprocessors” . IEEE Computer. Vol. 23, No 
6. pp39-47. 1990
Chin C-Y, Hwang K. "Packet Switching Networks for 
M ultiprocessors and D ata Flow Com puters” . IEEE Transactions 
on Computers. Vol. C33, No 11. p p 9 9 1-1003. November 1984
Chiodo M, et al. “H ardw are-Softw are Co-Design o f Embedded 
Systems” . IEEE Micro. Vol. 14, No 4. pp 10-16. August 1994
Coffman E G, et al. “System Deadlocks” . A C M  Computing 
Surveys. Vol. 3, No 2. pp67-78. June 1971
Conner D. “Reconfigurable Logic: H ardw are Speed with 
Software Flexibility” . EDNEurope, pp 15-23. July 1996.
Cox C E, Blanz W E. “GANGLION - A Fast Field- 
Program mable Gate Array Implementation o f  a Connectionist 
Classifier” . IEEE Journal of Solid-State Circuits. Vol 27, No 3. 
pp 288-299. M arch 1992
Crocker R L. “Technical Audit o f  the British Rail Structure 
Gauging Unit” . High Profile Ultrasonics Ltd. April 1993
Cypress Sem iconductor Corporation. “Program mable Logic Data 
Book” . 1-894PLDB 40000. 1994/95
G Dale. “Final Report for the Vehicle Location Study in the 
ROLIN Project” . SMIS Ltd. 1996
1 8 0
Dally87
Dally92
Dally92a
Dally93
DEC92
DennisSO
Dennis94
Duato93
DuboisSS
Duncan90
Dally86
Ebcioglu89
Dally W J, Seitz C L. “The Torus routing ch ip ”. Distributed 
Computing. Vol. 1. pp 187-196. 1986
Dally W J, Seitz C L. “Deadlock-Free M essage Routing in 
M ultiprocessor Interconnection N etw orks” . IEEE Transactions 
on Computers. Vol. C-36, No 5. pp 547-553. M ay 1987.
Dally W J, et al. “The M essage Driven Processor: A 
m ulticom puter processing node with efficient mechanisms” . IEEE 
Micro pp23-38. April 1992
Dally W J. “Virtual-Channel Flow Control” . IEEE Transactions 
on Parallel and Distributed Systems. Vol. 3 No 2. pp 194-205. 
M arch 1992
Dally W J, Aoki FI. “Deadlock-free Adaptive Routing in 
M ulticom puter N etw orks using Virtual Channels” . IEEE 
Transactions on Parallel and Distributed Systems. Vol. 4, No 4. 
pp 466-475. April 1993
Digital Equipment Corporation. “The Alpha AXP Architecture 
Handbook” . Digital Equipment Corporation. 1992.
Dennis J B. “D ata Flow Supercom puters” . IEEE Computer. Vol. 
13, No 11. pp48-56. Novem ber 1980
Dennis J B, Gao G R. “M ultithreaded Architectures: Principles, 
Projects and Issues” in Multithreaded Computer Architecture: A 
Summary of the State of the Art. Iannucci R A (Ed). Kluwer. 
ISBN 0-7923-9477-1. pp l-72 . 1994.
Duato J. “A N ew  Theory o f  Deadlock-Free Adaptive Routing in 
W ormhole N etw orks” . IEEE Transactions on Parallel and 
Distributed Systems. Vol4 No 12. pp 1320-1331. December 1993
Dubois M, et al. “Synchronization, Coherence, and Event 
Ordering in M ultiprocessors” . IEEE Computer. Vol. 21, No 2. 
pp9-21. February 1988.
Duncan R. “A Survey o f  Parallel Com puter Architectures” . IEEE 
Computer. Vol. 23, No 2. pp 5 -16. Felmiary 1990.
Ebcioglu K. “A W ide Instm ction W ord Architecture for Fine- 
grain Parallelism” . CONPAR 88. University Press. Cambridge. 
ISBN 0 521 37177 5. pp424-437. 1989
181
Eclworthy86 
Ellis91
Eshaghian96
Fa\vcett95
FengS1 
Flvnn72
Foulk94
Frahm86
Franke91
Freund93
Frost94
Edworthy M. “Television measurement for railway structure 
gauging” . Automatic Optical Inspection. SPIE Vol. 654. pp3 5- 
42. 1986
Ellis J W, et al. “Enhanced Communications for Transputer 
Arrays” . Applications of Transputers 3. Proceedings of the Third 
International Conference on Applications of Transputers. Vol. 
II. IOS Press. ISBN 90 5199 064 2. pp475-480. 1991
Eshaghian M M. “H eterogeneous Com puting” . Artech House 
Publishers. ISBN 0-89006-552-7. 1996.
Fawcett B K. “Field Programmable Gate Arrays and 
Reconfigurable Com puting” . Proceedings of Field 
Programmable Gate Arrays (FPGAs) for Fast Board 
Development and Reconfigurable Computing. J Schewel (Ed). 
SPIE Vol 2607. ISBN 0 -8 194-1971-0. pp 155-166. October 1995
Feng T-Y. “A survey o f Interconnection N etw orks” . IEEE 
Computer. Vol. 14, No 12. pp 12-27. 1981
Flynn M J. “ Some computer organisations and their 
effectiveness” . IEEE Transactions on Computers. Vol C21, No 9 
pp94S-960. 1972
Foulk P. “Reconfigurable Computing with SRAM Programmable 
Gate Arrays” . More FPGAs. 1993 International Workshop on 
Field Programmable Logic and Applications. W R M oore and W 
Luk (eds). Abingdon EE&CS books. ISBN 0-9518453-1-4. 
pp70-81. 1994
Frahm J, Haase A, M atthaei D. “Rapid NM R imaging o f dynamic 
processes using the FLASH technique ” . Magnetic Resonance in 
Medicine. Vol. 3. pp321-327. 1986
Franke D W, Purvis VI K. “H ardw are/Softw are CoDesign: A 
Perspective” . Proceedings of the 13th International Conference 
on Software Engineering. IEEE CS Press. Order No. 2140-02. 
pp344-352. 1991
Freund R F, Siegel H J. “Heterogeneous Processing” . IEEE 
Computer. Vol. 26, No 6. pp 13-17. June 1993
Frost G M. “ Image Reconstruction on a netw ork o f processors” . 
University O f Surrey Transfer Report. Appendix C. Sept 1994.
182
Gajski82
Gaughan95
GerlaSO
G iessler81
Glass92
G okhale91
Gopal85
Gottlieb83
Gottlieb92
Graham96
Gajski D D, et al. ”A Second Opinion on Data Flow Machines 
and Languages” . IEEE Computer. Vol. 15, No 2. pp58-69. 
February 1982
Gaughan P T, Yalamanchili S. “A Family o f Fault-Tolerant 
Routing Protocols for Direct M ultiprocessor N etw orks” . IEEE 
Transactions on Parallel and Distributed Systems. Vol. 6, No 5. 
pp482-497. May 1995
Gerla M, Klienrock L. “Flow Control: A Com parative Survey” . 
IEEE Transactions on Communications. Vol. COM -28, No 4. 
pp553-574. April 1980
Giessler A, et al. “Flow Control Based on Buffer Classes” . IEEE 
Transactions on Communications. Vol. COM -29, No 4. pp436- 
443. April 1981
Glass C, Ni L M. “The Turn M odel for Adaptive Routing” . 
Proceedings of the International Symposium on Computer 
Architectures. pp278-287. 1992
Gokhale M, et al. “Building and Using a Highly Parallel 
Program mable Logic Array” . IEEE Computer. Vol. 24, No 1. 
ppS l-89. January 1991
Gopal I S. “Prevention o f  Store-and-Forw ard Deadlock in 
Com puter N etw orks” . IEEE Transactions on Communications. 
Vol COM 33, No 12. pp 1258-1264. Decem ber 1985
Gottlieb A, et al. “The NYU U ltracom puter - Designing an 
M IM D Shared M emory Parallel C om puter” . IEEE Transactions 
on Computers. Vol. C32, No 2. pp 175-189. Feb. 1983
Gottlieb A. “Architectures for Parallel Supercom puting” . Parallel 
Computing and Transputer Applications. M  Valero, et al (Eds). 
IOS Press. pp39-47. 1992
Graham P, Nelson B. “Genetic Algorithms in Software and In 
Hardware - A Performance Analysis o f  W orkstation and Custom 
Com puting M achine Implementations” . IEEE Symposium on 
FPGAs for Custom Computing Machines 1996. IEEE CS Press. 
ISBN 0-8186-7548-9. pp216-225. 1996
183
Gupta93
Gurd85
Gurd86
HaaseS 6
Habermann69
Hafer82
Hagersten92
Harp 8 9 
Hart ley 9 3
GuntherS 1
Harvey96
Gunther K D. “Prevention o f Deadlocks in Packet-Switched Data 
Transport M echanisms” . IEEE Transactions on Communications. 
Vol. COM -29, No 4. p p 5 12-524. April 1981.
Gupta R K, De Micheli G. “H ardw are-Softw are Cosynthesis for 
Digital Systems” . IEEE Design and Test of Computers. pp29-41. 
Septem ber 1993
Gurd J R, et al. “The M anchester Prototype Dataflow Com puter” . 
Communications of the ACM. Vol. 28, No 1. pp34-52. January 
1985
Gurd J R, et al. “Fine-Grain Parallel Com puting: The Dataflow 
Approach” . Future Parallel Computers - Lecture notes in 
Computer Science. Springer Verlag. pp82-I52 . 1986
H aase A, et al. “FLASH Imaging. Rapid N M R Imaging Using 
Low Flip-Angle Pulses” . Journal of Magnetic Resonance. Vol. 
67. pp258-266. 1986
Habermann A N. “Prevention o f System Deadlocks” . 
Communications of the ACM. Vol. 12, No 7. pp373-377. July 
1969
Hafer L J, Parker A C. “Automated Synthesis o f Digital 
H ardw are” . IEEE Transactions on Computers. Vol. C31, No 2. 
pp93-109. February 1982
Hagersten E, et al. “DDM - A Cache-Only Memory 
Architecture” . IEEE Computer. Vol. 25, No 9. pp44-54. 
Septem ber 1992
Harp G. “Future Developments” in Transputer Applications. G 
Harp (Ed). Pitman. ISBN 0-273-02852-9. pp25S-263. 1989
Hartley D A, Harvey D M. “Enhanced inter-processor 
communication strategies for parallel TM S320C40 digital signal 
processor systems” . Advanced Signal Processing, Algorithms. 
Architectures and Implementations IV. SPIE Vol. 2027. pp495- 
507. 1993
Harvey P R, Mansfield P. “Echo-Volumar Imaging (EVI) at 0.5T: 
First W hole-body volunteer studies” . Magnetic Resonance in 
Medicine. 35:80-88. 1996
184
Hayashi95
Hayes86
Heeb92
Hey 8 8
Hey92
Hoare7S
HockneyS8 
Hockney88a 
Hom ewood87 
Hossack94
Hayashi K, et al. “Reconfigurable Real-Time Signal Transport 
System using Custom  FPGAs” . IEEE Symposium on FPGAs for 
Custom Computing Machines 1995. IEEE CS Press. ISBN 0- 
8 186-7086-X. pp68-75. April 1995
Hayes J P, et al. “A M icroprocessor-based Hypercube 
Supercom puter” . IEEE Micro. Vol. 6, N o 5. pp6-17. October 
1986
Heeb B, Pfister C. “Chameleon: A W orkstation o f  a Different 
Colour” . Field-Programmable Gate Arrays: Architectures and 
Tools for Rapid Prototyping. Second International Workshop on 
Field-Programmable Logic and Applications. Springer Verlag. 
ISBN 3-540-57091-8. pp l52-161. Sept. 1992
Hey A J G. “Reconfigurable Transputer networks: practical 
concurrent com putation” . Scientific Applications of 
Multiprocessors. Edited by R Elliott and C A R  Hoare. Prentice 
Hall. ISBN 0-13-795774-2. pp39-54. 1988
Hey A J G. “Applications on Parallel Universal M essage-Passing 
M achines” in Transputer Applications - Progress and Prospects : 
Proceedings of the Closing Symposium of the SERCDTI 
Initiative in the Engineering Applications of Transputers. M Jane 
et al (Eds). IOS Press. ISBN 90-5199-079-0. pp54-76. 1992
C A R  H oare “Communicating Sequential Processes” . 
Communications of the ACM. Vol 21, No8. pp666~677. Aug 
1978.
Hockney R W, Jesshope C R. “Parallel Com puters 2” . Adam 
Hilger. ISBN 0-85274-812-4. pp60-81. 1988
Hockney R W, Jesshope C R “Parallel Com puters 2” . Adam 
Hilger. ISBN 0-85274-812-4. pp206-244. 1988
H om ew ood M, et al. “The IMS T800 Transputer” . IEEE Micro. 
pp 10-26. O ctober 1987.
Hossack C J, Guy C G. “Fully Interconnected Fault-Tolerant 
N etw orks Using Global Link Adaptors” . Transputer Applications 
and Systems '94. A. De Gloria et al (eds). IOS Press. pp489-496. 
1994
185
lnmos89
Inmos89a
lnmos93
Intel89
lnte!92
Irwin92
IsloorSO
Jesshope93
Kermani79
KermaniSO
1EE94
Khokhar93
IEE. “The T9000 Transputer” . IEE Computing and Control 
Division Colloquium. Digest No: 1994/208. 1994
Inmos. “The Transputer Databook” . Inmos Document Number 72 
TRN 203 00. pp389-412. 1989
Inmos. “The Transputer Applications N otebook : Systems and 
Performance” . Inmos. Document Number 72-TRN-205-00. pp66- 
91. 1989.
Inmos. “The T9000 Transputer Products Overview Manual” . 
Inmos. Docum ent Num ber 72-TRN-228-00. pp 139-162. 1993.
Intel Corporation “i860 64-Bit M icroprocessor” . Intel 
Corporation. 240296-003. 1989
Intel Corporation. “Programmable Logic” . Intel pub. 296083- 
008. pp2.3-2.26. 1992
Irwin G W, Fleming P J. “Real-time Control Applications o f 
Transputers” . Transputer Applications - Progress and Prospects. 
Proceedings o f  the Closing Symposium o f the SERC DTI 
Initiative in the Engineering Applications o f  Transputers. M Jane 
et al (Eds). IOS Press. ISBN 9051990790. pp26-41. 1992.
Isloor S S, M arsland T A. ”The Deadlock Problem: An 
Overview” . IEEE Computer. Vol. 13, No 9. pp58-78. September 
1980
Jesshope C R, Izu C. “The MP1 N etw ork Chip and its 
Application to Parallel Com puters” . The Computer Journal. Vol. 
36, No 8. pp763-777. 1993
Kermain P, Kleinrock L. “Virtual Cut-through: A New Computer 
Communication Switching Technique” . Computer Networks 3. 
North Holland Publishing Company. pp267-286. 1979
Kermani P, Klienrock L. “A Tradeoff Study o f  Switching Systems 
in Com puter Communication N etw orks” . IEEE Transactions on 
Computers. Vol. C29, No 12. pp 1052-1060. December 1980
Khokhar AA, et al. “Heterogeneous Computing: Challenges and 
Opportunities” . IEEE Computer. Vol. 26, No 6. pp 18-27, June
1993.
186
Klien94 
K onicek91
Koninger92
Kuck82
Kumar75
Kumar93
Kung82
Langhammer92
Lattice87
Lenoski92
LiljaSS
Kim94
Lin94
Kim J H, et al. “Compressionless Routing : A Framework for 
Adaptive and Fault Tolerant Routing” . Proceedings o f the 21st 
Annual International Symposium on Computer Architecture, pp 
289-300. April 1994
KLIEN B. “Use LFSR to build fast FPGA-based counters” . 
Electronic Design. 42, (6), pp87-100. M arch 1994
Konicek J, et al. “The Organization o f  the Cedar System” . 
International Conference on Parallel Processing, Vol. 1 : 
Architecture. CRC Press. pp49-56. 1991
Koninger R K. “M PP Directions at Cray Research, Inc.” . Parallel 
Computing and Transputer Applications. M  Valero, et al (Eds). 
IOS Press. pp80-90. 1992
Kuck D J, Stokes R A. “The B urroughs Scientific Processor 
(BSP)” . IEEE Transactions on Computers. Vol C31, No 5. 
pp363-376. M ay 1982.
Kumar A, W elti D, Ernest RR. “N M R Fourier Zuegm atography” . 
Journal o f  Magnetic Resonance. Vol. 18. pp69-83. 1975
Kumar S, Aylor J H, Johnson BW, W ulf WA. “A Framework for 
H ardw are/Softw are Codesign” . IEEE Computer. Vol 26, No 12. 
pp81-89. Decem ber 1993.
Kung H T. “W hy Systolic Architectures?” . IEEE Computer. Vol. 
15, No 1. pp37-46. January 1982.
Langhammer F. “Second Generation and Teraflops Parallel 
Com puters” . Parallel Computing and Transputer Applications. 
M Valero, et al (Eds). IOS Press. pp62-79. 1992
Lattice Semiconductor. “Data Catalog High Speed E2CMOS 
PLDs” . Fall 1987
Lenoski D, et al. “The Stanford Dash M ultiprocessor” . IEEE 
Computer. Vol. 25, No 3. pp63-79. M arch 1992
Lilja D J. “Reducing the Branch Penalty in Pipelined Processors” . 
IEEE Computer. Vol. 21, No 7. pp 47-55. 1988
Lin X, et al. “Deadlock-Free M ulticast W orm hole Routing in 2-D 
M esh M ulticom puters” . IEEE Transactions on Parallel and 
Distributed Systems. Vol. 5, No 8. pp793-804. August 1994
187
Linder91
Locher83 
Lom ax91
Lyuu92
Malumbres96
Mansfield77
Mansfield86
MansfieldS7
MansfieldS 8
Linde92 Linde A, et al. “Using FPGAs to Implement a Reconfigurable 
Highly Parallel Com puter” . Field-Programmable Gate Arrays: 
Architectures and Tools for Rapid Prototyping. Second 
International Workshop on Field-Programmable Logic and 
Applications. Springer Verlag. ISBN 3-540-57091-8. pp l99- 
210. Sept. 1992
Linder D H. Harden J C. “An Adaptive and Fault Tolerant 
W ormhole Routing Strategy for k-ary n-cubes” . IEEE 
Transactions on Computers. Vol. 40 No 1. pp2-12. January 1991
Locher P R. “Proton NM R tom ography” . Philips Technical 
Review. Vol. 41, No 3. pp73-88. 1983
Lomax A J, Ross P G B, Undrill P E. “Parallel Processing 
Techniques for the Interactive Display o f  Volumetric Data Sets” . 
Applications O f Transputers 3. Proceedings o f the Third 
International Conference on Applications o f  Transputers. IOS 
Press. ISBN 90 5199 064 2. ppl36-141. 1991
Lyuu Y-D. “Information Dispersal and Parallel Com putation” . 
Cambridge International Series on Parallel Compulation:3. 
Cambridge University Press. Chapter 3 : Interconnection
Networks. 1992.
M alumbres M P, Duato J, Torrellas J. “ An Efficient 
Implementation o f  Tree-Based M ulticast Routing for Distributed 
Shared-M emory M ultiprocessors” . Proceedings o f  the 8th IEEE 
Symposium on Parallel and Distributed Processing 1996. pp 186- 
189. 1996.
Mansfield P. “M ulti-Planar image form ation using NM R spin 
echoes” . Physics Vol CIO L55. 1977
Mansfield P, Chapman B. “Active M agnetic Screening o f 
Gradient Coils in NM R imaging” . Journal o f Magnetic 
Resonance. Vol 66. pp573-576. 1986
Mansfield P, Chapman B. “Multishield Active M agnetic 
Screening o f Coil Structures in NM R” . Journal o f Magnetic 
Resonance. Vol 72. pp211-223. 1987
Mansfield P. “Imaging by nuclear magnetic resonance” . Journal 
o f Physics E: Scientific Instrumentation. Vol 21. pp 18-30. 1988
188
May87
M cCabe87
M cHenry95
McKinley94
McKinley95
Meiko93
MerlinSOa
Merlin80b
Miller
Mansfield95 Mansfield P, Coxon R, Hykin J. “EV I o f  the brain at 3.0T: First 
normal volunteer and functional imaging resu lts”. Journal of 
Computer Assisted Tomography. Vol 19, pp846-852. 1995
M D May, R Shepherd. “Communicating Process Com puters” . 
Inmos technical note 22. Inmos Ltd. 1987
M cCabe A P H. “Systolic Arrays and Special Purpose Silicon” . 
Major Advances in Parallel Processing. The Technical Press - 
UNICOM  series. pp l3 -32 . 1987
M cHenry J T, Donaldson R L. “The W ILDFIRE Custom 
Configurable Com puter” . Proceedings of Field Programmable 
Cate Arrays (FPGAs) for Fast Board Development and 
Reconfigurable Computing. J Schewel (Ed). SPIE Vol 2607. 
ISBN 0-8194-1971-0. ppl89-200. O ctober 1995
McKinley P K, et al. “Unicast-Based M ulticast Communication in 
W orm hole-Routed N etw orks” . IEEE Transactions on Parallel 
and Distributed Systems. Vol. 5 No 12. pp 1252-1265. December
1994.
McKinley P K, et al. “Collective Communication in W ormhole- 
routed Massively Parallel Com puters” . IEEE Computer. Vol. 28 
No 12. pp39-50. Decem ber 1995
Meiko. “Com puting Surface CS-2 Product Description” . Meiko 
Scientific Ltd. Pub. No. X 0126-00B 100-01. 1993
Merlin P M, Schweitzer P J. “Deadlock Avoidance in Store-and- 
Forward N etw orks - I: Store-and-Forward Deadlock” . IEEE 
Transactions on Communications. Vol. C-28, No 3. pp345-354. 
March 1980
Merlin P M, Schweitzer P J. “Deadlock Avoidance 'in Store-and- 
Forward N etw orks - II: Other D eadlock types” . IEEE 
Transactions on Communications. Vol. C-28, No 3. pp355-360. 
M arch 1980
Miller P R, Huang C. “The M ad-Postm an N etw ork Chip 
Interface” . Internal Report. Dept. E lectronic and Electrical 
Engineering. University O f Surrey.
189
MMI82
M otorola90
Vlutagi96
NakamuraB 7
NEC91
Ni93
Ni95
NicolauS4
NicoleBS
Padmanabhan83
Padmanabhan90
Miller91 Miller P R, et al. “The M ad-Postm an Netw ork Chip” . 
Proceedings of Transputing ‘91. Vol. 2. IOS Press. p p 5 17-536. 
1991.
M onolithic M emories Incorporated. “Bipolar LSI Databook” . 
1982
M otorola Inc. “DSP96002 Technical D ata” . M otorola Inc. 
DSP96002/D. 1990
M utagi R N. “Pseudo noise sequences for engineers” . IEE 
Electronics &  Communication Engineering Journal. pp79-87. 
April 1996
Nakam ura Y. “An Integrated Logic Design Environment Based 
on Behavioural Description” . IEEE Transactions on Computer 
Aided Design. CAD-6, No 3. pp322-336. May 1987
NEC. “RISC M icroprocessors V r4000/V r3600” Users Manual 
Hardware. U M -V R -H A R D 091V10. 1991
Ni L M, McKinley K. “A Survey o f  W ormhole Routing 
Techniques in Direct N etw orks” . IEEE Computer. Vol. 26, No 2. 
pp62-76. February 1993
Ni L M. “Should Scalable Parallel Com puters Support Efficient 
Hardw are M ulticast?” . Proceedings of the 1995 ICPP Workshop 
on Challenges for Parallel Processing. Agrawal D P (Ed). CRC 
Press. ISBN 0-8493-2618-4. pp2-7. 1995
Nicolau A, Fisher J A. “M easuring the Parallelism Available for 
Very Long Instm ction W ord Architectures” . IEEE Transactions 
on Computers. Vol. C33, No 11. pp968-976. 1984
Nicole D A, et al. “Switching N etw orks for Transputer Links” in 
Developments Using Occam : Proceedings of the 8th Technical 
Meeting of the Occam User Group. IOS Press. ISBN 90-5199- 
002-4. pp l47-165. April 1988
Padmanabhan K, Lawrie D H. “A Class o f Redundant Path 
M ultistage Interconnection N etw orks” . IEEE Transactions on 
Computers. Vol. C-32 No 12. pp 1099-1108. December 1983
Padmanabhan I<. “Cube Structures for M ultiprocessors” . 
Communications of the ACM. Vol. 33, No 1. pp43-52. January 
1990.
190
Page96
Papadopoulos91
Payne95
Pease77
PreparataB 1
Pritchards 7 
Pritchard93
Quest96
Raubold76
RectorSO
Page94
Reed87
Page I. “Param eterised Processor Generation” . More FPGAs. 
1993 International Workshop on Field Programmable Logic and 
Applications. W  R M oore and W Luk (eds). Abingdon EE&CS 
books. ISBN 0-9518453-1-4. pp225-237. 1994
Page I. “Constructing Hardw are-Softw are Systems from a single 
Description” . Journal o f  VLSI Signal Processing. Vol. 12, No 1. 
pp87-107. January 1996
Papadopoulos G M. “Implementation o f  a General-Purpose 
Dataflow M ultiprocessor” . Research Monographs in Parallel 
and Distributed Computing. Pitman. ISBN 0-273-08835-1. 1991.
Payne R. “Self-Timed FPGA Systems” . FPL ’95. Field- 
Programmable Logic and Applications. Proceedings o f the 5th 
International Workshop. p p 2 1-35.August 1995
Pease M C. “The Indirect Binary n-Cube M icroprocessor Array” . 
IEEE Transactions on Computers. Vol. 26, No 5. pp458-473. 
M ay 1977
Preparata F P, Vuillemin J. “The Cube-Connected Cycles: A 
Versatile N etw ork for Parallel Com putation” . Communications o f  
the ACM. Vol. 24 No 5. pp300-309. M ay 1981
Pritchard D J, et al. “Practical Parallelism Using Transputer 
Arrays” PARLE 1987. Vol. 1. pp278-294. 1987
Pritchard D J, Nicole D A. “Cube Connected M obius Ladders : 
An Inherently Deadlock-Free Fixed D egree N etw ork” . IEEE 
Transactions on Parallel and Distributed Systems. Vol. 4 No 1. 
pp l 11-117. January 1993
Quest, University o f  Newcastle upon Tyne. “Report to Surrey 
Medical Imaging Systems: Initial Kalman Filter Software for the 
ROLIN project” . SMIS Ltd. M ar 1996
Raubold E, Haenle J. “A M ethod o f  D eadlock-Free Resource 
Allocation and Flow Control in Packet N etw orks” . Proceedings 
o f the ICCC. pp483-487. 1976
Rector R, Alexy G. “The 8086 B ook” . Osborne/M cGraw Hill. 
ISBN 0-931988-29-2. 1980
Reed DA, Fujim oto R M. “M ulticom puter N etw orks : Message- 
based Parallel Processing” . M IT Press. Chapter 1. 1987
191
Roman84
Roscoe87
Rose93
Rosenburg95
Russell78
Saavedra-Barrera90
Salapura94
Sava96
Robinson95 Robinson D F, et al. “Optimal M ulticast Communication in 
W orm hole-Routed T om s N etw orks” . IEEE Transactions on 
Parallel and Distributed Systems. Vol. 6, No 10. pp 1029-1042. 
October 1995.
Roman G-C, et al. “A Total System Design Framework” . IEEE 
Computer. Vol. 15, No 5. pp 15-26. May 1984
Roscoe A W. “Routing M essages through Networks: An Exercise 
in Deadlock Avoidance” in Parallel Programming of Transputer 
Based Machines (Proceedings o f  the 7th technical meeting o f  the 
Occam User G roup) IOS Press. ISBN 90-5199-007-3. 1988
Rose J, et al. “Architecture o f Field-Programmable Gate Arrays” . 
Proceedings of the IEEE. Vol 81, No 7. pp 1013-1028. July 1993
Rosenburg J. “DSP Acceleration using Cache Logic FPGAs” . 
Proceedings of Field Programmable Gate Arrays (FPGAs) for 
Fast Board Development and Reconfigurable Computing. J 
Schewel (Ed). SPIE Vol 2607. ISBN 0-8194-1971-0. pp54-59. 
October 1995
Russell R M. “The CRAY-1 Com puter System” . 
Communications of the ACM. Vol. 21 No 1. pp63-72. 1978
Saavedra-Barrera R H, et al. “Analysis o f M ultithreaded 
Architectures for Parallel Computing” . Proceedings of the A C M  
Symposium on Parallel Algorithms and Architecture, pp 169-178. 
July 1990.
Salapura V, et al. “A Fast FPGA Implementation o f a General 
Purpose N euron” . FPL '94. Field Programmable Logic: 
Architectures, Synthesis and Applications. Proceedings of the 4th 
International Workshop on Field-Programmable Logic and 
Applications, pp 175-182. September 1994
Sava H, Fleury M, Downton A C, Clark A F. “A Case Study in 
Pipeline Processor Farming: Parallelising the H.263 Encoder” . 
U K  Parallel '96. Proceedings o f  the BCS PPSG Annual 
Conference. Springer-Verlag. ISBN 3-540-76068-7. pp 196-205. 
1996
192
Scott95
Seaman94
Seitz85
SejnowskiSO
Shepherd92 
Siegel79 
Siegel81
Singhal89
Sitkoff95
Snyder82
Schewel95
Solari91
Schewel J, et al. “Transformable Com puters and Hardw are Object 
Technology” . Proceedings 9th International Parallel Processing 
Symposium. IEEE CS Press. ISBN 0-8186-7074-6. pp518-522. 
April 1995
Scott S D, et al. “HGA: A H ardware-Based Genetic Algorithm” . 
FPGA '95. 1995 ACM Third International Symposium on Field- 
Programmable Gate Arrays. pp53-59. February 1995
Seaman G. “Dynamically Reprogrammable FPGAs and Parallel 
Com puting” . Parallel Update. BCS PPSG  New sletter no. 18. 
Novem ber 1994.
Seitz C L, “The Cosmic Cube” . Communications o f  the ACM. 
Vol28, No 1. pp22-33. January 1985.
Sejnowski M C, et al. “An Overview o f  the Texas Reconfigurable 
Array Com puter” . AFIPS National Computer Conference. ACM. 
p p631-641. 1980.
Shepherd R. “The T9000” . Transputer Applications. M. Jane et 
al. editors. IOS Press. pp77-81. 1992.
Siegel H  J. “Interconnection N etw orks for SIMD machines” . 
IEEE Computer. Vol. 12 No 6. pp 57-65. 1979
Siegel H  J, et al. “PASM: A Partitionable SIM D/M IM D System 
for Image Processing and Pattern Recognition” . IEEE 
Transactions on Computers. Vol C-30, No 12. pp934-947. 
December 1981
Singhal M. “Deadlock Detection in Distributed Systems” . IEEE 
Computer. Vol. 22 No 11. pp37-48. N ovem ber 1989
Sitkoff N, et al. “Implementing a Genetic Algorithm on a Parallel 
Custom  Computing Machine” . IEEE Symposium on FPGAs for  
Custom Computing Machines 1995. IEEE CS Press. ISBN 0- 
S186-7086-X. pp 180-187. April 1995
Snyder L. “Introduction to the Configurable Highly Parallel 
Com puter” . IEEE Computer. Vol. 15 No 1. pp47-56. January 
1982.
Edw ard Solari. “AT Bus Design. IEEE 996 Compatible” . 
Annabooks. 1991. ISBN 0-929392-08-6
193
Stenstrom90 
S tone71
Sullivan77
TanenbaumB 1 
TI97
TIUG91
Toeug79
Tregidgo92
Treleaven82
Trew91
Tseng96
TuckerS 8
Stenstrom  P. “A survey o f Cache Coherence Schemes for 
M ultiprocessors” . IEEE Computer. Vol. 23, N o 6, pp 12-24. 1990
Stone H  S. “Parallel Processing with the Perfect Shuffle” . IEEE 
Transactions on Computers. Vol. C20, N o 2. pp 153-161. 
February 1971.
Sullivan H, Bashkow  T R. “A Large-Scale, Hom ogeneous, Fully 
Distributed, Parallel M achine” . Proceedings of 4th Symposium on 
Computer Architectures. Vol. 5. pp 105-124. M arch 1977
Tanenbaum A S. “Com puter N etw orks” . Prentice Hall. 1981.
Texas Instruments. “TM S320C6201 Digital Signal Processor” . 
Product Bulletin SPR T 142. 1997
Texas Instalm ents. “TM S320C4x Users Guide” . Chapter 8 
Communication Ports. Texas Instalm ents SPRU063A
Toeug S, Ullman J D. “D eadlock-Free Packet Switching 
N etw orks” . Proceedings of the A C M  Symposium on the Theory 
of Computers. pp89-98. 1979.
Tregidgo R  W  S, et al. “Scalable Parallel Systems Design: 
Em bedded Applications and G eneral-Purpose Systems” . Parallel 
Computing and Transputer Applications. IOS Press, pages un­
numbered. 1992
Treleaven P C, Brownbridge D R, Hopkins R P. “Data driven and 
demand driven com puter architecture” . Communications of the 
ACM. Vol 14, No 1. pp95-143. M arch 1982
Trew  A, W ilson G. “Past, Present, Parallel. A Survey o f  Available 
Parallel Com puting Systems” . Springer Verlag. ISBN 0-387- 
19664-1. Section 4.1. pp 126-136. 1991
Tseng Y-C, et al. “A Trip-Based M ulticasting Model in 
W orm hole-Routed N etw orks with Virtual Channels” . IEEE 
Transactions on Parallel and Distributed Systems. Vol. 7, No 2. 
pp 138-150. February 1996.
Tucker L W. Robertson G G. “Architecture and Applications o f 
the Connection M achine” . IEEE Computer. Vol. 21, No 8. pp26- 
38. August 1988.
194
Turner95
VaiiDeiiBout92
VanLeeuwen87
Villasenor96
W eems92
W eisskoff90
Whobrey88
Wiley87 
WittieS 1
Wolf94
Turner93
Woo94
Turner R, et al. “Functional mapping o f  the human visual cortex 
at 4 and 1.4 Tesla using deoxygenation contrast EPI” . Magnetic 
Resonance in Medicine. Vol 29, pp277-279. 1993
Turner R “Functional mapping o f  the human brain with magnetic 
resonance imaging” . Seminars in the Neurosciences. Vol. 7, 
p p 179-194. 1995.
Van Den B out D E, et al. “AnyBoard: An FPGA-based 
Reconfigurable System” . IEEE Design and Test, of Computers. 
pp21-30. Septem ber 1992.
Van Leeuwen J, Tan R B. “Interval Routing” . The Computer 
Journal Vol. 30 No 4. pp298-307. 1987
Villasenor J, et al. “Configurable Com puting Solutions for 
Autom atic Target Recognition” . IEEE Symposium on FPGAs for 
Custom Computing Machines 1996. IEEE CS Press. ISBN 0- 
8186-7548-9.. pp70-79. 1996
W eems C C, et al. “Image Understanding Architecture: 
Exploiting Potential Parallelism in M achine Vision” . IEEE 
Computer. Vol. 25 No 2. pp65-68. February 1992
WeisskofF RM, Crawley AP, W eeden V. “Flow Sensitivity and 
flow com pensation in instant imaging” . S M R M  9th Annual 
Meeting - Book of Abstracts. Vol. 1. pp398. New York. 1990
W hobrey D. “A Communications Chip for M ultiprocessors” . 
CONPAR 88. University Press, Cambridge. ISBN 0 521 37177 5. 
pp464-473. 1988
Wiley P. “A Parallel Architecture comes o f  age at last” . IEEE 
Spectrum. Vol. 24, No 6. pp46-50. June 1987
W ittie L D. “Communication Structures for Large N etw orks o f 
M icrocom puters” . IEEE Transactions on Computers. Vol. C30, 
N o 4. pp264-273. April 1981
W olf W H. “H ardw are-Softw are Co-Design o f  Embedded 
Systems” . Proceedings of the IEEE. Vol. 82, No 7. pp967-989. 
July 1994
W oo N S, et al. “Codesign from Cospecification” . IEEE 
Computer. Vol. 27, No 1. pp42-47. January 1994
195
W u80a
Wulf72
Xilinx94
Yantchev89
Yeh95
Wu80 W u C-L, Feng T-Y. “On a class o f  M ultistage Interconnection 
N etw orks” . IEEE Transactions on Computers. Vol C29, No 8. 
pp694-702. August 1980
Wu C-L. Feng T-Y. “The Reverse-Exchange Interconnection 
N etw ork” . IEEE Transactions on Computers. Vol. C-29, No 9. 
pp801-811. Septem ber 1980,
W ulf W A, Bell C G. “C.mmp - A  m ulti-mini-processor” . 
Proceedings of the AFIPS Fall Joint Computer Conference. 
pp765-777. 1972
XILINX Inc. “The Programmable Logic Data Book” . 
PN 0401224. 1994
Yantchev J, Jesshope C R. “Adaptive, low latency, deadlock-free 
packet routing for netw orks o f processors” . IEE Proceedings. 
Vol. 136, Part E, No 3. pp 178-186. M ay 1989.
Yeh C-C, et al. “Design and Implementation o f  a M ulticom puter 
Interconnection N etw ork using FPG As” . IEEE Symposium on 
FPGAs for Custom Computing Machines 1995. IEEE CS Press. 
ISBN 0 -8 186-7086-X. pp56-60. April 1995
196
Appendices
9. Appendix A - VHDL Hierarchy and VHDL Source Code
Source File Description Page
RCORE.VHD Top level description 199
C40INJ.VHD C40 injection controller 202
C40REC.VHD C40 reception controller 207
DUALBUS.VHD Dual bus routing function 212
IPMUX.VHD Input multiplexer and flit buffer 216
IPCELL. VHD Flit buffer for data inputs 220
CIPCELL.VHD Flit buffer for control inputs 221
IPMUXCMP.VHD Address comparator 222
OPMUX.VHD Output multiplexer 224
OPCELL. VHD Output multiplexer cell for data 225
OPCELLE.VHD Output multiplexer for strobed signal output 226
VCONTROL.VHD VHDL controller collection 227
IMUX.VHD Input Multiplexer controller 231
IN J. VHD Injection controller 233
FLOW. VHD Flit buffer flow controller 238
ROUTER. VHD Routing controller 240
OMUX.VHD Output multiplexer controller 255
RMUX.VHD Reception channel multiplexer controller 257
Figure 70 - Index of VHDL Descriptions
198
— rcore.VHD - top level VHDL router description combining vhdl 
descriptions into a routing 'core' with two dualbuses
library IEEE;
use IEEE, s td Jo g ic J  164.all;
entity ROUTERCORE is
generic( constant NODE : integer := 0);
port( signal CLOCK :in std_logic;
signal RESET :in std_logic;
signal WADATAO :out std_logic_vector(7 downto 0);
signal WASYNCO.WASTRBO :out stdjogic;
signal WAHOLDO :in stdjogic;
signal EADATAI :in stdJogic_vector(7 downto 0);
signal EASYNCI,EASTRBI ;in stdjogic;
signal EAHOLDI :out std_logic;
signal WBDATAO :out std_logic_vector(7 downto 0);
signal WBSYNCO,WBSTRBO :out stdjogic;
signal WBHOLDO :in std_logic;
signal EBDATAI ;in std logic_vector(7 downto 0);
signal EBSYNCI,EBSTRBI :in stdjogic;
signal EBHOLDI :out stdjogic;
signal WADATAI :in std_logic_vector(7 downto 0);
signal WASYNCI.WASTRBI ;in stdjogic;
signal WAHOLDI ;out stdjogic;
signal EADATAO :out stdJogic_vector(7 downto 0);
signal EASYNCO,EASTRBO :out stdjogic;
signal EAHOLDO :in stdjogic;
signal WBDATAI ;in stdJogic_vector(7 downto 0);
signal WBSYNCI,WBSTRBI :in stdjogic;
signal WBHOLDI :out stdjogic;
signal EBDATAO :out std_logic_vector(7 downto 0);
signal EBSYNCO,EBSTRBO :out stdjogic;
signal EBHOLDO :in stdjogic;
signal C40STRBO : out stdjogic;
signal C40DATAO : out std_logic_vector(7 downto 0);
signal C40RDYO : in stdjogic;
signal C40STRBI : in stdjogic;
signal C40DATAI : in stdJogic_vector(7 downto 0);
signal C40RDYI : out stdjogic
);
end ROUTERCORE;
architecture ROUTERDEVICE of ROUTERCORE is
component C40REC
port( signal CLOCK :in stdjogic;
signal RESET ;in stdjogic;
signal C40STRBO :out stdjogic;
signal C40DATAO :out std logic_vector(7 downto 0);
signal C40RDYO :in stdjogic;
signal ARHOLDO :inout std_logic; 
signal BRHOLDO :inout stdjogic;
signal ARDATAO :in std_logic_vector(7 downto 0); 
signal BRDATAO :in std_logic_vector(7 downto 0); 
signal ARSTRBO :in std_logic; 
signal BRSTRBO :in std_logic; 
signal ARSYNCO :in std_logic; 
signal BRSYNCO :in std_logic
);
end component;
lor all: C40REC use entity work.C40REC(fsm); 
component C40INJ
port( signal CLOCK :in stdjogic;
signal RESET :in std_logic;
signal C40STRBI :in std_logic;
signal C40DATAI :in std_logic_vector(7 downto 0);
signal C40RDYI :out stdjogic;
signal ARHOLDI :in stdjogic;
signal BRHOLDI :in stdjogic;
signal RDATAI :buffer std_logic_vector(7 downto 0);
signal ARSTRB :out stdjogic;
signal BRSTRB :out stdjogic;
signal RSYNC :out stdjogic
)'-
end component;
for all: C40INJ use entity w'ork.C40INJ(fsm); 
component dualbus
generic/ constant EasterlyRouting : std_logic := ’O';
constant NODE : integer := 0); 
port/ signal CLOCK :in stdjogic;
signal RESET :in stdjogic;
signal ADATAI :in std_logic_vector(7 downto 0);
signal ASYNCI,ASTRBI :in stdjogic;
signal AHOLDI :out std_logic;
signal BDATAI :in stdJogic_vector(7 downto 0);
signal BSYNCI,BSTRBI :in stdjogic;
signal BHOLDI :out stdjogic;
signal RDATAI :in stdJogic_vector(7 downto 0);
signal RSYNCLRSTRBI :in stdjogic;
signal RHOLDI .out stdjogic;
signal ASYNCO,ASTRBO :out stdjogic;
signal ADATAO :out std_logic_vector(7 downto 0);
signal AJHOLDO :in stdjogic;
signal BSYNCO,BSTRBO ;out stdjogic;
signal BDATAO :out stdJogic_vector(7 downto 0);
signal BHOLDO :in stdjogic;
signal RSYNCO.RSTRBO :out stdjogic;
signal RDATAO :out stdJogic_vector/7 downto 0);
signal RHOLDO :in stdjogic
);
end component;
for all: dualbus use entity w'ork.dualbus/toplevel);
signal WRSTRBI stdjogic;
signal ERSTRBI : std_logic;
signal RSYNCI : stdjogic;
signal WRHOLDI : stdjogic;
signal ERHOLDI : stdjogic;
signal RDATAI : stdJogic_vector(7 dovvnto 0);
signal WRSYNCO,WRSTRBO : stdjogic;
signal WRHOLDO : std_logic;
signal ERSYNCO,ERSTRBO ; stdjogic;
signal ERHOLDO : stdjogic;
signal WRDATAO,ERDATAO : stdJogic_vector(7 downto 0);
signal NODE_ADDRESS ; stdJogic_vector(3 downto 0 );
begin
routerw: dualbus 
generic map ( 'O', NODE ) 
port map ( CLOCK, RESET,
WADATAI, WASYNCI, WASTRBI, WAHOLDI, 
WBDATAI, WBSYNCI, WBSTRBI, WBHOLDI, 
RDATAI, RSYNCI, WRSTRBI, WRHOLDI,
WASYNCO, WASTRBO, WADATAO, WAHOLDO, 
WBSYNCO, WBSTRBO, WBDATAO, WBHOLDO, 
WRSYNCO, WRSTRBO, WRDATAO, WRHOLDO
);
routere: dualbus 
generic map ('1 ', NODE ) 
port map ( CLOCK, RESET,
EADATAI, EASYNCI, EASTRBI, EAHOLDI, 
EBDATAI, EBSYNCI, EBSTRBI, EBHOLDI, 
RDATAI, RSYNCI, ERSTRBI, ERHOLDI,
EASYNCO, EASTRBO, EADATAO, EAIiOLDO, 
EBSYNCO, EBSTRBO, EBDATAO, EBHOLDO, 
ERSYNCO, ERSTRBO, ERDATAO, ERHOLDO
);
c40out: C40REC
port map ( CLOCK, RESET,
C40STRBO, C40DATAO, C40RDYO,
WRHOLDO, ERHOLDO,
WRDATAO, ERDATAO,
WRSTRBO, ERSTRBO,
WRSYNCO, ERSYNCO
);
c40in : C40INJ
port map ( CLOCK, RESET,
C40STRBI, C40DATAI, C40RDYI,
WRHOLDI, ERHOLDI, 
RDATAI,
WRSTRBI, ERSTRBI, 
RSYNCI
end ROUTERDEVICE;
~  c401NJ. VHD - input injection state machine for c40 comm port
library IEEE;
use IEEE, s td Jo g ic J  164.all;
entity C40INJ is
port( signal CLOCK :in stdjogic; 
signal RESET :in stdjogic;
signal C40STRBI :in stdjogic;
signal C40DATAI :in stdJogic_vector(7 downto 0);
signal C40RJDYI :out stdjogic;
signal ARHOLDI :in stdjogic;
signal BRIiOLDI :in stdjogic;
signal RDATAI .-buffer std_Iogic_vector(7 downto 0);
signal ARSTRB :out stdjogic;
signal BRSTRB :out stdjogic;
signal RSYNC :out stdjogic
);
end C40DNTJ;
architecture fsm of C40INJ is
type C40hijStates is ( ControlByte,
NullControl,
Header 1,
Header2,
NullHeaderl,
NullHeader2,
WordCount,
ByteO,
Bytel,
Byte2,
Byte3);
signal State : C40InjStates := ControlByte;
signal C ount: stdJogic_vector(7 downto 0) := ( others => '0');
signal LastBlock : stdjogic := 'O';
signal SelectA : stdjogic := 'O';
signal Cany': stdJogic_vector(7 downto 0) := (others => '0');
signal Blocked : stdjogic := T;
signal SRHOLDI ; stdjogic := 'O';
signal Encapsulation : stdjogic := 'O';
signal I-C40STRBI : stdjogic := 'O’;
signal rC40DATAI: std_logic_vector(7 downto 0);
signal NetByte : stdjogic := 'O';
signal CountlsZero : std jogic := 'O';
signal CE : stdjogic := 'O';
signal Data V alid: stdjogic := 'O';
signal RDY : stdjogic := '1';
begin
STRBRECr: process (CLOCK,RESET,C40STRBI) 
begin
if R ESET-1' then
rC40STRBI <= 'I'; 
elsif C40STRBI- 1' then 
rC40STRBI <= T; 
elsif CLOCK'event and CLOCK-0' and Data Valid-O' then 
1-C40STRBI <= C40STRBI;
null; 
end if; 
end process;
C40READY : process (CLOCK,RESET,rC40STRBI) 
begin
if  RESET-1' then 
RDY <= T; 
elsif rC40STRBI=T tlien 
R D Y <=T; 
elsif CLOCK'event and CLO CK -1' then 
if  rC40STRBI-0' then 
RDY <= 'O';
else
RDY <= T ; 
end if;
else
null; 
end if; 
end process;
C40RDYI <= RDY;
else
DataValidRECi: process (CLOCK,RESET) 
begin
if  R E SE T-1' then
DataValid <= 'O'; 
elsif CLOCK'event and CLO CK -1' then 
if  (rC40STRBI-O' and RDY-1') 
or (DataValid='r and B locked-1') then 
DataValid <= '1';
else
DataValid <= 'O'; 
end if;
else
null; 
end if, 
end process;
C40DATA : process (CLOCK,RESET,rC40STRBI,Data Valid) 
begin
if R ESET-1' then
rC40DATAI <= "00000000"; 
elsif CLOCK'event and CLOCK=T then
if rC40STRBI-O' and DataValid='0' then 
rC40DATAI <= C40DATAI;
else
null; 
end if;
else
null; 
end if; 
end process;
RDATAI <= rC40DATAI;
CE <= T  when DataValid='l' and Blocked='0' else 'O';
flags : process (CLOCK,RESET,CE) 
begin
if R E SE T-1' then
LastBlock <= 'O';
203
SelectA <= 'O';
Encapsulation <= 'O'; 
elsif CLOCK'event and CLO CK -1' and CE=T tlien
if  ( State=ControlByte or State=NullControl) then 
LastBIock <= rC40DATAI(0); 
end if;
if  State=ControlByte then
SelectA <= rC40DATAI( 1); 
end it;
if State=ControlByte then
Encapsulation <= rC40DATAI(2); 
end if;
else
null; 
end if; 
end process;
C40InjFsm : process (CLOCK,RESET,CE) 
begin
if RESET-F then
State <= ControlByte;
NetByte <= 'O'; 
elsif CLOCK'event and CLOCK=T and CE ='F then
case State is
when ControlByte =>
State <= Header 1;
NetByte <= T';
when NullControl =>
State <= NullHeaderl;
NetByte <= 'O';
when Header 1 =>
if Encapsulation-1' then 
State <= Header2;
NetByte <= T';
else
State <= NullHeader2;
NetByte <= ‘O’; 
end if;
when NullHeaderl =>
State <= NullHeader2;
NetByte <= 'O';
when Header2 =>
State <= WordCount;
NetByte <= 'O';
when NuIlHeader2 =>
State <= WordCount;
NetByte <= 'O';
when WordCount =>
State <= ByteO;
NetByte <= T ;
when ByteO =>
204
State <= Bytel; 
NetByte <= T ;
when Bytel =>
State <= Byte2;
NetByte <= '1';
when Byte2 =>
State <= Byte3;
NetByte <= T ;
when Byte3 =>
if CountlsZero-1' then 
if LastBlock-F then
State <= ControlByte;
else
State <= NullControl; 
end if;
NetByte <= ’O';
else
State <= ByteO;
NetByte <= '1'; 
end if;
end case;
else
null; 
end if; 
end process C40InjFsm;
CarrvLogic : process (CLOCK,RESET) 
begin
if R ESET-1' then
Cany <= "00000000"; 
elsif CLOCK'event and CLOCK=T and CE=T then 
if State=Byte0 then 
Carry(O) <= 'O';
Carry( 1) <= Count(O);
Carry(2) <= Count(l) or Count(O);
Carry(3) <= Count(2) or Count(l) or Count(O);
Carry(4) <= Count(3) or Count(2) or Count(l) or Count(O); 
elsif State=Bytel then
Carry(5) <= Count(4) or Carry(4);
Carry(6) <= Count(5) or Count(4) or Carry(4);
Carry(7) <= Count(6) or Count(5) or Coimt(4) or Carry(4);
else
null; 
end if;
else
null; 
end if; 
end process;
Counter : process (CLOCK,RESET) 
begin
if RESET-11 then
Count <= "00000000"; 
elsif CLOCK'event and CLO CK -1' and C E - 1' then 
if State=WordCount then 
Count <= rC40DATAI; 
elsif State=Byte2 then
Count(O) <= not( Count(O) xor Cairy(O));
Count( 1) <= not( Count( 1) xor Carry( 1));
Count(2) <= not( Count(2) xor Carry(2));
Count(3) <= not( Count(3) xor Carry(3));
Count(4) <= not( Count(4) xor Carry(4));
Count(5) <= not( Count(5) xor Carry(5));
Count(6) <= not( Count(6) xor Carry(6));
Count(7) <= not( Count(7) xor Carry(7));
else
null; 
end if;
else
null; 
end if; 
end process;
CounterZero : process (CLOCK,RESET) 
begin
if R ESET-1' then
CountlsZero < - 'O'; 
elsif CLOCK'event and CLOCK-1’ and C E - 1' then 
if State=Byte2 then
if Count="00000001" then 
CountlsZero <= 'I';
else
CountlsZero <= 'O'; 
end if;
else
null; 
end if;
else
null; 
end it; 
end process;
ARSTRB <= T  when ( SelectA -1' and N etB yte-1' and D ataV alid-1') 
else 'O';
BRSTRB <= '1' when ( SelectA-O' and N etB yte-1' and DataValid-1') 
else 'O';
SRI-IOLDI <= ARHOLDI when SelectA=T else BRHOLDI;
Blocked <= T  when ( NetByte=T and SRHOLDI='l') else 'O';
RSYNC <= '0' when ( State=Byte3
and CountlsZero-'1' 
and LastBlock-1')
else '1';
end fsm;
206
-  c40rec.VHD - RECEPTION state machine for c40 comm port
library IEEE;
use IEEE, s td Jo g ic J  164, all;
entity C40REC is
port( signal CLOCK 
signal RESET
:in stdjogic; 
:in std logic;
signal C40STRBO 
signal C40DATAO 
signal C40RDYO
:out stdjogic;
:out std_logic_vector(7 downto 0); 
:in stdjogic;
signal ARHOLDO ;inout stdjogic; 
signal BRHOLDO ;inout stdjogic;
signal ARDATAO 
signal BRDATAO 
signal ARSTRBO 
signal BRSTRBO 
signal ARSYNCO 
signal BRSYNCO
);
:in stdJogic_vector(7 downto 0); 
:in std_logic_vector(7 downto 0); 
:in stdjogic;
:in std_logic;
:in stdjogic;
:in std_logic
end C40REC;
arcliitecture fsm of C40REC is
type C40OutStates is ( FAVOURA,
USEASTROBE,
USEARDY,
USEAWATT,
FAVOURB,
USEBSTROBE,
USEBRDY,
USEBWAIT);
signal C40OutState : C40OutStates := FAVOURA;
signal LastStrobe : std jogic := 'O';
signal LastAByte ; std jogic := 'O';
signal LastBByte ; stdjogic := 'O';
signal ADataHold : std jogic := 'O';
signal BDataHold : std jogic := 'O';
signal ARdy : std_logic := 'O';
signal BRdy : std jogic := 'O';
signal rARSTRBO : stdjogic;
signal rBRSTRBO: stdjogic;
signal rARSYNCO : stdjogic;
signal rBRSYNCO : stdjogic;
signal C40ADATAO : stdJogic_vector(7 downto 0);
signal C40BDATAO : stdJogic_vector(7 downto 0);
signal rC40RDYO: stdjogic;
begin
REGADATA : process (CLOCK,RESET) 
begin
if R E SE T-1' then
C40ADATAO <= ''00000000''; 
elsif CLOCK'event and CLOCK=T and ARdy='0’ then 
C40ADATAO <= ARDATAO;
else
null; 
end if;
207
end process;
REGBDATA : process (CLOCK,RESET) 
begin
if R ESET-1' then
C40BDATAO <= "00000000"; 
elsif CLOCK’event and CLOCK-1' and BRdy-0' then 
C40BDATAO <= BRDATAO;
else
null; 
end if; 
end process;
REGRDY : process (CLOCK,RESET) 
begin
if R ESET-1' then
rC40RDYO <= '1'; 
elsif CLOCK'event and CLOCK-O' then 
rC40RDYO <= C40RDYO;
else
null; 
end if; 
end process;
LASTA : process (CLOCK,RESET) 
begin
if R ESET-1' tlien
LastAByte <= 'O';
elsif ( CLOCK'event and CLOCK=T and ARHOLDO='0' and ARSTRBO-1') then 
if ARSYNCO-O' then 
LastAByte <= T ;
else
LastAByte <= 'O'; 
end if;
else
null; 
end if; 
end process;
LASTB : process (CLOCK,RESET) 
begin
if RESET-1' then
LastBByte <= 'O';
elsif ( CLOCK'event and CLOCK=T and BRHOLDO=’0’ and BRSTRBO=T) then 
if BRSYNCO-0' then 
LastBByte <=
else
LastBByte <= 'O’; 
end if;
else
null; 
end if; 
end process;
REGSIGS : process (CLOCK,RESET) 
begin
if R ESET-1' then
rARSTRBO <= 'O'; 
rBRSTRBO <= 'O’; 
rARSYNCO <= 'O'; 
rBRSYNCO <= 'O'; 
elsif CLOCK'event and CLO CK -1' then 
rARSTRBO <= ARSTRBO;
rBRSTRBO <= BRSTRBO; 
rARSYNCO <= ARSYNCO; 
rBRSYNCO <= BRSYNCO;
null; 
end if; 
end process;
C40DATAO <= C40ADATAO when ( C40OutState=FAVOURA or
C40OutState=USEASTROBE or 
C40OutState=USEAWAIT)
else C40BDATAO;
RDYREG : process (CLOCK,RESET) 
begin
if R ESET-1' then 
ARdy<-0';
BRdy<-0'; 
elsif CLOCK'event and CLO CK -1' then 
if (ARSTRBO-F and ARHOLDO='0')
or (ARdy=T and not (rC40RDYO='0' and C40OutState=USEASTROBE)) then 
ARdy<=T;
else
ARdy<-0'; 
end if;
if (BRSTRBO-1' and BRHOLDO=’0’)
or (BRdy=T and not (rC40RDYO=’0' and C40OutState=USEBSTROBE)) then 
BRdy<=T;
else
BRdy<-0'; 
end if;
else
null; 
end if; 
end process;
ARHOLDC.) <= ARdy;
BRHOLDO <= BRdy;
C40IN ; process (CLOCK,RESET) 
begin
if R ESET-1' then
C40OutState <= FAVOURA;
C40STRBO<=T;
LastStrobe <= 'O'; 
elsif CLOCK'event and CLOCK=T then 
case C40OutState is
when FAVOURA => 
if ARdy=T then
C40OutState <= USEASTROBE;
C40STRBO<='0';
else
C40OutState <= FAVOURS;
C40STRBO<=' 1'; 
end if;
LastStrobe <= 'O';
when FAVOURB => 
if BRdy- I’ then
C40OutState <= USEBSTROBE;
C40STRBO<-0';
else
C40OutState <= FAVOURA; 
C40STRBO<- 1 
end if;
LastStrobe <= 'O';
when USEASTROBE => 
if rC40RDYO='Q' then
C40OutState <= USEARDY; 
C40STRBO<='1';
else
C40OutState <= USEASTROBE; 
C40STRBO<='0'; 
end if;
when USEBSTROBE => 
if rC40RDYO-0’ then
C40OutState <= USEBRDY; 
C40STRBO<=T;
else
C40OutState <= USEBSTROBE; 
C40STRBO<-0'; 
end if;
when USEARDY =>
if rC40RDYO=T then
C40OutState <= USEAWATT;
else
C40OutState <= USEARDY; 
end if;
C40STRBO<=T;
when USEBRDY =>
if rC40RDYO=T then
C40OutState <= USEBWAIT;
else
C40OutState <= USEBRDY; 
end if;
C40STRBO<=T;
when USEAWAIT =>
if LastStrobe = '1' then
C40OutState <= FAVOURS; 
C40STRBO<=T;
else
if ARdy='l' then
if LastAByte-1' then 
LastStrobe <= '1';
else
LastStrobe <= 'O'; 
end if;
C40OutState <= USEASTROBE; 
C40STRBO<='0';
else
C40OutState <= USEAWAIT; 
C40STRBO<-1'; 
end if; 
end if;
when USEBWAIT =>
else
if LastStrobe = '1' then
end case;
else
null; 
end if; 
end process;
end fsm;
C40OutState <= FAVOURA; 
C40STRBO<-1';
if BRdy=T then
if LastBByte-1' then 
LastStrobe <= '1';
else
LastStrobe <= 'O'; 
end if;
C40OutState <= USEBSTROBE; 
C40STRBO<-0';
else
C40OutState <= USEBWAIT; 
C40STRBCK-1'; 
end if; 
end if;
else
211
— dualbus. VHD - top level VHDL description combining all vhdl descriptions
library IEEE;
use EEEE.stdJogicJ 164.all;
entity DUALBUS is
generic( constant EasterlyRouting : std jogic := 'O'; 
constant NODE : integer := 0);
port( signal CLOCK 
signal RESET
signal ADATAI 
signal ASYNCLASTRBI 
signal AHOLDI
signal BDATAI 
signal BSYNCI,BSTRBI 
signal BHOLDI
:in stdjogic;
:in std_logic;
:in std_logic_vector(7 downto 0);
:in stdjogic; 
rout stdjogic;
:in std_logic_vector(7 dowuto 0);
:in stdjogic; 
rout std logic;
signal RDATAI 
signal RSYNCI,RSTRBI 
signal RHOLDI
signal ASYNCO,ASTRBO 
signal ADATAO 
signal AHOLDO
signal B S YN CO,BSTRB O 
signal BDATAO 
signal BHOLDO
rin stdJogic_vector(7 downto 0);
rin stdjogic; 
rout stdjogic;
rout stdjogic; 
rout std_logic_vector(7 downto 0); 
rin stdjogic;
rout stdjogic; 
rout stdJogic_vector(7 downto 0); 
rin std logic;
signal RSYNCO,RSTRBO 
signal RDATAO 
signal RHOLDO
)-
end DUALBT.JS;
rout stdjogic; 
rout stdJogic_vector(7 downto 0); 
rin std jogic
architecture TOPLEVEL of DUALBUS is
component IPMUX
generic( constant EasterlyRouting : stdjogic := 'O';
constant NODE : integer := 0); 
port( signal CLOCK rin stdjogic; 
signal RESET rin stdjogic; 
signal CLOCKEN rin stdjogic; 
signal DIRECTSYNC rin stdjogic; 
signal DERECTSTRB rin stdjogic; 
signal DIRECTDATA rin stdJogic_vector(7 downto 0); 
signal INIECTSYNC rin stdjogic;
signal 1NJECTSTRB rin stdjogic;
signal INJECTDATA rin stdJogic_vector(7 downto 0);
signal ISEL rin stdjogic;
signal HSEL rin stdjogic; 
signal InjTest rin stdjogic;
signal GT,GE,LE,PP,BC rout stdjogic;
signal EQ,LT,EOM rout stdjogic;
signal STRBH rout stdjogic;
signal OUTDATA rout std_logic_vector(7 downto 0);
signal OUTSYNC rout stdjogic;
signal OUTSTRB rout stdjogic
);
end component;
for all: IPMUX use entity work.ipraux(input_mux);
component VCONTROL
port( signal CLOCK :in stdjogic;
signal RESET :in stdjogic;
signal AHOLDO :in stdjogic;
signal ASTRB :in stdjogic;
signal BHOLDO :in stdjogic;
signal BSTRB :in stdjogic;
signal RHOLDO :in stdjogic;
signal RSTRB :in std_logic;
signal RSYNC :in std_logic;
signal ASTRBH,BSTRBH :in std_logic;
signal AEQ,BEQ :in stdjogic; 
signal ALT,BLT :in stdjogic;
signal AEOM,BEOM :in stdjogic;
signal AGT,AGE,ALE,APP,ABC :in stdjogic; 
signal BGT,BGE,BLE,BPP,BBC :in std_logic;
signal AlnjTest :out stdjogic;
signal BlnjTest ;out stdjogic;
signal AAROUTER_OEN :out std_logic;
signal ABROUTER_OEN :out stdjogic;
signal ARROUTER_OEN :out stdjogic;
signal BBROUTER JDEN :out stdjogic;
signal BAROUTER_OEN :out stdjogic; 
signal BRROUTER_OEN :out stdjogic; 
signal AISEL,AHSEL,ACE,ADSEL :out std_logic; 
signal BISEL,BLISEL,BCE,BDSEL :out stdjogic; 
signal AHOLDI :out stdjogic;
signal BHOLDI :out stdjogic;
signal RHOLDI :out stdjogic;
signal RASEL :out stdjogic
);
end component;
for all: VCONTROL use entity work. VCONTROL(VHDL_CONTROL);
component OPMUX
port( signal DIRECTSYNC :in stdjogic;
signal DERECTSTRB :in stdjogic;
signal DIRECTDATA :in stdJogic_vector(7 downto 0);
signal ADAPTSYNC :in stdjogic;
signal ADAPTSTRB :in std_logic;
signal ADAPTDATA :in stdJogic_vector(7 downto 0);
signal SELECTD :in std_logic;
signal DROUTER_OEN :in stdjogic;
signal AROUTER_OEN :in stdjogic;
signal OUTDATA :out stdJogic_vector(7 downto 0);
signal OUTSYNC :out stdjogic;
signal OUTSTRB :out stdjogic
);
end component;
for all: OPMUX use entity work.OPMUX(OUTPUT_MUX);
signal AISEL : std jogic ;= 'O';
signal AHSEL : stdjogic := 'O';
signal ACE : std logic := 'O';
signal BISEL : stdjogic := 'O'; 
signal BHSEL : std jogic := 'O';
signal BCE : stdjogic := 'O';
signal ADSEL : stdjogic ;= 'O'
signal BDSEL : std jogic := 'O'
signal RASEL : stdjogic := 'O'
signal ADATA : stdJogic_vector(7 downto 0) := "00000000";
signal ASYNC : stdjogic := 'O'; 
signal ASTRB ; stdjogic := 'O';
signal BDATA : std logic_vector(7 downto 0) := "00000000";
signal BSYNC : stdjogic := 'O'; 
signal BSTRB : stdjogic := 'O';
signal ASTRBH : stdjogic := 'O';
signal BSTRBH : stdjogic := 'O';
signal
signal
signal
signal
signal
signal
signal
signal
signal
signal
signal
signal
signal
signal
signal
signal
signal
signal
signal
signal
signal
signal
AAROUTER OEN 
ABROUTER_OEN 
ARROUTER_OEN 
BBROUTEROEN 
BAROUTER_OEN 
BRROUTER_OEN 
AEQ : std jogic := 
ALT : stdjogic := 
AEOM : std_logic 
BEQ : stdjogic := 
BLT ; stdjogic ;= 
BEOM : stdjogic
stdjogic 
stdjogic 
stdjogic 
stdjogic 
stdjogic 
std logic
=  '0'; 
= ’O'; 
=  ’0'; 
= ’O'; 
= 'O'; 
= 'O';
'O';
'O';
:= 'O’;
’O';
’O’;
:= 'O';
AGT
AGE
ALE
APP
ABC
BGE
BCrT
BLE
BPP
BBC
: stdjogic := 'O'; 
; std jogic := 'O'; 
: stdjogic := 'O';
: stdjogic := 'O';
: std jogic := 'O'; 
: stdjogic := 'O'; 
: stdjogic := 'O'; 
: stdjogic := 'O';
: stdjogic := 'O';
: std logic := 'O';
signal AlnjTest 
signal BlnjTest
: stdjogic := 'O'; 
: std logic := 'O';
beain
AJPMUX: IPMUX
generic map ( EasterlyRouting, NODE ) 
port map ( CLOCK, RESET,
ACE,
ASYNCI, ASTRBI, ADATAI, 
RSYNCI, RSTRBI, RDATAI, 
AISEL, AHSEL,
AlnjTest,
AGT,AGE,ALE,APP,ABC,
AEQ,ALT,AEOM,
ASTRBH, ADATA, ASYNC, ASTRB );
BJPMUX: IPMUX
generic map ( EasterlyRouting, NODE ) 
port map ( CLOCK, RESET,
BCE,
214
BSYNCI, BSTRBI, BDATAI,
RSYNCI, RSTRBI, RDATAI,
BISEL, BHSEL,
BlnjTest,
b g t ,b g e ,b l e ,b p p ,b b c ,
b e q ,b l t ,b e o m ,
BSTRBH, BDATA, BSYNC, BSTRB );
CONTROL: VCONTROL
port map ( CLOCK, RESET,
AHOLDO, ASTRB,
BHOLDO, BSTRB,
RHOLDO, RSTRBI, RSYNCI,
ASTRBH, BSTRBH,
AEQ, BEQ, ALT,BLT, AEOM,BEOM,
AGT,AGE,ALE,APP,ABC,
BGT,BGE,BLE,BPP,BBC,
AlnjTest, BlnjTest,
AAROUTER_OEN,ABROUTER_OEN, 
ARROUTER_OEN,BBROUTER_OEN, 
BAROUTER_OEN,BRROUTER_OEN, 
AISEL, AHSEL, ACE, ADSEL,
BISEL, BHSEL, BCE, BDSEL, 
AHOLDI, BHOLDI, RHOLDI, RASEL );
A_OPMUX: OPMUX
port map ( ASYNC, ASTRB, ADATA,
BSYNC, BSTRB, BDATA,
ADSEL,
AAROUTER_OEN,BAROlJTER_OEN,
ADATAO, ASYNCO, ASTRBO );
B_OPMUX: OPMUX
port map ( BSYNC, BSTRB, BDATA,
ASYNC, ASTRB, ADATA,
BDSEL,
BBROUTER_OEN,ABROUTER_OEN,
BDATAO, BSYNCO, BSTRBO);
RJDPMUX: OPMUX
port map ( ASYNC, ASTRB, ADATA,
BSYNC, BSTRB, BDATA,
RASEL,
ARROUTER_OEN,BRROUTER_OEN, 
RDATAO, RSYNCO, RSTRBO );
end TOPLEVEL;
— ipmux.VHD - input mux circuitry
library IEEE;
use EEEE.std_logic_l 164.all;
entity IPMUX is
geueric( constant EasterlyRouting : std jogic := 'O'; 
constant NODE : integer := 0);
port( signal CLOCK 
signal RESET 
signal CLOCKEN 
signal DIRECTSYNC 
signal DERECTSTRB 
signal DIRECTDATA 
signal INJECTSYNC 
signal INJECTSTRB 
signal INJECTDATA 
signal ISEL 
signal HSEL 
signal InjTest
signal GT,GE,LE,PP,BC 
signal EQ,LT,EOM 
signal STRBH 
signal OUTDATA 
signal OUTSYNC 
signal OUTSTRB
);
end IPMUX;
architecture INPUTJMUX of IPMUX is
component ipcell
port ( signal CLK :in stdjogic;
signal RST ;in stdjogic; 
signal CE :in stdjogic;
signal ID :in stdjogic;
signal DD :in stdjogic; 
signal ISEL :in stdjogic; 
signal HSEL :in stdjogic;
signal Y :buffer stdjogic; 
signal H ibuffer stdjogic; 
signal Q :out stdjogic
);
end component;
component cipcell
port ( signal CLK :in stdjogic;
signal RST :in stdjogic; 
signal CE :in stdjogic; 
signal ID :in stdjogic;
signal DD :in stdjogic; 
signal ISEL :in stdjogic; 
signal HSEL :in stdjogic; 
signal INJTST :in stdjogic;
signal Y :buffer stdjogic;
signal H :buffer stdjogic;
signal Q :out std_logic
);
:in std_logic;
:in stdjogic;
:in stdjogic;
:in stdjogic;
:in stdjogic;
:iii std_logic_vector(7 downto 0); 
:in stdjogic;
;in stdjogic;
:in stdjogic_vector(7 downto 0); 
:in stdjogic;
:in stdjogic;
:in stdjogic;
:out stdjogic;
..out stdjogic;
;out stdjogic;
:out stdJogic_vector(7 downto 0);
:out stdjogic;
:out stdjogic
end component;
component ipmuxcmp
generic( constant EasterlyRouting : stdjogic := 'O';
constant NODE : integer := 0); 
port( signal CLOCK :in std_logic;
signal RST :in std jo g ic ;
signal SELECTEDSYNC ;in stdjogic;
signal SELECTEDSTRB :in stdjogic;
signal SELECTEDADDR :in stdJogic_vector(3 downto 0);
signal GT :out stdjogic;
signal EQ :out stdjogic;
signal LT :out std jogic
);
end component;
signal SELECTEDDATA : stdJogic_vector(7 downto 0) := "00000000"; 
signal SELECTEDSYNC ; std jogic := 'O'; 
signal SELECTEDSTRB : stdjogic := 'O';
signal HOLDDATA ; stdJogic_vector(7 downto 0) := "00000000"; 
signal HOLDSYNC : stdjogic := 'O'; 
signal HOLDSTRB : stdjogic := 'O';
constant POINT_TO_POINT : stdJogic_vector( 1 downto 0) := "00";
constant BROADCAST : stdJogic_vector( 1 downto 0) := "01";
constant LOCALJ3NCAPSULATION : stdjogic_vector( 1 downto 0) := "10";
constant GLOBAL_ENCAPSULATION : stdJogic_vector(l downto 0) ;= "11"
begin
acmp: ipmuxcmp
generic map ( EasterlyRouting, NODE ) 
port map ( CLOCK, RESET,
SELECTEDSYNC, SELECTEDSTRB,
SELECTEDDATA(5 downto 2),
GT,EQ,LT);
dipmux: for N in 0 to 7 generate 
dipcell: ipcell
port map ( CLOCK, RESET, CLOCKEN,
INJECTDATA(N), DIRECTDATA(N), 
ISEL, HSEL,
end generate;
HOLDDATA(N), 
SELECTEDDATA(N), 
OUTDATA(N));
strbipcell: cipcell
port map ( CLOCK, RESET, CLOCKEN, 
INJECTSTRB, DIRECTSTRB, 
ISEL, HSEL, InjTest,
HOLDSTRB,
SELECTEDSTRB,
OUTSTRB );
syncipcell: cipcell
port map ( CLOCK, RESET, CLOCKEN,
INJECTSYNC, DERECTSYNC,
ISEL, HSEL, Inj Test,
HOLDSYNC,
SELECTEDSYNC,
OUTSYNC );
STRBH <= HOLDSTRB;
LOOKAHEAD:process (CLOCK,RESET) 
begin
if RESET-!' tlien 
GE <='0';
LE <='0';
PP <='0';
BC <='0';
elsif CLOCK'event and CLOCK=T then
if (SELECTEDSYNC-1' and SELECTEDSTRB=' 1') 
and (SELECTEDDATA( 1 downto 0)=GLOBAL_ENCAPSULATION) Uien 
GE < -1 ’;
else
GE <='0'; 
end if;
if (SELECTEDSYNC-P and SELECTEDSTRB-1') 
and (SELECTEDDATA( 1 downto 0)=LOCAL_ENCAPSULATION) Uien 
LE <=T;
else
LE <='0'; 
end if;
if  (SELECTEDSYNC-P and SELECTEDSTRB- 1') 
and (SELECTEDDATA( 1 downto 0)=POINT_TO_POINT) Uien 
PP <='!’;
else
PP <='0'; 
end if;
if (SELECTEDSYNC-1' and SELECTEDSTRB- 1') 
and (SELECTEDDATA( 1 downto 0)=BROADCAST) then 
BC <='!';
else
BC <='0'; 
end if;
else
null; 
end if; 
end process;
EOMLOOKAHEAD:process (CLOCK,RESET) 
begin
if RESET=T Uien 
EOM <='0’; 
elsif CLOCK'event and CLOCK=T then 
if CLOCKEN-1' Uien
if SELECTEDS YNC='0' and SELECTEDSTRB- 1' Uien 
E O M < -1';
else
EOM <='0'; 
end if;
end if;
else
null;
218

— ipcell.VHD - input mux cell circuitry
library IEEE;
use IEEE.std_logic_l 164.all;
entity IPCELL is
port( signal CLK :in stdjogic;
signal RST :in stdjogic;
signal CE tin stdjogic; 
signal ED tin stdjogic; 
signal DD tin stdjogic;
signal ISEL :in stdjogic;
signal HSEL :in stdjogic;
signal Y tbufter stdjogic; 
signal H tbufter stdjogic; 
signal Q :out stdjogic
);
end IPCELL;
architecture behav of IPCELL is 
begin
celhprocess (CLK,RST) 
begin
if R S T -1' Uien 
Q <= 'O’;
elsif CLK'event and CLK -P then 
if CE=T then 
Q <= H; 
end if;
else
null; 
end if; 
end process;
storage:process (CLK,RST) 
begin
if R S T -1' then
Y <= ’O’;
elsif CLK’event and C L K -1’ then
Y <= H;
else
null; 
end if; 
end process storage;
mux:process ( DD,ED,Y,HSEL,ISEL ) 
begin
if HSEL=T Uien 
H <= Y; 
elsif ISE L -1' Uien 
H <= DD;
else
H <= ED; 
end if; 
end process mux;
END;
220
~ cipcell.VHD - input mux cell circuitry for control lines
library IEEE;
use IEEE .stdJogicJ 164.all;
entity CIPCELL is
port( signal CLK tin stdjogic;
signal RST :in stdjogic;
signal CE :in stdjogic; 
signal ID tin stdjogic;
signal DD tin stdjogic;
signal ISEL tin stdjogic;
signal HSEL tin stdjogic; 
signal INJTST tin stdjogic;
signal Y tbuffer stdjogic;
signal H tbuffer stdjogic;
signal Q tout stdjogic
);
end CIPCELL;
architecture behav of CIPCELL is 
begin
celltprocess (CLK,RST) 
begin
if R S T -1’ then 
Q <= 'O';
elsif CLK'event and C L K -1' then 
if CE=T then 
Q <= H; 
end if; 
end if; 
end process;
storagetprocess (CLK,RST) 
begin
if R S T -P  then
Y <= 'O';
elsif CLK'event and C L K -1' then
Y <= H; 
end if;
end process storage;
muxtprocess ( DD,ID,Y,HSEL,ISEL,INJTST ) 
begin
if H SE L -1' then 
H <= Y; 
elsif INJTST='l' then 
14 <= 'O'; 
elsif ISE L -1' then 
14 <= DD;
else
14 <= ID; 
end if; 
end process mux;
END;
221
-  ipmuxcmp.VHD - input mux address compare circuitry
library IEEE;
use IEEE.stdJogicJ 164.all;
entity 1PMUXCMP is
generic( constant EasterlyRouting : std jogic := 'O';
constant NODE : integer := 0); 
port( signal CLOCK :in stdjogic;
signal RST tin stdjogic;
signal SELECTEDSYNC tin stdjogic;
signal SELECTEDSTRB tin stdjogic;
signal SELECTEDADDR tin stdJogic_vector(3 downto 0);
signal GT tout stdjogic;
signal EQ tout stdjogic;
signal LT tout stdjogic
);
end IPMUXCMP;
architecture behav of IPMUXCMP is
component addrgen
generic ( constant NODE : integer := 0);
port ( signal NODE_ADDRESS tout stdJogic_vector(3 downto 0)); 
end component;
constant POINTJTOJ’OINT : stdJogic_vector( 1 downto 0) := "00";
constant BROADCAST : stdJogic_vector( 1 downto 0) := "01";
constant LOCALJENCAPSULATION : std_logic_vector( 1 downto 0) := "10";
constant GLOBALJ3NCAPSULATION : stdJogic_vector( 1 downto 0) := "11";
signal NODE_ADDRESS : stdjogic_vector(3 downto 0);
begin
nag : addrgen
generic map( NODE )
port map ( NODE_ADDRESS );
process (CLOCK,RST) 
begin
if R ST -P  then 
LT <= 'O';
GT <= 'O';
EQ <= 'O’;
elsif CLOCK'event and CLOCK-11 then
if EasterlyRouting = '1' then
if(SELECTEDSYNC-Pand SELECTEDSTRB-1') 
and (SELECTEDADDR < NODE_ADDRESS) then 
LT <= T;
else
LT <= 'O'; 
end if;
if (SELECTEDSYNC-P and SELECTEDSTRB-1') 
and (SELECTEDADDR > NODE_ADDRESS) then 
GT <= 'P;
else
GT <= 'O';
222
if (SELECTEDSYNC-1' and SELECTEDSTRB- 1') 
and (SELECTEDADDR < NODE_ADDRESS) then 
G T < = T ;
else
GT <= 'O’; 
end if;
if (SELECTEDSYNC-1' and SELECTEDSTRB='l') 
and (SELECTEDADDR > NODE_ADDRESS) then 
LT <= T';
else
LT <= 'O'; 
end if; 
end if;
if SELECTEDADDR = NODE_ADDRESS then 
EQ <=T';
else
EQ <= 'O'; 
end if;
else
null; 
end if; 
end process;
END:
end if;
else
2 2 3
-  opmux.VHD - output mux circuitry
library IEEE;
use IEEE, s td Jo g ic J  164.all;
entity OPMUX is
port( signal DIRECTSYNC :in stdjogic;
signal DIRECTSTRB :in std_logic;
signal DIRECTDATA tin stdJogic_vector(7 downto 0);
signal ADAPTSYNC tin stdjogic;
signal ADAPTSTRB tin stdjogic;
signal ADAPTDATA tin stdJogic_vector(7 downto 0);
signal SELECTD tin stdjogic;
signal DROUTER_OEN tin stdjogic;
signal AROUTER_OEN tin std_logic;
signal OUTDATA tout std_logic_vector(7 downto 0);
signal OUTSYNC tout stdjogic;
signal OUTSTRB tout stdjogic
);
end OPMUX;
architecture OTJTPUTJMUX of OPMUX is
component opcell
port ( signal DD tin stdjogic;
signal AD tin stdjogic;
signal SEL tin stdjogic;
signal O tout stdjogic
);
end component;
component opcelle
port ( signal DD tin stdjogic;
signal AD tin stdjogic;
signal SEL tin stdjogic;
signal DOEN tin stdjogic; 
signal AOEN tin stdjogic;
signal O tout stdjogic
);
end component;
begin
dataopmux: for N in 0 to 7 generate 
dopcell: opcell
port map ( DIRECTDATA(N), ADAPTDATA(N), SELECTD, OUTDATA(N)); 
end generate;
syncopcell: opcell
port map ( DIRECTSYNC, ADAPTSYNC, SELECTD,
OUTSYNC );
strbopcell: opcelle
port map ( DIRECTSTRB, ADAPTSTRB, SELECTD,
DROUTER JDEN, AROTJTER J)E N ,
OUTSTRB);
END;
224
— opCELL. VHD - output mux circuitry
library IEEE;
use IEEE, std J o g ic J  164.all;
entity OPCELL is
port( signal DD :in stdjogic;
signal AD :in std_logic;
signal SEL :in stdjogic;
signal O :out std_logic
);
end OPCELL;
architecture OMUXCELL of OPCELL is 
begin
(.) <= AD when S E L -1' else DD ;
END;
225
— opCELLe. VHD - output mux circuitry
library IEEE;
use IEEE, std J o g ic J  164.all;
entity OPCELLE is
port( signal DD :in stdjogic;
signal AD :in stdjogic;
signal SEL :in stdjogic;
signal DOEN :in std_logic;
signal AOEN :in stdjogic;
signal O :out stdjogic
);
end OPCELLE;
architecture OMIJXCELLE of OPCELLE is 
begin
eomux : process ( SEL, DOEN, AOEN, DD, AD ) 
begin
if  SEL=T then
if AOEN-P then 
O <= AD;
else
O <= 'O'; 
end if;
else
if D O EN -1' then 
O <= DD;
else
O <= 'O'; 
end if; 
end if; 
end process;
END;
— vcontrol.VHD - top level VHDL description combining all state machines
library IEEE;
use IEEE.stdJogicJ 164.all;
entity VCONTROL is
port( signal CLOCK :in stdjogic;
signal RESET :in stdjogic;
signal AliOLDO :in stdjogic;
signal ASTRB :in stdjogic;
signal BHOLDO :in stdjogic;
signal BSTRB :in stdjogic;
signal RHOLDO :in std_logic;
signal RSTRB tin stdjogic; 
signal RSYNC tin stdjogic;
signal ASTRBH,BSTRBH tin stdjogic;
signal AEQ,BEQ tin stdjogic;
signal ALT,BLT tin stdjogic;
signal AEOM tin stdjogic;
signal BEOM tin stdjogic;
signal AGT,AGE,ALE,APP,ABC tin stdjogic;
signal BGT,BGE,BLE,BPP,BBC tin stdjogic;
signal AlnjTest tout stdjogic;
signal BhijTest tout stdjogic;
signal AAROUTER_OEN tout stdjogic;
signal ABROUTERJDEN tout std_logic;
signal ARROUTER_OEN tout std_logic;
signal BBROUTERJDEN tout stdjogic;
signal BAROUTER_OEN tout stdjogic;
signal BRROUTER_OEN tout stdjogic;
signal AISEL,AHSEL,ACE,ADSEL tout stdjogic; 
signal BISEL,BHSEL,BCE,BDSEL tout stdjogic; 
signal AHOLDI tout stdjogic;
signal BHOLDI tout stdjogic;
signal RHOLDI tout stdjogic;
signal RASEL tout stdjogic
);
end VCONTROL;
architecture VHDL_CONTROL of VCONTROL is
component OMUX
port( signal CLOCK tin stdjogic;
signal RESET tin stdjogic;
signal USED tin stdjogic;
signal USEA tin stdjogic;
signal ReqAdapt tin stdjogic; 
signal BlockAdapt tin stdjogic;
signal GNTD tout stdjogic;
signal GNTA tout stdjogic;
signal SELD tout stdjogic
);
end component;
for all: OMUX use entity work.OMUX(fsm);
component RMUX
port( signal CLOCK tin stdjogic;
signal RESET tin stdjogic;
signal AUSE tin stdjogic;
signal BUSE tin stdjogic;
signal GNTA tout stdjogic; 
signal GNTB tout stdjogic; 
signal SELA tout stdjogic
);
end component;
for all: RMUX use entity work.RMUX(fsm);
component ROUTER
port( signal CLOCK tin stdjogic;
signal RESET tin stdjogic;
signal GNTD,GNTA,GNTR tin stdjogic; 
signal EQ,LT tin stdjogic;
signal DHOLDO,AHOLDO,RHOLDO tin std_ 
signal EOM tin stdjogic;
signal STRB tin stdjogic;
signal GT,GE,LE,PP,BC tin stdjogic;
signal ReqAdapt tout stdjogic;
signal Routerldle tout stdjogic;
signal DROUTEROEN tout stdjogic;
signal AROUTEROEN tout stdjogic;
signal RROUTEROEN tout stdjogic;
signal PathBlocked tout stdjogic;
signal USE_D,USE_A,UvSE_R tout stdjogic
);
end component;
for all: ROUTER use entity work.ROUTER(fsm);
component FLOW
port( signal CLOCK tin stdjogic;
signal RESET tin stdjogic;
signal PathBlocked tin stdjogic;
signal RREQ tin stdjogic;
signal STRBH tin stdjogic;
signal FlitButYerEmpty tout stdjogic; 
signal HOLDI tout stdjogic;
signal HSEL tout stdjogic;
signal CE tout stdjogic
);
end component:
for all: FLOW use entity work.FLOW(fsm);
component MUX
port( signal CLOCK tin stdjogic;
signal RESET tin stdjogic;
signal InjReq tin stdjogic;
signal ISEL tout stdjogic
);
end component;
for all: M UX use entity work.IMUX(fsm);
component INJ
port( signal CLOCK tin stdjogic;
signal RESET tin stdjogic;
signal RSTRB tin stdjogic; 
signal RSYNC tin stdjogic; 
signal ASTRB tin stdjogic; 
signal BSTRB tin stdjogic;
signal APathBlocked :in stdjogic; 
signal BPatliBlocked :in stdjogic; 
signal ARouterldle :in stdjogic; 
signal BRouterldle :in stdjogic; 
signal AFlitBufferEmpty :in stdjogic; 
signal BFlitBufferEmpty :in stdjogic;
signal ATestlnj :out std_logic; 
signal BTestlnj :out std_logic; 
signal AlnjReq :out stdjogic; 
signal BlnjReq :out stdjogic; 
signal RHOLDI :out stdjogic
);
end component;
for all: INJ use entity work.INJ(fsm);
signal AUSEA : stdjogic := 'O';
signal AI.JSEB : stdjogic := 'O';
signal AUSER : stdjogic := 'O’;
signal BUSEA : stdjogic := 'O';
signal BUSEB : stdjogic := 'O';
signal BUSER : stdjogic := 'O';
signal AGNTA : stdjogic := 'O';
signal AGNTB : stdjogic := 'O';
signal BGNTA : stdjogic := 'O';
signal BGNTB : stdjogic := 'O';
signal RGNTA : stdjogic := 'O';
signal RGNTB : stdjogic := 'O';
signal APathBlocked : stdjogic := 'O';
signal BPatliBlocked : stdjogic := 'O';
signal ASHOLD : stdjogic := 'O'; 
signal BSFIOLD : stdjogic := 'O';
signal AlnjReq : stdjogic := 'O'; 
signal BlnjReq : stdjogic := 'O';
signal AlnjUse : stdjogic := 'O';
signal BlnjlJse : stdjogic := 'O';
signal AlnjFree : stdjogic := 'O'; 
signal BlnjFree : stdjogic := 'O'; 
signal ARouterldle : stdjogic := 'O’;
signal BRouterldle : stdjogic := 'O';
signal AFlitBufferEmpty : stdjogic := 'O'; 
signal BFlitBufferEmpty : stdjogic := 'O';
signal AReqAdapt : stdjogic := 'O';
signal BReqAdapt : stdjogic := 'O';
A_OMUX; OMUX
port map ( CLOCK, RESET,AIJSEA, BUSEA,
BReqAdapt,AReqAdapt,AGNTA, AGNTB, ADSEL );
BJDMUX: OMUX
port map ( CLOCK, RESET, BUSEB, AUSEB,
AReqAdapt,BReqAdapt,BGNTB, BGNTA, BDSEL );
CJRMT.JX: RMUX
port map ( CLOCK, RESET, AUSER, BUSER, RGNTA, RGNTB, RASEL );
A JLOUTER: ROUTER
port map ( CLOCK, RESET,
AGNTA, BGNTA, RGNTA,
AEQ, ALT, AHOLDO, BHOLDO, RHOLDO, 
AEOM,ASTRB, AGT,AGE,ALE,APP,ABC, 
AReqAdapt,ARouterIdle,
AAROUTER OEN,ABROUTER_OEN,ARROUTER_OEN, 
APatliBlocked, AUSEA, AUSEB, AUSER );
B_ROUTER: ROUTER
port map ( CLOCK, RESET,
BGNTB, AGNTB, RGNTB,
BEQ, BLT, BHOLDO, AHOLDO, RHOLDO, 
BEOM,BSTRB, BGT,BGE,BLE,BPP,BBC, 
BReqAdapt,BRouterIdle,
BBROUTER_OEN,BAROUTER_OEN,BRROUTER_OEN, 
BPathBlocked,BUSEB, BUSEA, BUSER);
A_FLOW: FLOW
port map ( CLOCK, RESET,
APatliBlocked, AlnjReq, ASTRBH,
AFlitBufferEmpty, AHOLDI, AHSEL, ACE );
B_FLOW: FLOW
port map ( CLOCK, RESET,
BPathBlocked, BlnjReq, BSTRBH,
BFlitBufferEmpty, BHOLDI, BHSEL, BCE );
AJMUX: IMUX
port map ( CLOCK, RESET, AlnjReq, AISEL );
B_IMUX: IMUX
port map ( CLOCK, RESET,BlnjReq, BISEL );
INJECT: INJ
port map ( CLOCK, RESET,
RSTRB, RSYNC,ASTRB,BSTRB,
APatliBlocked, BPathBlocked,
ARouterIdle,BRouterIdle,
AFlitBufferEmpty, BFlitBufferEmpty,
AInjTest,BhvjTest,
AlnjReq, BlnjReq,
RHOLDI);
end VHDL CONTROL;
~ IMUX.VHD - input mux state machine
library IEEE:
use IEEE.std_logic_l 164.all;
entity 1MUX is
port( signal CLOCK :in stdjogic; 
signal RESET :in stdjogic; 
signal InjReq tin std_logic:
signal ISEL tout stdjogic
);
end IMUX;
architecture Ism of IMUX is
type ImuxStates is ( Direct, In ject); 
signal State : LnuxStates := Direct;
begin
IMuxFsm : process (CLOCK,RESET) 
begin
if RESET-F then
STATE <= Direct; 
elsif CLOCK'event and CLOCK=T then 
case State is
when Direct =>
if InjReq- 1' then 
State <= hiject;
else
Slate <= Direct; 
end if;
when Inject =>
if InjReq-O' then
State <= Direct;
else
State <= hiject; 
end if;
end case;
else
null; 
end if; 
end process IMuxFsm:
process (CLOCK,RESET) 
begin
if RESET-F then 
ISEL <= T; 
elsif CLOCK'event and CLOCK='I' then 
if State=Inject and InjReq-1' then 
ISEL <= 'O';
else
ISEL <= 'I1; 
end if;
else
null: 
end if: 
end process: 
end fsm:
232
-  INJ.VHD - input injection state machine
library IEEE;
use IEEE.std_logic_l 164.all;
entity INJ is
port( signal CLOCK :in std_logic; 
signal RESET rin stdjogic; 
signal RSTRB :in stdjogic; 
signal RSYNC :in stdjogic; 
signal ASTRB :in stdjogic; 
signal BSTRB :in stdjogic; 
signal APathBlocked :in stdjogic; 
signal BPatliBlocked :in stdjogic; 
signal ARouterldle :in stdjogic; 
signal BRouterldle :in stdjogic; 
signal AFlitBufferEmpty :in stdjogic; 
signal BFlitBufferEmpty :in std_logic;
signal ATestlnj :out stdjogic; 
signal BTesthij :out stdjogic; 
signal AlnjReq :out stdjogic; 
signal BlnjReq rout stdjogic; 
signal RHOLDI rout stdjogic
):
end INJ;
architecture fsm of INJ is
type hijStates is (FavourA,
RequestA,
hijectA,
FavourB,
RequestB, 
hijectB );
signal State : InjStates := FavourA; 
signal RHOLD : stdjogic := 'O'; 
signal iRHOLDI: std jogic := 'O'; 
signal InjReqA : stdjogic := 'O'; 
signal InjReqB : stdjogic := 'O'; 
signal TesthijA : std_logic := 'O'; 
signal TestlnjB : stdjogic := 'O';
begin
InjFsm : process (CLOCK,RESET) 
begin
if R ESET-1' then
STATE <= FavourA; 
elsif CLOCK'event and CLOCK-1' then 
case State is
when FavourA =>
ifR SY N C -P and R STR B -1' then 
if ARouterldle-1' 
and AFlitBufferEmpty-1' 
and ASTRB-O' then
State <= RequestA; 
elsif BRouterldle- 1' 
and BFlitBufferEmpty^' 1' 
and BSTRB-O' then
State <= RequestB;
else
2 3 3
State <= FavourB; 
end if,
State <= FavourB; 
end if;
when FavourB ->
if RSY N C-1' and RSTRB-F then 
if BRouterldle-1' 
and BFlitBufferEmpty-1' 
and BSTRB-O' then
State <= RequestB; 
elsif ARouterldle-P 
and AFlitBufferEmpty=’l' 
and ASTRB='0' then
State <= RequestA;
else
State <= Favour A; 
end if;
else
State <= FavourA; 
end if;
when RequestA =>
if ASTRB-0' then
State <= InjeetA;
else
State <= FavourB; 
end if;
when RequestB =>
if BSTRB-O' then 
State <= InjectB;
else
State <= FavourA; 
end if;
when InjeetA =>
if RSYNC-O' and RSTRB=T and iRHOLDI='0' then 
State <= FavourB;
else
State <= InjeetA; 
end if;
when InjectB =>
if RSYNC-O' and RSTRB-1' and iRHOLDI='0' then 
State <= FavourA;
else
State <= InjectB; 
end if;
end case; 
end if; 
end process InjFsnv,
GrantRequestLogic: process ( State,RSTRB,RSYNC,iRHOLDI,
ASTRB,BSTRB,
AFlitBufferEmpty, BFlitBufferEmpty, 
ARouterldle,BRouterldle, 
APatliBlocked, BPathBlocked)
begin
case State is
when FavourA =>
else
ifRSY NC-T and RSTRB=T then 
if ARouterldle-1' 
and AFlitButYerEmpty=T 
and ASTRB-0' then 
InjReqA <=
InjReqB <= 'O'; 
TestlnjA <= ‘O'; 
TestlnjB <= 'O'; 
elsif BRouterldle- 1' 
and BFlitBufferEmpty=T 
and BSTRB='0' tlien 
InjReqA <= 'O'; 
InjReqB <= 'I'; 
TestlnjA <= 'O'; 
TestlnjB <= 'O';
else
InjReqA <= 'O’; 
InjReqB <= 'O'; 
TestlnjA <= 'O'; 
TestlnjB <= 'O'; 
end if;
else
InjReqA <= ’O’;
InjReqB <= 'O';
TestlnjA <= 'O';
TestlnjB <= 'O'; 
end if;
RHOLD <= T;
when FavourB =>
if RSYNC-11 and RSTRB-1' then 
if BRouterldle-1' 
and BFlitBufferEmpty-l' 
and BSTRB='0' then 
InjReqB <= 'I1; 
InjReqA <= 'O'; 
TestlnjB <= 'O'; 
TestlnjA <= 'O'; 
elsif ARouterldle-l' 
and AFlitButYerEmpty=T 
and ASTRB='0' tlien 
InjReqB <= 'O'; 
InjReqA <= '1'; 
TestlnjB <= 'O'; 
TestlnjA <= 'O';
else
InjReqB <= 'O'; 
InjReqA <= 'O'; 
TestlnjB <= 'O'; 
TestlnjA <= 'O'; 
end if;
else
InjReqB <= 'O';
InjReqA <= 'O';
TestlnjB <= ’O';
TestlnjA <= 'O'; 
end if;
RHOLD < - '! ';
when RequestA =>
if ASTRB-O' then 
InjReqA <= T;
InjReqB <= 'O’;
InjReqA <= 'O';
InjReqB <= 'O';
RHOLD <= T ; 
end if;
TestlnjA <= T;
TestlnjB <= 'O';
when RequestB =>
if BSTRB-O' then 
InjReqA <= 'O';
InjReqB <= T ;
RHOLD <= 'O';
else
InjReqA <= 'O';
InjReqB <= 'O';
RHOLD <= T; 
end if;
TestlnjA <= 'O';
TestlnjB <= '1';
when InjectA =>
TestlnjA <= 'O';
TestlnjB <= 'O';
if  RSYNC-O' and R STR B -1' and iRHOLDI='0' tlien 
InjReqA <= 'O';
InjReqB <= 'O';
RHOLD < = T ;
else
InjReqA <='!';
InjReqB <= 'O';
RHOLD <= APathBlocked; 
end it;
when InjectB =>
TestlnjA <= 'O';
TestlnjB <= 'O';
if RSYNC-O' and R STR B -1' and iRHOLDI='0' then 
InjReqA <= 'O';
InjReqB <= 'O';
RHOLD « = T ;
else
InjReqA <= 'O';
InjReqB <= T;
RHOLD <= BPatliBlocked; 
end if;
end case; 
end process;
process (CLOCK,RESET) 
begin
if R ESET-1' then 
iRHOLDI <= 'O';
RHOLDI <= 'O'; 
elsif CLOCK'event and CLOCK-1' then 
iRHOLDI <= RHOLD;
RHOLDI <= RHOLD;
else
null; 
end if; 
end process;
RHOLD <= 'O';
else
ATestlnj <= TestlnjA; 
BTestlnj <= TestlnjB; 
AlnjReq <= InjReqA; 
BlnjReq <= InjReqB;
end fsm;
2 3 7
— FLOW.VHD - flow control state machine
library IEEE;
use IEEE.std_logic_l 164.all;
entity FLOW is
port( signal CLOCK :in std_logic;
signal RESET :in stdjogic;
signal PatliBlocked :in stdjogic;
signal RREQ :in std jog ic ;
signal STRBH :in stdjogic;
signal FlitBufterEmpty ;out stdjogic;
signal HOLDI :out stdjogic;
signal HSEL :out stdjogic;
signal CE :out stdjogic
);
end FLOW;
architecture fsm of FLOW is
type FLOWJ5TATES is (NOTSTOPPED,STOPPED,RECOVER,HOLDING); 
signal STATE: FLOWJ3TATES := NOTSTOPPED;
begin
FLOW_FSM : process (CLOCK,RESET) 
begin
if R ESET-1' then
STATE <= NOTSTOPPED; 
elsif CLOCK'event and CLOCK- 11 then 
case STATE is
when NOTSTOPPED => 
if PatliBlocked-1' then
STATE <= STOPPED;
else
STATE <= NOTSTOPPED; 
end if;
when STOPPED => 
ifSTR B H -P then
if PatliBlocked- 1' then
STATE <= HOLDING;
else
STATE <= RECOVER; 
end if;
else
if PatliBlocked-1' then
STATE <= STOPPED;
else
STATE <= NOTSTOPPED; 
end if; 
end it;
when HOLDING =>
if PatliBlocked-P then
STATE <= HOLDING;
else
STATE <= RECOVER; 
end if;
when RECOVER =>
if PatliBlocked-P then
STATE <= STOPPED;
STATE <= NOTSTOPPED; 
end if;
null; 
end if;
end process FLOW_FSM;
HOLDI_REG : process (CLOCK,RESET) 
begin
if RESET-1' then 
HOLDI <= 'O'; 
elsif CLOCK’event and CLOCK-1’ then 
if PathBlocked- P or RREQ-F then 
HOLDI <= T ;
else
HOLDI <= 'O’; 
end if;
else
null; 
end if;
end process HOLDI_REG;
CE <= T  when PathBlocked-0’ else ’O’;
FlitBufferEmpty <= T* when STATE=NOTSTOPPED else 'O';
HSEL_LG : process ( STATE ) 
begin
if STATE=STOPPED 
or STATE=HOLDING then 
HSEL <= 'I';
else
HSEL <= 'O'; 
end if; 
end process;
end fsnv.
else
end case;
else
239
— ROUTER. VHD - data router state machine - with go faster stripes
library IEEE;
use E E E .std Jog icJ 164.all;
entity ROUTER is
port( signal CLOCK rin stdjogic;
signal RESET rin stdjogic;
signal GNTD,GNTA,GNTR rin stdjogic; 
signal EQ,LT rin stdjogic;
signal DHOLDO,AHOLDO,RHOLDO rin stdjogic; 
signal EOM rin stdjogic;
signal STRB rin stdjogic;
signal GT,GE,LE,PP,BC rin stdjogic;
signal ReqAdapt rout stdjogic;
signal Routerldle rout stdjogic;
signal DROUTEROEN rout stdjogic;
signal AROIJTEROEN rout stdjogic;
signal RROUTEROEN rout stdjogic;
signal PatliB locked rout stdjogic;
signal USE_D,USE_A,USE_R rout stdjogic
);
end ROUTER;
architecture fsm of ROUTER is
type ROUTER J3TATES is ( IDLE,
DIRECT,
ADAPT,
THRUBLOCK,
RXHDR,
RX,
BCSTBLOCK,
DIRCTBCSTHDR,
AD APTB C STHDR, 
DIRCTBCSTBODY, 
ADAPTBCSTBODY, 
DIRCTBCST,
ADAPTBCST, 
DERCTBCSTDBLOCK, 
ADAPTBCSTDBLOCK, 
DIRCTBCSTRBLOCK, 
ADAPTBCSTRBLOCIC);
signal STATE : ROUTER JSTATES := IDLE; 
signal DROUTER_OEN r std jogic 
signal AROUTER_OEN : std jogic := 'O'; 
signal RROUTER J 3 E N : std jogic := 'O'; 
signal DOPEN : stdjogic := '1'; 
signal AOPEN : stdjogic := 'O'; 
signal ROPEN : stdjogic := 'O';
begin
Routerldle <= T  when STATE=IDLE else 'O';
DOPEN <= T  when GNTD-1' and DHOLDO='0' else 'O'; 
AOPEN <= ’I’ when GNTA-I' and AHOLDO='0' else 'O’; 
ROPEN <= T  when GNTR-1' and RHOLDO='0’ else 'O';
ROUTER_FSM : process (CLOCK,RESET) 
begin
if R ESET-1' then
STATE <= E)LE; 
elsif CLOCK'event and CLOCK-1' tlien 
if STRB='l' then
case STATE is
when IDLE =>
if G E -F  then
ifD O PEN -F then
STATE <= DIRECT;
else
STATE <= THRUBLOCK; 
end if;
elsif LT='r tlien -- inertial header flit 
STATE <= IDLE;
elsif P P -I ' then
if EQ = '1' then -- PP at its destination 
STATE <= RXITDR; 
elsif DOPEN-1' tlien
STATE <= DIRECT;
else
STATE <= THRUBLOCK; 
end if;
elsif BC=T tlien
if EQ = '1' tlien -- BCST at its extent, same as PP 
STATE <= RXHDR; 
else -  BCST through routing 
if DOPEN-1' then
STATE <= DIRCTBCSTBODY;
else
STATE <= BCSTBLOCK; 
end if; 
end if;
elsif L E -F  then
if EQ = '1' then -  LE reached extent 
STATE <= IDLE; 
elsif DOPEN-1' then -  LE thru routing 
STATE <= DIRECT;
else
STATE <= THRUBLOCK; 
end if;
else
STATE <= IDLE; 
end if;
when THRUBLOCK => 
if DOPEN-1’ then
STATE <= DIRECT; 
elsif AOPEN='F then
STATE <= ADAPT;
else
STATE <= THRUBLOCK; 
end if;
when BCSTBLOCK =>
ifDOPEN-1' then
STATE <= DIRCTBCSTHDR; 
elsif AOPEN-1' Uien
STATE <= ADAPTBC STHDR;
else
STATE <= BCSTBLOCK; 
end if;
when DIRCTBCSTHDR => 
ifDOPEN-1' tlien
STATE <= DIRCTBCSTBODY;
else
STATE <= DIRCTBCSTHDR; 
end if;
when ADAPTBCSTHDR => 
if AOPEN-1' then
STATE <= ADAPTBCSTBODY;
else
STATE <= ADAPTBCSTHDR; 
end if;
when RXHDR =>
ifR O PEN -P Uien 
STATE <= RX;
else
STATE <= RXHDR; 
end if;
when RX =>
ifR O P E N -1' and EOM -1' then 
STATE <= IDLE;
else
STATE <= RX; 
end if;
when DIRECT =>
ifDOPEN-1' and EOM=T then 
STATE <= IDLE;
else
STATE <= DIRECT; 
end if;
when ADAPT =>
if AOPEN-1' and EOM -1' Uien 
STATE <= IDLE;
else
STATE <= ADAPT; 
end if;
when ADAPTBCSTBODY =>
if AOPEN-1' and R OPEN-1' Uien 
STATE <= ADAPTBCST;
else
STATE <= ADAPTBCSTBODY; 
end if;
when DIRCTBCSTBODY =>
ifDOPEN-1' and R O PEN -1' then 
STATE <= DIRCTBCST;
else
STATE <= DIRCTBCSTBODY; 
end if;
when DIRCTBCST =>
if D O PEN -1’ and R O PEN -1' and E O M -1' Uien 
STATE <= IDLE; 
elsif DOPEN=T and ROPEN='i' then 
STATE <= DIRCTBCST; 
elsif DOPEN- 1' then
STATE <= DIRCTBCSTRBLOCK; 
elsif ROPEN- 1' then
STATE <= DIRCTBCSTDBLOCK;
else
STATE <= DIRCTBCST; 
end if,
when ADAPTBCST =>
if A O PEN -1' and ROPEN-P and EOM- P then 
STATE <= IDLE; 
elsif AOPEN=T and ROPEN-P then 
STATE <= ADAPTBCST; 
elsif AOPEN-P then
STATE <= ADAPTBCSTRBLOCK; 
elsif ROPEN=’P then
STATE <= ADAPTBCSTDBLOCK;
else
STATE <= ADAPTBCST; 
end if;
when DIRCTBCSTDBLOCK =>
— reception channel has had the data 
if DOPEN-P and EOM='P then
STATE <= IDLE; 
elsif DOPEN='P Uien
STATE <= DIRCTBCST;
else
STATE <= DIRCTBCSTDBLOCK; 
end if;
when DIRCTBCSTRBLOCK =>
— direct channel has had tiie data 
if ROPEN- 1' and EOM- 1' Uien
STATE <= IDLE; 
elsif RO PEN-P then
STATE <= DIRCTBCST;
else
STATE <= DIRCTBCSTRBLOCK; 
end if;
when ADAPTBCSTDBLOCK =>
— reception channel has had Uie data 
if AOPEN-P and EOM='P then
STATE <= IDLE; 
elsif AOPEN='P then
STATE <= ADAPTBCST;
else
STATE <= ADAPTBCSTDBLOCK; 
end if;
when ADAPTBCSTRBLOCK =>
— direct chamiel has had the data 
if ROPEN-P and E O M -1' Uien
STATE <= IDLE; 
elsif ROPEN=T then
STATE <= ADAPTBCST;
243
STATE <= ADAPTBCSTRBLOCK; 
end if;
else
end case; 
end if;
null; 
end if;
end process ROUTER_FSM;
ROUTER_LG : process ( STATE,EOM,EQ,LT,STRB,GE,PP,BC,LE,
DOPEN,AOPEN,ROPEN,GNTD,GNTA,GNTR)
begin
case STATE is
when IDLE =>
AROUTER_OEN <= 'O'; 
if G E -I ' tlien
ifD O P E N -1' then
-STATE <= DIRECT;
ReqAdapt <= 'O';
DROUTER_OEN <= T;
RROUTER_OEN <= 'O';
else
-STATE <= THRUBLOCK;
ReqAdapt <= T ;
DROUTER_OEN <= 'O';
RROUTER_OEN <= 'O'; 
end if;
elsif L T -P  then — inertial header flit 
-STATE <= IDLE;
ReqAdapt <= 'O';
DROUTER_OEN <= '!';
RROUTER_OEN <= ’O';
elsif P P - 1' tlien
if EQ = '1' tlien -  PP at its destination 
-STATE <= RXHDR;
ReqAdapt <= 'O';
DROUTER_OEN <= 'O';
RROUTER_OEN <= 'P; 
elsif DOPEN='P then
-STATE <= DIRECT;
ReqAdapt <= 'O';
DROUTER_OEN <='!';
RROUTER_OEN <= 'O';
else
-STATE <= THRUBLOCK;
ReqAdapt <= '1';
DROUTER_OEN <= 'O';
RROUTER_OEN <= 'O'; 
end if;
elsif BC=T tlien
if EQ = 'P tlien — BCST at its extent, same as PP RX 
-STATE <= RXHDR;
ReqAdapt <= 'O';
DROUTER_OEN <= 'O';
RROUTER_OEN <= '1'; 
else -  BCST through routing 
DROUTER_OEN <= 'O';
else
RROUTEROEN <= 'O'; 
ifDOPEN-1' tlien
-STATE <= DIRCTBCSTBODY; 
ReqAdapt <= 'O';
else
-STATE <= BCSTBLOCK; 
ReqAdapt <= '1'; 
end if; 
end if;
elsif LE=T' then
if EQ = T  then — LE reached extent 
-STATE <= IDLE;
ReqAdapt <= 'O';
DROIJTER_OEN <= T;
RROUTER OEN <= 'O'; 
elsif DOPEN-T tlien
-STATE <= DIRECT;
ReqAdapt <= 'O';
DROUTER_OEN <= T; 
RROUTER_OEN <= 'O';
else
-STATE <= THRUBLOCK;
ReqAdapt <= T ;
DROUTER_OEN <= 'O'; 
RROUTEROEN <= 'O’; 
end if;
else
-STATE <= IDLE;
ReqAdapt <= 'O';
DROUTER_OEN <= T ;
RROUTER_OEN <= 'O'; 
end if;
when THRUBLOCK =>
ReqAdapt <= '1';
RROUTERJDEN <= 'O'; 
if DOPEN='l' then
-STATE <= DIRECT; 
if STR B -P then
DROUTER OEN <= 'P; 
AROUTER_OEN <= 'O';
else
DROlJTER_OEN <= 'O’; 
AROI.JTER_OEN <= 'O'; 
end if; 
elsif AOPEN='P Uien
-STATE <= ADAPT; 
if  STR B -P tlien
DROUTER_OEN <= 'O’; 
AROUTER_OEN<=’P;
else
DROI.JTEROEN <= 'O'; 
AROUTER_OEN <= 'O'; 
end if;
else
-STATE <= THRUBLOCK; 
DROUTER_OEN <= 'O’;
AROUTER_OEN <= 'O'; 
end if;
when BCSTBLOCK =>
ReqAdapt <= ’P;
RROUTER_OEN <= 'O’; 
if  D O PEN -1' then
-STATE <= DIRCTBCSTHDR: 
DROUTER_OEN <= T ; 
AROUTER_OEN <= ’O’; 
elsif AOPEN-P then
-STATE <= ADAPTBCSTHDR; 
DROUTER_OEN <= 'O'; 
AROUTER_OEN <= *P;
else
-STATE <= BCSTBLOCK; 
DROUTER_OEN <= 'O'; 
AROUTER_OEN <= 'O'; 
end if;
when DIRCTBCSTHDR =>
AROUTER_OEN <= 'O'; 
RROUTEROEN <= 'O’;
ReqAdapt <= 'O'; 
if DOPEN-P then
-STATE <= DIRCTBCSTBODY; 
DROUTER_OEN <= 'O';
else
-STATE <= DIRCTBCSTHDR; 
DROUTER_OEN <= 'P; 
end if;
when ADAPTBCSTHDR =>
-STATE <= ADAPTBCSTBODY; 
ReqAdapt <= 'O';
DROUTER_OEN <= 'O’; 
RROUTER_OEN <= 'O'; 
if AOPEN-P then
-STATE <= ADAPTBCSTBODY; 
AROUTER_OEN <= ’O';
else
-STATE <= ADAPTBCSTHDR; 
AROUTEROEN <= '1'; 
end if;
when RXHDR =>
ReqAdapt <= 'O';
DROUTER_OEN <= 'O'; 
AROUTER_OEN <= 'O'; 
if ROPEN-P then
-STATE <= KX; 
if STR B-P then
RROUTER_OEN <= 'P;
else
RROUTER_OEN <= 'P; 
end if;
else
-STATE <= RXHDR; 
RROUTER_OEN <= 'P; 
end if;
when RX =>
ReqAdapt <= 'O';
AROUTER_OEN <= 'O'; 
if ROPEN-P and EOM='P Uien 
-STATE <= IDLE; 
if STR B -P then
DROUTER_OEN < = '!’;
DROUTER_OEN <= 'O'; 
RROUTER_OEN <= '1'; 
end if;
else
-STATE <= RX;
DROUTER_OEN <= 'O’; 
RROUTER_OEN <= '1'; 
end if;
when DIRECT =>
ReqAdapt <= 'O';
AROUTER_OEN <= 'O'; 
RROUTER_OEN <= 'O'; 
DROUTER_OEN <= T’;
when ADAPT =>
ReqAdapt <= ’O';
RROUTER_OEN <= 'O'; 
if  AOPEN='l' and EOM=T then 
-STATE <= IDLE; 
if STRB-P then
DROUTER_OEN <='!'; 
AROUTER_OEN <= 'O';
else
DROI.JTER_OEN <= 'O'; 
AROUTER_OEN <= T'; 
end if;
else
-STATE <= ADAPT; 
DROUTERJDEN <= 'O’; 
AROUTER_OEN <= 'P; 
end it;
when DIRCTBCSTBODY =>
ReqAdapt <= 'O';
ARO!JTER_OEN <= 'O'; 
ifD O PEN -P and ROPEN=T tlien 
-STATE <= DIRCTBCST; 
if STRB-P then
DROUTEROEN <= 'P; 
RROUTER_OEN <= 'P;
else
DROUTER_OEN <= 'O'; 
RROUTER_OEN <= 'O’; 
end if;
else
-STATE <= DIRCTBCSTBODY; 
DROUTER_OEN <= 'O'; 
RROUTER_OEN <= 'O’; 
end if;
when ADAPTBCSTBODY =>
ReqAdapt <= 'O';
DROUTERJDEN <= 'O'; 
if AOPEN=’P and ROPEN='P then 
-STATE <= ADAPTBCST; 
if STRB-P then
AROUTERJDEN < = ’P; 
RROUTER_OEN <= 'P;
else
RROUTERJDEN <= 'O';
else
AROUTER_OEN <= 'O’;
RROUTER_OEN <= 'O'; 
end if;
-STATE <= ADAPTBCSTBODY; 
AROUTER OEN <= 'O';
RROUTER_OEN <= 'O’; 
end if;
when DIRCTBCST =>
ReqAdapt <= 'O';
AROUTEROEN <= 'O'; 
if DOPEN-P andROPEN='PandEOM='P then 
-STATE <= IDLE; 
if vSTRB='P then
DROUTER_OEN <= 'P; 
RROUTER_OEN <= 'O';
else
DROIJTER_OEN <= T; 
RROUTER_OEN <= 'P; 
end if;
elsif DOPEN='P and ROPEN='P then 
-STATE <= DIRCTBCST; 
DROUTER_OEN <= 'P;
RROUTER_OEN <= 'P; 
elsif DOPEN='P then
-STATE <= DIRCTBCSTRBLOCK; 
DROUTER_OEN <= ’O';
RROlJTER_OEN <= 'P; 
elsif ROPEN='P Uien
-STATE <= DIRCTBCSTDBLOCK; 
DROUTER_OEN <= ’P;
RROUTER_OEN <= ’O';
else
-STATE <= DIRCTBCST; 
DROUTER_OEN <= 'P;
RROUTER_OEN <= 'P; 
end if;
when ADAPTBCST =>
ReqAdapt <= 'O';
if AOPEN-P and ROPEN='P and EOM=’P then 
-STATE <= IDLE; 
if STR B -P then
DROI.JTER_OEN <='P; 
AROUTER_OEN <= 'O'; 
RROUTER_OEN <= 'O';
else
DROUTER_OEN <= 'O'; 
AROUTER_OEN <=' P; 
RROUTER_OEN <= 'P; 
end if,
elsif AOPEN='P and ROPEN='P then 
-STATE <= ADAPTBCST; 
DROUTER_OEN <= 'O'; 
AROUTER_OEN <= 'P;
RROUTER_OEN <= 'P; 
elsif AOPEN='P Uien
-STATE <= ADAPTBCSTRBLOCK; 
DROUTER_OEN <= 'O'; 
AROUTER_OEN <= 'O'; 
RROUTER_OEN <= 'P; 
elsif ROPEN-P Uien
-STATE <= ADAPTBCSTDBLOCK;
else
DROUTER_OEN <= 'O’; 
AROUTER_OEN <= T; 
RROUTER_OEN <= 'O';
-STATE <= ADAPTBCST; 
DROUTER_OEN <= 'O'; 
AROUTER_OEN <= T; 
RROUTER_OEN <= 'I'; 
end if;
when DIRCTBCSTDBLOCK =>
ReqAdapt <= 'O';
DROUTER_OEN <= T;
AROUTER_OEN <= 'O';
-  reception channel has had the data 
ifDOPEN-1' and EO M -P then
-STATE <= IDLE; 
if STR B-P then
RROUTEROEN <= 'O';
else
RROUTER_OEN <= 'O’; 
end if; 
elsif DOPEN- 1' then
-STATE <= DIRCTBCST; 
RROUTER_OEN <= 'P;
else
-STATE <= DIRCTBCSTDBLOCK; 
RROUTERJDEN <= 'O'; 
end if;
when DIRCTBCSTRBLOCK =>
ReqAdapt <= 'O';
AROUTER_OEN <= 'O';
-  direct channel has had the data 
if ROPEN-P and EOM='P tlien
-STATE <= IDLE; 
if STR B-P then
DROUTER_OEN <= 'P; 
RROUTER_OEN <= 'O';
else
DROI.JTER_OEN <= 'O’; 
RROUTEROEN <= 'P; 
end if; 
elsif ROPEN=’P then
-STATE <= DIRCTBCST; 
DROUTER_OEN <= 'P; 
RROUTER_OEN <= 'P;
else
-STATE <= DIRCTBCSTRBLOCK; 
DROUTER_OEN <= 'O'; 
RROlJTER_OEN <= 'P; 
end if;
when ADAPTBCSTDBLOCK =>
ReqAdapt <= 'O';
-  reception chaiuiel has had tlie data 
if AOPEN-P and EOM='P tlien
-STATE <= IDLE; 
if STR B-P then
DROUTER_OEN <= T; 
AROUTEROEN <= 'O'; 
RROUTER_OEN <= 'O';
else
else
DROUTER_OEN <= 'O'; 
AROUTER_OEN <= T ; 
RROUTER_OEN <= 'O'; 
end if; 
elsif AOPEN-P Uien
-STATE <= ADAPTBCST; 
DROUTERJDEN <= 'O'; 
AROUTER_OEN <= '1'; 
RROUTER_OEN <= '!';
else
-STATE <= ADAPTBC STDBLO CK; 
DROUTER_OEN <= 'O'; 
AROUTER_OEN <= '1'; 
RROIJTER_OEN <= 'O'; 
end if,
when ADAPTBCSTRBLOCK =>
ReqAdapt <= 'O';
— direct channel has had the data 
if ROPEN-P and EOM=T then 
-STATE <= IDLE; 
if STRB-P then
DROUTER_OEN <= 'P; 
AROUTER_OEN <= 'O'; 
RROUTEROEN <= 'O’;
else
DROUTER_OEN <= 'O'; 
AROUTER_OEN <= ’O'; 
RROUTEROEN <= 'P; 
end if; 
elsif ROPEN='P then
-STATE <= ADAPTBCST; 
DROUTER_OEN <= 'O'; 
AROUTER_OEN <= 'P; 
RROUTER_OEN <= T ;
else
-STATE <= ADAPTBCSTRBLOCK; 
DROUTER_OEN <= 'O’; 
AROUTER_OEN <= 'O'; 
RROI.JTER_OEN <= ’P; 
end if;
end case; 
end process ROUTER_LG;
BLK_LG : process ( STATE,STRB,GE,GT,DOPEN,AOPEN,ROPEN) 
begin
case STATE is
when IDLE =>
if GE='P or GT='P then 
if DOPEN-P then
PathBlocked <= 'O';
else
PathBlocked <= 'P; 
end if;
else
PathBlocked <= 'O'; 
end if;
when THRUBLOCK => 
PathBlocked <= T;
when BCSTBLOCK => 
PatliBlocked <= '!';
when DIRCTBCSTHDR => 
if DOPEN=T tlien
PatliBlocked <= 'O';
else
PatliBlocked <= T ; 
end if;
when DIRCTBCSTDBLOCK => 
ifD O P E N -1' Uien
PatliBlocked <= 'O';
else
PatliBlocked <= T'; 
end if; 
when DIRECT =>
if DOPEN='0' then
PatliBlocked <= '!';
else
PatliBlocked <= 'O'; 
end if;
when ADAPTBCSTHDR => 
if AOPEN-1' then
PatliBlocked <= 'O';
else
PatliBlocked <= 'P; 
end if; 
when ADAPT =>
if AOPEN-O' then
PatliBlocked <= T ;
else
PatliBlocked <= 'O'; 
end if;
when ADAPTBCSTDBLOCK => 
if AOPEN-1' then
PatliBlocked <= 'O';
else
PatliBlocked <= T; 
end if;
when RXHDR =>
if ROPEN-P Uien
PatliBlocked <= 'O';
else
PaUiBlocked <= ' 1'; 
end if; 
when RX =>
if ROPEN='0' then
PaUiBlocked <= '1';
else
PaUiBlocked <= 'O'; 
end if;
when DIRCTBCSTRBLOCK => 
if  ROPEN-P then
PaUiBlocked <= 'O';
else
PaUiBlocked <= 'P; 
end if;
when ADAPTBCSTRBLOCK =>
if  R O PE N -1' then
PathBlocked <= 'O';
else
PathBlocked <= '!'; 
end if;
when DIRCTBCSTBODY => 
if STRB-P then
Pa tliB locked <= '1';
else
PalliBlocked <= 'O'; 
end if;
when ADAPTBCSTBODY => 
if S T R B -1' then
PathBlocked <= '1';
else
PathBlocked <= 'O'; 
end if;
when DIRCTBCST =>
if D O PEN -1' and ROPEN-P then 
PathBlocked <= 'O';
else
PathBlocked <= T; 
end if;
when ADAPTBCST =>
if A O PEN -1' and ROPEN=T then 
PathBlocked <= 'O';
else
PathBlocked <= T; 
end if;
end case; 
end process BLK_LG;
IJSE_LG : process ( STATE,STRB ) 
begin
case STATE is
when IDLE =>
USE_A <= 'O';
USE_R <= 'O’; 
if STR B-I' then 
USE_D <= T;
else
USE_D <= 'O'; 
end if;
when THRUBLOCK =>
USE_R <= 'O';
USE_D <= '1';
USE_A <= '1';
when BCSTBLOCK =>
USE_R <= 'O';
USE_D <= T ;
I.JSE_A <= T ;
when DIRCTBCSTHDR =>
USE_D <= T ;
252
USE_A <= 'O';
USE_R <= 'O’;
when ADAPTBCSTHDR => 
USE_D <= ’O’; 
USE_A<=T;
USE_R <= ’O’;
when RXHDR =>
USE_D <= 'O';
USE_A <= 'O’; 
if STRB=T then 
USE_R <= '1';
else
USE_R <= 'O'; 
end if;
when RX =>
USE_D <= 'O’;
USE_A <= 'O';
USE_R <= '1';
when DIRECT =>
USE_A <= 'O';
USE_R <= 'O';
USE_D <= T ;
when ADAPT =>
I.JSE_D <= ’O’;
USE_R <= 'O’;
USE_A <= '1';
when DIRCTBCSTBODY => 
USE_D <= 'P;
USE_A <= ’O’; 
if STRB=T then 
USE_R <= V;
else
USE_R <= 'O’; 
end if;
when ADAPTBCSTBODY => 
USE_D <= 'O';
USE_A <= '1'; 
if STRB='l' then 
USE_R <= '1';
else
USE_R <= 'O’; 
end if;
when DERCTBCST =>
USE_A <= 'O';
USE_D <= '1';
USE_R <= '1';
when ADAPTBCST =>
USEJD <= 'O';
USE_A <= T;
USE_R <='!';
when DIRCTBCSTDBLOCK => 
USE_A <= 'O';
USE_D <= T;
253
when DIRCTBCSTRBLOCK => 
USE_A <= 'O';
USE_D <= T ;
USE_R <= T;
when ADAPTBCSTDBLOCK => 
USE_D <= 'O';
I.JSE_A <= T;
USE_R <= T ;
when ADAPTBCSTRBLOCK => 
USE_D <= 'O';
I.JSE_A <= ’1’;
I.JSE_R <= T;
end case; 
end process USE_LG;
process (CLOCK,RESET) 
begin
if RESET-P then
DROUTEROEN <= T‘;
AROUTEROEN <= 'O’;
RROUTEROEN <= ’O’; 
elsif CLOCK'event and CLOCK-1' then
DROUTEROEN <= DROUTER_OEN; 
AROUTEROEN <= AROUTER_OEN ; 
RROUTEROEN <= RROI.JTER_OEN;
else
null; 
end if; 
end process;
ml fsm;
USE_R <= T;
254
— OMIJX.VHD - output channel mux state machine
library IEEE;
use IEEE.std_logic_l 164.all;
entity OMUX is
port( signal CLOCK :in stdjogic; 
signal RESET :in stdjogic;
signal USED :in std_logic;
signal USEA :in stdjogic;
signal ReqAdapt :in stdjogic; 
signal BlockAdapt :in stdjogic;
signal GNTD :out std_logic; 
signal GNTA :out stdjogic;
signal SELD :out stdjogic
X
end OMIJX;
architecture fsm of OMUX is
type OMUX J3TATES is (DIRECT,ADAPT); 
signal STATE : OMUX_STATES := DIRECT;
begin
OMUX_FSM : process (CLOCK,RESET) 
begin
if RESET-P tlien
STATE <= DIRECT; 
elsif CLOCK'event and CLOCK='P Uien
case STATE is
when DIRECT => 
if U SED -P then
STATE <= DIRECT; 
elsif ReqAdapt-P and BlockAdapt-O' Uien 
STATE <= ADAPT;
else
STATE <= DIRECT; 
end if;
when ADAPT =>
if USEA-O' then
STATE <= DIRECT;
else
STATE <= ADAPT; 
end if;
end case;
else
null; 
end if;
end process OMUX_FSM;
GNTD <= 'P when STATE=DIRECT else 'O';
GNTA <= ’P when STATE=ADAPT else 'O';
SELD <= T  when STATE=ADAPT else 'O';
end fsm;
255
2 5 6
— RMUX.VHD - receive mux state machine
library IEEE;
use IEEE, s td Jo g ic J  164.all;
entity RMUX is
port( signal CLOCK :in stdjogic; 
signal RESET :in stdjogic; 
signal AUSE :in stdjogic; 
signal BUSE :in stdjogic;
signal GNTA :out stdjogic; 
signal GNTB :out stdjogic; 
signal SELA :out stdjogic
);
end RMUX;
architecture fsm of RMUX is
type RMUXJ5TATES is (GRANT_A,GRANT_B); 
signal STATE : RMUX_STATES := GRANT_A;
begin
RMUX_FSM: process (CLOCK,RESET) 
begin
if R ESET-1' then
STATE <= GRANT_A; 
elsif CLOCK'event and CLOCK-1' then 
case STATE is
when GRANT_A => 
if AUSE-1' then
STATE <= GRANT_A;
else
STATE <= GRANT_B; 
end if;
when GRANTJB => 
if BUSE='l' then
STATE <= GRANT_B;
else
STATE <= GRANT_A; 
end if;
end case;
else
null; 
end if;
end process RMUX JFSM;
GNTA <= T  when STATE=GRANT_A else 'O'; 
GNTB <= T  when STATE-GRANTB else ’O’; 
SELA <= ’O’ when STATE=GRANT_A else T;
end fsm;
257
10. Appendix B - Schematics
Schem atic Description Page
IPMUX Input multiplexer and flit buffer
IPCELL Flit buffer for data inputs 259
CIPCELL Flit buffer for control inputs 260
IPMUXCMP Address comparator 261
OPMUX Output multiplexer
OPCELL Output multiplexer cell for data 262
OPCELLE Output multiplexer for strobed signal output 263
Figure 71 - Index of Schematics
258
259
[D
c; 
1 
<
?
 
A<
 .i
p
u
s 
t 
k 4
, 
19
9
6 
Is
h
e
e
t
t26 1

is i
\n
'Q
UNIVERSITY OF SURREY LIBRARY
2 6 3
TU
RI
N
G 
RD
, 
G
u1
lD
>
'Q
RD
. 
G
u
2 
SY
F
