Platforms for artificial neural networks : neurosimulators and performance prediction of MIMD-parallel systems by Vuurpijl, L.G.
PDF hosted at the Radboud Repository of the Radboud University
Nijmegen
 
 
 
 
The following full text is a publisher's version.
 
 
For additional information about this publication click this link.
http://hdl.handle.net/2066/18657
 
 
 
Please be advised that this information was generated on 2017-12-05 and may be subject to
change.
Platform s for artificial neural networks
N eurosim ulators 
and
Perform ance prediction of M IM D -Parallel System s
Louis Vuurpijl

Platform s for artificial neural networks
N eurosim ulators 
and 
Perform ance predietion of M IM D -Parallel System s
een wetenschappelijke proeve op het gebied 
van de W iskunde en Inform atica
Proefschrift
te r  verkrijging van de graad van doctor 
aan de K atholieke U niversiteit Nijmegen, 
volgens besluit van het College van Decanen 
in het openbaar te  verdedigen op 
vrijdag 6 februari 1998 
des nam iddags om 3,30 uur precies
door
Louis Gerard Vuurpijl
geboren op 4 april 1964 te  G orinchem
Promotor:
Co-promotor:
Prof, Dr, Ir, Jan  V ytopil 
Dr. Th. E. Schouten
M anuscriptcom missie:
Prof. Dr. Ir. F.C.A . Groen (UvA) 
Prof. Dr. P .T .W . Hudson (RUL) 
Dr. L .E.B . Schomaker
CIP-GEGEVENS K O N IN K LIJK E B IB LIO TH EEK , DEN HAAG
Vuurpijl, Louis G erard
P latform s for artificial neural networks — 
N eurosim ulators and Perform ance prediction of 
M IM D -parallel system s /
Louis Vuurpijl.
Proefschrift Nijmegen. - M et lit. opg. - M et 
sam envatting in het Nederlands 
ISBN 90-9011347-9 
NIJGI 831
Tref. neurale netwerken /  com putertechniek
Voor Ron

C ontents
Contents i 
Preface vii 
1 Introduction 1
2 Platform s for artificial neural networks 5
2.1 Artificial neural n e tw o rk s ................................................................................................ 6
2.2 N eural networks and execution p la t fo rm s ................................................................  8
2.3 Suitability  of MIMD processor s y s t e m s ...................................................................  10
2.3.1 M eans of estim ating p e r fo rm a n c e ................................................................  10
2.3.2 A new, com bined a p p ro a c h ..............................................................................  11
2.3.3 Exam ining the  suitab ility  of execution p la tfo rm s ...................................  12
2.4 N eurocom puting and N eurosim u la to rs .......................................................................  14
2.4.1 Users in the  world of n eu ro co m p u tin g ......................................................... 15
2.4.2 The neurocom puting life c y c l e .......................................................................  16
2.4.3 An engineering approach to  n e u ro c o m p u tin g ........................................... 18
2.4.4 Towards an action-oriented neurosim ulator................................................  20
2.4.5 The program  d e s c r ip t io n .................................................................................. 21
2.4.6 CO N V IS , the  user-interface for control and v isu a liza tio n .....................  22
2.4.7 A pplications of p r e e n s  .................................................................................. 22
3 Parallelism  and the transputer 23
3.1 Parallelism  in com puter s y s te m s .................................................................................. 24
3.2 MIMD parallel processor s y s t e m s ..............................................................................  27
3.2.1 Shared memory s y s t e m s .................................................................................. 27
3.2.2 D istribu ted  memory MIMD sy s te m s ............................................................  28
3.3 The Inmos t r a n s p u te r .......................................................................................................  30
3.3.1 Process s c h e d u l in g ............................................................................................  32
3.3.2 In ternal and external inter-processor c o m m u n ic a tio n s ......................... 32
3.4 T ranspu ter n e t w o r k s .......................................................................................................  34
3,4,1 Hierarchical architecture of the  NSC ......................................................... 36
i
11 Contents
3.4.2 Hierarchical arch itecture of th e  G C e l ......................................................... 37
3.4.3 Configuring a tran sp u te r n e tw o r k ................................................................ 37
3.5 New tran sp u te r system s ................................................................................................ 38
3.5.1 T9000 based system s...........................................................................................  39
3.5.2 The P o w e rX P lo re r ............................................................................................  40
3.6 Program m ing environm ents for the  t r a n s p u t e r .....................................................  41
3.6.1 N ative s y s t e m s ...................................................................................................  41
3.6.2 Task d istribu tion  and e x e c u t io n ...................................................................  43
3.6.3 Setting  up com m unication c h a n n e ls ............................................................  45
3.6.4 Inter-processor com m unication and r o u t in g ..............................................  46
4 A Scalable Perform ance Prediction M odel 47
4.1 Perform ance Benchm arking ......................................................................................... 48
4.1.1 Synthetic benchm arks .....................................................................................  48
4.1.2 Kernel b e n c h m a rk s ............................................................................................  49
4.1.3 A lgorithm  b e n c h m a rk s .....................................................................................  49
4.1.4 A pplication benchm arks.....................................................................................  50
4.1.5 Suitability  of perform ance b e n c h m a rk in g .................................................. 50
4.2 Perform ance M odeling and neural n e tw o rk s ............................................................  52
4.2.1 C om m unication costs for neural netw orks...................................................  52
4.2.2 C om putation  costs for neural netw orks........................................................  55
4.2.2.1 The K ohonen SO M ............................................................................  55
4.2.2.2 The backpropagation neural netw ork.......................................... 58
4.2.3 P itfalls when using arithm etic  tim ings.......................................................... 61
4.3 Com bining Benchm arking and M odeling.................................................................... 63
4.3.1 Identification of function kernels.....................................................................  63
4.3.2 Consequences for parallel p rogram s...............................................................  66
5 A point-to-point com m unication layer 69
5.1 Decom position te c h n iq u e s ............................................................................................  70
5.1.1 Job-level decom position .................................................................................. 70
5.1.2 D ataset d e c o m p o s itio n .....................................................................................  71
5.1.3 N eural network d e co m p o sitio n .......................................................................  72
5.2 Synchronization and com m unication .......................................................................  74
5.2.1 C om m unication n e tw o rk s .................................................................................. 75
5.2.2 C om m unication r e q u i r e m e n ts .......................................................................  75
5.3 C om m unication paths in trees and grids ................................................................  78
5.3.1 Setting  up the  com m unication in a g r i d .....................................................  79
5.3.1.1 Setting  up the  com m unication in Helios ................................ 80
5.3.1.2 Setting  up the  com m unication in P a r i x ...................................  81
5.3.2 Setting  up the  com m unication in a t r e e .....................................................  82
5.3.3 Com m unicating in a tree and g r i d ................................................................  84
5.4 B roadcast and gather r o u t in e s .....................................................................................  84
Contents 111
5.4.1 Broadcasting .......................................................................................................  85
5.4.2 G a th e r in g ..............................................................................................................  86
5.5 G ather, accum ulate, and b r o a d c a s t ..........................................................................  87
5.5.1 Setup for the  e x p e r im e n t s ..............................................................................  87
5.5.2 GABO on the  NSC ............................................................................................  88
5.5.3 GABO on the  G C e l-5 1 2 .....................................................................................  92
5.5.4 GAB() on the  P o w e rX P lo re r ..........................................................................  94
5.6 C o n c lu s io n s ......................................................................................................................... 96
6 D ataset D ecom position 97
6.1 A general da tase t decom position a lg o r i th m ............................................................  98
6.2 B ackpropagation da tase t decom position ...................................................................  100
6.2.1 M easurem ents for function k e rn e ls ................................................................ 101
6.2.2 A problem  w ith small neural networks .....................................................  102
6.2.3 Perform ance of b ack p ro p a g a tio n ...................................................................  104
6.2.4 R esults for GCel ................................................................................................ 106
6.2.5 R esults for PX  ...................................................................................................  107
6.2.6 R esults for NSC grids and t r e e s ...................................................................  107
6.3 Discussion of the  r e s u l t s ................................................................................................ 108
6.4 D ataset decom position for K o h o n e n ..........................................................................  109
6.4.1 M easurem ents for function k e rn e ls ................................................................ I l l
6.4.2 R esults for KSOM  ............................................................................................  I l l
6.5 Fixed-size s p e e d u p ..........................................................................................................  113
6.5.1 The speedup l im i t ................................................................................................ 115
6.5.2 Fixed-size speedups for b a c k p ro p a g a tio n .................................................. 116
6.5.3 Fixed-size speedups for Kohonen n e tw o r k s ..............................................  119
6.6 Scalability and e ff ic ie n c y ................................................................................................ 119
6.7 Two a p p l ic a t io n s ..............................................................................................................  121
6.7.1 D ataset decom position and N e t t a l k ............................................................  121
6.7.2 D ataset decom position and satellite d a t a .................................................. 124
6.8 C o n c lu sio n ............................................................................................................................  126
7 Network D ecom position 127
7.1 The backpropagation n e t w o r k .....................................................................................  128
7.1.1 Im plem entation aspects of the  forward pass ........................................... 129
7.1.2 Im plem entation aspects of the  backw ard pass .......................................  130
7.2 A new gathering te c h n iq u e ............................................................................................  133
7.2.1 The store-and-forw ard technique for grids ..............................................  133
7.2.2 The pipeline techniques for grids ................................................................ 134
7.3 B ackpropagation com m unication costs ...................................................................  136
7.4 B ackpropagation on a t r e e ............................................................................................  138
7.4.1 C om m unication tim e for the  forward pass ..............................................  138
7.5 A com parison between tran sp u te r grids and trees ..............................................  140
iv Contents
7.6 P e rfo rm a n c e ......................................................................................................................... 141
7.7 Speedup, scalability and e ffic iency ..............................................................................  143
7.8 Expected results for N e t t a l k ......................................................................................... 146
7.9 The Kohonen neural network .....................................................................................  147
7.9.1 F inding the  winning n e u r o n ..........................................................................  147
7.10 Perform ance and speedup ............................................................................................  150
7.11 Network decom posed S a t d a t ......................................................................................... 151
7.12 C o n c lu s io n s ......................................................................................................................... 152
8 Neurosim ulators 153
8.1 The neurocom puting e n v iro n m e n t..............................................................................  154
8.1.1 Environm ents: user p e r s p e c t iv e ...................................................................  154
8.1.2 Environm ents: neurosim ulator p e r s p e c t iv e ..............................................  156
8.2 Features of neurosim ulators ......................................................................................... 157
8.3 The trad itional neurosim ulator e n g in e .......................................................................  159
8.3.1 The general neural network d a tas tru c tu re  ..............................................  160
8.3.2 Access and control of neural network d a t a ..............................................  163
8.3.3 The neural network description la n g u a g e .................................................. 164
8.3.4 The graphical u s e r - in te r fa c e ..........................................................................  164
8.3.5 I /O  and n e u ro s im u la to rs .................................................................................. 165
8.4 C o n c lu s io n s ......................................................................................................................... 165
9 An action-oriented neurosim ulator 167
9.1 O bjects and a ttr ib u tes  of actions ..............................................................................  168
9.2 Actions and program  d e s c r ip t io n s ..............................................................................  169
9.2.1 Param eters ..........................................................................................................  171
9.2.2 V ariab les .................................................................................................................. 173
9.2.3 D a t a ......................................................................................................................... 173
9.2.4 O ptions and s e t t i n g s ......................................................................................... 175
9.2.5 An example: specification of an action l e a r n ........................................... 176
9.3 p r e e n s  interface d e f in i t io n s ......................................................................................... 177
9.3.1 Interface between CONVIS and a sim ulation program  ......................... 179
9.3.2 The action control protocol ..........................................................................  180
9.3.3 Accessing com ponents of an a c t i o n ............................................................  180
9.3.4 Accessing d a ta  ...................................................................................................  181
9.3.5 Exotic or d istribu ted  d a t a ..............................................................................  183
9.3.6 Interface between CONVIS and too ls..............................................................  184
9.4 An example: tra in ing  rem otely sensed d a t a ............................................................  185
9.4.1 In itia tion  p h a s e ...................................................................................................  185
9.4.2 Tuning and testing  p h a s e s ..............................................................................  187
9.4.3 C la s s if ic a t io n .......................................................................................................  189
9.4.4 C o n c lu s io n s ..........................................................................................................  190
Contents v
10 Conclusions 191
Bibliography 197
Sam envatting 207
Curriculum V itae 213
vi Contents
Preface
W hen I s ta rted  my PhD  project and talked to  Eon about all the  things I would like to  
exam ine, and all the  things th a t I would discover, he im m ediately reacted very enthusias­
tically, It was how he always reacted, and as he was also doing his PhD , we planned to  
work together on a paper abou t neural networks and satellite image classification. Though 
he was working a t the  Joint Research C enter in Italy, we were always in contact. By phone 
and through T he In ternet via email or using t a l k .  We had  th e  same prom otor and super­
visor, so Eon often had to  come to  Nijmegen to  discuss chapters of his thesis w ith Jan  and 
Theo, On such occasions, we would go out to  bars like de Kluis en de Fuik to  m eet the 
num erous friends he had here in Holland, One tim e Jan  V vtopil, Theo Schouten, H arry 
Duvs and I visited the  JE C , where we m et Eon and his supervisors G raem e W ilkinson 
and Jacques Stakenborg, I t was a great visit, because in the  weekend we went into the 
m ountains to  go skiing in C ervina and Z erm att, near the  M atterhorn ,
I will never ski w ith Eon again, because he tragically died in a car accident on A ugust 1 
in 1995, I t was abou t one m onth before he would get his PhD  title , which he eventually 
got posthum ous. As always, Eon would reach his goals, so we succeeded in finishing our 
jo in t paper. In th e  sum m er of 1995, Eon presented our paper. It was the  first and also the 
last paper of m any others we had  planned. This thesis is dedicated to  Eon Schoenmakers, 
Eon jonguh, bedankt!
In fact, the  paper I w rote w ith Eon was only p a rt of th e  work I would have to  do. In 
January  1990, the  p r e e n s  project (Parallel Eesearch Execution Environm ent for Neural 
Systems) was launched. It was p a rt of the  D utch national SPIN project (Stim ulerings 
P lan  Inform atica N ederland) and SNN (Stichting N eurale Netwerken), a collaboration of 
four universities and several D utch com mercial firms. The project was also funded by the 
E sprit Parallel C om puting Action (PC A 4106), Fundings were available for a period of four 
m an years and m ade it possible to  acquire a 64 m ulti-transpu ter parallel processor system. 
The pro ject was carried out a t the  D epartm ent of Experim ental Inform atics for Technical 
A pplications, supervised by Professor Jan  V vtopil and guided by Dr, Theo Schouten,
W ith in  the  p r e e n s  project, the  goal was to  exam ine the  su itability  of tran sp u te r system s as 
an execution platform  for artificial neural network sim ulations. A nother goal was to  develop 
an execution environm ent for sim ulating neural networks, an environm ent which is also
viii Preface
known as neuro-simulator. This thesis will take the  reader to  several aspects from th e  world 
of neurocom puting. In particu lar, it will explain abou t how M IM D-parallel com puters (like 
tran sp u te r systems) can be used for sim ulating neural networks, and how such systems 
can be evaluated in term s of the  perform ance, speedup, scalability and efficiency th a t 
can be achieved. T he reader will become acquainted w ith neurosim ulators, the  world of 
neurocom puting, and I hope to  explain the  design principles of the neurosim ulator p r e e n s  
I developed.
The first years of my research I shared my office w ith P eter Snel, whom I s ta rted  to  like 
because of his nice character. In those days, everybody a t our departm ent had an Apple 
M acintosh on his desk and P ete r was ’the  m ac-m an’ (of course, the  rest of my life I will 
use Unix w orkstations). We had  a nice group of people working together, and had  regular 
coffee-meetings w ith Jan , Theo, Peter, Mirese, M artin , H anno and also a lot of g raduate  
students, I would like to  th an k  them  all for being such a pleasant company. And also for 
helping and stim ulating me w ith my work. Especially Theo Schouten, my supervisor. Not 
only could I always enter his room, for a chat or some serious stuff, he also invited me a t 
home to  enjoy the  fabulous com pany and cooking of his wife E iet, And you too  Jan , for 
helping me out w ith my “vervangende dienst” , offering me a job  as AiO, and guiding me 
through the  years of my research and w riting.
Though now I am an ultra-experienced program m er and Unix-user, by the  tim e I s ta rted  
my research I knew nothing abou t the  ins-and-outs of C, Fortunately, the  “studen t room ” 
was next to  mine and I could always drop in to  come up w ith famous questions like “W hy 
doesn’t  this do w hat I w ant it to  do?”
void makeJiello_world (char *ptr){
ptr = strdupO'Hello world!");}
int main (int argc, char *argv[]){
char *hello;
make_hello_world(hello);
printf ("%s\n",hello);}
I guided a num ber of studen ts along the ir (graduate) projects, whom I would like to  thank  
for being there, doing p a rt of my work and solving stup id  questions like the  one above, 
A lbertr, Eeinoud, Tricky, Silvio, Hans, Eons, Eonl, Pareival, Eric, Eene, Charles, Chris, 
A lbertk , A lbertb, Paul, Jan  W illem, , , ,  , th ank  you all.
The last years of my PhD  research, I shared my room w ith M aurice klein Gebbinck, I 
s ta rted  to  know and like M aurice as one of the  nicest people in the  world. We developed 
our own com m unication m ethods, and th e  two of us m ust have been a terrible nuisance for
Preface IX
anyone sitting  in a room closer th an  100 meters. We sang songs on M onday morning, having 
heard them  at “De Hollandse Avond in De Fuik” on Sundays, If I tapped  the  rhy thm  of 
a song w ith my fingers, M aurice would jo in  me. O r vice versa. And of course we always 
spoke to  our com puters, cursing because they always respond too  slow, som ething which 
really becam e apparent w ith the  advent of The In terne t and W W W , M aurice, bedank t en 
m aak het m aar snel af, je  prom otie.
Since 1995, I am  working w ith the  group of Lam bert Schomaker on a seemingly to tally  
different subject, handw riting recognition. He also helped me finishing my thesis. He 
used to  let me take days off for working on it (and th is year even two m onths), Lam bert, 
bedank t voor al je  adviezen en wijze lessen. A t the  NICI, I shared my room w ith Janek 
Mackowiak, Janek, thanks for the  cooperation and you being there.
D espite of the  existence of the  “Verbond van Slechte M annen” (the alliance of bad  men), 
of which Eon is honorary m em ber, I m anaged to  finish this thesis. Eon, Olaf, Jack, Steef, 
M arc, Johan, Joost en ook Eenze, bedank t voor jullie vriendsehap de afgelopen jaren, 
Zonder jullie was het veel eerder afgekomen.
Last bu t not least, I would like to  thank  my bro ther G errit and my m other Truus for 
always supporting  and believing in me. Now, they  are proud of me, I always was proud of 
the  two of you, and I always will.
x Preface
1Introduction
Outline
In th is thesis, two platform s for sim ulating artificial neural networks are 
discussed:
o M IM D-parallel processor system s as an execution platform  
o neurosim ulators as a research and developm ent platform
Because of the  parallelism  encountered in neural networks, d istribu ted  
processor system s seem to  provide a proper underlying execution p la t­
form, The su itability  of the  class of M IM D-parallel com puter platform s 
(in particu la r m ulti-transpu ter system s) for neural network sim ulation 
program s is discussed in th is thesis. In order to  evaluate the  su itabil­
ity of such system s, a new perform ance prediction m ethod is presented. 
An in troduction to  the  chapters discussing th is m ethod is given in this 
chapter.
N eurosim ulators provide a platform  for sim ulating, developing, evaluat­
ing and executing neural network models. In the  last two chapters of 
this thesis, neurosim ulators are examined: environm ents for the  develop­
m ent and sim ulation of artificial neural networks. By considering the ir 
com mon features, and the  requirem ents of the ir users, the  design criteria 
for a new neurosim ulator are specified. The design, im plem entation and 
evaluation of p r e e n s ,  an action-oriented neurosim ulator is presented in 
the  final chapter.
2 1. Introduction
Parallelism and parallel neural network simulations
The occurrence of parallelism  can be observed in all aspects of life. For instance, all 
creatures live the ir lives simultaneously. O ften, they  also cooperate together to  accomplish 
one or more tasks, like ants building an an t hill, or lions chasing a prey. In production 
plants, m any machines may operate sim ultaneously on identical tasks, d istinct tasks, or 
parts  of tasks. Parallelism  can be observed in the  construction of houses or roads, in cars 
and pedestrians a t a busy junction , and even in tasks as doing household chores.
Artificial neural networks (ANNs) also feature parallelism , on several levels of detail. They 
contain a large num ber of processing elem ents, neurons, connected via an even larger 
num ber of weights (modeling axons and synapses). In biology, each neuron, axon and 
synapse operates in parallel. At a more coarse grained level of detail, m ultiple modules or 
layers of neurons and weights can be identified, all operating simultaneously. In C hapter
2, a brief in troduction  into the  topic of neural networks is given. T h a t chapter also serves 
as a fu rther in troduction  to  this thesis, introducing the  concept of neurosim ulators and of 
perform ance modeling for the  class of m ultiple instruction, m ultiple d a ta  (MIMD) parallel 
com puter systems.
The goal of exploiting parallel com puting is to  finish a task  in a sm aller am ount of tim e, or 
to  handle more (distinct) tasks in the  same am ount of tim e. In C hap ter 3, the  parallelism  
encountered in com puter system s is discussed. Coarse grained m ulti-processor system s are 
for exam ple networks of w orkstations or tran sp u te r systems. On the  o ther hand, com puter 
system s like the  connection machine, or array  processors, feature parallelism  a t a finer level 
of detail.
W hen considering the  parallelism  occurring in neural networks, parallel processor system s 
seem to  provide a na tu ra l underlying hardw are platform  for im plem enting them . The ad­
vantage of having such execution platform s is th a t parallel im plem entations run faster than  
sequential ones. D isadvantages are th a t parallel im plem entations are harder to  program  
and more dedicated  to  one specific neural network model or application, A large am ount 
of parallel im plem entations of neural networks on several levels of detail are reported  in 
the  literature. T he lowest level represents a one-to-one m apping of weights and neurons 
onto analog or digital circuits. An overview over such im plem entations can be found in 
[45, 79, 103], A wide range of o ther parallel execution platform s like the  Connection 
M achine [94], GF11 [124], M asPar [17] and tran sp u te r system s [81, 82, 84, 104, 107] are 
proposed for higher (more abstrac t) levels of parallelism . These machines can be massively 
parallel, i.e. contain a large num ber of relatively simple parallel processors (>• 1000), or 
they  can consist of a sm aller num ber of more general purpose nodes.
In C hapter 3, m ulti-transpu ter system s are discussed. They are used as an execution p la t­
form for parallel neural network sim ulations in th is thesis. The m ain reason why tra n sp u t­
ers were chosen for this PhD  study  are th a t the  tran sp u te r is a general purpose com puter, 
m aking it suitable for a large variety of neural network models. W hereas massively par­
allel system s contain many nodes, the  num ber of nodes for M IM D-processor system s is
3lim ited (in practice <  1000), As a consequence, the  num ber of neurons and weights ex­
ceeds th e  num ber of processors. So when im plem enting a neural network on a generic 
M IM D-exeeution platform , the  neural network model has to  be decom posed somehow over 
the  available processors, where groups of neurons and weights are placed on different pro­
cessors, Techniques for im plem enting neural networks on M IM D-proeessor system s like the 
tran sp u te r are described in, e.g., [38, 100, 110], These techniques are discussed in detail in 
C hapters 5, 6 and 7, Two issues are of im portance for parallel neural network decom posi­
tions: a) make sure th a t each processor has an equal am ount of work to  do (load balance), 
and b) make sure th a t the  am ount of synchronization and com m unication overheads is 
kept as low as possible.
Suitability of M IM D-processor platforms for A N N s
A large p a rt of th is thesis is dedicated  to  a m ethod for determ ining w hether an execution 
platform  is well-suited for a certain  application. For such an evaluation, several questions 
can be raised [113, 114], The first question is w hat execution perform ance can be achieved 
for a given m achine configuration and type and dim ension of application. The second 
question is w hether by increasing the  am ount of com puting and com m unication resources, 
the  to ta l execution tim e for a given application can be reduced, i.e., w hat level of speedup 
can be achieved, A th ird  question concerns the  scalability of the  platform  and application,
i.e., does the  execution tim e stay  constant if the  size of b o th  the  application and platform  is 
scaled up? In C hap ter 4, a m ethod is introduced for predicting th e  perform ance of MIMD- 
processor system s for ANNs, For a given processor architecture, the  m ethod models the 
calculation tim e required for executing the application, and it models the  tim e required for 
com m unication. Based on m easurem ents of th e  calculation tim e on one processor and the 
com m unication tim e between two processors, predictions can be m ade for larger processor 
system s. Using th is m ethod, th e  th ree questions issued above can be answered.
As exam ined in [110, 113], parallel neural network sim ulations require several kinds of com­
m unication, In C hap ter 5, a com m unication layer im plem enting the  typical com m unication 
requirem ents of d istribu ted  neural network im plem entations is presented, A model for the 
required com m unication tim e is introduced and evaluated for th ree different tran sp u te r 
system s. T he first is the  Nijmegen Super C luster, a system  containing 64 T800 tran sp u t­
ers, The second is the  GCEL-512, containing 512 T805 processors. And the  th ird  is the 
PowerXPlorer, a system  containing 32 PowerPC 601 nodes. The la tte r  two machines are 
located a t the  U niversity of A m sterdam ,
In C hapters 6 and 7, two different decom position techniques are described: d a ta  set decom­
position and network decom position. B oth techniques are applied on two popular neural 
network models, the  m ulti-layered perceptron [83] and the  K ohonen self-organizing map 
[58], The parallel im plem entations for bo th  neural networks use the  com m unication layer 
introduced in C hap ter 5, The perform ance prediction m ethod discussed in C hap ter 4 is 
evaluated b o th  for datase t decom position and network decom position techniques. I t will be
4 1. Introduction
pointed out th a t using the  m ethod, the  su itability  of the  class of M IM D -parallel processor 
system s for artificial neural network sim ulations can be predicted  accurately.
Neurosimulators
The final two chapters of th is thesis are concerned w ith platform s dedicated to  artificial 
neural networks sim ulations, called neurosim ulators, A neurosim ulator is defined as a 
set of software a n d /o r hardw are com ponents th a t can operate together to  support the 
construction, m anipulation, visualization or (fast) execution of neural network sim ulations 
[116, 112], This thesis presents a new kind of neurosim ulator, called p r e e n s ,  a parallel 
execution environm ent for neural systems.
In C hap ter 2, an in troduction  is m ade to  the  world of neurocom puting. The different user 
groups involved in using, developing, or im plem enting neural networks are identified. Fur­
therm ore, the  neurocom puting life-eyele is presented. This life-eyele contains four phases: 
In itiation , Tuning, Testing, and O peration, In the  In itia tion  phase, the  task  to  be per­
formed and its preconditions are determ ined. In the  Tuning phase, a chosen neural network 
model is tailored for the  task  determ ined in the  in itia tion  phase. In the  Testing phase, the 
perform ance of the  resulting neural network is evaluated. Perform ance in this context may 
involve execution speed, recognition accuracy, or reliability. In the  final phase (O peration), 
the  resulting optim ized and tested  neural network is used in an actual application.
Existing neurosim ulators and the  features they  exhibit are reviewed in C hap ter 8, As 
distinguished by Recce et al [80], neurosim ulators can be distinguished in application- 
oriented, algorithm -oriented, and general program m ing system s. Features they  may share 
are a graphical user-interface, an algorithm  library containing a set of im plem ented neural 
network models, support for building new models, application specific tools, dedicated 
hardw are accelerators, etcetera.
Based on the  observations m ade when considering these features and based on the  re­
quirem ents from users in the  world of neurocom puting, in the  final chapter of this thesis 
(C hapter 9), the  design and im plem entation of p r e e n s  is presented, p r e e n s  comprises 
a neural network algorithm  library, a set of tools, and a m anager called CO N  VIS for con­
trolling tools and sim ulation program s. Tools, C O N V IS and a sim ulation program  can 
run as separate  processes in a heterogeneous com puter network. This, and a new concept 
called action-oriented program  descriptions form the  m ain differences between p r e e n s  and 
existing neurosim ulators.
The concept of actions and the ir associated com ponents is explained in C hap ter 9, Based 
on this, a set of interface definitions is specified via which new tools or neural network 
sim ulation program s can be in tegrated  in p r e e n s  relatively easy, p r e e n s  will be evaluated 
on a real-world application, the  classification of rem otely sensed (satellite) images.
2Platform s for artificial neural 
networks
Outline
This chap ter serves as an in troduction  to  two concepts: a) perform ance 
prediction of M IM D -parallel execution platform s for artificial neural net­
works, and b) the  world of neurocom puting and neurosim ulators.
Artificial neural networks can be considered as com puter program s in­
spired by the  processes tak ing  place in the  hum an nervous system . In 
this chapter, a brief in troduction  to  artificial neural networks is given. 
Because of the  parallelism  observed in biological neural networks, parallel 
processor system s like the  tran sp u te r are believed to  provide a natu ra l 
and efficient platform  for running artificial neural networks. The su it­
ability of such a platform  can be expressed in term s of the  execution 
perform ance, speedup and scalability th a t can be achieved, A m ethod 
for predicting these param eters for parallel neural network sim ulations is 
introduced in this chapter.
Neurosim ulators are a collection of software and hardw are tools for sim­
ulating artificial neural networks. They provide a platform  for users 
involved in applying or developing neural networks: users “doing neuro­
com puting” , In th is chapter, the  world of neurocom puting, a taxonom y 
of users doing neurocom puting, and the  neurocom puting life-eyele are 
presented. An in troduction  will be given tow ards the  design of a new 
type of neurosim ulator, called p r e e n s .
6 2. Platforms for artificial neural networks
2.1 Artificial neural networks
We, hum an beings, are capable of perform ing difficult tasks seemingly effortless; tasks 
th a t are not handled very well by com puters. For exam ple recognition of speech and 
visual stim uli in com plex scenes, perform ing sports, and our ability to  constantly  learn by 
experience are still very difficult to  be perform ed by a machine. Since the  beginning of 
the  com puter era, people have been wondering abou t how to  incorporate knowledge of our 
b rain  and the  nervous system  into m athem atical models and com puter sim ulations, and 
how to  use these biologically inspired models for a specific application area,
A large num ber of d istinct neural network models exist, and w ithin each model, there exist 
a large num ber of variations. Well-known neural network models are backpropagation or 
m ulti-layered perceptrons [83], Kohonen networks [58], Hop field networks [49], counter­
propagation networks [44], ART networks [13, 12, 14, 15], B oltzm ann networks [92], etc. 
On a regular basis, new publications are appearing th a t report m inor or significant changes 
in these paradigm s, so the  num ber of models is still increasing rapidly. However, all m od­
els share a num ber of im portan t features: they  exchange inputs and outputs (modeled by 
scalar num bers) w ith the ir environm ent, they  consist of a large num ber of processing ele­
m ents (neurons), they  have connections between the  neurons, neurons receive inputs from 
neighboring neurons, com pute the ir s ta te  of activation  based on this inform ation, propa­
gate the  activation following an activation fu n c tio n , and they all have the  ability to  learn 
by changing the ir connections following a certain  learning mechanism. A particu la r con­
nection topology, activation m echanism  and learning m echanism  together form the  basic 
ingredients of a particu la r artificial neural network m odel1,
A general neuron m odel
Each neuron j  has an activation value a,j and a threshold 9j. Connections between two 
neurons i and j  are m odeled by a weight value Wjtj. A neuron receives activation values 
from its input neurons, com putes the  weighted sum  of its inputs and weights, and ou tpu ts  
an activation value which is com puted by the  activation function f a.
Figure 2,1: A simple neuron model, receiving inputs cii from  other neurons.
1For an overview over the different neural network paradigms, I refer to the frequently asked questions 
(FAQ) of the Usenet newsgroup com p.a i.neu ra l-nets . It contains an up-to-date list of references to 
books and journals, with reviews and sectioning in, e.g. introductory, business, intermediate and advanced 
categories. An excellent book is Neural networks for pattern recognition by C.W. Bishop [8].
(2 .1)
2.1. Artificial neural networks 7
The activation function is usually of a simple form and can, for exam ple, consist of a 
threshold function or a sigmoid:
T hroughout th is thesis, two popular artificial neural networks will be used as a running 
exam ple, the  backpropagation network [83] and the  K ohonen self-organizing feature m ap 
SOM [58], Details of their connection architecture, tra in ing  algorithm s and activation 
functions, and details of different im plem entations on sequential and parallel machines will 
be discussed when required in chapters 4, 6 and 7,
The Kohonen SOM
The Self-Organizing Feature M ap (SOM) [58] is — besides backpropagation — the most 
well known neural network paradigm . The neural network is arranged in a X-dimensional 
grid of neurons (usually X = 2), the  feature m ap (see Figure 2,2), Each neuron is fully 
connected to  an inpu t layer which represents a feature vector of the  inpu t data.
Figure 2,2: Architecture o f the self-organizing feature map. The winning (black) neuron is 
excited the most, the ’’bubble” of neurons in its neighborhood is also excited.
As w ith m ost neural networks, the  SOM distinguishes two phases, an activation (or recall) 
phase and a train ing  phase. During activation, an inpu t vector is clam ped on the  input 
layer and each neuron com putes its m atch w ith the  input. Two m ethods are often used, 
com puting the  Euclidean distance or tak ing  the  inner product. T he la tte r  requires th a t 
all weights and inpu t d a ta  are normalized. The neuron w ith the  best m atch is called 
the  winner  and its weights form an in ternal representation of the  inpu t feature. During 
training, the  w inning neuron and the  neurons laying w ithin its neighborhood update  their 
weights, A com plete da tase t can be learned by the  SOM by repeatedly clam ping an input 
vector, activating the  winning neuron and its neighbors and updating  the ir weights.
threshold
8 2. Platforms for artificial neural networks
The backpropagation network
The im plem entations of the  backpropagation network described in th is thesis all use the 
algorithm  described by R um elhart in [83], Among all the  different variations on the  original 
algorithm , th is is probably the  most widely used. The backpropagation network has a 
layered arch itecture of one inpu t layer, one or more hidden layers and one ou tp u t layer.
Figure 2,3: Architecture o f a three-layered 4x6x3 backpropagation network.
Neurons in two subsequent layers are fully inter-connected. D uring activation of the  net­
work (the forward pass), inform ation consisting of neuron activations flows through the 
connections from the  inpu t th rough the  hidden layers to  the  o u tp u t layer. Each input 
p a tte rn  is clam ped on the  inpu t layer. The activation of a neuron in a hidden or ou tpu t 
layer is com puted based on the  activation of its inpu t neurons and th e  s treng th  of the 
corresponding connections, as in E quation 2,1, D uring train ing  (the backward pass), the 
inform ation flow is reversed, and each neuron propagates the  error values it has produced 
back. The error values are com puted based on th e  difference between ta rge t p a tte rn s  and 
com puted ou tpu ts. Based on the  error values, each neuron can com pute th e  contribution 
of its inpu t weights to  the  error, the  so-called delta. Using its delta, learning ra te  and 
other param eters, each neuron updates its weight values,
2.2 Neural networks and execution  platform s
Since the  increasing engagem ent of research and developm ent institu tions, and academic, 
industrial and com mercial users in the  area of neurocom puting, the  dem and for high per­
formance im plem entations of neural networks has arisen. R equirem ents concerning the 
am ount of d a ta  th a t has to  be handled, the  size of the  neural networks involved and the 
response tim es th a t have to  be fulfilled have become more and more im portan t. This es­
pecially holds for application areas like vision, p a tte rn  recognition and database  mining. 
In order to  fulfill these requirem ents, th roughout th e  last decade a large num ber of high 
perform ance execution platform s has been proposed for im plem enting and sim ulating neu­
ral networks. Parallel processor system s have particu larly  been of interest because they 
are liable to  provide a proper execution platform  for exploiting the  intrinsic parallelism  
present in neural networks. Several levels of parallelism  can be distinguished, and for each 
level some execution platform s may be more qualified th an  others. In th is thesis th e  focus
2.2. Neural networks and execution platforms 9
will be on message passing m ulti-processor system s, and in particu la r on m ulti-transpu ter 
systems.
The tran sp u te r is a m icroprocessor w ith processing units, control logic, private local mem­
ory and four com m unication links on one single VLSI device [65, 67], The tran sp u te r can 
be used in a single processor fashion, bu t also in m ulti-processor configurations for building 
a high perform ance parallel execution platform . In such a transputer network, it is required 
for the  program m er to  carefully determ ine how an application has to  be decom posed over 
the  available processors, and how the  processors have to  be connected w ith the ir four links. 
The two processor topologies used for the  artificial neural network sim ulations in th is thesis 
are a grid  and tree:
Figure 2,4: A m ulti-transputer tree and grid topology. A  transputer is depicted as a square 
box with fo u r  lines representing the links in the north, east, south and west direction.
In the  literatu re , a wide variety of execution platform s have been reported  for im plem ent­
ing and sim ulating neural networks. These range from sequential personal com puters, 
w orkstations, powerful supercom puters, SIMD and MIMD parallel processors like th e  con­
nection machine and tran sp u te r arrays, to  special purpose neuro-hardw are. Large scale 
SIMD parallel processors have been used like th e  IBM  GF11 for im plem enting baekpropa- 
gation [124] and the  DAP for am ongst others Hopfield networks [36], O berm aver et al. [74] 
describe the parallel im plem entation of Kohonen self-organizing feature m aps, com paring 
the  perform ance on a tran sp u te r array  and the  Connection M achine CM-2, The la tte r  
p latform  has also been used by Levin for sim ulating Hopfield-like networks [64] and Singer 
who achieved 1,3 giga-interconnects per second for backpropagation networks [94], MIMD 
parallel processors have been used like the  Intel iPSC hvpercube for experim ents w ith 
backpropagation networks [29], Feldm an et al. describe the  parallel im plem entation of 
the ir Rochester Connectionist S im ulator on the  BBN B utterfly  [34] and tran sp u te r arrays 
are used for a whole range of neural network sim ulations in [16, 20, 81, 82, 100, 110, 120], 
Finally, a lot of results have been published concerning the  design and im plem entation of 
general or special purpose neural hardw are. For exam ple, th is was done by using digital 
signal processors [28, 68] or designing VLSI im plem entations [27, 43],
In chapter 3, an overview over different parallel processor system s is given, and the  tran s­
p u te r arch itecture is discussed in detail.
(a) grid (b) tree
10 2. Platforms for artificial neural networks
2.3 Suitability of M IM D processor system s
W hen discussing the  suitab ility  of an execution platform  for neural networks several ques­
tions arise. The first is which platform  is qualified the  best for th is application area? In 
order to  answer th is question, different platform s have somehow to  be com pared to  each 
other. This can be done via perform ance benchm arking, where one or more benchm arks 
are run on the  different platform s and the  resulting perform ance m easures are com pared. 
Perform ance  is defined as the  num ber of operations related to  the  problem  dom ain th a t 
are com puted in one tim e step. T he execution platform  w ith  the  best perform ance would 
thus be the  m ost suitable. However, a fu rther question th a t arises is would th is execution 
platform  still be the  best if the  size of the  ta rge t architecture or problem  dom ain changes? 
W hen adding more processing, com m unication or memory resources, the  platform  is con­
sidered suitable if a proportional reduction in execution tim e (speedup) is achieved. The 
th ird  question is closely related to  the  form er one and concerns the  scalability issue. The 
platform  and application is scalable if by increasing the  processing and memory resources, 
a com parably larger problem  can be solved in the  same execution tim e. The final question 
th a t is considered here is w hat efficiencies are achieved when using a certain  am ount of 
processors and com m unication resources. The efficiency for a given speedup is the  speedup 
divided by the  num ber of processors, thus indicating w hat fraction of the  available resources 
is used productively w ithout waste. In th e  subsequent chapters, the  im plem entations of 
parallel neural network sim ulations (PN N S) will all be evaluated from th e  perspectives of 
perform ance, speedup, scalability and efficiency.
The perform ance of an execution platform  for neural network sim ulations is often ex­
pressed in the  num ber of (million) connection updates per second, or (M )CU PS. A ppar­
ently, M C U PS is a relative term  which depends on a com bination of a large num ber of 
factors, such as the  neural network algorithm , its im plem entation, its size, the  com puter 
architecture, available memory, processing and com m unication resources, etc. This involves 
th a t when com paring M C U PS on different execution platform s, all these param eters have 
to  be taken into account. O therw ise, M C U PS would be a m eaningless, com pletely useless 
perform ance scale, in analogy w ith M IP S  and M F L O P S  as discussed by D ongarra and 
Gentzsch [25],
2.3.1 Means of estim ating performance
There are two ways of estim ating the perform ance of an execution platform  for an appli­
cation, which are described in more detail in chap ter 4: perform ance benchm arking and 
perform ance modeling.
Perform ance benchmarks com prise one or more com plete program s or program  p arts  rep­
resenting one or more classes of applications. By tim ing a benchm ark on an execution 
platform , its perform ance for the  class of applications can be determ ined. In C hapter 4 it 
will be m ade clear th a t the  use of th is m ethod for predicting the perform ance for parallel 
neural network sim ulations has only lim ited potential. F irst, there exist no benchm arks
2.3. Suitability o f MIMD processor systems 11
for all neural network paradigm s. Benchm ark results for different applications cannot be 
com pared very easily. Second, a large num ber of factors determ ine the  perform ance such 
as hardw are and software environm ent, the  application, its size and the  way it is imple­
m ented, This involves th a t a very careful exam ination of bo th  the  benchm ark and the 
application, and the  conditions under which the  benchm ark is run, is required. F urther­
more, for answering question regarding speedup and scalability, th is m ethod cannot be 
used.
Perform ance modeling uses an analysis of the  com plexity of the  execution platform  and 
application in term s of, e.g., memory and com m unication requirem ents, arithm etic  opera­
tions and size of the  application. Using the  m achine specifications com prising individual 
tim ings for arithm etic  operations, memory accesses and com m unication, an overall perfor­
m ance m easure can be determ ined. However, again a large num ber of factors determ ine 
the  arithm etic perform ance of a machine and therefore, the  resulting perform ance estim ate 
will not be very precise. Furtherm ore, the  com plexity analysis and corresponding evalua­
tion of the  factors th a t determ ine the  perform ance is an in tricate  m atter, th a t has to  be 
carried out for each new execution platform  and application,
2.3.2 A new, combined approach
In th is thesis, a m ethod is presented th a t combines perform ance benchm arking and m od­
eling, To estim ate the  perform ance of a MIMD platform  for a neural network application, 
some calculation and com m unication benchm arks have to  be m easured. Using a model 
of the  com m unication and calculation costs, the  overall tim e can then  be determ ined. In 
general, the  overall tim e for tra in ing  or recall of a neural network application of n  neurons, 
w connections and using p  p a tte rn s  running on a platform  of P  processors is m odeled as 
the  sum  of the  calculation and com m unication times:
The calculation tim e Tcaic can be modeled as consisting of the  tim es required for executing 
a small num ber of kernel functions (e.g., com pute a weight change, u p date  a connection, 
update  a neuron). These are executed a large num ber of tim es (e.g., per connection or per 
neuron):
In (2,3), ti is the execution tim e for com puting function i and N i(n ,w ,p ) denotes the 
num ber of tim es the  function is com puted. Note th a t a perfect load balance can be assumed, 
as either n,w  or p  are large com pared to  P.
The com m unication tim e is modeled as the  num ber of tim es a single inform ation unit 
(e.g., a connection or activation value) m ust be sent over a physical com m unication link
T ( P ,n ,w ,p )  =  T c,dc(P ,n ,w ,p )  +  Tcomm( P ,n ,w ,p ) (2.2)
(2.3)
12 2. Platforms for artificial neural networks
('C ( P ,n ,w ,p )), m ultiplied by th e  tim e required for eom m unieating one value:
Tcom m {P,n,w ,p) = C (P ,n , w ,p ) • tcornrn (2.4)
If required by the  hardw are resources or com m unication pa tte rn s  of the  neural network, a 
more com plicated model for Tcomm can be used, e.g., by tak ing  setup  tim es into account. 
This com m unication model suffices for tran sp u te r networks, bu t a more elaborate model 
may be required for, e.g., a parallel system  containing w orkstations in a local area network. 
The needed benchm arks in th is m ethod are restric ted  to  m easuring the  execution tim e of a 
small num ber of kernel functions on one processor and the  tim e needed to  com m unicate a 
single inform ation unit between two processors. This approach can be classified as kernel 
benchm arking [5, 46],
2.3.3 Examining the suitability of execution platforms
By predicting the  perform ance for a different num ber of processors Pi and P2, the  su itability  
of the  platform  in term s of speedup, efficiency and scalability can be exam ined. The 
plain perform ance num ber can be used to  find out how close the  peak perform ance of the 
platform  can be reached. This gives some qualitative statem ents abou t the  efficiency of the 
im plem entation and use of the  available resources. T he speedup for P  processors is defined 
as the tim e it requires to  com pute a problem  on 1 processor divided by the  tim e it takes 
to  com pute it on P  processors (Eq, 2,5), The parallel speedup for Pi and P 2 processors 
can be defined as (Eq, 2,6):
* » * « • »  ■  m
s ~ V , . r , .  g g j g i  „
If the  speedup shows a linear characteristic, the  ta rge t platform  can be used efficiently 
for the  problem . If the  speedup only increases slowly w ith the  num ber of processors, or 
if it even drops, th is usually m eans th a t the  com m unication overheads become too  severe. 
This can be transla ted  in the  sta tem en t th a t the  available com m unication resources are 
insufficient for the  problem  or th a t the  com m unication/com putation  ratio  is too high. The 
la tte r  m eans th a t the  am ount of work each processor has to  do is too  low com pared to  the 
required com m unications. It will be shown in the  subsequent chapters th a t indeed, when 
increasing the  problem  size, the  achieved speedups increase too.
The speedup as defined above is also known as fixed-size speedup. The problem  size is 
kept constant and the  num ber of processors is increased. Following A m dahl’s law [1], there 
is always an upper lim it 1/ s  on th e  speedup, where s represents the  sequential p a rt of 
the  parallel algorithm . If, as is often th e  case in sim ulation studies, researchers w ant to  
increase the ir problem  size (e.g., grid size in w eather forecasting or image processing, or 
the  num ber of p a tte rn s  or connections in artificial neural networks), they  w ant to  know
2.3. Suitability o f MIMD processor systems 13
w hether the ir application is scalable on a specific machine architecture. W hen exam ining 
the  scalability of a parallel program , the  problem  size is not kept fixed bu t instead, it 
is increased proportionally  w ith th e  am ount of processors, A parallel problem  is called 
scalable if when increasing b o th  the  problem  size and the  num ber of processors, the  to ta l 
execution tim e of the  parallel program  stays constant. This only holds if w hat we call the 
scalability factor equals one (2,7), The problem  size depends on the  num ber of neurons 
and connections in the  neural network and on the  num ber of patterns. This is denoted as 
(n,w,p) :
/-(*,* (w)) = <m>
In [42], th e  so called scaled speedup is introduced in order to  overcome the  problem s 
im posed by A m dahl’s law. According to  G ustafson, ’’speedup should be m easured by 
scaling the  problem  to  the  num ber of processors, not by fixing the  problem  size” . In 
[35], the  definition of scaled speedup is given, which exactly m atches our definition of the 
scalability factor tim es k:
S scal( k ,P ,( n ,w ,p )) =  k -  f scal( k ,P ,(n ,w ,p ) )  (2.8)
In the  sequel, the  te rm  scalability will be used m eaning scaled speedup and the  term  
speedup will indicate fixed-size speedup. In general, scalability will result in higher num ­
bers th an  speedup. Therefore, using speedups to  exam ine the  suitab ility  of an execution 
platform  for an application could give the  indication th a t the  platform  is not well suited, 
whereas using scalability would indicate a b e tte r  suitab ility  of the  platform . In th is thesis, 
b o th  scalability and (fixed-size) speedup are considered.
The m ethod presented here will be validated quan tita tively  for different decom position 
m ethods and processor topologies (e.g., grid, tree) in chapters 6 and 7, D epending on the 
exploited decom position techniques and processor topologies, the  modeled Tcalc and Tcomm 
may differ strongly. Therefore, the  m ethod requires th a t for each topology and application 
used, an analysis has to  be m ade of the  calculation and com m unication times. Using the 
analysis and the  tim es associated w ith the  m easured kernel benchm arks, th e  m ethod can 
be used to  predict th e  perform ance of any MIMD m ulti-processor system. T he m ethod 
is a scalable perform ance prediction m ethod, because based on the  tim es m easured on 
one processor and one com m unication link, predictions can be m ade for larger processor 
system s. Furtherm ore, for each neural network application, the  analysis of its calculation 
tim es and the  identification of the  corresponding kernel functions has to  be perform ed 
only once. Its calculation tim es for new ta rge t execution platform s can subsequently be 
estim ated  by m easuring its kernels. This also holds for th e  com m unication tim es on a 
specific execution platform . Once its com m unication tim es are estim ated , they  can be used 
to  predict th e  perform ance for any neural network application. Therefore, the  m ethod is 
called a scalable, general purpose perform ance prediction m ethod.
This com bined m ethod will be discussed in chapter 4, I t is ta rge ted  on MIMD parallel pro­
cessor system s th a t com m unicate w ith each o ther th rough message passing. An overview
14 2. Platforms for artificial neural networks
over MIMD parallel processor system s is given in chapter 3, C hap ter 4 covers the  design, 
im plem entation and perform ance evaluation of a message passing com m unication layer 
tailored for d istribu ted  neural network sim ulations. In chapters 5 and 6, the  perform ance 
prediction model will be applied on two popular neural network models, the  multi-layered 
perceptron (backpropagation) network [83] and the  K ohonen self-organizing feature m ap 
[58], Three m ulti-transpu ter system s are used for the  quan tita tive  validation of the  m od­
els, The system  located in Nijmegen contains 64 T805 transpu ters. Two o ther system s 
were m ade available by the  University of A m sterdam  for experim entation. T he GCEL-512 
containing 512 T805 nodes and the  Pow erXPlorer containing 32 PowerPCs allowed for 
the  exam ination of the  perform ance, speedup, scalability and efficiency issues for a larger 
num ber of processors and for processor nodes w ith higher com puting power.
2.4 N eurocom puting and Neurosim ulators
Neural networks are being used in m any com mercial and research applications. People 
using or developing artificial neural networks are involved in the  world o f neurocomputing. 
D espite of the  growing popularity  of neural networks, there have always been a num ber of 
oppositions against using them . Researchers in th e  theoretical field are not pleased w ith 
the  fact th a t they  do not really know w hat is going on ’inside’ the  network. W hat do values 
of param eters like “weights” and “biases” mean when they  are tuned  for an application. 
P o ten tial neural network users working in control or industry  are com plaining abou t the 
reliability and precision of neural networks. If hard  real-tim e responses are required and 
a neural network hesitates or gives an inaccurate reaction, disastrous failures may occur. 
Finally, the  long train ing  tim es required for tun ing  a network are a nuisance and may 
justify  the  use of high-perform ance ta rge t hardw are like the  tran sp u te r system s described 
in th is thesis.
In order to  a) exam ine the  inner workings of a neural network, b) to  increase its general­
ization precision through experim entation, c) to  support the  tun ing  of its param eters, or 
d) to  provide fast execution platform s, powerful tools are required. Such tools, which as­
sist a user in the  world of neurocom puting are called neurosim ulators. We define the  term  
neurosim ulator as any set of software a n d /o r  hardw are com ponents dedicated  to  designing, 
controlling, m onitoring, m anipulating, or (fast) executing neural network sim ulations. In 
the  FAQ of comp.ai .neural-nets, an elaborate  list of neurosim ulators is available. M ost 
neurosim ulators have em erged from the  needs for support during experim entation. Exist­
ing neurosim ulators like the  RCS [40], Aspirin-M igraines [63], NNSIM [106], M etaN et [70] 
or Genesis [123] are all built by research team s who were already doing neural network 
research before s ta rting  to  build the ir tools. As the  num ber of people involved in using 
neural networks has increased, a po ten tial m arket for neurosim ulators has come up. This 
has led to  the  s ituation  in which not only trad itional neural network researchers, bu t also 
electrical or chemical engineers, com puter scientists, and all kinds of o ther academ ic or 
com mercial institu tes, are confronted w ith the  problem s of neural network design. Com­
2-4- Neurocomputing and Neurosimulators 15
mercially available neurosim ulators are for exam ple M imeniee [86], NeuralWorks [72] and 
BrainMaker [97], Pygm alion [2], SNNS [30] and p r e e n s  [112] are exam ples of neurosim ­
ulators built by com puter scientists. In this thesis, a new approach to  neurosim ulators 
is given. The justification, design and im plem entation of w hat we call an action-oriented 
neurosim ulator is presented.
In th is in troduction, a taxonom y of the  users involved in neurocom puting is given. Based 
on the  different ways they work w ith neural networks, and on the  typical actions th a t are 
carried out when doing neurocom puting, a conceptual model of the  neurocomputing life cy­
cle will be introduced. By exam ining the  neurocom puting life cycle, the  requirem ents can 
be identified concerning which actions have to  be supported  by a neurosim ulator. In C hap­
te r 8, the  neurocomputing environm ent will be discussed. Also, an overview over existing 
neurosim ulators is given. By exam ining neurocom puting environm ents, the  requirem ents 
on how  the  typical actions have to  be supported  by a neurosim ulator can be distinguished,
2.4.1 Users in the world of neuroeom puting
Four classes or groups of users associated w ith neural networks can be identified: 1) model 
builders, 2) tool builders, 3) applied researchers, and 4) end users,
o Model builders are people involved in basic or applied neural network research. They 
can be found in research departm ents like cognitive science or biophysics. T heir goal 
is to  ob ta in  insight in the  functioning of (parts of) the  biological brain, or to  build 
artificial neural networks th a t are suited for a specific application range. Though in m any 
occasions the  models they  come up w ith are not sim ulated on a com puter because they 
merely exist in m athem atical formulas, com puter sim ulations are often used to  validate 
the  behavior of the  model. Particu larly  of interest is th a t in such cases these users tend  
to  not use neurosim ulators for im plem enting and sim ulating the ir neural network model, 
bu t ra th e r program  it from scratch in a language like C [56], T he m ain reasons for 
th is are th a t m ost neurosim ulators run slowly com pared to  ta ilor m ade program s, and 
furtherm ore th a t model builders are not willing to  comply to  the  requirem ents of the 
neurosim ulator for adding or changing a (new) neural network model,
o Tool builders are people th a t im plem ent neurosim ulators. T heir interest in building them  
can originate from a possible need th a t came up during experim entation. Therefore, 
tool builders can very well be people th a t belong to  classes 1 and 3, Furtherm ore, tool 
builders can have as a goal to  do research in building neurosim ulators or to  come up 
w ith com mercially interesting packages,
o Applied researchers can belong to  any group of people who are using neural networks 
for some application. Typically, th e  more one is involved in tra in ing  a neural network 
and tweaking its param eters and architecture, th e  more one is willing to  try  out tools 
a n d /o r neurosim ulators. Eventually, th is may lead to  building a neurosim ulator which 
is fine-tuned for the  application. As a final stage of tun ing  the  application, typically the 
neural network p a rt of it is in tegrated  in the  u ltim ate product.
16 2. Platforms for artificial neural networks
o E nd users are people who use such an end product. The product is tailor-m ade for the 
application the  end user is in terested  in. The final end product may be the  result of a 
combined effort of the  end users, model builders, tool builders and applied researchers, 
Normally, neurosim ulators are not used by this group for experim entation bu t for pro­
duction purposes only, as the  end user has very lim ited experience w ith neural networks 
and merely w ants to  use them  to  solve his particu lar problem s.
People belonging to  the  model builders or applied researchers impose the  highest require­
m ents on neurosim ulators, especially tow ard extendibility  and flexibility concerning new 
models, new p a tte rn  form ats and new (graphical) tools. They w ant to  be able to  imple­
m ent and validate new models, to  change existing models or to  combine several models 
into a new, heterogeneous architecture.
2.4.2 T he neurocom pu ting  life cycle
The neurocom puting life cycle is considered as a m ultiple feedback loop over several stages. 
In each stage a certain  action is perform ed and one can jum p from each stage to  each other. 
The stages in the  neurocom puting life cycle can be grouped into four categories: initiation, 
timing, testing  and production (see Figure 2,5):
Initiation
environm ent
specification
data task  model
aquisition identification selection
Tuning
model p aram eter
im plem entation adjustem ent
architecture
adjustem ent
ru le
adjustem ent
Testing
architecture
optim ization
testset
validation
recall
Operation
code
optim ization
code
integration
production
Figure 2,5: The neurocomputing life cycle.
Initiation
In the  in itia tion  phase it is decided w hat task  has to  be perform ed. This can be the 
developm ent of a new neural network architecture (typically done by model builders), or 
any application in which a suitable neural network has to  be found, tuned  for a certain  
datase t and in tegrated  w ithin a certain  environm ent. Stages in the  in itia tion  phase are
2-4- Neurocomputing and Neurosimulators 17
- Task identification , which reflects the  aim  of using neural networks,
- E nvironm ent specification, which sta tes the  requirem ents concerning inpu t and ou t­
pu t stream s, execution speed and accuracy of the  neural network to  be used,
- Model selection, which depends heavily on the  task  to  be performed,
- Data acquisition , which is partly  ruled by the  environm ent and has constrain ts im­
posed by the  neural network.
Note th a t the  neural network or the  application may require th a t the  d a tase t is form atted  
or preprocessed in a specific m anner. In the  in itia tion  phase it is decided how to  in tegrate 
environm ent, neural network and dataset.
Tuning
A fter the  in itia tion  phase, tun ing  may involve finding the  optim al neural network archi­
tecture, param eter settings and activation and train ing  rules for a given datase t. If the 
selected task  cannot be accomplished w ithin certain  lim itations, it may even be so th a t 
the  conclusion of the  tuning  phase is th a t a different task  has to  be selected, or th a t the 
task  cannot be accomplished. Stages in the  tuning  phase are
- Model im plem entation. At the  in itia tion  phase, the  neural network model to  be 
used is selected. If a new model has to  be developed, or if the  chosen model is not 
supported  by available neurosim ulators, im plem entation of the  model is required,
- Architecture adjustm ent, changing num ber of neurons, weights, thresholds, etc,
- Param eter adjustm ent, changing learning rate , weight values, thresholds, etc,
- Rule adjustm ent, changing activation or learning mechanisms.
At any m om ent during tuning, a different neural network architecture or model may be 
selected, or a different d a tase t may be constructed . This means th a t a feedback jum p  is 
m ade to  the  in itia tion  phase.
Testing
Once a neural network is tuned  for an application, the  testing  phase is carried out. For a 
different d a tase t th an  the  one the  network was tuned  w ith, its generalization capabilities 
are tested. The test d a ta  will have to  represent sufficient relevant samples of the  eventual 
application th a t is executed in the operational phase. In order to  arrive a t an optim al 
network architecture, often many feedback loops between tun ing  and testing  phases are 
perform ed. Stages in the  testing  phase are
- Recall, which is used to  com pute the  network generated o u tp u t for a given input 
pa tte rn .
18 2. Platforms for artificial neural networks
- Architecture optim ization, for exam ple a technique called pruning can be used to  find 
a m inim um  num ber of hidden layers and nodes for which a backpropagation neural 
network still is capable to  perform  well on a te st set,
- Test set validation, which examines the  perform ance of the  neural network for the 
te st data . The perform ance may represent execution speed, accuracy or reliability.
Also after the  te st phase it may be concluded th a t the  neural network is not able to  perform  
well enough w ithin the  given environm ent, for the  given da tase t, network architecture, and 
tuned  param eters. This conclusion may lead to  feedback jum ps to  the  previous phases.
Operation
Once the  neural network architecture and param eters are tuned  and tested , the  network is 
ready for the  eventual operational phase. F urther optim izations are possible (e.g., ex trac t­
ing only relevant program  p arts  from the  neural network sim ulation program , optim izing 
the  code, or im plem enting the  network in hardw are), bu t in general no changes are m ade 
to  the arch itecture and param eters itself. Especially if the  neural network is used as one 
step  in a range of processing stages, in m any cases the  neural network code is in tegrated  
in the  eventual application. The stages in the  operational phase are
- Code optim ization, for exam ple using in-line assembler prim itives, loop unrolling and 
o ther techniques or by using hardw are accelerators,
- Code integration. The full end-application may already have the  neural network 
in tegrated  in its environm ent. However, especially when using neurosim ulators to  
tune and te st the  network, th is stage may involve ex tracting  a stand-alone program  
and in tegrate it w ithin the  environm ent,
- Production, is the  final stage during which the  network is used to  accomplish the 
overall task,
2.4.3 An engineering approach to neuroeom puting
A nother approach to  this life cycle is given by W hitting ton  and Spracklen in [119] , Their 
paper is one of a very lim ited num ber of publications which discuss the  neurocom puting 
life cycle. I t is an engineering approach, in which six phases are distinguished in th e  devel­
opm ent of a neural network application: assessm ent, specification, design, im plem entation, 
evaluation and delivery. In the  assessm ent phase, the  aim  is to  find out w hether neural 
networks are a feasible or appropria te  technology for the  task  to  perform . L im itations of 
neural networks like the  ones m entioned a t the  beginning of this chap ter have to  be con­
sidered, The specification phase has to  consider several issues, which are th e  natu re  and 
type of the  application, the  I /O  representation of the  d a ta , and its availability. T he natu re  
of th e  application can be either stand-alone or em bedded, where the  la tte r  m eans th a t in
2-4- Neurocomputing and Neurosimulators 19
later phases th e  neural network has to  be coupled to  o ther com ponents of a larger system. 
The I /O  representation is specified by the  system , where the  representation required by 
the  neural network is only specified during the  design states. The availability of the  da ta  
is of im portance during the  specification phase because in many occasions, d a ta  acquisi­
tion may be a tim e or money consum ing process. A t this m om ent during the  life cycle 
the  decision may be m ade to  choose e.g., a self-organizing neural network because of the 
unavailability of supervised data. In the  design phase, the  neural network model, possible 
pre- and post-processing of the  data , the  runtim e requirem ents and I /O  requirem ents of 
the  neural network are identified. A fter the  specification and design, the  neural network 
is im plem ented during the  im plem entation  phase. In software engineering it is commonly 
assum ed th a t once specification and design are correctly specified and validated, the  im­
plem entation is a stra igh t forward issue. If the  im plem ented system  stands the  testing  and 
evaluation tests, it is ready for acceptance. However W hitting ton  and Spraeklen m ention 
th a t the  key issue in the  neurocom puting im plem entation phase is the  validation of the 
im plem entation for its correctness, as often im plem entation errors will not be observed be­
cause of the  adaptive natu re  of neural system s. Because of this feature, a neural network 
will still continue its operation despite of it being incorrectly im plem ented. Indeed, m any 
discussions in the  news-net pages of comp. a i  .n e u r a l - n e t s  indicate th a t users have in­
correctly im plem ented their neural network and wonder why it produces strange or partly  
correct results. The evaluation  phase is carried out as depicted in Figure 2,6, T he final 
phase is the  delivery phase during which the  im plem ented system  is installed w ithin its 
environm ent.
assessment, specification, design, implementation
Figure 2,6: Evaluation phase in the methodology o f W hittington and Spraeklen. This figure 
is drawn from  Figure 3 in [119]. The evaluation phase eorresponds with the tuning and 
testing phases in Figure 2.5.
W hen com paring the  neurocom puting life cycle as depicted in Figure 2,5 and the  devel­
opm ent cycle of W hitting ton  and Spraeklen, many sim ilarities can be found, Note th a t in 
Figure 2,5 fourteen different stages are identified. To com pare these to  the  six phases of 
W hitting ton  and Spraeklen w ith the  initiation, consider Table 2,1:
20 2. Platforms for artificial neural networks
phases stages in Figure 2.5
assessment
specification
design
implementation
evaluation
delivery
task identification
data acquisition, environment specification 
model selection, environment specification 
model implementation
the stages associated with tuning and testing 
code integration, production
Table 2,1: Similarity between stages listed in Figure 2.5 and the methodology of
Whittington and Spracklen.
The assessment, specification and design phases of Whittington and Spracklen are per­
formed in the initiation phase. The implementation and some of the evaluation phase are 
contained in the tuning phase, whereas some of the evaluation phase also occurs in the 
tuning phase. Because the development cycle of Whittington and Spracklen is specifically 
targeted at the use of neural networks to perform a specific task within an application, their 
most important phase is the implementation phase. This is emphasized in [119], However, 
depending on what the user of neural networks wants to achieve, in many cases the imple­
mentation phase is completely superfluous, especially if the selected neural network model 
(in the initiation phase) is already supported by an available neurosimulator.
Considering the four groups of potential neural network users and the neurocomputing life 
cycle discussed in this section, the respective important phases are listed in table 2,2:
user class interest
model builders 
tool builders 
applied researchers
end users
model selection, model implementation 
model implementation
task identification, data acquisition, environment specification,
tuning, testing and operation
operation
Table 2,2: Phases and stages depending on user interests.
2.4.4 Towards an action-oriented neurosimulator.
In Chapter 8, several existing neurosimulators are discussed, like Pygmalion [2], the Roche­
ster Connectionist Simulator [40], Aspirin/MIGRAINES [63] and Genesis [122], They are 
considered from two perspectives, the perspective of the user and its environment, and 
the perspective of the neurosimulator and its environment. Neurosimulators can be distin­
guished in three categories, application-oriented systems, algorithm-oriented systems and 
general programming systems. For example in [77], [80], [109] and [116] the characteris­
tic features that neurosimulators offer are listed. Neurosimulators can have a graphical
2-4- Neurocomputing and Neurosimulators 21
user-interface, an algorithm library, and support for building new models. They can con­
tain visualization and monitoring tools, neural network description languages, application 
specific tools, and dedicated hardware accelerators. Based on the examination of the envi­
ronment in which neurosimulators are used, potential user-requirements, and the pros and 
cons of features that existing neurosimulators contain, in this thesis a new neurosimula­
tor is proposed. It is called p r e e n s ,  a parallel research execution environment for neural 
systems. The requirements that p r e e n s  has to fulfill are:
1, Provide a general purpose user-interface suited for the monitoring, visualization and 
control of any running neural network simulation program,
2, Provide an interface definition via which new or existing simulations can be coupled 
to the user-interface without much effort,
3, Provide a communication and control interface via which both distributed and se­
quential simulation programs can be controlled.
Three components of p r e e n s  were developed to make sure to fulfill these requirements. 
The first is a small specification language for specifying action-oriented program descrip­
tions, The second is a set of interface definitions that use an action-oriented program 
description to access pieces of neural network simulation data and code. The third is a 
graphical user-interface called CONVIS via which programs and associated tools can be 
controlled,
2.4.5 The program description
The major insight that led to a system meeting these requirements was that most neural 
network simulations only implement a limited number of actions, being the loading and 
saving of neural network data or stimuli, the initiation and control of a training or recall 
session, and the final operation of the tuned simulation program. By specifying this lim­
ited set of actions, and providing an interface that can use the specification for control, 
visualization and operation of the program, the requirements listed above can be met. This 
involves that whereas most other neurosimulators are based on a general description of the 
neural network, p r e e n s  is based on a description of the neural network program. This 
program description uses what I call an action-oriented model.
In such a model, each program can be described by specifying the actions it implements. 
Each action can have a number of objects associated with it, e.g., parameters, variables, 
data, options and settings. Figure 2,7 depicts the conceptual model of a p r e e n s  action:
A program description contains specifications of a number of such actions. Each action has 
parameters, options and settings for installing its initial configuration. And it has variables 
and data whose values can be changing during execution of the action. The action-oriented 
model and program description are explained in Chapter 9,
22 2. Platforms for artificial neural networks
Figure 2,7: Conceptual model of the PREENS action.
2.4.6 CONVIS , the user-interface for control and visualization
CON VIS [116] is the general purpose user-interface that manages a running neural network 
simulation program and any number of associated tools. It provides an environment which 
takes care of the control of actions while making minimal assumptions about the way in 
which they are implemented. It uses a set of interface definitions using a program descrip­
tion, for data exchange with and control of running neural network simulation programs 
and associated tools. The environment of p r e e n s  is discussed in chapter 9, and consists 
of the manager CONVIS and a set of neural network programs and tools (see Figure 2,8),
user-interface
Figure 2,8: The. p r e e n s  neurosimulator environment.
2.4.7 A pplications of p r e e n s
Each of the neural network simulation programs contained in the p r e e n s  algorithm library 
is tested on a number of data sets. For example sets containing echographic liver data 
and sets containing printed characters were examined. The most important application 
considered is the classification of ground cover classes using remotely sensed imagery. In 
collaboration with Ron Schoenmakers of the Joint Research Centre at Ispra, Italy, this 
work resulted in a paper for the IGARSS ;95 conference held in Fiorenze, Italy [87], This 
application will be used as a running example in the last chapters of this thesis.
m
Parallelism  and the transputer
Outline
This chapter briefly describes the parallelism that can be observed in 
computer systems. The class of MIMD parallel processors to which trans­
puter systems belong is discussed in more detail. The architecture of the 
T8xx transputer is presented and the differences with its successor, the 
T9000, are listed. Three transputer systems are described, the NSC con­
taining 64 T800 processors, the GCEL containing 512 T805 processors, 
and the PX, a system containing 32 PowerPCs, Furthermore, the most 
significant features of how to use the operating systems Helios and Parix 
are described.
24 3. Parallelism and the transputer
3.1 Parallelism  in com puter system s
The overall goal of exploiting parallelism is to finish a job in a smaller amount of time, 
or to handle more (distinct) jobs in the same amount of time. This goal is achieved in 
current computer systems by exploiting parallelism on several levels of detail. These can 
be classified in increasing level as: (1) bit level, (2) instruction level, (3) function unit level, 
(4) function level, (5) CPU level, (6) data level, (7) program level and (8) computer level.
On the lowest level - the bit level - the bandwidth of the data bus (e.g., 8,16,32, or 64 
bit wide) determines how many bits of information can be transferred between processor 
units in parallel within one clock tick. Similarly, the ALU or FPU can be of, e.g., 16, 32 or 
64 point which means that that many bits of information are processed within one clock 
tick. I/O  can be performed via parallel ports, transferring a number of bits in parallel. 
On an even finer level of detail, the smallest components of a computer systems all work 
in parallel. Current technology of making VLSI or WSI implementations allows for a very 
high number of components packed on a small area of silicon to operate simultaneously. For 
example, several special purpose hardware computers have been designed that implement 
parallel neural networks in neuro-asics (application specific integrated circuits).
On the instruction level, parallelism is exploited by parallelizing the instruction cycle. A 
well known method is instruction pipelining which executes instruction fetch, instruction 
decode, operand fetch, execution and store of results. A series of operations can be com­
puted significantly faster in this way. An example instruction pipeline is depicted in Figure 
3.1.
ifetch
idecode
ofetch
execute
store
ifetch 
idecode 
ofetch 
execute 
store
cycle
Figure 3.1: Instruction pipelining versus normal instruction cycle., time in clock ticks.
On the. function unit level, parallelism exists in the occurrence of multiple functional units, 
e.g., one or more ALU, FPU. Furthermore, for example the FPU can be even more sub­
divided in, e.g., floating point adder/subtracter and floating point multiplier/divider. By 
simply replicating functional units and using pipelining, multiple arithmetic operations can 
be performed in parallel or simultaneously. One of the key issues on efficient pipelining 
is how to provide the pipeline with data (instructions,operands). The memory bandwidth
time
pipeline
time
3.1. Parallelism, in computer systems 25
(the average number of information units that is accessed per time unit) and the instruction 
and data access characteristics highly determine the performance of a pipeline processor.
The function level can be characterized by the existence of several dedicated hardware 
components that each perform a specific task, like disk I/O , inter-processor I/O  (commu­
nication), memory interfaces and arithmetic units.
More and more often, current computer systems facilitate several CPU (e.g., Sun ultra, IBM 
ES6000 cluster). This results in parallelism on the CPU level. The CPUs share multiple 
resources like shared memory, disk drives and communication hardware and therefore the 
performance of such computer systems is highly dependent on the managing operating 
system. The operating system decides whether shared resources may be accessed in parallel 
or not. Applications that require many non-shared system calls (e.g. file I/O) may suffer 
enormously from contention in resource accesses, whereas applications that exhibit much 
computational efforts may run highly efficient.
Data level parallelism is observed in computer systems that exhibit the possibility to com­
pute many (simple) operations on individual data elements in parallel. These computer 
systems are known as SIMD (single instruction multiple data) parallel processors. Array 
processors like the massively parallel processor (MPP), and multi-processor systems like 
the Connection Machine are SIMD processors. These systems feature a large number of 
relatively small processing elements and an interconnection communication network for 
routing of data and instructions. Each processing element executes the same set of in­
structions in a lock-step fashion, where the instructions are broadcast by a centralized 
control unit, A typical SIMD parallel processor architecture is depicted in Figure 3,2,
Figure 3,2: A general SIMD parallel processor architecture.
Another class of computer systems that show data parallelism are vector supercomputers 
like the Crav-1, CDC STAR 100 and Cvber-205, Besides providing scalar operations, such 
systems are designed to operate on data arranged in a regular, homogeneous structure like
26 3. Parallelism, and the transputer
in matrix or vector formats. Vector operations are performed in a pipelined fashion on 
subsequent vector elements (see Figure 3,3) and there may exist multiple vector pipelines.
Figure 3,3: Simplified depiction of a pipelined vector computation.
Furthermore, vector pipelines may be chained thus holding intermediate results ready for 
execution before storing them back into memory. The concept of pipeline vector processors 
can be classified as MISD (multiple instruction single data), as different stages in the 
pipeline perform different operations, but operate on only one scalar or vector data element. 
Vector processors are characteristic for their very high instruction throughput of hundreds 
of MFLOPS, A typical vector processor architecture is depicted in Figure 3,4,
Figure 3,4: Typical architecture of vector processors.
Program level parallelism can be observed in multi-tasking and/or multi-user environments, 
where several independent tasks run via time-sharing on one or more processors. In case 
of several processors, programs can run in parallel. In case of one processor, it is the task 
of the operating system to schedule tasks and take care of swapping code to and from 
memory. Program parallelism is the most commonly used type of parallelism, which is 
exploited in almost all workstation and personal computer systems available today.
The highest level of parallelism is on the computer level. The class of computer systems 
that consist of several processors, each having its own CPU(s), arithmetic units, I/O  com­
ponents, etc., is called MIMD (multiple instruction multiple data). These systems are
3.2. MIMD parallel processor systems 27
considered general purpose, as each processor in the parallel processor network may com­
pute a separate program, different from programs running on other processors (i.e. multiple 
instructions) on a distinctive set of data (multiple data). The Inmos transputer can be 
used as a processing element for building MIMD parallel processors. As our target hard­
ware contains transputer systems, in the rest of this chapter the attention is focussed on 
computer systems that belong to the class of MIMD parallel processors,
3.2 M IM D parallel processor system s
A typical MIMD processor network consists of a set of processing elements and an intercon­
nection network for transferring data between processing elements and external memory. 
Basically two kinds of MIMD parallel processors exist which are shared memory and dis­
tributed memory systems,
3.2.1 Shared m emory system s
As any other MIMD architecture, shared memory systems contain a number of computa­
tion elements (CPUs), Salient feature is that there exists some means via which different 
CPUs can map into some global address space (shared memory). This can be done via 
some common communication path, which can be a memory bus, some (multi-stage) inter­
connection network or a combination of a bus and interconnection network. The available 
memory can be completely global (Figure 3.5(a)), each processor can have its own local 
memory while sharing a global memory pool (Figure 3.5(b)), or processors can somehow 
share certain parts of each processors local memory (Figure 3.5(c)). Also some combina­
tions of these architectures are possible.
(a) Shared memory without local memory. (b) Shared memory with local memory (LM).
28 3. Parallelism, and the transputer
interconnection network
(c) Shared local memory.
Figure 3,5: Shared memory processor architectures.
Bus based shared memory systems use for example standard buses like the VME bus, 
the NuBus and Multibus, Each processor requests for read/write operations via a bus 
controller. In order to prevent two processors from writing in the same memory location, 
a means of determining the availability of the bus is required. Several algorithms exist 
for scheduling the bus access over the processors, such as daisy chaining, round robin, 
fifo, lifo and polling. Essential in these systems is the ratio of local cache memory and 
global memory. If the number of processors is large compared to the bandwidth of the 
bus, shared memory systems suffer from bus contention, resulting in severe performance 
degradations. One solution to bus contention is the use of hierarchical multiple buses, 
resulting in a network of bus communication layers. Whereas a single bus avoids shared 
memory systems from being scalable, using a multi bus design allows for scalability.
Shared memory systems with local memory per CPU are characterized by a interconnection 
network, a virtual or real shared memory part, a global shared memory part and a number 
of processing elements that contain private local memory. The interconnection network 
(typically a network based on crossbar switches) allows for individual processing elements 
to access local memories from other processing elements and the global memory, A crossbar 
switch uses several components to provide all-to-all communications between a number of 
processing elements. By combining crossbar switches, multi stage interconnection networks 
can be built to couple clusters of processing elements, resulting in scalable parallel processor 
architectures. Example of a system with shared and local memory is the BBN butterfly,
3.2.2 D istributed m emory M IM D system s
The typical architecture of a distributed memory system consists of a number of process­
ing elements connected via some communication medium. Each processing element can be
3.2. MIMD parallel processor systems 29
considered as a stand-alone computer which has its own CPU, memory and communica­
tion ports (see Figure 3.6(a)), The Intel IPSC/1 and IPSC/2, the XCube and transputer 
systems are examples of parallel processor networks with a number of processor elements, 
each representing a stand-alone computer. The processing elements are connected via a 
communication network and are equipped with one or more communication processors or 
link interfaces. The Intel hypercubes consist of processing elements with standard 80286 
(IPSC/1) or 80386 (IPSC/2) processors. Each processing element contains a direct con­
nect routing module (DSC) which has 8 communication links, one of which is reserved for 
external I/O , By chaining several DSCs, it is possible to dynamically establish a physical 
communication path between any two processors, Intel hypercubes have dimensions of 3 
(P=8) up to 7 (P=128), delivering a range of peak performances of 4 up to 1280 MFLOPS 
depending on the availability of scalar or vector co-processors. The NCube., developed by 
the XCube Corporation, covers a range of machines from 16 (XCube/4) processors and 
1 I/O  system up to 1024 (XCube/10) processors and 8 I/O  systems. Each processing 
element contains a 32-bit processor with local memory and communication links to other 
processing elements. The I/O  systems provide a very high potential bandwidth for trans­
ferring code and data between external devices and the hypercube network. Each node has 
22 communication channels, 20 of which are used to connect to other processing elements 
and 2 of which are used for system I/O , Transputer systems are extensively discussed in 
the rest of this chapter.
(a) Distributed memory. (b) Connected via LANs or WANs.
Figure 3,6: MIMD parallel processor architectures.
Remote MIMD systems are essentially stand alone computer systems that communicate via 
local or wide area networks, such as networks of workstation clusters (see Figure 3.6(b)), 
Currently, many operating systems designed for parallel computer architectures are ported 
on these machines. The advantages concerning portability, prototyping, debugging, cross­
compilation and heterogeneous network resource utilization are numerous compared to 
what is currently available on most native operating (or runtime) systems that run on 
parallel processor systems belonging to the other classes listed above. Parallel operating 
systems like for example CS Tools, Express, Helios, Parix, and message passing standards 
like PVM and MPI are successfully used in many high performance applications and it is
30 3. Parallelism, and the transputer
believed that developing parallel applications on workstation clusters requires less efforts 
than on native systems,
3.3 The Inm os transputer
The Inmos transputer [65, 67] has processing units, control logic, private local memory, and 
four communication links on one single VLSI device (see Figure 3,7), Transputers can be 
used in a single processor fashion for applications like control, but also in multi-processor 
configurations for building a high performance parallel execution platform for large scale 
applications.
Figure 3,7: The T800, a 32 bit 'microprocessor with 64 bit floating point unit. Transputer 
Reference Manual, Inmos Ltd. [65].
The T 111. T800 and T805 transputers contain several components connected via a 32-bit 
bus. Via the bus the transputer can access internal memory (2Kb or 4Kb respectively) 
and using the external memory interface the external RAM (typically 1,4,8,16 Mb) can be
3.3. The Inmos transputer 31
accessed. The experiments discussed in this thesis have been performed on three transputer 
systems, the Nijmegen SuperCluster (NSC), the GCel-512 (GCEL) and the PowerXPlorer 
(PX), The latter two systems are located at the University of Amsterdam, The NSC 
consists of 64 T800 transputers with 4 Mb external RAM and a clock speed of 25 MHz 
each. The transputers in the GCEL are T805 transputers and have the same amount of 
memory but a clock speed of 30 MHz, The speed of each of the communication links is 
20 Mbits/second, delivering a total bandwidth of 80 Mbits/second, The PX is described 
later in this chapter,
A transputer can address 4 Gb of consecutive memory, part of which contains the on-chip 
RAM, The CPU is a RISC processor using only six registers for sequential processing. 
Three registers (A,B,C) are contained in an evaluation stack used for integer and address 
arithmetic. Three more registers are the workspace pointer which points to a memory area 
where local variables are kept, the instruction pointer which points to the next instruction 
and the operand register which is used in the formation of instruction operands (see Figure 
3,8), The CPU and FPU can operate in parallel, for example the CPU may perform some 
address calculations while the FPU executes a floating point operation (the FPU has some 
additional registers).
Figure 3,8: The transputer registers. Transputer Reference Manual, Inmos Ltd. [65].
The design of the transputer is based on the parallel programming language OCCAM [51], 
which is an implementation of the language CSP (communication sequential processes) 
developed by Hoare [47], In the OCCAM model, an application is described by multiple 
concurrent processes that communicate via communication channels. Processes can run 
concurrently on one processor or in parallel on more than one processor. Conceptually, the 
same OCCAM program can run on one or on several processors. Communication between 
processes can be done via local (internal) channels implemented through local memory, or 
via remote (external) channels between neighboring processors. Remote channels are im­
plemented via link interfaces. Both the scheduling of concurrent processes on one processor 
and the communication via link interfaces is implemented in the transputer microcode.
32 3. Parallelism, and the transputer
3.3.1 Process scheduling
The transputer can run multiple processes concurrently at two priority levels (high priority 
and low priority), where each process has its own set of registers. Time sharing of low 
priority processes is performed via the micro coded scheduler and only occurs after certain 
instructions (de-scheduling points) which leave the A,B,C and FPU registers undefined. 
Because scheduling is performed in hardware and the transputer has only a small number 
of registers that have to be saved, context switches between processes can be executed 
extremely fast (in the order of /¿seconds). Processes running concurrently on one processor 
can be either active or inactive. Active processes are either being executed or are in a list 
of processes waiting to be executed. Inactive processes are in a list of processes waiting for 
some I/O  or timer event to occur, and do not consume any computation time.
High priority processes are typically interrupt service routines. They are descheduled 
only upon termination or when they are in an inactive state. Low priority processes are 
scheduled only if no high priority process is active. For both priorities a queue of active 
processes waiting to be executed is maintained. The queue is a linked list of process 
workspaces, implemented using two registers, one of which points to the first process in 
the list (front), the other to the last (back). The workspaces contain a process’ local 
variables and some status information. For example consider four processes, P,Q,R and S, 
running concurrently. The running process is S and the first process to become scheduled 
is P. When switching S and P, the instruction pointer of S is saved in its local workspace, 
the workspace and next instruction registers are fetched from P ’s workspace and the front 
and back pointers are adjusted (see Figure 3,9),
registers
front
back
A
C
workspace
next instr
operand
workspace
Q
R
program
Figure 3,9: Linked process workspace list. Transputer Reference Manual, Inmos Ltd. [65].
3.3.2 Internal and external inter-processor com m unications
p
B
S
The transputer uses the channel concept for inter-process communication, A channel is a 
memory location either containing EMPTY, indicating that no process uses the channel for
3.3. The Inmos transputer 33
communication, or containing a pointer to the workspace of the process that is using it 
for communication. Assume that two processes A and B wish to communicate with each 
other through a channel. Each local I/O  request results in loading the required number of 
bytes to be transmitted (count), loading the pointer to the data that has to be transmitted, 
and checking the channel location, which initially is empty. The first process performing 
an I/O  request (say process A), finds the channel location empty and stores its identity 
(pointer to process A) in it (Figure 3,10),
Figure 3,10: Initiating an internal I/O  request, process A is just before becoming inactive 
and waiting for process B ’s I/O  request.
Then, process A becomes inactive as it waits for I/O  and subsequently becomes descheduled 
resulting in storing the data pointer, count and process context in its workspace. Whenever 
the second process (B) is scheduled and performs its I/O  request, it finds the corresponding 
channel location occupied by the pointer to process A and is able to access A ’s data address 
through the previously stored data pointer (Figure 3,11), The actual communication is 
then performed via a block move and the inactive process A is added to the active process 
workspace list. Finally, the channel location is reset to its empty state. For this procedure, 
it does not matter which process performs the I/O  request first.
Figure 3,11: Internal communication when process B has become scheduled.
34 3. Parallelism, and the transputer
If a process wishes to communicate with another process running on a different processor 
over an external channel (i.e. transputer link), the scheduler recognizes the corresponding 
I/O  request and delegates the task of communication to the corresponding link interface, 
while descheduling the process. The link interface administrates the descheduled process 
and recognizes starting address and count of the data to be transmitted. Each of the four 
link interfaces has three registers to store this information (see Figure 3,12), Furthermore, 
each link interface has DMA access to the transputer memory, so the CPU is not needed to 
copy the data. Similar as described above, processes initiate their corresponding channel 
for external communication, but in this case the channel is implemented through the link 
interface. When the link interfaces on both processors have initiated the transmission, the 
data is transmitted over the transputer link. If the data transfer is completed, each link 
interface places the corresponding process at the end of the active process list. Again, it 
does not matter which of the processes performs the I/O  request first.
workspace link interface link interface
A ptr to A
dataptr
count transputer link
ptr to B
dataptr
count
workspace
B
Figure 3,12: External communication over a transputer link.
3.4 Transputer networks
Transputer networks contain a number of transputers connected in a network via transputer 
links. As each processor only has four links, the connectivity of the network is restricted to 
four. The major advantages of using processors like the transputer as a building block for 
large parallel systems are that scaling up the number of transputers not only increases the 
computation power, but also increases the total communication bandwidth of the system, 
as for each transputer an extra four communication links are added. This means that 
conceptually no global communication contention or bottlenecks occur like is known from 
shared memory or shared buses. Depending on what topology suits an application best, 
several multi-transputer networks can be made like a pipeline, grid, torus, ring and tree:
(a) pipeline (b) grid
3.4- Transputer networks 35
" l h  " l h
(c) torus (d) ring
(e) tree
Figure 3,13: Some multi-transputer topologies. A transputer is depicted as a square box 
with four lines representing the links in the north, east, south and west direction.
Connecting transputers via their links can be done in a fixed or reconfigurable topology. 
The latter is implemented by using programmable crossbar switches that allow to con­
figure the transputer network in any desirable topology. Note that for fixed topologies 
the transputers are directly coupled via a transputer link, whereas when using crossbar 
switches, this is not the case. What quantitative implications this has for inter-processor 
communication delays will be discussed in a later section. The GCel-512 is configured in a 
16x32 grid with fixed topology. The NSC (Nijmegen Super Cluster) uses the Inmos C004 
crossbar switches for implementing a reconfigurable system (Figure 3,14), A C004 cross­
bar switch contains 32 demultiplexers that are able to couple any of the input links to one 
of the output links, C004 chips can be cascaded to connect multi-transputer units with a 
higher degree of connectivity. The next discussion gives an overview over the architecture 
of the NSC, consisting of 64 transputers and several communication units consisting of 
C004s and control logic.
links
Figure 3,14: The C004 provides a switch between 32 input and 32 output links.
36 3. Parallelism, and the transputer
3.4.1 Hierarchical architecture of the NSC
The Super Cluster [39] architecture has a hierarchical design consisting of basic build­
ing blocks, multi-transputer modules, network configuration units, computing clusters and 
multiple superclusters. The basic processing element in the NSC consists of a T800 trans­
puter, 4 Mb of RAM and some control logic. Any processing element can connect to 
four neighboring processing elements, A multi-transputer module contains four of these 
processing elements (see Figure 3,15),
Figure 3,15: The multi-transputer module MTM-EDC.
A computing cluster contains four multi-transputer modules (MTMs) and a so called net­
work configuration unit (NCU), which connects all 4x4 transputer links of each of the four 
MTMs (so the 64 links of the processing elements in a computing cluster connect to its 
NCU), Each NCU can connect 96 to 96 transputer links. This means that each computing 
cluster has 32 transputer links that can connect to other computing clusters or processing 
elements (see Figure 3,16),
Figure 3,16: The architecture of a computing cluster.
3.4- Transputer networks 37
On the top level hierarchy, the supercluster consists of one to four computing clusters. 
The computing clusters are connected and configured via 2 extra network configuration 
units. Via the latter, multiple superclusters may be configured in e.g. 256, 512 or 1024 
multi-transputer systems,
3.4.2 Hierarchical architecture of the GCel
The GCel is one of Parsytec’s GC family of transputer systems, GC means Grand Chal­
lenge, GigaComputer or GigaCluster (opposed to SuperCluster), The architecture of the 
GC family is designed with the goal of building highly scalable massively parallel trans­
puter systems from building blocks of one GC containing 64 processors. Similar to the 
supercluster architecture, a GC contains 4 clusters which each contain 16 processing ele­
ments, Each processing element can be any Txxx transputer, but the main design goal 
was to build a highly scalable parallel machine based on the T9000 transputer with the 
C104 dynamic crossbar switch. Unfortunately, Inmos had serious problems with building 
these components and therefore current GC architectures are based on the T805 processor 
and C1004 crossbar switch. These systems are called GCel,
As with the supercluster, the GC has a highly modular architecture. What is called 
a computing cluster in the supercluster architecture, corresponds with a cluster. Each 
cluster has 16+1 processing elements, the 17th is used for redundancy purposes. Whenever 
one of the 16 processors crashes, the 17th takes over its control. Each four clusters together 
form a cube, which is the basic element of larger machines. The GCel-512 consists of 8 
cubes, providing a total of 512 processors.
The cluster architecture resembles that of the computing cluster, however instead of one 
NCU, four C104 chips are used. The main reason for taking more crossbar switches is 
because the eventual goal of GC machines is to scale up to 16K processors which requires 
a very high global connectivity. Each cluster in a large GC network is connected to its six 
neighbors in a three dimensional grid via 8 links in each direction. Via the links, three 
communication networks are realized in a GC system. The data network provides the 
communication pathways for applications and the operating system. The control network 
connects the control processors of each cube. Via the control network, processors can be 
initialized or booted, routing chips can be programmed, and monitoring of the hardware 
status of the cube resources can be performed. The third network is the I /O  network. This 
builds a way to connect transputers with I/O  devices, like a host machine and memory 
storage devices,
3.4.3 Configuring a transputer network
For the GCel machines, a user or programmer need not to be concerned with how to 
connect the diverse transputer links, as they are ’’hardwired” in a grid. However, they 
can be configured in different ways by a system administrator, A user can only allocate 
partitions of a GCel which the administrator has installed. For the NSC, a configuration
38 3. Parallelism, and the transputer
procedure is defined that allows a user to configure a transputer network in any topology 
that can be made with the four links each processor has. Using the operating system Helios 
(see Section 3,6), each user has to initiate a login procedure which performs password 
authentication and subsequently reads a resource map defining the topology and resources 
of the transputer network the user wishes to claim. If the required resources are available, 
Helios boots each transputer in the specified network and loads a copy of the distributed 
operating system, or - if the user has marked a transputer to run in NATIVE mode, leaves 
the processor empty. This allows for native programs or runtime systems to be loaded 
onto the processor without having Helios running on it. Each resource map consists of a 
number of terminals, where each terminal can specify a transputer or any hardware device 
like a frame grabber or graphics device. For each communication link of a processor it 
can be specified to which processor it has to be connected. As an example, consider the 
specification of a ternary tree of processors of depth 3:
subnet /Cluster {
CONTROL Rst_Anl [/Cluster/00];
terminal 00 { ~ I 0 ,, ~01 ; SYSTEM
ptype T8 0 0 ; Mnode Rst_Anl [pa • d];
terminal 01 { ~ 0 0 , ~ 0 2 ,~ 0 3 , ~ 0 4  ; HELIOS ; ptype T800
terminal 02 { ~ 0 1 , ~ 0 5 ,~ 0 6 , ~ 0 7  ; HELIOS ; ptype T800
terminal 03 { ~ 0 1 , ~ 0 8 ,~ 0 9 , ~ 1 0  ; HELIOS ; ptype T800
terminal 04 { ~ 0 1 , ~ 1 1 ,~12,~13; HELIOS ; ptype T800
terminal 05 { ~02, , , HELIOS; ptype T800 }
terminal 06 { ~02, , , HELIOS; ptype T800 }
terminal 07 { ~02, , , HELIOS; ptype T800 }
terminal 08 { ~03, , , HELIOS; ptype T800 }
terminal 09 { ~03, , , HELIOS; ptype T800 }
terminal 10 { ~03, , , HELIOS; ptype T800 }
terminal 11 { ~04, , , HELIOS; ptype T800 }
terminal 12 { ~04, , , HELIOS; ptype T800 }
terminal 13 { ~04, , , HELIOS; ptype T800 }
terminal 10 { ~00; 10; }
}
Algorithm 3,1: Resource map specification for a ternary tree of depth 3. Each processor 
must run Helios and must be a T800 transputer.
During the boot phase, the reconfigurable crossbar switches are programmed such that the 
required link configuration is installed,
3.5 N ew  transputer system s
A wide variety of MIMD systems that communicate via a message passing communication 
network exists. Some of these contain processing nodes that provide a similar performance
3.5. New transputer systems 39
as the transputer, others have more up to date nodes with increasing computational powers. 
Current transputer systems based on the T4 or T8 series have proven their value in parallel 
research, applications and industry. However, compared to a current state of the art 
workstation or PC, the transputer has become very slow. In many cases, it just is not 
efficient enough to use a transputer system. In this section, the efforts of Inmos and 
Parsvtee to boost the performance of a processing node or to build new communication 
chips for speeding up the communication network are briefly discussed,
3.5.1 T9000 based system s.
Since several years, Inmos has announced the arrival of a new transputer, the HI or T9000, 
This processor would work in conjunction with a new communication chip, the C104, 
From the specifications of the T9000 [67], it provides a performance of 200 MIPS, 25 
MFLOPS, runs at 50 MHz and each of its transputer links has a communication bandwidth 
of 20 Mb/second, This means that the computational and communication performance is 
increased by a factor 6 to 10 compared with current transputer systems. Apart from its 
increased performance, the main differences with previous transputer are:
o 16 Kb cache memory, which allows for applications to make better use of fast internal 
memory,
o Pipelined, super scalar architecture, which allows for several instructions to be computed 
simultaneously. The architecture also features an instruction grouper which recognizes 
instruction sequences that the processor can execute effectively,
o Better error handling and protection, which allows for the protection of each process’ 
code and data locally,
o Memory management, which allows to check all memory accesses. Furthermore, it can 
be used to implement swapping from memory to and from disk,
o Virtual hardware channels, which multiplex a physical transputer link to be shared by 
multiple processes. Messages are divided in packets with a header containing information 
about the destination process,
o C104 routing chip, which allows for virtual point-to-point connections between processors 
that are not interconnected. The C104 uses an extra destination field in the header of a 
packet to determine whether it is destined for the local processor or it has to be trough 
routed to some other C104, A C104 contains a 32x32 crossbar chip, two control links 
and a local processor to execute routing tasks.
The C104 uses worm-hole routing to minimize routing delays. Instead of a store and 
forward mechanism, via worm-hole routing it is possible to output a packet while it is 
still being inputted. Upon receiving a packets header, it is decided to what link it has to 
be trough routed (see Figure 3.17(a)), After setting up the crossbar switch, the data is
40 3. Parallelism and the transputer
directly transmitted over the switch until the end of the packet is detected, after which 
the switch is cleared (see Figure 3.17(b)), Note that with this system, routing becomes 
transparent for the programmer and furthermore, the routing administration no longer has 
to be executed by the transputer as this is performed by the C104.
C104
(a) Determination of output switch. (b) Transmission.
Figure 3.17: Expected and measured times for GAB() on different GCe.l grids.
In [114], it has been discussed what implications new transputer systems based on the 
T9000 and C104 communication chip could have on applications like parallel neural network 
simulations. The main idea is that - because the performance model splits computation 
and communication times - the new overall time to compute an application would be the 
sum of the computation time on a T8 transputer divided by the increased computational 
power and the communication time divided by the increased communication power. We 
were not able to quantitatively validate this expectation because no systems using T9000 
transputers are available. Similarly, the C104 chip has not been released until now. There 
exist T9000/3 transputers which operate on a lower clock rate (10 MHz), which were tested
[3], but because of the late arrival of the T9000, users and manufacturers are looking at 
alternatives, like the PowerPC.
3.5.2 The PowerXPlorer
The PowerXPlorer is a MIMD system shipped by Parsvtec based on the PowerPC 601 
[50], The system uses transputers and their communication links to build a communication 
network. The PowerPC is a state of the art RISC processor developed by IBM, Motorola 
and Apple. It is a super scalar processor executing three instructions per clock tick (via 
integer,branch and floating point units). It contains a 32 Kb cache and an on chip memory 
management unit. Via its system interface (32-bit address bus, 64-bit data bus and 52 
control and information signals), it can exchange data with other devices through shared 
memory.
The PowerXPlorer contains up to 64 PowerPC 601 nodes, each equipped with a T805 
(1 Mb memory) processor dedicated to communication over its 4 links. Each node gives 
a peak performance of 80 MFLOPS double precision operating at 80 MHz and using
3.6. Programming environments for the transputer 41
the eommunieation network, each node has 80 MBits/second communication throughput. 
Each pair of PowerPC and T805 communicate via an internal memory bus through shared 
memory, and can be considered as a single processing element. The system can be managed 
by the Parix operating system, thus allowing fast ports of applications from the GCel to 
the PowerXPlorer,
We were able to use the PowerXPlorer at the University of Amsterdam to examine the 
performance of parallel neural network simulation programs. This system contains 32 
nodes, configured in a 2D grid topology. Users can claim networks of sizes 4x2, 4x4, 
6x4 and 8x4 processors. Each node contains 32 Mb of memory. The system architecture 
comprises several units containing 2 boards with each 2 pairs of T805 and PowerPC, The 
32 processors are contained in 8 units of 4 processing elements configured in a 2x2 grid, 
Parsvtee is also incorporating this technology in the GC series, building large systems with 
tremendous compute power. However, especially in communicate intensive applications 
like neural networks, it will become clear that the communication performance should 
be increased too. Using a communication network based on C104 routing chips would 
facilitate such a resource,
3.6 Program m ing environm ents for the transputer
As any other computer, the transputer can be programmed on different levels of detail. 
Whereas most computers use an operating system to manage I/O  devices, disks, screens 
and to provide a shell within which a user can issue commands to be executed, for many 
applications the transputer is used in so-called native mode. In this mode, it is ready 
to accept any executable code to be loaded and run, which is supported by a number 
of machine language constructs. The software that runs on a transputer and builds an 
intermediary between a user and the transputer system can be distinguished in three levels 
of detail, 1) low level application code, 2) runtime systems and 3) operating systems. 
Software development and compilation can be done on a host system, in which case the 
executable code is down loaded from the host onto the transputer system. Or software can 
be developed on the transputer itself, which involves that code compilation and linking is 
done on the transputer,
3.6.1 N ative system s
A transputer runs in native mode when the first two levels are concerned. This involves 
that a programmer has to take account for problems like (global) inter-processor communi­
cations, task scheduling and task distribution. This may be supported via several software 
layers providing specific algorithm libraries. This software can provide a programmer with 
various software development features, as listed in Table 3,1
A wide range of programming languages has been developed for the transputer, like ANSI 
C, various versions of C with parallel language constructs, Occam, C++, Ada, Fortran, etc.
42 3. Parallelism, and the transputer
In general, the parallel language constructs they offer are PAR and select statements and 
some CHANNEL data type, which represents a communication medium between tasks. The 
PAR construct allows for a programmer to start up concurrent processes on one processor 
(like fork/exec in Unix),
Using the CHANNEL datastructure, point-to-point communication between tasks can be 
established. If two communicating tasks run on distinct processors that are not directly 
connected, the message to be communicated has to be routed through the transputer 
network, as is explained below in Section 3,6,4, The select statement is used to determine 
which of a number of communication channels becomes active. In general, this is very useful 
for asynchronous applications like a farm construct. In a farm, a master process distributes 
the available work over a number of slaves. The work is divided in packets and initially, 
each slave gets one packet. The moment a slave becomes ready, it sends a message to 
the master requesting for more work. In order for the master to determine which slave 
becomes ready, the select statement can be used, A simple way to program such a farm 
is depicted in algorithm 3,2
C H A N N E L  master, slave [NSLAVES];
void  master (int id)
{ vo id  slave (int id)
parallel languages parallel language constructs 
compiler, linker, libraries,
tools for creating and loading parallel programs
profiling/monitoring tools,
multi-tasking/multi-processor debuggers
vector/matrix operations,
routing and communication routines,
signal processing, image processing,
graphics libraries,
string operation libraries,
routines for DMA access, interrupt routines
tools
libraries
embedded systems
Table 3,1: Native software development features.
{
}
w h ile  (1) {
}
Recei veN e wPacket ( ) ;
HandlePacketQ;
send (master,READY);
}
(a) Master. (b) Slave.
Algorithm 3,2: Farm implementation (ParC syntax)
3.6. Programming environments for the transputer 43
3.6.2 Task distribution and execution
Note that in algorithm 3,2, both the slave process and master process could run on the same 
processor, or several slaves could run on one and others on distinct processors. However, 
in general it would be best if all tasks would run on a separate processor, as in such a case 
all tasks can be really executed in parallel. Now for the farm problem at first sight it does 
not matter on which processor a slave is running. As all packets have the same size and 
all slaves perform the same computation, the time for each slave to compute one package 
is equal —-, if all processors provide the same computational power. For problems that 
do not have such a regular structure, it can become important to specifically map each 
task on a specific processor. There are three ways in which a task can be loaded onto a 
processor:
1, Each task is programmed and compiled into a separate piece of code. The code is 
explicitly loaded onto a specific processor. For example the Helios operating system 
supports this way of task distribution,
2, Each program contains the code for all tasks and is loaded onto every processor. It 
is decided at runtime which piece of the code has to be executed. For example for 
many native systems tasks are loaded on a processor via this mechanism,
3, A small piece of code is loaded onto each processor. On a processor it is decided at 
runtime which task has to be executed, after which the separately compiled code of 
the task is loaded. For example the Parix operating system supports this way of task 
distribution.
Note that using operating systems like Helios and Parix, the second method of loading tasks 
on a processor can also be exploited. In the second and third method, when running code 
on a processor it is decided what task has to be executed. This is realized by examining a 
datastructure that specifies the transputer configuration and the transputer on which the 
code is running. For example, using the Parix RootProc_t structure, on each processor it 
can be determined what grid size the GCel has and what position the processor has in the 
grid.
A very general application of using such a structure is in master-slave programs, where 
the master process must reside on a specific processor and the slave processes on the other 
processors:
typedef struct { 
int MyProcID; 
int MyX; int My Y ; int MyZ; 
int nProcs;
int DimX ; int DimY ; int DimZ ;
/* own processor number */ 
/* (x,y,z) position */
/* D im X  * D im Y  * D im Z  */ 
/* (x,y,z) dimension */
} RootProc_t;
44 3. Parallelism, and the transputer
if (rootproc->MyProcID==0) 
m a s t e r O  ;
else
slave();
Using an operating system provides a relatively comfortable way of programming, task dis­
tribution, scheduling, routing and debugging. Furthermore, similar features are supported 
by an operating system as the ones listed in Table 3,1, Many operating systems developed 
or ported for the transputer exist, such as CSTools, Express, Idris, Mach, and Chorus, For 
the experiments discussed in this thesis, the operating systems Helios and Parix were used.
The way in which tasks are distributed when using Parix is discussed above, Helios pro­
vides another mechanism based on CDL-scripts, CDL (Component Distribution Language) 
[108], is a Unix shell-like language that is used to specify a so-called taskforce. This is a 
description of a parallel program consisting of multiple tasks (components) which commu­
nicate to each other by message passing. The communication network can be specified 
via predefined communication constructs or by explicitly specifying each pair of inter-task 
connections. When running a CDL taskforce, if it is specified correctly, the components 
are mapped on the required processors. By annotating each component with the iden­
tification of the processor to run it on, components can require to be run on a specific 
processor. Similarly, other requirements like the required available memory per processor 
can be specified in a CDL-script,
The next CDL-script specifies three components A, B and C, which are connected in a 
pipeline. When running this script, for example component A runs the code a on processor
0, sends its output to component B on processor 1, which sends its output to component 
C.
component A { code a; puid 00; }
component B { code b; puid 01; }
component C { code c; puid 02; }
A | B | C
Algorithm 3,3: CDL-script containing three components.
Conceptually, the Helios philosophy works well. When developing a parallel program con­
sisting of multiple tasks, it can be really helpful to be able to specify a taskforce and in 
particular its communication structure using CDL-scripts, Also the mapping of tasks on 
processors can be handled on this level. However, the determination of the communication 
channels for each component must confirm to the CDL-script, How this can be realized, 
is discussed in the sections below.
3.6. Programming environments for the transputer 45
3.6.3 Setting up com m unication channels
Both Helios and Parix provide several levels of setting up communication channels between 
tasks. In Helios the channels can be set up via a CDL specification (which implements 
channels as Posix file descriptors). Each task program should know the meaning of a file 
descriptor, i.e. to which task it is connected and what amount of data it must communicate 
at a certain moment in execution. In Figure 3,18, an example assignment of file descriptors 
to channels following a CDL script is depicted.
Figure 3,18: Assignment of file descriptors for A <>  B.
By using posix read and write calls, two connected tasks can communicate to each other. 
Note that the posix channels are so called virtual channels, as two communicating tasks 
could be running on any processor, and their communication paths could thus consist of 
many physical inter-processor communications. This means that using CDL-scripts, it is 
possible to write parallel software independent of the underlying target hardware, where 
routing of messages is performed by Helios,
Parix provides a set of communication libraries for setting up virtual communication topolo­
gies like a ring, grid, tree or torus. For example, using the routine Make2DGrid returns a 
structure containing the four communication channels in the NORTH, EAST, SOUTH and WEST 
directions. After investigating the position in the topology, each process can access the 
corresponding channel by indicating the direction it is interested in. Similarly, processes 
can communicate in for example a ring by indicating the FORWARD and BACKWARD directions.
Like with Helios, this way of using virtual channels allows for writing software independent 
of the hardware topology. However, especially for applications where a lot of communica­
tion is involved, there can arise a severe cost penalty if the hardware topology does not 
match that of the virtual communication network. For example, mapping a ternary tree 
of depth 3 (40 processors) on a 5x8 grid results in an inefficient organization. Therefore, 
for the experiments used in this thesis, a communication library was designed and imple­
mented that explicitly maps a tree onto a tree topology and a grid onto a grid (see chapter 
5), This software layer uses point-to-point communications between processes, where each 
process communicates over physical communication links to its neighboring processes.
46 3. Parallelism, and the transputer
3.6.4 Inter-processor com m unication and routing
After booting, for any distributed operating system like Helios or Parix, a daemon process 
is running on each processor. This process, which in Helios is called the nucleus, loads and 
schedules tasks on a processor and enables (inter-processor) communications between tasks. 
Communication over processor links is handled via a router mechanism which monitors 
for each processor the incoming links and investigates whether an incoming message is 
designated for the processor or whether it has to be through-routed to other processors. 
In general, routing is performed in two steps. In an initial step, the processor network is 
investigated and for each processor a so-called routing table is created. Using this table, 
during the second step - which is when running the system - the router process can 
determine via which link a message has to be through-routed to a destination processor.
Whereas an operating system provides (through virtual channels) an implicit routing mech­
anism, for native systems routers have to be explicitly implemented. For the ParC runtime 
system, a general router was designed in our department [90], which supports broadcast 
and gather operations like the ones used in parallel neural network simulations. Similar 
routers are designed for other systems, like Tiny for Occam [19], In general it can be 
stated that software routers require some overhead because they must implement a general 
mechanism including the routing tables and buffering of messages. If the patterns of com­
munication for a particular application or range of applications are well known, it is better 
to use a communication layer which is tuned for the application and that implements global 
communications as efficient as possible. In Chapter 5, the design and implementation of 
such a communication layer is discussed.
4A Scalable Perform ance Prediction  
M odel
Outline
A method to determine the suitability of transputer systems for parallel 
neural network simulations is introduced in this chapter. By predicting 
the performance of such systems, suitability issues like how fast the neural 
network can be run and how efficient the execution platform used is, can 
be answered. Two existing methods for predicting the performance of an 
execution platform are discussed: performance benchmarking and perfor­
mance modeling. It will be made clear that the first method can only be 
used to give rough estimates and that questions regarding speedup and 
scalability cannot be answered. Examples of the second method will indi­
cate that if an application is modeled on a too fine-grained level (i.e., on 
the level of arithmetic operations), the resulting performance timings are 
not precise, A combination of performance modeling and performance 
benchmarking on the level of kernel benchmarks will be presented in this 
chapter. The new combined method achieves much better prediction 
results.
48 4- A Scalable Performance Prediction Model
The performance of an execution platform for neural networks is often expressed in MCUPS 
or GCUPS (Million or Giga Connection Updates Per Second) and M ICPS or G ICPS 
(Million or Giga Interconnects Per Second), The achieved performances reported in the 
literature differ enormously, ranging from a couple of thousand ICPS for workstations to 
more than two GCUPS for VLSI hardware [43, 78], But what does such a performance 
scale mean to the practical user. Does it provide any help when deciding which execution 
platform would be the best for a specific application? Do MCUPS say anything about 
the quality of the implementations? Are the efforts for programming and debugging the 
machines mentioned? Or can the performance of another platform for the neural network 
or the performance of the platform for another neural network algorithm be predicted 
based on this information? In fact, all these questions cannot be answered by only using 
a performance scale like MCUPS, What is needed is a means via which a computer system 
can be evaluated for a specific application.
The evaluation of computer systems for a certain problem can be performed from four 
different points of views [54]: (1) how fast can a given problem be solved using a system 
(responsiveness), (2) how efficiently is the system exploited when solving the problem 
(usage level), (3) how well can the system deal with failures and other unexpected events 
(mission ability and dependability), and (4) what support does the system offer to a user or 
programmer for realizing the problem (productivity). The first and second point of views 
will form the main contents of this part of the thesis. It will appear that by answering the 
first question, we can also answer the second. By computing the performance of a system 
for an application, also the efficiency and speedup can be computed. Three methods 
for performance evaluation will be discussed in this chapter: performance benchmarking, 
performance modeling, and a hybrid combination of benchmarking and modeling,
4.1 Perform ance Benchm arking
There are two major reasons why benchmarks are important. Manufacturers can run the 
same benchmark on different machines to compare their CPU, memory or communication 
performances. And benchmarks can be run on one specific machine to quantify its indi­
vidual performance for e.g. floating-point operations, memory accesses, numerical libraries 
or application areas. If one would want to know how well an execution platform would be 
suited for a particular application, one could select the benchmark which best matches the 
application and consider its results for that specific platform. Benchmarking methods can 
be divided into a number of categories [5, 118], which are synthetic, kernel, algorithm and 
application benchmarks,
4.1.1 Synthetic benchmarks
Synthetic benchmarks are not representative of any real computation but exercise various 
basic operations, such as memory latency, library routines, arithmetic operations, etc. 
Benchmarks that fall into this category are the well known Dhrvstone [117] and Whetstone
4-1. Performance Benchmarking 49
[21] benchmarks. The Dhrystone benchmark contains a collection of statements from non­
numeric, svstem-tvpe programs, such as use-interfaces, operating systems, compilers and 
editors. The performance of Dhrystone depends on cache size, but the code fits in the 
cache of most modern machines. Furthermore, the implementation of the string functions 
influences its performance significantly. The Dhrystone benchmark comprises language 
statements occurring in numeric application programs. These contain more loops, more 
floating-point operations, more numeric library functions, less procedure calls and less 
conditional statements. The benchmark contains a number of small loops with a high 
code density resulting in near 100% cache hits, which obviously does not hold for real 
applications.
This highlights the major disadvantage of synthetic benchmarks, i.e. they do not mirror 
real applications because of two reasons. First, they do not really test the memory system 
because in general the code and data sizes are very small compared to real sized problems. 
This involves that small sized synthetic benchmarks can be loaded in fast internal memory 
or on-chip cache, whereas real applications cannot. Second, compiler writers tend to boost 
their compiler performance dedicated to the typical operations that are contained in e.g. 
Whetstone or Dhrystone benchmarks, such as numerical library routines and string oper­
ations, This could imply that the compiled code for real applications is not that efficient,
4.1.2 Kernel benchmarks
Kernel benchmarks eliminate these problems as they are more representative for real ap­
plications, These benchmarks consist of program parts that embody the salient features 
of compute, communication or memory extensive portions of actual applications. Kernel 
benchmarks can be easily ported and measured on other target platforms, as they contain 
compact programs compared to large benchmarks or full applications. Examples of kernel 
benchmarks are discussed by for example Berry et al in [5], They mention The Livermore 
Fortran Kernels and NAS kernel benchmark program. The main idea behind kernel bench­
marking is to use the performance figures of the kernel that has similar memory usage, uses 
the same library routines and typical numerical operations as the application for which the 
performance has to be indicated. Typically, the benchmarks can be scaled to measure 
the required memory access behavior as imposed by the application. Kernel benchmarks 
contain complete routines that are stripped from real applications or numerical libraries, 
unlike synthetic benchmarks which merely contain statements or pieces of such packages,
4.1.3 A lgorithm  benchmarks
Algorithm benchmarks contain sub-programs that implement familiar algorithms well known 
from the literature, as found in image processing, numerical, statistical or other applica­
tion areas. Examples are algorithms found in Maple [31], Matlab [66], FFT libraries and 
numerical algebra libraries. Similar to kernel benchmarks, comparing an application with 
algorithm benchmarks matching its costs will indicate its performance range that can be
50 4- A Scalable Performance Prediction Model
achieved. An example of algorithm benchmarks is Linpack [26], In general, algorithm 
benchmarks can be scaled by increasing or decreasing the problem size, e.g. the size of the 
matrix on which Linpack operates. Algorithm benchmarks for scientific applications are 
characterized by the high amount of floating-point operations, loops, and high locality of 
code compared to the locality of data. The latter involves that target architectures with 
fast memory access and instruction cache are in favorite,
4.1.4 A pplication benchmarks.
Although kernel and algorithm benchmarks try to match a wide range of application ar­
eas, they are still hindered by the effects of cache size, memory access times, compiler 
optimizations and implementations of library routines. It has been argued by a number 
of authors [5, 24, 105, 118] that for reflecting the performance for real applications, small 
benchmarks are not sufficient. Rather than quantifying the performance of individual sys­
tem components, a wide range of application-oriented benchmarks are required, which give 
a more resembling cost profile than kernel or algorithm benchmarks. Application bench­
marks are for example the SPEC [24], Perfect, Genesis [46] or EuroBen [105] benchmarks. 
Characteristic for these benchmarks is that each is initiated by a collaboration between 
research and commercial organizations with the goal to come up with a standardized set 
of application programs to be used as benchmarks for a wide range of target platforms,
4.1.5 Suitability of performance benchmarking
Our goal is to arrive at a benchmarking method for predicting the performance range of 
MIMD parallel processor systems for parallel neural network simulations. Two approaches 
can be considered for arriving at such a method. The first is to examine whether existing 
benchmarks like the ones described in the preceding sections can be used to quantify 
the expected performance range of new applications. The other approach is to find out 
whether it is feasible to develop synthetic, kernel, algorithm or application benchmarks 
that are scalable as well as general purpose. The second approach will be discussed in 
section 4,3,
Considering the first approach, it must be concluded that existing benchmarks can per­
fectly be used to measure the performance range of a target machine for the benchmark 
involved. Measuring the same benchmark on different platforms can be used to compare 
their performances. As it is likely that manufacturers of new machines will port most 
well-known benchmarks in order to present performance characteristics, in most cases the 
numbers that are listed in their specifications can be used for the comparison. However, 
they will publish only these numbers which are in favor for the machine, and probably will 
use all their skills to boost the performance of the benchmarks. There exist many factors 
that determine the performance of a benchmark, and they all must be taken into account 
to be able to use benchmark results as a performance indication for other applications like 
parallel neural network simulations:
4-1. Performance Benchmarking 51
1, The hardware platform, characterized by the number of processors, processor archi­
tecture, memory and communication system, and peripherals,
2, The software environment, comprising operating system, programming languages, 
compilers and libraries,
3, The size and complexity of the application,
4, The implementation of the application and optimization tricks used,
5, Communication and synchronization requirements.
Consider that the number of operations required for updating a connection depends on 
the complexity of the neural network model. For example, comparing MCUPS measured 
for a Hopfield network [49] with those measured for a backpropagation network [83], will 
give significant performance differences, as updating a connection for the latter network 
not only requires more operations, but also uses expensive library functions such as exp(). 
This is one of the reasons why we state that MCUPS is a meaningless performance scale. It 
would be appropriate to express the performance in Hopfield MCUPS or backprop MCUPS, 
Furthermore, if variations in an algorithm exist, these should also be indicated when stating 
performance, like [Rumelhart,momentum, batch update] backprop MCUPS,
To my knowledge there are no performance benchmarks containing neural network codes. 
Furthermore, if there would exist such benchmarks, they cannot cover all variances in 
neural network models that exist. This involves that when using such benchmarks only 
rough estimates of the performance that can be expected can be given. So if one wants 
to predict the performance of a particular application, and use the performance number 
of the benchmark that best matches the application as an indication, a number of criteria 
have to be taken into account in order to make this number a reliable estimate. These 
criteria involve that if this approach is taken, the selected benchmarks have to be examined 
in great detail:
1, Make sure that the benchmark has been measured with the same compiler and opti­
mization settings that will be used for the application,
2, Make sure that the same programming language and library routines are used,
3, Make sure that the benchmark is programmed in the same programming style as the 
application will be implemented,
4, Take the performance numbers for benchmarks that have the same problem size as 
the application,
5, Consider the number of optimization efforts that have been taken to port the bench­
mark,
6, Make sure that the benchmark requires the same amount of synchronization and 
communication overheads.
52 4- A Scalable Performance Prediction Model
4.2 Perform ance M odeling and neural networks
Another approach to predict the performance of an parallel neural network implementation 
is to analyze its complexity in terms of the required communications, data accesses and 
arithmetic operations. Given the costs for these basic operations, quantitative estimates 
about the performance that can be achieved could be given. Most performance analyses 
reported in the literature either analyze a particular neural network algorithm and target 
machine architecture [18, 124, 125] or use some general model characterizing the architec­
ture and functioning of a whole range of algorithms and platforms [38, 69, 61], For the 
discussion of performance analysis models below, the performance parameters characteriz­
ing a neural network and target platform are listed:
n total #neurons tlLa time for local memory access
w total ^connections t9a time for global memory access
jy total ^processors L ^communication links per processor
pt processor #i Bl bandwidth per communication link
Ct neural network part on Pi tt transfer time for communication
Sii j-th sub-part of C{ ts setup time for communication
xa #local/#global memory accesses Xe #local/#global communications
Table 4,1: List of parameters often used in performance analysis 'models.
Most of these parameters have to be known in order to determine the total costs for 
a parallel implementation. The overall cost for a MIMD parallel implementation for a 
network specified by NET decomposed over P  processors is defined as:
t(P , NET) =  /,„/, (/', NET) + tcornm(P, NET)
In the following discussions, a further analysis of tcomm and tcaic is given for MIMD parallel 
processor systems,
4.2.1 Com m unication costs for neural networks.
When decomposing a neural network over a processor network, the goal is to ensure that 
each processor has an equal amount of work to do, and that the amount of inter-processor 
communications is minimized. Let the network be decomposed in P  components C\ • • • Cp, 
and let each component Cjt consist of a number of Njt sub-components Su • • • SiNi, It 
will be evident that finding partitions with equal work loads is a major problem if the 
neural network is structured in modules with highly different computational requirements. 
However, assuming that the load is well balanced, the calculation time on P  processors 
can be defined as tcak:{P) =  tcak:{ l)/P .
The following discussion presents a model in which the communication requirements for a 
given neural network on a MIMD machine architecture are quantified, as adapted from a
4-2. Performance Modeling and neural networks 53
model presented by Ghosh and Wang [38], Given a partitioning:
(P, NET) { {Ci
{C2
{Sn---SlNl}}, 
{S21 ' ' ' *SW2}};
{ C p  — {Spi • • • S p Np}}
}
Furthermore, assume that the basic amount of information that is transmitted is propor­
tional to the size of a sub-component Ski. Let Cjtj denote the number of sub-components in 
Ci that have to exchange values with sub-components in Cj and let denote the number 
of inter-processor communications (hops) required to exchange information between pro­
cessors i and j .  The total number of hops can then be expressed as in Equation (4,1), where 
Cij can be estimated as if a random distribution of connections (Ci,Cj) is assumed, 
so V,j,k P
P p
'A,ij i 
i=1 j = l 
p p 
i= 1 j = 1 v
p  p
P
i=1 j=l
(4 .1)
For MIMD processor systems, each inter-processor communication has to be initiated by 
fetching the data to be transmitted from memory, scheduling communicating processes, 
determining the output communication medium, etc. The costs required for initiating a 
communication are defined as the setup time ts. Depending on the communication medium, 
the costs for transferring data are defined as the transfer time tt. The total communication 
costs can be estimated from (4,1) where it is assumed that the basic amount of information 
that is communicated is determined by the largest sub-component as B  bytes:
p p
Xe
i=1 j=l
(4 .2)
For homogeneous neural networks, it can be assumed that all neurons are equally divided 
over the available processors, so all n* equal n/P. So Equation (4,1) can be rewritten as:
][hops
p  p
—P
(4.3)
i=1 j=l
t
54 4- A Scalable Performance Prediction Model
Note that in this equation each value is communicated individually, which in general is 
not done. Instead, all values from components Ci are sent in one packet, and thus the 
communication costs can be expressed as:
If we let all dij equal the maximal path that has to be traveled in a processor network, i.e. 
the diameter d of the processor network, (4,4) can be estimated by the upper bound:
Note that for a given number of processors, the communication time depends linearly on 
the number of neurons, whereas for a given number of neurons, the time is proportional to 
the square root of the number of processors. This is because the maximal path length on 
a grid is estimated as d =  \fP. For a ternary tree, the maximal path length is 2 times the 
depth of the tree, which is proportional to d = 3 log P.
As will be explained in this thesis, for homogeneous neural networks in general all-to- 
all broadcasts are required. These networks have a regular architecture of one or more 
fully connected layers of neurons, so in general all neurons residing on a processor have 
to exchange their values with all neurons residing on the other processors. This involves 
that all Ac are 1, In [69], it is explained that when not assuming fully connected layered 
networks, but instead using random connections between different network components, 
still all-to-all broadcasts are required. Assume two components Cjt and Cj having equal 
sizes z, and having a connectivity Ac, i.e. Ac • z neurons in Ci and Cj need to communicate 
information. The chance that a given node in Ci has no connections to Cj is (1 — \c)z. This 
involves that even for relatively small z (say 100) and small connectivity Ac (say ,05), the 
chance that there is no communication required between processors i and j  approximates 
zero. This means that for randomly connected neural networks there is always a need for 
all-to-all communications. The time required for these communications can be bounded 
by multiplying Equation (4,5) with the number of processors P.
The general model presented here, and the ones described in the papers of Ghosh and Wang 
[38] and Murre [69] discuss a general design specifying the communication requirements for 
parallel neural networks. When applying such a general model to a concrete neural network 
simulation, the corresponding parameters have to be estimated, analyzed or measured. 
Also, the general model must often be translated into models that match the application 
more specifically. For example Equation (4,5) specifies the maximal time required for a one- 
to-all communication. The parameters B , ts and tt have to be determined and furthermore, 
it has to be considered what the amount of inter-processor communications for a given
t,comm
i = 1 j = 1
(4.4)
i=  1 j = l
t,comm
71
(4.5)
4-2. Performance Modeling and neural networks 55
parallel neural network simulation actually is in order to give a precise estimate. In the 
subsequent chapters several more specific models will be given that specify the required 
communications more precisely,
4.2.2 Com putation costs for neural networks.
A number of efforts have been made to implement neural network simulations on parallel 
hardware and model their performance. These are all based on a careful examination 
and complexity analysis of the resulting simulation programs. In this section, several 
examples are given of methods that use such an analysis to model the performance of a 
simulation program in terms of the required arithmetic operations, memory accesses and 
communications primitives. For the discussion given below, the following parameters are 
used, where it is assumed that for each arithmetic operation the time needed for memory 
load and store operations is included:
tadd addition + tsub subtraction —
tm u l multiplication * tdiv division /
¿sig sigmoid
l
l+ ex p (—x)
Table 4,2: List of parameters used for arithmetic operations.
4.2.2.1 The Kohonen SOM.
Remember from the introduction that the Kohonen SOM finds the best match between 
the neurons in the feature map and each input pattern x. The “winning” neuron is the 
one with the best match and training the SOM for this pattern is done by updating the 
weights of the winner and its neighboring neurons. The operation of the SOM activation 
and training phase is depicted in algorithm 4,1,
loacLdataO ; 
initialize-network(); 
while (err crit>error && nepochs— ) { 
for (all patterns p) {
c = f ind_winner (p); /* activation phase */ 
for (all neighbours i of c) /* training of weights */ 
update_weight(i,p) ;
}
}
Algorithm 4,1: Algorithm for the Kohonen self-organizing feature map.
56 4- A Scalable Performance Prediction Model
Complexity of the SOM activation phase.
Given a pattern x with dimension N. For all n neurons, the Euclidean distance has to be 
computed following Equation (4,6):
N
X -  UH II =  ( -  w ÿ ) 2 ) ( 4 6 )
d=l
d
Algorithm 4,2 depicts how the winning neuron is computed:
int fincLwinner (int pat)
{
float *x,*w,dist; 
float min = MAXFLOAT; 
int i ,j ,winner; 
for (i=0; i<n; i++) {
x = data[pat] ; 
w = weights[i]; 
dist = 0; 
for (j=0; j<N; j++)
dist += (x[j]-w[j] )*(x[j]-w[j] ); 
if (dist<min) {
dist = min; 
winner = i;
}
}
return winner;
}
Algorithm 4,2: The routine f  incLwinner. In [58], finding the winning neuron is described 
through a process called relaxation. In this process, all neurons are connected via 
inhibiting lateral connections. Each neuron excites itself. Via relaxation, after a number 
of iterations, one neuron will become active (the winner), while the others remain 
inactive. Such a network is also known as winner-takes-all network.
Note that for finding the winner, no computation of the square root is necessary. Com­
puting the distance in f  incLwinner requires n ■ (N  • (2 • tsub + tmui)), So the times for 
computing the activation phase for all patterns can be estimated as:
tad p • 7Ì • N  • (2 • tsuf) + trnui) (4.7)
4-2. Performance Modeling and neural networks 57
Complexity of the SOM training phase.
For a pattern ~x and winner c, all neurons i laying within the neighborhood of c are updated 
following:
Wi + =  a ■ (x — uij) • rj(a, c, i) (4,8)
The weights are updated using the learning rate a  and the neighborhood function rj(o, c, i)1, 
which is determined by a Gaussian function whose strength is based on the distance between 
the winner c and a neuron i and the width a (see Figure 4,1),
X
Figure 4,1: Neighborhood function rj(a, c, i) =  exp(— ), in this case a =  2,5, the winner 
c =  (5,5) and the neurons i are located within the 10x10 Kohonen map.
During training, the learning rate a and neighborhood parameter a are decreased gradually. 
Effectively this means that the width and height of the Gauss function reduce during 
training. Though there exist many ways to perform this, in general for each iteration 
the value of a  and a is multiplied by a factor k < 1, Furthermore, the neighborhood 
range which determines the number of neurons that are updated, is decreased from an 
initial value (often y7n) to zero. In general this is done by subsequently maintaining 
the range for a number of iterations, and then decrementing it. In practice this means 
that the neighborhood administration requires only to be computed in cases that the 
neighborhood range is changed, involving that the check whether neurons lay within each 
others neighborhood boils down to a lookup rather than a computation. These kind of 
programming tricks are known as the time/space trade off. In order to save execution
1 There also exist variations of the algorithm that only use the learning rate.
58 4- A Scalable Performance Prediction Model
time, it is often better to compute some results, store them in memory and use them 
during computation instead of computing them each time they are needed. Similarly, the 
results for the Gauss values have to be recomputed only when a changes.
Analyzing the complexity of an algorithm where the computational load is not constant 
requires an estimation of the load. In this case, the load varies with the neighborhood range, 
as only the neurons that lay within this range have to update their weights. Furthermore, 
the load depends on where the winning neuron is located in the Kohonen map. If it is a 
border neuron, this number is obviously smaller than if the winner is at the center of the 
Kohonen map. Of course, it is possible to give a worst case for the complexity, but for 
performance prediction this could result in estimations with large deviations,
Wu, Hodges and Wang present a complexity analysis of the Kohonen SOM in [125], They 
approximate the number of nodes in the neighborhood of the winner in the k-th iteration 
as:
nk =  | - ( 2 - r fe + l ) 2
The time to update the weights of all neurons laying in the neighborhood following Equa­
tion (4,8) would then sum up to nk • (2 • tmui + 2 • N  • ( taM +  tsub) ) ,  In their implementation, 
only the learning rate a is used, so the computation of the Gauss values is not performed. 
Furthermore, the problem of computing whether two neurons lay within each others neigh­
borhood is solved by just computing their distance, which for each pattern and each neuron
i requires the computation of (cx — ix)2 + (cy — iy)2 and checking whether this is within the 
square of the neighborhood range2. This requires for each pattern n ■ (2 • (tsub + tmui) + tacid) 
computations. The total time for the k-th. iteration of the training phase would then sum 
up to:
¿train ~ P ' {2 • tmui + 2 ■ N  ■ (tadd + tsuh)) + Tl • (2 • (tsuh + tmul) + ¿add))  (4-9)
and the total time for the training phase can be determined by integrating this expression 
for k.
4.2.2.2 The backpropagation neural network.
Algorithm 4,3 depicts the general operation of the backpropagation network.
2In [125], the square root of these values are is computed, which is not needed.
4-2. Performance Modeling and neural networks 59
loacLdata(); 
initialize_network(); 
while (err crit>error && nepochs— ) { 
for (all patterns p) {
/ *  activation phase */  
clamp_input(); 
for (1=1;1<L;1++)
compute_activations(l);
/  * training phase */  
compute_output_deltas(); 
for (1=L-1; 1>0; 1— ) { 
propagate_deltas_back(); 
change _we ight s ();
}
}
}
Algorithm 4,3: The backpropagation algorithm (L=number of layers). 
Complexity of the backpropagation activation phase.
For each pattern ~x of dimension N, each of the N  input neurons i is activated with the 
value Xi. The activations of the rii neurons in subsequent layers I are formed by the sigmoid 
of their net input £¿:
ni- i- l
'y  ^ i f l j ' Wij)
3=0
1
&j
1 + exp(^C¿)
The total time required for computing the state of activation of the network consisting of 
L lavers is:
L - 1
¿act J2(ni • ( n ‘- 1 ( t add "I" trn u l)  +
1=1
W  ' ( t add +  trnu l)  +  ( «  “  % )  ' ( t sig +  t suh) (4.10)
Complexity of the backpropagation training phase.
During training, each output neuron computes its errors as the difference between the 
target and computed output value:
et =  t i -  a,i
60 4- A Scalable Performance Prediction Model
Furthermore, it computes its so-called delta as the product of the error and the derivative 
with respect to the activation. The derivative of the sigmoid activation function /'(a*) 
equals o* • (1 — a*):
Si = e* • f ’(ai)
Cj ' O'i ' (l Q’i)
The computation of the deltas of neurons in the output layer L — 1 can be done in time:
2 • Til,—1 ' (tsub  "i" tmui) (4,11)
After computing the deltas of neurons in the output layer, each neuron in the previous 
layer computes its error as the in-product of its output weights and the corresponding 
delta values of its output neurons (see Figure 4,2),
m-i
ei = ■ wik)
k=0
Si = eH- = eH • o* • (1 - at)
Figure 4,2: Computation of the hidden errors following e{ =  &k • u>ik, the corresponding 
deltas are computed as S, =  e, •
The ’’backpropagation” of the errors is performed between each pair of layers I and I + 1 
where I E [1 • • • L — 1], This involves that no computation of errors and deltas is required 
for the input layer. The computation of the error and delta values costs time:
L—1
• ((ni + 2) • tmui + tsub)) (4-12)
1=2
Once the delta values are computed for each neuron, the weight changes are computed as 
given by Equation (4,13), after which the weights are updated:
Au)ij(t + 1) =  r] ■ Oj • Si + a ■ AWij(t) (4-13)
Wij(t + 1) + =  A Wij(t + 1)
4-2. Performance Modeling and neural networks 61
The total time required for the training phase comprises the time for computing the errors 
and deltas on the output layer (4,11), the time to compute the errors and deltas for each 
hidden layer (4,12), the time required for computing the weight changes (Equation (4,13), 
and the time for updating the weights:
L—1
2 ' riL-1 • ( tsub  + tmul) + • ((ni + 2) ' tmui + tsub)) + W  • (2 • tadd + 3 • tmu[) (4,14)
1=2
4.2.3 Pitfalls when using arithm etic tim ings.
The performance of various computer systems and the code made with several compilers 
with different compiler options was measured for the arithmetics operations add (+), sub 
(-), mul (*) and div (/),  The algorithm to measure the timings used three arrays of 
data a,b,c of varying size (see algorithm 4,4), Though trivial, this code mirrors the usage 
of arithmetic operations in neural network simulation programs fairly representatively.
float a[datasize],b[datasize],c[datasize] ; 
float time_op()
{
int
tl = c l o c k O  ;
for (i=0;i<niterations;i++)
for (j=0; j<datasize; j++)
a [j] = b[j] op c[j] ;
t2 = c l o c k O  ;
return (float) (t2-tl)/datasize/niterations;
}
Algorithm 4,4: Algorithm for measuring the arithmetic performance.
As explained in section 4,1,5, there exist numerous factors that determine and influence 
this performance. This can be observed when examining for example Figure 4,3, which 
depicts the time required for the four arithmetic operations, on a sun spare station 10 
model 30 (32 Mb, 36 MHz) with the gnu gee compiler using no compiler options or the 
option -02, Note that for small data sizes the time for an arithmetic operation is lower 
than that for larger data sizes. This can be accounted for by the use of internal cache 
memory. Furthermore note that when using just the plain gcc compiler, the times show a 
much smoother plot than using the -02 option.
62 4- A Scalable Performance Prediction Model
results_gcco results_gcco2
div 0
mul 
operation
(a) Compiled with gcc.
datasize
mul
operation
7500
'5000 
'2500 
diV~0 datasize
(b) Compiled with gcc -02.
Figure 4,3: Time of arithmetic operations add,sub,mul,div, measured on spare 10, using 
algorithm 4-4 compiled with gcc.
Note that the deviations from some average time for an arithmetic operation are large, 
especially in Figure 4.3(b), If the basic timings show such large variations, it can be 
assumed that a performance model that uses them will be very inaccurate. This can be 
further observed when measuring for example the SOM neural network simulation programs 
analyzed in the previous section. The time required for activating and training a network of 
n neurons, N  dimensions and p patterns is given by Equations (4.7) and (4.9) respectively. 
Plotting this expected computation time against the measured time gives the Figure 4.4 
depicted below. The time was measured as a full application benchmark. Apparently, the 
expected times are too high, which can be explained by the opportunities the compiler 
finds to optimize the code.
Figure 4.4: Expected and measured computation times for Kohonen SOM. The solid lines 
represent the expected times based on the arithmetic timings. The dotted lines represent the
4-3. Combining Benchmarking and Modeling. 63
measured timings. For this figure, the number of neurons is varied and two experiments 
with (n,p) =  (10,20), (5,10) are depicted. Similar results are obtained for other settings of 
these parameters.
What can be concluded when using performance modeling on the arithmetic level as de­
scribed in this section, is that using the timings for individual arithmetic operations results 
in bad predictions. In the next section, a method is introduced that models the perfor­
mance of an application in terms of function kernels, which gives much better results,
4.3 Com bining Benchm arking and M odeling.
The performance prediction method that is introduced in this thesis tries to eliminate 
the problems that occur when modeling an application on a too small level of detail. By 
defining function kernels which represent the salient features of a simulation program, and 
measuring these for several problem sizes, the expectation is that the corresponding compu­
tation times can be extrapolated with higher precision towards the expected computation 
times for other problem sizes,
4.3.1 Identification of function kernels.
When considering neural network simulation programs, a hierarchy of function kernels can 
be identified (see Table 4,3), (1) program level, (2) routine level, (3) subroutine level and
(4) code fragment level. Though there does not exist a heuristic to choose the proper 
level on which function kernels have to be identified, by analyzing the code of a simulation 
program like for example in Sections 4,2,2,1 and 4,2,2,2, the suitable function kernels can 
be chosen.
For a given number of function kernels nf, the total computation time can be expressed as 
the sum over the number of times each function is called (Ni), times the time it takes to 
compute each function (¿¿):
program level measure a complete program
routine level measure time for routines
subroutine level measure time for subroutines 
code fragment level identify parts of a subroutine
and measure their time
Table 4,3: Hierarchy of function kernels.
(4.15)
i=1
64 4- A Scalable Performance Prediction Model
Function kernels on the program level contain the complete program. Typically, these ker­
nels are used when measuring the performance for a full application. Therefore measuring 
the performance on this level can be compared to application benchmarking discussed in 
Section 4,1, The program level is used in this thesis to compare the predicted times for a 
given application with the times actually measured for the complete program. In Equation
(4,15), rif =  1, Ni =  1, and t\ is the measured overall computation time.
Function kernels on the routine level typically measure the time for routines that are 
called from within the main body of a program. Consider for example algorithm 4,1, This 
program contains 4 function kernels, loacLdata, in itia lize_netw ork, f  incLwinner, and 
update .weights. For a large number of patterns or large neural networks, the time for 
loading data and initiating the network can be ignored. The total execution time expressed 
in routine kernels would then sum up to:
P  • (W • t f i n(i_w inner ^  ' N  • t update-weight)
Using the same settings of compiler, compiler options and target machine as used in Section 
4,2,3, the time for two routine kernels (computing the winner and updating the weights) 
was measured for different sizes of the Kohonen SOM neural network. The results are 
depicted in Figures 4.5(a) and 4.5(b),
microseconds
350000 i i i I
300000 w*1.25 --- ^ ;
N=5 ...
250000 N=10 +- -
N=15
200000 N=20 -
N=25 -! "
150000
100000 - -
50000 -
0
0 50000 100000 150000 200000 250000
microseconds
(a) Time for f incLwinner. (b) Time for update_weights.
w w
Figure 4.5: Time for function routine kernels
As can be observed in Figure 4.5(a), there exist a linear relationship between the number 
of weights and the time to find the winner, w -tfinci_winner, with tfind_winner =  1,25/iseeonds, 
However, for Figure 4.5(b), there is more information required, as no such relationship 
seems to exist. For finding the information, the function kernel at the routine level has to 
be further subdivided, which can be done by identifying subroutines or program parts that 
represent significant portions of the execution time. Consider the routine update .weights, 
depicted in algorithm 4.5.
4-3. Combining Benchmarking and Modeling. 65
#define X(c,i) (abs(c%width-i%width))
#define Y(c,i) (abs(c/height-i/height)) 
void update_weights (int N, int c, int i, int p)
{
int dx = X(c,i); int dy = Y(c,i); 
float mult = lrate*gauss[dy][dx]; 
float *x,*w; 
int j;
x = patterns [p] ;
w = weight s[i];
for (j=0;j<N;j++,«++,x++)
*w += mult*(*x - *w);
}
Algorithm 4,5: Algorithm for updating weights of the SOM.
For each of the nk neighbors of the winner c, the Gauss value with respect to c has to be 
’looked-up’, which must be done by computing its x and v offset from c in the Kohonen 
map. Apparently, measuring only the routine kernel update_weights is not sufficient, it 
should be subdivided into code fragment kernels that measure the computation of x and v 
offsets and the actual computation of the new weights. The time for updating the weights 
can thus be modeled as:
tupdale-weighls ; -^0  —  ^  ' { to f f  sels "i" N  • t Updale)
microseconds 
90000 
80000 
70000 
60000 
50000 
40000 
30000 
20000 
10000 
0
0 1000 2000 3000 4000 5000 6000 7000 8000 900010000
n
(a) Time for computing x and, y offsets.
microseconds
250000 i i i 1
200000 - w*.9 ---
N=5 "*...
150000 -
N=10 
N=15 
N=20 ->"■
~
100000 -
N=25
-
50000
1 1 1 1
00 50000 100000 150000 200000 250000
(b) Time for updating weights.
Figure 4,6: Time for code fragment kernels
Figures 4.6(a) and 4.6(b) depict the time measured for respectively computing the x and v 
offsets and updating the weights. From these it can be derived that - using the compiler and 
machine settings as described above - t0ffsets =  8,25/iseeonds and tupdate =  0,9/iseeonds,
w
66 4- A Scalable Performance Prediction Model
Figure 4,7 depicts the measured time for the program kernel kohonen. c and the expected 
time computed via:
tc a lc i^ ' j tupda,te_weightS(P' j N ')  W • tfind_w inner
' (¿o  ƒ  ƒ  sets ""I- N  • t  update) +  W • t f i n d_winner
microseconds
w
Figure 4,7: Expected and measured time for the Kohonen SOM, expectations have an ac­
curacy within [0 ■ ■ ■ 5%].
Note that in Figure 4,7, it is possible to predict the calculation time using the time for 
routine kernels and code fragment kernels with high precision. Furthermore it is interesting 
to note that using the combination of kernel benchmarking and program modeling, it can 
be deduced what parts of the program are really compute intensive. In this case the time 
to compute the x and v offsets causes that the computation time for N  =  10 is larger 
than for N  =  25 for the same number of weights w, as in the latter case the number of 
neurons is smaller. Based on this observation the decision could be made to allocate a 
table with 2D-Gaussian coefficients for the complete Kohonen map instead of the nk * nk 
neighborhood, if the available memory resources allow this3. The computation of x and v 
offsets would then be superfluous,
4.3.2 Consequences for parallel programs.
When predicting the execution time of parallel implementations of neural network simula­
tion programs two new problems are introduced, (1) load balancing and (2) communication. 
In the introduction, the first problem was solved by assuming a perfect load balance. This 
involves that when using P  processors it can be assumed that the calculation time is re­
duced by a factor P. This assumption can safely be made for decomposition techniques 
that equally divide a network over the available processors. Especially if the size of a neural
3The implementation of the Kohonen SOM within the PREENS algorithm library uses this feature.
4-3. Combining Benchmarking and Modeling. 67
network is relatively large compared to the number of processors, in such a case the load 
will be perfectly balanced. The second problem must be solved by analyzing the patterns 
of communication in the parallel simulation program, A general approach to analyzing the 
communication complexity is given in Section 4,2, For any parallel simulation program 
where it is assumed that the load is well balanced, an upper bound on the communication 
costs can be given via Equation (4,5) on page 54, However in practice, each individual im­
plementation has to be analyzed in order to arrive at a precise model of the communication 
costs, instead of such an upper bound.
In the models used in this thesis, a perfect load balance is assumed and the costs for the 
typical patterns of communication are specified in a precise manner. The consequence of 
our model for decomposition techniques where the load is not equally balanced involves 
that the total calculation time can no longer be divided by the number of processors. 
However, if it is possible to find out how large the largest component to be computed on 
a processor is, an upper bound can be given for the calculation time (which is the time to 
compute the largest component).
Another assumption that is made is that the calculation and communication times can 
be summed up in order to arrive at the total execution time. For parallel programs that 
implement an overlap between communication and computation, this no longer holds. 
However, in such a case the model can still be used as an upper bound.
68 4- A Scalable Performance Prediction Model
W
A point-to-point com m unication  
layer
Outline
In this chapter the design, implementation and performance of a point- 
to-point communication layer is discussed. The communication layer is 
suited for MIMD processor systems that communicate via message pass­
ing, It is particularly equipped with means to efficiently implement the 
typical communication paths required for distributed neural network sim­
ulations, Before the communication layer is introduced, first an overview 
over different decomposition techniques is given, pointing out the typi­
cal communication requirements for parallel neural network simulations. 
The implementation and communication costs for broadcast and gather 
communications are discussed for all three execution platforms used in 
this thesis.
70 5. A point-to-point communication layer
5.1 D ecom position techniques
The problem of decomposing a given neural network over a parallel processor system with 
given topology has often been addressed in the literature, Chu and Wall [18], Cosnard et 
al [20], Witbrock and Zagha [124], and various other authors have discussed the implemen­
tation of backpropagation networks on diverse parallel processor systems. Similar efforts 
have been reported implementing Kohonen networks by for example Obermaver et al [74], 
Wu [125] and Ultsch [104], Several other neural networks like Hopfield networks [22, 4, 36], 
EDANN networks [102], ART networks [120] etc, have also been implemented on parallel 
architectures. In the sequel, when referring to the term networks, neural network (simula­
tions) are meant and the term processor networks is used when referring to multi-processor 
systems.
Most of these parallel implementations were ad hoc, the PNNS were specifically decom­
posed based on careful examinations of both the neural network architecture and the 
characteristics and configuration of the target platforms involved. In general, using this 
way of implementation gives good results, A number of efforts have been made to imple­
ment tools for automatically mapping a given neural network architecture onto a given 
processor system [18, 101], Although the mapping problem has been acknowledged to 
be an NP-hard problem, the resulting mappings can give reasonable efficiencies and the 
tools support a user with a mechanism via which there is no need to be occupied with 
parallel programming. On the other hand, in most cases the resulting implementations 
are not as efficient as would be possible with dedicated decompositions. Furthermore it 
has appeared that, at least for certain classes of neural networks, implementing them is 
relatively easy to do. In [38, 100, 110], several methods for decomposing neural networks 
on MIMD multi-processor systems are discussed. They can be distinguished in three levels 
of decomposition, job-level decomposition, dataset decomposition and what we call network 
decomposition. The first two techniques are coarse grained decomposition methods which 
are well suited for MIMD parallel computer architectures, as the job sizes are large and 
the amount of inter-processor communications is small. The latter technique is defined as 
the collection of decomposition methods that can be used to divide a given neural network 
onto a given processor system. In general, these techniques are more fine grained than the 
former ones and thus require more inter-processor communications,
5.1.1 Job-level decom position
Job-level decomposition is a coarse grained parallelization method which places complete 
copies of a neural network (the jobs) on different processors. In general, each job is ini­
tiated with different parameter or architectural settings. The technique is often used by 
researchers looking for the proper initial values for which a neural network is able to perform 
well for a particular application. They just start up a number of copies of the neural net­
work program on different machines (workstations) and evaluate their performance based 
on convergence and generalization criteria. In [76], this method has evolved in a tool which
5.1. Decomposition techniques 71
automatically selects a proper architecture and set of learning parameters for the baek- 
propagation neural network. Each processor runs a complete copy of the neural network, 
together with an evaluator process which extracts information for generating evaluation 
statistics from the neural network simulation. Based on the evaluation of a neural network 
residing on a specific processor, the decision can be made to quit the evaluation and use 
that processor for evaluating a neural network started with another set of initial values. 
Note that for applications which require very large networks or large sets of data, this 
method cannot be used directly as it is restricted by the amount of memory available per 
processor. On the other hand, by combining this approach with other techniques in which 
each job is decomposed over a number of processors, larger sized networks can be handled. 
In this thesis, job level decomposition will not be discussed. Instead, the attention will be 
focussed on the other two techniques,
5.1.2 D ataset decom position
Dataset decomposition is a special kind of job-level decomposition. Each job is initiated 
with the same parameters and architecture, so the same neural network is present on every 
processor. The parallelism that is exploited in this technique stems from the dataset. If we 
consider a neural network as operating in two phases, a training and recall phase, for both 
an efficient decomposition technique can be exploited by dividing the dataset in a number 
of parts and computing each part in parallel. For the recall phase, the state of activation 
Ai is computed based on the network (weight) state W(t) and the input pattern Xi\
Aj(t) =  recall(W(t), Xi)
The computation of all activations for all p input patterns can be performed via:
p - i
A(Q...p-i)(t) =  [ J  Aiit) 
i=0
This can be done in parallel on P  processors via 5,1:
P —1 ((proc+ l)-p/P)- l
A(o...p-i)(t) =  [ J  [ J  A(proc^ {t) (5,1)
proc= 0 i=proc-p/P
Similarly for the training phase, the change of a weight variable of a network is computed 
based on its current weight state and a new training pattern i as:
A Wi(t) =  train(W (t), xi)
If somehow a neural network is able to be trained using epoch or batch learning, this involves 
that the weight changes for all patterns can be computed by summing up the individual 
weight changes per pattern as:
p - i
A U ; n . . .p  , , ( / )  =  ^ A U ; ( / )
¿=0
72 5. A point-to-point communication layer
This can be performed in parallel via:
P  ({j>roc-+l)-p/P) — 1
Al 1 f n--./» l) (I) ^   ^  ^ AU; pro,-./) ( I ) (5.2)
Especially if the datasets are large, dataset decomposition is a highly efficient parallelization 
technique as during the calculation of the sub-terms in 5,1 and 5,2, no communication is 
necessary,
5.1.3 Neural network decom position
Job-level and dataset decomposition result in coarse grained (and thus efficient) decompo­
sitions of the problem domain. However, they cannot be exploited if the neural networks 
are too large to fit on one processor. Allocating one job per processor is only feasible if the 
processor’s resources are sufficient. For jobs representing neural networks which are too 
large to fit on one processor, each job has to be divided into a number of components run­
ning on several processors. The collection of parallelization methods via which this can be 
performed, is called network decomposition. In order to efficiently use a processor network 
for speeding up a specific problem by dividing it over the available processors, two items 
are of major importance: load balance and synchronization and communication [6, 37, 48], 
A proper load balance ensures that each processor in the network has an equal amount of 
work to do. If this would not be the case, some processors would still be working while 
others are ready to proceed but have no job to do. By dividing the network in components 
with equal work loads, the load can be balanced properly. Considering the architecture of 
a neural network and the load of its individual components, several ways of decomposing it 
into a number of parts exist, each on a certain level of parallelism or grain size. Apart from 
job-level and dataset parallelism as discussed above, two more levels can be distinguished.
The first is on the level of the neurons and their connections. It seems a straightforward 
way to decompose neural networks by considering the network architecture and perform­
ing a one-to-one mapping of neurons onto processors and connections onto processor links. 
Some processor architectures have been proposed where neural networks are implemented 
in hardware or on special purpose neurocomputers or coprocessors that function as neural 
accelerator boards [41, 43, 79, 98], Furthermore, fine grained implementations have been 
reported of Kohonen [75] and backpropagation networks [95, 127] on the Connection Ma­
chine, However, as this thesis is concerned with coarse or medium grained general purpose 
machines where the number of processors is much smaller than the number of neurons in 
the neural network, this level of parallelism will not be discussed in great detail. Rather 
than looking at the fine grained parallelism as observed on the level of neuron operations, 
larger jobs have to be identified like the ones described in the previous sections.
The second kind of parallelism is on the level of clusters of neurons and their connections. 
Decomposing and mapping neural networks on this level of parallelization has been dis­
cussed in for example [38, 94, 100, 126], Based on the connectivity and load of the different
5.1. Decomposition techniques 73
clusters, the decision can be made to group them into one component and place this on a 
processor. Considering their architecture and learning mechanisms, four classes of neural 
networks can be distinguished, which are described below. Each has specific features that 
require different decomposition strategies in order to maintain a good load balance while 
reducing synchronization and communication times. The four classes are homogeneous 
networks, layered networks, heterogeneous networks and dynamic networks (see Figure 
5.1).
(a) Homogeneous (b) Layered
Figure 5.1: Neural network architectures, dynamic networks can have any architecture, 
where one or more nodes ’’grow” during a training phase.
Homogeneous networks
In case of homogeneous networks such as the Hopfield or Kohonen networks, the network 
can be divided into a number of components via geometric decomposition [100], Each part 
is running as one separate task on a processor. If several subtasks would run on the same 
processor, unnecessary overhead regarding to scheduling and intertask communications 
would be introduced. As will be pointed out in Chapter 7, for certain neural networks 
geometric decomposition results in a bad load balance. Especially when different parts of 
a network have different computational loads, this is the case, A technique that can be
74 5. A point-to-point communication layer
used to avoid this problem is scattered decomposition [100], which distributes parts of the 
network randomly over the available processors.
Layered networks
When layered networks like the backpropagation model are used, each layer can be sub­
divided over all processors. Each processor has thus control over a subpart of each layer. 
Note that in this case all-to-all communications are required as every two subsequent layers 
are fully interconnected. In [94], it is discussed that the different layers can be placed on 
a number of processors, using a special case of pipelined parallelism as discussed in [121], 
In Chapter 7, this technique will be further explained.
Heterogeneous networks
If modular, heterogeneous networks are considered, each module should be placed on one 
processor. This is because in general each module consists of densely interconnected clus­
ters of neurons, whereas the connectivity between different modules is low, A general 
method to decompose these networks is the following. If a module is too large to fit on 
one processor, or if two modules are connected, a number of adjacent processors should be 
used where the number of inter-processor communications should be minimized [38],
Dynamic networks
In some cases, neural networks have a dynamic architecture, i.e. they grow during the 
training process. Examples of these networks are the (recurrent) cascade-correlation neu­
ral networks [33, 32], CALM networks [71], ART networks [13, 12, 14, 15] or the grownet 
algorithm [96], These networks impose a completely different requirement on the decom­
position problem, i.e. dynamic load balancing. If at a certain state during training the 
load is well balanced, growing one or more neural network components could result in a 
poorly balanced load. To solve this problem, the load has to be somehow dynamically 
re-balanced during the training phase.
5.2 Synchronization and com m unication
Synchronization and communication problems are typical for distributed neural network 
simulations. Each component calculates its new state, after which it has to exchange this 
information with the components it is connected to. The communication overheads are 
determined by the communication resources of the target execution platform and the kind, 
size and number of required communications.
5.2. Synchronization and communication 75
5.2.1 Com m unication networks
For each parallel processor system where data has to be communicated between processors, 
some kind of communication network is required. Some architectures devise special, highly 
optimized communication routines that can be exploited, like described by Zhang [127] for 
the Connection Machine, Other architectures like the DAP or MPP have high speed 
communication mechanisms that shift data across rows or columns of the processor arrays. 
For shared memory multi-processor systems, data is communicated via a data medium 
(bus) that is shared by the different processors. For MIMD processors like transputer 
systems or the Intel hvpercube, the basic communication primitives are point to point 
data exchange between neighboring processors that are connected to each other via some 
communication link.
Typically, these systems consist of a set of processing elements (PE) and a communication 
network which connects the processing elements with each other. For transputer systems, 
each processing element has 4 communication links via which it can connect to other PE, 
The implementations discussed in this thesis all run on transputer systems configured in a 
grid or tree topology.
5.2.2 Com m unication requirements
Each of the parallel neural network simulations discussed in this thesis, is implemented via 
a master process and several slave processes. The master process performs the I/O  with the 
user-interface and file system. It also sends commands and distributes the data to the slave 
processes. Each slave process hosts a part of the neural network, or part of the data. They 
communicate with each other and with the master process. For parallel neural network 
simulations, a master-slave implementation operates in two phases, a calculation phase and 
a synchronization phase. During calculation, each process operates independently of the 
others. After each calculation step, processes enter the synchronization phase during which 
information is communicated. In [110, 111], the typical patterns of communication that are 
needed for parallel neural network implementations on MIMD computers are distinguished. 
Broadcast communications are needed for making information globally available to every 
slave process. In cases where distributed information has to be somehow accumulated, 
gathering communications are needed that collect the information from slave processes to 
the master.
Broadcasting
For master-slave implementations, typical broadcast communications contain data that 
has to be available on every processor, or commands that rule a program’s flow of control. 
Algorithm 5,1 illustrates a characteristic main loop of a slave program:
76 5. A point-to-point communication layer
while (broadcast-command (&command)) { 
switch (command) {
}
}
Algorithm 5,1: Typical slave main loop.
Depending on the topology of a processor network, various implementations of a broad­
cast operation can exist. However, in all cases, broadcasting data is performed from the 
master process to the slave processes. The master process typically resides on the root 
(host) processor of the processor network. In Figures 5.2(a) and 5.2(b) it is depicted how 
broadcasting data on a tree and grid topology is performed.
(a) Broadcasting on a tree (depth=2). (b) Broadcasting on a 5x3 grid.
Figure 5.2: Broadcasting messages on tree and grid topologies. The marked nodes are the 
root processors, which host the master process.
From Figure 5.2(a) it can be deduced that broadcasting data can be done in parallel on 
every node at the same depth of the tree. For a depth of 1, the master sends the data 
to each of its sons. As each processor node in a transputer tree has maximum three links 
available to connect to son processors, this requires at most 3 communications. For each 
depth greater than 1, the processes residing on processors on this depth can receive the 
data from their father processor and send it further to their sons. This can be done in 
parallel for each processor. The total times for broadcasting data of size ,s- on a processor 
tree of P  processors can be estimated via (5.3), where the depth of a tree of P  processors 
can be estimated by i log P.
T lre.e.-broadcast ( ^  ^  =  3  . d e p f  h ^  . <(# . ( 5 , 3 )
Considering Figure 5.2(b), broadcasting data is done by the master process by sending 
it to its right and down neighbors. Furthermore, each processor in the leftmost column
5.2. Synchronization and communication 77
reçoives the data from his top processor and sends it to its right and down neighbors. 
Each processor which does not reside in the leftmost column receives the data from its left 
neighbor and transmits it further to its right neighbor. For each row, this can be done 
in parallel. For a W  x H  grid, H  — 1 communications are required for sending the data 
downwards. Furthermore, W — 1 communications are required for sending data from left 
to right in each row. The total times for broadcasting data of size s on a processor grid of 
P  processors can thus be estimated as:
rjgrid-broadcast, (^ p_ <(#) =  (wîdth(P) + hetght(P) - 2) • « • tcomm (5.4)
Gathering, accum ulating and collecting
Gathering information in transputer grid or tree topologies is performed from the leaves of 
the processor network up to the root processor. The amount of inter-processor communica­
tions required for performing gathering operations is thus the same as given in Equations 
5,3 and 5,4, One important issue with gathering is to try to minimize the number of 
subsequent inter-processor communications. Furthermore, if possible, it should be tried to 
perform any required computations on the data during the gathering phase, as this means 
that the computations can be performed in parallel.
(a) Gathering on a tree. (b) Gathering on a grid.
Figure 5,3: Gathering weight changes on tree and grid topologies
When considering Equation (5,2), the weight changes do not only have to be gathered, but 
also have to be summed up. This can be done in parallel on each node as illustrated in 
Figure 5,3, Each node receives the weight changes from its neighbors, accumulates these 
with its own weight changes and sends the accumulated sum to its father process, Note 
that using this technique, not only the required communications are performed in parallel, 
but also the needed accumulations. For a neural network containing w connections, the 
total times for gathering and accumulating the weight changes can be estimated as below, 
where tacc represents the time for accumulating two weight values,
T lree-gal,her-accum ulate  ^  ^  =  3  . d e p f  h ( p ) . w  . +  ^  ( 5 , 5 )
■j-gnd g,„h,r ( ^  =  (width(p) + height{P) - 2) • W  • {tcorrmi + tacr) (5.6)
78 5. A point-to-point communication layer
For gathering the required information in ease of network decomposition, in most eases such 
simple gathering strategies as described above do not suffice. For dataset decomposition, 
the only communications required are the gathering and accumulation of weight changes. 
As their data have identical sizes on each process, the administration involved is rather 
simple. On the other hand, for network decomposition in general each process manages a 
different part of the neural network components and therefore, their corresponding data 
are not identical and could even have different sizes in case the load is not uniformly 
distributed. The latter is mostly the case as the load can only be distributed equally over 
the processors if the size of all neural network components is divisible by the number of 
processors. For example for a backpropagation neural network containing rii neurons in 
one of its layers, the load is only uniformly distributed if rii mod P  equals zero.
One solution to solve this problem is to use virtual point-to-point connections, where each 
slave process sends its data directly to the master. Most operating systems, like He­
lios, Parix or CS-tools offer this possibility. Other (runtime) systems like ParC require 
user-developed routers that perform this task. However, it will appear in the subsequent 
chapters that using this solution introduces too much subsequent inter-processor commu­
nications, Rather than using virtual point-to-point communications, it is more efficient to 
communicate locally. However, solutions using this method require that slave processes 
know about the data that they are going to communicate with neighboring slaves. Some 
administration of the distribution of data is needed that has to be present on every slave 
process. In Chapter 7, such an implementation is discussed.
Based on the typical patterns of communication observed in neural network simulations, a 
set of broadcast and gather operations was designed and implemented, specifically tuned 
for parallel neural network applications running on tree and grid topologies. On the GCel 
and PX only grid topologies are used. This is because no physical tree topology could 
be installed as the processors are arranged in a grid. Furthermore, mapping virtual tree 
topologies on a grid resulted in unpredictable and inefficient communication performance, 
as will be discussed below,
5.3 Com m unication paths in trees and grids
Three kinds of communication routines are contained in the communication layer: 1) 
routines for setting up the communication on a grid or tree topology, 2) routines for broad­
casting information and 3) routines for gathering information. The latter two routines are 
introduced in the next section. The routine query_grid() finds out the dimensions of the 
physical grid topology and arranges for each process the communication channels NORTH, 
EAST, SOUTH and WEST. The routine query_tree() finds out the physical tree topology and 
arranges for each process the communication channels TOP, LEFT, DOWN and RIGHT, Broad­
cast routines send messages from a master processor (root) to all other processors. The 
master process sends a message to each neighboring slave process. Each subsequent slave 
process transmits the message to its neighbors further up the path. Gather routines receive
5.3. Communication paths in trees and grids 79
information from all processors to a master processor. Each slave process receives infor­
mation from its neighbors further up the path, does some processing on the information 
and transmits it in the direction of the master,
5.3.1 Setting up the com m unication in a grid
As explained in the previous chapter, the way in which communication channels are set up 
when using Helios or Parix differs significantly. However, for the applications used in this 
thesis the following assumptions are made:
1, Each master process runs on processor (0,0) in the W  * H  grid,
2, Each slave process i runs on processor ((i + 1) mod W,(i + 1 ) /W) in the grid. Slave 
processes are numbered from [0 ,,, nslaves-1] ,
3, Each process on the left-most column of the grid is connected to its NORTH, SOUTH 
and EAST neighbor,
4, The other processes are connected to their EAST and WEST neighbors.
After setting up the communication channels, a number of global variables are set via 
which a process can find out how many neighbors it has, whether it has neighbors to its 
east or south side, what communication channels are available, etc. In particular, a process’ 
identification slave_id is set to either -1 indicating that the process is the master, or a 
number i within [0 ,,, nslaves-1]. Figure 5,4 depicts the communication channels that 
are set up, how the master and slave processes are identified via their identification and 
how the processes are partitioned on a physical grid.
Figure 5,4: Situation after calling query_grid().
80 5. A point-to-point communication layer
The routine query_grid() gets as parameters the width and height of the taskforee grid 
(i.e. the tasks that are assigned to processors in the grid). In its main routine, each 
program calls query_grid() to check whether the required taskforee matches the physical 
topology, and if so, to set up the communication channels and initialize the variables used 
for broadcast, gather and point-to-point communications,
5.3.1.1 Setting up the communication in Helios
The Helios Resource Management Library [23] contains a number of routines that can be 
used to investigate the physical processor network. Using the routine RmGetNetworkO, a 
datastrueture representing the status of the transputer network is filled in. In query -grid (), 
this structure is examined for the number of processors and connections between them. For 
each processor, its identification can be requested via RmGetProcessorlDO, Based on the 
id, its position in the grid can be determined and the required identification of its neighbors 
can be computed. For each of the directions in which the processor must be connected, it 
is cheeked whether it is connected to the required processor. This is performed via the rou­
tine RmFollowLinkO, which returns the identification of the processor that is connected 
via a specific link. If the physical configuration matches the topology required by a grid, 
the identification of the communication channels (file descriptors) is computed. Otherwise 
an error message is returned. The assignment of file descriptors to a process’ specific com­
munication channels in its NORTH, EAST, SOUTH and WEST directions is done as depicted in 
Figure 5,5,
stdin stdout
Figure 5,5: Assigned posix file descriptors after calling query_grid().
Note that the master is connected via virtual channels to all slaves. The channels are 
identified via the following file descriptors:
5.3. Communication paths in trees and grids 81
#define from_slave(i) (4+i+i)
#define to_slave(i) (5+i+i)
The SOUTH channels of the m aster are connected to slave w-1, and each pair of channels is 
connected to the file descriptors 0 and 1 of each corresponding slave. Based on its slave_id, 
the routine query_grid() can determine which file descriptors must be valid for a certain 
task. Using the Helios routine fdstreamO, it can be checked whether the corresponding 
channel is valid or not. If all required file descriptors are valid and the physical configuration 
is correct, this means tha t the loader of the CDL script has correctly loaded each task on 
the corresponding processor and th a t all communication channels are setup. Subsequently, 
query_grid() returns true and an application may be started th a t uses neighboring point- 
to-point communications between tasks running on neighboring processors,
5.3.1.2 Setting up the com m unication in Parix
In order to write portable code, the communication layer has to be independent of the 
application tha t runs on it. Therefore, the Parix implementation of query_grid() has the 
same semantics as the Helios implementation described above. As discussed before, one of 
the main differences between Parix and Helios is th a t with the la tter system, a user has to 
claim a resource map in order to allocate a transputer network, after which he must use 
the CDL mechanism to specify and partition a task force. In Parix, this login procedure 
is not necessary. Instead, a program is loaded onto a transputer system directly from the 
host operating system. The same program is loaded onto each processor and therefore it 
has to be decided at runtime which slave_id a task has. As we use a one-to-one mapping 
of tasks onto processors, using the Parix call GET_R00T(), which returns a local and global 
description of the processor network, it can be determined which processor the task runs 
on. If this processor is not part of the W*H grid, the task exits, as depicted in the code 
below:
procID = GET_ROOT()->ProcRoot->MyProcID; 
x = GET_ROOT()->ProcRoot->MyX; 
y = GET_ROOT()->ProcRoot->MyY; 
slave_id = y * W + x - 1; 
if (x>=W| |y>=H) exit (0);
The communication channels are dynamically created via ConnectLinkO , which allocates 
a channel to a specified processor. Note th a t when using Helios, it is checked whether 
the physical topology is correct and whether the communication channels are all validly 
allocated via the CDL mechanism. Using Parix, a previously installed partition of proces­
sors is allocated. If this succeeds, it is guaranteed th a t the partitions are configured in a 
certain way (i.e. grid). Allocating partitions and loading tasks is done via the Parix run 
command:
run -ppartition task_code [task_arguments]
The check whether the communication channels are valid is implicitly performed when
82 5. A  point-to-point communication layer
using ConnectLinkO, If each ConnectLinkO succeeds, the corresponding channels are 
valid and query_grid() returns true,
5.3.2 Setting up the com munication in a tree
As mentioned above, mapping a virtual tree topology onto a physical grid does not make 
much sense. Experiments with using Parix within the CA M PP’93 programme [111] showed 
th a t the communication performance suffers significantly from the through-routing of mes­
sages, Furthermore, two other factors determine the suitability of a virtual tree topology. 
The first is the mapping algorithm th a t is used. We have used the standard MakeTree util­
ity from Parix, which tries to create a proper virtual tree topology on a given physical grid. 
The other is the size of the grid onto which the tree is mapped. One would expect th a t 
the larger the grid onto which a tree is mapped, the better the mapping result. However, 
consider for example the timings of gather-accumulate-broadcast operations of a virtual 
tree of 40 processors mapped onto three different physical grids:
msecs
nfloats
Figure 5,6: Timings of GAB on a 40-ternary virtual tree.
As can be expected, the timings differ. But it also appears th a t the times for a 16x16 
grid are considerably larger than for smaller grids, which is not the expectation. Based on 
these observations, no further experiments with Parix and virtual trees on the GCel are 
performed. On the NSC, physical tree configurations can be specified via a resource map. 
Similar to the way in which tasks set up the communication channels in a grid configuration 
using query_grid(), each task explicitly checks on which processor it is running. Note 
th a t in Helios a programmer has to specify the resource map and CDL-script, For tree 
topologies, some utilities were designed and implemented to support a programmer with 
these tasks. The utilities use a simple configuration algorithm to specify a tree topology. 
If this algorithm is used, for each processor (or process), the following hold:
5.3. Communication paths in trees and grids 83
1, Each processor is connected to a father processor via its TOP link, the master processor 
has no father,
2, Each processor is father of one to three subtrees where the number of processors 
contained in a subtree N(subtree) > 0 (a processor with no subtrees is a leave),
3, If a processor has one or more subtrees, it is connected via its LEFT, DOWN, or RIGHT 
link,
4, The difference between all N(subtree)  on a certain depth of the tree never exceeds
1, i.e. the processor tree is balanced.
The same algorithm is used to specify a CDL task force description, which implies th a t a 
direct mapping of tasks onto processors is established. When running a task force, each 
task must call the routine query_tree(), which finds out whether the processor on which 
the task runs is physically connected to the proper processors in the tree. This routine 
also uses the tree configuration algorithm, so the same algorithm is used on three levels, 1) 
to create a resource map, 2) to create the corresponding CDL task force, 3) to assure tha t 
each task runs on the appropriate processor. The la tter is done similar to query_grid() by 
using the Helios routines contained in the Helios library RmLib, Furthermore, it is checked 
by using fdstreamO, if each task has valid file descriptors in its TOP, LEFT, DOWN and 
RIGHT link. After calling query_tree(), the communication channels are assigned to posix 
file descriptors as depicted in Figure 5,7,
Figure 5,7: Assigned posix file descriptors after calling query_tree ().
84 5. A  point-to-point communication layer
5.3.3 Communicating in a tree and grid
Once the communication channels are setup correctly, neighboring tasks can communicate 
to each other using the channels. In Helios, this is performed via posix readO and write() 
calls, whereas when using Parix, this is done via the routine RecvLinkO and SendLinkO. 
Each of these routines has three im portant parameters, the direction to or from which to 
communicate, the memory location where the da ta  is located, and the number of bytes the 
da ta  occupies. By using the following declarations and macros, the same application code 
can run on both Helios and Parix (where 1 represents the link number [0-3], b represents 
the address where the data  is located and n represents the number of bytes:
#ifdef parix
extern LinkCB_t *links[4];
#define rec(l,b,n) RecvLink(links[1],b,n)
#define snd(l,b,n) SendLink(links[1],b,n)
#endif
#ifdef helios 
extern int links[4];
#define rec(l,b,n) read(links[1],b,n)
#define snd(l,b,n) write(links[l]+l,b,n)
#endif
Using the routines query_grid() and query_tree(), on each processor a number of vari­
ables via which it can find out to which neighbors is has to communicate are set. For both 
a tree and a grid, these are called has left. has_top, has_right and has_down. Using 
these variables and the macros described above, point-to-point communications can be 
used between neighboring processors.
Unfortunately, it is impossible to write applications using the grid or tree point-to-point 
neighboring communication schemes in a transparent way independent of the underlying 
topology. If processes on a grid have to communicate to each other, the have to talk to 
their respective NORTH, EAST, SOUTH and WEST neighbors. If the processes lay on a tree, they 
have to know whether to talk to their TOP, LEFT, DOWN and RIGHT neighbors. However, for 
many applications some typical communication patterns are required. By examining the 
patterns of communication, a software layer can be designed on top of the point-to-point 
layer, th a t implements these for e.g. a grid or tree topology. For the parallel neural network 
simulations described in this thesis, such a software layer was made. The layer contains 
communication routines for one-to-all broadcasts and all-to-one gather operations,
5.4 Broadcast and gather routines
Broadcast and gather routines differ in the direction of the flow of information in a processor 
network. Using broadcasts, information is distributed over the processor network from 
one processor to all other processors. Using gathering, information is collected from all
5.4- Broadcast and gather routines 85
processors to one processor. W ith broadcasting for each processor the data  to receive from 
and to transm it to neighboring processors is the same. This means th a t the amount and 
type of da ta  is equal for all processors. W ith gathering, in general the da ta  to collect is 
distributed over the processor network. This means th a t depending on the distribution 
of the data, the amount and even the type of the data  to be communicated may differ 
for distinct processors. Depending on which processor a process is running on, and what 
neighbors it is connected to, it has to decide via which links to receive or send messages. 
Several possibilities to realize this exist:
1, The first is to compile different programs for each of the cases, and make sure to load 
the proper program on the corresponding processor. For example, this may involve 
th a t a program master is loaded on the root processor, a program slave_column on 
the left most column, and a program slave_row on each row of a grid,
2, The second is to assign the proper routines to the general communication routines 
during the examination of the network topology. For example this can be real­
ized for broadcasts by using a function pointer th a t represents a row_broadcast, 
col_broadcast, or root_broadcast,
3, The third option is to determine the receipts and sends dynamically by examin­
ing some global variables set by the topology querying routines query_grid() and 
query_tree(),
The first option is rejected because this requires a large number of different programs 
and a lot of extra adm inistration for task distribution and execution. The second option 
could be used, but can only be exploited for general routines like broadcast and gather- 
accumulate operations. However, parallel neural network applications decomposed via 
network decomposition (see Chapter 7) also require more application specific communi­
cation schemes. This means th a t the global variables mentioned above have to be used 
anyway, and as furthermore the performance gain compared to the third option is minimal 
(a small number of comparisons are made), option three has been chosen.
5.4.1 Broadcasting
The variables th a t are examined when broadcasting in a grid are has left. has_top, 
has_right and has_down. Algorithms 5.2(a) and 5.2(b) depict a simplified version (shows 
no error detection/reports) of the code used for broadcasting on grids and trees.
86 5. A  point-to-point communication layer
in t broadcast (char *buf, in t size) 
{
i f  (!has_left) { 
if  (has_top)
ree (NORTH,buf,size); 
if  (has_down)
snd (SOUTH,buf,size);
}
else
ree (WEST,buf,size); 
i f  (has_right)
snd (EAST,buf,size);
}
(a) Broadcast on grids.
in t  broadcast (char *buf, in t size) 
{ .
if  (has_top)
rec(TOP, buf, size); 
i f  (has_left)
snd (LEFT,buf,size); 
if  (has_down)
snd(DOWN,buf,size); 
i f  (has_right)
snd (RIGHT,buf,size) ;
}
(b) Broadcast on trees.
Algorithm 5,2: Broadcasts on grids and trees.
5.4.2 Gathering
For many applications where the decomposition of a program is of a homogeneous nature, 
gathering da ta  becomes relatively easy to accomplish. Examples are the accumulation of 
vectors or the com putation of a global maximum or minimum value. Both are used in the 
applications discussed in this thesis. The algorithm to gather and accumulate a number of 
vectors of equal length is depicted in algorithm 5,3,
in t Lgather (float *dst, float *buf, in t n 
{ ,
in t i, size =  n*sizeof(float);
i f  (has_right) {
reel KAST.(char *)buf,size); 
fo r (i=0;i<n;i++) 
dst [i] + =  buffi];
}
i f  (!has_left) { 
i f  (has_down) {
rec(S O l'T II.(char *)buf,size); 
fo r (i=0;i<n;i++) 
dst[i] + =  buffi];
}
if  (has_top)
snd(NO RTH,(char *)dst,size);
}
else
snd(W EST,(char *)dst,size);
}
(a) GA() on grids.
in t tg a th e r  (float *dst, float *buf, in t  n) 
{ , ,
in t  i, size =  n*sizeof(float));
i f  (has_left) {
rec(LEFT,(char *)buf,size); 
fo r (i= 0;i<n);i++) 
dst [i] + =  buffi];
}
if  (has_down) {
rec(DOW N,(char *)buf,size); 
fo r (i= 0;i<n);i++) 
dst [i] + =  buffi];
}
i f  (has_right) {
rec(R IG H T,(char *)buf,size); 
fo r (i= 0;i<n);i++) 
dst fi] + =  buffi];
}
i f  (has_top)
snd(T O P,(char *)dst,size);
}
(b) GA() on trees. 
Algorithm 5,3: Gathering and accumulation on grids and trees.
5.5. Gather, accumulate, and broadcast 87
5.5 Gather, accum ulate, and broadcast
As mentioned before, all-to-all broadcasts are implemented using a subsequent gathering 
and broadcasting of information. In Chapter 7, all-to-all broadcasts for non-uniform data 
are discussed. In this section it is assumed th a t the da ta  to be gathered is equal on every 
processor and th a t during the gathering, all da ta  elements have to be summed up. This 
is exactly the case when using dataset decomposition, where on each processor the weight 
changes are computed, after which they have to be accumulated and made available to 
every other processor. The routine th a t subsequently gathers, accumulates and broadcasts 
information is called GAB(), A number of experiments were carried out to quantitatively 
validate the times required for performing GA B O  on diverse tree and grid topologies,
5.5.1 Setup for the experim ents
The time to perform G A B O  for a vector of size s is (respectively for a grid and tree):
T GAB- ^ d(P, s) = (width(P) + height(P) -  2) • s • (2 • t comm +  tacc) (5.7) 
T GkB-tree(P,s) = 3 • depth(P) • s • (2 • t cornm +  t acc) (5.8)
When considering Equations (5.7) and (5.8), two function kernels required to quantita­
tively model the times for G A B O  can be identified. The first is the well-known ping_pong 
kernel, which measures the time tcornm for transm itting a value between two neighboring 
processors over one physical communication link. This benchmark starts a timer, transm its 
a message over a link, and subsequently receives the same message and halts the timer. 
The communication time is the difference between the two timers divided by two. The 
second kernel to be benchmarked is the accumulation kernel, which measures the time to 
accumulate two vectors. As mentioned in Chapter 4, many factors may determine the 
performance of such a kernel. Therefore, the kernel we used for measuring the accumula­
tion times was contained in a larger piece of code mimicking some typical neural network 
simulation program. The next table lists the results for the ping.pong and accumulation 
kernels:
M achine tacc tcomm
NSC
GCEL
P X
6.00 //seconds/float 3.00 //seconds/float 
4.99 //seconds/float 3.59 //seconds/float 
0.31 //seconds/float 3.79 //seconds/float
Table 5.1: Times for  ping-pong and accumulation kernels.
Using these timings for t cornm and tacc, the expected times for Equations (5.7) and (5.8) 
can be computed. The next sections present the measured and expected times for GABO, 
measured on different processor topologies and network sizes on the Nijmegen Super Cluster 
(NSC), the GCel-512 (GCEL) and the PowerXPlorer (PX). For each of the experiments,
88 5. A  point-to-point communication layer
vectors of size [10 , , ,  90], [100 , , ,  900], [1000 , , ,  9000] and [10000 , , ,  100000] floats where 
used in order to examine the model,
5.5.2 GAB() on the NSC
One of the main problems with the Helios operating system is tha t it is not suited for 
very large processor systems. Especially if the number of communication channels is large, 
Helios often crashes. If for example all-to-all communications between processes would be 
implemented using virtual communication channels, where each process receives and sends 
its information directly to all other processes, it has appeared th a t this is not feasible using 
Helios (version 1,22), Applications running more than  36 fully connected tasks just died. 
Using the local communication harness described in the previous chapter, applications 
can be scaled up to the full 64 processors of the NSC, Furthermore, this solution has the 
advantage tha t, — especially for GABO — , local communications can be performed in 
parallel (communication as well as accumulation),
GABO o n  N S C  g rid s
Figure 5,8 depicts the measured and expected times for GABO on NSC grids.
Figure 5,8: Expected and measured times for  GABO on NSC grids.
The expectations are modeled following Equation (5,7), where for t comm and t acc the times 
depicted in Table 5,1 are filled in. It can be observed th a t the expected times are modeled 
relatively good for large messages (< 6%), whereas for small messages the deviations become 
much higher (up to 10%), This behavior is depicted in Figures 5.9(a) to 5.9(c),
5.5. Gat,her, accumulate, and broadcast 89
nweights
(a) Large sizes (1000 . . .  100000).
nweightS
( b )  Medium, sizes (0 . . .  1000).
(c) Sm all sizes (0 . . .  100).
Figure 5,9: Deviations between measured and expected times (in %).
The routine GABO is used in parallel neural network simulation programs to accumulate 
a number of weight vectors. Recall th a t our goal is to find out the suitability of trans­
puter systems for a neural network simulation by predicting the performance (and thus 
execution times) for a given simulation on a given machine. This is done via modeling the 
calculation and communication times. According to the results presented above, relatively 
good predictions on NSC grids can be made for neural networks containing more than 
400 weights. For a backpropagation network, this means an architecture of say 10 inputs, 
20 hidden and 10 output neurons. For a Kohonen network an architecture of 10 inputs 
and 5x8 neurons contains 400 weights. Both are moderately sized networks. For really 
small networks, from these results it can be concluded th a t the predictions will probably 
be not too precise, but on the other hand, if a relatively large number of communications
90 5. A  point-to-point communication layer
is required, transputer systems are only suited for relatively large neural networks. The 
suitability of transputer systems for parallel neural network simulations will be discussed 
in the subsequent chapters,
GAB() o n  N S C  tre e s
The expected times for GABO on a ternary tree are given by Equation (5,8), For each 
depth of the tree, 3 subsequent communications and accumulations are required. Note 
th a t for trees th a t are not completely filled, this can be considered as an upper bound for 
the number required. For example for a tree containing 64 processors, the first 40 nodes 
form a completely filled tree of depth 3, where 24 of the 27 nodes laying on depth 3 are 
connected to a leave node. This means th a t instead of 3 subsequent communications and 
accumulations between depth 3 and 4, only 1 is executed. In Figure 5,10, the expected and 
measured times are depicted on trees of 13, 40 and 64 nodes. Note the difference between 
the dashed line above iree64 and the dash dotted line exp64. The first is the expected 
time for 64 nodes following (5,8), whereas the second is the expectation where only one 
communication and accumulation is counted between depth 3 and 4,
Figure 5,10: Expected and measured times for  GABO on N SC  trees.
Note tha t indeed, the expectations for 64 nodes following (5,8) are too high, whereas expa4 
provides an accurate expectation, which is represented by:
te x p 6 4 :(s )  =  J lGAB-iree(40, s) +  S  • (2 • t cornm  +  t a c c )
The next figures depict the deviations for large, medium and small message sizes.
5.5. Gat,her, accumulate, and broadcast 91
(a) Large sizes (1000 . . .  100000). (b) Medium, sizes (0 . . .  1000).
(c) Sm all sizes (0 . . .  100).
Figure 5.11: Deviations between measured and expected times (in %).
It shows tha t, similar to the results for NSC grids, for relatively small messages the devia­
tions can be as large as 10%, There can be several reasons for the deviations, all of which 
are caused by the simplicity of the model. For example issues like the setup time, the 
amount of internal versus external memory th a t is used, the number of crossbar switches 
th a t are traversed during communication, or synchronization failures, i.e. some processes 
are ready while others are still computing, are not captured in the model. The la tter issue 
can be illustrated as follows. Assume some process on a processor is gathering information 
from its neighbors (i.e. its sons in a tree, or its EAST or DOWN neighbors in a grid). Following 
the model described here, for each processor it is assumed th a t at a specific time t  all da ta  
to be gathered is available and can be sent to its neighbor. However, if for some reason 
the da ta  is available at a later moment t + S, it can occur th a t the actual time after which
92 5. A  point-to-point communication layer
gathering has finished is higher than  expected. Another effect may cause the measured 
time to be smaller than  the expected time. Consider for example a node a as depicted in 
Figure 5,12,
Figure 5,12: Situation in which the expected times are lower than the measured times.
In this situation, the model predicts the time for GABO as 6 • s ■ (2 • tcomm +  tacc), which 
is modeled more precisely as exp5 = 4 • s • (2 • tcomm +  i acc). However, when considering 
algorithm 5.3(b), node a first gathers the da ta  from its LEFT and DOWN sons, after gathering 
it from its RIGHT son. In the situation of Figure 5.12, this means th a t exp5 is not even good 
enough, the expected time should be 3 • s • (2 • tcornm +  tacc), because during the gathering 
of da ta  from processors b and c, processor d has already gathered its da ta  from e. on 
the other hand, the expected time following our model will still give an upper-bound on 
the execution time, which can be used to assure th a t at most a certain maximum time is 
required for GAB ().
5.5.3 GABO on the GCel-512
The same set of experiments was carried out on the GCel machine, which enabled the 
validation of the model for significantly larger grid topologies. The program to measure 
GABO was ported to Parix, and tcomm and t acc where measured as 3.59 and 4.99 /¿seconds 
per float respectively. Note th a t a node on the NSC is a T800 transputer running at 25 
MHz, and a node on the GCel is a T805 running at 30 MHz. Surprisingly, the naive 
expectation th a t the accumulation time would thus be a factor 25/ 30 lower is true! The 
difference in communication times is not caused by a difference in link speed, but can be 
effected by a difference in the implementation at the system level of communication, or in 
the difference between the exploited crossbar switches. Figures 5.13(a) and 5.13(b) depict 
the measured and expected times for GABO on GCel grids of topology 4x4 to 16x32.
5.5. Gat,her, accumulate, and broadcast 93
(a) For 16x16 and 16x32 grids. (b) For 4x4, 8x4 and 8x16 grids.
Figure 5,13: Expected and measured times for  GABO on different GCel grids.
Again, it appears th a t the model is able to predict the measured times accurately. This 
can also be observed by considering Figures 5.14(a), 5.14(b) and 5.14(c) which depict the 
deviations for large, medium and small message sizes.
94 5. A  point-to-point communication layer
size
(c) Sm all sizes (0 . . .  100).
Figure 5,14: Deviations between measured and expected times (in %).
For relatively large vectors, the deviations are within 4,3%, W hat is noted, is th a t for 
small messages the deviations are not very high either, opposed to the situation with the 
NSC.
5.5.4 GAB() on the PowerXPlorer
A node on the PowerXPlorer (PX) is a PowerPC with a peak performance of 80 MFLOPS, 
The T805 offers 4,3 MFLOPS, and -  again naively - , we may expect the accumulation time 
to be 4,3*4,99/80 «  0,27, The measured t acc amounts to ,31 /¿seconds/float, which is sur­
prisingly close to the expected time. The communication time t comm is 3,79 /¿seconds/float. 
This involves th a t the ping-pong times for the PX are comparable to th a t of the GCel, 
Consider th a t the underlying communication network is implemented through a trans­
puter system, and th a t for each inter-processor communication an extra data  transfer is 
required through the shared memory of the PowerPC and communication transputer. In 
first instance one would expect tha t the communication performance would suffer from 
this overhead. However, the implementation of shared memory transfers is very fast and 
furthermore, each transputer can be fully dedicated to communication only. Therefore, 
each transputer is equipped with an efficient communication micro kernel, where no over­
heads exist of scheduling and calculation. The result is th a t one node provides a high 
com putational power and a communication performance comparable to th a t of the GCel 
and NSC, In Figures 5.15(a) and 5.15(b) the expected and measured times for GABO on a 
2x2, 4x2, 4x4 and 8x4 grid are depicted.
m
il
lis
ec
on
ds
5.5. Gather, accumulate, and broadcast 95
3500 - 
3000 
2500 - 
2000 - 
1500 - 
1000 
500 
0
0
* 4x2 8000 * 4x8
o 2x2 s r  ■ o 4x4
-  -  expected
s S ' 7000
-  -  expected
9000r
20000 40000 60000 
nweights
80000 100000
s6000T313
§ 5000 
^  4000 
3000 
2000 
1000 
0
0 20000 40000 60000 80000 100000 
nweights
(a) For 2x2 and 4x2 grids. (b) For 4x4 and 8x4 grids.
Figure 5,15: Expected and measured times for  GABO on different P X  grids. 
The deviations are depicted in Figures 5.16(a), 5.16(b) and 5.16(c).
(a) Large sizes (1000 . . .  100000). (b) Medium, sizes (0 . . .  1000).
96 5. A  point-to-point communication layer
(c) Sm all sizes (0 . . .  100).
Figure 5,16: Deviations between measured and expected times (in %).
5.6 Conclusions
Broadcast and gather communication routines have been introduced in this chapter for 
tree and grid topologies. Following the communication paths th a t messages pursue using 
these routines, a model was defined th a t predicts the communication times for small and 
large messages. For all three systems (PX, GCEL and NSC) and for NSC trees, the model 
was validated quantitatively. This was done by measuring the communication times for 
various sizes of the processor network and a large range of neural network sizes. The 
timings predicted by the communication model were estim ated accurately with a precision 
of on the average 5%, All predictions had an accuracy of ±10%,
W ith these communication routines, a foundation is made for implementing parallel ap­
plications, In the next chapter, the communication layer defined here will be used to 
implement two neural network paradigms, the backpropagation and Kohonen SOM neural 
network.
m
D ataset D ecom position
Outline
Using the combined performance prediction model introduced in the pre­
vious chapters, the performance, speedup, efficiency and scalability of a 
large number of backpropagation and Kohonen neural networks are deter­
mined, Kernel benchmarks are identified and measured on the Nijmegen 
Super Cluster (NSC), and on the GCel-512 (GCel) and PowerXPlorer 
(PX) located at the University of Amsterdam, The predicted perfor­
mance measures are subsequently compared to full program kernels, i.e., 
actual parallel implementations of the two neural networks. It will appear 
th a t the combined method can be used to predict the performance for 
any neural network size on any size of processor network rather accurate, 
based on kernel benchmarks measured on one processor, and communi­
cation kernels measured between only two processors.
98 6. Dataset Decomposition
6.1 A general dataset decom position algorithm
When using dataset decomposition for exploiting parallelism in neural network simulations, 
a set of patterns is equally distributed among a number of identical processes. After 
this distribution, all processes compute the results for their local patterns in parallel. 
Subsequently, the results for all processes are gathered and broadcast. This series of 
actions may be iteratively repeated. Note th a t like described in the previous sections, 
a subsequent gather and broadcast communication is performed to emulate an all-to-all 
broadcast, A general algorithm for dataset decomposition is depicted below:
In case of neural networks, two phases can be identified in which dataset decomposition can 
be exploited, a recall phase and a training phase. In the recall phase, for each local pattern 
the state  of activation of a neural network is computed, after which all computed activations 
are gathered and broadcast. In the training phase, for each pattern  the computed network 
status is used to compute the change of neural network variables (e.g. weights, biases, 
learning rate). After all network changes are computed, they are gathered, accumulated, 
broadcast and used to update each copy of the neural network.
load_data() ; 
initialize-network(); 
distribute_patterns(); 
while ( ! ready) {
for (p=0;p<nlocal_patterns;p++) {
compute_pattern(p) ;
}
Algorithm 6,1: General algorithm for dataset decomposition.
w hile  (¡ready) {
w hile  (¡ready) { for (p=0;p<nlocaLpatterns;p++) {
for (p=0;p<nlocaLpatterns;p++) { compute_net-status(p);
(a) R ecall phase. (b) T ra in ing  phase.
Algorithm 6 ,2 : General algorithms for recall and training phase.
This code runs on each processor in the transputer network. For each neural network, 
the computation of its network status, its network changes and the adm inistration of its 
changes is different. Furthermore, these computations are independent of the underlying
6.1. A  general dataset decomposition algorithm 99
communication mechanism. The gathering, accumulation and broadcasting of the network 
changes is very similar for different neural network models, but does depend on the com­
munication primitives and underlying processor topology used. Therefore it will appear 
in the next sections th a t the difference in the performance model between dataset decom­
posed backpropagation and Kohonen network mainly shows in the calculation model. This 
is in contrast to other decomposition methods (e.g. network decomposition), where the 
patterns of communications differ significantly for different neural networks.
In Equation (2,3), the overall calculation and communication time for a parallel application 
is expressed as:
To find out the calculation time for a dataset decomposed neural network, the N i(n ,w ,p ) 
and corresponding time i* have to be found. This has to be done by carefully examining 
the code of the simulation program and by identifying and measuring a properly selected 
set of function kernels. How this can be done is described in C hapter 4, Section 4,3, 
To find out the overall communication time, the number of required communications for a 
given processor network with P  processors and a given neural network application (n, w,p) 
has to be determined. Similar to modeling the calculation time, a set of communication 
kernels has to be identified to determine the number of communications C (P ,n ,w ,p ) .  
This amount can be expressed as the sum of the required communications for a number of 
communication kernels:
The function and communication kernels on the routine level in Algorithms 6.2(a) and 
6 .2 (b) are listed below:
Table 6.1: Function and communication kernels for dataset decomposition.
If all patterns are equally divided over the available processors, the number of patterns 
each processor has to compute is n lo c a l  p a t te rn s = /> / / \  Using this perfect load balance 
property, the to tal calculation time for training one epoch can be expressed as:
Tcomm(P, n, w,p) = C (P ,n ,w ,p )  ■ t ico m m (6.1)
compute_net_status() gather_net_status() 
compute_net_changes() broadcast _net_status() 
updat e _net .change s ( ) G AB _net .change s ( )
100 6. Dataset Decomposition
where the indices cns , etcetera stand for com pute_net_status () etcetera. The communi­
cation time is the time required for gathering, accumulating and broadcasting the network 
change information.
The next sections describe the implementation, performance model, and predicted and 
measured performance for the training phase of dataset decomposed backpropagation and 
Kohonen neural networks with several architectures. The recall phase is not discussed, 
because (see Algorithm 6.2(a)) this phase also incorporates N cns(n , w) and t cns. In Section 
6,5 and 6,6 the simulations are discussed with respect to the achieved speedup, scalability 
and efficiency.
6.2 Backpropagation dataset decom position
When implementing a backpropagation network using dataset decomposition, in fact mul­
tiple copies of a sequential algorithm are used to compute the local network status and 
network changes. Therefore, a similar analysis of the required function kernels can be 
exploited as described for the sequential Algorithm 4,3, Note th a t in this algorithm, the 
weight changes are directly updated instead of gathered, accumulated, and broadcast be­
fore the updating takes place. Algorithm 6,3 depicts the dataset decomposition version of 
Algorithm 4,3,
for (epoch=0;error>errcrit&&epoch<nepochs;epoch++) {
e rro r  = o.O;
for (pattern=0 ; patternCnpatterns ; pattern++) {
ComputeActivations(pattern); 
ComputeErrorsO ; 
ComputeWedsO ; 
error += ComputeError();
}
GABWeightsAndBiasesO ; 
error /= npatterns; 
ChangeWeightsAndBiasesO ;
}
/* compute-netstatusO */
I* compute-net-changes () */
/* . . .  */
/* . . .  */
/* GAB-net-changes () */
/* up dat e-net-changes () */
/* . . .  *1
Algorithm 6,3: Algorithm for backprop dataset decomposition.
When analyzing this algorithm, the following function and communication kernels can be 
identified on the routine level:
6.2. Backpropagation dataset decomposition 101
kernel time required
ComputeActivationsO tad
ComputeErrors 0 terrs
ComputeError() terr
ComputeWedsO twed
ChangeWeightsAndBiasesO tchg
GABWeightsAndBiasesO T gab(P, nweights + nbiases)
Table 6,2: Function and communication kernels for backpropagation neural networks
decomposed via dataset decomposition.
Note th a t the first four routines are computed for each pattern, whereas the gathering and 
updating of the weights and bias changes only happens for each epoch. Let Tpat and Tepo 
represent the time required per pattern  and per epoch:
Tpat(n, w) = tact(n,w) + terrs w) + terr w) +  twed(n, w )
Tepo(P ,n ,w ) =  tchg(n,w)  +  T GAB(P, n + w)
The total time for computing p  patterns on P  processors would than  amount to:
Tbackprop(Pj ^  j P) ' Tpat(n , w) Tepo{^ Pj fl, W)
which — for a large number of patterns — can be estim ated via two constants tpat and t chg 
as:
-^1>ackprop(Pj ' tp a t  W  ' tch g  T  ( P ,  Tl V j)
During this time, on each of the P  processors, for each of the p / P  patterns per processor, 
w weights are computed, so the performance can be expressed in M CUPS as:
7^backprop(Pj w j P) = 7^  77) T (6-2)Tbackprop{P,n,W,p)
6.2.1 M easurements for function kernels
For each of the machines on which the experiments were carried out, the same programs 
were used. Because the routines used in the underlying communication layer have the same 
syntax and semantics for both Helios and Parix, porting the code to any of the NSC, GCel 
or PX machines introduced no difficulties. The results for the measured times tpat and t chg 
are depicted in Figures 6.1(a), 6.1(b) and 6.1(c).
102 6. Dataset Decomposition
(a) Kernels on NSC. (b) Kernels on GCel.
(c) Kernels on PX .
Figure 6,1: Measured and fitted times for function kernels tpat and tchg in milliseconds.
Note th a t the measured data  points fit the plotted lines. This means th a t there exists a 
linear relationship between the number of weights and tpat and t chg. From Figure 6,1 and 
Equation (6,2), the following expressions can be derived:
m achine ThackProP(P, n , W, p )  =  w  • ( p / P  • tpat + tchg) + T GAB(P, n  +  w)
NSC
G Cel
PX
w  ■ ( p / P  • 15.35 + 10.80) + T GAB(P, n  + w) f i seconds 
w  ■ ( p / P  ■ 16.66 + 11.25) + T gab(P, n  + « ^ s e c o n d s  
w  ■ ( p / P  ■ 1.012 + .0787) + T gab(P, n  + « ^ s e c o n d s
Table 6,3: Expected time for  backprop dataset decomposition.
6.2.2 A problem with small neural networks
Using this model to predict the performance for backpropagation neural networks decom­
posed via dataset decomposition results in good predictions as will be discussed in the next
6.2. Backpropagation dataset decomposition 103
sections. However, in first instance the deviations between measured and predicted results 
for small neural networks were very high. These could amount to more than  a factor 4 for 
networks of 10 weights (a 2x2x3 network). The reason for these large deviations is th a t the 
times for function kernels computed for small networks differ significantly from the fitted 
times. W hen ’’zooming in” the plotted times in Figure 6,1 at an interval of [0 . . .  IK], it 
appears th a t the smaller the neural network, the higher the time to compute tpat. This can 
be observed in Table 6,4,
nweights NSC GCel P X
10 55 73 4
50 31 39 2.3
100 25 31 1.8
500 19 73 1.3
IK 17 73 1.2
5K 15.8 17.1 1.1
10K 15.4 16.7 1.0
Table 6,4: Measured time for  tpat in usees/weight.
From this table it can be deduced th a t only for networks >5K weights the function kernels 
for tpat show the predicted behavior following Table 6,3, The reason for this problem 
is th a t each of the times t act, t err and twed not only depends on the number of weights, 
but also on the number of neurons. For large networks, the effect can be ignored and the 
execution time can be modeled via Table 6,3, However for small networks, also the number 
of neurons has to be taken into account. There are two ways to solve this problem. The 
first is to use a more detailed model, i.e. to identify and measure kernels on the level of 
code fragments (see Chapter 4, Section 4,3,1), Unfortunately when considering the code 
of backpropagation on this level, many individual kernels can be identified, which is hard 
to examine. Furthermore, it was discussed in Chapter 4 th a t a too low level of detail can 
result in bad predictions. The second method tha t can be used is to fit tpat as:
tp a t  ^  ' tw e ig h t  ^  ' tn e u r o n
Using Mai lab [66], tpat was fit on the measured function kernels (which were already mea­
sured to determine the results from Table 6,3), This results in the following expectation 
for the execution times:
P
T b a c k p ro p (P j H , W , p )  —  • ( it)  ' t w eig h t  ^  ' tn e u r o n )  ”1“ t e p o { P f  ^ 0
where tweight and tneUron for the three target platforms are listed below:
m achine tweight tneuron
NSC 15.2 42.5
GCel 16.3 66.9
PX 1.01 3.9
Table 6,5: Fitted times for  t weight and tneuron in fxseconds.
104 6. Dataset Decomposition
6.2.3 P erfo rm ance of backpropagation
The parallel backpropagation simulations decomposed via dataset decomposition were mea­
sured as program kernels, i.e. the full application was timed. The measurements all involve 
the training times. Measurements were made for neural networks with a size in the range 
of 10 to 200K weights. For the large networks (>5K), the predictions are very good, in 
general within a deviation of 0% , , ,  8%, As an example, consider Figures 6.2(a) to 6.2(d), 
which depict the measured and predicted performance for neural networks of size 100K 
and 200K weights, running on GCEL grids.
(a) 1 0 0 K  n e tw o rk s on sm a ll grids. (b) 1 0 0 K  n e tw orks on  large grids.
(c) 2 0 0 K  n e tw o rk s on  sm a ll grids. (d) 2 0 0 K  n e tw orks on  large grids.
Figure 6.2: Measured and expected performance for  network decomposed backpropagation 
networks on GCel grids. The n p a t te rn s  represent the number of patterns per processor.
6.2. Backpropagation dataset decomposition 105
For all experiments using dataset decomposition, the plotted performance shows the char­
acteristics as depicted in Figure 6,2, The larger the number of patterns th a t are computed 
during one epoch, the less influence the communication, gathering and update time rep­
resented by tepo has on the performance. In the discussion held below, several tables are 
depicted containing the performance achieved for neural networks with diverse size and 
running on different processor systems. It will appear tha t not only the number of pa t­
terns, but also the size of the neural network determines the performance th a t is reached. 
This can also be explained in a straight forward manner. The larger the neural network, 
the more a processor is carrying out useful computations and the less influence is caused 
by the communication overhead. Both the number of patterns and the size of the neural 
network contribute to the problem size.
When considering the com putation/com m unication ratio it is obvious th a t the more work 
each processor has to do, the more efficient the parallel program runs and thus the higher 
the performance th a t is reached. Using dataset decomposition, the performance for a 
backpropagation network is modeled via Equation (6,2), An upper bound for the maximal 
performance th a t can be reached on P  processors is found by taking the limit for p^ o f  
Equation (6,2):
'P m ax(P ,n ,w ,p ) =  lim
p ■ w
p^oo Tpat (P ,n ,w ,p )  + Tepo(P ,n ,w )
Using the definitions of tpat and tepo, this can be rewritten to Equation (6,3), which for 
large w approximates to (6,4):
'Pmax(P,n,w,p) =  lim ———---------------------- ^  W ---- --------- —— -------- (6,3)
p^oo p / P  ■ ( w  tweight +  n ■ tneuron) +  (w + nbias) • t epo
P
tw e ig h t
The expected values of P max for the different target platforms are given below:
(6.4)
m a c h in e V m a x ( P )  =  
4x4 8x4
p in  MCUPS 
16x82x2 2x4
{'weight
8x8 16x16 16x32
NSC
GCEL
PX
0.26
0.25
3.96
0.53
0.49
7.92
1.05
0.98
15.84
2.11
1.96
31.68
4.21
3.93
63.37
8.42
7.85
126.73
16.84
15.71
253.47
33.68
31.41
506.93
Table 6 ,6 : Maximum performance for backprop dataset decomposition.
The next three subsections present the results of performance predictions achieved on 
respectively the GCel, PowerXPlorer and NSC,
106 6. Dataset Decomposition
6.2.4 Results for GCel
The results are presented in tables depicting the maximal performance in M C U PS and the 
minimum and maximum deviations in percent between measured and predicted perfor­
mance, The achieved performances for the GCEL are depicted in Table 6,7, These were 
measured for networks of different size and a number of patterns varying from [1 , , ,  1000] 
patterns per processor. As mentioned before, all performances obtained for a certain neural 
network show the behavior as depicted in Figure 6,2, This means th a t the performance 
increases quickly in the domain of [1 . . .  100] patterns per processor, after which it slowly 
increases towards its maximum for a larger number of patterns. The performance depicted 
in Table 6,7 almost reaches the predicted maximal performance shown in Table 6 ,6 ,
n w e ig h ts g r id s iz e
4x4 8x4 8x8 16x8 16x16 16x32
10 0.2 0.4 0.8 1.7 3.3 6.6
50 0.4 0.8 1.7 3.3 6.6 13.2
100 0.5 1.0 2.0 4.0 8.0 15.9
500 0.7 1.4 2.7 5.5 10.9 21.6
IK 0.8 1.5 3.0 6.0 11.9 23.6
5K 0.9 1.7 3.4 6.8 13.4 26.6
10K 0.9 1.8 3.5 7.0 13.8 27.4
25 K 0.9 1.8 3.6 7.1 14.2 28.0
50 K 0.9 1.8 3.6 7.2 14.4 28.4
75K 0.9 1.8 3.7 7.3 14.5 28.6
100K 0.9 1.8 3.7 7.3 14.5 28.7
125K 0.9 1.9 3.7 7.3 14.6 28.8
150K 0.9 1.9 3.7 7.4 14.6 28.9
175K 0.9 1.9 3.7 7.4 14.6 28.9
200K 0.9 1.9 3.7 7.4 14.6 28.9
Table 6,7: Performance in MCUPS for GCel.
The deviations between measured and predicted results are depicted in Table 6 ,8 , For the 
sequel of this thesis, all deviations between expected e* and measured values m* are given 
in percentages as the fraction (e* — m ^ je i  • 100,0%,
n w e ig h ts g r id s iz e
4x4 8x4 8x8 16x8 16x16 16x32
10 7.7-16.9 3.3-16.9 0.1-16.8 2.3-16.9 0.9-16.5 1.4-16.1
50 0.1-2.4 0.4-2.4 0.2-3.6 0.6-6.9 0.0-8.5 0.1-10.9
100 1.4-1.5 1.4-3.3 1.4-4.3 1.5-6.3 1.2-7.2 1.3-8.9
500 0.3-1.7 1.1-1.8 1.8-1.8 1.8-3.3 1.9-3.9 1.9-5.1
IK 0.5-2.3 0.8-2.3 1.3-2.3 2.3-2.7 2.3-3.2 2.4-4.3
5K 0.7-2.8 0.4-2.8 0.8-2.8 2.0-2.8 2.4-2.9 2.9-3.3
10K 0.7-2.9 0.3-2.9 0.7-2.9 1.8-2.9 2.8-2.9 3.0-3.7
25K 1.9-5.2 2.4-5.1 2.9-5.4 3.4-5.4 3.8-5.6 4.2-5.6
50K 0.9-3.9 1.8-4.1 2.2-4.2 2.8-4.2 3.3-4.4 3.9-4.4
75K 0.4-3.2 1.2-3.3 1.7-3.3 2.5-3.4 2.9-3.6 3.6-3.6
100K 0.2-2.9 1.1-3.0 1.6-3.1 2.3-3.1 2.8-3.3 3.4-3.5
125K 0.0-2.6 0.9-2.7 1.4-2.7 2.2-2.9 2.7-3.0 3.0-3.4
150K 0.1-2.5 0.8-2.5 1.3-2.6 2.1-2.6 2.6-2.8 2.8-3.3
175K 0.2-2.4 0.7-2.5 1.3-2.5 2.1-2.6 2.6-2 .7 2.7-3.3
200K 0.3-2.2 0.6-2.3 1.2-2.4 2.0-2.4 2.5-2.5 2.6-3.2
Table 6 ,8 : Min and max deviations for GCel (in %).
6.2. Backpropagation dataset decomposition 107
6.2.5 Results for PX
The same set of experiments was carried out on the PowerXPlorer, The measured maxi­
mum performances and minimum and maximum deviations are depicted in Table 6,9,
n w e ig h ts P e r f o r m a n c e D e v ia t io n s
2x2 4x2 4x4 4x8 2x2 4x2 4x4 4x8
10 0.9 1.7 3.4 6.5 9.7-25.1 10.1-27.1 9.1-22.0 9.7-23.0
50 1.8 3.5 7.0 13.5 8.6-14.3 7.9-14.2 13.4-18.3 14.0-19.2
100 2.2 4.3 8.6 16.6 0.3-5.8 0.6-8.8 0.1-14.3 0.3-16.4
500 3.1 6.0 11.9 23.0 0.1-9.1 0.0-10.5 0.5-5.0 1.0-5.9
IK 3.4 6.7 13.1 25.3 0.1-11.0 0.0-13.4 1.0-4.3 1.2-5.0
5K 3.8 7.4 14.4 27.9 0.5-10.2 0.8-15.0 3.5-7.1 3.7-7.2
10K 3.9 7.6 14.8 28.7 0.0-9.8 1.5-14.8 3.6-8.1 3.8-7.9
25 K 3.9 7.7 15.2 29.5 0.2-1 .7 0.1-1.2 0.2-1.0 0.0-1.7
50 K 3.9 7.9 15.4 30.0 0.9-3.1 0.3-2.6 0.1-2.4 0.2-2.2
75K 3.9 7.9 15.6 30.2 1.3-3.9 0.6-3.4 0.2-3.2 0.0-3.1
100K 3.9 7.9 15.6 30.3 1.4-4.4 0.7-3.9 0.2-3 .7 0.2-3.5
125K 3.9 7.9 15.7 30.4 1.6-4.7 0.9-4.2 0.1-4.0 0.1-3.8
150K 3.9 7.9 15.7 30.5 1.6-4.9 0.9-4.3 0.0-4.2 0.0-4.0
175K 3.9 7.9 15.7 30.5 1.7-4.9 0.9-4.4 0.0-4.3 0.0-4.1
200K 3.9 7.9 15.7 30.5 1.8-5.0 1.0-4.5 0.1-4.4 0.1-4.1
Table 6,9: Performance (MCIJPS) and deviation range (%) for PX.
6.2.6 Results for NSC grids and trees
The backpropagation neural networks were also measured on NSC grids and trees. The 
same programs were used as the ones th a t were utilized for GCel and PX grids. However for 
tree topologies another communication layer implementing the gathering and broadcasting 
of data  was exploited, as explained in C hapter 5,
n w e ig h ts P e r f o r m a n c e D e v ia t io n s
4x4 8x4 8x8 4x4 8x4 8x8
10 0.3 0.5 1.0 9.9-11.1 14.3-14.6 13.8-14.1
50 0.5 1.0 2.0 2.6-19.8 7.9-25.5 7.9-26.4
100 0.7 1.2 2.4 1.4-13.4 0.0-21.0 0.5-22.8
500 0.9 1.6 3.2 3.8-6.2 8.6-12.3 8.7-12.4
IK 0.9 1.7 3.5 0.1-2.3 2.8-6.7 2.7-7.0
5K 1.0 1.9 3.8 3.9-4.0 1.3-2.4 1.3-1.8
10K 1.0 2.0 3.9 3.8-4.6 1.4-1.7 0.9-1.3
25 K 1.0 1.9 3.9 3.2-7.1 1.2-4.3 0.0-0.9
50 K 0.9 1.8 4.0 2.0-9.6 2.3-6.9 0.5-3.3
75K 0.9 1.9 3.7 0.8-4.5 0.7-4.6 3.6-5 .7
100K 0.9 1.9 3.9 0.9-4.6 0.2-1.9 0.0-0.8
125K 0.9 1.8 3.6 0.4-6.4 4.4-9.1 4.2-8.0
150K 0.9 1.7 3.6 0.4-6.5 5.5-10.9 3.6-7.2
175K 0.9 1.8 3.8 0.2-3.3 3.4-7.4 1.2-2.4
200K 0.9 1.9 3.9 0.3-3.2 0.5-4.2 0.2-0.5
Table 6,10: Performance (MCUPS) and deviation range (%) for N SC  grids.
108 6. Dataset Decomposition
n w e ig h ts P e r f o r m a n c e D e v ia t io n s
4 13 40 63 4 13 40 63
10 0.1 0.2 0.6 1.0 0 .6- 16.1 0.0-10.3 0.1--7.5 0.1-9.5
50 0.1 0.4 1.3 2.1 5 .2- 11.8 2.6-7.2 2.1”-5.4 4.1-7.4
100 0.1 0.5 1.5 2.4 0.2”-4.7 0.0-4.2 0.0--7.1 0.0-5.7
500 0.2 0.6 2.0 3.2 2.9”-8.0 1.9-8.6 0.8--8.4 1.0-8.3
IK 0.2 0.7 2.1 3.4 0.9”-2.0 0.2-2.7 0.1--3.2 0.0-4.2
5K 0.2 0.7 2.3 3.8 0.6”-2.1 1.2-2.9 1.2--3.1 1.3-1.9
10K 0.2 0.7 2.4 3.9 0.6”-2.5 1.3-3.7 1.3--4.2 1.4-3.0
25K 0.2 0.8 2.5 4.0 2.1”-4.2 2.7-4.9 2.8--6.5 2.9-6.9
50K 0.2 0.8 2.5 4.0 1.1”-3.5 1.8-4.3 1.8--6.0 1.9-6.4
75K 0.2 0.8 2.5 4.0 0.5”-3.0 1.1-4.0 1.2--5.6 1.4-6.1
100K 0.2 0.8 2.5 4.0
COCD -2.8 1.0-3.6 1.0--5.5 1.1-6.0
125K 0.2 0.8 2.5 4.0 0.1”-2.5 0.7-3.6
cqCD -5.3 0.9-5.7
150K 0.2 0.8 2.5 4.0
CDCD -2.4 0.5-3.4 0.6--5.3 0.8-5.4
175K 0.2 0.8 2.5 4.0
CDCD -2.2 0.5-3.2 0.6--5.0 0.6-4.2
200K 0.2 0.8 2.5 4.0
CDCD -2.3 0.4-3.1 0.5--4.9 0.5-3.8
Table 6,11: Performance (MCIJPS) and deviation range (%) for  N SC  trees.
6.3 D iscussion of the results
The tables depicted above present the achieved performances and deviations between the 
measured and predicted performance. For all results it appears tha t the predicted maximal 
performance is reached for large problem sizes. It can also be observed th a t in general the 
larger the problem size, the better the predictions. Note th a t not only the number of 
weights, but also the number of patterns contributes to the problem size. The typical 
behavior of the deviations for a given neural network related to the number of patterns is 
as depicted below in Figure 6,3,
(a) Deviations fo r  100 weights. (b )  D eviations fo r  100K weights.
Figure 6,3: Typical deviations on different GCel grids (in %).
6.4- Dataset decomposition fo r  Kohonen 109
This behavior can be explained as follows. It was noted th a t -  similar like the communi­
cation time for small messages -  the calculation times depend heavily on the size of the 
neural network. Using the performance prediction model for dataset decomposition, two 
param eters are estimated, the times tpat and t epo. For neural networks > 10K weights, these 
times are estim ated using Table 6,5, For smaller networks, a fit of the measured function 
kernels was required in order to arrive at proper performance predictions. Consider the 
performance model, given an expected execution time for a certain neural network on a 
certain processor topology texp = p ■ tpat +  tepo and assuming th a t the measured time can 
be modeled via two quality factors ( /i ,  / 2) such th a t t rneas = p ■ tpat ■ f  1 +  tepo • ƒ2, The 
deviation (fraction) is given as:
d e v ( f i , f 2)
texp
texp
( 1 -  f l )  • P • tpat +  {1 -  h ) ' t , epo
The derivative of d e v ( f i , f 2) equals:
d e v ' ( f i , f 2)
P  ' t pat t epo
tp a t  ' t epo  ' ( f l  ƒ2) 
(P  ' t pa t  ±  t ep o )
\2
If f i  equals / 2, the deviation has a constant value (its derivative is zero). Note th a t 
p, tpat j t epo are all greater than  zero. As the sign of the derivative determines the slope of the 
deviation, by considering the plotted deviations as in Figure 6,3 it can be determined th a t 
f i  > ƒ2 or vice versa. Evaluating all deviations of the results, it appears th a t for networks < 
5K weights, the expected values for tcomm have a larger deviation than the expected values 
for tpat■ This can be derived from the fact th a t the derivative is negative, so f l  > f 21, 
In other words, it can be concluded th a t for small neural networks communication not 
only forms a bottle neck for the performance, but also tha t it is the main source for the 
deviations of the performance prediction model. For large neural networks, the expected 
maximum performance is reached and the predictions are within a number of percentages. 
As for these networks the communication becomes less im portant, possible differences in 
the calculation times rule the shape of the deviations.
t
6.4 D ataset decom position for K ohonen
The sequential KSOM (Kohonen Self-Organizing feature Map) algorithm and the identi­
fication of its function kernels is described in C hapter 4, Section 4,2,2,1, Similar to the 
dataset decomposed version of the backpropagation algorithm, the version of the KSOM 
algorithm differs from its sequential equivalent in the way in which the weights are up­
dated, During one epoch, for each of the patterns stored locally on a processor, the weight 
changes are computed. Subsequently, they are gathered, accumulated and broadcast after
1This holds for ƒ 1, ƒ2 < 1. For other cases a similar consideration can be made.
110 6. Dataset Decomposition
which the weight updating takes place. The dataset decomposed version of the KSOM 
algorithm is depicted in Algorithm 6,4:
for (epoch=0;error>errcrit&&epoch<nepochs;epoch++) { 
for (p=0;p<npatterns;p++) {
c = FindWinner (p); /* compute-nets ta tus  () */
for (i=0;i<nneurons;i++) /* compute-net-changes (X/
if (Neighbour(c,i)) /* ... */
ComputeWeightChanges(p,c,i); /* . . . */
}
GAB_WeightChanges();
Updat eWe ight Change s();
}
Algorithm 6,4: Algorithm for  KSO M  dataset decomposition.
From this algorithm, the function kernels can be identified as in Table 6,12,
kernel time required
F indWinner() t fn d ,
Neighbour() tn e i
Comput eWe ight Change s() t dw
GABWeightsChanges() T GkB(P,nweAghts)
UpdateWeightChanges() tch w
Table 6,12: Function and communication kernels for Kohonen decomposed via dataset
decomposition.
In Section 4,2,2,1 it is discussed th a t for determining the to tal execution time, the times for 
the routines FindWinner () and UpdateWeightChanges 0  must be multiplied by the num­
ber of weights. The same holds for GABWeightsChanges() and UpdateWeightChanges(),
but the execution time for N eighbour() is related to the number of neurons in the Ko­
honen map. Given a winning neuron, for all neurons th a t lay within its neighborhood, 
the weight changes have to be computed. In our implementation of the KSOM neural 
network, it is decided whether two neurons are neighbors by checking if their Euclidean 
distance is within a certain range r. The upper bound on the number of neurons th a t have 
to compute their weight changes is given by tt • r 2. Initially, r  =  s/n/A, where n is the 
number of neurons. The to tal calculation time per pattern  can thus be bounded by tpat, 
which is given in Equation (6,5), Note th a t the number of weights w = N  ■ n, where N  is 
the dimension of the input space. The extra time required per epoch is given in (6 ,6):
tpat = n - t nei + w - t fnd + N  -n  ■ (^ /n /4)2 ■ tdw
= n ■ tnei + w ■ t fnd + N  ■ 7i ■ n / 16 ■ tdw
= n - t nei + w ( t fnd + 7r/16-tdw) (6.5)
tepo = w ■ tchw + T gab(P, w ) (6 .6)
6.4- Dataset decomposition fo r  Kohonen 111
6.4.1 M easurements for function kernels
The function kernels listed in Table 6,12 were measured on the three target platforms. 
The different values for tnei, t f nd, etcetera were determined as described in Chapter 4, The 
results were measured for different network settings (N , n ), where each result represents 
the outcome of a sequential program kernel measurement, i.e. for a certain network, the 
Algorithm 6,4 was measured without executing GABWeightsChanges () . Instead of the 
backpropagation network, it is possible to identify only one code fragment kernel (the 
computation of the neighbors). Therefore, Equation (6,5) can also be used for small neural 
networks and no fitting is required. The measured times for the function kernels are 
depicted in 6,13,
kernel NSC G CEL P X
tnei 24.95 20.77 3.37
tfnd 8.13 6.77 .474
tdw 10.01 8.34 .520
tew 6.40 5.33 .335
Table 6,13: Times for KSOM  function kernels ([¿seconds).
Using these results and Equations (6,5) and (6 ,6), the maximum performance for KSOM 
networks can be estim ated as Equation (6,7), for which the computed values are given in 
Table 6,14,
V m a x ( P , n , w , p ) limO
p  ■ n  ■ N
p - n / P  • ( t nei +  N  • ( t f n d  +  7 t / 1 6  • t i w ) +  Tepo(P, n, w )
P  • N I  { tn e i  +  N  • ( t f n d  +  7 t / 1 6  • t d w) )  (6.7)
m a c h in e
2x2 2x4 4x4
V/ m a x
8x4
in  MCUPS
8x8 16x8 16x16 16x32
NSC 0.38 0.77 1.54 3.07 6.15 12.30 24.60 49.20
GCEL 0.46 0.92 1.85 3.69 7.38 14.77 29.54 59.07
PX 6.47 12.94 25.88 51.76 103.52 207.04 414.09 828.18
Table 6,14: Maximum performance for KSO M  dataset decomposition.
6.4.2 Results for KSOM
Below, the measured and maximum performance and deviations for the KSOM networks 
decomposed via dataset decomposition are given. The size of the networks is varied from 
10 to 200K weights. The number of patterns was limited to 1000 per processor. Full 
program kernels were measured, similar to the backpropagation networks.
112 6. Dataset Decomposition
n w e ig h ts g r id s iz e
4x4 8x4 8x8 16x8 16x16 16x32
10 0.6 1.2 2.5 4.9 9.8 19.3
50 0.7 1.3 2.7 5.3 10.6 21.1
100 1.0 2.1 4.2 8.3 16.4 32.4
500 1.1 2.3 4.5 9.0 17.8 35.2
IK 1.4 2.8 5.7 11.2 22.3 43.9
5K 1.4 2.8 5.5 11.0 21.8 42.9
10K 1.6 3.2 6.3 12.6 24.9 48.9
25 K 1.5 3.1 6.6 12.1 22.3 45.4
50 K 1.6 3.1 6.3 12.4 24.6 48.3
75K 1.7 3.3 6.6 13.0 25.8 50.6
100K 1.7 3.4 6.7 13.4 26.4 51.9
125K 1.7 3.4 6.9 13.6 26.8 52.7
150K 1.8 3.5 6.9 13.7 27.1 53.2
175K 1.8 3.5 7.0 13.8 27.3 53.6
200K 1.8 3.5 7.0 13.9 27.5 53.9
Table 6,15: Performance (MCUPS) for GCel.
n w e ig h ts g r id s iz e
4x4 8x4 8x8 16x8 16x16 16x32
10 5.9-19.3 1.6-19.2 1.1-19.0 1.1-18.8 1.6-18.5 0.4-18.1
50 1.0-1.2 1.0-1.6 1.0-1.7 1.0-2.1 1.0-2.1 0.7-1.0
100 1.7-2.4 2.0-2.4 1.8-2.4 2.0-2.4 1.9-2.4 0.1-2.3
500 1.6-2.9 1.8-2.9 1.5-2.9 1.7-2.9 1.4-2.9 0.1-2.8
IK 2.1-4.2 2.2-4.2 1.8-4.2 1.9-4.2 1.6-4.1 0.2-4.0
5K 2.0-4.1 2.1-4.1 1.7-4.1 1.8-4.1 1.5-4.1 0.2-3.9
10K 2.3-4.9 2.3-4.9 1.9-4.9 1.9-4.8 1.6-4.8 0.2-4.6
25K 0.1-0.5 0.0-0.1 2.9-8.3 0.0-0.5 2.2-7.0 0.0-3.5
50K 3.8-7.5 3.5-7.5 2.8-7.5 2.6-7.5 2.1-7.4 0.1-7.2
75K 3.1-6.4 2.9-6.4 2.3-6.3 2.2-6.3 1.8-6.3 0.3-6.1
100K 2.7-5.8 2.6-5.8 2.1-5.8 2.0-5.8 1.7-5.7 0.3-5.5
125K 2.5-5.5 2.4-5.4 1.9-5.4 1.9-5.4 1.6-5.3 0.2-5.1
150K 2.3-5.2 2.3-5.2 1.8-5.2 1.8-5.1 1.5-5.1 0.2-4.9
175K 2.2-5.0 2.2-5.0 1.7-5.0 1.8-4.9 1.5-4.9 0.1-4.7
200K 2.1-4.9 2.2-4.8 1.7-4.9 1.7-4.8 1.5-4.8 0.1-4.6
Table 6,16: Min and max deviations (%) for GCel.
n w e ig h ts p e r fo r m a n c e d e v ia t io n s
2x2 4x2 4x4 4x8 2x2 4x4 4x4 4x8
10 1.5 2.9 5.7 10.8 0.7-12.8 1.3-15.9 1.7-16.2 2.5-16.8
50 1.6 3.2 6.2 12.3 3.0-4.7 2.1-4.6 1.1-4.8 0.3-4.5
100 2.8 5.6 11.1 21.5 1.9-4.4 0.4-4.3 0.2-4.0 0.1-4.0
500 2.9 5.8 11.4 22.3 2.0-3.9 0.3-3.7 0.2-3.7 0.1-3.6
IK 4.2 8.2 16.1 31.2 1.8-3.6 0.1-3.6 0.2-3.7 0.1-3.5
5K 4.3 8.4 16.5 32.0 2.0-3.8 0.4-3.6 0.1-3.7 0.0-3.5
10K 5.3 10.3 20.2 39.0 1.7-3.4 0.1-3.7 0.1-3.6 0.1-3.3
25 K 4.1 8.1 15.8 30.6 2.5-5.7 0.4-5.0 0.0-6.3 0.1-6.1
50 K 5.0 9.9 19.7 38.4 2.3-5.9 0.2-5.1 0.0-3.9 0.4-2.3
75K 5.6 10.9 20.9 41.2 1.2-2.7 0.2-3.1 0.3-5.3 0.4-2.4
100K 5.8 11.4 22.0 42.9 1.6-4.1 0.2-3.1 0.1-4.2 0.2-2.2
125K 5.8 11.5 22.8 44.0 2.0-5.5 0.1-4.1 0.1-2.8 0.1-2.0
150K 6.1 12.0 22.7 44.5 1.0-2.4 0.1-1.9 0.0-4.8 0.3-2.5
175K 6.2 11.8 23.4 44.2 1.0-2.3 0.0-4.7 0.0-3.3 0.1-4.3
200K 6.2 12.0 23.4 45.6 1.4-3.6 0.2-4.0 0.2-4.1 0.1-2.2
Table 6,17: Performance (MCUPS) and deviation range (%) for PX.
6.5. Fixed-size speedup 113
n w e ig h ts p e r fo r m a n c e g r id s  ize
4x4 8x4 8x8 4x4 8x4 8x8
10 0.6 1.1 2.1 0.2”-9.0 0.0”-9.3 0.0”-7.9
50 0.6 1.3 2.5 2.7”-5.8 2.9”-6.1 2.9”-5.7
100 1.0 2.0 4.0 0.3”-5.2 0.4”-5.9 0.4”-5.7
500 1.1 2.2 4.5 0.0”-4.5 0.0”-4.8 0.0”-3.6
IK 1.4 2.9 5.7 0.1”-3.0 0.3”-3.8 0.1”-1.9
5K 1.4 2.8 5.6 0.0”-2.0 0.1”-2.5 0.0”-1.4
10K 1.6 3.2 6.4 0.4”-3.1 0.4”-3.1 0.3”-3.1
25K 1.2 2.5 4.8 3.9”-6.5 0.1”-4.0 2.7”-5.3
50K 1.4 2.7 5.3 2.8”-5.0 0.9”-5.7 3.1”-6.1
75K 1.4 2.8 5.6 2.0”-3.8 0.5”
QO 2.6”-5.5
100K 1.4 2.9 5.8 3.0”-5.5
coCD ”4.3 1.4”-2.7
125K 1.5 3.0 5.8 1.9”-3.7 0.2””2.5 2.1”-3.9
150K 1.5 3.0 5.8 1.0”-2.3 0.0””3.3 2.5”-5.3
175K 1.5 3.0 5.8 2.6”-4.9 0.6””3.8 2.6”-5.5
200K 1.5 3.0 6.0 2.1”-4.1 0.7””4.6 1.7”-2.8
Table 6,18: Performance (MCIJPS) and deviation range (%) for NSC.
6.5 Fixed-size speedup
The previous sections show th a t for dataset decomposition, the performance prediction 
model is able to predict the execution times for a given neural network architecture and 
processor topology. Especially for larger problem sizes, the accuracy of the predictions is 
good. Using the predicted execution times, the speedup, efficiency and scalability can be 
predicted within the same accuracy. As already can be observed in the tables listed above, 
for large problems linear speedups can be reached. Consider for example any of the tables 
th a t depict the maximum performance. The bottom  rows of each table represent the larger 
neural networks. Following the values of a row from left to right, the performance increases 
each step with a factor of two. As the number of processors also increases with a factor 
of two, the speedup is linear. As also can be observed, the speedups for smaller problems 
(small number of patterns p) are not linear. This is because smaller problems suffer from 
communication overheads. The speedups mentioned here represent the scaled speedup, 
as the problem size for a given network is ruled by the number of patterns per processor. 
Though, as expected, the achieved scalabilities are high, the question th a t is discussed in 
this section is to find out at which point the (fixed-size) speedup drops, i.e. for which 
problem size, for which number of processors, does the speedup no longer increase but 
rather decrease. This point is known as the speedup limit. In the next section, this issue is 
discussed for scaled speedups, i.e. the scalability limit is searched. P lotting the fixed-size 
speedup for different neural networks sizes and number of patterns gives similar results for 
all three configurations (NSC, GCEL and PX), For example consider the speedups depicted 
below. For backpropagation networks of size 10 and 200K weights, for a varying number 
of patterns, the expected speedup is depicted for grid and tree topologies.
114 6. Dataset Decomposition
400 '^ S L /W W  \  \
P 0 0  P
(a) GCEL grids, w = 10.
0 0
512
(b) GCEL grids, w= 200K
1 2  
i a  
, 8­
6, 
4  
2
400 \  \  128
2 0 ^ ^ 3 ^ < 5 4  
P 0 0  P
(c) P X  grids, w = 10.
256
400 \  \  128
2 0 ^ ^ m > - v <S4
P 0 0  P
(d) P X  grids, w= 200K
256
Figure 6,4: Speedups for GCEL and P X  grid topologies.
These plots are computed using the definition of fixed-size speedup:
S (P n  w v) = T ^ w ^ )  o(P, n, w,p)
which is defined for dataset decomposition as:
P ' tpat(n,w) + tchg(n, w)
S(P, n, w,p)
P /P  • tPat(n, w ) +  t chg(n, w) +  T GAB(P, n, w) 
where the times for communication for respectively a grid and tree topology are:
Tfrid (P> n ,w ) = 2-  (y/P  -  1) • (2 • t comm(n, w ) +  tacc(n, w))  
Pfre^kP; n ,w ) = 3 • log3(P) • (2 • t comm(n, w ) +  tacc(n, w))
P P
6.5. Fixed-size speedup 115
(a) GCEL trees, w = 10. (b) GCEL trees, w= 200K
(c) P X  trees, w = 10. (d) P X  trees, w= 200K
Figure 6,5: Speedups for GCEL and P X  tree topologies.
As can be observed in these figures, for both small and large neural networks, and for all 
patterns in the range [1 , , ,  1000] per processor, there is a point at which the speedup 
reaches the speedup limit. Furthermore it is noted th a t for most cases the speedup as well 
as the speedup limit for a tree are higher than  those for a grid. Only for larger networks 
on the PX, the grid seems to be in favor of a tree topology (compare Figures 6.4(c) and 
6.5(c)), On the other hand, in these figures it shows th a t for larger processor systems, the 
speedup is higher for a tree,
6.5.1 The speedup limit
If the speedup limit is reached for a certain number of processors, this number can be 
found at the point where the derivative for the speedup equals zero (see also [114]), Taking
116 6. Dataset Decomposition
the derivative M  and computing P  for j p  =  0 gives for a grid:
d S  Q 2 • t comm(n , w) iacciP'i w) p • tpat(n, w)
d P  vAp P 2
, pgrtd = (  P ' tpatjn, w)
max \2-tcomm(n,'w) + tacc(n,w)J  '
and for a tree, where we estimate the depth of a tree with log3(P ):
d S  3 -(2  - t comm( n , w ) + t acc(n,w)) tpat(n ,w )
—  =  u ^  ------------------------------------------ — —------------  =  u
d P  P  ■ ln{3) P 2
ptree = fa(3) - p - t pat(n,w)
max 3 • (2 • tcomm(n +  w) +  tacc(n +  w ))
Intuitively, such a relation between the number of patterns, the calculation and the commu­
nication times could be expected. Indeed, if the communication times are high compared 
to the calculation times, the simulations do not run efficiently and the speedup limit is 
reached relatively soon. Similarly, if the number of patterns increases, the calculation time 
becomes more im portant and the speedup limit raises. For a small number of patterns, 
the network size is of importance for the speedup limit. For example, if every processor 
computes only one pattern, communication is required after the computation of each pat­
tern, For small networks, especially for a fast machine like the PX, the calculation times 
are relatively low compared to the communication times. These principles can be observed 
in the discussions about Kohonen and backpropagation networks given below.
6.5.2 Fixed-size speedups for backpropagation
Using Equations (6 ,8) and (6,9), for a given neural network and a given number of pa t­
terns, the speedup limit can be computed. In order to determine the effect of network 
size and number of patterns on the speedup limit, consider the figures below. Plotted is 
the maximum number of processors after which the speedup drops, computed via these 
equations.
6.5. Fixed-size speedup 117
0
(c) Speedup lim it fo r  PX .
Figure 6 ,6 : Speedup limit for backpropagation on transputer grids. The y-axis gives the 
network size in terms of number of neurons, the x-axis represents the number of patterns 
per processor. This means that for the GCEL, p  =  5 means that the number of patterns 
in the processor network is 5 • 512, and similarly for the N SC  and PX, that this number is 
5 • 64 and 5 • 32.
Considering these plots, it shows th a t the most im portant param eter is the number of 
patterns. The larger this number, the higher the speedup limit. To compare this to the 
speedup limit for tree topologies, consider Figure 6,7, As there is a linear relation between 
the number of patterns p  and the times for calculation and communication, compared to 
the power(2/3 ) for grids, it is obvious tha t trees provide a more efficient target platform 
for dataset decomposed neural networks than  grids.
118 6. Dataset Decomposition
(a) Speedup lim it fo r  NSC. (b) Speedup lim it fo r  GCEL.
(c) Speedup lim it fo r  PX .
Figure 6,7: Speedup limit for backpropagation on transputer trees.
When considering these plots, it is noted tha t for larger neural networks, the speedup limit 
approximates some asymptote:
lim P„ P  ' tw e ig h t
or
2 ' tcornrn tat 
ln(3) • p • tWeighl
3 ' (2 t cornrn tacc)
for grids 
for trees
(6.10)
(6 .11)
For respectively the NSC, GCEL and PX systems, the speedup limit for 5 patterns per 
processor computed for these equations becomes 55, 227 and 7, (and 148, 1256, 8 for trees), 
which corresponds with the figures depicted above. Considering th a t these numbers are 
met relatively soon, this means th a t even for neural networks with m oderate size (n = 100),
0
0
6.6. Scalability and efficiency 119
the speedup limit can be computed in this way. Furthermore, this means th a t for a given 
number of patterns, the speedup limit decreases to (6 ,10) or (6 ,11) when increasing the 
network size,
6.5.3 Fixed-size speedups for Kohonen networks
Equations (6 ,8) and (6 ,8) also hold for Kohonen neural networks. Substituting tpat, etcetera 
for Kohonen, and dividing numerator and denominator by the number of neurons n (see 
Equations (6,5) and (6 ,6)), the speedup limit becomes:
p g r i d  _  / P  ' ( t n e i  +  N  ' ( t f n d  +  16 ' t d w ) )
max \  2 • t comm(N) + tacc(N))
P
t
,tree  __ l f l { 3 )  • p  • { t n ei +  N  • { t f n d  ^6 ' tdw~)
max 3 ' (2 ' tcomm{N ) +  t acc{N))
This means th a t the speedup limit for Kohonen networks depends on the number of inputs 
N  and the number patterns. Similar plots can be made as depicted above for baekpropaga- 
tion. When increasing the network size for small networks (small N ),  the drop in speedup 
limit th a t was observed for backpropagation is even more dramatic. Again, by taking the
limit of the network size, Pmax can be found as:
lim Pgrid = ( P ^ fnd +  ^  ' tdw^  3
iv^oo max ^ 2 • t cornrn + tacc )
lim Ptree = P ' ln{3) ■ {tfnd +  fk ' ^dw)
N^oo max 3 • (2 • t comm +  tacc)
(6.12)
For respectively the NSC, GCEL and PX systems, the speedup limit for 5 patterns per 
processor becomes 35, 146 and 6 (and 74, 648, 6 for trees). Note th a t these numbers 
are somewhat lower than the speedup limits found for backpropagation in the previous 
section. This is caused by the fact th a t for Kohonen, for the same number of weights, the 
calculation times are less than for backpropagation.
6.6 Scalability and efficiency
The scalability issue is im portant if someone scales up his problem with a certain factor k 
and hopes to solve it in the same time as the original problem was solved on P  processors 
by increasing P  to k • P. The scalability is A;, if both problems indeed can be solved in the 
same time, or more formallv:-"5
S scal{k ,P ,{ n ,w ,p )) =  k -  f scal{k ,P ,{n ,w ,p ))  where
n % P , ( n , W, P>) =  T { k T % ™ Pl p)) (6.13)
120 6. Dataset Decomposition
Scaling up the problem via k ■ (n ,w ,p ) can be done by either increasing the network size 
or by using more patterns. For example, consider Figure 6 ,8 , The plots represent the 
measured scalability for small and large neural networks, for a varying number of patterns. 
The execution time for a problem with 10 or 200K weights computing p  patterns was 
measured on one processor. Furthermore, the same networks computing k •p  patterns were 
measured on k processors. P lotted are the divided execution times multiplied by k\ which 
equals the scalability.
512-,
£
j3
| 2 5 6 '
o
128
64
16.
1000
512 "i
sc
al
ab
ili
ty
6
128'
64 '
16a
1000 ^
512
500
npatterns 0 16 64 k
(a) Scalability fo r  10 weights (G CEL).
512
500
npatterns 0 16 64 k
(b) Scalability fo r  200K  weights (G CEL).
500
npatterns 0 4
(c) Scalability fo r  10 weights (PX).
16
nprocs
32 >
sc
al
ab
ili
ty
6
8 '
4=
^ 1000 ^
"32 32
500
npatterns 0 4
(d) Scalability fo r  200K  weights (PX).
16
nprocs
Figure 6 ,8 : Scalability for backpropagation networks on GCEL and P X  grids.
As can be observed, the scalability limit is not reached. Not for small networks of 10 
weights and not for large networks of 200K weights. Similar as described in the previous 
section, this limit can be found by finding the number of processors for which ds^ p l equals 
zero. The expectation is tha t, -  if such a limit exists - ,  it must be much higher than 
the fixed-size speedup limit, as the problem size is scaled linearly with the number of
6.7. Two applications 121
processors. Computing ds*pl gives no reasonable P  for which the derivative equals zero. 
Considering Figure 6,8 and plotting the scalability for varying k, n, p, and P, it can be 
concluded th a t for dataset decomposition no scalability limit exists. However, for large 
k, the scalability factor approaches zero, in particular if p  is small compared to k. This 
means th a t a large number of processors are added without any significant gain in execution 
time; a large amount of resources is wasted. The part of the processors th a t is being used 
efficiently is defined by the efficiency. The definition of efficiency is speedup divided by the 
number of processors. This means tha t the higher the speedup, the higher the efficiency. 
Linear speedups result in an efficiency of 1, For scalability, we define the scalability factor 
as efficiency measure. Using the definitions of speedup and scalability factor, for a given 
problem size it can be decided to add more processors based on speedup and efficiency 
characteristics.
6.7 Two applications
To conclude the discussion about dataset decomposition, for two relevant applications the 
performance, speedup, scalability and efficiency th a t can be expected are presented. The 
two applications are Nettalk [93] and Satdat, a satellite data  classification problem [89], For 
the Nettalk dataset, a backpropagation neural network with 203 inputs, a varying number 
of hidden neurons and 26 outputs is used. The dataset and some additional information 
can be obtained via anonymous ftp to f t p . i d i a p . c h . The dataset contains about 20K 
training patterns and is contained in file /p u b /b e n c h m a rk s /n e u ra l /n e tta lk .ta r .Z , The 
satellite da ta  from Satdat contains 8047 patterns with 6 inputs (bands) and 16 classes 
(ground cover). For this set, a 50x50 Kohonen network is used,
6.7.1 D ataset decom position and N ettalk
In their paper [93], Sejnowski and Rosenberg describe a dataset and neural network archi­
tecture used to generate speech. Standard backpropagation was used. The input to the 
neural network represents one letter from a word to be pronounced, plus its three preceding 
and succeeding letters. Each letter has 29 features representing the 26 letters in English 
and three punctuation characters. So in total, each input pattern  has 203 features. The 26 
output features represent artieulorv features such as voicing and vowel height, and stress 
and syllable boundaries. In their experiments, Sejnowski and Rosenberg vary the number 
of hidden neurons.
Considering the size of both dataset and neural network, Nettalk can be considered as a 
significant application. For the experiment depicted below, a hidden layer containing 71 
neurons was chosen, so the problem size (n ,w ,p ) is about (300,16K, 20K). Figure 6,9 
depicts the expected performance, speedup and efficiency for running the application on 
the PX, NSC and GCEL,
122 6. Dataset Decomposition
(a) Performance. (b) Speedups.
(c) Efficiency.
Figure 6,9: Performance, speedup and efficiency for Nettalk.
These figures are computed based on the execution time required for running one epoch, i.e. 
training all 20K patterns. Note th a t the PX is a much more powerful execution platform 
than  the GCEL and NSC, but because of its less favorable com putation/com m unication 
ratio, it is less efficient for larger processor systems. The speedup limit for PX, NSC 
and GCEL is respectively 194, 881 and 929 for a grid, which can be observed in Figure 
6.9(b), This limit also corresponds to the plotted performance in Figure 6.9(a), For a 
tree, the speedup limits are much higher (respectively 986, 9579 and 10361). The expected 
maximum performance as predicted in Table 6.6 is not reached for large processor systems, 
because for large machines it only holds for much larger p. Figure 6.10 plots the expected 
scalability factor, for k between [0,500] and for P  in {2,16,32}.
6.7. Two applications 123
k
(a) Scalability factor, P=2.
k
(b) Scalability factor, P=16.
(c) Scalability factor, P=32.
Figure 6,10: Scalability factors for Nettalk, varying P.
From these results it can be concluded th a t for a large scalability range, for any of the 
three execution platforms, scaling up the problem size and the number of processors results 
in about the same execution times; the problem is highly scalable. Especially because in 
practical situations k will not be very large2, the scalability factor will be much higher 
than 0,9, Note tha t for larger P, f scal decreases significantly. This is because the larger 
P, the more f scal will approach l / \ f P  for a grid, and log(P)/log(k ■ P)  for a tree.
2Even k = 10 can be considered high, as this means th a t P  is increased with a factor 10.
124 6. Dataset Decomposition
6.7.2 D ataset decom position and satellite data
Eon Schoenmakers, Graeme Wilkinson and Theo Schouten describe in [89] a hybrid seg­
mentation method for classifying remotely sensed imagery. This satellite da ta  covers the 
Lisbon area of Portugal, recorded with the Landsat-TM , containing 1953 lines by 1801 
columns by 6 channels. Each pixel is thus represented by 6 features, and the number of 
ground cover classes is 16, A training set containing ground tru th  from a part of the image 
is used, in to tal 8047 pixels. The neural network used in [89] is a backpropagation neural 
network. For the purpose of this chapter, a Kohonen SOM will be used. The number of 
weights w in the 6x50x50 KSOM is 15K, which is somewhat smaller than the backprop­
agation network discussed in the previous section. Also the number of patterns for this 
application is smaller. Using the prediction model, the performance, speedup and efficiency 
are computed for this application.
(a) Performance. (b) Speedups.
(c) Efficiency.
Figure 6,11: Performance, speedup and efficiency for Satdat.
6.7. Two applications 125
Again, the PX offers far more performance than  the GCEL and NSC, The speedup limit 
for PX, NSC and GCEL is respectively 110, 395, 450, (425, 2874, 3500 for a tree) which for 
grids can be observed in Figure 6.11(b), Note th a t these figures are very much similar to 
Figure 6,9, however, for Kohonen the GCEL is more efficient than the NSC, This is caused 
by the difference in computation time for the neural networks.
The scalability factors for k between [0,500] and P  in {2,16,32} are depicted in Figure 6,12, 
The Satdat problem appears highly scalable for the GCEL and NSC, For modest k, this 
also holds for the PX,
(a) Scalability factor, P=2. (b) Scalability factor, P=16.
(c) Scalability factor, P=32.
Figure 6,12: Scalability factors for Satdat, varying P.
126 6. Dataset Decomposition
When comparing these results to those achieved in Section 6,7,1, it appears th a t the per­
formances are less. However, when comparing the maximum performances th a t can be 
achieved for backpropagation and Kohonen networks in Tables 6,6 and 6,14, the opposite 
result would be expected. The reason for this is tha t the efficiency of the dataset decom­
position technique depends on the size of the neural network and the number of patterns. 
The sizes of the neural networks are about the same, but the number of patterns is higher 
in the first application. Therefore, its results are better,
6.8 Conclusion
Using the performance prediction model and a relatively small number of timings, it was 
shown how predictions can be made about performance, speedup, efficiency and scalability. 
By computing the point at which the derivative of the speedup becomes zero, the speedup 
limit could be computed. The model was validated quantitatively for Kohonen and back­
propagation neural networks, resulting in accurate results within a small deviation range. 
It was noted th a t for small neural networks the predictions are less precise, but th a t a 
more elaborate modeling of kernels depending on both the number of neurons and number 
of weights solves this problem.
Evaluating a MIMD system can be done based on any of the predictions for performance, 
speedup, efficiency and scalability, or can be based on a combination of them. Depending 
on what question has to be answered, using the model, a decision can be made to buy more 
processing elements, more memory per processor or perhaps to use the available resources 
if extending them  makes no sense. In most cases, scaling up the application is very well 
feasible as the scalability factors are very high. The critical consideration occurs when 
the problem size stays constant, because in such a case the speedup limit can be reached. 
For the applications considered in the previous section, which have a reasonably large 
problem size, it shows th a t current transputer systems are very well equipped for neural 
network simulations. As for the fast PX communication becomes a bottle neck if the 
number of processors increases, a faster communication network would be advantageous. 
Furthermore, both the GCEL as PX are ’hard-wired’ in a grid topology, whereas it is shown 
th a t a tree provides a more efficient processor topology.
7
Netw ork D ecom position
Outline
Network decomposition is defined as the collection of methods th a t can 
be used to decompose a neural network by dividing it over a number 
of processors. In this chapter the network decomposition of the Ko- 
honen Self-Organizing Feature Map (KSOM) and the backpropagation 
neural network are discussed. It will become clear th a t the implementa­
tion of the Kohonen SOM is not very difficult, and therefore the focus 
is on the implementation aspects of backpropagation, A new commu­
nication strategy is introduced to minimize the gathering of connection 
updates. Furthermore, the performance prediction method described in 
the previous chapters will be further evaluated for network decomposition 
techniques.
In the previous chapter, the dataset decomposition technique is explained. If a neural 
network model does not accommodate epoch learning or if it does not fit on one processor, 
dataset decomposition cannot be exploited. The first case holds for networks th a t require 
updates of the weights immediately after being inputted with a training sample, such as 
Hopfield [49] or the ART networks [13], The second case is obvious: if a neural network 
requires more memory for its connections or da ta  than there is available on one processor, 
more processors have to be used. In these cases, the neural network has to be decomposed 
via another technique. Techniques where not the data, but the neural network is decom­
posed over the available processors are called network decomposition techniques, A general 
parallel algorithm for network-decomposed neural networks is given below in algorithm 7,1:
128 7. Network Decomposition
load_data(); 
initialize(); 
distribute_patterns(); 
while ( ! ready) {
for (p=0;p<npatterns;p++) { 
compute_pattern(p); 
gather_results(); 
broadcast_results(); 
process_results();
}
}
Algorithm 7,1: General algorithm for network decomposition.
This code runs on each of the nodes in the processor network. When comparing this al­
gorithm to algorithm 6 ,1, it can be noted th a t instead of communicating after processing 
all patterns p / P , here for each of the p  patterns communication is required. Therefore, 
it can be expected th a t network decomposition suffers more than dataset decomposition 
from communication overheads. In this chapter it will be shown th a t especially for baek- 
propagation networks, the potential communication overheads are severe and because of 
this, new distributed gather routines are developed for minimizing communication. The 
attention in this chapter is focussed on the new gathering technique and its performance 
model. The concept of identifying function kernels was discussed in detail in the preceding 
chapters.
7.1 The baekpropagation network
The network architecture of the baekpropagation neural network requires th a t for the com­
putation of the activation of a single unit, all activations of its incoming units are known. 
Each neuron in a layer is fully connected to the neurons in its input layer. The general way 
to decompose baekpropagation networks over transputer systems is to divide each layer 
equally over the available processors. Consequently, all-to-all communications are neces­
sary to provide all inputs to each neuron in a layer. Similarly, for the backward pass of the 
baekpropagation network, all-to-all communications are required for computing each neu­
ron’s delta values. The implemented code for the network-decomposed implementations of 
parallel baekpropagation networks discussed here is depicted in algorithm 7,2, The code 
represents one epoch during training:
7.1. The backpropagation network 129
for (p=0;p<npatterns;p++) {
for (1=0;1<L-1;1++) { /*  fo rw a rd  pass * /  
broadcast_activations(1); 
compute_activations(1+1); 
gather_activations(1+1);
}
broadcast_activations(1);
for (1=L-1;1>0;1— ) { /* backward p ass */ 
get_errors_and_compute_deltas(1) ; 
if (1>1) {
compute_errors(1-1) ; 
gather_accumulate_errors(1-1)
}
change.weights(1);
}
}
Algorithm 7,2: Algorithm  fo r  network-decomposed backpropagation.
7.1.1 Im plem entation aspects of the forward pass
Like with the dataset decomposition techniques discussed in the previous chapter, all im­
plementations have one master process and P  — 1 slave processes. The master loads 
patterns from disk, gets initial param eters and distributes this information to the slaves. 
Both the master and slave processes host part of the neural network. After the network 
initialization, each layer I contains ni neurons, and each node in the processor network 
hosts (is responsible for) n i /P  neurons and ii[ • n ^ i / P  connections per layer 1,1 > 0, The 
organization of neurons and weights is depicted in Figure 7,1,
Figure 7,1: Decomposition o f layers and connections over processors, each arrow represents 
all connections between neurons residing on two (distinct or the sam e) processors.
Note th a t for this decomposition method it is required th a t all n activations and ” delta 
values” can be stored on each processor, which will be explained below. This will only work 
if the available memory is sufficient for storing 2 • n  neuron values and w /P  connections on 
each processor. If the resources allow the storage of all patterns on each node, broadcasting 
the activations of layer 0 (the input layer) can be done more efficiently by just broadcasting
130 7. Network Decomposition
the pattern  number p*. In the implementations discussed here, each input layer is broadcast 
entirely.
Consider Algorithm 7,2, In the routine broadcast_activations () , for layer I = 0, the 
master stores the input pattern  pit it in its input layer. For all layers I > 0, first a broadcast 
is performed of the activations in its previous layer. For 1 = 1, which is the first hidden 
layer, these activations are copied from the input pattern. If all activations are available, 
the activations of neurons in the current layer I can be computed, after which they have 
to be gathered and broadcast. Note tha t after gathering the activations of a layer I, all 
rii activations are available at the master processor. After broadcasting the rii values, all 
activations of this layer are available to all processors, so the local activations of the next 
layer can be computed. For all layers, all activations are kept in memory on each processor, 
which is required because of the way the backward pass is implemented. The gathering 
techniques discussed in Chapters 5 and 6 were tailored for simultaneously gathering one 
amount of  information and accumulating it; the gather-accumulate technique. Here, when 
gathering the activations, there are three major differences. First, the amount of informa­
tion residing on each processor may differ if m od(n ;,P ) ^  0 , i.e. there is no perfect load 
balance. Second, no accumulation is required. The third difference is th a t the structure 
of the information differs. Whereas with dataset decomposition all weights are gathered 
and accumulated, in this case the activations residing on different processors are stored 
in different memory locations. In Section 7,2, a technique tailored for the gathering of 
activations or delta values required for the backward pass is introduced.
The function kernel required for the forward pass is compute_activations, for which the 
time can be modeled as w -tact for large networks, but (as explained in the previous chapter) 
for small networks it must be modeled as n-t2ct + w-t^ct. Using Mai lab [66], for the measure 
pairs A  = {(ni,Wi)} and the measured times t  =  {tj}, the fitted times ƒ =  {t^ct,t™ct} can 
be determined via ƒ =  A \ t .  The time to compute n activation values is given by:
tact(n,w) = n ■ t™ct + w ■ t™ct (7.1)
and these were measured in micro seconds as:
machine J.n +w Lact l act
NSC
GCEL
PX
34.9 3.2 
18.3 5.9 
1.7 0.3
Table 7,1: Timings for  function kernel compute_activations, per neuron or weight.
7.1.2 Im plem entation aspects of the backward pass
Though comprising different computations, the backward pass can be considered as the 
inverted forward pass. The direction of the weights is changed and the error values for
7.1. The backpropagation network 131
each neuron j  in a layer I — 1 are computed as the in-product of the delta value of each 
neuron i in I and the corresponding weight values w^. By allocating memory for each 
layer’s complete error vector, each process can compute its local contribution to the error 
vector as depicted in Figure 7,2:
>[j]-X8[i]*w[i][j]
e[i]
0[i]
t[i] - a[i] 
e[i]*a'[i]
Figure 7,2: Backpropagation of the error values by computing the contribution of all local 
neurons (marked black) to the errors of their input neurons.
The algorithm given below is a more detailed description of the backward pass as introduced 
in algorithm 7,2:
1, The errors e* are computed for the output layer as the difference between target and 
computed activation (which was kept in memory),
2, For each neuron in the output layer, its delta value ^  is computed as the product of 
the error and the derivative of the activation function. This is done on all processors,
3, For each individual processor, the contribution of its local neurons to the error vector 
of layer I — 1 are computed. Note th a t these contributions correspond with the 
number of weights stored locally. For each local neuron i in layer I, the contributions 
to each (global) neuron j  in layer I — 1 are added to the error vector of layer I — 1 
as 6j =  6j +  Sjt • Wij. This is the reason for the storage requirements of 2 • n neuron 
values,
4, After computing all local contributions to the error vector of layer I — 1, the over­
all vector is computed using the gather-accumulate-broadcast technique discussed 
before,
5, The local delta values for I — 1 can be computed using the generalized delta rule,
6 , Change the incoming weights for layer I.
132 7. Network Decomposition
7, Goto step 3 if I is not the first hidden layer.
Steps 1) and 2) are only computed for the output layer. Note th a t changing the weights 
is only allowed after computing step 3, The function kernels required for the backward 
pass are: get_errors_and_compute_deltas() (steps 1-2 or 5), compute.errors() (step 
3), gather_accumulate_errors() (step 4) and change_weights() (step 6), costing re­
spectively tgecdl t™rr, t™rr, tgae and tchw, where tgae is just T GAB(P,m).
The time to compute each neuron’s delta values depends on the number of neurons n. 
Computing the errors in compute_errors() has a similar complexity as computing the 
activation values and is thus modeled by two factors, i"rr and t frr. The time to adjust the 
weights depends on the number of weights w only. The function kernels were measured for 
different network sizes on one node of the NSC, PX and GCel, The resulting timings are 
listed in Table 7,2 below (micro seconds per neuron or per weight):
machine tgecd fn°err fWlerr tchw
NSC 6.2 2.7 1.8 7.5
GCEL 8.1 3.5 2.6 9.8
PX 0.4 0.1 0.1 0.5
Table 7,2: Timings for function kernels of network-decomposed hackpropagation.
The to tal calculation time for training one pattern  of the backpropagation network is 
modeled by Equation (7,2), For one epoch, this time must be multiplied by the number of 
patterns,
Tcaic(P, n ,w ) = ^  ■ (n • {tnact +  tgecd + rerr) +  w • (C t +  C r  + ^hw) (7.2)
The calculation times for computing Equation (7.2) were measured on the three transputer 
platforms. From Figure 7.3 it can be observed th a t for larger neural networks, the deviation 
between measured and predicted times is very small. Furthermore, even for small networks, 
the deviations are within 8%.
nweights
1.2. A  new gathering technique 133
Figure 7,3: D eviations in % fo r  Equation (7.2) and all three m achines fo r  different network 
sizes.
7.2 A new gathering technique
In this section a new gathering technique is introduced which is tailored for gathering an 
amount of information which is distributed over a processor network. This is different 
from the gather-accumulate-broadcast technique discussed before. In the la tter technique, 
several different instances of the same information are located on distinct processors. Af­
ter gathering, accumulating and broadcasting, the updated information is available to all 
processors. In the new technique, the information to be updated can be considered as 
a vector of n elements, where each processor holds n / P  elements. After computing n / P  
new element values, the goal is to make all new values available to all processors. Two 
techniques to perform this operation are compared below.
7.2.1 The store-and-forward technique for grids
In a first attem pt, the problem was attacked as depicted in Figure 7,4, In a horizontal 
phase, after W  — 1 communication steps, all ’’row” information is available at the left most 
column of a grid. Similarly in the vertical phase, after H  — 1 steps, all n elements are 
available at the root processor.
Figure 7,4: Gathering using the store-and-forward technique.
For a \Tp  * \ f P  grid, the total amount of transm itted elements N h for the horizontal phase 
is given in Equation (7,3), The amount of transm itted elements gathered at the left most 
column is \ / P  • n /P .  This means th a t for the vertical phase N v =  \ / P  • N h, so the to tal
134 7. Network Decomposition
amount N lolai can be given by Equation (7,4):
N h :
a n n
p + 2 - - p + 3 - -  + +
V p - ( V p - i )
N„
N,
71
P
I . V p .
71
(7.3)
V P - 1)
lolal 2 • P
li
2 • P
• V P - { V P + i ) - { V P - i )  
- V P - { P - 1 ) (7.4)
This technique is called store-and-forward, because only if a processor has received all 
information from neighbors situated more ” inside” the processor network, it sends all 
gathered information so-far to its parent processor. For example for the horizontal phase 
of this technique, all processors in the left-most column have to wait V P  — 2 steps before 
thev mav receive information.
7.2.2 The pipeline techniques for grids
The new distributed gathering technique introduced here makes better use of the available 
resources. Each processor first sends its information to its neighbors, before receiving 
information from ”deeper laying” processors. As can be observed in Figure 7.5, also after 
W  — l  steps, all ”row” information is available at the left-most processors. During each step,
2 communications are performed, i.e., sending to a neighbor and subsequently receiving 
from a deeper node. However, on the transputer, inter-link communication can be done 
in parallel and a write operation does not have to wait for receipt. Therefore, the two 
communication steps can be done in one pass, where the setup time is ignored.
I
□
□
Figure 7.5: Gathering using the pipeline technique.
2
2
3
At step i, each processor Pj has gathered information of processors / '/. | (. This
means for a V P  * V P  grid, th a t after step V P  — 1, all left-most processors have gathered
7.2. A  new gathering technique 135
all information in a row. Similar considerations can be made for the vertical phase, so the 
to tal amount of elements transm itted can be derived as specified below:
N h = £ - ( a / P - D  
N v = V / ' • ( > / / ' -  D
Ntotal = ^ - ( 1  +  V P ) - ( V P - 1 )
= £  • (P  ~  1) (7-5)
From Equations (7,4) and (7,5) it can be deduced th a t the pipelined technique performs 
significantly { \ fP  / 2) faster than  the store-and-forward technique. However, the pipeline 
technique is not capable of speeding up the GAB techniques described in Chapters 5 and
6 , At first instance this seems to be the case, because in the GAB techniques also store- 
and-forward principles are used. However, when gathering one amount of information s 
residing at each processor, using the store-and-forward technique as well as the pipeline 
technique, the traffic would amount to:
N h = s - ( V P ^ l )
N v = s - { V P -  1) 
Ntot =  s - { 2 - > / P -  1)
Below, an algorithm for gathering the activations of a layer I using the pipeline technique 
is given.
in t offst =  lindex[pid][l];
in t n =  nneurons_in[l]/W/H; /*  W xH  grid */
in t i;
/*  horizontal phase * /  
if  (hasJeft)
snd(left,&activations[l] [offst] ,n); 
fo r (i=0;i<nodesright;i++) { 
offset + =  n;
rec(right ,&activations[l] [offst] ,n); 
if  (has_left)
snd(left,&activations[l] [offst] ,n);
}
(a) In itia liza tio n  a n d  h o r izo n ta l phase.
/*  vertical phase * /  
offst =  lindex[pid][l];
if  (!has_left) { /*  the left m ost column * / 
if  (has_top)
snd(top,&activations[l] [offst] ,n*W); 
for (i=0;i<nodesdown;i++) { 
offset + =  n*W;
rec(down,&activations[l] [offst] ,n*W); 
i f  (has_top)
snd(top,&activations[l] [offst] ,n*W);
}
}
(b) Vertical phase.
Algorithm 7,3: Algorithm for distributed gathering.
In algorithm 7,3, two simplifications are made. First, an equal load balance is assumed, 
where each processor has the same amount of n i j P  activations, and n i /P  is an integer. 
Second, the distribution of a layer over processors is done as depicted in Figure 7,6, which
136 7. Network Decomposition
allows for the simple adm inistration of the used datastruetures. Each processor ’’knows” 
th a t the index to da ta  from its right neighbor is just its own offset plus rii/P. In the actual 
implementations, during an initial decomposition phase, all processors get a datastructure 
containing information about the number of neurons on each other processor. Using this 
datastructure, at each communication step it is known how many information has to be 
communicated, and where it has to be stored.
0 ---------------1 -------------------- 2 --------------------3 -
8 ---------------9 ------------------- 1 0 ----------------- 11
- 1 2 ---------------1 3 ------------------1 4 ------------------15 -
Figure 7,6: Gathering using the pipeline technique.
7.3 Backpropagation com m unication costs
For the forward pass of backpropagation, the time for broadcasting and subsequently dis­
tributed gathering the activations for each layer I e  [0 • • • L  — 1] can be estim ated as:
L —1 L - 1
T ° ° F (P ,n ,w )  = Y , Tm ‘‘(P ' n‘) + Y . TM P ’n‘)
1=0 1=1
=  i f'i +  +  n 2~) ' (2 ■ v P  — 2) ■ /, ....... +  
n i  +  l h - I P - l ) - tp  \ A 'c o m m
= tcomm- ( n - ( 2 - V p ^ 2 )  +  ^ ^ - ( P ^ 1 ) )  (7.6)
For the backward pass, as explained in Section 7,1,2, it is assumed th a t each processor stores 
all activations computed during the forward pass. In this way, the required communications 
can be limited to gathering, accumulating and broadcasting the error vector:
T “ / C F > - n o )
The time required for the forward pass was measured on each of the three platforms NSC, 
GCEL and PX, Measurements were carried out using the ’’ping-pong” technique, i.e. the
7.3. Backpropagation communication costs 137
root processor starts the timer, first broadcasts activations from a layer I and stops the 
tim er after receiving all activations from I — 1,
(a) nsc
1000
500
expected
P=512
P=256
P=128
P=64
P=32
2000 4000 6000 
n
8000 10000
(b) gcel
1500
0
0
(c) power
Figure 7,7: Measured and expected communication times for NSC, GCEL and PX.
As can be observed in Figure 7,7, deviations are relatively small. For all three platforms, 
deviations are within 4%-15%, The to tal execution time is is derived from Equations (7,2), 
(7,6) and T GfdB(P, n — n0) as:
Tgrid(p i n, W) =  Tca;c(P, n, w ) +  (P, n, w ) +  T ^ fdB (P , w)
= Tcaic(P , n, w) + t cornm • (n • (2 • V P  -  2) + !!_ !!£  . (p  _  1)) +
(2 • tcomm +  tacc) • (n -  nQ) • (2 • V P  -  2) (7.7)
138 7. Network Decomposition
7.4 Backpropagation on a tree
In [110] and in Chapter 5, it was concluded tha t the tree topology offers an efficient way of 
performing gather-accumulate-broadcast operations. The major point in this technique is 
th a t accumulations are performed in parallel during communication. From algorithm 7,2 
it becomes clear th a t the GAB technique is used for accumulating the neuron error vectors 
during the backward pass. During the forward pass however, no accumulation is required. 
In this section, the communication time for the network decomposed backpropagation 
model on trees is derived. As the time required for the backward pass is T GAB(n — n0), 
the attention is focussed on the forward pass,
7.4.1 Communication tim e for the forward pass
Again, an equal load balance is assumed, so each processor hosts ni / p neurons per layer I. 
Two techniques are compared for gathering distributed information: the store-and-forward 
technique and the pipelined technique. In the first technique, each processor on a certain 
level of the tree first receives all activation values of its sub-trees, and then transm its these 
together with its own values to its parent. For a tree of depth 2 (i.e., the root with three 
sub-nodes, the number of communications is 3 - 1 ,  For a tree of depth 3, this number is
3 • (3 + 1 ), etcetera. For a depth of d, the number of required communications for gathering 
distributed information is as depicted in Figure 7,8:
I 3 2 1I-----  ----- - d=4 -> 3*(3*(3+1)+1) = 3 + 3 + 3
d=3 -> 3*(3+1) = 3 2+ 3 1
d=2 -> 3*1 = 3 1
I d=1 -> 0
Figure 7,8: Number of communications for gathering via the store-and-forward technique.
The to tal time for gathering and subsequently broadcasting ni / P neurons can be derived 
as Equation (7,8):
T ™ B =  W rn  • |  • (3l +  32 +  • • • +  3d-1) +  T * J P ,  ni)
d - 1
=  t c o m m  • ^  3' -  T f ree  ( p ,  m )
i=1
ji, 3  ^— 2. o
=  t c o m m  ' p  ' (  ^ 1) +  T t r e e i P ’ n Ù  (7-8)
7.4- Backpropagation on a tree 139
Using the pipelined technique, each processor sends its local information to its parent node, 
and subsequently receives and transm its the information from each of the nodes in its three 
sub-trees. It will be deduced below through inductive proof, th a t the required number of 
communications for a number of nodes N(d)  for a depth d equals N(d)  — 1,
d = l There are zero leaves, so the number of communications required is zero.
d=2 There are four nodes and three leaves, so the number of communications required is three.
d=k Each of the three nodes down the root has a depth k — 1. Each of these has gathered all 
information in its subtree in N ( k ^ l )  — 1 steps (through induction). Furthermore, the root 
has already received N(k  — 1) — 1 packets from one of the three nodes (say its left-down 
neighbor). This means that another 1 + 2 * N(k  — 1) communications are required, which 
in total amounts to 3 * N(k  — 1) =  N(k)  — 1.
Based on these observations, it is concluded tha t the to tal communication time for a tree 
of depth d equals Equation (7.9), which is equal to Equation (7.8). This involves th a t there 
is no difference between the store-and-forward and pipelined techniques for trees.
d- 1
rp D G  _  f  ^  q i
1 tree W ra ' p  ' /   ^°
¿=0
= t  • — • ( -___1) (7 9)vc o m m  p  '  2  \  )
The total communication time required for the forward pass is given below:
L—1 L—1
T™B(P,n,w) =  £ lS ,(P ,B ,)  + £ lSS(P ,»()
1=0 1=1 
= (n0 +  «1 +  n2) • 3 • depth(P) ■ tcomm +  
rii + n2 ,3d
P  2 c o m m
Tl — Tin 3d
—  t c o m m (n • 3 • depth (P) H----- ——  • (— -  1)) (7-10)
The to tal execution time for running network decomposed backpropagation networks on a 
tree is derived from Equations (7.2), (7.10) and T GAB(P ,n  — n0) as:
Ttree (P , n > W) =  Tcalc(P , w) +  T ^ B (P, n , w) + (P, W)
2^ _ Tbr\ 3^
Tcalc(P-> w) +  tcomm (n ■ 3 • depth(P) + p  • {— -  1)) +
(2 • t cornm +  tacc) • (n -  n0) • 3 • depth(P) (7.11)
140 7. Network Decomposition
7.5 A com parison betw een transputer grids and trees
To compare whether a tree or grid topology is more suited for network decomposed baek- 
propagation networks, the difference between Equations (7,7) and (7,11) is considered.
T ‘f J P , n . ' w ) - T Z J P , n . ' w ) tcomm, • (n • (2 • V P  — 2) H — • (P  — 1)) +
P
(2 • tc +  tn w • (2 2)
(n • 3 • depth(P) + . ( _  _  i))
(2 • tcomm +  tacc) • (n -  n0) • 3 • depth(P) (7.12)
Below, the expected difference between the communication times required for grids and 
trees are depicted.
V  0-.. 
A
'is -1500  -
> > 196.51M
182.25M
A A 96.00M
V V 48.00M
0 0 24.00M
□ □ 18.00M
* , 12.00M
+ + 6.00M
« - 3.00M
o o 1.50M
0.38M
30 40 
nprocs
(a) Differences fo r  N SC
8000
6000
« 2000
200 300 400 
nprocs
(b) Differences fo r  GCEL
tcomm
0r
4000
0
0 10 20 50 60 0 100 500
0
-200
-400
-600
-800
000
200
400
600
0 10 15 20 
nprocs
25 30 355
(c) Differences fo r  P X
Figure 7.9; Expected difference in communication times between grids and trees.
7.6. Performance 141
In Figure 7,9, it can be observed th a t for a larger number of processors, a tree topology is 
more efficient than a grid. For a smaller number of processors (<  80), a grid shows to be 
more efficient. This effect can also be determined by considering Equation (7,12), which 
can be rewritten to:
tcomm ' n ' (2 • V P  ^ 2 ^ 3 -  depth (P )) +  
n - m
' c o m m  p  V 2
(n -  n0) • (2 • t comm +  tacc) • (2 • \ f P  ^ 2 ^ 3 -  depth(P))
If the depth of a ternary tree is estim ated as 3log(P), the term  2 • \ f P  — 3 -3 log(P)  rules 
Equation (7,12) for larger P.  In the discussions below, only results for a grid are considered. 
This is because of the following considerations:
1, Current transputer systems are all hard-wired in a grid topology; only virtual (and 
thus non-efficient) tree-topologies can be made,
2, Developments in computer hardware have shown th a t large MIMD parallel processor 
systems in general cannot keep up with the pace with which processors are acceler­
ated, This was in particular true for the T 8xx transputer systems, which were in fact 
already out-dated by the time they were released,
3, On the other hand, smaller systems are now being released, like the PowerXPlorer, 
These machines contain nodes with more state-of-the-art performance,
4, For decomposition strategies like network decomposition which require relatively 
much communication overheads, such machines provide a more efficient platform 
than large machines containing many, less-powerful nodes.
The la tter argument will be explained in more detail below. It will be shown th a t especially 
for network decomposition, smaller systems offer a higher efficiency than larger ones.
7.6 Perform ance
The performance for network backpropagation networks decomposed via network decom­
position is:
* P  (P, n, w) =  p------ - m c u p s
Ta?id(p ’ w )
The measured and expected performance are depicted in Figure 7,10:
142 7. Network Decomposition
(a) Performance for NSC (b) Performance for small GCEL
(c) Performance for large GCEL (d) Performance for PX
Figure 7,10: Expected and measured performance in MCUPS.
Compared to the performance achieved with the dataset decomposition techniques ex­
plained in the previous chapter, these performances are significantly lower. This was al­
ready expected and is due to the increased communication overheads. Note that for smaller 
neural networks, this effect involves that the performance is lower on larger grids. In par­
ticular consider Figure 7.10(c), where 512 processors only reach a higher performance than 
128 processors for problems larger than 85\I weights. The deviations between measured 
and expected performance are all within 15%. In Chapter 6 , the maximum performance 
that can be achieved was computed by taking the limit for the problem size. The problem 
size can be enlarged by increasing the number of patterns or by increasing the size of the
7.7. Speedup, scalability and efficiency 143
neural network. In this ease, the execution time increases linearly with the number of 
patterns, and the size of the neural network is limited by the amount of available memory. 
Therefore, for network decomposed baekpropagation networks, the maximum performance 
cannot be computed.
7.7 Speedup, scalability and efficiency
The (fixed-size) speedup is computed as:
5(P, ( 7.13)
^ } T g?td(P i n -> W)
Below, the measured and expected speedup are depicted.
45
40
35
30
^25
£20
15
10
5
0
0 0 12.00M
□ □ 6.00M
* * 3.00M
+ + 1.50M
o o 0.38M
0 10 20 30 40 50 60 70 
nprocs
(a) Speedup for NSC
> > 96.00M
a A 48.00M
X x 24.00M
0 0 12.00M
□ □ 6.00M
* * 3.00M
+ + 1.50M
o o 0.38M
(b) Speedup for GCEL
nprocs
0 0 12.00M
□ □ 6.00M
* * 3.00M
+ + 1.50M
o o 0.38M
(c) Speedup for PX  
Figure 7.11: Expected and measured speedup in MCUPS.
144 7. Network Decomposition
When considering the fixed size speedup, it is noted that:
1, The achieved speedups are lower than those achieved with dataset decomposition,
2, For smaller neural networks, the speedup limit is reached relatively soon.
The speedup limit for a given neural network can be computed by computing the derivative 
of Equation (7,13) and solving it equal zero. Let:
a
b
c
d
S(P , n, w
Tcaic(l, n, w) ' 
T£id(P,n,w)_
‘¡, — h * s/p — b + c —-p + d* s/P — d.
P2 s/P
c_ „  + iiL+rf), p , v/F
Li
p
T cai c ( l , n , w )
T corara * Tl * 2 
ijl Tlo) * tcomm
(2 * tcomra tacc) * 2 * (fi îîO)
Tcalc(l, n, w)
Tgrtd(P i n , W)
0
0
b + d
(7.14)
The following table depicts the computed speedup limit and corresponding maximal speedup 
following Equation (7,14), In Figure 7,11, it can be observed that these numbers match 
the points at which the speedup drops.
a
c
a c
Mweights
NSC
p  S(p )
G CEL
p  S(p ) p
PX
S(P)
0.38M 48 (18) 59 (2 2 ) 9 (4)
1.50M 76 (27) 94 (34) 15 (6 )
3.00M 95 (34) 118 (42) 19 (7)
6.00M 12 0 (43) 148 (52) 24 (9)
12.00M 151 (53) 187 (6 6 ) 30 (1 1 )
18.00M 173 (61) 214 (75) 34 (13)
24.00M 190 (67) 235 (82) 38 (14)
48.00M 240 (84) 297 (103) 47 (17)
60.00M 258 (90) 320 (1 1 1 ) 51 (19)
96.00M 302 (105) 374 (129) 60 (2 2 )
Table 7,3: Speedup limits following Equation (7.14)-
7.7. Speedup, scalability and efficiency 145
In the tables depicted below, some computed scalability factors for network decomposed 
backpropagation networks are given, for each of the three target platforms. The scalability 
factor is computed as:
f s ca l i  l  p  / \\ _  T(P, n ,  w) 
f  (k, P, (n, w)) T{k-P,k-{n,w))
k (,4M,P=1) (1.6M,P=4) (,4M,P=8) (3M,P=8)
1 1 .0 0  (1 ) 1 .0 0 1 .0 0 1 .0 0
2 1 .0 0  (2 ) 0.98 0.90 0.96
4 0.99 (4) 0.95 0.75 0.89
8 0.98 (7) 0.87 0.56 0.77
16 0.93 (15) 0.76
32 0.87 (28)
64 0.75 (48)
Table 7,4: Scalability factors for NSC (scalability in parentheses).
k (,4M,P=1) (1.6M,P=4) (,4M,P=8) (3M,P=8)
1 1 .0 0  (1 ) 1 .0 0 1 .0 0 1 .0 0
2 1 .0 0  (1 ) 0.99 0.93 0.92
4 0.99 (4) 0.96 0.80 0.82
8 0.98 (7) 0.90 0.63 0.67
16 0.95 (15) 0.81 0.43 0.50
32 0.90 (29) 0.67 0.27 0.32
64 0.80 (51) 0.49 0.15
128 0.66 (84) 0.32
256 0.49 (125)
512 0.32 (164)
Table 7.5: Scalability factors for GCEL (scalability in parentheses).
k (,4M,P=1) (1.6M,P=4) (,4M,P=8) (3M,P=8)
1 1 .0 0  (1 ) 1 .0 0 1 .0 0 1 .0 0
2 0.96 (2) 0.84 0.60 0.73
4 0.87 (3) 0.62 0.33 0.48
8 0.73 (6 ) 0.40
16 0.54 (9)
32 0.35 (11)
Table 7.6: Scalability factors for PX (scalability in parentheses).
146 7. Network Decomposition
Similar to the dataset decomposition technique described in Chapter 6 , it shows here that 
for larger k, the scalability factor drops. Also note that still no scalability limit is found.
7.8 E xpected results for N ettalk
The expected performance, speedup and efficiency for the application Nettalk described in 
the previous chapter are computed here.
Performance ( m c u p s ) Speedup Efficiency
p NSC GCEL PX NSC GCEL PX NSC GCEL PX
1 0.08 0.05 1.06 1 .0 0 1 .0 0 1 .0 0 1 .0 0 1 .0 0 1 .0 0
2 0.15 0 .1 0 1.73 1.97 1.97 1.63 0.98 0.98 0.81
4 0.28 0 .2 0 2.06 3.70 3.77 1.93 0.93 0.94 0.48
8 0.47 0.35 1.77 6 .2 1 6.54 1 .6 6 0.78 0.82 0 .2 1
16 0.62 0.49 1.28 8.26 9.28 1 .2 0 0.52 0.58 0.07
32 0.62 0.53 0 .8 8 8.24 9.91 0.82 0.26 0.31 0.03
64 0.50 0.44 0.60 6.64 8.34 0.56 0 .1 0 0.13 0 .0 1
128 0.36 0.33 0.41 4.86 6 .2 2 0.38 0.04 0.05 0 .0 0
256 0.26 0.23 0.28 3.43 4.42 0.27 0 .0 1 0 .0 2 0 .0 0
512 0.18 0.16 0 .2 0 2.40 3.11 0.18 0 .0 0 0 .0 1 0 .0 0
Table 7,7: Performance, speedup and efficiency Nettalk
Note that the performance appears to be very low compared with the results depicted in 
Figure 6,9, The reasons for this result are, again, the communication overheads. With 
dataset decomposition only after computing the weight changes for 20K patterns, com­
munication has to take place. Here, the overheads are so severe, that for this application 
only small processor networks are suited. For larger systems, the performance drops. Us­
ing Equation (7,14), the number of processors at which this occurs can be computed. The 
maximal speedup for respectively the NSC, GCEL and PX is found at respectively P  =  22, 
P  =  26 and P  =  4:
p speedup=<5(P) performance=’P  (P)( M C U P S ) efficiency=£(P)
NSC 22 8 .6 0.64 0.39
GCEL 26 1 0 .1 0.53 0.39
PX 4 2 .0 2 .0 0 0.5
Table 7,8: M axim um  speedup and corresponding perform ance and efficiency fo r  Nettalk.
7.9. The Kohonen neural network 147
7.9 The K ohonen neural network
In Section 4,2,2,1, the Kohonen Self-Organizing feature map (KSOM) algorithm is ex­
plained, Here, the parallel implementation of the KSOM via network decomposition is 
discussed. For network decomposition, when distributing the neurons of the KSOM, two 
requirements must be met [7, 104]:
1. Divide all neurons equally over the available processors, where each neuron “hosts” its 
weights locally. This requirement assures that during the recall phase (finding the winning 
neuron), the computational load is well-balanced.
2. Make sure that neighboring neurons are placed at distinct processors. This assures that 
during the training phase (changing the weights of neurons in a certain neighborhood of 
the winning neuron), the load is well-balanced.
The architecture of the KSOM is a grid with width Kw  and height Kh, which has to 
be decomposed over P  processors. A simple algorithm which ensures that no neighboring 
neurons lay on the same processor and equally divides neurons over the available processors 
is a round-robin decomposer. On each processor, an array global_pos is available pointing 
at positions in the Kohonen map. The distribution is performed by the master process, and 
it is assumed that it can send information to a particular processor using a send command:
void distributeJietwork (int Kw, int Kh, int P)
{
int i,n,p;
for (i=0;i<Kw*Kh;i++) { 
p = i*/.P;
global_pos [p] [nneurons_on[p]] += i; 
nlocal_neurons[p] += 1;
}
for (i=0;i<P;i++) {
send(i, (char *)&nlocal_neurons [i] ,sizeof(int));
send(i, (char *)global_pos [i] ,nlocal-neurons [i] *sizeof(int));
}
}
7.9.1 Finding the winning neuron
After this initial decomposition phase, communication is only required for the distribution 
of input patterns (implemented by a broadcast operation), and for the distributed compu­
tation of the winning neuron. The implementation of the latter computation is very similar 
to the GAB technique described before. Instead of accumulating weights, now only two 
values are communicated: the index of the winning neuron and its distance to the input 
pattern. On each processor, the local “winner” and its distance to the input is computed.
148 7. Network Decomposition
Using the distributed gathering technique, the global winner is determined. Below, this 
algorithm is given for a tree and grid processor topology.
void gather_winner (int *winner, double *mydist)
.  .  {  .
void gather_winner (int *winner, double *mydist) in t w;
{
in t w; 
double d;
if  (has_right) {
ree (EAST,(char *) &w,sizeof(int)); 
ree (EAST,(char *) &d,sizeof(double)); 
if  (*mydist>d) {
♦mydist = d;
♦winner = w;
}
}
if  (!has_left) { /♦ leftmost row ♦/ 
if  (has_down) {
ree (SOUTH,(char *) &w,sizeof(int)); 
ree (SOUTH,(char *) &d,sizeof(double)); 
if  (*mydist>d) {
♦mydist = d;
♦winner = w;
}
}
if  (has_top) {
double d;
if  (hasJeft) {
rec (LEFT,(char *) &w,sizeof(int)); 
rec (LEFT,(char ♦) &d,sizeof(double)); 
if  (♦mydist>d) {
♦mydist = d;
♦winner = w;
}
}
if  (has_down) {
rec (DOWN,(char *) &w,sizeof(int)); 
rec (DOWN,(char *) &d,sizeof(double)); 
if  (♦mydist>d) {
♦mydist = d;
♦winner = w;
}
}
if  (has_right) {
rec (RIGHT,(char ♦) &w,sizeof(int)); 
rec (RIGHT,(char ♦) &d,sizeof(double)); 
if  (♦mydist>d) {
♦mydist = d;snd (NORTH,(char ♦) winner,sizeof(int)); 
snd (NORTH,(char ♦) mydist,sizeof(double)); ♦winner = w;
} }
} }
else { if  (has_top) {
snd (WEST,(char ♦) winner,sizeof(int)); snd (TOP,(char ♦) winner,sizeof(int));
snd (WEST,(char ♦) mydist,sizeof(double)); snd (TOP,(char ♦) mydist,sizeof(double)); 
} }
broadcast ((char ♦) winner, sizeof(int)); broadcast ((char ♦) winner, sizeof(int));
broadcast ((char ♦) mydist, sizeof(double)); broadcast ((char ♦) mydist, sizeof(double)); 
} }
(a) Finding the winner on grids. (b) Finding the winner on trees.
Algorithm 7,4: Finding the winner on grids and trees.
After gathering the winner, on each processor it is known which neuron is the winning one. 
Based on its index, the position in the KSOM can be computed as (idx%Kw,idx/Kw), 
and for all neurons on a processor, it can be decided whether their weights have to be 
updated by considering whether they are in the neighborhood of the winner. The network- 
decomposed KSOM algorithm is depicted below:
7.9. The Kohonen neural network 149
void ksom ()
{
int i;
distribute_network();
for (i=0; i<p; i++) {
broadcast ((char *)pattern,ninputs*sizeof(float)); 
winner = FindWinner(pattern);
ChangeWeights(pattern,winner);
}
}
void ChangeWeights (float *pat, int w)
{
int i,j ;
register float *iptr;
for (i=0;i<nlocal_neurons;i++) {
gauss_value = in_neighbourhood(globalpos[i],w); 
if (gauss.value>0.0)
for (j=0;j<ninputs;j++)
weights[i][j] += lrate*gauss_value*(pat[j]-weights[i] [j]);
}
}
Algorithm 7,5: General algorithm for Kohonen network decomposition.
The required function kernels for the network-decomposed KSOM are similar to those de­
scribed in Section 6,4, listed in Table 6,13, For dataset decomposition, the required weight 
changes are computed and stored in a temporary array, costing time tdw. In Algorithm 7,5, 
the only difference is that weight changes are directly adapted. So ChangeWeights costs 
tCw =  tdw- The function kernels are therefore:
kernel NSC GCEL PX
tnei 24.95 20.77 3.37
tfnd 8.13 6.77 .474
tew 1 0 .0 1 8.34 .520
Table 7,9: Times for Kohonen function kernels ([¿seconds).
and the expected calculation time for the KSOM per pattern is based on Equations (6,5) 
and (6 ,6 ):
1Tcak:{ P ,n ,iv ) =  — • (n-tnei + W • (tf nd + TT¡16 ■ tcw)) (7.15)
150 7. Network Decomposition
7.10 Perform ance and speedup
As can be expected, the performance of transputer networks is high for Kohonen neural 
networks. The only communications are gathering of the winning neuron and broadcasting 
new patterns. The latter communications can be ignored if the number of patterns is small 
enough to store all patterns on each processor. The communication time per pattern is:
Tcomm(P,n,w) =  Tgab(P, 2) + Tb (N) (7.16)
For a wide range of synthetic KSOM neural network sizes, measurements were taken on 
the NSC, GCEL and PX, These were all accurate within less than 10%, which is due to the 
small amount of communication overheads. Below, the expected performance of diverse 
KSOMs are depicted.
4
£U
v ^ v v v v v v v v v v v
A A 64
V V 32
0 0 1600 0 0 0 0 0
* * 8
** * * o o 4
30
0.5 1 1.5 2
nweights
2.5
x 10
(a) nsc (b) gcel
nweights
x 10
(c) power
Figure 7,12: Expected performance for NSC, GCEL and PX.
0
7.11. Network decomposed Satdat 151
Note that these plots have similar characteristics as the performance plots depicted in 
Chapter 6 ; a certain performance limit is reached at the point where the problem size is 
big enough to neglect communication.
Table 7,10 below contains the speedup for NSC, GCEL and PX grids.
Mw NSC GCEL P X
p _ 4 P=8 P=16 P=32 P 61 P 61 P=128 P=256 P=512 p _ 4 P=8 P=16 P=32
.5 4.00 7.99 15.92 31.53 61.25 60.17 107.77 165.59 198.88 3.97 7.77 14.57 24.54
1 4.00 7.99 15.95 31.70 62.25 61.55 114.54 190.80 257.88 3.98 7.85 15.06 26.82
2 4.00 8.00 15.97 31.81 62.90 62.44 119.25 210.96 316.90 3.99 7.91 15.40 28.52
3 4.00 8.00 15.98 31.86 63.16 62.81 121.22 220.16 348.45 3.99 7.93 15.53 29.26
4 4.00 8.00 15.98 31.88 63.31 63.02 122.40 225.91 369.93 3.99 7.94 15.61 29.72
5 4.00 8.00 15.98 31.90 63.40 63.15 123.13 229.61 384.55 3.99 7.95 15.66 30.01
6 4.00 8.00 15.99 31.91 63.47 63.25 123.69 232.45 396.24 3.99 7.95 15.70 30.23
7 4.00 8.00 15.99 31.92 63.52 63.32 124.08 234.46 404.76 3.99 7.96 15.73 30.38
8 4.00 8.00 15.99 31.93 63.57 63.38 124.42 236.22 412.42 3.99 7.96 15.75 30.52
Table 7,10: Speedups
Near linear speedups are achieved for the KSOM, Only for smaller neural network sizes or 
for the PX, lower speedups are estimated. This is because the computation/computation 
ratio becomes less in these situations. Scalability and efficiency are not considered here, 
because it is obvious that these are high when observing these speedup rates,
7.11 Network decom posed Satdat
The neural network architecture for Satdat is a 6x50x50 KSOM network. The number of 
connections required for this application is 15000, which is much smaller than the networks 
used for the experiments described above. However, the performance and speedup esti­
mated based on Equations (7,15) and (7,16) are still relatively good for smaller processor 
networks:
4 8 16 32 64 128 256 512
performance NSC 0.28 0.56 1.10 2.12 3.82 5.91 7.16 6.71
speedup NSC 3.99 7.95 15.70 30.23 54.42 84.28 102.05 95.69
performance GCEL 0.34 0.67 1.32 2.50 4.38 6.42 7.24 6.43
speedup GCEL 3.99 7.94 15.61 29.72 51.99 76.17 85.93 76.34
performance PX 3.45 6.60 11.56 16.81 18.61 16.17 12.27 8.83
speedup PX 3.93 7.50 13.15 19.13 21.18 18.39 13.96 10.04
Table 7,11: Performance and speedup for Satdat
152 7. Network Decomposition
For larger processor networks, the speedup limit is reached. The number of processors at 
which the speedup limit is reached can be found similar to Equation (7,14) as:
a
b
P
n ■ Tnei + w ■ (tfnd + 7t/ 1 6  • tcw
2 *  '(2 t c o m m  ta c e )  2 -)- 2 • t c
■ a '
■N
(7.17)
giving the following results:
p ¿>(P) V{P) £ ( P )
NSC 295 103 7.2 0.35
GCEL 246 86 7.3 0.35
PX 58 21 21.2 0.36
Table 7,12: Maximum speedup and corresponding performance and efficiency for Satdat.
7.12 Conclusions
The implementation aspects and measured performance results of two distinct neural net­
work algorithms are discussed in this chapter. It is pointed out that in general, network 
decomposition introduces communication overheads. Therefore, care has to be taken to 
come up with an implementation that reduces communications. For backpropagation, two 
techniques are introduced for this goal: 1 ) a new distributed gathering technique using 
pipelining, and 2 ) an implementation of the backward pass which requires communications 
in the order of 0(n).
For the KSOM, the communication overhead is relatively small, resulting in near linear 
speedup and high performances for large processor networks. For both the backpropaga­
tion and the KSOM it is observed that for smaller processor networks and smaller neural 
networks, the performance drops and the efficiency is low. Because in general, real ap­
plications are of moderate size, it must be concluded that no large processor networks 
must be used for network decomposition for such problems. In particular because modern 
transputer systems (PX) have a low computation/communication ratio.
For the two real applications Satdat and Nettalk, it appears that the maximum performance 
that can be reached is well within the available number of processors of each of the three 
target platforms. Using Equations (7,17) and (7,14), this performance can be computed, 
resulting in efficiencies between 0,3 and 0,5,
w
Neurosim ulators
Outline
In the introduction of this thesis, a classification of users of neurosim­
ulators is given. Furthermore, the neurocomputing environment is in­
troduced and the activities with which users are occupied when doing 
neurocomputing are identified, A neurosimulator is a toolbox containing 
tools for developing, monitoring, manipulating, controlling and executing 
neural network simulation programs. In this chapter an overview over the 
features most neurosimulators have in common for providing these tools 
is given. The way in which neurosimulators access data from a running 
neural network simulation program is explained and it is argued that 
implementations based on features such as a network description lan­
guage and hierarchical neural network data structures are less efficient 
than what can be achieved when using dedicated, tailor-made implemen­
tations.
154 8. Neuro simulator s
8.1 The neurocom puting environm ent
In order to arrive at a general classification of neurosimulators and to examine the support 
that they offer to perform neurocomputing, in this chapter the specific features are listed 
that are encountered in environments that use neural networks. Figure 8,1 depicts such a 
neurosimulator environment,
Figure 8,1: A general neurosimulator environment.
The neurosimulator can receive input from devices like a CCD camera with frame grabber, 
database management systems, a data generation program, or from file. It can send its 
outputs to data post-processing tools, output devices like a robot controller or graphical 
display, or to file. Via a user-interface the neurosimulator can be controlled. The neu­
rosimulator can be either explicit, i.e., running as one program coupled to its surrounding 
environment, or it can be implicit, i.e., running as part of a larger program. Some ex­
plicit neurosimulators offer possibilities to dump the code of a tuned neural network as a 
stand-alone program which can be used to integrate it as an implicit program part,
8.1.1 Environments: user perspective
The user groups involved in neurocomputing and the neurocomputing life cycle as identified 
in the introduction of this thesis, are listed in Table 8,1:
user groups phases in life cycle
model builders initiation
tool builders tuning-
applied researchers testing
end users operation
Table 8,1: User groups and life cycle of neurocomputing.
8.1. The neurocomputing environment 155
A neural network is often considered as a black box that receives inputs, adapts itself based 
on features in the inputs, and produces outputs. In a operational phase, the black box 
model is acceptable for users, in particular for end users and applied researchers. For a 
neural network in operation, the user is not interested in what happens inside the black 
box, and the main efforts that are performed by him are inputting and outputting data into 
and from the black box. Important features that the neural network simulation exhibits in 
such a setup are that it is integrated with data preprocessing, post-processing and/or data 
evaluation tools. However, for model builders it may be extremely interesting what goes 
on inside the box, as one of their goals is to understand the functioning of the brain. Also 
for applied researchers in the tuning and testing phases — during which the adaptation 
process of the neural network takes place — the black box concept is not sufficient. Though 
there exist neural networks which are more or less self-organizing, like ART [12], fieldnet
[85] or Kohonen [58], in order to monitor and control the adaptation process it is required 
to be able to halt a running session and edit or view the network parameters.
When considering these aspects and the different user groups for neurosimulators, three 
environments can be distinguished:
1, Production environments. Such an environment in most cases contains implicit neu­
rosimulators, It is integrated in a larger workbench and carries out one step in a 
range of processing stages. Note that before the end application is implemented, 
the complete neurocomputing life cycle could have been run through. This can be 
performed by some hired experts or by employees from the research department of 
the company. In either case, if neurosimulators are used in production environments, 
they must somehow have a means to be integrated in the environment. This means 
that it must be possible to input and output data in certain formats, or to extract 
stand-alone neural network code which can be used in the end application,
2, Model research environments. Model builders impose more requirements on neu­
rosimulators, They have a lot of experience in building, tuning, evaluating and 
applying neural networks, and want a neurosimulator to support them with these 
tasks. Whereas in production environments dynamic I/O  is one of the salient fea­
tures, for this environment inputs and outputs are obtained from relevant training 
and testing datasets, or from artificial data. Knowledge from the data (through data 
processing and statistical evaluation) can be built into the models or can be used in 
some preprocessing stage,
3, Application research environments. These environments comprise the former two, as 
the eventual goal is to exploit neural networks for a specific application, and before 
this is achieved, a lot of experimentation has to be done.
Whereas in the first environment in general not much experience will exist with using neural 
networks, in the other two environments this often is the case. It may even be so, that 
researchers already have a number of existing implementations of a set of neural network
156 8. Neurosimulators
algorithms in their department. In order for them to accept a new neurosimulator, they 
will require that it is possible to incorporate their existing code within the new system, 
without too much effort,
8.1.2 Environments: neurosim ulator perspective
In [80], a taxonomy for neurosimulators is given by Recce, Rocha and Treleaven, Three 
classes of neurosimulators are distinguished based on the support that they offer: application- 
oriented, algorithm-oriented and general programming systems. The characteristics of each 
of the classes are determined by the intended user and scope of the application.
Application-oriented systems are targeted on end products. They are used in production 
environments, and often have integrated the neurosimulator as an implicit version into the 
application. The end product is developed based on requirements specified by the end 
user. The neural network part of the application is either implemented from scratch, by an 
application-specific neurosimulator, or by a general-purpose neurosimulator. The first type 
is developed specifically for the end user or for the application area he is interested in. This 
means for example that it is equipped with means for loading and storing data according to 
the formats used in the target environment. The second type are commercially available 
packages, generally targeted on PCs, like NeuralWorks, Nestor, and BrainMaker, Such 
neurosimulators provide an easy to use user-interface via which a range of neural network 
models can be controlled.
Algorithm-oriented systems are neurosimulators dedicated for one specific neural network 
model or neurosimulators containing algorithm libraries. Many algorithm-oriented neu­
rosimulators are targeted on multi-layered pereeptrons1. Others are targeted on for ex­
ample Kohonen networks [60, 59], Because these systems are tailor-made, in general they 
contain highly efficient implementations of the model or application that is supported. 
Algorithm-oriented systems are used if a user knows or expects that his application can be 
solved by a particular neural network. The disadvantages are that they are not very flexible; 
some difficulties may arise if they have to be integrated in the applications environment, 
or if the user wants to make changes to the neural network algorithm.
General programming systems have the advantage over application-oriented and algorithm- 
oriented systems that they offer possibilities for researchers to implement new models. They 
can be divided in development programming systems, open systems and hardware-oriented 
systems. Neurosimulators belonging to the first category provide means to construct neural 
networks using sets of library routines or using a neural network description language 
(NDL), Programs using the library routines have to be linked with the neurosimulator, 
programs specified following the NDL have to be interpreted or compiled and run. Both 
methods have the advantage that the neurosimulator may safely assume that the neural 
network model is specified in a prescribed manner. This means that it can access the
1Even many general-purpose neurosimulators, though offering features to incorporate other models, 
have only implemented the backpropagation network.
8.2. Features of neurosimulators 157
datastructures, provide monitoring and manipulation tools and that it can use its own 
model of execution and control. The second category, open programming systems, differ 
from the previous one by offering the facility to be customized, for the benefit of the 
user or for the application. Changing the environment can result in a customized user­
interface and/or integration with user-defined tools. The third category are hardware- 
oriented systems, which are able to map a given neural network specification on a given 
hardware configuration, A comparison between three neurosimulators belonging to the 
third class is given by Plonski in [77], The Rochester Connectionist Simulator, Genesis 
and Sfinx are compared, Plonski also distinguishes three classes of general programming 
systems. The first class contains demonstration programs which are dedicated to one 
particular neural network paradigm to (briefly) demonstrate its behavior. The second 
class are special purpose neurosimulators, intended for a broader range of experimentation 
for a set of network models. The third class are simulation environments, which differ from 
the previous two by supporting a number of network models and by providing facilities to 
construct new models. In his paper, Plonski discusses the typical features neurosimulators 
exhibit. An overview over these features is given in the next section,
8.2 Features of neurosim ulators
As stated before, a neurosimulator must be able to aid the user with the construction, 
tuning, testing, integration in an environment, and execution of a neural network. De­
pending on which class of neurosimulator is used, some of these tasks may be more or 
less supported. For example in [77], [80], [109] and [116] the characteristic features that 
neurosimulators offer are listed. Neurosimulators can have a graphical user-interface, an 
algorithm library, support for building new models, application specific tools, dedicated 
hardware accelerators, etc. In figure 8,2, a general purpose neurosimulator is depicted. 
It consists of tools to construct, visualize and manipulate, control and execute a neural 
network executable. The executable may be connected to different tools through a data 
interface, or it can be controlled by a (graphical) user-interface. The following components 
can be identified:
o User-interface. Via a graphical user-interface, or sometimes a character based command- 
line interface, the neurosimulator can be controlled. The user interface supports starting 
up actions such as learning or recalling a set of patterns. Furthermore, it supports the 
visualization of neural network specific data, by starting up graphical monitor or plot 
tools. In many occasions, user commands issued via the user-interface can be logged to 
a file which can be loaded to perform some kind of batch processing,
o Algorithm library. Neurosimulators often contain a parameterizable algorithm library, 
consisting of various implementations of neural network models. These can be called from 
the user-interface, but can also be used to develop application specific user programs. 
Typical parameters for initiating the algorithms are architectural settings determining 
the structure of the neural network, control parameters such as the number of iterations
158 8. Neurosimulators
Figure 8,2: A general purpose neurosimulator.
to use for training, and learning parameters like the learning rate, determining the 
speed of learning. Integrating algorithms from the library as stand-alone programs can 
be supported by neurosimulators. In most eases this is done by dumping the code of 
a tuned neural network simulation program with static values assigned to the weights, 
thresholds and other parameters into a stand-alone program,
o Support for building new neural network models. New neural network models can be 
specified via library routines using a hierarchical neural network datastructure, or by 
using a neural network description language. As each different implementation uses the 
same kind of datastruetures, the graphical user-interfaee is able to connect to objects 
hidden inside the network. In this way, objects can be changed or monitored. Some 
neurosimulators support graphical construction of neural networks by drawing icons 
representing neurons or clusters, and arrows for connections between them,
o Network monitoring tools. Graphical monitoring tools initiated by the user-interfaee 
allow the visualization of all kinds of information from the neural network. This could 
be the weight values, neuron activations, parameter values like learning rates, etc. Often, 
graphical plot tools are provided to monitor the performance of the network error during 
training, A lot of neurosimulators offer the possibility to dump neural network data in
8.3. The traditional neurosimulator engine 159
a specified format which can be used by the user to transform it to the formats required 
by his specific tools. Others allow the possibility to dump data into standard formats 
like PostScript2, Mai lab [6 6 ] or gnuplot [57],
o Dedicated target platforms. In order to cope with the processing and memory require­
ments required by todays neural network applications, a wide range of high performance 
target platforms are proposed and being used. Some neuro-simulators use parallel pro­
cessor systems (Tollenaere [100], Richards [82], Recce [80] , Goddard [40]), some are 
targeted on super computers (Ghosh and Wang [38]), others on dedicated neuro asics 
(Theeten [99], Han [43]) or use heterogeneous architectures (Duranton [27]),
o Application specific tools and devices. For most experiments using neural networks, 
the information produced by the tools mentioned above is adequate for making general 
statements about how the network is performing during training. However, in order to 
find out how well a network actually behaves after training, it will have to be used for 
production. Often, this involves coupling the network simulator to some I/O  device or 
controller, or feeding it with large amounts of data. As the data can have any representa­
tion required by the application at hand, this involves formatting it to the formats used 
by the neurosimulator. Image processing applications for example, require the simulator 
to be coupled to graphical display programs. For information systems, it is coupled 
to database management systems. For real-time applications, it has to be coupled to 
the corresponding devices. Figure 8,1 in Section 8,1 depicts a number of possible tools 
and devices to which a neurosimulator can be connected. Most neurosimulators have no 
support for dynamic I/O , i.e., they have no means to receive inputs from preprocessing 
tools nor to send outputs to post-processing tools online. Integration in such cases means 
that an explicit neurosimulator has to be transformed in an implicit one by extracting 
stand-alone neural network code,
8.3 The traditional neurosim ulator engine
Finding out how neural networks can be constructed and run, how data processing and 
visualization tools can access neural network specific data and how this data can be ma­
nipulated, boils down to finding out about the internal datastructures and flow of control 
of the neural network simulation program. In the traditional approach, there are three 
ways via which neural network objects contained in a program are accessed. The first is 
completely hidden from the neural network experimenter. Via an NDL or a script lan­
guage, he is able to add, delete or manipulate neural network objects. To enable this, the 
simulation kernel is built using a hierarchical datastructure, e.g. comprising structures or 
classes like network, neuron, connections and activation functions. The second is a more 
open approach where the user may program his own applications, using library routines 
to manipulate the neural network datastructures. These routines (or macro definitions)
2PostScript is a trademark of Adobe Systems, Inc.
160 8. Neurosimulators
provide an encapsulation of the general neural network datastructures for the user. The 
last is a completely open approach, where a neural network programmer has full access to 
the datastructures.
8.3.1 The general neural network datastrueture
In the black-box model, a neural network transforms inputs into outputs, possibly while 
adapting itself based on the inputs. Black boxes can be nested, i.e. a black box may 
contain several other black boxes, each representing a specific part of the neural network. 
For example, a three-laver backpropagation network could be described by a hierarchical 
structure of network, three layers, several neurons and connections. The black box network 
contains three black boxes, each representing a layer. Each layer may contain an arbitrary 
number of neuron black boxes and connections may be defined between layers and indi­
vidual neurons. Note that using this hierarchical approach, the concept of connecting for 
example layers may be implemented by fully connecting the outputs of the first layer to 
the inputs of the second layer. In a simulation of such a conceptual neural network model, 
each black box definition can be a structure (or class), containing local and global function 
pointers to routines which implement its behavior, and global and local data pointers and 
variables for storing and retrieving information,
typedef struct { 
typei variablei;
= = |  global data, variables . . .
= = l  and functions typem variables 
typei datai;
typen datan; 
typei functioni;
type0 function0; 
} BlackBox;
Figure 8,3: The general black box concept.
For example, consider the neuron model of the Rochester Connectionist Simulator [40], 
Each neuron is called unit. Units can have a number of sites, each site handling a number of 
incoming links. An excerpt of the unit model, its datastructures and structure specifications 
(from [40]) are depicted below3.
local local data- 
variables pointers
local
function pointers
3 The fields in the datastrueture do not precisely reflect the names specified in [40]
8.3. The traditional neurosimulator engine 161
U U
N N
I I
T 1 T 2
Site1 Site21
| Site3 Site4 Site5 \
L1
typedef struct { 
Potential;
State;
Output ;
♦Data;
*Neuron_Function; 
*Site_List;
} Unit ;
Figure 8,4: The RCS unit model and unit datastructure.
L
L 2
L 3
L 4
L 5
L 6
Units are accessed via an index in an array of units, A neuron’s input is determined based 
on the values of its sites. Each unit has a linked list of sites. Incoming connections ( links) 
are implemented as a linked list of links, which are stored by each individual site. The 
ECS provides a set of (command interface) routines via which a programmer can build a 
neural network via the command line interface, like:
MakeUnit type function,
AddSite index type function,
MakeLink indexfrom indext0 value data function
However, this is a time consuming process and it is advised to build neural networks in a C- 
program. For this purpose, an extensive library of routines that allocate, build and connect 
neural network objects is provided. Routines with similar syntax like the ones above exist. 
Furthermore, a number of syntactic convenience routines can be used to access the neural 
network. Given a unit with index idx, for example its output, state or data fields can be 
requested or updated via:
GetOutput(idx); GetState(idx); GetData(idx);
SetOutput(idx,value); SetState(idx,value); SetData(idx,value);
A new neuron, site or link can be constructed by implementing programmer-defined func­
tions, Each of these ’black boxes’ has a field Data, which in fact is a void pointer and can 
be used to contain any information. This involves that if newly defined functions use this 
data, during allocation of a black box (e.g. via MakeUnit) also the data for the unit has 
to be allocated.
The EENS [62] neurosimulator developed at our department has a similar conceptual model 
like the ECS, Its neuron black box is depicted in figure 8,5,
162 8. Neurosimulators
global input 
parameters
local
parameters
parameters
global neuron global output
parameters
local
parameters
local
parameters
Figure 8,5: The EENS neuron model.
I
I
Also the Pygmalion [2 ] neurosimulator and many others like Aspirin/MIGRAINES [63] 
and Genesis [1 2 2 ] are built based on a general hierarchical neural network datastructure, 
Pygmalion datastructures contain objects like system, network, layer, cluster, neuron and 
connections. Macros are provided for accessing data, like n_0_2_3_6.0 , which produces a 
neuron’s state value:
sys->net[0]->layer[2]->cluster[3]->neuron[6]->state[0]
Genesis has a biologically very plausible model of neural networks. Consider for example 
how some of the fields of a unit in Genesis are defined:
typedef struct {
char ♦name;
int index ;
struct object_type ♦object ;
Element ♦parent ;
Element ♦child;
Element ♦next ;
double state;
float Rm; /*
float Cm; /*
float Em; /*
float inject ; /*
} unit-type;
In Aspirin/Migraines a neural network is constructed based on what actually is called a 
black box, A black box can be defined by a DefineBlackBox statement, which requires a 
description of its constituents:
8.3. The traditional neurosimulator engine 163
DefineBlackBox <name> {
OutputLayer
InputFilter
OutputFilter
Components
/* where to put output data */
/* C function producing net input */
/* C function producing output */
/* list of layers containing neurons with connections */
}
8.3.2 Access and control of neural network data
Because each neurosimulator is built based on this concept of hierarchical datastructures, 
accessing a value of a neural network boils down to following the pointers in the structures. 
The main advantage of the black box models described above, is that they are general 
purpose. Pointers to attach any user-defined data or functions are available. Furthermore, 
as for each neural network the same datastructures are used, the neurosimulator is able 
to access them and for example start up visualization tools for monitoring (parts of) the 
network. The disadvantages are that programmers have to stick to the syntax of the 
datastructures and associated macro definitions and library routines, and furthermore, 
that this model uses a rather inefficient flow of control. For example, consider algorithm 
8,1, It implements a simplified neuron black box model and training algorithm for the 
backpropagation network. Each neuron object has a number of properties, defined in a 
C-structure, It contains variables maintaining its status, like delta and activation. It 
contains parameters denoting the number of inputs and outputs and indexes to the first 
neuron from which it receives inputs and to which it sends outputs. Finally, pointers exist 
to its incoming connections and to three functions which implement the computation of a 
neurons input and activation values and the training mechanism.
typedef struct { 
float netinp; 
float bias; 
float activation; 
float delta; 
int ninputs; 
int first-input; 
int noutputs; 
int first_output; 
float ^connections; 
void (*input_function)(); 
void (*output_function)(); 
void (*train_function)();
} Neuron;
(a) The neuron object properties.
Pattern *pat;
Neuron *neurons; 
int npatterns, nneurons; 
void epochQ { 
int i,p;
for (p=0 ;p<npatterns;p++) {
for (i=0 ;i<nneurons;i++) { /* forward pass */ 
neurons[i] ^ input_function(&neurons[i] ,pat [p]); 
neurons[i]—>output_function(&:neurons[i],pat[p]
}
for (i=nneurons-l;i>0;i~) /* backward pass */ 
neurons[i] ^ train_function(&neurons[i] ,pat[p]);
}
(b) The basic training algorithm.
Algorithm 8,1: Backpropagation using hierarchical datastructure.
164 8. Neurosimulators
In this black box model, the flow of control is ruled by the routine epoch(), which knows 
about the number of neurons and which functions to call. Running this algorithm in most 
cases is less efficient than running a dedicated implementation of the backpropagation 
neural network. The first reason for this is because of the flow of control. For each neuron 
a number of function calls are required. Especially for a large number of neurons this 
introduces overhead compared to running dedicated in-line code. Note that this overhead 
increases for more complex models like the RCS or EENS model described above. The 
second reason is the way in which data is accessed. The more references in a hierarchical 
datastructure are required to access e.g. an activation or weight value, the more time is 
required,
8.3.3 The neural network description language
Neurosimulators like Pygmalion, Genesis and Aspirin/Migraines provide an NDL for spec­
ifying neural networks. The first provides the high level language nC and an intermediate 
level language nC-code, The second has defined the language Aspirin, which can be used in 
close combination with the MIGRAINES interface. Genesis provides shell-like commands 
via which neural network objects can be defined, connected and executed. The commands 
can be stored in a shell script which can be run in batch mode,
A network description language offers flexibility combined with encapsulation of program­
ming efforts. Using predefined modules or customized library routines (e.g. programmed 
in C) contained in an algorithm library, with the NDL new neural network models can be 
built, A NDL script can either be interpreted (Genesis) or be compiled into executable 
code (Pygmalion, Aspirin/Migraines), The former method results in a slow execution. 
The latter offers the advantage of having several compilers building code for diverse target 
platforms. For example the goal of Pygmalion was to have compilers that — based on a 
NDL and a description of the target machine configuration — produce code for transputer 
based multi-processor systems.
The underlying datastructures are the same as discussed in section 8,3,1, This involves that 
using an NDL is not as efficient as when using dedicated code. Especially if the goal is to 
automatically decompose a neural network program over a parallel processor architecture 
this is the case. This problem can be translated to the problem of mapping two connected 
graphs onto each other, for which no general optimal scheme exists,
8.3.4 The graphical user-interface
As mentioned before, via the user-interface the user can control a running neural network 
simulation program. This comprises initiating, halting and continuing actions like loading 
patterns or a training and recall. This also includes monitoring and changing parameters 
associated with these actions.
An important observation made when considering existing neurosimulators and their en­
8.4 . Conclusions 165
vironments was made in [116], The amount of code actually needed to implement a neural 
network’s dynamics is far less than the amount needed for implementing the user-interface 
mechanism for controlling the network. In some cases the required code for the user­
interface may take up more than 90% of the total code. Especially, this holds if the 
neurosimulator also provides graphical monitoring and debugging tools,
8.3.5 I /O  and neiirosirniilators
A final common feature of neurosimulators discussed here are I/O , A running neural net­
work simulation can input data or generate output to be loaded or output from file or 
some dynamic I/O  channels. For example consider Figure 8,1, which depicts such a set up. 
Most neurosimulators have defined one or more pattern formats for loading and storing 
data to be processed. In general, such a format contains a header describing the number of 
patterns, the number of input features per pattern and — if supervised data is concerned 
— the number of output classes. The data is generally organized in input/target pairs. 
For example, the patterns for the XOE problem would be specified as:
4 2 1
0.1 0.1 0.1
0.1 0.9 0.9
0.9 0.1 0.9
0.9 0.9 0.1
For certain applications, specific requirements may exist which cannot be solved via such 
pattern files. In particular applications in sensor/motor fusion and vision require for the 
neural network to input or output data to or from hardware devices. Current neurosim­
ulators offer the possibility to add hooks to so called input functions or input routines, 
as explained in the previous sections. Programmers can add their application-specific I/O  
routines, written in, e.g., C to implement the I/O , For each execution of an input or output 
routine, the execution mechanism of the neurosimulator checks if some programmer-defined 
routine has to be called. Upon calling such a routine, the neural network is supplied with 
the corresponding input data or outputs the corresponding output data.
Similar considerations can be made toward routines for control and visualization of a neural 
network simulation. General purpose neurosimulators have utilities to view the contents 
of neural network datastructures such as the activation of units and the weight values. If 
the visualization of data requires a different layout, specialized visualization routines can 
be attached to the corresponding data structures,
8.4 Conclusions
Considering the observations discussed in the previous sections, it can be concluded that:
166 8. Neurosimulators
1 , General purpose neurosimulators are built on the concept of a hierarchical data struc­
ture,
2, Using this structure, a user/programmer can exploit algorithm libraries, a NDL, or 
dedicated (C)-eode to add or modify part of a neural network simulation program,
3, This means that a user/programmer has to adhere to a predefined syntax and data 
structures, which in general is not as efficient as customized code,
4, Application specific I/O  interfaces and neural network simulation code may also suffer 
from this situation,
5, On the other hand, general purpose tools may be designed based on these data 
structures,
6 , Research and applications of neural network simulations require an intensive access 
to the simulation code. Changing the model or adding I/O-speeifie routines is an 
activity with which users of neurosimulators are heavily occupied,
7, It is also noted that the control over a neural network simulation is relatively constant, 
A limited set of actions such as loading of a network and data, and starting up training 
sessions are to be controlled via the user-interface,
8 , Furthermore, the vast majority of the code required for neurosimulators is occupied 
by the code required for the user-interface,
9, And finally, neurosimulators require a means to be coupled to the outside world; to 
a user-interface for controlling the simulation, to input data provided by input tools, 
and to output data needed by output tools.
m
An action-oriented neurosim ulator
Outline
In this chapter, the design and implementation of a so-called action- 
oriented neurosimulator is discussed. The notion of actions and associ­
ated objects and attributes is presented and based on this, the idea of 
program descriptions is introduced. Using these concepts, a set of inter­
face definitions is specified via which the manager of the neurosimulator 
environment (called CONVis) can access data and can rule the flow of 
control of a running neural network simulation program. The neurosim­
ulator p r e e n s  consists of this manager CONVIS and a set of implemented 
tools and neural network algorithms, p r e e n s  is introduced here and, as 
an example, its application to remotely sensed satellite imagery is de­
scribed.
168 9. An action-oriented neurosimulator
9.1 O bjects and attributes of actions
It was argued in Chapter 8  that there exist only a limited number of actions that are carried 
out when using a neurosimulator for simulating artificial neural networks. The actions 
are concerned with loading data, with training, recall, and operation. Actions contain 
components, which can contain objects and attributes [116], These can be further divided 
in parameters, options and settings for installing the initial configuration of an action, and 
variables and data whose values can be changing during execution of an action,
o Objects associated with an action represent parameters, variables and data that exist 
in the implementation of an artificial neural network. In most cases they represent 
real objects of the neural network, such as activations, thresholds and weights. Also, 
they may represent data structures required for implementing the neural network, 
like input/target patterns and derived components like a confusion matrix1, or like 
a classification result. In other cases, objects may be more related to the program 
simulating the neural network. Examples of the latter case are parameters like the 
name of a file in which training patterns or weights and activations of a neural network 
are stored, or program variables like the current pattern or iteration number,
o Attributes are settings and options. They are required for specifying the flow of 
control of a neural network simulation program. Settings can be considered as radio 
buttons or boolean values. When a setting is on or off, this indicates that a part of 
the program has to be executed or not. Options are program variables representing 
one out of several alternatives. They can be considered as switch buttons switching 
between a number of states of which only one may be active.
Consider for example a training session of the multi-layered perceptron with momentum 
learning rule and sigmoid activation function [83], Such a session would be implemented 
in an action learn. Parameters that rule the computation of weight error derivatives 
are the learning rate r/ and the momentum a. Other parameters can be the number of 
iterations or the error criterion after which training has to stop. The error generated by 
the neural network at a certain time can be expressed as the sum of squared differences 
between target and output activations. This error is a scalar variable that can change 
over time. Non-scalar values that can change over time are called data. Data contained 
in the neural network are for example activation and threshold values of the neurons, and 
the values of weights. During training, certain modes can be on or off. For example, it 
can be decided whether during training the neural network components have to be saved 
or not. The attribute that represents such a decision is called a setting. When giving 
the setting autosave the value on, this indicates that saving the net during training is 
required. Finally, certain options can exist that have one out of several values. Activating 
a neuron can be performed via, e.g., a binary threshold function, the sigmoid function, or 
the tanh function. This can be specified by an option called activation_function.
1 Confusion matrices will explained in more detail later in this chapter.
9.2. Actions and program descriptions 169
For the objects and attributes of the action learn described here, the following identities 
can be listed:
parameters learning rate, momentum, nr of iterations, error criterion
variables net error, current pattern, current epoch
data activations, thresholds, weights
options activation function
settings epoch training, autosave
Table 9,1: Sample objects and attributes for an action learn.
As another example, consider an action loacLnetwork, A parameter for initiating this 
action can be the name of a file from which to load a network’s weights, thresholds or 
activation values. If no previously stored neural network has to be loaded, but a new 
network has to be initialized, parameters can be the minimal and maximum values between 
which weights have to be initialized. The option indicating whether a new or an existing 
network has to be loaded could be called new/from f i le .  Parameters determining the 
architecture of the network are the number of layers, and for each layer, its number of 
neurons. Data are the weights, activations and thresholds, Note that for this action, no 
variables or settings are required.
Throughout this chapter, further definitions and examples of actions will be given. In 
some occasions, these will be accompanied by explanations how they can be controlled 
via the graphical user-interface of CONVis, The c o n v is  main window is depicted in Fig­
ure 9,1, Actions are divided in the categories I/O  (e.g., loadjnetwork, save^network, 
load_patterns), learn, re ca ll and c lassify . The latter actions correspond with the 
tuning, testing and operation phases in the neurocomputing life cycle. For actions that do 
not fit in this concept, the miscellaneous category is provided.
‘ t  convis
| •y 11/6)  ( iearn)  ( recall) ( classify) (m iscellaneous)  (info) 
main control
Figure 9,1: CONVIS main window. It contains a pull-down menu called I/O, and it contains 
buttons via which control for an action can be activated. Upon activating an action, the 
so-called action control window pops up (see Figure 9.2).
9.2 A ctions and program descriptions
The observations made in the previous section hold for all neural network simulation pro­
grams, Each action implemented for a program can be described via parameters, variables,
170 9. An action-oriented neurosimulator
data, settings, and options. And by describing each program’s actions, the complete pro­
gram can be described. Different neural network models may have different objects and 
attributes for an action. The momentum parameter for backpropagation does not exist 
in the Kohonen neural network. Vice versa, the range of the neighborhood function is a 
parameter that is not used in the backpropagation neural network. So whereas different 
neural network programs implement the same actions, it is the specification of their actions 
that may differ,
A program description describes the actions implemented for a certain neural network 
simulation program. Each action is described by specifying its components; its objects and 
attributes. The syntax of a program description is depicted below:
program_description: actions
actions: empty I action; actions
action: ’action5 name runmode components
components: empty I component; components
component: object I attribute
object: parameter I variable I data
attribute: option I setting
In p r e e n s , each neural network simulation program, say sim.c, has to be specified by a 
program description, say sim.descr. In section 9,3, it is described how a running sim­
ulation program from the p r e e n s  algorithm library (say sim) and CON vis interact. In 
an initialization phase, CON vis interrogates sim for the actions it contains. The simula­
tion program parses its program description and builds up a data structure containing the 
specification of its actions. When parsing a program description, for each action encoun­
tered, an entry in an array of CAction structures is allocated. Such an entry contains the 
following fields:
typedef struct {
char name [C_NAMELENGTH] ; 
int id; 
int runmode; 
int nparameters; 
CParameter »parameters; 
int nvariables; 
CVariable »variables; 
int ndata;
CData *data; 
int noptions;
COption »options; 
int nsettings;
CSettings settings;
} CAction;
/* name of the action, e.g., 'learn5, 'recallJ 
/* identifying the action, index in array */ 
/* tells if action is interruptible of not */
*/
Control over an action can be instantiated via the CONVIS main window by clicking on the
9.2. Actions and program descriptions 171
action or selecting it from the pull-down menu (see Figure 9,1), The action control window 
is activated via which the components of an action can be selected, and the execution of 
an action can be controlled.
learn
(parameters) (variables)  (data) (options)  ( settings) 
(run) (halt) (continue) (cancel) (quit)
Figure 9,2: Action control window for the action learn. It contains buttons via which 
control can be initiated for its components. And it may contain control buttons via which 
an action can be started, halted, continued, canceled and quit.
An action is interruptible if it can be halted during execution of the action. For some 
actions it does not make sense to interrupt it, for example for an action save_network this 
is the case. In the program description of an action, with runmode, it can be specified if 
an action is interruptible or not. For actions that are not interruptible, only two control 
buttons are generated; the run and qu it button, Note here, that CONVis dynamically 
installs itself based on the program description. This is a feature that will be described in 
more detail later in this chapter.
In the next sections, the data structures of all components are defined, and it is explained 
how a component must be specified in a program description.
9.2.1 Param eters
In a program description, any number of parameters can be specified for an action. For 
each parameter, an entry in the array parameters is allocated for storing the parameter 
descriptions. The data structure for a parameter is given below:
typedef struct {
char name [C JIAMELENGTH];
int id;
int type;
double value;
double min;
double max;
int editable;
char string [C-STRINGLENGTH] 
int size;
} CParameter;
/* e.g. ’niterations’, ’Irate’ */
/* index in array ’parameters’ */
/* e.g., int, float, double, string */
/* hint for CONVIS */
/* if type is string, contains the string */ 
/* # of bytes the parameter occupies */
When clicking the button parameters in the action control window, the parameter control 
window pops up:
172 9. An action-oriented, neuro simulator
^ parameters ‘learn’
nr epochs 1000 (edit) (set)
error criterium 0.050000 (edit) (set)
noise percentage 0 (edit) (set)
save file save (edit) (set)
save cycles 1000 (edit) (set)
(cancel) (o k)
Figure 9,3: Parameter control window of CONVIS.
Via this window, values of parameters can be edited and viewed. When clicking the 
ed it button, a window pops up via which more fields of the corresponding parameter can 
be viewed or edited (see Figure 9,4), For each field, it can be specified in the program 
description whether it can be edited (like the min or max fields), or just viewed (like the 
type, name or id e n t if ic a t io n  field. The syntax for specifying a parameter in a program 
description is given below.
parameter: parameter name type value min max editable string_value
name: string
type: C-type I file I string
value: C-type
min: C-type
max: C-type
editable: boolean
string_value: string
A C-type is any type in the C programming language, such as short, f lo a t and double. 
Furthermore, the types b it  and byte are contained in a C-type, A s tr ing  can be any 
sequence of characters between quotes. The difference between type s tr ing  or f i l e  is 
that for the latter, CONVIS adds a mechanism to start up a file browser, as depicted in 
Figure 9,4, For other types, this is not the case:
edit ‘save tile ’ 
name save file
identification 3 
type FILE TYP E
string save
(select file)  ( cancel)  (o k)
e d it 'n r e p o c h s ’
name nr epochs
identification 0
type IN T T Y P E
value 1000
minimum 0
maximum 10000000
(cancel) (o k)
Figure 9,4: Edit parameter windows of CONVIS. When clicking select f i le ,  a file, browser 
pops up.
9.2. Actions and program descriptions 173
9.2.2 Variables
Variables and data may change in time, and in their corresponding data structures, fields 
are contained for maintaining information associated with monitoring the changing values. 
In section 9,3, this is explained in detail. The items that have to be specified for a variable 
in a program description are similar to those for a parameter, but variables can have no 
type string or file:
variable: variable name type value min max editable time_id
The time_id field is a string (which may be empty), for indicating the unit quantity at 
which the variable is changing, e.g., second or iteration number. The specification of, e.g., 
a variable called error following this syntax is:
variable { "error" double 0.0 0.0 MAXDOUBLE 1 "epoch" }
9.2.3 D ata
Variables can hold only one value, as they represent scalar objects. If more than one value 
is used within an object, it is considered as data. The CData structure that defines the 
data concept in p r e e n s  is given below:
typedef struct {
void *dataptr; /* pointer to place data is stored */
CData_Description description; /* describing the data */
int allocated; /* is data allocated or not? */
} CData;
Data can be structured any way a programmer wants to. However, it was argued that the 
more data is structured in, e.g., linked lists or hierarchical data structures, the more time 
is required for accessing its individual elements. Especially in neural network simulations, 
a high number of relatively simple operations have to be performed on vector or matrix 
elements. Based on these observations, the CData structure was developed. It contains a 
pointer to the place where the data is stored. Furthermore, it contains a description that 
specifies how the data is structured. The data description allows data to be structured in 
multi-dimensional arrays, thus supporting vectors and matrices in particular. Up to four­
dimensional arrays of data vectors are allowed, covering the majority of data structures 
encountered in neural network simulation programs. Fields in the CData_Description 
structure that are used to describe the structure of data are:
174 9. An action-oriented neurosimulator
typedef struct {
int dimension; /*
int depth[4]; /*
int length; /*
int flat; /*
} CData_Description;
dimension of the data, either 1,2,3 or 4 */ 
for each dimension, the nr of data elements */ 
the number of features per data element */ 
indicates whether data is allocated consecutive, or 
more-dimensional */
For example, consider a set of n input patterns, where each pattern has m, elements. 
Two ways of arranging a data structure containing the nxrn elements are depicted in 
Figures 9.5(a) and 9.5(b). A two-dimensional array containing element vectors of length I 
is depicted in Figure 9.5(c):
0
1
n-1
}
}
}
m
m
m
0
1
n-1
}
}
m
}m n-1
m-1
}
}
(a) One dimensional flat nxm (b) One dimensional non-flat (c) Two-dimensional nxrn ar- 
array. nxm array. ray with elements of length I.
Figure 9.5: Arrangement of three data structures.
0
01
l
l
In Table 9.2, the way such data structures are described is given:
field Figure 9.5(a) Figure 9.5(b) Figure 9.5(c)
dimension 1 1 2
depth[0 ] n n n
depth[l] m
length m m 1
flat yes no no
Table 9.2: Description of data depicted in Figure 9.5.
In general, data is allocated dynamically. Before running an action, all the data it uses has 
to be allocated. As a program description is a static description of a program, in general 
the structural dimensions of data are not specified.
data: data name type struct descr monitorable time_id
struct descr: dim 5[’depth5] 5 length flat
9.2. Actions and program descriptions 175
The data from Figure 9.5(e) could be specified as:
data { "input pattern" float 2 [m n] 1 0  "epoch" } or 
data { "input pattern" float [] 0 "epoch" }
In the second specification, the description of its structural dimensions (the dim, depth, 
length and f la t  fields) is determined at runtime,
9.2.4 Options and settings
Options must be used if one out of a number of possibilities has to be selected. For example, 
if an action learn has implemented several learning methods, one of these can be chosen. 
Or if an action load_patterns allows loading of either training data, data for testing, or 
data to be classified, an option which patterns can be defined, which has state values 
{0 , 1 , 2 }, These values could be represented at the user-interface via the strings "load 
tra in in g  patterns", "load test patterns", "load c lass ify  patterns". Such an 
option is used to indicate which out of three possible pattern sets has to be loaded. The 
definition of the COption data structure is:
option: option name nstates state state_names
In this definition, state_names contains for each of the nstates states, the string repre­
senting a state. For the example of determining which set of patterns has to be loaded, 
the following is a valid specification:
option { "which patterns" 3 0 ""load training patterns",
"load test patterns", "load classify patterns" }
The option control window from CONVIS is depicted in Figure 9,6:
o p tio n s  ‘ load p a tte rn s ’
▼ I which patterns) load training patterns
load training patterns V
load test patterns 
load classify patterns
(cancel; l^ okj
Figure 9,6: Option control window of CONVIS.
For many of the actions that exist in neural network simulation programs, initiation often 
not only involves initializing some parameter values, but it can also require setting some 
boolean values. In CONVIS, these are called settings. The specification of a setting is done 
via the following syntax:
setting: setting name state
176 9. An action-oriented neurosimulator
9.2.5 An example: specification of an action learn
As an example, consider the specification of an action called learn. In this example, the 
action implements a backpropagation training algorithm.
action "learn" { 1
# name type value min max editable string_value
parameter { "nr epochs" int 1000 0 MAX 1 I I >
parameter { "error_crit" double C .05 0 1000 1 I I >
parameter { "save file" file 4 0 0 1 "bp.net" >
parameter { "save cycles" int 1000 0 MAX 1 I I >
parameter { "momentum" double 0.1 0 10 1 I I >
parameter { "Irate" double C .05 0 1 1 I I >
variable { "pattern" int 0 0 MAX 1 "epoch" >
variable { "epoch" int 0 0 MAX 1 "epoch" >
variable { "error" double 1 0 1 1 "epoch" >
# name type dim monitorable timeid >
data { "weights' float [] 1 I I >
data { "activations' float [] 1 I I >
data { "biases' float [] 1 "epoch" >
data { "train input patterns' float [] 1 I I >
data { "train target patterns' float [] 1 I I >
data { "test input patterns' float [] 1 I I >
data { "test target patterns' float [] 1 I I >
data { "learn classification' int [] 1 "epoch" >
data { "recall classification' int [] 1 "epoch" >
data { "learn confusion' int [] 1 I I >
data { "recall confusion' int [] 1 I I >
option { "update" 2 0 "svery pattern" "every epoch" >
option { "pattern sequence" 2 0 "random" "round robin" }
option { "activation function" 2 0 h sigmoid" "tanh" >
setting { "classify while learning" 0 >
setting { "show confusion matrix' 0 >
setting { "autosave" 0 >
>
Parameters for this action are the error criterion and the number of epochs after which 
training has to stop, the name of the file to use for saving a network after every save 
cycles epochs, the momentum determining the influence of previous weight updates and 
the learning rate determining the speed of learning. Variables are the current pattern and 
epoch number, and the error generated by the network for all patterns.
9.3. PREENS interface definitions 177
Each implemented neural network from the PREENS algorithm library is able to classify a 
train or test data set during the training process, Datastructures to store the associated 
data are the input and target patterns, the resulting classifications and confusion matrices. 
Via the options, it can be decided to perform weight updates after each pattern, or once 
per epoch. Patterns can be submitted in random order or in a cyclic way. And one out 
of two activation functions may be selected. The booleans indicating whether the network 
has to be saved, whether classification of train or test patterns has to be performed or 
whether a confusion matrix has to be computed can be set via the settings.
When parsing such a specification of an action, a neural network simulation program builds 
up the CAction data structure. In the next section it will be explained how CONVIS and 
associated tools uses this concept to be able to request and update objects and attributes 
from a running simulation program,
9.3 p r e e n s  interface definitions
PREENS is a neurosim u la to r com pris ing  three com ponents:
1, A (expandable) set of executable neural network simulation programs, using the 
concept of actions and program descriptions explained in the previous section,
2, A (expandable) set of tools for visualization and monitoring of neural network objects, 
and for dynamic I/O,
3, The graphical user-interface CONVIS, managing a running neural network simulation 
program and a set of running tools. Via CONVIS, it is possible to request and update 
objects and attributes from the simulation program,
PREENS was introduced in the introduction of this thesis. For convenience, the global 
architecture of PREENS is depicted again here:
user-interface
Figure 9,7: The p r e e n s  neurosimulator environment.
178 9. An action-oriented, neuro simulator
In this environment, a simulation program, c o n v is , and tools run as separate processes 
coupled via a communication interface using TCP/IP, Note that any of these can be run 
on different machines, with a different architecture. For example, CONVIS can run on a 
X-Windows workstation, the simulation program on a parallel processor system, and tools 
on other machines in a heterogeneous computer network. Values of parameters, settings 
and options can be requested and updated by CONVIS (if they are specified in the program 
description). The same can be done with values of data and variables, but furthermore, 
data and variables can also be “coupled” to input and output tools. As an example, 
consider the monitoring of the error generated by the neural network for a test set during 
training.
Via the “edit variable window”, a monitoring tool can be coupled to the variable called 
error as follows (see also Figure 9,8:
1-1 e d it ‘error’
name error
identification 5
type DO U BLE_TYPE
value 1.000000
default 1.000000
minimum 0.000000
maximum 1.000000
time 0.000000
previousupdate 0.000000
delta 0.000000
do monitor O FF
monitor -1
input tool NOT CO N N ECTED
timeid epoch
| ~r\ connect) (cancel) (o k)
s e le c t  v a r ia b le  m on ito r
•y | variable monitor) m onitorvariable 
user: louis 
|T-|host) poseidon 
(manual)  ( cancel) (o k)
Figure 9,8: The. edit variable window and monitor variable selection window. Via the pull­
down menu button connect, the latter window can be activated. In this window, the tool 
and machine, on which it has to be. run can be. selected.
When a tool is coupled to a variable or data object, some fields in the corresponding data 
structure are set. These fields are:
double prev_update; /* time of previous update of the object */
double delta_t; /* time amount after which object has to be updated */
int inputtool; /* identification of input tool coupled to it */
int monitor; /* identification of output tool coupled to it */
The field delta_time indicates after how many time steps, the object has to be monitored. 
For example, if a large test set has to be classified during training, this will take a con­
siderable amount of time, whereas it is often not required to compute the classification at
9.3. PREENS interface definitions 179
each time step. Setting delta_time to, say, 10, indicates that after every 10 steps, the 
classification has to be computed,
CONVIS maintains an internal process administration of all running tools. Via the fields 
inputtoo l and monitor (which is an output tool), it can get to the required items like 
the communication channel via which the tool is connected, and the machine type of the 
machine it is running on,
9.3.1 Interface betw een eo .w is  and a sim ulation program
In the subsequent text, we call a programmer someone who implements neural network 
simulation programs or tools and wants to interface them to CONVIS, A user is someone 
who uses p r e e n s . In order to couple a simulation program to CONVIS, a programmer has 
to perform the following steps:
1, Specify the program description of the program,
2, Write a main program sim.c:
/* specification of actions here */ 
int main (int argc, char **argv)
{
FilllnSimActions(argc,argv,actions);
SetUpCommunicationsO;
ConvisMainLoop(argc,argv); 
return 0;
}
The array actions contains a specification of the actions implemented by the pro­
gram, This array must match the program description,
3, For each action in the program description, implement it following the instructions 
below.
In the ConvisMainLoop, sim waits for commands from CONVIS, Four kinds of commands 
can be issued 1) requests, 2) update, 3) control and 4) interrupt, CONVIS can send re­
quests for the number of actions and their names, the number of parameters or variables 
of an action, and for each component of an action, its constituents. For each component 
of an action, CONVIS can send updates. Control commands are RU N _ACT lON  and CON- 
T lN U E_ACT lON , Interrupt commands are issued if an event at CONVIS has occurred during 
the execution of an action. Events are that a user has hit a h a lt or cancel button, or that 
data or a variable from an input tool was received by CONVIS, The simulation program 
can also generate interrupts, i.e. it can indicate that some data or a variable is ready to 
be monitored, or that an action has finished.
180 9. An action-oriented neurosimulator
9.3.2 The action control protocol
Associated with any command, is the identification of the action that is controlled, identi­
fication of the component of an action, and information about the contents of any data to 
follow. In the interface between CONVIS and a simulation program, a command is defined 
as:
typedef struct {
int command; /* request, update, control, interrupt */
int action_id; /* action identification */
int component_id; /* component identification */
int size; /* size of any data to follow */
double value; /* to be used for various goals */
} CCommand;
The simulation program can be in two states. When it is running, only interrupt commands 
are sent by CONVIS, In this state, a RUN or c o n t i n u e  command was previously received 
from CONVIS, and the ConvisMainLoop determined the action_id to start up the associ­
ated action. If an action is not running, the simulation program is in ConvisMainLoop, 
waiting for request, update, or control commands.
The routine action_control () performs four tasks: 1) it detects whether any I/O  inter­
rupts arrive, 2) it detects whether it was interrupted by a h a l t  or c a n c e l  command, 3) 
it adjusts the time of each of its variables and data objects, and 4) it checks whether any 
variables or data have to be monitored.
9.3.3 Accessing com ponents of an action
Until now, interfacing a simulation program with CONVIS has been very simple. Very 
little has to be changed to the code of existing programs. If programs are written from 
scratch, the specification of actions with corresponding components can even help in the 
process of software design. However, the most difficult part of the interfacing procedure is 
to allow CONVIS to access the neural network attributes. As mentioned before, the actions 
are contained in an array of CAction, For each request or update, a component can be 
accessed as:
actions[action_id] .component[component_id] .value
However, in most neural network simulation programs, objects are accessed by name. 
Consider for example changing the learning rate, or some weight value:
Irate = new_value;
weight [i][j] += lrate*delta[i] *activation[j] ;
This would be implemented as:
9.3. PREENS interface definitions 181
actions[LEARN].parameters[LRATE].value = new_value;
((double **)actions[LEARN] .data[WEIGHT] .dataptr)[i] [j] += 
actions[LEARN].parameters[LRATE] .value
* ((double *)actions[LEARN].data[DELTA].dataptr)[i]
* ((double *)actions[LEARN].data[ACTIVATION]).dataptr)[j];
Obviously, this straight-forward implementation introduces a lot of overhead. Selecting 
and indexing in hierarchical datastructures can be expensive, as stated before. And in 
order to allow a programmer to write his programs as he is used to, and to restrict the 
number of efforts required to interface a program to CONVIS, some means has to be defined 
which maps a name to some data or variable in the array of actions,
A set of convenience macro definitions is made available for these purposes:
#define p_val(type,id,ix) ((type)actions[id] .parameters[ix].value)
#define v_val(type,id,ix) ((type)actions[id] .variables[ix] .value)
#define v_adr(id,ix) (actions[id] .variables[ix].value)
#define o_val(id,ix) (actions[id].options[ix].state)
#define s_val(id,ix) (actions[id].settings.states[ix])
The code below depicts how these macros can be used in an action learn:
int learn (CAction *action, int run)
{
int epoch = v_val(int,LEARN,EPOCH); 
double error = v_val(double,LEARN,EPOCH); 
int nepochs = p_val(int,LEARN,NEPOCHS); 
int npatterns = p_val(int,LEARN,NPATTERNS); 
double errcrit = p_val(double,LEARN,ERRCRIT);
if (!start(run,0,action)) 
return 0;
while (action_control(action) && epochCnepochs &&error>errcrit) { 
error = 0.0;
for (p=0;p<npatterns;p++)
error += train_pattern(...); 
v_adr(LEARN,EPOCH) = ++epoch; /* make new values available for CONVIS */ 
v_adr(LEARN,ERROR) = error;
}
return finish(action,error<=errcrit);
}
9.3.4 Accessing data
Data can be accessed via the field dataptr. This is a void pointer to any data defined 
by a programmer. Using the CData_Description, it can be found out how the data is
182 9. An action-oriented neurosimulator
arranged, A convenience routine AssignDataO is provided for a programmer to specify 
how his data is arranged:
int AssignData (void *dataptr, int id, int index, int dim
, int dO, int dl, int d2, int d3, int length, int flat)
{
CData_Description *d = &actions[id] .data[ix].description;
actions[id].data[ix].dataptr = dataptr; 
d->dim = dim;
d->d[0] = dO; d->d[l] = dl; d->d[2] = d2; d->d[3] = d3; 
d->length = length; 
d->flat = flat;
}
As an example of how programmer-defined data can be described, consider the computation 
of a confusion-matrix. During recall, a neural network computes a certain output for each 
input it receives, A confusion matrix is a way of validating how well the network performs 
on a set of so-called supervised data, i.e. a set of patterns for which it is known in advance 
which output the network has to produce. Each row and column of the matrix contain 
one entry for each of the possible classes contained in the data set. The percentage of 
patterns belonging to class i and which the network has classified as class j  is contained in 
matrix [i] [j] , In order for a tool to compute the confusion matrix, it has to know for each 
pattern p the class classp and computedp. Assume that a programmer wants to equip his 
program with a datastructure confusion_matrix, which contains two arrays of integers, 
one containing dassp, the other com,putedp. In his program, both arrays are computed as:
int confusion_matrix[NPATTERNS][2];
for (p=0;p<NPATTERNS;p++) {
confusion_matrix[p][0] = targets [p]; 
confusion_matrix[p][1] = compute_class(inputs[p]);
}
In order to  te ll CONVIS how the  d a ta  is struc tured , the  p rog ram m er has to  specify the 
confusion m a tr ix  in  the  p rog ram  descrip tion , and  use AssignDataO to  specify how the 
d a ta  is arranged:
data { "confusion_matrix" int [] 1 "epoch" } /* in program description */ 
AssignData( (void *) confusion_matrix, LEARN, C0NFUSI0NJ4ATRIX, 1 
,2 ,0 ,0 ,0 ,NPATTERNS,0);
Once this is done, the programmer can refer to the confusion matrix by name, while at 
any moment, the data can be accessed as specified by its CData_Description,
9.3. PREENS interface definitions 183
9.3.5 Exotic or distributed data
If a program contains data structured other than multi-dimensional arrays, the mechanism 
described above cannot be used. This also holds for data distributed over e.g., a network 
of workstations or a transputer network. This is because the CData_Description is not 
suited for specifying such data, A method is designed to be able to cope with this problem.
The description of data contains a field data_f ormat, which can be 1) p r k k n s  f r . y i t .  2) 
U SER_DEFIN ED  and 3) d i s t r i b u t e d . In the first case, the interface as described above can 
be used. In the other two cases, where the data has some “exotic” format, the data has to 
be transformed to p r k k n s  f r . y i t  and vice versa. Figure 9,9 depicts how this is done.
Figure 9,9: Transforming exotic data to p r e e n s  and vice versa.
If data has to be monitored by some output tool, the simulation program sim first checks 
if the data has to be transformed to p r e e n s . After transforming, it can be communicated 
with CONVIS via the communication interface. If an output tool receives data from CONVIS, 
it checks if the data must be transformed to its exotic format.
In the graduation work of Eric Boon [9], this concept is used for the implementation of 
an interface to a multi-transputer system. The communication primitives discussed in 
Chapter 5 are used as a foundation for this parallel interface. For dataset decomposed 
neural networks, no transformation is required, as on each processor the same copy of the 
neural network is present. After each epoch, an updated neural network is available at 
the master processor, so request commands can be satisfied through communication to 
the master only. Interrupts, control, and update commands are broadcast through the 
transputer network.
For network decomposed neural networks, two situations may exist, 1) the amount of data 
requested or to be updated is small enough to be held on one processor, or 2 ) the amount 
cannot be stored on one processor. In the first case, for request commands, a distributed 
data structure is gathered at the master processor, after which it is sent to CONVIS using the 
normal procedure. For updates, the data is sent to the master processor and subsequently 
distributed over the processor network.
184 9. An action-oriented neurosimulator
For the second case, in [9] some modifications are made to the interface between CONVIS 
and the parallel interface. To be able to select part of a large amount of data, a module 
called distribute is added. This module builds up an administration of the distributed 
data, based on the p r e e n s  format. This means that given a CData_Description and its 
distributed administration, the module is able to locate where a certain part of the data 
is located. Using the concept worked out by Boon, instead of requesting or updating a 
complete data structure, now only part of the data can be accessed using indices and offsets 
in the struct descr field of the CData_Description, However, only preliminary tests are 
carried out, and this aspect of p r e e n s  must be developed and tested more elaborately.
9.3.6 Interface betw een eo .w is  and tools.
Tools can be coupled to variables and data, using the mechanism as depicted in Figure 9,8, 
The process administration maintained by CONVIS contains for each tool the communica­
tion channels via which it is connected to CONVIS, In the structure describing a variable 
or data, the identification of the tool which is coupled to it is contained. As explained in 
the action control protocol 9,3,2, the simulation program sim checks for each variable and 
data object, if it has to be monitored:
if (v[i] .do_monitor&&v[i] .time-v[i] .prev_update>=v[i] .delta_t)
SendVariable(v);
CONVIS installs an interrupt handler and detects that a variable has to be monitored. 
Subsequently, it sends the variable to the tool determined from its process administration. 
All output tools can be implemented using the following set up:
#include <preens_convis.h>
int main (int argc, char *argv[])
{
fd = setup_client_connection(argv[l] ,atoi(argv[2] ));
ReceiveVariable(&v,fd);
InitializeTool(ftv);
while (ReceiveVariable(&v,fd))
HandleVariable(&v);
}
The current set of tools implemented is given in Table 9,3, They are described elaborately 
in [115].
9.4- An example: training remotely sensed data 185
confusion_matrix_plotter - For graphical visualization of confusion matrices
convis_matlab - Interface between CONVIS and matlab [66]
convis_xv - Interface between CONVIS and xv [10]
data.tool - A general data evaluation tool, containing basic statistical 
analysis features, like standard deviation, min, mean, max
kolionen_projection - For visualizing the classification results of a Kohonen map
monitor_variable - Monitoring values of a variable per time step
monitor_data - Monitoring values of individual data elements per time step
plot-image - Visualizing images.
save.data - Tool to save data in PREENS format.
Table 9,3: List of tools contained in p r e e n s .
9.4 An example: training rem otely sensed data
In [89, 87], the application of neural networks to the classification of remotely sensed 
imagery is introduced. At the Institute for Remote Sensing Applications from the JEC 
in Ispra, Italy, research is carried out for the automated classification of satellite images. 
Various types of imagery are available, like radar (SAE), Landsat-TM and Spot, The goal 
is to use these images and classify each pixel’s ground cover class. The work described here 
was performed in close cooperation with Eon Schoenmakers from the JEC,
9.4.1 Initiation phase
In [87], the classification performance of a neural network for combined six-band Landsat- 
TM and one-band EES-1/SAE PEI imagery from the same scene is presented. Different 
combinations of the data, either raw, segmented or filtered, using the available ground 
truth polygons, training and test sets are created. The training sets are used for learning 
while the test sets are used for verification of the network. The combination of two types 
of imagery offers the possibility to test their influence on the classification accuracy. From 
the scene used, the Lisbon area in Portugal, two images are contained, one Landsat-TM 
recorded June 24, 1991 and one EES-1/SAE PEI image from March 21 1992, The imagery 
is geo-referenced and re-sampled into pixels with 25 meter ground resolution. The error 
caused by the geo-referencing is within one pixel. The image size is 2518 columns by 2363 
rows. The original 16-bit per pixel EES-1 image is scaled to one bvte-per-pixel. From the 
Lisbon area, ground truth data (collected in June 1991) are available for 9 classes:
1 sand (825,798) 2 grassland (3358,3165) 3 water (4726,3878)
4 cereals (369,73) 5 forest (719,71) 6 urban (353,298)
7 vineyard (168,33) 8 marshland (356,123) 9 aquatic vegetation (262,259)
Table 9,4: The 9 ground truth classes. In parenthesis, the number of pixels per class for
the train set and test set.
186 9. An action-oriented neurosimulator
The ground truth data is divided into a training set (11163 pixels) for training and a test 
set (8698 pixels) for testing. The satellite images contain a number of bands, each band rep­
resenting a range in the visible, reflected infra-red (IR), thermal IE  or microwave portions 
of the electro-magnetic spectrum. Using the raw one-band radar, the six-band optical and 
the seven-band combined data, processed by segmentation and filtering techniques, eight 
different data sets are produced:
1 1-band SAR, un-filtered
2 1-band SAR, filtered
3 6-band optical, raw
4 6-band optical, segmented
5 7-band optical + SAR, raw
6 7-band optical + SAR, filtered
7 7-band optical + SAR, segmented
8 7-band optical + SAR, filtered and segmented
Table 9,5: The eight data sets used for this experiment.
The segmentation method used [8 8 ] is a combination of an improved edge detection method 
and a region growing method. It is beyond this thesis to explain more about this process, 
but the idea is that through segmentation, neighboring pixels having similar spectral values 
are clustered in one segment. In the segmented data sets used here, the average of each 
band of a segment is assigned to all pixels it contains. So all pixels in one segment have 
identical spectral values. The data extracted from the SAE image is speckle filtered by the 
method described in [73], For the 1-band SAE data no segmentation was performed, as 
the average classification performance (ACP) was low for the speckle filtered image, and 
segmentation was not expected to improve the ACP significantly.
In the initiation phase, the task to be performed, the data sets, and the neural network 
model to be used are identified. The first two items are described above. Using p r e e n s , 
several tests were performed with neural networks from the p r e e n s  algorithm library:
artmap - Implementation of several ART networks by
Parcival Willems [120],
backprop - Implementation of backpropagation neural net­
work with momentum training.
boltzmann - Implementation of a variant of the Boltzmann
neural network as described in [55].
fieldnet - Implementation of a the fieldnet neural network
as described in [85].
kohonen - Implementation of the Kohonen SOM.
counterprop - Implementation of the counterpropagation neu­
ral network [44],
Table 9,6: Neural networks from the p reens  algorithm library
9.4- An example: training remotely sensed data 187
From both the training sets and the test sets, random selections were made of 500 patterns. 
Using these sets, p r e e n s  was used to quickly evaluate the classification performance of each 
of the neural networks in the algorithm library. The backpropagation neural network out­
performed the rest. It was already shown in [53, 52] that the multi-layered perceptron can 
be used for this application. The tool that was particularly used for this initial examination 
was the confusion matrix monitoring tool. Both the classification performance for the 
training set and the test set can be monitored during training using this tool (see Figure 
9.10).
Figure 9.10: Values of diagonal of confusion matrix with average classification results. The 
left image is the diagonal for the training set, the right image is the diagonal for the test
In general, classification performance in remote sensing is expressed as the number of pixels 
correctly classified, divided by the total number of pixels [11], In [87], for the evaluation of 
classification performance, the averages over classes are used. This was done to eliminate 
differences between the number of pixels available per class. For example, a neural network 
may classify a scene of pixels containing water with high percentage correct, whereas it is 
unable to classify vineyards. Because of the unequal distribution of pixels between these 
classes (see Table 9.4), the overall average classification performance would be biased by 
the large amount of pixels in the class of water.
9.4.2 Tuning and testing phases
In the tuning phase, the optimal initial parameters for the backpropagation neural network 
(baekprop), and its architecture are determined. Numerous experiments using PREENS 
were performed, training neural networks with the different training sets and comparing 
the classification results of the test sets. The following observations were made:
1. Small initial weight values are required, with Wj_j E [—0.15 • • • 0.15]
188 9. An action-oriented neurosimulator
2, The learning rate must be low, e =  0,001, which confirms to the rule of thumb 
e «  10/p, with p the number of patterns. For higher learning rates, the network 
fluctuates heavily between different weight states,
3, Training must be online, i.e. for each pattern, the weights of backprop are updated. 
For off-line training, no networks were able to learn the training set. Apparently, 
updating weights per epoch averages the weight changes and the network gets easily 
trapped in local minima. With online training, this effect is avoided,
4, The tanh activation function gives better results than the traditional sigmoid. Fur­
thermore, the networks converge faster. The reason why this is the case is not known,
5, Scale the patterns within a small interval, say [—0.9, .9], For both the sigmoid and 
tanh activation functions, the active domain is near zero. This means that in that 
range, small changes in the net-input will result in relatively large changes in the 
activation output. If the patterns are not scaled, i.e., they contain pixel values which 
are between [0 • • • 255], the net-input of the activation functions is in the domain at 
which they are not very active.
For training the data sets, two 4-lavered neural network architectures were used: a 6x26x18x9 
and a 7x30x20x9 network. By monitoring the ACP during training, it showed that even 
after many training iterations (> 5000), for most data sets, no loss in performance was ob­
served, However, for the segmented data sets such a loss was observed ,— a phenomenon 
called over-training — , where the neural network becomes too much specialized for the 
training set. For smaller sized neural networks, or a smaller number of iterations, a better 
ACP was achieved. This indicates that the segmented data is easier to learn. Results for 
the 8  data sets are depicted in Table 9,7:
raw fil seg fil+seg
SAR
optical
combined
28.7
92.2
92.1
34.3
93.0
90.8
92.0 82.6
Table 9,7: ACP (in %) for each of the 8 experiments.
It appears that just the SAE data is not enough to distinguish correctly between different 
ground cover classes. On the other hand, it seems that using both the optical and SAE 
imagery in combination with segmentation and filtering techniques produces no significant 
increase in ACP, Combining filtering and segmentation results in a drop in ACP,
To get an insight in how the different imagery, filtering and segmentation influence the 
classification performance, a further investigation in how the different ground cover classes 
are recognized is required. Table 9,8 depicts for each experiment the ACP for each of the 
9 classes.
9.4- An example: training remotely sensed data 189
6raw 6 seg 7raw 7fil 7seg 7fil+seg
1 91.6 93.2 92.4 92.5 94.5 92.0
2 84.8 96.7 85.4 85.7 84.0 88.4
3 10 0 10 0 10 0 10 0 1 0 0 99.8
4 86.3 69.4 90.4 97.3 98.6 94.5
5 76.1 76.1 78.9 80.3 67.6 99.0
6 99.7 99.0 90.3 92.6 99.0 86.9
7 97.0 10 0 97.0 93.9 87.9 6 .1
8 98.4 10 0 98.4 98.4 1 0 0 10 0
9 10 0 10 0 10 0 10 0 1 0 0 99.2
avg. 92.2 90.8 92.1 93.0 92.0 82.6
Table 9,8: ACP (%) for 6 and 7 band imagery.
9.4.3 Classification
For testing p r e e n s  in an operational phase, CONVIS was coupled to an input tool, reading 
images in . Ia n  format and producing data in the form of an input pattern containing 6  
bands:
data { "classify patterns" float 1 [width*height] 6 0 "epoch" }
Furthermore, CONVIS was coupled to the output tool co nv is _x v , This tool starts up the 
image visualization program xv, with the option to poll for creation or update events of a 
file. For each classified image CONVIS receives from the simulation program, it transfers 
the data to the output tool. This tool overwrites the file, after which xv displays the new 
image.
Figure 9,11: Set up of the classification phase.
For each image to be classified, the classification results are stored in a file. The corre­
sponding parameter s a v e f  i l e  is set via the action control menu of the action “classify”.
190 9. An action-oriented neurosimulator
It was discussed that in an operational phase, the neural network part of the neurosimulator 
may be extracted for a stand-alone end-application running without the neurosimulator. 
In such a set up, the input tool, the tuned neural network, and the code for saving the 
classification results would be extracted and integrated into the end-application. In the 
work of Schreurs and Hendriks [91], it was examined how p r e e n s  neural network code can 
be extracted and embedded into a stand-alone application. Their conclusion was that it 
is relatively easy to do this. The actions load_network and c lass ify  must be extracted, 
and a set of macro-definitions must be used which mirror the way the components of the 
neural network are accessed in CONVIS,
9.4.4 Conclusions
Based on the observations made when considering features of existing neurosimulators, and 
considering the requirements specified by users and environments in the world of neuro­
computing, three design criteria for p r e e n s  were identified. Following these criteria, the 
design, implementation and application of p r e e n s  is presented in this chapter. Using the 
concept of action-oriented program descriptions, CONVIS can relatively easy be interfaced 
with neural network simulation programs. Using the communication and data interface, 
tools can be added to the p r e e n s  tool library. As few assumptions are made about how a 
neural network is implemented, the network simulation code can be tailored to the needs 
of a specific application or neural network model, and can thus be highly efficient and 
compact, CONVIS can operate in parallel with tools and the simulation program it is con­
nected to. Because of this exploitation of the computing resources in a processor network, 
p r e e n s  runs faster and can handle a larger amount of data than current neurosimula­
tors, The design of p r e e n s  makes it suitable for appliers of neural networks to use it 
as an efficient platform to solve their applications. It can also very well be used as an 
extendible general purpose neurosimulator for controlling simulations written by neural 
network programmers,
p r e e n s  was tested with neural network simulation programs running on various execution 
platforms. For controlling dataset parallel implementations of simulations running on 
MIMD-parallel systems, CONVIS is very well suited. For network decomposed simulations, 
a preliminary interface was tested using an administration of distributed data structures.
10
Conclusions
Outline
This chapter summarizes the findings about platforms for artificial neu­
ral network simulations. In the first part of this thesis, the suitability 
of MIMD-parallel execution platforms for artificial neural networks is 
examined. In particular, a performance prediction method is introduced 
and evaluated for transputer-based implementations of backpropagation 
and Kohonen neural networks.
In the second part of this thesis, neurosimulators as a platform for sim­
ulating neural networks are discussed. The features of the new, action- 
oriented neurosimulator called p r e e n s  are discussed here.
Introduction: platforms for artificial neural networks
In this thesis, the reader is made familiar with several operational aspects from the world of 
neurocomputing. In Chapter 2, an introduction is given to these aspects. Artificial neural 
networks, high-performance platforms for executing them, and a performance prediction 
method for examining the suitability of these platforms are introduced. Furthermore, a 
taxonomy of users “doing neurocomputing” is given, the typical phases occurring in the 
neurocomputing life-eyele are identified, and an introduction to p r e e n s , a platform for 
simulating neural networks, is given. In this chapter, conclusions about these aspects are 
drawn and some final remarks are made.
192 10. Conclusions
MIMD-execution platforms and the transputer
In Chapter 3, the architecture of MIMD-parallel computers and in particular multi­
transputer systems is presented. The processor architecture and communication networks 
of three multi-transputer systems, the Nijmegen Super Cluster ( n s c ) ,  the GCEL-512 ( g c e l )  
and the PowerXPlorer (p x )  are discussed. Aspects of how to program these systems using 
the operating systems Helios (for the n s c )  and Parix (for the GCEL and p x )  are mentioned. 
It was concluded that at present, the T8 xxx transputer is more or less outdated because of 
its low processor performance. Current parallel processor systems like the PowerXPlorer 
tend to use faster, more up-to-date CPUs as the basic processing element. However, the 
techniques discussed in this thesis are not restricted to transputer systems only; they are 
valid for any MIMD-architecture were there exist a number of nodes connected via a 
communication network.
Performance prediction
In Chapter 4, it is concluded that there are a number of disadvantages when using the 
techniques of performance benchmarking and performance modeling. With performance 
benchmarking, a large number of criteria have to be taken into account before a bench­
mark can be used to predict the performance of an execution platform for an application. 
It was argued that performance measures like M CUPS are meaningless, completely useless 
performance scales, unless it is exactly stated what such a parameter implies. For example, 
[Rumelhart, momentum, batch-update, 6x20x30x9] backprop M CUPS is a more descriptive per­
formance measure than just plain MCUPS, Considering performance modeling, it was also 
concluded that a large number of parameters influence the performance. For example, the 
hardware architecture, operating system, compilers, compiler options, run-time libraries, 
implementation aspects, and the size and complexity of the application, are factors that 
have an effect on the performance that can be achieved. If all of these factors are kept 
constant, it may be possible to come up with reliable estimates. However, if any of them 
is changed, a new estimation has to be made. Because of the problems with performance 
benchmarking and modeling, in this thesis, a new technique is introduced which combines 
the two techniques.
The combined method of performance modeling and kernel benchmarking presented in this 
thesis defines the total required calculation and communication time as Equation (10,1):
Tca;c(P ,n ,w ,p )
i= 1
Tcomm(P, n, w,p) = C(P,n, w,p) ■tcomm
Ttotai(P, n, w,p) = Tcalc(P , n, w,p) + Tcomm(P, n, w,p) (10.1)
In this method, a program is modeled in terms of a number of rif function kernels. Kernel 
benchmarks are executed on one single processor to measure the time ti required for exe­
cuting each kernel i. The Ni(n,w,p) represents the number of times each kernel is called
193
and is based on the number of neurons n, the number of weights w and the number of pat­
terns p. The communication time is modeled as the number of times a single information 
unit (e.g., a connection or activation value) must be sent over a physical communication 
link, C(P,n,w,p), multiplied by the time required for communicating one value, tcomm.
A communication layer
In Chapter 5, a communication layer implementing the typical patterns of communica­
tion required for MIMD-parallel neural network simulations is presented and analyzed. 
Algorithms for broadcast, gather and accumulate are discussed for MIMD-computer net­
works organized in grid and tree architectures. The communication complexity for these 
algorithms is given by:
T grid-broadcast  ^ ^  =  +  height(P) -  2) • S • tcomm
T tree-broadcast^ s j  =  ^ ■ depth{P) ■ S  ■ t cornm
jigrid-gather-accumulate ^  ^  =  + height {P) -  2) • W  • (tcomm + tacc)
T tree .gather.accum ulate^  ^  =  3 . d e p f h ^  . w  . +  ^
vacc)
Using these models, the communication time Tcornm(P, s) can be predicted for any number 
of processors P, where it is only required to measure the time tcornm over one communication 
link and the accumulation time tacc on one processor. For all three transputer systems 
used, the NSC, the GC EL and the PX , this method is validated quantitatively. For a varying 
number of processors P, and several message sizes s, measurements and predictions using 
the method are made. The predictions have an average accuracy within 5%, and all are 
accurate within ± 1 0 %,
Dataset decomposition
When using dataset decomposition for parallel neural network simulations, a set of p pat­
terns is equally distributed over the available P  processors, A complete copy of the neural 
network runs on each processor. In Chapter 6 , this technique is presented and analyzed 
for the transputer-based implementation of two popular neural network models, baekprop- 
agation and Kohonen neural networks. For both neural networks, the kernel functions are 
determined and measured on one processor. Also the communication requirements are 
given. Based on these, predictions are made which all have an accuracy within a couple of 
percentages.
Using the performance prediction method, the suitability of M IM D  systems like the trans­
puter can be expressed in terms of MCUPS or i c p s , speedup, efficiency and scalability. It 
appears that dataset decomposition is a highly efficient technique. This can be concluded 
from the following observations:
1, The maximal performance as expressed by Equations (6,4) and (6,7) is almost reached.
194 10. Conclusions
2, Though the speedup limit is reached for small problem sizes, in general, near linear 
speedups are reached. Furthermore, for two real applications (see Section 6,7), the 
speedup limit is reached far beyond the available number of processors (e.g., 194 
(p x ), 881 (n s c ) and 929 (g c e l ) for Nettalk on a grid),
3, High scalabilities are achieved and the scalability limit is not reached.
Network decomposition
In Chapter 7, these issues are considered for network decomposition, the collection of de­
composition methods for distributing one single neural network over a number of processors. 
The implementation of backpropagation and Kohonen neural networks via network decom­
position is presented and analyzed, A new communication method is described for gath­
ering the distributed activations of the backpropagation neural network. Again, function 
kernels are determined and measured on one processor. Using the model for calculation 
and communication times, performance predictions are made with and accuracy within 
±15% for backpropagation and ±10% for Kohonen, Compared to the results achieved 
with the dataset decomposition technique, it appears that network decomposition is less 
suitable, in particular for backpropagation. This can be concluded based on the following 
considerations:
1, The achieved performances are less, consider for example Figure 6,9 and Table 7,7 
for the Nettalk problem and Figure 6,11 and Table 7,11 for Satdat,
2, The speedup limit is reached relatively soon, as can be seen in Table 7,3,
3, The scalability and efficiency are low, consider Section 7,7,
For the Kohonen SOM, these considerations do not show such a dramatic drop in suit­
ability, The reason is that for k s o m s , the communication requirements are far lower than 
for network decomposed backpropagation networks.
Neurosimulators
Three classes of neurosimulators are distinguished in Chapter 8 : application-oriented, 
algorithm-oriented, and general programming systems. Users in the world of neurocomput­
ing are: model builders, tool builders, applied researchers and end-users. For some users, 
one class of neurosimulators may be more suited than for others. In Chapter 8 , it is con­
cluded that current neurosimulators are based on a general (hierarchical) neural network 
datastructure or some neural network description language. This approach requires neural 
networks to be specified in a prescribed standard manner, which usually means that their 
implementations are not as efficient as would be the case with implementations specifi­
cally designed for a (neural network and) target execution platform. Furthermore, it forces 
neural network programmers to program their applications following the NDL syntax or 
using some set of neural network library routines. Furthermore, it appears that neural
195
network simulation programs all have a similar structure, implementing a limited set of 
actions. And finally, in many cases the neural network code in neurosimulators represents 
only a small part of the total code required. The larger part is required for implementing 
user-interface and I/O,
p r e e n s , an action oriented neurosimulator
Based on these observations, in Chapter 9, a new kind of neurosimulator is presented, 
p r e e n s  uses a conceptual model of programs rather than of neural networks. In this model, 
a program is described via the actions that are implemented. In p r e e n s , the user-interface 
CONVIS and neural network programs are loosely coupled. Via a program description and 
a communication interface, CONVIS can control a running simulation program and access 
its components. It can also exchange components of the neural network with tools from the 
p r e e n s  algorithm library. As programmers do not have to conform to some general neural 
network datastructure, the neural network code can be implemented specifically tailored 
for an application. Note that p r e e n s  fully exploits the availability of workstations in a 
computer network. As tools, CONVIS and a neural network simulation program can be run 
on any machine, the environment can be executed in parallel. Therefore, it can operate 
faster and handle larger amounts of data than current neurosimulators.
Currently, p r e e n s  can handle tools and simulation programs running on workstations and 
transputer systems. It has successfully been tested on heterogeneous processor networks 
consisting of Dec, Sun and Hp workstations and a Parsvtee transputer system. For dataset 
decomposed neural network simulation programs, the communication libraries of p r e e n s  
operate satisfactory. Preliminary tests indicate that the concept of transforming “exotic” 
data to and from p r e e n s  format also shows promising opportunities.
Final considerations
The performance prediction method described and used in this thesis can be used to evalu­
ate the suitability of MIMD-execution platforms for artificial neural network simulations. 
Note that this method is a scalable performance prediction method, because based on the 
measurements of calculation time on one processor, and on the communication time over 
one communication link, the performance can be predicted for larger-scaled processor net­
works, Using the method, predictions are made for the performance of backpropagation 
and Kohonen neural network implementations. Whereas the predictions are good, in gen­
eral with an accuracy within ± 1 0 %, for network decomposed backpropagation networks 
it appears that transputers do not provide a very efficient execution platform. However, 
for dataset decomposed implementations, and also for the network decomposed Kohonen 
network, good suitability criteria are achieved.
This indicates that for M IM D  platforms, when implementing a neural network, it should 
be considered whether dataset decomposition is feasible. If this is not the case, only 
small processor networks, possibly with a large amount of memory per node (e.g., 64 Mb
196 10. Conclusions
compared to 4 Mb for the T8 xxx transputer), may be used to implement a neural network. 
This observation matches current trends in MIMD-processor technology, where individual 
processing elements are becoming more powerful.
The performance prediction method and the concept of action-oriented program descrip­
tions described in this thesis could also be used for other application areas:
o For any MIMD-parallel implementation of an application, the total execution time can 
be expressed as the sum of the times required for calculation and for communication. 
To make this feasible, it is required to identify the kernel benchmarks, and measure 
these and the basic communication time tcomm. Equation (10,1) can then be used to 
determine the total exection time, where instead of the number of neurons, weights and 
patterns (n,w,p), the parameters corresponding to the application have to be filled in. 
Note that for implementations that exploit techniques for overlapping communication 
and calculation, a more complicated model is required,
o The concept of action-oriented program descriptions could also facilitate a means of 
designing tool boxes for application areas like, e.g., image processing. Similar to neural 
network applications, with image processing a relatively small set of actions can be 
identified, such as loading and storing the image, filtering operations, edge detection, 
segmentation, etcetera. The possibility of incorporating these within one tool-set and 
exploiting the resources in a distributed environment by executing them on different 
processors would justify the use of an approach like the one sketched in this thesis for 
PREENS,
Bibliography
[1] G.M. Amdahl. Validity of the single processor approach to achieving large scale computer 
capabilities. In Proc. AFIPS Spring Joint Comp. Conf. 30, pages 483-485, Atlantic City 
(NJ), 1967.
[2] M. Azema-Barac, M. Hewetson, M. Recce, J. Taylor, P. Treleavan, and M. Vellasco. Pyg­
malion Neural Network Programming Environment. In B. Angeniol and B. Widrow, editors, 
International Neural Network Conference, pages 1237-1244, Paris, July 1990. Kluwer Aca­
demic Publishers.
[3] G. Bader and B. Przywara. T9000 - A Preliminary Evaluation of Arithmetic Performance. 
Brief Note, Interdisziplinäres Zentrum fur Wissenschaftliches Rechnen, Universität Hei­
delberg, Im Neuenheimer Feld 368, D-6900 Heidelberg, iwrl.iwr.Uni-Heidelberg.de, May
1993.
[4] V. Barbosa and P.M. Lima. On the Distributed Parallel Simulation of Hopfield’s Neural 
Networks. Software-Practice and Experience, 20(10):967-983, October 1990.
[5] M. Berry, G. Cybenko, and J. Larson. Scientific Benchmark Characterizations. Parallel 
Computing, 17(2): 1173^1194, 1991.
[6] D.P. Bertsekas and J.N. Tsitsklis. Parallel and distributed computation. Prentice Hall, 1989.
[7] Silvio Bierman and Hans van Hooft. Kohosim, a parallel simulation environment for koho­
nen neural networks. Under graduate project, 1996.
[8] C.M. Bishop. Neural networks for pattern recognition. Clarendon Press, Oxford, 1995.
[9] Eric Boon. The P in preens, A dataset parallel interface for convis and experiences with a 
parallel counterpropagation simulator. Master’s thesis, University of Nijmegen, Computer 
Science, June 1997. thesisnr 415.
[ 101 John Bradley. Xv, an interactive image manipulation program for the x windows system,
1994. Version 3.10a.
[11] J.B. Campbell, introduction to remote sensing. The Guilford Press, New York, 1996.
[12] G.A. Carpenter and S. Grossberg. Art 2: Self-organization of stable category recognition 
codes for analog input patterns. Applied Optics, 26(23):4919-4930, December 1987.
197
198 Bibliography
[13] G.A. Carpenter and S. Grossberg. A massively parallel architecture for a self-organizing 
neural pattern recognition machine. Computer Vision, Graphics, and Image Processing, 
37:54-115, 1987.
[14] G.A. Carpenter, S. Grossberg, and J.H. Reynolds. Artmap: Supervised real-time learn­
ing and classification of nonstationary data by a self-organizing neural network. Neural 
Networks, 4:565-588, November 1991.
[15] G.A. Carpenter, S. Grossberg, and D.B. Rosen. Fuzzy art: Fast stable learning and catego­
rization of analog patterns of an adaptive resonance system. Neural Networks, 4:759-771, 
June 1991.
[16] Edingburgh Parallel Computing Centre. Annual Report and Project Directory, 1990-1991.
[17] G. Chinn, K.A. Grasjki, C. Chen, C. Kuszmaul, and S. Tomboulian. Systolic array imple­
mentation of neural nets on the maspar mp-1 massively parallel processor. In Proceedings 
of the IJCNN-90 II, pages 169-173, San Diego, 1990.
[18] L-C. Chu and B.W. Wah. Optimal Mapping of Neural Network Learning on Message- 
Passing Multicomputers. Journal of Parallel and Distributed Computing, 14:319-339, 1992.
[19] L.J. Clarke. Tiny version 1.0, discussion and user guide. Technical report, Edinburgh 
Parallel Computing Centre, The King’s Buildings, Mayfield Road, Edinburgh EH9 3JZ, 
May 1989.
[20] M. Cosnard, J.C. Mignot, and H. Paugam-Moisy. Implementations of Multilayer Neural 
Networks on Parallel Architectures. In 2nd International Specialist Seminar on Parallel 
Digital Processors, Lisbonne, Portugal, april 1991.
[21] H.J. Curnow and B.A. Wichmann. A Synthetic Benchmark. Comput. J., 19(1):43—49, 1991.
[22] A. d’Acierno, R. Del Balio, and R. Vaccaro. Fully connected neural networks: simulation 
on massively parallel computers. Artificial Neural Networks, pages 1489-1500, 1991.
[23] Perihelion Software Ltd. (Ian Davies). The Helios Parallel Operating System. Prentice Hall, 
1991.
[24] K.M. Dixit. The SPEC Benchmarks. Parallel Computing, 17(2):1195-1209, 1991.
[25] J. Dongarra and W. Gentzsch. Benchmarking of High Performance Computers. Parallel 
Computing, 17(2):1067-1069, 1991.
[26] J. J. Dongarra. Performance of Various Computers Using Standard Linear Equations Soft­
ware. Technical Report CS-89-85, University of Tennessee, Computer Science Department, 
Knoxville, TN 37996-1301, September 1992.
[27] M. Duranton, F. Aglan, and N. Mauduit. Hardware Accelerators for Neural Networks: 
Simulations in Parallel Machines. In D. Heidrich and J.C. Grossetie, editors, Comput­
ing with T-Node Parallel Architecture, pages 235-264. ECSC, EEC, EAEC, Brussels and 
Luxembourg, 1991.
Bibliography 199
[28] W. Eppler, M. Rinderspacher, and M. Rudolph. A Digital Signal Processor for Simulating 
Backpropagation Neural Networks. In T. Kohonen, K. Makisara, O. Simula, and J. Kangas, 
editors, Artificial Neural Networks, pages 1565-1568. Elsevier Science Publishers (North- 
Holland), 1991.
[29] D. Erco§kun and K. Oflazer. Experiments with Parallel Backpropagation on a Hypercube 
Parallel Processor System. In T. Kohonen, K. Makisara, O. Simula, and J. Kangas, editors, 
Artificial Neural Networks, pages 1465-1468. Elsevier Science Publishers (North-Holland),
1991.
[30] A. Zell et al. SNNS Stuttgart Neural Network Simulator. University of Stuttgart, IPVR, 
Breitwiesenstrasse 20-22, 70565 Stuttgart Germany, 1994. User Manual, Version 3.3 Report 
3/94 (revised).
[31] B.W. Char et al. Maple V Language Reference Manual. Springer Verlag, 1991.
[32] S.E. Fahlman. The Recurrent Cascade-Correlation Architecture. Technical Report CMU- 
CS-91-100, Carnegie Mellon University, School of Computer Science, Pittsburgh, PA 15213, 
May 1991.
[33] S.E. Fahlman and C. Lebiere. The Cascade-Correlation Learning Architecture. In D.S. 
Touretzky, editor, Advances in Neural Information Processing Systems 2. Morgan Kauf­
mann, 1990.
[34] J.A. Feldman, M.A. Fanty, N.H. Goddard, and K.J. Lynne. Computing with Structured 
Connectionist Networks. Communications of the ACM, 31(2):170—187, Februari 1988.
[35] H.P. Flatt and K. Kennedy. Performance of Parallel Processors. Parallel Computing, 12:1— 
20, 1989.
[36] B.M. Forrest, D. Roweth, N. Stroud, D.J. Wallace, and G.V. Wilson. Implementing Neural 
Network Models on Parallel Computers. The Computer Journal, 30(5):413-419, 1987.
[37] G.Fox, M.Johnson, G.Lyzenga, S.Otto, J.Salmon, and D.Walker. Solving Problems on 
Concurrent Processors, volume 1. Prentice Hall, New Jersey, 1988.
[38] J. Ghosh and K. Hwang. Mapping Neural Networks onto Message-Passing Multicomputers. 
Journal of Parallel and Distributed Computing, 6:291-330, 1989.
[39] Parsytec Gmbh. Supercluster technical documentation. Parsytec Gmbh, Juelicher Strafie 
338 D-5100 Aachen Germany, April 1989.
[40] N.H. Goddard, K.J. Lynne, T. Mintz, and L. Bukys. Rochester Connectionist Simulator. 
Technical Report 233 (revised), University of Rochester, Computer Science Department, 
Rochester, New York 14627, October 1989.
[41] K.F. Goser. Challenge of ANN to Microelectronics. In S. Gielen and B. Kappen, editors, 
ICANN ’93, pages 1025-1029. Springer-Verlag, 1993.
[42] J.L. Gustafson. Reevaluating Amdahl’s Law. Communications of the ACM, 31(5):532—533, 
May 1988.
200 Bibliography
[43] I.S. Han, K.H. Ahn, T.H. Park, and K.H. Jun. Adaptable VLSI Neural Network of Tens of 
Thousand Connections. In I. Aleksander and J. Taylor, editors, Artificial Neural Networks
2, pages 1423-1426. Elsevier Science Publishers (North-Holland), 1992.
[44] R. Hecht-Nielsen. Applications of Counterpropagation Networks. Neural Networks, 1:131— 
139, 1988.
[45] J.N.H. Heemskerk. Neurocomputers for Brain-Style Processing. Design, Implementation, 
and Application. PhD thesis, Leiden University, Dept, of Experimental Psychology, 1995.
[46] A.J.G. Hey. The Genesis Distributed Memory Benchmarks. Parallel Computing, 
17 (2): 1275—1283, 1991.
[47] C.A.R. Hoare. Communicating sequential processes. Communications of the ACM, 21(8), 
August 1978.
[48] R.W. Hockney and C.R. Jesshope. Parallel computers : architecture, programming and 
algorithms. Hilger, 1988.
[49] J.J. Hopfield. Networks and Physical Systems with Emergent Collective Computational 
Abilities. In Proc. Natl. Acad. Sei. USA 79, pages 2554-2558, 1982.
[50] Motorola Inc. IBM Microelectronics. PowerPC 601, RISC Microprocessor User’s Man­
ual. IBM Microelectronics, Motorola Inc., IBM Microelectronics, Mail Stop A25/862-1, 
PowerPC Marketing, 1000 River Street, Essex Junction, VT 05452-4299, 1993.
[51] INMOS International. OCCAM 2 reference manual. Prentice Hall Internationa, 1988.
[52] I. Kanellopoulos, A. Varfis, G.G. Wilkinson, and J. Megier. Land cover discrimination 
in spot hrv imagery using an artificial neural network. International Journal of Remote 
Sensing, 13(5) :917-924, 1992.
[53] I. Kanellopoulos and G.G. Wilkinson. Experiments with backpropagation neural networks 
for image classification. Technical Report 1.91.119, JRC, Image Processing Lab, 1-21020, 
Ispra (VA) Italy, 1990.
[54] K. Kant. Introduction to Computer System Performance Evaluation. Computer Science 
Series. McGraw-Hill International Editions, 1992.
[55] B. Kappen. Using Boltzmann Machines for probability estimation. In S. Gielen and B. Kap­
pen, editors, ICANN ’93, pages 521-526. Springer-Verlag, September 1993.
[56] Bert Kappen. Personal communication, during several meetings at the Biophysiscs Lab in 
Nijmegen.
[57] C. Kelley and T. Williams. Gnuplot Unix version 3.5, online manual, info-gnuplot dart- 
mouth.edu, 1986-1993.
[58] T. Kohonen. Self-Organization and Associative Memory. Springer Verlag, Berlin, second 
edition, 1988.
Bibliography 201
[59] T. Kohonen, J. Kangas, and J. Laaksonen. Som pak, the self-organizing program package. 
Technical report, Laboratory of Computer and Information Science, Rakentajanaukio 2 C, 
SF-o2150 Espoo, Finland, November 1992. Version 1.2.
[60] T. Kohonen, J. Kangas, J. Laaksonen, and K. Torkkola. Lvq pak, the learning vector 
quantization package. Technical report, Laboratory of Computer and Information Science, 
Rakentajanaukio 2 C, SF-o2150 Espoo, Finland, October 1992. Version 2.1.
[61] M. Korsloot, A.J. Klaassen, and J.M. Mulder. The Suitability of Transputer Networks 
for Various Classes of Algorithms. In M. Reeve, editor, Parallel Processing and Artificial 
Intelligence, chapter 15, pages 275-291. Steven Ericsson Zenith., July 1989.
[62] R. Leenders. EENS, an Execution Environment for Neural Systems. Master’s thesis, Nr. 
139, University of Nijmegen, Faculty of Mathematics and Informatics, Toernooiveld 1, 6525 
ED Nijmegen, The Netherlands, January 1990.
[63] R.R. Leighton. The aspirin/migraines neural network software. Technical Report MP- 
91W00050, MITRE Washington Neural Network Group, The MITRE Corporation Wash­
ington C3I Division 7525 Colshire Drive McLean, Virginia 22102, October 1992. User’s 
Manual Release V6.0.
[64] B. Levin. Implementation of the SANS Algorithm on the CM2. In T. Kohonen, K. Makisara,
O. Simula, and J. Kangas, editors, Artificial Neural Networks, pages 1473-1476. Elsevier 
Science Publishers (North-Holland), 1991.
[65] Inmos Limited. Transputer Reference Manual. Prentice Hall, 1988.
[6 6] Mathworks, Cochituate Place 24, Prime Pathway, Natic, Mas 01760. Matlab, High Perfor­
mance Numeric Computation and Visualization Software, 1992. Version 4.0 User’s Guide.
[67] Inmos/SGS-Thomson Microelectronics. The T9000 Transputer Product Overview, 1991.
[6 8] N. Morgan, J. Beck, P. Kohn, J. Bilmes, E. Allman, and J. Beer. The Ring Array Proces­
sor: a Multiprocessing Peripheral for Connectionist Applications. Journal of Parallel and 
Distributed Computing, 14:248-259, 1992.
[69] J.M.J. Murre. Transputer Implementations of Neural Networks: an Analysis. In T. Ko­
honen, K. Makisara, O. Simula, and J. Kangas, editors, Artificial Neural Networks, pages 
1537-1540. Elsevier Science Publishers (North-Holland), 1991.
[70] J.M.J. Murre and S.E. Kleynenberg. The metanet network environment for the develop­
ment of modular neural networks. In B. Angeniol and B. Widrow, editors, Proceedings 
of the International Neural Network Conference, pages 717-720, Paris, July 1990. Kluwer 
Academic Publishers.
[71] J.M.J. Murre, R.H. Phaf, and G. Wolters. CALM networks: a modular approach to super­
vised and unsupervised learning. In Proceedings of the IJCNN, pages 649-656, Washington, 
1989.
[72] Inc. NeuralwareWare. Neuralworks professional ii. 202 Park West Drive, Pittsburgh PA 
15275.
202 Bibliography
[73] E. Nezry, H-G. Kohl, and H. de Groof. Restoration of Textural Properties in SAR Images 
using Second Order Statistics. In Proceedings of the International Geoscience and Remote 
Sensing Symposium (IGARSS94), pages 2165-2167, Pasadena, California, USA, August 
8-12 1994.
[74] K. Obermayer, H.Heller, H. Ritter, and K. Schulten. Simulation of Self-Organizing Neural 
Nets: a Comparison between a Transputer Ring and a Connection Machine C.Y1-2. In 
Proceedings of the Third Conference of NATUG, Sunnyvale, CA, 1990.
[75] K. Obermayer, H. Ritter, and K. Schulten. Large-Scale Simulations of Selforganizing Neural 
Networks on Parallel Computers: Application to Biological Modelling. Parallel Computing, 
14:381-404, 1990.
[76] Paugam-Moisy. A Spy of Parallel Neural Networks. Technical Report TR 90-27, Ecole 
Normale Supérieure de Lyon, IMAG, Lyon, France, 1990.
[77] M. Plonski. Res, genesis, and sfinx:three public domain simulators for neural net­
works. Neural Network Review 4, 1990. paper obtained via Neural Network Archive 
http://www.lpac.ac.uk/SEL-HPC/Articles/NeuralArchive.html.
[78] AFCEA International Press. DARPA Neural Network Study, November 1988.
[79] U. Ramacher, W. Raan, J. Anlauf, J. Beichter, U. Hachmann, N. Brühls, M. Wefieling, and 
E.Sicheneder. Multiprocessor and Memory Architecture of the Neurocomputer SYNAPSE-
1. In S. Gielen and B. Kappen, editors, ICANN ’93, pages 1034-1039. Springer-Verlag,
1993.
[80] M.L. Recce, P.V. Rocha, and P.C. Treleaven. Neural Network Programming Environments. 
In I. Aleksander and J. Taylor, editors, Artificial Neural Networks 2, pages 1237-1244. 
Elsevier Science Publishers, September 1992.
[81] G.D. Richards. The Implementation of Backpropagation on a Transputer Array. In J. Ker- 
ridge, editor, Developments using Occam, pages 173-179, 1988.
[82] G.D. Richards and T. Tollenaere. Documentation for Rhwydwaith Version 2.1. Technical 
Report ECSP-UG-7, University of Edinburgh, July 1989.
[83] D.E. Rumelhart and J.L. McClelland. Parallel Distributed Processing: Explorations in the 
Microstructure of Cognition, volume 1. MIT Press, 1986.
[84] A.J.M. Russel. Simulating Neural Networks on a Multi-Transputer System. Master’s thesis, 
Nr. 175, University of Nijmegen, Faculty of Mathematics and Informatics, Toernooiveld 1, 
6525 ED Nijmegen, The Netherlands, January 1991.
[85] A.J.M. Russel and Th.E. Schouten. FIELDNET, a Dynamic Network for Pattern Classifi­
cation. In S. Gielen and B. Kappen, editors, ICANN ’93, pages 456-459. Springer-Verlag, 
September 1993.
[8 6] Mimetics S.A. Mimenice artificial neural networks simulator, user manual. Avenue Sully- 
Prudhomme, 92298 Chatenay-Malabry, France.
Bibliography 203
[87] Ron Schoenmakers and Louis Vuurpijl. Segmentation and classification of combined optical 
and radar imagery. In Proceedings of IG A R SS’95, Florence, September 1995.
[8 8 ] R.P.H.M. Schoenmakers. Segmentation of Remotely Sensed Imagery. PhD thesis, Informat­
ica, Katholieke Universiteit Nijmegen, Toernooiveld 1, 6525 ED Nijmegen, The Netherlands, 
September 1995.
[89] R.P.H.M. Schoenmakers, G.G. Wilkinson, and Th.E. Schouten. Results of a hybrid seg­
mentation method. In SPIE , Rome, September 1994. SPIE.
[90] Th.E. Schouten. Internal report. Faculty of Mathematics and Informatics University of 
Nijmegen The Netherlands, 1991.
[91] R. Schreurs and C. Hendriks. Extracting PREENS Neural Network Code and Embedding it 
into a Stand-alone Simulation Program. Technical report, University of Nijmegen, Faculty 
of Mathematics and Informatics, Toernooiveld 1, 6525 ED Nijmegen, The Netherlands, 
October 1993.
[92] T.J. Sejnowski, P.K. Kienker, and G.E. Hinton. Learning Symmetry Groups with Hidden 
Units: Beyond the Perceptron. Physica, 22l):26(l 275. 1986.
[93] T.J. Sejnowski and C. Rosenberg. Parallel networks that learn to pronounce english text. 
Complex Systems, pages 145-168, 1987.
[94] A. Singer. Exploiting the Inherent Parallelism of Artificial Neural Networks to Achieve 
1300 Million Interconnects per Second. In B. Angeniol and B. Widrow, editors, Proceedings 
of the International Neural Network Conference, pages 656-660, Paris, July 1990. Kluwer 
Academic Publishers.
[95] A. Singer. Implementations of Artificial Neural Networks on the Connection Machine. 
Parallel Computing, 14:305-315, 1990.
[96] G. Smith. Backpropagation with Dynamic Topology and Simple Activation Functions. 
Technical Report TR 90-12, School of Information Science and Technology, Discipline of 
Computer Science Flinders University of South Australia PO Box 2100, Adelaide, SA, 5001,
1992.
[97] California Scientific Software. Brainmaker. 10024 Newtown Road, Nevada City, CA 95959.
[98] H. Speckmann, P. Thole, and W. Rosenstiel. COKOS: a Coprocessor for KOhonen’s 
Selforganizing Map. In S. Gielen and B. Kappen, editors, IC A N N  ’93, pages 1040-1044. 
Springer-Verlag, 1993.
[99] J.B. Theeten, M. Duranton, N. Maudit, and J.A. Sirat. The LNeuro-Chip: A Digital VLSI 
with On-Chip Learning Mechanism. In B. Angeniol and B. Widrow, editors, Proceedings of 
the International Neural Network Conference, pages 593-596. Kluwer Academic Publishers, 
July 1990.
[100] T. Tollenaere and G.A. Orban. Simulating Modular Neural Networks on Message-Passing 
Multiprocessors. Parallel Computing, 17(1):361—379, 1991.
204 Bibliography
[101] T. Tollenaere and G.A. Orban. Transparent Problem Decomposition and Mapping - a 
CSTools based Implementation. In P. Welch, editor, Transputing ’91, pages 107-123, 1992.
[102] T. Tollenaere, M.M. van Hulle, and G.A. Orban. Parallel Implementation and Capabil­
ities of Entropy Driven Artificial Neural Networks. Journal o f Parallel and Distributed 
Computing, March 1992. excact reference to be found (pp).
[103] P.C. Treleaven. Neurocomputers. International Journal of Neurocomputing, 1:4-31, 1989.
[104] A. Ultsch and H.P. Siemon. Exploratory Data Analysis: Using Kohonen Networks on Trans­
puters. Technical Report Bericht Nr. 329, University of Dortmund, Fachbereich Informatik, 
Postfach 500500, D-4600 Dortmund 50, December 1989.
[105] A.J. van der Steen. The Benchmark of the EuroBen Group. Parallel Computing, 17(2): 1211— 
1221, 1991.
[106] F.A. van Schaik and J.A.G. Nijhuis. Introduction to neural network design with nnsim 3.0. 
Technical Report IMS-TB-05/89, Institution for Microelectronics, Allmandring 30, 7000 
Stuttgart 80, Germany, November 1989.
[107] J. Vanhala and K. Kaski. Simulating Neural Networks in Distributed Environments. In 
J. Wexler, editor, Developing Transputer Applications (Proceedings OUG 11), pages 129— 
141, Edinburgh, Scotland, September 1989. IOS.
[108] Bart Veer. The CDL Guide. Helios Technical Guides. Distributed Software Ltd., 670 Aztec 
West, Bristol BS12 4SD, UK, January 1990.
[109] L.G. Vuurpijl and Th.E. Schouten. Control and Visualization of Neural Networks in the 
PREENS Project. In Proceedings of the third workshop of the Esprit Parallel Computing 
Action, Bonn, May 1991.
[110] L.G. Vuurpijl and Th.E. Schouten. Suitability of Transputers for Neural Network Simula­
tions. In W. Joosen and E. Milgrom, editors, Parallel Computing: From Theory to Sound 
Practice, pages 528-537. IOS Press, 1992.
[111] L.G. Vuurpijl and Th.E. Schouten. Performance of the GCel-512 for Parallel Neural Net­
work Simulations. In End report o f the CAMPP ’93 programme (University o f Amsterdam  
and Parsytec Gmbh). University of Nijmegen, 1993.
[112] L.G. Vuurpijl and Th.E. Schouten. PREENS, a Parallel Research Execution Environment 
for Neural Systems. In High Performance Computing and Networking ’94, München, April
1994.
[113] L.G. Vuurpijl and Th.E. Schouten. A Scalable Performance Prediction Model for Par­
allel Neural Network Simulations. In High Performance Computing and Networking ’94, 
München, April 1994.
[114] L.G. Vuurpijl, Th.E. Schouten, and J. Vytopil. Performance Prediction of Large MIMD 
Systems for Parallel Neural Network Simulations. Future Generation Computing Systems,
11 (2) :221—232, 1995.
Bibliography 205
[115] Louis Vuurpijl. Preens tutorial, how to use tools and nn simulations. Technical report, 
University of Nijmegn, November 1995.
[116] Louis Vuurpijl, Theo Schouten, and Jan Vytopil. Convis: Action oriented control and 
visualization of neural networks. Technical report, University of Nijmegen, Faculty of 
Mathematics and Informatics, Informatics for Technical Applications, Toernooiveld 1, 6525 
ED Nijmegen, The Netherlands, November 1994.
[117] R. Weicker. Dhrystone: A Synthetic Systems Programming Benchmark. Communications 
of the ACM , 27(10): 1013—1023, October 1984.
[118] R.P. Weicker. A Detailed Look at some Popular Benchmarks. Parallel Computing, 
17(2) :1153—1172, 1991.
[119] G. Whittington and C.T. Spracklen. A structured design, development and integration 
methodology for real-world applications of artificial neural networks. In I. Aleksander 
and J. Taylor, editors, Artificial Neural Networks 2, pages 1245-1251. Elsevier Science 
Publishers, September 1992.
[120] P. Willems. The ART Neural Networks Enlightened: Implementation on Sequential and 
Parallel Computer Systems. Master’s thesis, University of Nijmegen, Faculty of Mathe­
matics and Informatics, Toernooiveld 1, 6525 ED Nijmegen, The Netherlands, September
1993.
[121] S.A. Williams. Programming Models for Parallel Systems. John Wiley k, Sons Ltd., 1990.
[122] M. Wilson and J.M. Bower et al. Genesis and Xodus Documentation. Caltex, 
ftp://ftp.bbb.caltech.edu, Februari 1991.
[123] M.A. Wilson, U.S. Bhalla, J.D. Uhley, and J.M. Bower. Genesis: A System for Simulating 
Neural Networks. In D.S. Touretzky, editor, Advances in Neural and Information Processing 
Systems, pages 485-492. Morgan Kaufmann, 1989.
[124] M. Witbrock and M. Zagha. An Implementation of Backpropagation Learning on Gl ’ll .  a 
Large SIMD Parallel Computer. Parallel Computing, 14:329-346, 1990.
[125] C-H. Wu, R.E. Hodges, and C-J. Wang. Parallelizing the Self-Organizing Feature Map on 
Multiprocessor Systems. Parallel Computing, 17(1):821—832, 1991.
[126] H. Yoon and J.H. Nang. Multilayer Neural Networks on Distributed Memory Multipro­
cessors. In B. Angeniol and B. Widrow, editors, Proceedings of the International Neural 
Network Conference, pages 669-672, Paris, July 1990. Kluwer Academic Publishers.
[127] X. Zhang, M. McKenna, J.P. Mesirov, and D.L. Waltz. The Backpropagation Algorithm 
on Grid and Hypercube Architectures. Parallel Computing, 14:317-327, 1990.
206 Bibliography
Sam envatting
Deze dissertatie is het resultaat van het p r e e n s  onderzoeksprojekt, verricht bij de onder­
zoekslijn EITA van de faculteit W iskunde en Informatica aan de Katholieke Universiteit 
Nijmegen, In dit proefschrift worden twee omgevingen belicht die gebruikt worden voor 
kunstmatige neurale netwerken: neurosim ulatoren , dat wil zeggen: software omgevingen 
voor het simuleren van neurale netwerken en MlMD-parallelle com puters, krachtige machines 
voor de snelle uitvoering van de rekenprocessen in neurosimulatoren.
Parallellisme en neurale netwerken
Parallellisme komt overal voor, In de natuur, waar bijvoorbeeld ontelbare mieren tege­
lijkertijd aan een termietenhoop bouwen. Maar ook in het dagelijks leven, waar machines 
parallel aan een taak werken, waar bouwvakkers met z’n allen een huis bouwen en waar 
voertuigen tegelijkertijd in het verkeer deelnemen.
Ook kunstmatige neurale netwerken vertonen parallellisme op verschillende niveaus. Op 
een gedetailleerd niveau bevatten ze een groot aantal rekeneenheden, de neuronen, die met 
elkaar zijn verbonden via een nog groter aantal axonen en synapsen (gemodelleerd door 
gewichten). In de biologie werken neuronen, axonen en synapsen parallel in de tijd samen. 
Op een hoger niveau komen verschillende modules van lagen van neuronen en gewichten 
voor, die alle parallel ten opzichte van elkaar werken.
Parallelle computers lijken een voor de hand liggend platform te zijn om kunstmatige neu­
rale netwerken op te implementeren, vanwege het parallellisme dat deze bevatten. Het 
voordeel is dat zulke implementaties sneller uitgevoerd kunnen worden dan op sequentiële 
computers. Het nadeel is dat parallelle systemen moeilijker zijn te programmeren en 
meestal specifiek gericht zijn op één neural network model of één bepaalde toepassing, 
In de literatuur zijn een groot aantal parallelle implementaties van kunstmatige neurale 
netwerken te vinden. Op een laag niveau van parallellisme worden een-op-een afbeeldin­
gen van gewichten en neuronen op analoge of digitale chips gebouwd. Een overzicht van 
dit soort “neuro-asies1” wordt gegeven in [45, 79, 103], Een breed spectrum  van andere 
parallelle platforms zoals de Connection Machine [94], GF11 [124], M asPar [17] en transpu- 
tersvstemen [81, 82, 84, 104, 107] zijn gebruikt voor hogere (meer abstracte) niveaus van
1asic =  application specific integrated circuit
208 Samenvatting
parallellisme. Deze machines zijn ofwel massief parallel (bevatten veel, relatief simpele, 
processoren), of ze bevatten een kleiner aantal meer algemeen toepasbare processoren,
In dit proefschrift worden transputersystemen gebruikt voor de implementatie van neurale 
netwerken. Met name wordt een model geïntroduceerd waarmee de geschiktheid van de 
transputer voor neurale netwerken kan worden bepaald. Een transputersvsteem behoort 
tot de klasse van MlMD2-svstemen, computers bestaande uit meerdere processoren die ver­
bonden zijn in een communicatienetwerk, In tegenstelling tot massief parallelle computers 
bevatten MlMD-svstemen slechts een beperkt aantal processoren. Dit heeft als gevolg dat 
het aantal neuronen en gewichten het aantal processoren ver overtreft. Dus als een neuraal 
netwerk op een MlMD-svsteem wordt geïmplementeerd, dan moeten het benodigde geheugen 
en de rekentaken verdeeld worden over de beschikbare processoren. Technieken voor het 
opdelen van een neural netwerk op MlMD-svstemen zoals de transputer zijn beschreven in 
bijvoorbeeld [38, 100, 110] en worden in detail behandeld in de hoofdstukken 5, 6 en 7, 
Twee principes zijn hierbij van belang: zorg ervoor dat iedere processor evenveel werk te 
doen heeft en zorg ervoor dat de hoeveelheid rekenwerk die met informatieuitwisseling, 
synchronisatie en communicatie te maken heeft zoveel mogelijk beperkt blijft,
In hoofdstuk 3 wordt de architektuur belicht van MlMD-svstemen en in het bijzonder van 
drie transputersystemen, het Nijmegen SuperCluster, de GCEL-512 en de PowerXplorer, 
Er wordt behandeld op welke wijze deze systemen te programmeren zijn en hoe er gebruik 
gemaakt kan worden van de besturingssystemen Helios en Parix, De technieken die in dit 
proefschrift worden beschreven kunnen voor iedere MlMD-arehitektuur worden gebruikt, 
dus niet alleen voor transputersystemen.
Voorspellen van de prestatie
Bij het bepalen van de geschiktheid van een computersysteem voor een bepaalde toepassing, 
kunnen verschillende vragen gesteld worden, De eerste is welke rekensnelheid behaald kan 
worden, gegeven een bepaalde maehineeonfiguratie en type en grootte van de toepassing, 
De tweede vraag is of door de communicatie- en rekencapaciteit te vergroten, de totale 
rekentijd verlaagd kan worden, oftewel welke prestatieverbetering of “speedup” behaald 
kan worden, De derde vraag betreft de schaalbaarheid van het computersysteem en de 
toepassing, oftewel blijft de rekentijd konstant als de grootte van zowel het computersys­
teem als de toepassing gelijkmatig opgeschaald wordt, In hoofdstuk 4 wordt een methode 
geïntroduceerd die de prestatie van MlMD-svstemen voor kunstmatige neurale netwerken 
kan voorspellen. Volgens deze methode wordt voor een gegeven systeem de rekentijd en 
communicatietijd gemodelleerd. Op basis van de gemeten rekentijd op één processor en de 
gemeten communicatietijd tussen twee processoren kunnen voorspellingen gedaan worden 
voor grotere proeessorsystemen. Met behulp van deze methode kunnen de drie vragen die 
hierboven zijn gesteld zeer goed worden beantwoord,
2mimd staat voor “multiple instruction multiple data” . Zo bestaan er ook siMD-systemen, waarbij de 
“s” staat voor “single” .
Samenvatting 209
Zoals beschreven in [110, 113], vereisen neurale netwerk implementaties op parallelle syste­
men verschillende typische soorten van communicatie, In hoofdstuk 5 wordt een commu­
nicatielaag geïntroduceerd die deze soorten implementeert, Algorithmen voor broadcast, 
gather en accumulate communicaties worden behandeld en de bijbehorende modellen voor 
de vereiste communicatietijd worden uitgewerkt in dit hoofdstuk. Met behulp van deze 
modellen kan de communicatietijd Tcornm(P ,s ) voorspeld worden voor ieder aantal proees- 
soren P  en iedere boodschapgrootte s. Voor de drie gebruikte transputersvstem en, blijken 
de voorspelde waardes een precisie van rond de 5% te hebben en accuraat te zijn binnen
Het modelleren van de rekentijd gaat volgens een nieuwe methode van kem el benchmarking 
en performance modeling, In hoofdstuk 4 worden een aantal nadelen van het gebruik van 
klassieke performance benchmarking en performance modeling methoden genoemd. Een 
performance schaal zoals MCUPS is nutteloos, tenzij precies wordt aangegeven wat nu pre­
cies to t zo’n getal heeft bijgedragen, zoals “[Rum elhart,m om entum , batch-update, 6x20x30x9] 
backprop m cu ps” . Volgens de nieuwe methode wordt de rekentijd gemodelleerd als:
De rekentijd van een programma wordt hierbij gemodelleerd als de som van de benodigde 
rekentijden voor rif rekenfuncties, waarvan voor elk de rekentijd wordt gemeten op slechts 
een processor. Deze methode noemen we ook wel “kernei benehmarking” ,
D atasetdecom positie en netwerkdeeom positie
De hierboven beschreven methode om de prestatie van een parallel systeem te voorspellen 
is in dit proefschrift toegepast voor twee technieken, datasetdecompositie en netwerkde­
eompositie, Bij de eerste techniek wordt een verzameling van p  door het neurale netwerk 
te leren patronen gelijkmatig verdeeld over P  processoren. Op iedere processor draait een 
identieke kopie van het neurale netwerk, In hoofdstuk 6 wordt deze techniek geëvalueerd 
voor twee typen neurale netwerken, baekpropagation en Kohonen, Voor beide modellen zijn 
kernei benchmarks gedefiniëerd en gemeten op de drie verschillende transputersvstem en. 
Samen met de modellen voor de communieatietijd kunnen voorspellingen worden gemaakt 
voor de totale rekentijd op een willekeurig transputernetwerk, De geschiktheid van een 
MlMD-svsteem zoals de transputer kan vervolgens uitgedrukt worden in termen van MCUPS 
of ICPS, speedup, efficiency en sealability, De resultaten gepresenteerd in hoofdstuk 6 geven 
aan dat datasetdeeompositie een bijzonder geschikte techniek is voor MlMD-systemen,
In hoofdstuk 7 wordt netwerkdecompositie behandeld, een verzameling technieken om een 
neuraal netwerk over meerdere processoren te verdelen, De voorspellingsmethode wordt 
getoetst voor netwerkdeeompositie via de implementatie van baekpropagation en Kohonen 
netwerken. Wederom worden kernei benchmarks gedefiniëerd en gemeten op een processor. 
Voorspelde waardes wijken maximaal 15% af voor backpropagation en 10% voor Kohonen,
1 0 %.
(1 0 .2 )
i=  1
210 Samenvatting
Vergeleken met de resultaten behaald met datasetdeeompositie, blijkt netwerkdeeompositie 
minder geschikt te zijn, in het bijzonder voor backpropagation vanwege de benodigde 
hoeveelheid informatieuitwisseling tussen processoren,
Neurosimulators
In dit proefschrift wordt een overzicht gegeven van de typische gebruikers van neuro- 
simulatoren en de activiteiten die zij uitvoeren. Er zijn gebruikers die modellen bouwen, 
die “tools3” bouwen, die neurale netwerken gebruiken voor een toepasssing, en eindgebrui­
kers, In hoofdstuk 8  worden de typische componenten van neurosimulatoren beschreven, 
De meeste neurosimulatoren zijn gebaseerd op een hiërarchische datastruk tuur of een 
netwerkbesehrijvingstaal. Dit vereist dat neurale netwerken op een voorbeschreven manier 
ontwikkeld moeten worden, wat inhoudt dat de resulterende implementaties niet zo efficiënt 
zijn als die welke specifiek voor een neuraal netwerk en /o f computersysteem ontwikkeld zijn. 
Verder blijkt dat de meeste neurale netwerkprogrammatuur een identieke struktuur heeft 
en slechts een beperkt aantal acties ondersteunt. Tenslotte blijkt dat het grootste gedeelte 
van de code van een neurosimulator bestaat uit de user-interface en I/O  voor visualisatie 
en lezen en schrijven van en naar de harddisk.
P R E E N S
De in dit proefschrift geïntroduceerde neurosimulator pr e en s  is gebaseerd op een con­
ceptueel model van program m a’s in plaats van neurale netwerken, In dit model wor­
den de acties die het programma implementeert beschreven volgens een zogenaamde actie- 
georiënteerde programmabeschrijving. Een algemene user-interface, CONVIS, beheert een 
op deze manier beschreven neurale netwerksimulatie, en kan tevens de data  in het pro­
gramma manipuleren en uitwisselen met tools. Daar er geen rekening hoeft te worden 
gehouden met een algemene datastruk tuur of netwerkbesehrijvingstaal, kan een neuraal 
netwerkprogramma op m aat geïmplementeerd worden voor een bepaalde toepassing, De 
verzameling componenten bestaande uit CONVIS, een neuraal netwerkprogramma en ad­
ditionele tools kan op een netwerk van computers uitgevoerd worden. Hierdoor kan de 
pr e en s  neurosimulator sneller zijn en meer data  verwerken dan andere neurosimulatoren.
Het concept van pr e en s  is geëvalueerd voor verschillende toepassingen, waaronder satel- 
lietbeeldherkenning, pr e en s  is getest op heterogene computersystemen bestaande uit Unix 
werkstations en transputernetwerken. Voor parallelle implementaties voldoet de pr een s  
communicatielaag in het geval van datasetdecompositie, terwijl voor netwerkdeeompositie 
het concept van transformeren van “exotische” da ta  van en naar het pr e en s  formaat een 
geschikte oplossing lijkt te zijn.
3Tools zijn softwareprogrammas, gereedschappen, die gebruikt kunnen worden voor een bepaald doel, 
zoals het visualiseren van informatie in een kunstmatig neuraal netwerk.
Samenvatting 211
Slotoverwegingen
De ontwikkelde methode om de rekenprestatie van MlMD-eomputersvstemen te voorspellen 
voor een toepassing is gebruikt om de geschiktheid van dit soort systemen voor kunstmatige 
neurale netwerken te bepalen, De in dit proefschrift gemelde resultaten geven aan dat re­
delijk nauwkeurige (in het algemeen binnen 1 0 %) voorspellingen kunnen worden gedaan. 
Voor backpropragation neurale netwerken die geïmplementeerd zijn via netwerkdecomposi- 
tie blijken transputersvstem en geen efficiënt executieplatform. Maar voor datasetdeeom- 
positie en ook voor Kohonen netwerken die zijn geïmplementeerd via netwerkdecompositie 
blijken transputersvstem en wel geschikt.
Dit geeft aan dat voor de implementatie van neurale netwerken op dit soort systemen 
de mogelijkheid om datasetdecompositie te gebruiken nader moet worden bekeken. Als 
er geen efficiënte methoden van datasetdecompositie worden gevonden, dan zullen slechts 
kleinere systemen gebruikt mogen worden, met eventueel meer geheugen per processor. 
Deze observatie komt overeen met de huidige trends in de MlMD-proeessorteehnologie, 
waar individuele processoren steeds krachtiger worden,
De in dit proefschrift beschreven methode om de prestatie van een MlMD-eomputersysteem 
te voorspellen en het concept van op acties gebaseerde programmabeschrijvingen kunnen 
ook gebruikt worden voor andere toepassingsgebieden:
o Voor iedere MlMD-parallelle implementatie van een toepassing kan de totale uitvoer­
ingstijd worden uitgedrukt in termen van reken- en communicatie-tijden. Via het defi­
niëren en meten van kernei benchmarks en de communicatietijd, kan de geschiktheid van 
het MlMD-eomputersysteem worden voorspeld op het gebied van rekenkracht, speedup, 
efficiency en sealability,
o Het concept van op actie geörienteerde programmabeschrijvingen kan ook gebruikt wor­
den om andere softwareomgevingen te ontwikkelen, zoals voor beeldbewerking. Evenals 
bij neurale netwerktoepassingen, zijn hierbij een beperkt aantal acties te identificeren 
zoals het laden en bewaren van een beeld, filter-, edge-detectie, segmentatie en andere 
beeldbewerkingsoperaties, De mogelijkheid om deze te verenigen in een softwareomge- 
ving en gedeeltes op verschillende processoren in een gedistribueerd netwerk te laten ex­
ecuteren, rechtvaardigt een benadering zoals beschreven in dit proefschrift voor p r e e n s .
212 Samenvatting
Curriculum  V itae
Louis Vuurpijl was born in Gorinehem, the Netherlands on April 4, 1964, In 1982 he passed 
his exams for the VWO at the Lorentz Seholengemeensehap in Arnhem, and started  to 
study computer science in Nijmegen, He received his M.Sc, degree in computer science in 
1989 from the University of Nijmegen,
From 1989 till 1991, Louis worked at the Departm ent of Technical Informatics as a system 
manager and programmer, to fulfill his alternative military service. During this work he 
started his job as a PhD student. He finished the pr e en s  project in 1995, and since then 
he has been working as a researcher at the MCI.  also at the University of Nijmegen, In 
1998 he will s tart a 3-vear post-doctoral research at the MCI.
