Munster Technological University

SWORD - South West Open Research
Deposit
Theses

Dissertations and Theses

2004

Investigation of Implementation Techniques for Backpropagation
Neural Networks.
John Culloty
Department of Mathematics and Computing, Cork Institute of Technology, Cork, Ireland.

Follow this and additional works at: https://sword.cit.ie/allthe
Part of the Computer Sciences Commons

Recommended Citation
Culloty, John, "Investigation of Implementation Techniques for Backpropagation Neural Networks."
(2004). Theses [online].
Available at: https://sword.cit.ie/allthe/144

This Master Thesis is brought to you for free and open access by the Dissertations and Theses at SWORD - South
West Open Research Deposit. It has been accepted for inclusion in Theses by an authorized administrator of
SWORD - South West Open Research Deposit. For more information, please contact sword@cit.ie.

tlfWt
11 ffs? f"

In

AT

"I"

§•’■€ «“i

.i Ui

AJUVIi^ |Ji U Mcife CS. ^1’
i
A
xj

f^fs 4“''« ,

<§« A, ?

00104655

ly A1

^ j^
4 f ■c«5=s
i
i^.s

3

rtetworis

11 11-2004
-

To the registrar Cork Institute of Technology
I John Culloty declare that the thesis which I am presenting “Investigation of
Implementation Techniques for Backpropagation Neural Networks” is my own work.

John Culloty

Investigation of Implementation
Techniques for Backpropagation Neural
Netw^orks

John Culloty

HETAC
Autumn 2004

Thesis for the Degree of Master of Science

Title

Investigation of Implementation Techniques for
Backpropagation Neural Networks

Submitted By:

John Cullotv

Academic Advisor:

Dr. Paul Walsh

Sponsoring Institute:

Department of Mathematics and Computing
Cork Institute of Technology

Submitted to the Higher Education and Training Awards Council, (June) (2004)

^ wc.

Corklnstitute of Technology

Abstract
Investigation of Implementation Techniques for Backpropagation
Neural Networks
John Culloty
Artificial neural networks are biologically inspired computational methods that
have the ability to approximate discrete, real and vector valued target functions. For
some problem domains this ability makes neural networks a more appealing solution
than traditional computational techniques. Such problem domains typically contain a
large number of parameters that are interrelated in a complex and often unknown
manner. This makes rule based solutions all but impossible. However another feature of
such problem domains is the presence of large volumes of real world data that can be
used to tram a neural network until it learns the solution.
One of the more popular neural network training algorithms is the
hackpropagatioii of error training algorithm. Backpropagation works by compiling a
large set of input/output samples and using these samples to adjust the network structure
until the network encodes a solution to the problem. After the presentation of each
sample to the network its structure is adjusted in such a way as to bring the network
output closer in line with the sample output. This process must be repeatedly performed
itNcral liuiidrcd or thousand limes on all the samples before the network conxerges on a
solution.
Hence a major limiting factor of the backpropagation algorithm is the length of
time required to train a network. Therefore the aim of this thesis is to investigate parallel
implementation techniques that reduce this time without altering the underling
calculations. Methods investigated include the use of SIMD processing to speedup the
underlying operations, the parallel implementation of backpropagation on a dedicated
cluster computer, and its extension to a High Throughput Computing environment.

Acknowledgment
I would like to thank the a number of people for their help and time during the
course of the thesis. In particular I would like to thank my project supervisor Dr. Paul
Walsh for his guidance with all aspects of the project, especially his help editing and
proofreading the thesis, without which the thesis would not have being completed. I
w ould also like to thank Alexey Lastovetsky for taking the time to read this thesis and
act as external reader.
I would also like to thank Paul Rothwell and Dave Simpson for their help on an
earlier project. Which although abandoned did provide valuable insights into the work
presented in this thesis. Finally 1 would like to thank Dr. Jeane Stynes for her help and
advice

Contents
Abstract..........................................................................................................................i
Acknowledgment..........................................................................................................ii
Contents.......................................................................................................................iii
List of Tables...............................................................................................................ix
List of Graphs...............................................................................................................x
Introduction...............................................................................................................2
Artificial Neural Networks..................................................................................... 13
2.1
Introduction.................................................................................................... 13
2.2
Artificial Neurons.......................................................................................... 13
2.3
Network Topology......................................................................................... 14
2.3.1
Feed-Forward Neural Networks..............................................................15
2.4
Training Algorithm........................................................................................ 18
2.5
The Perceptron............................................................................................... 19
2.5.1
Perceptron Transfer Function................................................................. 20
2.5.2
Perceptron Training................................................................................ 21
2.5.3
Limitations of Perceptrons...................................................................... 23
2.6
The Backpropagation of Error Training Algorithm.......................................24
2.6.1
Forward Pass...........................................................................................25
2.6.2
Backward Pass........................................................................................ 26
2.6.3
Backpropagation Transfer Functions...................................................... 27
2.6.3.1 Common Transfer Functions...............................................................28
2.6.4
Weight Changes...................................................................................... 29
2.6.5
Network Error......................................................................................... 30
2.6.6
Local Minima..........................................................................................31
2.6.6.1 Momentum...........................................................................................32
2.7
Hidden Neurons............................................................................................. 33
2.8
Cross Validation Testing............................................................................... 34
2.9
Conclusion.....................................................................................................35
Parallel Computing.................................................................................................38
3.1
Introduction....................................................................................................38
3.2
Parallel Architectures..................................................................................... 39
3.2.1
Single Instruction Stream, Single Data Stream(SISD)............................ 40
3.2.2
Single Instruction Stream, Multiple Data Stream(SIMD)...................... 41
3.2.3
Multiple Instruction Stream, Single Data Stream(MISD)...................... 42
3.2.4
Multiple Instruction Stream, MultipleData Stream(MIMD)...................43
3.3
Memory Structure..........................................................................................44
3.4
Parallel Programming Models....................................................................... 45
3.4.1
MIMD Programming Models............................................................. 46
3.5
Interconnection Networks.............................................................................. 47
3.6
Cluster Computing......................................................................................... 51
3.6.1
Requirements of a Cluster Computer..................................................... 51
3.6.2
Advantages of Cluster Computing.......................................................... 52
3.6.3
Types of Cluster Computing................................................................... 54
3.6.4
Cluster Computing Software.................................................................. 54
3.7
Grid Computing............................................................................................. 57
iii

3.8
Granularity..................................................................................................... 58
3.9
Load Balancing.............................................................................................. 59
3.10
Speedup Calculations.....................................................................................60
3.10.1
Amdahl's Law.......................................................................................60
3.10.2 The Gustafson-Baris Law........................................................................ 61
3.10.3 Communication Costs.............................................................................. 62
3.11 Parallelisation of the Backpropagation Algorithm......................................... 63
3.11.1 Training Session Parallelism................................................................... 63
3.11.2 Training Set Parallelism.......................................................................... 64
3.11.3
Pipelining................................................................................................64
3.11.4 Neuron Parallelism.................................................................................. 64
3.11.5
Synapse Parallelism................................................................................65
3.11.6
Vector Processing...................................................................................65
3.12 Conclusion......................................................................................................66
Vector Processing for Backpropagation................................................................. 70
4.1
Introduction....................................................................................................70
4.2
SIMD Architecture........................................................................................70
4.3
Intel's SIMD Extensions................................................................................74
4.3.1
Packed Registers and Packed Data-Types.............................................. 74
4.3.2
SIMD Operations....................................................................................76
4.3.4
SIMD Design Considerations.................................................................. 79
4.3.4.1 Size of Data Elements......................................................................... 79
4.3.4.2 Alignment of Data..............................................................................79
4.3.4.3 Organisation of Data in Memory....................................................... 80
4.3.4.4 Padding...............................................................................................81
4.4
Issuing Intel’s SIMD Instruetions from C Code............................................ 82
4.4.1
Assembly Language................................................................................83
4.4.2
Compiler Intrinsics.................................................................................85
4.4.3
C++ Class Interface................................................................................86
4.4.4
Compiler Vectorisation...........................................................................86
4.5
Implementation of Backpropagation using SIMD Processing...................... 86
4.5.1
Analysis of Backpropagation's Constituent Equations........................... 87
4.5.1.1 Nodes Input..........................................................................................87
4.5.1.2 Output Values......................................................................................89
4.5.1.3 Output Layer Error..............................................................................90
4.5.1.4 Hidden Layer Error..............................................................................91
4.5.1.5 Changing Weights..............................................................................92
4.5.1.6 Network Error.....................................................................................93
4.5.2
Implementation methods for Constituent Equations.............................. 94
4.6
Implementation of Backpropagation Algorithm............................................ 99
Cluster Computing for Backpropagation.............................................................. 104
5.1
5.2
5.3
5.3.1
5.3.2
5.4
5.5
5.6

Introduction................................................................................................ 104
Cluster Computing....................................................................................... 104
Message Passing Systems............................................................................ 105
Point-to-Point Primitives...................................................................... 105
Group Primitives................................................................................... 106
MPIandPVM.............................................................................................. 107
Overview of the PVM System..................................................................... 109
Cluster Computing and Backpropagation...................................................114
iv

5.6.1
Training Session Parallelism............................................................... 114
5.6.2
Training Set Parallelism........................................................................115
5.6.3
PVM Implementation of Training SetParallelism............................... 116
5.6.3.1 Analysis of Communication Requirements....................................... 116
5.6.3.2 Analysis of Initial Communication Phase......................................... 116
5.6.3.3 Analysis of Weight Update Communication Phase........................... 118
5.6.3.4 Stopping Criteria.................................................................................119
5.6.3.5 Master Task....................................................................................... 120
5.6.3.5 Slave Task.......................................................................................... 122
5.6.3.6 Speedup Results..................................................................................123
5.7
Conclusion...................................................................................................126
6
High Throughput Computing................................................................................128
0.1
Introduction................................................................................................... 128
6.2
High Throughput Computing.................................................................... 129
6.3
Condor..........................................................................................................130
6.3.1
Condor Machines...................................................................................130
6.3.2
Condor Jobs...........................................................................................132
6.3.3
Checkpointing and Migration................................................................133
6.3.4
Condor Universes..................................................................................133
6.4
Condor and Training Session Parallelism................................................... 134
6.5
Condor-PVM................................................................................................135
6.6
Condor-PVM and Backpropagation Training..............................................137
6.6.1
Condor-PVM and Training SessionParallelism..................................... 137
6.6.2
Condor-PVM and Training Set Parallelism...........................................138
6.6.3
Block Scheduling Algorithm.................................................................139
6.6.3.1 Maintenance Algorithm.................................................................... 141
6.6.3.2 Master Algorithm............................................................................... 142
6.6.3.3 Slave Algorithm..................................................................................144
6.6.4
Evaluation of System.............................................................................145
6.7
Limitations of Condor-PVM based System..................................................147
6.8
Conclusion....................................................................................................150
7
Conclusion............................................................................................................ 152
8

References............................................................................................................. 157
Appendix A
Intel's SIMD Instruction Set.................................................... A-1
A. 1 SIMD Registers and Data-Types................................................................A-1
A.2 Types of Instructions................................................................................ A-11
A.3 Arithmetic Instructions............................................................................ A-IIl
A.3.1
Sum of Absolute difference............................................................. A-111
A.3.2
Integer Addition and Subtraction.....................................................A-IV
.4.3.3
Multiplication................................................................................... A-V
A.3.4
ADDSUB* instructions...................................................................A-Vl
A.3.5
Horizontal Arithmetic Instructions..................................................A-VI
A.4 Create Mask of Packed Data................................................................. A-Vlll
A.5 Shift....................................................................................................... A-VIII
A.6 Logical.....................................................................................................A-IX
A. 7 Compare...................................................................................................A-IX
A.8 Unpack.....................................................................................................A-XI
A.9 Shuffle.................................................................................................... A-XII
A. 10 Pack....................................................................................................... A-Xlll
V

A. 11 Conversion............................................................................................A-XIV
A. 12 Data Movement..................................................................................... A-XV
A. 13 State Management Functions................................................................A-XVl
A. 13.1
EMMS........................................................................................ A-XVII
A. 14 Cacheabilty Control........................................................................... A-XVIII
A. 15 Thread instructions............................................................................... A-XIX
Appendix B Checking for SIMD Hardware and Software Support......................B-I
B. 1
Testing for the presence of SIMD extensions.............................................B-I
B.2 Testing for the presence of a Floating-point unit........................................B-I
B.3 Testing for the FXSAVE and FXRSTOR support....................................B-II
B.4 Testing for Operating System support of SIMD exceptions.....................B-II
Appendix C Differences between PVM and Condor-PVM.................................C-I
C.l
No Multiple Slaves Per Host......................................................................C-I
C.2 No Multiple Spawns...................................................................................C-I
C.3 PVM Architecture Class Replaced by Condor Machine Class..................C-1
C.4 New notification events.............................................................................C-II
C.5 Host not added immediately on request.....................................................C-II
Appendix D
Paper Presented to the ParCo2003 Conference on Parallel
Computing...................................................................................................... D-I

VI

List of Figures
Figure 1.1 Sample Neural Network.................................................................................. 4
Figure 1.2 Feed-forward Neural Networks....................................................................... 4
Figure 2.1 A Generic Artificial Neuron.......................................................................... 14
Figure 2.2 Neural Network Topologies.......................................................................... 16
Figure 2.3 Naming convention for a 3-layered network................................................. 17
Figure 2.4 Decomposition of multi-output Perceptrons................................................. 19
Figure 2.5 The effect of the threshold value on the transfer function............................ 20
Figure 2.6 Neural network diagram showing biased inputs........................................... 21
Figure 2.7 Linear separability and Boolean functions.................................................... 24
Figure 2.8 The error surface for a network, complete with local and global minima.... 31
Figure 2.9 The effect of momentum on training............................................................. 33
Figure 3.1 SISD Architecture......................................................................................... 40
Figure 3.2 SIMD Architecture........................................................................................ 41
Figure 3.3 MISD Architecture........................................................................................ 42
Figure 3.4 MIMD Architecture....................................................................................... 43
Figure 3.5 Complete Graph Interconnection Network................................................... 47
Figure 3.6 Ring Interconnection Topology.................................................................... 48
Figure 3.7 2D Mesh with and without wraparound connections.................................... 48
Figure 3.8 Binary Tree Interconnection Network........................................................... 49
Figure 3.9 Hypercube Interconnection Network............................................................ 50
Figure 4.1 Conditional Branch in a SIMD system......................................................... 71
Figure 4.2 SIMD versus serial vector processing........................................................... 72
Figure 4.3 Packed 64-bit add instruction operating on four 16-bit Integers................... 73
Figure 4.4 Packed Registers........................................................................................... 74
Figure 4.5 Packed Compare Operation........................................................................... 77
Figure 4.6 Branch Elimination....................................................................................... 77
Figure 4.7 SIMD Array Processing with Packed Registers............................................ 78
Figure 4.8 Location of array elements in memory......................................................... 80
Figure 4.9 Organisation of composite data-types in memory........................................81
Figure 4.10 Suitable structures for SIMD calculation of node activation..................... 89
Figure 5.1 Initial Communication Phase.......................................................................117
Figure 5.2 Weight Update Communication Phase........................................................118
Figure 5.3 Tree structure for weight update communication phase.............................119
Figure 6.1 Remote System Call.....................................................................................133
Figure 6.2 Block scheduling architecture..................................................................... 140
Figure 6.3 Solution for master task employing multiple tasks per host......................148
Figure A.l Packed and scalar SIMD Instructions....................................................... A-II
Figure A.2 Sum of Absolute Difference Instruction..................................................A-IV
Figure A.3 SIMD Integer Multiply Operations........................................................... A-V
Figure A.4 ADDSUB* Instructions............................................................................A-VI
Figure A.5 Horizontal Arithmetic Instruction............................................................A-VI
Figure A.6 Create Mask Operation.......................................................................... A-VIII
Figure A.l Packed Logical Operations.......................................................................A-IX
Figure A.8 SIMD Compare Operation........................................................................ A-X
Figure A.9 Unpack Operations...................................................................................A-XI
Figure A. 10 Shuffle Operations............................................................................... A-XIII
Figure A.l 1 Pack Operations................................................................................... A-XIII
Figure A. 12 Conversion operations.......................................................................... A-XV
vii

Figure A. 13 Store Mask Operation........................................................................A-XVIII
Figure A. 14 Thread Synchronisation Instructions................................................... A-XIX

Vlll

List of Tables
Table 3.1 Strategies for parallelising backpropagation ................................................. 65
Table 4.1 SIMD Extensions of popular vendors.......................................................... 72
Table 4.2 Different coding strategies forthe vector operation A = B * C...................... 83
fable 4.3 Comparison of sequential and SIMD calculation of node activation.............. 88
fable 4.4 Comparison of sequential and SIMD calculation of layer transfer function.. 90
Table 4.5 Comparison of sequential and SIMD calculation of output layer errors....... 90
Table 4.6 Suitable structures for SIMD calculation of hidden node error.....................91
Table 4.7 Comparisonof sequential and SIMD calculation of hidden layer errors...... 92
Table 4.8 Comparisonof sequential and SIMD weight operations.................................93
Table 4.9 Comparisonof sequential and SIMD operations for calculatingthe MSE .... 93
Table 4.10 Comparison of SIMD Vs Serial instructions over increasing sizes of N....94
Table 4.11 Basic SIMD operations required for implementing backpropagation.......... 95
Table 4.12 Precision and clock cycles for 1/x operations..............................................98
Table 5.1 Relationship between inputs and network size for the parity problem........124
I'able A. 1 SIMD Registers and Packed Data-Types.................................................... A-I
Table A.2 Relationship between SIMD extensions and Registers for Arithmetic
Instructions.........................................................................................................A-III
fable A.3 SIMD Integer Instructions...................................................................... A-VII
Table A.4 SIMD Floating-Point Arithmetic Instructions.........................................A-VII
Table A.5 Create Mask of Packed Data Instructions............................................. A-VIIl
7'able A.6 Shift Instructions................................................................................... A-VIII
Table A.7 Logical Instructions.................................................................................. A-IX
Table A.8 Floating-Point C ompare Instructions........................................................ A-X
Table A.9 Unpack Instructions.................................................................................. A-XI
Table A. 10 Shuffle Instruction................................................................................ A-XII
l able A.l 1 Pack Instructions................................................................................. A-XIII
Table A. 12 Conversion Instructions....................................................................... A-XIV
Table A. 13 Data Movement Instructions............................................................... A-XVi
Table A. 14 State Management Instructions........................................................... A-XVII
Table A. 15 Cacheabilty Control Instructions..........................................................A-XIX

IX

List of Graphs
Graph 1
Graph 1
Graph 1
Graph 2
Graph 4
Graph 4
Graph 4
Graph 4
Graph 4
Graph 4
Graph 4
Graph 4
Graph 4
Graph 5
Graph 5
Graph 5
Graph 6
Graph 6
Graph 6
Graph 7
Graph 7
Graph 7

Ideal speedups for SIMD enabled code..........................................................9
Ideal HPC speedups.........................................................................................9
Sample HTC Speedups.................................................................................. 10
The logistic transfer function.........................................................................28
Padded Vs Unpadded Array Processing....................................................... 82
Performance of assembly compared to other implementation methods.......84
Comparison of Code generated by Intel and Microsoft Compilers..............96
Comparison of SIMD implementation methods............................................97
Speedups for Arithmetic Operations C code................................................ 97
Reciprocal Vs Divide....................................................................................98
Reciprocal Vs Divide (Small Aixays)........................................................... 99
Relationship between network size and speedup....................................... 100
Effect of padding on speedups..................................................................... 101
Speedup for small Networks....................................................................... 125
Speedups for medium to large sized networks...........................................125
Speedups attained for a ten-node cluster over increasing network sizes .... 126
Variations in time per epoch....................................................................... 145
System adjusting to loss of hosts................................................................146
System adjusting to gain of hosts................................................................ 147
SIMD speedups for different network sizes................................................ 152
Speedups obtained for PVM system on different network sizes................153
HTC system adjusting to a change in network configuration....................154

Chapter 1

Introduction

Introduction

1

Introduction
Artificial neural networks are simplified models of neural processing; they draw

their inspiration from the workings of biological neural networks found in the brains of
most animals. However most artificial neural networks are not biologically plausible,
and merely mimic certain aspects of neural interaction in order to accomplish some
infomiation-processing task. The types of problems that a neural network based solution
has being applied to include,

Data-mining

•

Visualisation

Clustering

•

Unsupervised Clustering

Pattern Classification

•

Time Series Processing

Prediction

•

Signal Processing

Control Problems

•

Pattern Recognition

Image Processing

Neural network solutions are suitable for applications where a conventional
process is not suitable or cannot easily be defined or cannot fully capture the complexity
in the data [1]. Applications incorporating neural networks have being developed for a
variety of problem domains including.

•

Telecommunications

Engineering

•

Medicine

Science

•

Marketing

Stock Market Prediction

Manufacturing

Weather Prediction

Finance

Robotics

These infomiation-processing task and the applications to which they are
applied have different characteristics and requirements, and as such no single neural
network model is sufficient for all these applications. A number of different neural
network models have being developed over the years each with their own strengths and
weaknesses [2], but in general all neural network models share the following properties.
2

Introduction
A set ot simple processing units.
These processing elements are known as nodes, units or neurons. They
are comparable to the neurons of the brain, in that they receive input from a sub
set of the nodes in the network and transmit output values to a possibly distinct
sub-set of neurons. The state of each neuron is defined by its activation value
and possibly its error and output.
A set olWeighted connections between nodes of the network.
Neural networks are highly parallel; a network of N neurons can contain
up to

connections. Although many models will contain fewer connections,

the number of connections typically exceeds the number of neurons. Associated
with each connection between two neurons, is a real valued weight representing
the strength of that connection. These weights represent the synaptic connections
between neurons in biological neural networks; and can increase or decrease the
effect the output of one neuron has on the activation value of another representing inhibitory or exhibitory synapses.
A set rule for calculating and updating the state of neurons.
The algorithm must define some method for selecting neurons to update.
This can involve some stochastic, probabilistic or deterministic approach. A rule
must also be defined to calculate a neuron’s activation based on the
outpuf\activation values of its input neurons and the weighted strength of each
connection. Although more complex functions have being proposed most
algorithms calculate a neuron’s activation a simple weighted sum of its inputs.
A set of rules for adjusting the connection strengths between neurons.
The function performed by the network is largely controlled by its set of
weights. One of the most appealing properties of neural networks is that they
employ some algorithm for adapting the weights from some initially random
distribution. These algorithms are used to train the network to perfomi some
function using a set of known examples. Some algorithms require that the
correct output be supplied with each example (supervised training), while others
can be trained using just sample input values (unsupervised training). Note that
not all algorithms employ a learning rule: some networks act as memories, in
which the set of weights is calculated as a function of the patterns to be stored.

Introduction

Figure 1.1 Sample Neural Network

A simple yet important class of neural network model is the feed-forward (or
multi-layered Perceptron). These networks consist of a number of distinct layers of
neurons with connections permitted between neurons of different layers only. Input
patterns are supplied to the first layer from the external environment, and activation
values flow forward through the network to the output layer. The activation values of
the output layer are inteipreted as the networks output.

Figure 1.2 Feed-forward Neural Networks

Feed-forward networks are usually although not always trained using a
supervised training algorithm. The most popular of which is the backpropagation
algorithm [3] or one of its many variants [4]. The algorithm can learn an arbitrary
mapping between a set of input vectors and a set of output vectors -using the layers of
hidden neurons to perfomi the mapping. The network is trained by example using a set
of known input output pairs as a teachers.
The set of samples used for backpropagation training is refen'ed to as the
training set; while the samples of the training set must be representative of the
underlying task. They do not however have to include all cases that the network will
4

Introduction
encounter; one useful property of neural processing that is captured by the
backpropagation algorithm is the ability to generalise. That is to produce the correct
output for previously unseen inputs, based on their closeness to the sample inputs used
in training. Proper generalisation requires that the network be trained on a large training
set containing many samples of each class of input pattern. For some applications such
as .lapanese character recognition training sets containing in the order of 10^ training
samples have being used [5].
Backpropagation training consists of two distinct phases; during the first phase a
sample input pattern is presented to the network and activation values flow forward
through the network to the output layer, this phase is known as the forw-ard pass. The
second phase compares the network output with the correct output for that input pattern;
error values are calculated for all output neurons based on the difference between the
required and actual outputs for each node. As sample values are unavailable for non
output neurons, a gradient-descent approach is used to calculate their errors based on
how much the neurons inHuenced the output eiTors.
Once error values have being calculated, these values are used to adjust weights
in such a manner as to bring the networks output closer to the desired output. As the
network must learn the function inherent in the full training set, weight adjustments for
any one pattern must be relatively small in order to prevent the network from unlearning
previously seen samples. As a result each sample must be repeatedly presented to the
network during training.
Due to the repeated presentation of each sample pattern and the large number of
patterns in the training set, backpropagation training can take in the order of days or
weeks to complete [6]. When considered with the fact that it is often requires
experimenting with several different network structures before a suitable one is found,
this lengthily training time poses a real problem for the development of
backpropagation solutions.
The aim of this thesis is to investigate implementation methods that will reduce
this training time. Methods investigated are restricted to those that can be implemented
on cheap commodity hardware affordable to most research projects. All methods
investigated involve some form of parallel computing. Parallel computing is when a
program uses concurrency to either:

Introduction
•

Decrease the runtime for the solution to a problem,

•

Increase the size of the problem that can be solved.

This is achieved by dividing up the calculation among a number of processing
elements, with each processing element performing a part of the calculation and
operating in parallel. During the parallel computation tasks running on different
processing elements may be required to communicate and synchronise with each other.
This introduces a certain amount of overhead into the system with must be taken into
account when designing the parallel algorithm in order to achieve the full benefit of
parallelising the algorithm.

The first method investigated was the possibility of using Single Instruction
Multiple Data (SIMD) type processing to speedup calculations performed on arrays of
data. SIMD processing is a form of parallel computing where the same instruction is
simultaneously issued to all processing elements in the system’. Each processing
element has access to its own local data, which it uses as operands for each instruction.
SIMD processing is suitable for some highly regular calculations such as vector
processing.
Many scientific and engineering algorithms (including backpropagation training)
involve complex calculations performed on lar'ge arTays of data. Performing these
computations is referred to as vector processing, because the algorithm spends the
majority of its time performing vector calculations. The SIMD model is parlicularly
well suited to these types of calculations. Using this approach each processing element
performs the same instruction on different elements of the array, allowing the system to
process the array in blocks.
However such appr'oaches are not liniited to larger scale parallel computers; a
number of desktop vendors have recently started introducing SIMD type operations on
their instruction sets. These instructions operate on packed registers that allow a number
of array elements to be processed simultaneously. One such vendor is Intel, who with
the introduction of the SSE extension allows four single-precision floating-point
numbers to be processed in a single instruction. The use of such technology is examined
as a possible method of reducing training times using a single computer.

It is possible to switch of some processing elements that are not to execute the instruction.

6

Introduction

The second approach investigates the use of cluster computing techniques to
implement a parallel version of the algorithm. Cluster computing attempts to create a
virtual parallel computer by configuring a number of standard networked machines to
act as a single computer. One of the main motivations behind cluster computing is the
superior price\perfoiTnance ratio that it offers over traditional purpose-built highperformance parallel computers. This is due to the fact that desktop computers have
now become massed produced products, a fact that is reflected in their price.
Cluster computing is well established as a viable alternative to expensive
purpose-built high-perfomiance computers (HPC); this can be demonstrated by the fact
that the third most powerful supercomputer cunently in existence is a cluster computer.
In all over 40% of the top-5()0 supercomputers are cluster computers, offering almost
50% of the total performance [7], However cluster computers are not restricted to just
the high end of the supercomputing market; cluster computers are highly scaleable and
have been constructed from anywhere from two to a few thousand processors, allowing
a cluster to be built to meet both the budget and requirements of a project.
“A cluster is a type of parallel or distributed processing system, which consists
of a collection of interconnected stand-alone computers working together as a single
integrated computing resource” [8]. While this definition includes dedicated highperformance clusters it also includes another important type of cluster computing - High
Throughput cluster Computing (HTC).
“A High Throughput Computing environment strives to provide large amounts
of processing capacity to customers over long periods of time by exploiting existing
computing resources on the network” [9J. HTC systems are often referred to as
scavenger systems because they monitor a pool of networked machines, and allocate
work to these machines as they become idle.
The main difference between a high-perfonnance and a high-throughput system
is the motivation behind their development. High-perfomiance clusters aim to build
cheap yet powerful computers to serve some particular task or group. High-throughput
clusters aim to tap into the large amount of wasted CPU cycles associated with idle
machine sitting on the network.
In general a HTC application operates in a more unreliable and unpredictable
environment than a HPC application. The resources on which an application runs are
not owned by the users of the application and as such they has no control over their use.
7

Introduction
The fact that each machine retains its local ownership has two consensuses for an
application running on them; firstly the machines may be used for other computations
while also running the application and secondly there is increased likelihood of task
failure - caused by the owner reclaiming his/her machine [10].
As a result of this difference in operating environments not all parallel
algorithms are suitable for HTC applications. Indeed in one definition of HTC it is
noted that a HTC approach is “appropriate when the problem can be decomposed into
many (very many) smaller problems tliat are essentially independent” [11]. This poses a
problem for its use with regard to the backpropagation algorithm.
Of the different strategies for parallelising the algorithm only one involves
completely independent tasks running in parallel. This requires training complete
networks on different machines and selecting the best one. If the training time for a
single network is to be reduced then a certain amount of synchronization is required
between the parallel tasks. This thesis wall also examine ways of limiting this problem,
in order to exploit, as much as possible the vast amount of untapped computing power
available in idle PCs for the training of backpropagation neural networks.
In summary the goal of this work is to reduce backpropagation training time as
cheaply as possible, by increasing the throughput of the underlying computation. This
increased throughput can be measured in terms of speedup; the perfomiance gain or loss
achieved after modifying an algorithm to incoiporate some parallel feature. Speedup is
calculated as;
execution time of a serial program
exectution time of a parallel program

on a single host machine for which SIMD instructions are available the ideal
speedup would be equal to the number of packed elements processed simultaneously in
the registers. For example, if four packed elements are processed by each SIMD
instruction the ideal speedup should be four, see Graph I-I.

Introduction

Sample SIMD Speedups

3 3

■D

0)

<U

-

a 2
cn

- - - - - - - - - - 1- - - - - - - - - - - - - - - 1—- - - - - - - - - - - - - 1- - - - - - - - - - - - - - - - - - - r

Size of Problem

Four Packed Elements

Two Packed Elements

Graph 1-1 Ideal speedups for SIMD enabled code

On a high performance dedicated cluster the workload is divided between the
machines comprising the cluster. In the simple case where all machines are identical the
workload can be evenly divided among them, and the ideal speedup is directly related to
the number of hosts in the system, see Graph 1-2.
Sample HPC Speedups
14
12

a.

3
■o
0)0)

a

(/)

10
8
6

4
2
0
5

6

7

8

10

11

12

Number of Hosts

Graph 1-2 Ideal HPC speedups

The use of speedups to measure increased throughput in a HTC environment is
more problematic than for the other two parallel methods. This is due to the fact that an
HTC application runs on volunteer hosts, with the number and capacity of the hosts
varying within and between different runs of the system. As a result the speedups

Introduction
obtained for each run of the system may vary greatly. However this does not present a
major problem. The motivation behind HTC processing is to utilise what would be
otherwise a wasted computer resources and a result any increase in throughput is an
advantage.
Sarrple KTC Speedups

Runs of the System

Graph 1-3 Sample HTC Speedups

Thesis Structure

This section outlines the thesis structure.

Chapter 2 outlines the theory of artificial neural networks with particular regard
to feed-forward networks and the backpropagation algorithm. Chapter 3 covers the
theory of parallel computing and introduces the current state of the art approaches
towards the parallel implementation of backpropagation. These two chapters represent
the main bulk of background research undertaken as part of this project; they are
intended to provide an overview of the fields of neural networks and parallel computing
and introduce many concepts required to understand later chapters.
Each of the next three chapters deals with a separate aspect of the parallel
implementation.

Chapter

4

describes

a

SIMD

enabled

implementation

of

backpropagation. As the application was developed for Intel's Pentium family of
processors, this chapter first introduces Intel's SIMD architecture and describes a

10

Introduction
number of implementation options. These implementation methods where compared
and the results are presented at the end of the chapter.
Chapter 5 describes a HPC implementation of backpropagation on a dedicated
cluster computer. The chapter contrasts two message-passing systems that are often
employed for writing cluster applications: PVM and MPI. As the PVM system was the
one chosen for this implementation it is outlined in greater detail. Of the state of the art
parallel implementation methods described in chapter 3 two are well suited to cluster
computing. These are explained in greater detail than the description of the previous
chapter, and their implementation using PVM is outlined. Finally speedup results
obtained from tests of the system are presented.
Chapter 6 outlines a HTC implementation of backpropagation. The general
requirements of a HTC environment are introduced and a typical resource management
system, Condor, is described. Both approaches to parallel backpropagation examined in
the previous chapter are analysed and their potential limitations for implementation in
an HTC environment are identified. The implementation of both these methods are
outlined, results are presented and a number of shortcomings of the system are
highlighted.

As most of the literature surveyed was concerned with HPC

implementations of the algorithm, the work presented in this chapter represents the main
contribution that this thesis has made to the research area.
The final chapter summarise the thesis and its overall findings of this research.

Contributions
The main contributions of this research include:

An evaluation of the different SIMD coding techniques for neural network
simulation. This involves an analysis and study of assembly, intrinsics, object
orientated classes and compiler based implementation methods, as presented in
chapter 4.
An analysis and study of the requirements of a HTC implementation of the
backpropagation algorithm, as presented in chapter 7. This involves making
adjustments to the basic algorithm in order to allow it to take maximum advantage
of the different environment under which it runs.

Chapter 2

Artificial Neural Networks

Artificial Neural Networks

2

Artificial Neural Networks

2.1

Introduction
Artificial neural networks ANNs are biologically inspired computational methods,

and as their name might suggest they draw' their inspiration from the interaction of
neurons within the brain. They are motivated by the theory that all biological
intelligence is based on local interactions between neurons [12], and that its underlying
principles can be captured mathematically and transferred to a computer in order to
serve some useful purpose.
Such networks consist of a number of artificial neurons, an interconnection
strategy for connecting them as well as rules for calculating their cun*ent state and
adapting the networks structure. Numerous models of artificial neural networks have
being developed over the years, each with their own distinct properties and abilities.
Most employ a variant of the same basic generic neuron in their processing. A complete
analysis of these models is beyond the scope of this thesis, which will limit itself to a
single model: backpropagation neural networks. A survey of current neural network
models may be found at [2], and a history of the field may be found at [13].

2.2

Artificial Neurons
The generic artificial neuron is a processing element, which receives inputs from

other artificial neurons in the network, and produces an output, which either serves as
input to other artificial neurons in the network, fomis part of the network output, or
both. Associated with each input to an artificial neuron is a set of real valued weights,
representing the strength of the connections between the two neurons. A neuron's input
(or cicfivation value) is calculated by summing its weighted inputs:

/=0

where
a ■ is the activation value of node /,

is the weight associated with the connection from node / to node /,
T is the output of node/,
N is the number of inputs to node /.
13

Artificial Neural Networks
Alternative activation functions include cubic nodes and sigma-pi units [14].
This activation value is then passed through some transfer function to produce the
neurons output:
0. = f{a.)
where
is the activation value of node /,
O, is the output value of node i,
f{ )is the threshold function.

Figure 2.1 A Generic Artificial Neuron

2.3

Network Topology
The manner in which neurons of the network are interconnected is referred to as

the network topology, and has a huge effect on the power and complexity of the
underlying network. Using topology as classifying criterion all neural network models
can be divided into two distinct classes recurrent and non-recurrent networks.
•

Recurrent Networks allow for backward connections between neurons, which
result in cycles between two or more neurons. If a cycle fomis in the network the
output of any neuron in the cycle is directly or indirectly fed back into that same
neuron as input. These cycles allow the network to retain some memory of past
events by feeding its output back to earlier parts of the networ k.
14

Artificial Neural Networks
•

Non-Recurrent Networks are free from such cycles. Non-recun'ent networks can
be further divided into feed-forward and non feed-forward neural networks. A feed
forward neural network restricts all connections to be in a foi*ward direction only,
whereas a non-recun'ent non feed-forward network pemiits lateral and backward
connections, provided no cycles are fomied. An equivalent fully connected layered
neural network can be found to represent any non-recurrent neural network.

Because of the presence of cycles between neurons in recurrent neural networks
it may take a number of discrete time steps before activation values stabilise and the
network output can be interpreted. For networks that stabilise in a fixed known number
of time steps it is possible to construct an equivalent feed-forward network, where each
layer of the network represents the complete recurrent network for a particular time
step" [15], An example of such a network is the backpropagation through-time network
[16].

2,3.1

Feed-Forward Neural Networks
Any network in which all connections are in the same direction is a feed-forward

network. But these can be further refined into layered feed-forward and fully connected
layered neural networks.
•

Layered Feed-Forward Neural Networks
If the nodes in a feed-forward neural network can be divided into layers where

the nodes in each layer receive input only from the nodes in the previous layer, it is said
to be a layered feed-forward neural network. Layered feed-forward neural networks are
also referred to as multi-layered Perceptrons [17].
•

Fully Connected Layered Feed-Forward Neural Networks
A special case of a layered feed-forward network is when each node within a

layer receives input from all nodes in the previous layer, and sends output to all nodes in
the subsequent layer. Such networks are referred to as fully connected layered neural
networks.
rhis is not true of all types reeurrent networks.

15

Artificial Neural Networks

Network Topologies

Fully-Connected
Network

Recurrent Network

Layered Networks
Feed-Forward Networks
Non-Recurrent Networks

(i)
(ii)
(iii)
(iv)

A cycle exists between nodes B, C and E.
A latheral connection exists between nodes B and C.
A backward connection exists between nodes C and E.
Connection E-A skips a layer.
No connections exist between nodes E-B or D-C
Figure 2.2 Neural Network ropologies

Fully connected layered feed-forward neural networks form an important class
of neural network models, in part because their structure allows for a neat and efficient
representation using matrix operations and notation. It is worth noting that an equivalent
fully connected layered neural network can be found to represent any non-reeurrent
neural network.
The weights between any two layers of fully connected neurons ean be stored in
a

)matrix fV, where

is the number of neurons in layer L. Each element of

IF \\\- represents the weight of the connection between neuron / in layer L - 1 and neuron
/■ in layer L. Such a notation is convenient beeause if the inputs to a layer are stored in a
(/f/_i,l) matrix, the activation values for the entire of layer L can be calculated by a
single (yV^,A/_|) (y\,_,,l) matrix multiply.
When the neurons of a network are divided into distinct layers, three elasses of
neurons can be identified
16

Artificial Neural Networks

Input Neurons
Input neurons are neurons in the first layer of the network. They perfomi
no processing and merely serve as a bridge passing on the input values from the
external environment to the first layer of the network. By convention the input
layer is not counted as a layer of the network.

Output Neurons
The last layer of the network is known as the output layer and all neurons
in that layer are referred to as output neurons. Output neurons are the only
neurons to propagate values outside the network.

Hidden Neurons
All other neurons have no external connections outside the network and
are referred to as hidden neurons.

Figure 2.3 Naming convention for a 3-layered network

An interesting feature of multi-layer neural networks is that for a three or more
layered network an equivalent two-layered network can be constructed that perfomis the
same function. This would suggest that a single hidden layer is all that is necessary to
train a network on, but in practice networks with increased number of hidden layers
may train faster than networks with only a single hidden layer.

17

Artificial Neural Networks
2.4

Training Algorithm
A second major criterion for classifying neural networks is the manner in which

the network selects a set of weights. With neural networks all the information stored in
the network is encoded in the weights. This information is distributed across all the
weights in the network and as such the weights control in part what function the
network applies to the input pattern. Changing the set of weights associated with a given
network will change the function performed by that network. But finding the coiTect set
of weights required for the network to perform a certain function can be a daunting task.
It is impossible to hand-select a coiTect set of weights for all but the simplest of toy
examples, so all neural network models employ what is refeiTcd to as a training
algorithm to set weight values. It is the structure of these algorithms that is compared
when networks are classified according to training algorithm.

The algorithm used to select weights is referred to as a training algorithm
because most networks converge on a suitable set of weights after an iterative process of
gradual weight changes known as training. During training the network is repeatedly
presented with a set of input patterns refeired to as the training set. After each
presentation of an input pattern, moderate weight changes are calculated in accordance
with the algorithm’s rules. Training continues in such a manner until a set of weights is
found which is suitable for all the patterns in the training set.

If the training set contains input patterns paired with target output patterns the
algorithm is refen*ed to as a supervised training algorithm. When the algorithm is
required to find a set of weights without the aid of target output values it is an
imsiipervised training algorithm. Not all neural network models employ iterative
training algorithms; some such as Hopfield networks [18] calculate their weight matrix
as a function of its inputs after a single pass through the training set.

Artificial Neural Networks
2.5

The Perceptron
One of the first neural network models to employ a training algorithm for the

purpose of weight selection was the Perceptron [17]. The Perceptron is an example of a
simple single layered fully connected feed-forward neural network, where all the input
neurons are connected directly to all the output neurons. As such its weights can be
represented by (O,/) matrix W, where O equals the number of output neurons, and / the
number of input neurons. Each row of W, stores the connection strengths for all the
weights connected to a single output neuron, and each output neuron is influenced
solely by the weights stored in one row of W and the value of the current input pattern.
.\s such any Perceptron of N output neurons (where N > 1) could be decomposed into N
independent Perceptrons of a single output neuron, by dividing the rows of IT between
them. See Figure 2.4.

Two-output nodes Perceptron

Equivalent two one-output Perceptrons
Weight Matrix

Weight MatrK W.AC Wad Wae
Wbo Wbe

Figure 2.4 Decomposition of multi-output Perceptrons

The simple case of a Perceptron with one output node will be considered here.
All inputs to the Perceptron are either -1 or 1, with weights acquiring any real valued
number. As with most layered networks output, values for the input neurons are copied
19

Artificial Neural Networks
directly from the current training pattern. A generic artificial neuron is employed by the
output node, which calculates its activation value a by summing its weighted input
values:

/=()

where
(I

is the activation value of the output node,

vv; is the weight associated with input /,
A', is the CLiiTent value of input /,
N is the number of inputs to the network.

2.5.1

Perceptron Transfer Function
Perceptrons employ a hard-limiting bipolar linear transfer function to produce

an output of either -1 or 1, according to the following rule

/(^f) =

\-^ a>t
-\

a <t]

where t is the threshold value associated with the output neuron. The threshold
value can acquire any real valued number. Its use in the transfer function has the net
effect of shifting the function a distance of t from the origin along the x-axis, as shown
in Figure 2.5. The value of t is dependent on the function that the network is required to
learn, and as such is another parameter that the network must adapt.

+1
i

-1

0

+1

---------- +1
l

1

0

'1

'-1

-1 t=-

t=

Figure 2.5 The effect of the threshold value on the transfer function

20

t= 1

Artificial Neural Networks
In order to simplify the training algorithm the value of t is generally replaced in
the transfer function with a constant value of zero. As a non zero value of t is required
for the network to leam all but the simplest of functions, an extra input -referred to as a
bias - is added to the output neuron; the weight associated with this input represents the
value of -t, and the input is assumed to be always on (+1). A neuron using a variable
threshold is equivalent to one with a bias input. The use of a bias input allows the
training algorithm to treat the threshold the same as all other weights when calculating
changes to its value. If shown in network diagrams the bias input is symbolised by a
solid black dot connected to the output neuron as shown in Figure 2.6.

Weight Matrix

Wac Wad Wa£ Wbias

Fij»ure 2.6 Neural network diagram showing biased inputs

2.5.2 Perceptron Training
The Perceptron implements a foim of supervised learning in its training
algorithm. A training set consisting of example input patterns together with the target
output is assembled to train the network. Each row in the training set represents a
particular example that the network must learn, and it is the task of the training
algorithm to find a set of weights that produce the desired output for all examples in the
training set. For each input pattern weights are adjusted in proportion to the difference
between desired and actual outputs, according to the following rule:

An;. =/;(/’- 0)x.

Cork Institute of Teclinology
21

Artificial Neural Networks
where
Avi' stores the changes to weight /,
T is the target output for the current input pattern,
O the actual network output,
the value of the i'' input node for the same pattern.

The temi

is refeixed to as the learn rate, it is typically a small constant

between zero and, one the use of which controls the size of weight changes when
training.

As the output of a Perceptron is restricted to a value of either -1 or 1 the term
(T - O) can equate to one of only three values,
•

If the output is equal to target value the term will evaluate to zero, and as a result
Avv will be zero for all i. Simply put, if the current input pattern produces the
correct result no changes are made to the weights.

•

If the network produces a negative result but the target requires a positive result
(T - O) is equal to 2 and Aw- = 2rjx- for all /.

•

If the network produces a positive result but the target requires a negative result
(T - O) is equal to -2 and Aw. - -27]x. for all /.

The inputs to a Perceptron are also restricted to values of either -1 or 1, thus
when the network undershoots producing a false negative, all weights from positive
inputs are increased by 2?] and all weights from negative inputs are decreased by 2/].
When the same pattern is presented to the network again the positive inputs will have a
greater effect on the networks output, and the negative inputs a lesser effect. As a result
the nodes activation will become more positive. After repeated presentations of the
training pattern the nodes activation becomes greater than 0, and the network produces
the coiTcct positive result. Similarly a false positive result when the network overshoots
has the opposite effect on the network weights.
As stated above the learn rate (77) is used to control the size of weight changes. It
is necessary because the Perceptron is trained over several examples. If the weight
changes after viewing one example are too severe the network will unlearn most of what
79

Artificial Neural Networks
it has seen in previous examples. Under such conditions the network will fail to learn a
set of weights that fits all the examples in the training set, instead the weights will
oscillate greatly between each example presented. For this reason it is desirable to select
a small value for //, resulting in minor weight changes after each presentation and a
gradual convergence of the weights to fit the data. An interesting feature of the
Perceptron training algorithm is that if such a set of weights exists the algorithm is
guaranteed to find it provided that the value of rj is small enough.

2.5.3 Limitations of Perceptrons
In an extensive study of Perceptrons, Minsky and Papert [19] showed that the
Perceptron is unable to learn a large class of functions, referred to as non-linearly
separable functions. The Perceptron is unable to learn such functions because no set of
weighits exist that will produce the correct output for all patterns in the training set.
Perhaps this fact is best conveyed by way of example. Consider the case of a
two input Perceptron, the input space for this device is a 2 unit square centred on the
origin in 2D coordinate space. It is the task of the training algorithm to find a set of
weights that divides the input space into two categories (-1, +1). This can be interpreted
graphically as finding a single straight line to divide the input space into the required
categories. If a function is non-linear separable no single line can be found to divide the
input space into the required categories, and thus no set of weights exist. This can be
clearly demonstrated by examining the case of Boolean XOR. For a Perceptron to learn
this function it must find a line that separates the inputs (-1, 1), (1, -1) from (-1, -1), (1,
1), which is clearly impossible.
Interestingly if two lines are used the input space is easily dividable into the two
correct categories. This is equivalent to using a two-layer network where hidden nodes
perform the linear separable functions represented by each line, and an output node
performs a linear separable function on the hidden nodes, see Figure 2.7. Unfortunately
the Perceptron training algorithm has no method of assigning error values to hidden
nodes, and is therefore unable to compute changes for weights connecting the input to
hidden nodes.

23

Artificial Neural Networks
XOR

Truth Table
-1

-1

-1

1

1

1

-1

1

1

1

-1

-1

Figure 2.7 Linear separability and Boolean tiinctions

2.6

The Backpropagation of Error Training Algorithm
The problem of assigning error values to hidden nodes in multi-layer neural

networks was overcome with the development of the backpropagation of error training
algorithm. Like the Perceptron training algorithm backpropagation is a supervised
training algorithm, where a training set of input/output pattern pairs is used to train the
network.
On the presentation of an input pattern to the network backpropagation performs
two passes through the network. During the first pass (forward pass) the algorithm
computes the network output using the current set of weights. For the second
(backward) pass an error value is computed for all output nodes, and these errors are
propagated backwards through the network to all hidden nodes. Weight changes are
then calculated based on the eiTor value of each node. It is from the second pass that the
algorithm derives its name backpropagation of Error.

24

Artificial Neural Networks
2.6.1 Forward Pass
As part of the forward pass output values propagate from the input layer to the
output layer on a layer-by-layer basis, in a series of time steps. With the nodes in each
layer assumed to calculate their outputs within the same time step. As with the
Perceptron a nodes output is calculated based on its activation value, which is simply a
weighted sum of its inputs,

'V/-,

‘‘i.L =
i=Q

where
(I. , is the activation value of node / in layer L,
n'-- is the weight connecting node / in layer L-1 with node / in layer L,
-Y. the output of node i in layer L,
A^/ _, is the number of nodes in layer L-1,

Once calculated this activation is converted to the nodes output value by putting
it through some transfer function f (x). Backpropagation requires that this transfer
function be both continuous and differentiable. It does not however require that every
node in the network (or indeed layer) utilise the same transfer function, and it is not
uncommon to find that the output layer nodes employ a different transfer function to the
hidden nodes.
Output values for all nodes in a hidden layer serve as input values for all nodes
in the subsequent layer, while outpul values of output neurons are interpreted as the
network output.

25

Artificial Neural Networks
2.6.2 Backward Pass
The backward pass calculates error values for all nodes. It begins with the output
layer where the target output for each node is known for the current training pattern. In
this layer the rule used to adjust a nodes error is similar to that of the Perceptron, except
that for backpropagation the derivative of the transfer function used is also included.
This has the net effect of producing larger eiTors on nodes where the transfer functions
rate of change is greatest, - where a small change in the weights will have most effect,
S^= r(a,)(T,-0,)
where
5- is the error for output node /,
7:

is the target output for node /,
is the actual output for node /,

J

) is the derivative of node /’s transfer function.

Hidden nodes have no target output with which to calculate their error so instead
the en'or is estimated based on its contribution to the error values of all nodes in the
subsequent layer,

/=o

where
S- ^ is the error for hidden node i in layer L,
f'{ci-) is the derivative of node /’s transfer function,
(i is the error for node7 of layer L + /,
is the weight connecting hidden node / in layer L to node / in layer L
^

1.

26

Artificial Neural Networks
2.6.3

Backpropagation Transfer Functions
The transfer function is the function employed by nodes to convert its activation

value into an output value. This can be almost any function, and it is not necessary that
all neurons employ the same function. The one actual requirement that backpropagation
places on the transfer function used is that it is differentiable and therefore continuous.
There are however a number other properties which although not required are desirable
of any transfer function, these include the following:

•

That it is fast to compute.
The transfer function must be calculated for all hidden and output
neurons after the presentation of each training pattern to the network. A
computationally expensive transfer function would greatly extend training times.

•

That its derivative is fast to compute.
Similarly as the derivative of the transfer function must be calculated as
often as the actual transfer function, it is just as important that it be cheap to
compute.

•

riiat it be iioii-liiiear.
This is really only a requirement for hidden layer neurons; if the hidden
layer computes a linear tiinction of its input, the network is equivalent to a single
layered network and therefore unable to compute linear separable functions.

•

That its range be of small magnitude.
Again this depends on the transfer functions used in later layers of the
network. These functions have high and low saturation points, for all input
exceeding these points the output will approach a constant high or low value. If
the function produces a near constant v'alue its rate of change will be close to
zero. But as weight updates are calculated in proportion to this rate of change
these too will be close to zero, and the network will be unable to learn as a
result.
The inputs to a neuron in layer L are the outputs of neurons in layer T-7,
if these are large there is a greater likelihood that its activation will exceed one
of its saturation points. This can be avoided by using transfer functions that have

27

Artificial Neural Networks
a range within an acceptable level, typically [0, 1] or [-1, 1]. For the same reason
it may also be necessary to nomialise the training set.
It is worth pointing out that there is a similar effect referred to as
network paralysis [20]. This is caused by some weights acquiring overly large
values during the course of training. Producing a near zero gradient and
effectively freezing all incoming weights to that neuron.

2.6.3.1 Common Transfer Functions
While in theory almost any function can be used in a backpropagation network,
in practice most networks are trained using only a few types of functions. By far the
most conimon transfer function used is the standard logistic function,
1
l + e“"

Graph 2-1 T he logistic transfer function.

This function meets all the requirements of the backpropagation algorithm; it is
non-linear and continuous, has a range of [0. 1] and both it and its derivative
f\a-) = 0-{ \ - 0-) are easy to compute.

Other commonly used transfer functions include:
•

The Gaussian Function: /(.r) = exp(-.x- * v)

•

The Flyperbolic Tangent Function: f{x)

•

The Hyperbolic Secant Function: f{x) = SQch{x)

28

=

^

Artificial Neural Networks
It has being shown that a two layered backpropagation neural network (one
hidden layer and one output layer) employing the logistic transfer function can
approximate any continuous function, given a sufficient number of hidden neurons [21].
Similarly Gaussian based networks are also universal approximators [22]. However for
any given function the minimum number of hidden neurons required to approximate it
using logistic transfer functions will generally differ from the minimum number of
neurons required to approximate it using gaussian transfer functions. This is due to the
fact that the functions have different boundary regions and one function may be more
efficient at dissecting the problem than the other. As the time required to train a network
is related to the number of hidden neurons, the selection of transfer function is an
important aspect of neural network design, which is viewed by some experts as being
important as network topology and training algorithm [23].

2.6.4 Weight Changes
Once the algorithm has calculated error values for all output and hidden neurons
it uses those error values to calculate changes for weights connected to each node
inputs, using the equation
Ah'„ = iiSp,

where
is the weight change for the weight connecting node i in layer L
with node / in layer/. + 7,
1]

■ is the learn rate,

S^ is the error value for node /,
O, is the output of node /.

The learn rate value may be a constant across the network, or it may vary across
the layers or even weights of the network. A number of approaches have also being
attempted which reduce the learn rate as training progresses.

29

Artificial Neural Networks
The weight update itself is a simple addition of Aw, and the existing weight
value w, w.. = w..^ + Aw^.,.. Three update strategies exist

•

Learning by Pattern

After the presentation of each training pattern.
•

Learning by Epoch

After the presentation of an entire epoch.
•

Learning by Block

After a group of patterns have being seen.

If the weights are not updated after each pattern, a subtotal of Avv values must be
accumulated until they are updated,

= Aw.. + i]SP-. It has been shown that as the

inters al between weights updates increases, so too does the number of epochs required
to achieve convergence [24].

2.6.5

Network Error
The network error is a measure of how well the network is perfomiing over all

the patterns in the training set. It is calculated after every epoch of training, where an
epoch is defined as the presentation of the entire training set. Its value is calculated by
summing the errors associated with each training pattern for the current epoch. In cases
where the training set contains a large number of patterns it may be advisable to
normalise the error. The two equations are.

Or
/; = 0

£■ =

p=0

N

where
E is the network en'or for the current epoch,
N the number of examples in the training set,
the error for training pattern p.

30

Artificial Neural Networks
The error associated with each training pattern is its Mean Square Error (MSE).
Summing the squared difference between target and output value for each node in the
output layer, and dividing by two calculates the MSE:

- /=()

where
E is the Mean square eiTor for the current pattern,
N is the number of output neurons,
T is the target output for node /,
O, is the actual output for node /.

2.6.6

Local Minima
In essence the backpropagation algorithm reduces the MSE over all the output

nodes for the current pattern. The equations for error assignment and weight adjustment
are derived from trying to minimise this equation [3], and as such backpropagation is a
gradient descent approach. As with all gradient decent algorithms backpropagation is
faced with the problem of local-ndninia in the surface it is trying to minimise. Localminima are regions of the error surface, which are lower than surrounding regions but
higher than the lowest point on the surface - the global-ininiina.

Local
Minima

31

Artificial Neural Networks
Local minima pose a threat because gradient descent algorithms are only
allowed to reduce the en'or and are not permitted to increase it. Once the algorithm is
trapped in a local-minima, it may find that all subsequent weight changes would
increase the eiTor and are thus impossible. In such a case the network will stop learning
without ever finding the optimal solution located in the global-minima.
To date there is no proof of convergence for the backpropagation algorithm. The
lack of such a proof means that w'hen an optimal set of weights exists for any given
network and training set, the backpropagation algorithm is not guaranteed to find it, and
as such there is no way of knowing if the algorithm has settled on a local or global
minima.
If the network coverages on a solution and the network error is above some
acceptable level, the network is assumed to be trapped in a local-minima. This can be
detected when the network eiTor stagnates at some value and remains unchanged after
repeated epochs. In such an event training is simply restarted with a new set of random
weights, in the hope that it will find a deeper minima.
The set of w'eights associated with a given network can be interpreted as a point
on the error surface of the function that the algorithm is trying to minimise, with each
unique set of weights representing a single point on that surface. By selecting a random
set of initial weights the algorithm is selecting a random starting point on the error
surface. Due to the gradient descent nature of backpropagation, the algorithm will
descend this surface along the path of greatest slope until it settles in one of the minima
scattered throughout the surface.

2.6.6.1 Momentum
Modifying the weight update rule to incoiporate what is referred to as
momentum can reduce the risk of local minima. But even with the use of momentum
there still may exist local minima from which the network is unable to escape. When
momentum is used, a percentage of the previous weight update is included with the
current update. The idea behind this is that if the previous weight changes are in a
certain direction along the error surface, the current weight ehanges should tend to be in
the same direction. When the network stumbles across a local minima momentum may
push it out of the hollow by continuing in the current direction.
32

Artificial Neural Networks

Figure 2.9 The effect of momentum on training.

The modified weight update rule now becomes.
Ah, (7/ + 1) = qSjO, + (2Ah',(//)
where
Ah,(a? + l)is the current weight update,
AH',(//)is the previous weight update,

a is the momentum factor.

The momentum factor a controls the percentage of the previous weight update
that is included in the present update. Empirical results would suggest using a high
value approaching one although there is some merit in reducing this over the course of
learning.

2.7

Hidden Neurons
Whereas the number of input and output neurons is fixed, the number of hidden

layers and neurons is not. They are another parameter of the network to be trained. The
number of hidden neurons determines the power of the network, and having either too
few or too many can adversely affect training.
33

Artificial Neural Networks

Too Few Hidden Neurons
If the network is required to learn a linear separable function the network
requires a minimum number of hidden nodes in order to dissect the input space
into the required categories. A network that has too few hidden nodes will be
unable to learn the required function.

Too Many Hidden Neurons
If the network is endowed with too many hidden neurons its capacity
will greatly exceed what is required to learn the function. Such networks instead
of learning the function inherent in the training set, may in fact try to memorise
the entire training set. Such a condition is referred to as over-fitting and is
undesirable because the more the network relies on memory the lesser its ability
to generalise to unknown inputs becomes.

As a result judging the number of hidden nodes employed by a network is a
careful balancing act, too few and the network will not be powerful enough, too many
and it will fail to generalise properly. To date there is no predetermined way of
calculating the number of hidden nodes required for a network to learn a given function.
All that exists in the literature are empirical results and rules of thumb, which serve as
rough guidelines.
There have being attempts to either add or remove hidden nodes during training
if required. But even when these methods are applied it is standard practice when
developing a solution using backpropagation to train a number of networks, each with a
different configuration of hidden neurons and select the one that performs the best.

2.8

Cross Validation Testing
In part due to the lack of proven convergence, but mainly due to the desire for

proper generalisation backpropagation neural networks are never trained until the
network error is reduced to zero. Instead the network is trained until the error falls
34

Artificial Neural Networks
below a certain predefined level. After this the network is tested to check if it has
generalised from or memorised the training set.
In order to test the network a set of examples is compiled separate from the
training set. The examples in this test set are presented to the network and an error is
calculated. If the eiTor over all the samples in the unseen test set is comparable with the
error from the training set on which it was trained, the network has generalised properly
and can now handle unknown input patterns. But if the network performs well on the
training set and poorly on the test set, it has memorised the training set and is unsuitable
for use with unknown inputs. Such a process is known as cross validation testing.
Provided both the training and test sets are sufficiently large, and each contain
the proper proportions of samples from each category of inputs, cross validation testing
pro\'ides a practical method for determining that the network has an adequate number of
hidden nodes.
However there still may exist deeper minima, which better fits the examples in
both data sets. There is no way of knowing if a solution more optimal than the current
one exists. In order to counteract this problem it is customary to train several networks,
each with the same configuration of neurons but a different set of random weights.
Whichever one of these networks perfonns best when tested is selected as the most
optimal solution found, and its weights are used when operating the network.

2.9

Conclusion
An artificial neural network is a collection of simple processing elements referred

to as neurons. Each neuron computes a weighted sum of its inputs, which is then put
through some transfer function f(x) to calculate its output. These neurons are highly
interconnected with each neuron fonning connections with up-to all other neurons in the
network. The manner in which neurons are arranged and connected is referred to as the
network topology, and have a large bearing on the power of the network.
Commonly used network topologies include the layered feed-forward networks;
neurons in these networks are divided into distinct layers, with activation values
travelling from the first to the last layer on a layer-by-layer basis. These networks
usually employ some form of supervised training rule. Here pairs of sample input/output

Artificial Neural Networks
vectors are collected, and the network is trained until it starts to produce the couect
output (or close to it) for most of the samples.
One of the first neural networks to employ this fomi of training was a simple
one-layered network known as the Perceptron. However it was soon found that
Perceptrons were very limited in the type of functions they could represent, and that at
least one hidden layer was required to represent some common class of functions. But
the learning rule employed required that a target value be known for all nodes, as
suitable values for hidden neurons couldn't be calculated in advance the algorithm
couldn't be applied to networks with hidden layers.
The backpropagation algorithm was designed to overcome this problem, by
providing a deterministic method for calculating the error associated with hidden
neurons. This is achieved by computing the effect each hidden neuron has on the errors
of all output neurons, and assigning it an error value based on this.
However the training of backpropagation is a well known limitation of the
algorithm and it is noted in [6] that for practical problems, where training sets are large
[25], [26], [27], training times on serial machines of days and weeks are not uncommon.
Hence it is proposed to address this limitation by exploiting underused computing
resources by the use of parallel processing techniques.

36

Chapter 3

Parallel Computing

Parallel Computing

3

Parallel Computing

3.1

Introduction
Parallel computing is the use of two or more processors simultaneously to

perform some computationally intensive calculation. The idea is that by dividing the
calculation between a number of processors it can be performed in less time than if a
single processor was required to do all the work. However multi-processor systems are
both more complex to design and program than single processor ones, and may require
algorithms to be extensively restructured before they can be efficiently implemented.

The purpose of this chapter is to overview parallel computing. The most
common types of parallel computers all involve some method of connecting a number
of Von Neumann type processors together, although other methods exist such as dataHow machines, reduction machines and hardware implemented neural networks. While
such machines are parallel computational machines the computations they perform are
fundamentally different from the computations performed by traditional Von Neumann
style machines. As such they will not be dealt with in this chapter, which concerns itself
with parallel computers that perform computations in the traditional sense.

38

Parallel Computing

Parallel Architectures

3.2

A common method for classifying different parallel architectures was proposed
by Michael Flynn [28]. It involves categorising architectures based on the level of
parallelism in both the instruction and data streams. The instruction stream refers to the
number of instructions that can be executed simultaneously, for each instruction stream
available in a parallel architecture a separate control unit is required to fetch and decode
the next instruction in the stream. Similarly the data stream refers to the number of
distinct data items that can be operated on in parallel. Applying this method yields four
classes of computer architecture,

•

Single Instruction Stream, Single Data Stream (SISD)
Executes a serial program one instruction at a time, with each instruction
operating on at most a single piece of data.

•

Single Instruction Stream, Multiple Data Stream (SIMD)
Only one instruction is executed at a time, but each instruction is applied
to multiple data items.

•

Multiple Instruction Stream, Single Data Stream (MISD)
More than one instruction is executed at a time, but each instruction is
applied to the same piece of data.

•

Multiple Instruction Stream, Multiple Data Stream (MIMD)
More than one instruction is executed at a time, but each instruction is
applied to its own piece of data.

39

Parallel Computing

3.2.1

Single Instruction Stream, Single Data Stream (SISD)
SISD machines are serial machines that implement the Von Neumann model. In

these machines a single processing element executes a program one instruction at a time
in a sequential fashion. A control unit fetches a single instruction to be executed by the
processing element, which may fetch and store a single item of data. Such a system is
ultimately limited by the bandwidth of the bus connecting memory to the CPU. As all
instructions and data read to or from memory they must travel along the same bus,
creating a bottleneck that can reduce processor perfonnance.

Figure 3.1 SISD Architecture

40

Parallel Computing

3.2.2

Single Instruction Stream, Multiple Data Stream (SIMD)
With SIMD machines multiple processing elements execute the same

instruction, each on its own piece of data. These machines have a single control unit (or
sequencer) and thus are restricted to running a single program. The sequencer fetches
and decodes instruetions to be executed by the processing elements. As the one
sequeneer eontrols all processing elements (PE) , each PE executes the same instruction
at the same time. Thus SIMD programs are synchronised at the instruction level.
Each PE has access to its own local memory, allowing them to operate on
separate data sets. The SIMD approaeh reduces both hardware and software complexity
over other parallel methods, but is efficient only for certain problems with a high degree
of regularity, sueh as image processing and some numerical simulations.
Communication between processing elements is achieve either by an
interconnected network or a shared memory bus. As with SISD machines the
communication medium can reduce the performance of the processing elements.

Data

Figure 3.2 SIMD Architecture

41

Parallel Computing

3.2.3

Multiple Instruction Stream, Single Data Stream (MISD)
MISD machines are more a theoretical category than a practical architectural

model on which to base the design of parallel machines. According to their definition
MISD machines are capable of executing several different programs on the same datum
simultaneously. But an application for such a machine is hard to conceive; instead the
literature assigns pipelining machines into the MISD class. Such machines divide the
execution of an instruction into a number of distinct stages (fetch, decode, operand
fetch, execution) each of which can operate in parallel. Whereas a serial computer must
complete all stages of an instruction before starting the first stage of the next instruction,
a pipelining machine will start the first stage of an instruction as soon as the previous
instruction has begun its second stage of execution. In reality such machines still rely on
a single processor/memory bus to carry what is in effect a single instruction stream,
which creates a similar bottleneck problem as faced by SISD machines.

Instruction

Control

Figure 3.3 MISD Architecture

42

Parallel Computing

3„2.4

Multiple Instruction Stream, Multiple Data Stream (MIMD)
MIMD machines are the most general and powerful of all classes of parallel

architectures, resulting in them being the most difficult class of machines to program.
These machines allow for multiple programs to execute in parallel, each program
running on its own processor and operating on its own data stream. Generally each
processor will have access to its own local memory and will communieate with other
processors either through message passing (loosely coupled) or via shared memory
(tightly coupled). As the processors are relatively independent of each other MIMD
programs are asyiichronous, a fact which reduces the communication bottlenecks
experienced by the other models.

Instruction

Control

Figure 3.4 MIMD Architecture

43

Data

Parallel Computing

3.3

Meniory Structure
Flynn's taxonomy [28] classifies parallel systems based on the level of

parallelism in both the data and instruction streams, but of almost equal importance to a
parallel system is the manner in which each processor aecesses memory and
eommunicates with each other. Memory in a parallel system may be either shared
between or distributed across all the processors. Hybrid systems that combine the two
also exist.
Distributed Memory
In distributed memory architectures each processor has aecess to its own
local memory only. Communication and synchronisation between processors is
achieved via message passing, over an interconnection network. Typically such
systems provide primitives such as SEND and RECEIVE for communication, and
blocking versions of these primitives if synchronisation is also required. As all data
must be exchanged using message passing data locality is an important
consideration when designing a parallel program for such systems.

Shared Memory
In shared memory architectures all processors access the same common bank
of memory. Communication between processors is aehieved by use of shared
variables. Some form of synchronisation mechanism, such as semaphores, must
protect access to these variables. A major problem with shared memory
architectures is the faet that they don’t scale well. As more processors are added to
the system a bottleneck develops as all the processors try and access the same bloek
of memory.

Hybrid Systems
An important class of hybrid system is what is know'n as virtual shared
memory. In these systems the memory is physically distributed between the
processors but from a programming point of view it is logically interpreted as a
single block of shared memory. This allows a shared memory algorithm to be
implemented on a distributed architeeture. Virtual shared memory is also referred to
as distributed shared memory (DSM).
44

Parallel Computing

3.4

Parallel Programming Models
A number of distinct models or paradigms can be identified for programming

parallel systems. These programming models relate to the process of designing and
coding parallel solutions to a given problem and are to a large extent independent of the
underlining architecture of the system. In this regard where a given programming model
may be more suited to a particular parallel architecture, it does not prevent it from being
adapted to and used with other architectures. Some common parallel programming
models include:

Shared Memory Model

In the shared memory programming model tasks share a common address space
and data. There is no need for explicate data exchange between processes as all data is
shared between all tasks, communication between tasks is also achieved via the use of
shared variable. However as all tasks share the same variables some form of
synchronisation mechanism must control access to these variables (i.e. as locks or
semaphores). The shared memory-programming model can be applied to either pure
shared memory systems or virtual shared memory systems.

Message-Passing Model

In the message passing-programming model there is no common address space
shared between all the tasks in the system. Each task has access to its own local data
only, and the programmer must explicitly code all communication and data exchange.
This is achieved through the sending and receiving of messages between individual
tasks or groups of tasks in the system. This requires cooperation between the producing
and consuming tasks, with the consuming task required to call a matching receive
operation for the producing tasks send operation.

Data Parallelism

In the data parallelism parallel programming model the majority of the parallel
work to be performed consist of performing operations on a common data structure,
which is divided among the parallel tasks. Each task essentially performs the same
operation on different portions of the data structure. On shared memory architectures
45

Parallel Computing
each task shared a common data structure, but on distributed architectures the data
structure is partitioned and distributed among the tasks.

Threads Model
Threads allow a single process to have a number of simultaneously executing
paths of execution, known as threads. Threads are not independent processes and can
only exist within the parent process, and for the duration of the parent process. Threads
share the entire resources of the parent task but also have access to their own local
address space. Communication is perfomied via the shared global address space and
requires some form of synchronization such as locks or semaphores. Threads are not
confined to parallel systems, and have being used in many multitasking serial
environments to both simplify the task of programming and speedup the finished
application. Two common examples of treads include Pthreads (POSIX threads) [29]
and OpenMP [30].

3.4.1

MIIV1D Programming Models
On MIMD architectures each processor is capable of executing its own unique

program independently of the other processors in the system. Each of these programs
solves a sub-task of the overall problem, and communicates and synchronises with other
programs running on different processors as required. Therefore an MIMD program can
be viewed as a collection of sub-programs running on the various processors of the
MIMD system. Typically executing a master program on any of the processors launches
an MIMD application. The master program then initialises and launches other sub
programs as required on the remaining processors. Within this framework two distinct
models of program design are possible.

In the first model is referred to as a Single Program, Multiple Data (SPMD)
model. In such a model every processor in the system executes a copy of the same
program, although control flow statements allow for different processors to execute
different sections of code. Provided that most of the processors execute the same section
of code, albeit on different portions of data, the SPMD approach offers a relatively
simple method of programming MIMD machines.
46

Parallel Computing

A more general approach to writing MIMD programs is one that allows the
processors to execute different programs. Such a model is referred to as a Multiple
Program, Multiple Data model. MPMD programs, while being more difficult to code,
offers greater flexibility and modularity than SPMD programs.

Interconnection Networks

3.5

The manner in which individual processing elements are connected together
within a parallel computer is referred to as an interconnection network. The
interconnection network has a large bearing on both the cost and performance of any
parallel system. Theoretically an ideal interconnection network would directly connect
each processing element to all other processing elements in the system. Such a set-up
gives optimal connectivity, resulting in increased perfomiance due to lower
communication overheads. But requires the highest number of connections, resulting in
increased cost and complexity. In fact the number of required connections grows
exponentially as more processors are added, making full connectivity only feasible for
small systems of few processors. In practice most systems have to balance the
communication cost between processors with the cost and complexity of extra
connections. Numerous interconnection topologies exist, a few of which are outlined
below, for a more detailed description see [31] and [32].

•

Complete Graph

If the processors are connected by a complete graph, each processor is at
most one communication step away from every other processor. But a network
of P processors requires P * ("P - 1) connections. Complete graph topologies
offer the lowest communication costs, but at the priee of the highest connection
costs.

Figure 3.5 Complete Graph Interconnection Network

47

Parallel Computing

Ring

In a ring interconnection topology each processor is connected to two
other processors, in such a manner to form a circuit. Such a configuration can be
constructed with only P connections, but requires between one and P/2
communication steps for communication between any two processors.

Figure 3.6 Ring Interconnection Topology

•

Mesh

If the processors are connected to form an /-dimensional grid they are
said to form an /-dimensional mesh. Typically for ease of construction the mesh
will be either of two or three dimensions. Interior processors in the mesh form 2/
connections with other processors, while edge processors foi*m fewer
connections. It is possible to construct the mesh in such a way that all processors
forni 2/ connections by using wrapping connections at the edges, see Figure 3.7.
A major disadvantage of a mesh topology is that the communication costs
between processors varies greatly depending on the proximity of the two
processors concerned. Ranging from a single step for immediate neighbours, to
for the most distance processors, where P is the number of processors in the
system.

bi 2D Mesh with wraparound connections

Figure 3.7 2D Mesh with and without wraparound connections

48

Parallel Computing
Binary Tree
With this interconnection topology the processors form a binary tree,
with each processor being connected to at most three other processors: one
parent and two children, terminal processors which form leaves of the tree are
only connected to a single parent processor, while the root processor is
connected to its two children. Binary tree topologies of depth d contains 2‘' - 1
processors at a cost of

- 2 connections, and have a maximum communication

step of log: (P)- A disadvantage of the binary tree structure is that a non-leaf
node must channel all communications between its two sub-trees, as well as all
communication between child nodes and other nodes in the tree. Such a situation
leads to bottlenecks and the associated communication delays at these nodes.
This may be overcome by creating "fat trees" that employ switches for the inner
nodes, and restrict processing to the leaf nodes [33].

)b oh 6 b do
a) Binary tree of depth four
I

I

Switch

b) Fat Binary tree of depth four
(2}

Processor

-----

Connection

Figure 3.8 Binary Tree Interconnection Network

49

Parallel Computing

Hypercube

A

hypercube is an /-dimensional object consisting of 2‘ nodes, where

each node forms / connections to other nodes. Hypercubes are the mathematical
extension of a standard 3-dimensional cube to another dimension, /-dimensional
hypercubes are fomied by linking the con'esponding nodes of two (i - 1)dimensional hypercubes, as shown in Figure 3.9.
In a hypercube interconnection network the processors are arranged to
fonn

an

i-dimensional

hypercube, such a network has a maximum

communication step of /. It has being shown that a hypercube balances a
logarithmic maximum communication cost with a logarithmic number of
connections.

a) D = 0

b) D = 1

c) D = 2

d) D = 3

e) D = 4
Center Node

Directly Connected Nodes

Figure 3.9 Hypercube Interconnection Network

50

C)ther Nodes

Parallel Computing

3.6

Cluster Computing
A MIMD system as already described consists of a number of processors

capable of independent operation connected together via some interconnection network.
These processors can either share a common block of memory, or access their own local
block of memory. Such a definition can also be used to describe a number of serial
computers connected via an Ethernet network. As such, a network of serial computers
can be viewed as an example of an MIMD system provided that the processors in each
machine can be configured to perfonn some calculations in parallel. A network of
computers cooperating to fomi a parallel computer is referred to as cluster computer;
each computer in the cluster (node) consists of a processor and a local block of memory
- distributed memory.

3.6.1

Requirements of a Cluster Computer

An arbitrary collection of networked computers cannot be viewed as a cluster
computer, the machines in the cluster must be capable of cooperating in such a manner
as to logically form an MIMD system. In order to enable the nodes of a cluster to
cooperate in this manner, they must be configured to meet the following requirements.

•

Each node must be capable of executing a program.

•

These programs must be able to communicate with other programs running on
other nodes in the network via some communication protocol such as TCP/IP.

•

These programs must be able to synchronise operations with other programs
running on other nodes in the network. Synchronisation between processors can
be achieved through communication.

•

Programs must have the ability to launch programs on other nodes.

If the machines in a cluster are configured to meet these requirements, they are then
capable of running MIMD programs across the network. Such a MIMD program
consisting of independent programs running on independent machines, each solving
sub-tasks of the overall problem and communicating results and intermediate values as
required.
51

Parallel Computing

It is worth noting that it is not a requirement that all machines in the cluster be of the
same type. It is possible to develop a cluster out of any collection of networked
machines provided that each machine in the cluster meets the above requirements. Thus
the machines in a cluster may differ in terms of architecture, processor speed, available
memory, input/output capabilities, and data fomiats. Nor is it a requirement that each
node in the cluster be a serial machine, clusters can be formed any collection of
coiTCCtly configured networked machines. When all the machines in a cluster are
identical it is referred to as a homogenous cluster, otherwise the cluster is
heterogeneous.

3.6.2 Advantages of Cluster Computing
Cluster computing offers a number of advantages over purpose built parallel
computers, that makes them a more appealing option for most medium to small scale
applications. Some of these advantages include

•

Cost

It is considerably cheaper to develop a cluster computer from off-theshelf commodity products than to purchase a purpose built parallel computer.
•

Scalability

A cluster computer can be scaled up or down by the addition or removal
of nodes from the network.
•

Flexibility

For ease of design and manufacture most parallel machines are
constructed using identical processors, each with an identically sized block of
local memory. Contrast this with a heterogeneous cluster where individual nodes
may differ not only in temis of their underline architecture, but also in temis of
their computational speed and local memory. Heterogeneous computing offers a
flexible approach to parallel computing, where the abilities of specific nodes are
tailored to meet the requirements of particular sub-tasks they are to run.

52

Parallel Computing

•

Ease of Programming

Cluster computers are built from commodity of-the-shelf machines. The
instruction sets, operating systems and high level programming languages
employed by these machines are generally well understood by most
programmers, whereas purpose built machines are associated with novel
instruction sets and vendor specific operating systems and programming
languages. In a similar regard many of the tools used in cluster computing have
already received extensive testing and debugging on serial machines, and may
be less error prone than seldom used propriety software.
•

Maintenance

The fact that cluster computers are constructed using commodity
machines is also of advantage when it comes to maintaining the cluster. Faulty
nodes are easily replaceable, while fault detection is aided by technical
familiarities with the underlying system.
•

Slielf 1 Jfe

One consequence of Moore’s law is that any processor has a very short
shelf life. Within two years of a processor being developed a sufficiently more
powerful processor will appear on the market. This is true for processors
intended for use in both serial and parallel computers. But it is far cheaper to
upgrade a cluster computer than it is to upgrade a parallel computer, due to the
enormous difference in cost between the two. Also it is possible to upgrade the
cluster computer in stages by replacing older nodes with ones containing new
faster processors.

One of the major disadvantages of cluster computing is the network delay associated
with communication. In a purpose built parallel machine the processors are physically
closer together, and the communication overheads lower than a collection of
independent machines communicating over a network or via a switch. Such a purpose
built machines will always achieve faster and more reliable communication than
clusters. Due to its high communication overheads, cluster computing is not suitable for
parallel applications with a high communication to computation ratio.

53

Parallel Computing

3.6.3

Types of Cluster Computing

The type of eomputing perfonned by a cluster is generally categorised as either
High Performance Computing (HPC) or High Throughput Computing (HTC).

•

High Performance Computing
High performance computing seeks to create a purpose built cluster
computer to perfomi computationally intensive calculations. Each machine in
the cluster is solely allocated to the cluster, which is easily recognisable as a
single computer. Here the use of cluster computing is motivated by the benefits
it offers over traditional HPC solutions, in temis of both cost and flexibility.

•

High Throughput Computing
High throughput seeks to utilise the unused resources of existing networked
machines. Instead of creating a puipose built cluster, machines are allocated to the
HTC system when they are not required for another task. This type of computing
is motivated by a desire to avail of the free computational power generated by idle
machines.

3.6.4

Cluster Computing Software
A cluster computer is a collection of independent machines configured to work

together as a single computer, and applications running on the cluster consist of a
number of separate tasks running on different nodes of the cluster. During the course of
the application these tasks may be required to communicate and synchronise with each
other, as well as possibly launching new tasks on other nodes. This section examines a
number of different software approaches available when developing such an
application.

•

Low Level Primitive Operations
For a node to participate in the cluster it must be networked and the operating

system used on that node must support a basic suite of low-level networking protocols.
This allows for the possibility of creating a cluster application from scratch using such
low-level system calls as sockets and RPC (Remote Procedure calls) as a basis.
However such methods do not provide much of the functionality required by a parallel
54

Parallel Computing

program, for instance notification of task failure, which may be difficult and
cumbersome to implement. Another disadvantage of such an approach is that the
resulting application may be tied to a particular operating system and thus may be
difficult to port to other systems.
•

Message Passing Libraries

Tasks in a cluster computing system generally communicate via some fomi of
message passing. Message passing libraries provide most of the generic functionality
required by parallel programs in a set of easy to use low-level routines. These routines
are a level above system calls such as sockets, on which they are implemented, and
allow a user to send a message to a task without knowing on which host that task
resides. Message Passing libraries are generally machine independent allowing the
programs to be easily ported to another system. The two most common message-passing
libraries used in cluster computing are MPl [34] and PVM [35].
•

Specific-Function Parallel IJbraries

These types of libraries are not designed for generic parallel programming
libraries but are optimised libraries that perform a certain calculations in parallel. They
are similar to the libraries found used in serial computations and sei*ve much the same
purpose. Libraries in this category include, ScaLAPACK -linear systems solver [36] and
PETSc - partial differential equations [37].
•

Parallel Skeletons

A problem with message passing libraries is that even though they are at a
higher level that basic system calls; from a programming point of view they are still
relatively low-level primitives. Much of the routine work -such as fault tolerance, and
data distribution are still left up to the application programmer. However many of these
tasks are required by different applications with the core routines possibly remaining
unchanged. This may be especially true when the applications implement the same
parallel programming paradigm. This would suggest that such routines would be more
suited in some fonn of library; skeletons address this issue.
“A Skeleton is fomially a higher order function taking one or more other
skeletons or portions of sequential code as parameters, and modeling a parallel
computation out of them” [38]. They are generally designed to aid the development of
specific programming paradigms such as divide-and-conquer or task farming and are

55

Parallel Computing
implemented on top of an existing message passing system. Skeletons may be
implemented either as libraries or explicit programming languages.
•

Remote Management Systems
“Cluster Management Software (CMS) - this software is designed to administer

and manage application jobs submitted to workstation clusters. It encompasses the
traditional batch and queuing systems” [39]. These systems are designed to facility
high-throughput computing over a network of non-dedicated workstations. In most
systems the owner of a machine retains complete control over that machine and grants
permission to the CMS to use it under certain conditions. For this reason CMS systems
try to minimize the impact running remote jobs has on the owners of the machines. This
may require suddenly suspending, migrating or tenninating remote jobs. CMS systems
may or may not allow the execution of parallel jobs. CMS systems include Condor [40]
and Generic NQS.
•

Parallel Languages
Parallel cluster computing programming languages allow the programmer to

specify the level of parallelism in the algorithm using native constructs of the language.
They serve much the same purpose as high-level languages in a standard computer in
that they shield the user from the underlining architecture of the system. In the case of a
cluster computer this means removing the need to explicitly specify certain details such
as the target task when communicating.
The programmer outlines the basic algorithm using the abstract programming
environment of the parallel language. The compiler converts this into code to run under
the underlying message passing system, such as MPI. Two parallel languages that
achieve this using different approaches are Linda [41], [42] and mpC [10].
Linda
Linda is a parallel programming language for virtual shared memory systems;
communication is via shared memory in what is referred to as a tuple space. Tuples are
items of data that must be shared between tasks. Tasks communicate by placing and
receiving tuples into the tuple space; communication is asyiichronous meaning that a
task can continue once it has placed a tuple in the space; it need not wait for the
receiving task to remove the tuple. Tuples can contain function names as well as data,
when a function is placed in the tuple space a new task is launched corresponding to
56

Parallel Computing

that function. Linda is a base system that has being applied to a number of high-level
languages sueh as C, Fortran, and Prolog. Note that Linda is frequently implemented as
a preeompiler for the underling language, making it highly portable. A disadvantage of
the Linda system is that it does not seale well to large problems involving many tasks.
This is partly due to the fact that a shared tuple spaee increases the chanees of eonflicts
between tuples and the number of tasks in the system increases.

mpC

mpC is an extension of ANSI C for parallel computing on a heterogeneous cluster. It is
implemented on top of HMPI (Heterogeneous MPl); an extended version of the MPI
standard to allow the programmer greater seope in specifying which processors are
selected when running a parallel applieation in a heterogeneous environment. As mpC
programs

employ

MPI

message-passing

routines

for

eommunieation

and

synehronisation the programmer is shielded from the underlining MPI and can naturally
express the operation required using the construets of the language. When mpC source
code is compiled using a special mpC compiler standard ANSI C code is generated
(with MPI system calls). This code may then be compiled for any architecture using a
native compiler.

3.7

Grid Computing
Grid computing aims to unite computational resources across several

organisations using the World Wide Web. The grid is a collection of computational
recourses that are available to a user for use on a particular problem. These resources
may consist of computers, databases and experimental facilities, which are not directly
available, the user locally. One of the main goals of grid computing is to provide a
simple and transparent interface to the user that shields him/her from the underling
structure of the grid, such as where the services are located and the exact technology
that the services are implemented on. This is usually achieved through the use of middle
ware grid enabled software such as NetSolve\GridSolve [43] and Globus [44]. Gird
computing is generally used to solve intensive problems such as climate modelling.

57

Parallel Computing

3.8

Granularity
Granularity is a measure of the level of parallelism achieved in a parallel

algorithm. The term is used to compare the amount of computation perfomied by
parallel segments of code with the amount of communication required to perform that
calculation in parallel. The finer the granularity the more communication and
synchronisation required. As there is a correlation between the levels of granularity
achieved and the number of communications generated, there is also a relationship
between granularity and an algorithm's suitability for implementation on various
parallel architectures. As the communication costs increase across different parallel
architectures so too does the efficient level of parallelism achievable on that
architecture. In general granularity is a slightly abstract temi where an algorithm will be
defined as coarsely, medium or fine-grained.

Fine Granularity
Statement level parallelism: modules of parallel code consist of a few statements
only, suitable for implementation on SIMD architectures due to their inbuilt
synchronisation and low communication costs.
Medium Granularity
Procedure level parallelism: is suitable for implementation on asynchronous
MIMD architectures. Each processor in the MIMD system executes subroutines
in parallel, performing synchronisation where necessary.
Coarse Granularity
Program level parallelism: the task is divided into a number of sub-programs
that execute relatively independent of each other, requiring little communication
and synchronisation between each sub-program. Coarse granularity is suitable
for use with clusters of workstations eommunicating over an Ethernet network,
where communieation costs are high.

58

Parallel Computing

3.9

Load Balancing
Load balancing is an attempt to achieve the maximum amount of processor

usage and thus reduce overall execution time in a parallel system. The idea is that at any
given time as many processors as possible are involved in some fomi of computation.
Key to load balancing is the concept of workload allocation', tasks should be assigned to
processors in a manner that results in each processor spending the least amount of time
possible waiting for some other processor to finish its task.

The two main considerations when performing load balancing are the abilities of
each processor, and the execution time of the tasks to be performed. If all processors are
identical and each task requires the same amount of time to execute, load balancing is
simply a matter of dividing the tasks evenly between each processor. But in systems
where either these differ, load balancing is an important consideration when designing a
parallel solution. Two distinct approaches to load balancing are static load balancing,
and dynamic load balancing.

Static Load Balancing
Static load balancing schedules tasks to processors before execution of the
program. Using this method load balancing is an optimisation problem, which tries to
reduce processor idle time. Static load balancing works well in cases where the
characteristics of each processor are known in advance, and task execution time is not a
variable - the execution time of some computations is parameter dependent, and varies
greatly across the range of possible input.

Dynamic Load Balancing
With dynamic load balancing no pre-compiled mapping of tasks to processors
exists prior to program execution. Instead task assignment is calculated on the fly as
part of the parallel computation. With dynamic load balancing a Work Pool of tasks if
fomied, as processors complete their current tasks they are assigned new tasks from the
pool. Processors are only allowed to be idle after the work pool has being exhausted; as
a result the maximum idle time for any processor is equal to the maximum execution
time of a task. Thus dynamic load balancing works best on fine-grained problems where the execution times per task are small.
59

Parallel Computing

3.10

Speedup Calculations
The main motivation behind parallel computing systems is the potential for

improved execution times as more processors are added to the system. Such an
improvement in performance is temied as speedup, and is defined as

T
P

'~r
^P

where
Sp is the speedup achieved using p processors,
is the total execution time for a serial version of the algorithm,
T is the total execution time for a parallel algorithm using p processors.

When performing speedup tests it is important to ensure that the serial algorithm
being compared is the most efficient algorithm possible in order to get a true gauge of
the speedup gained. It is also important to ensure insofar as is possible that both
versions of the program run on similar processors. In practice this is not always easy to
achieve, so the performance of a serial algorithm is often approximated to that of the
parallel version running on a single processor, without communication.

3.10.1

Amdahl's Law
The speedup achieved by a parallel implementation is limited by a number of

factors. The first and most important of which is the level of parallelism inherent in the
underlying computation. For any given algorithm a certain amount of the instructions
must be executed sequentially, the remainder of which may be executed in parallel.
When executing a sequential block of code there is no performance increase to be
gained by adding extra processors to the system, they are only of benefit while the
system is dealing with parallel sections of code. Therefore it is not beneficial to
parallelise code that contains too many sequential instructions. This relationship is
expressed using Amdahl’s law [45],
60

Parallel Computing

/+

(1-/)
P

where
Sj, is the maximum speedup achieved using p processors,
y’is the fraction of instructions executed serially (0 < / < 1),
P is the number of parallel processors.

Amdahl’s law implies that there is an upper bound on speedup that is
independent of the number of processors applied to the problem. As the number of
processors P approaches infinity the equation is bounded by the tenn 1//, which is a
constant, due to the fact that for any given algorithm/is assumed to be a constant. This
would appear to suggest that massively parallel processing involving hundreds or
thousands of processors are inefficient as the speedup per processor drops drastically as
the number of processors increases. Interestingly if there is no serial element to the
calculation (f = 0) there is no upper bound on speedup and the maximum speedup
increases linearly with P.
It is worth pointing out that the use of Amdahl’s law is not an exact calculation,
it merely serves as a guide to designers when selecting a suitable number of processors
for a given system. Although the law was originally devised with multiprocessor
systems in mind it has being successfully applied to calculate performance gains arising
from other areas of computer science, such as the use of cache memory [46].

3.10.2 The Gustafson-Baris Law
A problem with Amdahl’s law is that it assumes the size of the problem is fixed
as the number of processors is increased, and thus that / remains a constant. But in
practice as the power of parallel systems grows (through the addition of extra
processors), so too does the size of the problems they handle. But as the size of the
problem grows the fraction of time spent in serial calculations reduces. An alternative
scaled speedup measure known as the Gustafson-Baris Law [47] captures this idea;
61

Parallel Computing
S=P + (l-f)P
where
Sp is the maximum speedup achieved using p processors,
/is the fraction of instructions executed serially (0 < / < 1),
P is the number of parallel processors.

Applying the Gustafson-Baris Law, the maximum achievable speedup is no
longer bound to the serial portion of the code but is linearly related to the number of
processors applied to the problem. If both the problem size and number of processors
are large enough Amdahl’s law can be circumvented and substantial speedups achieved.
Under such conditions massively parallel processing becomes a viable option. It has
being shown that both laws are in fact identical [48].

3.10.3 Communication Costs
Both the above laws assume that if a single processor can perform a parallel
calculation in time t, P processors can perform the same calculation in time tjP. This
ignores the communication and synchronisation costs incurred when perfomiing the
calculations in parallel. When communication is taken into account the speedup
achieved on the parallel section of code is somewhat less than tjP. Moreover as more
processors are added to the system, the amount of communication required can grow
exponentially. At some stage the system will reach a point where any performance gains
achieved by the inclusion of extra processors is lost, due to the increased
communication delays incurred in communication between these processors. Once this
point is reached the addition of extra processors will in fact lower speedup.
There are however cases where speedups in excess of //T'have being achieved.
Such speedups are referred to as superlinear speedups. The presence of superlinear
speedups does not disprove the above laws, but rather points to an inequality between
the serial and parallel systems being compared. For example a common cause of
superlinear speedups is that the combined memory of a multiprocessor system exceeds
that of the serial system used in the comparison.

62

Parallel Computing

3.11

Parallelisation of the Backpropagation Algorithm
A major limiting factor with the backpropagation algorithm is the length of time

it takes to train a network. Backpropagation learning is computationally expensive and
training can take the order of days or weeks for large training sets [6], This problem is
compounded by the uncertainty related with backpropagation design; in order to find a
workable solution using the backpropagation algorithm several networks - each with
different network parameters or initial weight configurations - may need to be trained.
Moreover the number of calculations required to train a network grows exponentially as
the size of the data set increases. But as backpropagation is applied to new problem
domains the size of the data sets required by the algorithm invariantly increases.
Currently training sets for applications such as character recognition and speech
recognition can require the order of 10^’ training samples and 10^ network parameters
[5].
The backpropagation algorithm is an ideal candidate for parallelisation; it has a
regular structure and repetitively performs the same calculations on its data. Further
more over 3Ni -iN| + 3Ni Ni+i multiplication and addition operations are required per
hidden layer, for every training pattern presented. As a result the computational
requirements of the algorithm grows dramatically, even for modest increased in network
size. This is compounded by the fact that larger networks generally require more
training patterns and a larger number of epochs before they converge on a solution.
A

number

of different

approaches

exist

for the

decomposition

of

backpropagation into a parallel algorithm, each of which is briefly outlined below.
These descriptions are in part based on [49], which also outlines implementations of
these methods on various architectures.

3.11.1 Training Session Parallelism
This is by far the simplest method for parallelisation of backpropagation, and
can only be viewed as a parallel solution if the entire training session is considered to be
the entity being parallelised. In an attempt to find an optimal solution the network
designer will typically train several networks each with different initial configurations.
The process of training each of these networks and selecting the best solution is tended
a training session. Each of these networks, which are standalone backpropagation
63

Parallel Computing

networks in there own right, can be trained on a separate processor, thus reducing the
overall time taken to find a solution. Here the algorithm itself remains untouched and
parallelism is at the network level. This approach is often referred to as ''embarrassingly
parallel" due to the high level of independence and lack of communication between
individual tasks.

3.11.2 Training Set Parallelism
This is a simple yet effective method for parallelising the training of a given
network. Here the training set is partitioned and divided up between the processors.
Each processor stores an identical copy of the network and weight matrix, but calculates
its weight updates using its local subsection of the data set. The weight updates from
each processor must be communicated at the end of each epoch, in order to calculate a
new weight matrix for use by all processors in the next epoch. Here parallelism is at the
data level, as the network itself is not partitioned.

3.11.3 Pipelining
With pipelining the network is dissected horizontally into layers and a layer is
assigned to each processor. This pennits training patterns to be pipelined between
layers, by allowing the hidden layer’s calculations for pattern P to be perfonned in
parallel with the output layer’s calculations for pattern P - 7. It is also possible to
pipeline the training phases instead of the layers of the network. Here one processor
perfonns the fonvard pass for pattern P, while another processor performs the backward
pass for pattern P - 1. In this case parallelism is at the training pattern level.

3.11.4 Neuron Parallelism
Using neuron parallelism the network is vertically sliced and divided between
the processing elements. This allows the neurons within a layer to be computed in
parallel. Each processor holds a subset of the neurons from any given layer, and stores
all incoming weights for each neuron it holds. As the inputs to each neuron - the
outputs of all the neurons in the previous layer - are distributed between the processors.
64

Parallel Computing
these values must be communicated to all processors before calculation of the current
layer can begin. A problem with neuron parallelism is that the processors only store the
incoming weights for each neuron held, but their outgoing weights are required when
calculating the en'or value for a neuron. These values must be either communicated or
duplicated and updated twice.

3.11.5 Synapse Parallelism
While neuron parallelism vertically slices the network and stores the incoming
weights for each neuron, synapse parallelism vertically slices it and stores the outgoing
weights for each neuron. Using this method each processor lacks the infonnation
required to calculate a nodes output. Instead a partial output is calculated for each node
and this is communicated between all the processors. The advantage of this method is
that error values can be calculated without duplication or communication.

3.11.6 Vector Processing
Virtually all of the calculations performed while training a backpropagation
network can be represented by Matrix operations. These operations perform the same
basic arithmetic operation on all elements of a vector or matrix, and as such are well
suited to implementation on SIMD machines.

Type of Parallelism
Training Session

Granularity
Very Coarse

Partition
Network

Training Set

Coarse

Data

Pipelining
Neuron
Synapse

Medium
Fine
Fine

Layer
Vertical Slicing
Vertical Slicing

Vector Processing

Very Fine

Array Element

Table 3.1 Strategies for parallelising backpropagation

65

Parallel Computing

3.12

Conclusion
Applying Flynn's taxonomy to parallel computers four classes of parallel

architectures can be identified, SISD, SIMD, MISD and MIMD. Of these SIMD and
MIMD are the most practical parallel computers. SIMD computers are suitable for
finely grained problems such as vector processing, where each processing element
perfomis the same instruction on a different element of the array. Whereas MIMD
computers are the most general of the parallel architectures and are applicable to more
coarsely grained problems.

With either model, memory may be shared between the processors, or each
processor may access its own local block of memory. Shared memory systems
communicate via shared variables and provide synchronisation through the use of
devices such as semaphores and mutex's. Distributed memory systems employ message
passing to perform both communication and synchronisation.

The manner in which processors are connected together is referred to as the
interconnection network. Ideally the more processors that are connected to each other
the better, as this reduces the amount of communication steps required for two
processors to communicate. But as adding connections to the network increase both the
cost and complexity of the system a balance has to be struck between the two.
Hypercubes and meshes offer such a balance.

A viable alternative to expensive purpose built parallel computers is a cluster of
off the shelf machines, configured to work collectively as a virtual MIMD computer.
Cluster computers are not only cheaper, but are also more flexible, scaleable, and easier
to maintain and program than standard parallel machines.

Granularity is a measure of the amount of computation verses communication
required to perform a parallel calculation. It is loosely defined as fine, medium or
coarse. In general, the finer the granularity the greater the level of parallelism inherent
in the problem. The granularity associated with a problem has an effect on its suitability
for implementation on different parallel architectures.

66

Parallel Computing
Load balancing is an attempt to achieve maximum processor utilisation, by
reducing the amount processors idle time. Static load balancing attempts to do this prior
to the execution of the program. But for many problems this is all but impossible and
dynamic load balancing should be eniployed to perform it as part of the parallel
calculation. The lower the granularity of a problem the easier it is to perform load
balancing for that problem.

Speedup is a measure of the performance gains achieved through parallelisation.
It is defined as;

execution time of a serial program on a single computer
exectution time of a parallel program on a parallel computer

Amdahl's law suggests that speedup is limited by the amount of serial
calculations required by the parallel solution, and not the number of processors applied
to the problem. But the Gustafson-Baris law shows that if the problem is large enough
Amdahl's law can be overcome, and speedup becomes linearly related to the number of
processors used.
A number of approaches exist for implementing backpropagation on a parallel
system. Of these, training session and training set parallelism are the coarsest implying
that they have the lowest communication cost and hence are the most suited to cluster
computing, whereas vector processing can efficiently perfonn the matrix operations that
underline neural network calculations, wherever SIMD processing is available.
It is therefore proposed to develop parallel computing techniques for
backpropagation training that exploit widely available and underused computing
resources. The most suitable resources identified in this chapter are:

•

SIMD processing extensions for NN vector processing. Many ANN operations
can be computed in parallel using this technology, if it is available on the host
processor. Many compilers can miss opportunities for parallelism on these
processors.

67

Parallel Computing

•

Clustering of standard Personal Computers, that can be configured and
programmed to perfomi training session and training set parallelism in either:
1. Dedicated cluster mode, where training is performed on a fixed number
of dedicated machines.
2. A HTC mode, here training is performed on an arbitrary set of volunteer
machines.

Implementing a HTC environment for backpropagation training set parallelism
brings its own set of challenges as:

1. Network weights must be synchronised at the end of each training epoch.
2. Machines can enter and leave the set of volunteer machines as they move
between busy and idle states.

These goals and challenges are outlined in the following chapters. In the next
chapter the use of SIMD processor extensions is examined for NN vector processing.

68

Chapter 4

Vector Processing for Backpropagation

Vector Processing for Backpropagation
4.1

Introduction
SIMD (Single Instruction, Multiple Data) refers to a parallel programming

paradigm; where single instruction is issued to several processing elements, each of
which executes the same instruction on a separate piece of data. One of the more
appealing aspects of the SIMD model is the natural way in which it handles calculations
perfonned on arrays of data, with each processor simultaneously performing the
operation on individual array elements offering a potential linear speedup. This fact has
prompted many PC vendors to include SIMD type operations in their instruction sets.
This chapter investigates issues arising out of the use of such SIMD features in an
implementation of the backpropagation algorithm. Much of the discussion below is
specific to Intel's Pentium family of processors; however the principles may be
generalised to almost any processor with similar SIMD abilities.
While there are also a number of library routines for efficiently implementing
operations on amiys in vector architectures, with languages such as C, FORTRAN 77
[10] and .lava [50], the main focus here is a direct examination of low level SIMD
operations available on common architectures, such as Intel, Motorola and AMD
processors.

4.2

SIMD Architecture
SIMD is a style of parallel programming where a single program executes across

all processors in the system. The processors share a common control unit (sequencer),
and are thus restricted to execute each instruction of the program in a lock step fashion,
with all processors issuing the same instruction at the same time. However processors
have independent access to memory enabling them to select different operands for each
instruction issued.

Using the SIMD model it is difficult to write programs in the traditional sense.
Each processor is restricted to executing the same sequence of instructions, regardless
of any differences in their input data. Constructions such as conditional branches can be
facilitated in SIMD architectures, by switching off any processors not executing that
70

branch, see Figure 4.1. However their use is inefficient for programs that execute long
branch sequences, as all processors not participating in the branch must remain idle
while it executes. As a result SIMD processing is most appropriate for highly regular
calculations such as those involved in array and matrix calculations.

Instruction
Sequence

Code

PE 1

PE 2

B<c-2* A

A=2
B=4

A=3
B=6

B*A
C<- C +A

C<^B* A
C^C +A

C=8
C= 10

C= 18
C = 21

B +A

C
B+ A
D<-C + 4

Off

Off

B^2*A
ifB<7

elsc
D^C + 4

D= 14

D = 25

PE 3

PE 4

A=4
B=8

A=5
B= 10

Off

Off

C- 12
D= 16

C= 15
D= 19

Figure 4.1 Conditional Brandi in a SIMD system

Using a serial model the same operation must be repeatedly perfomied on each
individual array element, whereas in the SIMD model the array is distributed between
the processors for simultaneous processing, see Figure 4.2. As each processor is only
required to perform simple calculations, the PE's in vector processing machines
generally contain little more than an Arithmetic Logic Unit (ALU). To date most
commercial SIMD machines are vector processors.

71

BI1I3I2I0I

TOTT

|7|o|3|2|
C I7I8I6I

cumin miEm fwn
II

r

,

I I I

I

I I I I

..............................................................................................

® 1^2'3' 4' 5' 6' 7' 8' 9'

Current Clock Cycle

a) SIMD vector processing

I

'

' '

Current Clock Cycle
b) Serial (SISD) vector processing

Figure 4.2 SIMD versus serial vector processing

The benefits of SIMD processing are not limited to large scientific applications
running on commercial vector processing machines. Any application requiring repeated
calculations on an'ays of data would profit from the SIMD approach. Such applications
include the games and multimedia programs commonly found running on many
personal computers. With this in mind many personal computer vendors now
incorporate some SIMD features on their chipsets.

endor
Intel
AMD
Motorola

Extension
MMX\SSE\SSE 2\SSE 3
3DNow, Enhanced 3DNow!, 3DNow! Professional
AltiVec
Table 4.1 SIMD Extensions of popular vendors

In general these SIMD features are intended to speedup applications that deal
with large aiTays of data and as such only provide a limited set of SIMD operations to
aid vector processing. Central to such SIMD capabilities is the idea of a packed data
type and packed operations. A packed data-type is a X-bit data-type that is logically
composed of Y independent elements. Operations perfomied on these data-types are
simultaneously perfomied on each of the individual packed elements, thus allowing a

72

single instruction to effectively perform Y independent calculations in parallel, with a
potential Y-linies speedup over standard serial operations.

reg3
64

48

6

regl

add (reg1 , reg2)

1

32

16

7

12

2

5

6

8

nn 1

0

+

reg2
=

reg3

1

1I 1 0

I

□zn

Figure 4.3 Packed 64-bit add instruction operating on lour 16-bit Integers

73

4.3

Intel’s SIMD Extensions
Intel introduced SIMD features to its 1A32 architecture in a number of stages,

beginning with the MMX [51] extension to certain Pentium processors, this was later
extended by SSE [52] for the Pentium III, SSE2 for the Pentium 4 and SSE 3. All
extensions are fully backward compatible, allowing a Pentium 4 to issue MMX and SSE
instructions. The main features added to the 1A32 architecture by these extensions
include new packed data-types, packed registers and instructions to operate on them.

4.3.1

Packed Registers and Packed Data-Types
Two new sets of eight packed registers were added by the extensions - the MM

and XMM registers. Associated with these registers are a number of packed data-types.

MM
Registers

0

128

XMM
Registers
0
XMM7
XMM6
XMM5
XMM4
XMM3
XMM2
XMM1
XMMO

80 64
0
Floating Point Registers

Figure 4.4 Packed Registers

74

64-bit MM registers (MMO - MM7)
With the MMX extension the MM registers operate on integer values. The data
stored in MM registers may be interpreted as

•

8 Packed 8-bit Bytes Integers (SignedMJnsigned)

•

4 Packed 16-bit Words Integers (SignedMJnsigned)

•

2 Packed 32-bit DoubleWords Integers (SignedMJnsigned)

•

A single 64-bit QuadWord Integer (SignedMJnsigned)

While the MM registers are logically independent registers they are in fact
aliases to the registers in the FPU data stack. As a result a special EMMS instruction is
required to empty the MMX state when switching from MMX to floating point routines.

128-bit XMM registers (XMMO - XMM7)
Introduced as part of the SSE extension the XMM registers originally only
operated on single-precision floating-point values. However SSE2 extends their use to
operations involving integer and double precision Boating-point values. Data stored in
XMM registers can be interpreted as one of the following data-types

•

4 Packed Single-precision floating-point (32-bit)

•

2 Packed Double-precision floating-point (64-bit)

•

Integers (Signed\Unsigned)
16 Packed Bytes (8-bit)
8 Packed Words (16-bit)
4 Packed DoubleWords (32-bit)
2 Packed QuadWords (64-bit)
A single non-packed Double QuadWords (128-bit)

The XMM registers are real physical registers and are not aliases onto any other
registers in the system.

75

MXCSR register
With the introduction of SSE and SIMD floating-point calculations also came
the addition of the MXCSR register. It perfonns a task similar to the FPU Status and
Control words, it that it both flags and masks exceptions. All floating-point exceptions
that occur when operating on XMM registers are flagged using the MXCSR register^

4.3.2

SIMD Operations
Intel's SIMD instruction set includes all instructions added by the MMX, SSE,

SSE 2 and SSE 3 extensions. These instructions are presented in summary form in
Appendix A and will not be discussed in any great depth in this section, which will limit
itself to discussing the kinds of processing the instructions allow.

SIMD processing using packed registers is specifically designed to speed up
calculations on horizontal arrays of data. The packed elements of a register represent
consecutive elements of the array allowing them to be loaded and processed in blocks.
Figure 4.7. A number of arithmetic, logical and shift operations can be performed on
these blocks of packed data.
Such methods are not however so apt when it comes to matrix calculations,
while the rows of a matrix may be processed in an identical manner to that of an array,
columnar calculations must transpose the matrix and process it as a row. A number of
shuflle-type instructions are provided to assist in this task, these instructions allow
packed elements to be swapped within or between registers, and can be used to
implement SIMD matrix transposition^.

Although there are no branch statements in the instruction set there are a number
of comparison operations. These work much like standard comparison statements in that
they evaluate an expression and return a TrueVFalse value as a result. The difference
being that they perform the comparison on all packed elements of the register and return
a result for each. This result(s) takes the form of a packed bit-mask with a mask of ones

’ The FISTTP (Store Integer With Truncation) is the only exception; however this instruction operates on
the floating point stack and not the SIMD registers.
^ I'he Intel compiler provides optimised macros that perform shuffle operations requiring a number of
SIMD instructions.

76

for all elements that evaluated to true and a mask of zeros for false elements, Figure 4.5.
There are two approaches to implementing conditional statements based on these
results: branch elimination ox serial processing.

4

1

5

9
<>
3
1
8
9
=
0...0 1...1 0...0 1...1
Figure 4.5 Packed Compare Operation

•

Branch Elimination

With branch elimination the instructions in a conditional block are applied to
all packed elements, regardless of whether they evaluated to true or false in
preceding comparisons. However elements for which the comparison evaluated
to false should be unaffected by these instructions and must be restored to their
original values using logical operations and the bit mask, see Figure 4.6.

Sc«L|uCl 1 cc oft )pcrations
A

// Code containing condilir)nal
7 selection statement

Cornpari&on

if (A[i] > B[i] )
E[i]
C[i]

C

Logical

\

+ D[i]

. / Code alter branch elimination

C‘

Mask:[i] :» A[i] > B[i]
C'Ci]
C[i] AND Mask[i
ECi] :» C•[i] + D[i]

Arithmetic
\

Figure 4.6 Branch Elimination

•

Serial Processing

In cased where branch elimination is impractical the array will have to be
processed serially. This can be aided through the use of the one of the create bit
mask instructions, which condense the most significant bits of each packed
element into a single value.

77

C

Block Currently
Processing

Arrays Being Processed

2

1

4

5

6

6

3

4

3

0 I 7

9 -.3 1.7..

6

9

? 8

4

8 I 8

6 I 10

9

9

?

3

0

4 1 2 I 8 I 4

Packed
Register

2 1'^

3

9 1 8 S

7 12

4

5

6 '■
c

a) Vector Operation A = B + C
Block Currently
Processing

E

Array Being Processed
B
6
3
0
7
9
A
6x I 3x I 4x I 3x I 0 I 7x I 9x I 3x|?

a

^

9

Register
Packed with
Scalar value
6^;

7

b) Scalar Operation A = xB
Element
Currently
Processing
Running
Totals
Totals After
Processing
Current Block

Ls-

4

5

19

23

Block
Currently
Processing
16 1

mi

45 I 39 I 34 i 28

Partial
Results

c r??i
General-Purpose \
FPU Register

9

Array Being Processed

SIMD Calculations

Serial Calculations

c) Unary Operation c =Xa

Figure 4.7 SIMD Array Processing with Packed Registers

78

4.3.4

SIMD Design Considerations
Using SIMD instructions can greatly enhance the performance of applications

that process large arrays of data. But in order to receive the full benefit from such
instructions the design of the application may have to be altered to take account of
certain considerations. The following points are based in part on [53].

4.3.4.1 Size of Data Elements
SIMD instructions speedup execution time by performing the same operation on
all packed elements in a register simultaneously. The number of elements packed into
the registers therefore limits the maximum possible speedup attainable. This depends on
the size of the elements; by choosing smaller sized elements more of them may be
processed in a single instruction, although it also has the effect of reducing the range
and precision of values that can be represented by the each elements.
In order to achieve the maximum speedup the smallest suitable data-type should
be seleeted for all arrays intended for use with SIMD instructions. This may involve a
trade off between speed and precision; for backpropagation training 32-bits of precision
are generally suffieient [54], allowing the application to work with four single-precision
values.

4.3.4.2 Alignment of Data
Ideally data destined for use with the MM and XMM registers should be aligned
on 8 and 16 byte boundaries respectively. Aligned data can be loaded to and from these
registers in a single memory access as opposed to the two required for unaligned data.
Further more a number of instructions that operate on the XMM registers require
aligned data as operands, and will generate a general protection fault if the data is not
aligned on a 16-byte boundary.

When statically declaring variables they can be aligned using specific compiler
directives such as_declspec() in Visual C++, or dynamically using the _mm_malloc()
function. Memory allocated using _mm_malloc() must be freed using the corresponding
_mm free() function.

79

// Static Allocation

// Dynamic Allocation

__declspec (align (16)) int x;

ptr = _mm_malloc (size, 16);

__declspec (align (16)) int y [10];

mm free (ptr);

Note that Intel's cache lines are 32-bit aligned; for the application to make
efficient use of caching data should be aligned on 32-bit and not 16-bit boundaries.

4.3.4.3 Organisation of Data in Memory
For packed data to be moved between memory and the MM and XMM registers,
the individual elements must be in consecutive memory locations. If the underlining
data structures are not declared with this in mind then additional move instructions will
be required to reorganise the data, which could make the use of SIMD operations
inefficient. There are two main areas of concern with regard to this.

Declaring of Arrays of Data

With simple arrays of data the elements are ensured to be consecutive. But care
must be taken when declaring multi-dimensional arrays, as the elements will be
arranged consecutively for only one dimension of the array. Figure 4.8. It is important
to insure that this dimension is the one that will most benefit from SIMD processing.

Data Structures

Declarations
int array [8]:
int matrix [4] [6];
int cube [3] [4] [2];
Memory

Figure 4.8 Location of array elements in memory

Declaring Composite Data-Types

Attention must be paid to the design of the structural hierarchy of data in the
program. If variables are distributed between instances of different structures, so too
will be their memory locations. For variables that intend to use SIMD processing
structures of arrays are preferable to arrays of structures.

Structure of Arrays
Mernorv

Array of Stiiictures
;1eniorv

vD yQ vD
20

yD
zD

Fifiure 4.9 Organisation of composite data-types in memory

It should be noted that the above explanation is not strictly speaking tme when
dealing with large structures of data. If X, Y and Z are sufficiently large then the
corresponding elements of each array may be located on different pages in memory,
necessitating that the memory sub-system swap pages for each packed operation
In this case the best approach is to employ a hybrid system where the data is
stored in an array of structures, but each structure holds a sub array of the total. This
way the sub arrays can be processed using packed instructions, while at the same time
keeping the corresponding elements of each array on the same memory page. For
greater efficiency the size of each sub array should be a multiple of the 32-byte cache
line.

4.3.4.4 Padding
In order to avail of SIMD processing the elements of an array must be loaded
into the SIMD registers in fixed sized blocks. It is only when the size of the array is a
multiple of the number of elements packed into these blocks that entire array can be

processed using SIMD instructions. In all other cases an extra cleanup loop will be
required to process the leftover elements.

This clean-up loop is serial and on average will require more instructions and
take more time than if the remaining elements were processed in a single SIMD
instruction. Graph 4-1. This can be achieved by padding out all arrays to an appropriate
length. While such an approach will achieve only slight gains on each array processed,
in applications where the calculation are repetitively performed such gains may be
noticeable.

When padding is to be used, the algorithm should be carefully analysed to
ensure that the padded elements neither interfere with the underlying calculation, nor
acquire some value that is invalid for the operations performed on them. Padding may
also be required to maintain alignment for members of a structure or rows of a matrix.

Padded V’s Unpadded Arrays

1

2

3

4

5

6

7

8

9 10 11 12 13 14 15 16 17 18 19 20
Length of Array

Non-Padded SSE

Padded SSE

Graph 4-1 Padded V's Unpadded Array Processing

4.4

Issuing Intel’s SIMD Instructions from C Code
There are four methods of writing SIMD enabled code for use on Intel’s I A3 2

processors: issuing the instructions directly using assembly language, using intrinsics or
a class interface to mask the underline register allocation, or the write un-modified serial
code and allowing a parallelising compiler to convert it to SIMD enabled code. Of the
82

four methods assembly code offers the greatest rewards in speed, while automatic
vectorising is possibly the easiest.

Implementation
Method

Assembly

In trill sics

Class

C

Code
for count:
movaps xmmO, [ebx + edi]
movaps xmml, [eax + edi]
mulps xmmO, xmml
movaps [esi + edi], xmmO
add edi, 16
loop for count
for (i = 0; i < Size; i-f- + ) {
A [ i] = mm mul ps (B [ i] , C[i] ) ;
}
for (i = 0; i < size; i++) {
A[il = B [i] * C [i] ;
}
for (i = 0/ i < size; i++) {
A[i] = B[i] * C[i] /
}

Type of
Operand
XMM
Register

__ml28 *

F32vec4 *

float *

l ablt* 4.2 Dilferciit coding strategies for the vector operation A = B * C

4.4.1

Assembly Language
The main advantage of programming in assembly language is that it offers the

most rewards in terms of the efficiently of the code produced. In cases where the
compiler employed by the core application doesn't support SIMD instructions it may
also be the only way to write SIMD code; the routines may be assembled separately and
linked to the core application later. This is true regardless of the high level language
used.
However there are many disadvantages to assembly language coding. Writing in
assembly language is generally slower and more error prone than writing in a high level
language.

It may also be difficult to optimise the assembly code as well as the

compiler. This is due to the complex micro-architecture employed by modem
microcomputers, a detailed knowledge of which may be required in order to fully
optimise the assembly.
This point is clearly seen in Graph 4-2, it shows the number of clock cycles
required to perform the addition operations of Table 4.2. Cycles were measured using
83

the RDTSC command, all data was preloaded into the LI cache to eliminate caching
effects [55], Of the four methods assembly language faired the worst.

Size of Array

Class

Asm

Intrinsic

Graph 4-2 Performance of assembly compared to other implementation methods

In order to achieve a better performance than the other methods the loop has to
be unrolled three times to exploit the parallelism available due to Intel's superscalar
architecture, as shown below.
unroll = (len - (len % 3))
leftover = (len % 3);

/ 3;

asm
lea esi, A
lea ebx, B
lea eax, C
mov edi, 0
mov ecx, unroll
] ecxz
start leftover
for count:
movaps xmmO, XMMWORD PTR [ ebx + edi ]
+ edi ]
mulps xmmO, XMMWORD PTR [ eax
movaps XMMWORD PTR [esi + edi ]1 , xmmO
movaps xmml, XMMWORD PTR [ebx + edi + 16]
+ edi + 16]
mulps xmml, XMMWORD PTR [eax
movaps XMMWORD PTR [esi + edi + 16] , xmml
movaps xmm2, XMMWORD PTR [ ebx 4- edi + 32]
+ edi + 32]
mulps xmm2, XMMWORD PTR [eax
movaps XMMWORD PTR [esi + edi + 32] , xmm2

84

add edi, 48
loop for_count
start_leftover:
mov ecx, leftover
jecxz end_prog
leftover count:
movaps xmml, XMMWORD PTR [ebx + edi]
mulps xmml, XMMWORD PTR [eax + edi]
movaps XMMWORD PTR [esi + edi], xmml
add edi, 16
loop leftover count
end_prog:
1

Another difficulty with assembly is its lack of portability; this can be seen at a
number of levels. Firstly assembly language instructions will only operate on one target
architecture. New instructions are not backward compatible with earlier members of the
same family, techniques for optimising on one generation of processor may result in
poor performance when the code is ported over to the next generation. This is true of the
Pentium 111 and 4, where different optimisation techniques are proposed for each
processor. Finally assembly routines may prevent code being ported to different
operating systems running on the same processor; Windows and Linux employ different
syntax for assembly language.
Microsoft's

Macro

Assembler

(Version

6.1 Id)

supports

Intel's

SIMD

instructions, but their Visual C++ Compiler 6.0 requires the following patches if it is to
recognise SIMD instructions

4.4.2

•

Visual C++6.0 Processor Pack

•

Service Pack 4 or 5

Compiler Intrinsics
Coding in assembly can be slow and error prone, and for many simple routines

the end result is no more efficient than compiler optimised code. For applications coded
in C\C++, SIMD routines can be coded without the need to worry about register
assignment, by using Intel's C\C++ compiler intrinsics. These are C macro functions
85

that perform SIMD operations on enumerated data, representing packed data-types. The
operations performed by an intrinsic may involve one or more SIMD instructions.
Intrinsics operate on values of type jriJ28; a 128-bit data-type used represents
a 128-bit packed float, it is defined as typedef long long _ml28. They have the
advantage of being more portable than assembly and easier to implement but may suffer
from some performance penalty. The files mmintrin.h, xmmintrin.h and emmintrin.h
contain the intrinsics required for MMX, SSE and SSE2 respectively.

4.4.3

C++ Class Interface
If the application is to be written in C++ rather than C it may be more

convenient to use Intel's C++ class interface, over compiler intrinsics. The class
interface provides a more natural programming medium when operating in an objectorientated environment. It represents packed data-types as objects, and defining
overloaded functions to operate on them.
Single precision floating-point values are represented using the F32vec4 class,
which is simply a class operating on _ml28 data-type with overloaded operators for
most operations. The files ivec.h, fvec.h and dvec.h contain the class interfaces for
MMX, SSE and SSE2 respectively.

4.4.4

Compiler Vectorisation
The final way in which code can be SIMD enabled is to allow the compiler to

detect and automatically vectorise any operations that could be performed using SIMD
instructions. This way the program can be written using the standard constructs of a
high-level language, without the aid of any libraries or extensions. One of the main
advantages of this method is that it allows the code to be ported to non-Intel systems,
and recompiled to take advantage of their SIMD features. Intel C++ Compiler 8.0 can
automatically vectorise standard C\C++ or Fortran code.

4.5

Implementation of Backpropagation using SIMD Processing
Neural networks can be efficiently represented using matrix operations and

notation, which is precisely the type of processing Intel's SIMD extensions are designed
86

to accelerate. Therefore it stands to reason that the backpropagation algorithm could
benefit from incorporating these SIMD features whenever vector processing. This
section investigates the implementation of backpropagation using packed SIMD
instructions.
Where used in this section:

N indicates the number of elements in an array,
P the number of simultaneously processed elements,
M the number of rows in the training set.

For simplicity it is assumed that P is a factor of N. When operations are
performed on the packed single-precision floating-point values of the XMM register P
= 4. As with standard mathematically notation, vector variables are capitalised and
scalar variables are lowercase. Packed operations performed on all P elements of the
XMM registers are overlined while operations performed on the general-purpose
register are not.
In order to keep the algorithms outlined in the tables below as concise as
possible the Z symbol was used to represent loops, it is hoped that this will not lead to
any confusion as it is also used in its mathematical context to describe the function
being implemented.

4.5.1

Analysis of Backpropagation's Constituent Equations
Most backpropagation calculations are perfonned on arrays/matrices of data, and

so should see some benefit from packed SIMD processing. This section examines the
equations

of the backpropagation

algorithm with

regard

to

their potential

implementation using SIMD instructions.
4.5.1.1

Nodes Input
The input to node i in layer L of the network is given by the equation:

ci:,= y w.,x.
/=()

87

This operation basically equates to the summed product of two arrays, and
could more generally defined as,
a = tB,C,
where
c/ is a scalar,
and B and C are vectors of length N.

To perform this calculation sequentially requires N multiply and addition
operations. Using SIMD instructions just over NjP operations of each are required;
provided the summation is partially calculated using packed operations Table 4.8.

JVIatliematical

V/-,

Activation Function
ScQ 11011 tlill

^/<-0

SIMD
A<r-()

/=()
N

"=Z«,c
/

f a ^ a+ t
iA^A + r

a <— 0
A[i]

l
Operations Per Node:

2N + \

2y/+P+2

Table 4.3 Comparison of sequential and SIMD calculation of node activation.

In order to avail of the packed multiply instruction B and C must be arrays of
continuous data, this implies that the weight matrix is in column order, and that the
outputs of the previous layer are stored in an array. Thus the input for each node in a
layer may be calculated mostly in parallel, and this calculation repeated for each
successive node in the layer.

88

Activation
Values

Weights for
Layer

Row Order

Column Order
Figure 4.10 Suitable structures tor SlMl) calculatiuu of node activation.

4.5.1.2

Output Values
A node’s output is calculated based on a single value (its activation value) and

so cannot benefit from SIMD processing. However provided that all nodes in the layer
employ the same transfer function, it may be possible to calculate the layer in parallel,
at least in part. Take for example the logistic transfer function:

1+e "

Intel provides no SIMD instruction to calculate exponentials, so each
e“" calculation must be performed sequentially, however the remaining calculations
(addition and reciprocal'”') can be performed in parallel. Note Intel provides software
emulation of many packed mathematical functions. If the operations required by the
transfer functions are included in this set then the entire function may be calculated
using SIMD operations, at least in software.

^ Intel provides a packed reciprocal instruction. This is faster then the packed divide instruction but uses
an approximation which is accurate to 11 bits of precision. If greater accuracy is required the Newton
Raphson can be used to increase it.

89

IVIatlieinatical
1
l + c""

Transfer Function
SIMD
Sequential
/ <r- 0- a
r ^ 0 - A[i]
^ t <— pow{e,t)
t <- 1//

2. ^
'

+i

T ^MT
Operations Per Layer:

^‘Yp + n

AN

Table 4.4 Comparison of sequential and SIMD calculation of layer transfer function

4.5.1.3 Output Layer Error
Error values for output layer nodes are calculated as S- = {T- - 0-)f\). The
tenn /'() is the derivative of the transfer function used by that node. Depending on the
operations required the calculation of this derivative might require serial processing.
Again assuming the use of the standard logistic function, the equation becomes
Sj = {T. -

- O;), all terms in this equation are scalar values. Thus the error for all

output layer nodes may be calculated in parallel using SIMD instructions simply by
vectorising the equation.

Output Layer Error
Mathematical

Sequential
0 <- D[/] - 0[i]
yO ^0 *0[/]
, A
1 “ 0[i]
(5[/] — 0 * A

Operations Per Layer:

SIMD
^ Y^D[i]-0[i]

iT, <r-\-0[i]
S[i]^T,^N

47V

Table 4.5 Comparison of sequential and SIMD calculation of output layer errors

90

4.5.1.4

Hidden Layer Error
Things are not so simple when calculating the en'or of hidden nodes. The

equation

<5,u=0,(I-0,)X<y, w..
/=0

includes both scalar and vector tenns. The vector part must be performed
separately for each node in the layer. However it produces a scalar result allowing the
remainder of the calculation to be performed in parallel across all nodes of the layer^.

The vector factor in the equation has the fonn,

i

and at first glance this appears to be a simple array multiplication and
summation similar to the calculation of a nodes activation value. The problem is that C
is not a continuous aiTay of elements, for C to be an array the weight matrix would have
to be in row order, apposed to column order as required by the activation function. This
would appear to suggest that the weight matrix must be logically transposed to perform
the operation.
Pervious Layer
Error Values

Weights for
Previous Layer

oooooooo

Row Order

Current Layer Error Values
Coluriin Order
Table 4.6 Suitable structures lor SIMD calculation of hidden node error

Note the use of

0-{\

—

O,)

implies that the standard logistic function is used.

91

However this is not the case; a backpropagation layer is highly parallel, logically
errors are calculated for all neurons within a layer simultaneously. As a result there is no
need to transpose the weight matrix, if the error is calculated in parallel across the layer.

Hidden Layer Error
Sequential

Mathematical

SIMD

Scalar Calculation
r

<r-t^O[l]

Y,T <-7*011]
‘ W]*~-T*fO

Operations Per Layer:

3N

Vector Calculation - f()
,v

a <r- 0
<-

J(h =

1

a

B[ipC[i + k]
a+ t

Z4'l^o
/

7j

<—

Beast{B[i])

,v

Z.gT’, <-7; »C[A']

/

'

>

t A[k\<^ A[k\ + T,

i
*

Operations Per Node:

2V,.,+1

Operations Per Layer:

2A*

A* A, , + A

+N

----------J-------- +

*Not applicable as calculation is pertbrmed across the layer.
Table 4.7 Comparison ot sequential and SIMD calculation of hidden layer errors

4.5.1.5

Changing Weights
The

change

to

a

weight

is

calculated

as

,

or

Avv.(/t + 1) =+ aAvv'.(/t), if the momentum term is used. Both these equations
contain entirely scalar terms and thus require P fewer instructions using SIMD. Note
that depending on the update rule employed weight updates may be either performed
directly or calculated for a later update, but this does not effect the use of SIMD
operations.

92

.Mathematical
= 1V„ + 'l^P,

Calculation of Weight Updates
SIMD
Sequential
7]^ djj]

(i * 0[/]

M

Z7 ^

^t

*0[/]

^ vv{y][z]^ vA{7][/]+r2

Operations Per Weight:

2NN,_, + yv,_,

Mathematical

Updating Weights
Sequential

SIMD
N

A/

A

J

1

M T‘

H’= U'j, + AH',,
Operations Per Weight:

ZEh(7]1']<-Ah{,/][/1
/ '

A'W,.-,

/F

Table 4.8 Comparison of sequential and SIMD weight operations.

4.5.1.6

Network Error
The network error cannot be calculated using SIMD operations, as it is a

nomialised sum of temporal values. However these temporal values - the MSE for each
training pattern- includes vector operations, and may be partially converted to SIMD.

Mathematical

-

Mean Square Error
Sequential
E„^0

SIMD

T, ^0

/=0

/ t; -> T", + r,
i

Operations Per Pattern:

2N + \

2^ + P + l

Table 4.9 Comparison of sequential and SIMD operations for calculating the MSE

93

4.5.2

Implementation methods for Constituent Equations
The previous section discussed the use of packed SIMD instructions for each of

the basic backpropagation equations. Table 4.10 compares the SIMD and sequential
versions of some of these routines over increasing values of N. It would appear that all
operations would benefit from SIMD processing.

4
8 12
Activation Function
9 17 25
2N + \

16

32

33

65

121

22

36

8

10

12

14

Transfer Function
AN
16

32

48

64

7

14

21

28

Output Layer Error
AN
16 32 48

64

4

8

12

N
60 100

16

140 200

400

800

201

281

401

801

1601

56

76

106

206

406

128 240 400

560

800

56

105

175 245 350

128 240 400
32

60

100

560

800

140 200

1600 3200
700

1400

1600 3200
400

800

I'able 4.10 C omparison ot SIMD V's Serial instructions over increasing sizes ol'N.

As the backpropagation algorithm is increasingly being applied to larger
networks and training sets, the size of N in each of the equations above increases
dramatically, allowing the potential speedup of SIMD operations to grow linearly.
Further more many of these calculations have to be repeated for every node in the
network, and all have to be perfonned at least once per training pattern. So even in cases
where N is small the minor performance gains achieved through SIMD processing, will
be amplified over several thousand training epochs to noticeable effect.
However it has to be pointed out that the above comparisons are relativity crude,
they only consider the actual operation being performed and ignore any auxiliary
instructions required to carry it out. Take for example the multiplication of two arrays
A = B * C. Auxiliary instructions here include the movement of data to and from
registers and incrementing the array index counter. Each of these instructions will have
to be performed

fewer times using SIMD processing as shown in the following code

sections:

' Where P is the number of processing elements.

94

// SSE loop

(100 iterations)

MOV EAX,

II

Serial loop (400 iterations)

100

MOV EAX,

SSE_LOOP:

100

SERIAL_LOOP:

MOVAPS XMMO,

[EBX + EDI]

FLD [ESI + EDI]

MOVAPS XMMl,

[EAX + EDI]

ELD

[EAX + EDI]

MULPS XMMO, XMMl

FMUL

MOVAPS

FSTP [EBX + EDI]

[ESI + EDI], XMMO

ADD EDI,

16

// Bytes

LOOP SSE LOOP

ADD EDI, 4

// Bytes

LOOP SERIAL LOOP

With this in mind the set of basic SIMD operations required to implement
backpropagation was compiled from the algorithms presented in the previous section.
These primitive packed operations are outlined in Table 4.11, and were used as a basis
to compare the different implementation options. Routines to perform each primitive on
arrays of size N were written in assembly code, C, intrinsics and classes.
The number of clock cycles required for each routine were recorded as the size
o

of N increased from 1 to 1360'. The C code was compiled for serial execution and
SIMD execution; the same unaltered code was used in both versions. In order to draw a
fair comparison all data was 16-byte aligned when declared. All code was generated for
execution on the Microsoft Windows operating system, using the Visual Studio EDA.
Where suitable both the Microsoft and Intel compilers were selected to generate the
code.

Operation

Name

Description

A <r- h
A<^ B
A^-B
A ^ B++

Set Scalar
Set Packed
Negate
Plus Plus

Assign all elements of an array the same scalar value.
Copying an array.
Negating an array.
Adding one to all elements of an array.

A ^ XjB

Reciprocal

Calculating the reciprocal for each element of an array.

A <-B + C
A<^ B-C
A <^B*C

Plus

Adding two arrays.
Subtracting two arrays.
Multiplying two arrays.

Minus
Multiply

Fable 4.11 Basic SIMD operations required lor implementing backpropagation.

These tests reviled a number of points;
^ This size was chosen in order to keep the contents of the arrays in the LI cache and eliminate any effects
caused by caching.

•

The Intel Compiler Produced the Fastest Executables
For most operations C code generated using the Intel compiler 7.0 outperformed that

of the Microsoft Visiual C++ Compiler 6.0. Indicating that the Intel compiler is better
at optimising generated code^. However for code written using either classes or
intrinsics the improvement is less pronounced with the Microsoft compiler performing
better for more operations.

Visual C++/lntel Speedups (C code)

Add
Divide
Copy Scalar
Copy Packed
Multiply
Negate
Plul Plus
Subtract

Graph 4-3 Comparison of Code generated by Intel and Microsoft Compilers. The graph shows the
speedup of Visual C++ versus Intel C compiled code. For example the green line above shows that
Intel compiled code is over 2 times faster then Visual C-H- compiled code for the Plus Plus
operation.

•

Code written in Intrinsics was faster than other Implementation Methods
For many operations there was little difference between the implementation

methods. However for any operations that did differ code written in intrinsics produce
faster executables than the other two methods. Note that assembly language routines
were not included in this comparison. This is due to the difficulty in optimising
instruction scheduling on for the microarchitecture of the Pentium III. A number of
operations were however selected for hand optimisation and the resulting assembly code
outperformed intrinsics in every case.

The e.xceptions being the assign and copy operations.

96

Comparsion of Implementation Methods

'^r^OCOCDCDCNIiOCO't-'^r^OrOCDCDCNlO

t^’^CNJcncDcO'i-ooiocoor^LOCNicncD'^^

T-CNCMco'^iniocDt^oooocnoOT-cMro

Size of Array

Classes

Intrinsics

Automatic Vectorisation

Graph 4-4 Comparison of SIIMD implementation methods. The graph shows the number of clock
cycles required to perform the same operation on arrays of increasing size using Classes, intrinsics
and automatic vectorisation. In this graph Code written using Intrinsics performed the best and
code w ritten using the class interface performed the worst.

•

Memory Access Limits SIMD Speedup
For most arithmetic operations (plus, multiply, minus, plus plus, negate) the SSE

code approached a speedup of 3 in comparison to the Microsoft compiler, but only a
speedup of 2 in comparison to the Intel compiler. As the Intel compiler (non SSE code)
is also faster than the Microsoft one, it would appear that the true speedup from using
SSE operations is 2. However this can be explained by the fact that only 32-bits can be
read from memory at a time, thus SIMD move instructions are only twice as fast as
standard move instructions. This would suggest that in order to achieve the maximum
performance registers should be reused as much as possible.
4
3.5 i

3 i

2.5 4
1.5 J
1 0.5
0

i--------

-I—
o)cor^(DmTrro(NT-oo)a3r^cDm'^cocNiT^t^c^cnir)T-t^cncninocD(Noo^o<DCMoo
■.-■^CNCNcn'^^rinmcDh'r^ojoocnoO'.-TIntel / Intel (SSE)

VC / Intel (SSE)

o

C3>

oo

CM

(N

m

O)

in

VC / Intel

Graph 4-5 Spccdups for Arithmetic Operations C code. For non-SSE code the Intel compiler
produced code 1.5 times faster than the Visual C-h- compiler (Yellow), however when the Intel
compiler generated SSE enabled code it was almost 3 times faster than the Microsoft compiler
(Purple), The speedup obtained fro the Intel Compiler SSE enabled code compared with non-SSE
enabled code was only 2 times faster (Blue).

97

•

Reciprocal is faster than Divide
When performing the 1/1 + exp(-t/) operation using SIMD instruction either the

divide (DIVPS) or reciprocal (RCPPS) instructions can be used. The reciprocal
instruction is up to four times faster than divide, but uses an approximation and is less
precise. The accuracy of the result can be improved on by applying the NewtonRaphson approximation algorithm:

Result = 2 * RCPPS(x) - x * RCPPS(x) * RCPPS(x)

while this is slower than reciprocal it is still faster than divide and could possibly
be used instead especially in the earlier stages of training when the network is less
sensitive.
Method
Reciprocal
Reciprocal(Newton-Raphson)
Divide

Precision
11-bits
22-bits
32-bits

Clock Cycles
1
13

Table 4.12 Precision and clock cycles for 1/x operations

Reciprocal Vs Divide
6000
</) 5000
o 4000
>.
O 3000
o 2000
o
o 1000
0
t-oo)oot^coLn-si-coc\jT-oa)ooh^cDir>'^
ooinco-r-or^mcO'r-cDr^'^CNiooocD'^
T-CMcoro-^LOcor^r^ooo50-«-T-(Nco

Size of Array
Rep

RepNR

Div

Graph 4-6 Reciprocal Vs Divide

98

•

Optimisation strategies vary depending on the array size
The results presented here represent the general trend that emerged during

testing. However it should be pointed out that for some operations the results varied
depending on the size of the array. This can be demonstrated by re-examining the
results for reciprocal, while the general trend shows that the Newton-Raphson
approximation method is faster than divide, the opposite is true for arrays of 60
elements or less. This would suggest that the size of arrays being processed be
considered when designing the code.

Reciprocal Vs Divide (Small Arrays)

Rep

RepNR

Div

Graph 4-7 Reciprocal Vs Divide (Small Arrays)

4.6

Implementation of Backpropagation Algorithm
The program was restructured to implement the algorithms outlined in Section

4 4. Wherever possible the routines were combined in an attempt to reuse values in
registers and improve the speedup obtained. However when this code was compiled
using the Intel compiler for automatic vectorisation, only one loop was vectorised and
the resulting code performed little better than standard C code. This was due in part to
the dynamic nature of the code. All structures were generated dynamically and
referenced using pointers, although for loops and array notation was used.

99

Of the other three methods (Intrinsics, Class Interface, or Assembly language),
Intrinsics were chosen as an implementation method as they offered a potentially more
optimal solution than the other two'^.
The resulting program was compared with non-SIMD code and speedups
obtained for a number of network sizes. While an ideal speedup of 4 was expected for
these tests (due to the fact that four 32-bit operations can be performed at once using
128-bit SIMD registers) a number of results achieved superliner speedups in excess of
the ideal. This is due to the fact that the SSE code required extensive alterations to be
made to the memory structures employed by the system in line with the considerations
outlined in Section 4.3.4 From these tests the following conclusions can be drawn.

•

Larger sized networks produced the greatest speedups

In general the larger the network size the greater the speedup obtained This is
demonstrated in Graph 4-8, network sizes are presented in terms of the number of input
hidden and output neurons.
6
5
Q.

4

Ia 3
<u

2
1
0
240-120-24

24-80-4

4^-04

Networks Tested

Graph 4-8 Relationship between network size and speedup. The graph shows that in general larger
networks produced a greater speedup than smaller networks.

While assembly could be optimised more, it is more difficiilt to program and less portable due to the
sucperscaler architecture employed by Intel machines.

100

•

Padded networks performed worse than unpadded networks

When the number of columns in the weight matrix was not an even multiple of four
a number of padded columns were required in order to keep the start of each row
aligned on a 16-byte boundary. As a result networks that required padding suffered a
penalty in comparison to similar sized networks that didn't require any padding, but for
larger networks this effect is less noticeable.
6 n

4> 3

'y. 2
1
r—--------------------1-----------------------r—

0
251-121-25

240-120-24

241-121-25

24-80-1

24-80-4

HeTwniks TesTe<l

Graph 4-9 Effect of padding on speedups. The graph shows that networks without padding
performed better than slightly larger networks w ith padding. Padded networks are show n in red
and unpadded networks arc shown in blue.

However there are other factors effecting the use of padding and not just the raw
network size. What appears to have most effect is the number of padded elements in the
matrix. The more rows the matrix contains the more padded elements in the matrix and
thus the more wasted instructions* V For certain network configurations these wasted
instruction can actually result in poorer overall performance in comparison to the serial
version of the code.

" Note it might be possible to store the matrix in w hichever order reduces the amount of padded
elements, and allow the application the reverse the logic for the forward and backward passes.

101

4.7

Conclusion
This chapter investigated the use of Intel's SSE extension to speedup

backpropagation training on a single Desktop computer. The SSE extension provides
single-precision floating point SIMD capabilities to the Pentium III and higher range of
processors. Intel computers were deemed suitable for the study, as they are the most
common type of desktop computer in use today. The SSE extension allows a limited
number of arithmetic instructions to be performed on four packed elements in one of the
8 XMM registers.

SSE instructions can be issued in one of four ways:

Assembly l.aiiguage
Offers best performance, however it is difficult to program and optimise,
optimised solutions may be less portable than other methods.
Compiler Intrinsics
Second most optimal, relatively easy to program with, however data structures
may require alteration to incorporate the_ml28 data type.
Class Interface
Less optimal than compiler intrinsic but may be more convenient in a C+-f
environment.
Automatic Vectorisation
Easiest to implement, code is unaltered and thus may be ported to other
architectures. However this method is the least optimal as the compiler may miss
many potential SIMD operations.

Of these methods compiler intrinsics was considered the best implementation
method. A SSE enabled version of the program was written using compiler intrinsics
and speedups were calculated for a number of network sizes. The results show that for
most network configurations there is some benefit to be gained fonn using SIMD
instructions.

102

Chapter 5

Cluster Computing for Backpropagation

Cluster Computing

Cluster Computing for Backpropagation
5.1

Introduction
The aim of this chapter is to investigate cluster-computing methods for speeding

up backpropagation training times. Cluster computing is the use of two or more
independent machines as a single computer. Perhaps the two biggest advantages of
cluster computing is that it offers the possibility of creating enough computing power to
meet the requirements of almost any project at an affordable price. The price of a large
supercomputing cluster is an order of magnitude cheaper than that of a purpose built
parallel computer of the same capacity.
However cluster computing is not just limited to large scale computing projects,
a modest size cluster of just a few machines can easily be constructed to serve the needs
of backpropagation research. This chapter examines such an approach.

5.2

Cluster Computing
Cluster computing creates a virtual parallel MIMD computer, using a network of

commodity machines as constituent components. Each machine in the cluster, referred
to as a node, runs its own operating system and is capable of independent operation. In
such an environment a parallel program consists of a number of sub-programs, referred
to as tasks, running on the different nodes of the system. In order to co-operated
collectively as a virtual computer each node in the cluster must communicate over the
network, typically using some form of message passing.

Message passing is usually provided through the use of some middleware
software, which fomis an interface between the tasks running on the cluster, and the
operating systems of the constituent nodes. The use of middleware software also
permits a cluster to be constructed using nodes of different operating systems and\or
processors; they may even be Sl.MD or MIMD machines. Such clusters are referred to
as hetero^onous clusters.

04

Cluster Computing
5.3

Message Passing Systems
Message passing is a mechanism by which processors in a parallel system can

communicate and synchronise with each other, without the use of a shared block of
memory. Message passing systems are not limited to cluster computers; they are also
employed in many commercial MIMD and SIMD machines. In a cluster computing
environment message passing is achieved through the use of specific libraries. These
libraries support parallel programming by providing low-level communication routines,
as well as routines for configuring and controlling the cluster.

Message passing systems employ two distinct classes of communication
routines: point-to-point and group. Point-to-point communication allows for direct
communication between two tasks in the system, whereas group communication
provides a means of collective communication within a sub-set of tasks in the system.

5.3.1

Point-to-Point Primitives
Point-to-point communication is performed through the use of SEND and

RECEIVE primitives. These primitives can be classified as synchronous, blocking or
non-blocking.
•

Syncliroiious Primitives

Synchronous primitives overlap communication with synchronisation,
and ensure that the message is delivered before either process continues. This
requires some fonn of acknowledgement protocol between the sending and
receiving tasks. Such operations are refen*ed to as rendezvous.

•

Blocking Primitives

Blocking primitives return control to the calling process as soon as
buffers used can be safely overwritten. For example a blocking send will
suspend the calling task until the message has been transmitted and the send
buffer if cleared, regardless of whether or not the receiving task is ready to
accept the message. A Blocking receive suspends the calling task until the
message arrives in the receive buffer.

105

Cluster Computing
•

Non-Blocking Primitives

Non-blocking primitives

return

control

to

the calling process

immediately. In the case of a send operation this is as soon as the message is
copied to the send buffer, but prior to the message being transmitted. A nonblocking receive either returns with the message or a code indicating that the
message has yet to arrive.

5.3.2

Group Primitives

The SEND and RECEIVE primitives provide an efficient way for two tasks to
communicate and synchronise. But in many parallel applications these operations are
more appropriately defined between groups of tasks. Message passing systems provide
a range of routines to catered for such circumstances. These include
•

Broadcast / Multicast

Both these operations transmit a single message from the source
task, which issued the call, to a set of target tasks. In the case of broadcast
the message is transmitted to all members of the group, whereas with
multieast a set of target tasks must be speeified. In both cases target tasks
accept the message by issuing a standard receive routine.
•

Scatter

Broadcast and multicast routines transmit an identical message to all
target tasks; but many applications require that the same source task
distribute data among a number of targets. Typically this data is stored in
arrays, with targets tasks being assigned a separate portion of the array to
process. In such circumstances the scatter routine is used to dissect the array
and send the resulting portions to each target task.
•

Gather

The gather operation is the opposite to scatter. In this case there are a
number of source tasks and only one target. Eaeh source task sends data to
the target using a standard send primitive, which are received by the target
using a single gather call. The gather routine unpacks the contents of each
message, and stores them in an array, ordered by souree task.

106

Cluster Computing
Reduce
A reduce operation is much like a gather, but instead of compiling
the received data into an array, reduce performs some operation on the
individual data items received. Typical predefined operations include min,
max, average and addition. Reduce also supports user defined operations.
Barrier
Synchronisation between groups of tasks is typically perfoiTned via
the barrier operation. A call to barrier suspends the calling task until such
time as all the tasks in the group have issued a barrier call.

5.4

MPl and PVM
The two most popular message-passing libraries employed in cluster computing

are:
MPI (Message Passing Interface)
MPI is a standard defined for message passing systems. Its main aim is
to aid the development of portable parallel programs, by providing a standard
interface across all multi-processor message-passing systems. While primarily it
was intended for use on massively parallel processing (MPP) systems, it is now
commonly used in cluster computing environments as well. The first draft of the
system {MPI-1) was introduced in 1994, but this was greatly extended with the
introduction of the second draft {MPI-2) in 1997. A detailed overview of MPI-1
and MPI-2 can be found in [58] and [34] respectively.

PVM (Parallel Virtual Machine)
PVM is an integrated set of software tools and libraries designed to aid
the development of parallel systems, running over a network of heterogeneous
computers. It was developed as a result of research earned out at the Oak Ridge
National Laboratory, into the area distributed heterogonous computing. But has
since been implemented on many MPP systems. A complete description of the
PVM system can be is given in [35].

MPI and PVM have both similar functionality and target domains; as a result the
choice of one over the other is very often decided by application designers experience
107

Cluster Computing
and preferences. They are nonetheless separate systems, and each has some advantages
and disadvantages in comparison with the other. While a comprehensive list of
differences can be found in [59] and [60], it is useful here to point out the main
differences between the two.

Perhaps the main difference between the two systems is in relation to design
goals of both systems, and as a result their underlining implementation. MPI is a
standard, it defines a set of functions, and requirements for how these functions should
operate. It is left up to the vendor to provide the specific implementation; moreover in
circumstances where the standard does not clearly specify how a feature should operate,
the vendor has complete freedom in its implementation.

For example MPI does not specify how tasks should be assigned to processors,
which may differ from one implementation to the next. As a result MPI programs are
fully portable, but may not be interoperable; that is an MPI program may be compiled
and run on any supported architecture without code modification, but such programs
running in a heterogeneous environment may not be able to communicate with each
other.

PVM on the other hand is a software system which supports distributed
computing. PVM programs communicate with each other indirectly through the use of
demons running on each host. Provided both hosts run the same version of the PVM
demons, programs running on these hosts will be able to communicate and cooperate
when performing some parallel calculation. Thus PVM programs are both portable and
interoperable. However providing interoperability requires a certain amount of overhead
and as a result MPI programs are generally more efficient than PVM programs.

In general PVM is more dynamic than MPI, PVM permits a program to
reconfigure the cluster - by adding or deleting nodes, swap tasks between processors,
and create and alter groups dynamically. MPI is more suited towards static dedicated
environments, with MPl-1, neither providing dynamic reconfiguration, task swapping or
groups. While some of these issues are addressed by extensions added by the MPI-2
standard, it is still less dynamic than PVM.
108

Cluster Computing
A final area of note concerning the differences between the two systems
concerns the range of features provided by each system. MPI has by far a richer set of
features, providing blocking, non-blocking, and synchronous versions of the send
primitive, in comparison to PVM, which provides a blocking send only.
However the main consideration for the work purposed in this thesis was the
ability to allow tasks to be dynamically created. Hence it was decided to use PVM
message passing libraries for this work, although there are a wide range of choices for
implementing parallel systems, each with their own characteristics

5.5

Overview of the PVM System
PVM is a software system that allows the dcvelopiricnt of parallel applications

running on a cluster of heterogeneous machines. It is composed of three main
components.

•

PV1V1 Demons

Conceptually the PVM demons form a layer between the operating
system of the underline nodes and the PVM tasks running on those nodes.
Logically the PVM tasks run under these demons, which handle all system
call relating to the cluster or other tasks running on it. The pvmd demon is
the main required by all nodes participating in the cluster, but a group server
is also required on nodes that utilise group operations.
•

PVIVl Libraries

While the pvnid demons allow the creation of a cluster computer, it
is the PVM library that allows for the creation of PVM applications. A PVM
application consists of a number of tasks running across the nodes of the
virtual machine cooperating to perform a parallel calculation. Each task is
simply a program that issues PVM function calls and is compiled and linked
with the PVM libraries. PVM tasks can be written using the C, C++ and
FOF^TRAN languages, C\C++ tasks are linked to the libpvm library;
FORTRAN tasks are linked with the lihfpvm library. Tasks that utilise group
operations are also required to link to the PVM group library.

09

Cluster Computing
•

Console Application
The final component of the PVM system is a console application,
which allows a user to set up, configure and halt a virtual machine. The
console application may be used at any time - prior to, after or during the
execution of a PVM application - to probe or alter the state of the virtual
machine. Note that while the console is a useful tool its use is not required
to run a PVM application, as the PVM library contains all the functionality
of the console. All that is required to run a PVM application is the presence
of a pvmd demon on the initial host.

Virtual .Machine

Key to the workings of PVM is the concept of a parallel virtual machine, which
is in essence a cluster computer configured to run PVM applications. But a virtual
machine is an abstraction of a cluster computer in that

•

it may be formed from a subset of machines in the cluster,

•

a number of virtual machines may operate on the same cluster,

•

and they may overlap, with two virtual machines utilising the same node.

In order to participate in the virtual machine a node must be coiTectly configured
and run the pvmd demon. This demon forms the core of the PVM system; all PVM calls
are directed towards the local the pvmd demon, which if required forwards the call to
the pvmd of another node ". The use of pvmd demons allows PVM to interoperate.
Provided two nodes are running the same version pvmd, different implementations of
PVM can participate in the same virtual machine as the pvmds of each implementation
are guaranteed to use the same protocol.

A virtual machine is fomied by one or more pvmds that logically link together.
These pvmds are organised in a master/slave fashion. A master pvmd is one that is
launched by hand; it creates a single host virtual machine - the node on which it resides.
All subsequent hosts that are added to the virtual machine are done so through the
" The pvm_setopt() call can be used to override this default situation and allow two tasks to directly
communicate with each other.

110

Cluster Computing
master pv/ucl, and serve as slave pvmds. PVM allows a host to be added to the virtual
maehine in one of tw o ways;

•

from the console, using the add command;

•

from any task running on the virtual machine, using the pvm_addhosts()
function call.

Regardless of wdiich method is used the request is routed to the master pvmd,
which

IS

then required to launch a program (the slave pvmd) on a remote host, implying

that the master pvmd - or rather the owmer of it - must have pemiission to execute
commands on the nodes that fonn part of the virtual machine. PVM uses either rsh or
exec() to start slave pvmds on remote hosts, exec() requires the user to enter a password
for each host added, while rsh relies on a file based trust system between the two hosts.
By applying the concept of users and ownership PVM permits virtual machines to
overlap, by allowing more than one pvmd to run on a host, provided that each pvmd is
ow'ned a different user. Note that while an owner is not permitted to run two virtual
machines on the one host there is no limit to the number of applications running on a
virtual machine.
The master pvmd is solely responsible for the state of the virtual machine; all
requests to probe or alter its state are handled by the master. In the event of the master
pvmd failing, the virtual machine can no longer function, and all slave pvmds will
eventually timeout. If a slave fails the master will detect it and update the state of the
virtual machine. In all other respects master and slave pvmds are considered equal.

Once established the virtual machine is used to run parallel applications,
consisting of a number of tasks running under the local pvmds of each host. Each task is
a fully compiled executable that communicates with other tasks via PVM library calls.
Part of the power of PVM is that is allows tasks to reconfigure the virtual machine
through the use of the pvm_addhosts() and pvm_delhost() routines.

Each task in the virtual machine is assigned an integer task identifier {TID),
which is used by other tasks when communicating with it. Tids are assigned by the local

Cluster Computing
pvmd without communication with the master and are unique across the virtual machine
and a portion of the tid is reserved to indieate the host on which the task is running. This
implies that tid values do not refleet the order in which tasks were spawned. The
pvm_mytid() and pvm_parent() calls returns a task's tid and the tid of its parent
respectively.

There is no seeurity within the virtual machine; any task can spawn or kill
another task at any time, (using pvm_spawn() and pvm_kill() respeetively). When
spawning tasks P VM uses its own load balaneing algorithm to assign the task to a host,
if one is not specified in the function call. PVM supports both a SPMD and MPMP
models of parallel programming.

Groups in PVM are not handled by the pvmds; instead a separate group server is
launched if a task on that node issues a group eall. PVM does not provide a rich set of
group functions with the exception of group maintenance functions. Barrier, broadcast
and reduce are the only group operations defined. However these are sufficient for many
parallel calculations involving groups. PVM groups are both dynamic and fiexible;
tasks may

•

be members of several groups,

•

join or leave a group at any time,

•

perform group functions for groups they are not members of

PVM doesn't automatically perform fault tolerance but rather provides the
pvm_notify() routine to allow a task to detect both host and task failure. It is up to the
application programmer to implement fault tolerance and recovery around these
routines. pvm_nofify() can also be used to notify a task when hosts are added to the
virtual machine.

PVM supports heterogonous computing; different versions of each task may be
compiled for each supported arehitecture, and the appropriate version selected when
spawning a tasks on any given node. But in a heterogonous environment nodes may also
employ different data encoding formats (little endian or big endian). Data packed into a
12

Cluster Computing
message on one node may be unreadable when unpacked by another. By default PVM
converts all messages to XDR encoding when packing and back to native encoding
when unpacking. In cases where the communicating nodes use the same data format this
overhead can be overridden by selecting the PvmDataRaw option when the message
buffer is created.
In order for PVM to run con'ectly, a number of operating system environment
variables must be set to appropriate values. Some of the most important of these are
outlined below

PVIVl ROOT

Defines the location of the directory that contains the installation of PVM; it is
required by both the console and pvmds on remote hosts when launching the local
pvmd.
PVM ARCH

Defines the machine/operating system pairing of the local host; in a
heterogonoLis environment it is used to select the correct executable when launching a
task on the host. It is also used to select a host when a task requires a specific
architecture to run on.
PVM DPATH

Points to the location of task executables on the local host; when a task is
spawned on the local host the executable for that task is assumed to be located in the
PVM_ARCH sub-directory of PVM_DPATH. It is possible to avoid copying
executables to every host in the cluster by setting PVM_DPATH - for each host - to
point to the some shared file system, such as NFS. In a heterogonous cluster a separate
executable for each supported architecture, must be compiled and stored in a sub
directory named PVM ARCH.
PVM TMP

Points the location of the directory used to store temporary files required by the
pvmd demon when in operation. All temporary files for virtual machines running on the
host are stored in the PVM_TMP directory, each file is appended with a <uid> suffix to
indicate the owner of the virtual machine to which it refen'ers. The main files of note are
the pvml.* log files. In the event of a crash or host failure leftover pvmk* files may
prevent the owner from starting a pvmd on the host, in such a case they can be simply
deleted.
113

Cluster Computing
5.6

Cluster Computing and Backpropagation
When operating over a network of independent machines, the efficiency of

message passing systems is severely limited by the latency of the underlying network.
As a result only algorithms with a high calculation over communication ratio can be
efficiently

implemented

using

such

systems.

With

specific

regard

to

the

backpropagation algorithm this means that training session and training set parallelism
are the only suitable approaches for parallelisation of the algorithm, on such systems.

5.6.1

Training Session Parallelism
Training session parallelism involves simultaneously training a number of

independent networks to perfomi the same task, and selecting the best one. Each
network trained will differ from the others in tenns of hidden layer nodes, initial weight
values, and perhaps even inputs and outputs. Training session parallelism is efficient to
implement in a cluster-computing environment, as each machine in the cluster trains a
complete network, and requires no communication with any other machine during the
training process.
Training session parallelism can be implemented using the master\slave model.
In applying this approach a single master task maintains a pool of network
configurations to train, which it distributes to slave tasks. Each slave task then trains
and evaluates its assigned network, before returning the resulting evaluation to the
master.
Once a network is trained and evaluated the slave submits the resulting
evaluation (network en'or) to the master, and waits for a reply. The master replies by
either accepting or rejecting the trained network. If the network is accepted the final set
of weights is returned to the master. In either case the slave then tenninates.
One problem with this solution is that a slave task temiinates after training a
single network. But in cases where the pool of network definitions exceeds the available
nodes in the cluster, a new slave task may need to be spawned as soon as one
terminates. Each new slave spawned requires the transmission of the entire data set, thus
the efficiently of the algorithm can be improved by exploiting data locality. In this
scenario only one slave is assigned per node, but each slave is allowed to train multiple
networks.
14

Cluster Computing

5.6.2

Training Set Parallelism
With training set parallelism each node also holds a complete copy of the

network. But under this approach only one network is trained, with all nodes
participating in the parallel calculation. This is achieved by distributed the training set
between all nodes in the cluster, with each node training its copy of the network on a
distinct subset of the training set. To maintain the integrity of the algorithm, all nodes
must employ the same set of weights during training. Thus communication and
synchronisation between nodes is required during the training process in order to insure
that the weights are consistent across all nodes.
Training set parallelism is efficient to implement in a cluster-computing
environment only if this communication can be kept to a minimum. This is achieved
through the use of a batch or epoch update rule. Nodes accumulate weight changes for
the current batch\epoch independently of each other, and communicate these changes
between all nodes at the end of each batch\epoch. In essence this communication
requires transmitting the entire set of weights at least twice for every node in the cluster.
So it is important to insure that the batch size is sufficiently large in proportion to the
size of the weight matrixes, which is typical in many applications, [25], [26], [27].
Like training session parallelism, training set parallelism is also typically
implemented using a master slave model. In this case the master transmits an identical
set of weights, and a subset of the training set to each node in the cluster prior to
commencing training. The master then waits for each slave to eomplete the cuiTent
batch\epoch, and return its set of weight changes. When these are all returned, the
master updates the global set of weights, and transmits new weight values for all slaves
to use on the next batch\epoch. Synchronisation occurs at this point, as all slaves must
complete the current batch\epoch before a new set of weights can be calculated and
broadcast.

115

Cluster Computing
5.6.3

PVM Implementation of Training Set Parallelism
The PVM system was used to implement a parallel version of backpropagation.

The algorithm was parallelised, by dividing the training set between each slave.

5.6.3.1 Analysis of Communication Requirements
This section analyses the communication requirements of the algorithm; the
main issues of concern here are:

•

the volume of data that must be transmitted,

•

the frequency of transmission,

•

the delay associated with the communication phase,

•

the effect of a communication strategy on network convergence.

There are two main communication phases required by the algorithm; the initial
communication phase, where the master distributes the training set amongst the slaves
and assigns them initial random weight settings. This phase is performed only once per
network trained. The second communication phase is the weight update communication
phase: during this phase weight changes accumulated by each slave must be combined
to produce a new set of weights that will be used by all slaves during the next
batch\epoch. I'his communication must be performed for every batch\epoch of training.
A final third phase may also be identified; this occurs only once per network trained and
is required to shutdown the application running on the virtual machine. As this phase
simply involves the master sending a signal kill message per slave it is relatively
insignificant and will not be considered further here.

5.6.3.2 Analysis of Initial Communication Phase
Figure 5.1 shows the initial communication phase, as can be seen it is
completely one-directional, with the master only sending data and the slaves only
receiving data. If the master task is assumed not to take part in the computation of
weight changes then the entire training set must be transmitted to the slaves. This
116

Cluster Computing
involves transmitting T bytes in S messages; where S the number of slave tasks and T
the total size of the training set in bytes,

T = {N,^N^^)^P*D
where
N] is the number of neurons in the input layer,
Ni the number of neurons in the output layer,
P the number of training patterns,
D the size in bytes of each input\output, when the network is being
trained a £) is typically a 32-bit float or 4 bytes [54].

In a homogenous environment where the training set is partitioned evenly
between the slaves each slave receives TjS bytes of data. The master must also insure
that each slave starts with an identical set of weights, which the master must also hold a
copy of The size of the weight matrix is.
f /.

W=
V

/■=!

where Nj is the number of neurons in a layer, the Sigma temt (Z) represents the
bias weight associated with all hidden and output neurons. However it is only
sometimes necessary to broadcast the entire weight matrix to the slaves, for most
training sessions the master randomly generates a set of weights, in which case only the
seed used need be broadcast.

As all communication is in a single direction there are no waiting delays
incurred by the master; communication can begin as soon as the slaves are spawned and
the training set partitioned. Once a slave receives its data and weights it can
117

Cluster Computing
immediately begin compulations, it is not required to wail until the end of the
communication phase.

5.6.3.3 Analysis of Weight Update Communication Phase
Figure 5.2 shows the weight update communication phase, this phase is required
at the end of every batch\epoch. During this communication phase the accumulated
weight matrices held by each slave must be summed and added to the current weight
matrix. At the end of this phase all tasks must hold a copy of the updated w^eight matrix.

Communication in this phase is bi-directional with each slave sending its AW
matrix to the master and receiving a new weight matrix W when the master has
calculated the weight updates. Synchronisation occurs at this point as each slave is
required to return their zl^^Dnatrix before the next batch\epoch can continue.

Ideally in a well-balanced system all slaves should return these matrices at
approximately the same time; in order to reduce slave idle time. In large systems
consisting of a lot of slaves this poses a problem with the master becoming swamped
with S messages at the end of every batch\epoch. This potential bottleneck can be
removed by employing a tree structure, with intermitted nodes acting as servers for their
lower branches; collecting and accumulating AW values which are transmitted to
higher-level nodes.
Note that in a heterogeneous system unless the load balancing can be perfectly
perfomied, the slaves returning their results at slightly different times will offset the
bottleneck effect. Here the delay is caused by the time taken to transmit the results of
the last completed slave to the master, in a hierarchal structure this involves one

Cluster Computing
communication and one accumulation for each level between the slave and master. In
whieh case increasing the depth of the tree may increase the overall length of the delay.

Figure 5.3 free structure tor weight update communication phase.

Regardless of the slave strueture the weight update eomrnunication phase requires
eaeh slave to transmit and receive a full weight matrix. The total amount of data
transferred is therefore equal to 2SW, where S is the number of slaves and IV the size of
the weight matrix in bytes. This is potentially a substantial amount of data, and is not
affeeted by the frequency of weight updates; network traffic can therefore be minimised
by performing this communication phase as little as possible - at the end of eaeh epoch.
As there is some idle slave time associated with the communication phase it should also
have the effect of reducing the overall time per epoch.
However it should be pointed out that increasing the update frequency can reduce
the overall number of epochs required for the network to converge [24], at a cost of
increasing the time per epoeh and network load.

5.6.3.4

Stopping Criteria

The algorithm must define some stopping criteria; the standard criteria used, is
when the network en'or falls below some specified value. As the network error is
calculated based on the errors associated with each training pattern it is more efficient
for the slaves to caleulate the error associated with their patterns as they perform the
backward pass. The slaves must return this partial eiTor (a signal value) with the weight
updates. These partial errors are simply summed and normalised by the master proeess
to calculate the global network error.
119

Cluster Computing
5.6.3.5

Master Task

The master algorithm is outlined below; it is assumed that the virtual maehine is
established prior to the launch of the master task, which reads the number of slaves to
spawn as a command line argument. The algorithm was designed to be implemented on
a dedicated cluster of homogeneous Linux machines. As all machines were assumed to
be equal the data set was evenly divided between the slaves. The implementation
employs a learn-by-epoch update strategy. The master training algorithm is outlined
below.

// Initialisation Phase

PartitionTrainingSet (NoSlaves);
*Tids = SpawnSlaves (NoSlaves);
/*****************

Start of Initial Coinmunication Phase
*****************/

ScatterData ();
BroadcastSeed ();
j*****************

End of Initial Coinmunication Phase
*****************/

RandWeights(Seed)/

// Training Phase

do {
/*****************
Start of Weight Update Coinmunication Phase
*****************!
// Blocking Receive Function
// Blocks until all Updates are received

ReceiveUpdates()

;

BroadcastCurrentWeights ();

/ *****************
120

Cluster Computing
End of Weight Update Communication Phase
*****************/
} while (NetworkError > StopError)

;

// Termination Phase
for (i = 0; i < NoSlaves; i++)
pvm_kill(Tids[i])

;

All communication was via the pvm_send () and pvm_recv () primitives, group
constructs were not used in the implementation. The ScatterData () function is outlined
below,

void ScatterData ()

{

for (i = 0; i < NoSlaves; i++)
PackData (Partition[i])

{

;

pvm_send (Tids[i], DataTag)

;

}
}

Broadcast operations were implemented in a similar manner, except that the data
was only packed once. They have the general form,

void Broadcast*

()

{

PackData (DataToBrodcast)

;

for (i = 0; i < NoSlaves; i++)
pvm send (Tids[i], *Tag)

{
;

}

)

The ReceiveUpdates () function receives the accumulated weight updates from
all slaves, and calculates the cuirent set of weights. Synchronisation occurs within this
function, as all slaves must wait until it completes before commencing the next epoch.
As the partial error is associated with the weight updates it is efficient to receive these in
a single function. The function is outlined below,
121

Cluster Computing
void ReceiveUpdates ()

{

// Must receive one message from each slave
// to exit this loop
NetworkError = 0;
for (i = 0; i < NoSlaves; i++)

{

// Blocking Receive
// The order of updates arrivals is not important
pvm_recv ( -1 , UpdateTag)

;

// Unpack and sum weights
pvm_upkf loat (ScDeltaWeights)

/

CurrentWeights = CurrentWeights + DeltaWeights /

// Unpack and sum errors
pvm_upkf loat (ScError)

;

NetworkError = NetworkError + Error ;
}

// Normalise Network Error
NetworkError = (NetworkError * 2) / Patterns ;

5.6.3.5

Slave Task

The corresponding slave-training algorithm is below. Each slave task performs a
complete forward and backward pass on its local training set, transmits the weight
updates to the master and waits until it receives a new weight matrix on which to train
the next epoch.

// Initialisation Phase

// Get Tid of Master Process
ParentTid = pvm_parent ();
/

*★*****★********'*

122

Cluster Computing
start of Initial Communication Phase
***************** j

Data = ReceiveData (ParentTid);
Seed = ReceiveSeed (ParentTid)/
j*****************

End of Initial Communication Phase
*****************!
RandWeights (Seed);

// Training Phase

do {
TrainNetwork (Data);
/*****************

Start of Weight Update Communication Phase
*****************/
SendWeightUpdates()

;

// Blocking Receive Function
// Slave must wait until the master is ready
// to send the current set of weights
ReceiveCurrentWeights ();
/*****************

End of Weight Update Communication Phase
***************** j

} while (True)

5.6.3.6

;

Speedup Results

In order to evaluate the parallel algorithm, the system was used to train several
different networks on the parity problem; a standard neural network benchmark
commonly used to evaluate the perfomiance of networks. Inputs to the network
represent a binary string, which the network must learn to classify as being either odd or
even parity. This problem is an extended case of the XOR problem where the number of
123

Cluster Computing
inputs is greater than or equal to two, and as such is also non-linearly separable
problem.
As this is a basic binary function the training set used by the network is simply the
truth table of the function. If N is the number of inputs to the network 2^ training
patterns are required, using a signal hidden layer a minimum of 2^'“ hidden neurons are
required.

String Inputs
Length Nodes
2
2
3
3
4
4
5
5
6
6
7
7
8
8
9
9
10
10
11
11
12
_____n.

Hidden
Nodes
2
4
8
16
32
64
128
256
512
1024
2048

Outputs
Nodes
1
1
1
1
1
1
1
1
1
1
1

Training
Patterns
4
8
16
32
64
128
256
512
1024
2048
4096

1 able 5.1 Relationship betneen inputs and network size for the parity problem.

Table 5.1 outlines the neural network parameters required for the parity problem
using a number of different length binary strings. Each of these networks were trained
using both the parallel system and a sequential version of the program. In order to keep
the comparison as fair as possible both versions implemented the same routines during
network training. The parallel versions were trained using a different numbers of slaves.
All networks were trained for a fixed number of epochs - 1000.

Small Networks

Graph 5-1 shows the speed up attained using the system on small network sizes.
Values below one indicate a drop in performance in comparison to the sequential
version. Note that the sequential version of the program is one where there is no use of
any communication or task management primitives. As can be seen there is little benefit
to be gained from parallelising small networks, with most networks experiencing a
substantial drop in performance. This is due to the communication required between the

124

Cluster Computing
slaves. For such networks training session parallelism may offer a better alternative; as
no communication is required during training.

Graph 5-1 Speedup for small Networks

Graph 5-2 shows the speedups obtained for medium to large sized networks, as
can be seen all networks showed some improvement from parallelisation, and this
improvement becomes more pronounced as the size of the network increases. This can
be more clearly demonstrated in Graph 5-3. Here the size of the cluster was held as a
constant and the size of the network was varied, as the network size increased so too did
the speedup obtained.

■7 Inputs
8 Inputs
9 Inputs
■ 12 Inputs

Graph 5-2 Speedups for medium to large sized networks

125

Cluster Computing

(iraph S-3 Speedups attained for a ten-node cluster over increasing network sizes

5.7Conciusion
Cluster computing addresses the need of many organisations to acquire extra
computational resources at an affordable price. This is achieved thought the use of
cheap mass-produced products over expensive purpose built ones. The basic
components of a cluster computer are standard PC's and workstations that are
networked together.
Applications running on a cluster consist of a number of processes that are
distributed across the nodes of the cluster, and communicate and synchronise using
message passing primitives. Two most common message-passing libraries used in
cluster computing are MPl and PVM, of the two MPI offers more functionality and
PVM greater flexibility.
A parallel version of backpropagation was implemented using the training set
parallelism method. The results show a near linear-speedup for large sized networks, but
suggest that smaller networks should employ a form of training session parallelism
instead.
In summary the goal outlined in Chapter 1 has been addressed by increasing
speedup/throughput as shown in Graph 5-2 and Graph 5-3.

126

Chapter 6

High Throughput Computing

High Throughput Computing

High Throughput Computing
6.1

Introduction
Cluster computing has established the fact that powerful computing systems can

be built from a network of standard computers. When viewed in this light the pool of
networked machines within any organisation could potentially be turned into a single
computer, in order to perfomi some computationally intensive calculation. However
various groups or individuals within the organisation own these machines and require
them for their own purposes. Nevertheless they are frequently idle and the pool of idle
machines across the network remains a free resource.
Computer usage within most organisations is very inefficient. PCs and
workstations have a relatively short boot-up time yet they are rarely powered down at
night or over the weekend, with nobody using them or no background tasks running.
The same is true of lunchtime and large parts of the working day when other non
computer related tasks are being perfomied. As a result a machine could spend
significant part of its powered up time performing no useful task. When the financial
and environmental cost of running these machines is taken into account this idle time
should be seen as a wasted resource that could be better spent performing any useful
task.
A machine cun'ently in use may also waste resources; the power of a machine is
generally dictated by the requirements of its more computationally intensive tasks. Yet
the machine may spend the majority of its time performing some low intensity task such
as email or word-processing. During this time the CPU may also spend a significant
amount of time idle.
High Throughput Computing (HTC) is an attempt to tap into these wasted
resources, by allocating some useful work to the CPU whenever it is deemed to be idle.
In such systems the owner of the machine is generally allowed to specify what
conditions it is to be considered idle. This is to reduce any negative impact that the
system may have on the owner's work, with the aim of encouraging more owners to
participate.
This chapter investigates the use of a HTC system for the implementation of the
backpropagation algorithm. Such a system is of interest because of the experimental
nature of backpropagation training, and of neural networks in general. There is a large
128

High Throughput Computing
element of trial and error involved in selecting network parameters for any task. The
search space is of such complexity that genetic algorithms are increasingly being used
to reduce it; in such an environment any solution that helps search this space is of
benefit.
As the training of a particular network can be viewed as evaluating a point on this
search space, the use of a HTC training system will broaden the size of the search
regardless of how slow it is in comparison to other methods applied.

6.2

High Throughput Computing
High Throughput Computing (HTC) is closely related to High Perfonnance

Computing (HPC) in that both paradigms typically run large-scale scientific or
engineering applications, but whereas HPC is concerned with achieving the fastest
possible execution times for these large computationally intensive jobs; HTC is
concerned with achieving the maximum utilisation of available resources [40].
HPC applications are run on either supercomputers or dedicated clusters of
workstations acting as a single machine. In such an environment the underlying speed
of the machine has a direct inOuence on the time required to complete the application.
As a result a major motivating factor behind HPC is the creation of faster and more
powerful computers, whereas the major motivating factor behind HTC is the utilisation
of idle CPU cycles throughout an organisation. If a machine anywhere on the network is
not operating at full capacity, its spare capacity could be allocated to the HTC system.
In HTC the applications don't run on what could be viewed as a signal machine. The
application is offered access to machines as they become free, but must relinquish that
access as soon as the owners require their machines. During the life of an application it
might run on several different machines, with the number of machines participating in
the calculation changing.
In such an environment the time required to complete the application is related
to the availability of machines during the life of the application. While this time may
vary greatly with different runs of the application, it should always be quicker than if
the application were run on a single machine without the use of these extra resources. In
fact in some organisations HTC may make possible computations previously
unaffordable in which case any result is a bonus regardless of the time.

129

High Throughput Computing
Condor

63

Condor is a software system that allows for the creation of a High Throughput
Computing environment out of a pool of networked machines. In essence Condor is a
batch scheduler that implements queuing and scheduling algorithms to allow the system
to claim unutilised CPU cycles throughout the pool of networked machines [40]. Users
submit jobs from their local machine, which are queued by Condor and executed on a
remote machine when one becomes available.
Each machine in the pool is locally owned, with its owner retaining complete
control over the machine. The owner of a machine specifies under what conditions a
remote job may execute on that machine, and it is Condor's task to match these with the
requirements of jobs submitted by users. 1 his is achieved through the ClassAd system.
All machines and jobs in the pool advertise their requirements. Periodically Condor
collects these ads and assigns jobs to machines.

6.3,1

Condor Machines
A Condor pool is a collection of heterogeneous machines configured to run the

Condor software. Machines can be configured to play a number of different roles in the
pool, these include

•

Central Manager
It is the role of the central manager to monitor all machines in the pool and
to match the requirements of requests made by submit machines with the
resources and policies of execute machines.

•

Submit Machine
A submit machine is any machine from which jobs may be submitted.

•

Execute Machine
An execute machine is any machine on which jobs may be executed.

Only one Machine in the pool is configured to act as a central manager, which is
essentially a pool server to which the other machines connect. The use of a single
central manager could create a potential bottleneck in large pools, this can be avoided
!0

High Throughput Computing
through the use oiflocking [61], a system where the eentral manager from one pool may
be configured to run jobs on another pool.
All machines in the pool (central manager included) can be configured to submit
and/or execute jobs. Each machine in the pool runs various Condor demons depending
on the role it is to play in the pool. The owner of each local machine can control the
policies of the demons that run on it through the use of locally stored configuration
files.
On execute machines the configuration file controlling the Stand demon can be
used to implement policies controlling remote jobs running on the machine. The most
important expression in this file is the START expression, which must evaluate to true
before a remote job can start on the machine. The START expression can use attributes
of both the machine {Keyhoanlldle, LoadAvg, ClockMin etc.) and the potential job
(ImageSize, E.xecutahleSize, Owner etc.) in its evaluation, as well as system and user
defined macros. Allowing the machine owner to specify policies such as

•

Don't start jobs during office hours (8:00 am to 5:00pm, Mon-Fri).
Weekend = (ClockDay == 0 || ClockDay == 6)
START = ( (ClockMin < 480 || ClockMin > 1020) || $(Weekend))

•

Only start jobs if the keyboard is idle for ten minutes or more.
START = (KeyboardldlO 10 * $(MINUTE))

•

Only start jobs if the CPU is idle.
START-(LoadAvg< 0.3)

•

Only execute jobs owned by Smith
START = (Owner == "smith")

Similar expressions can be defined to control when jobs are suspended, resumed,
pre-empted or killed.

13

High Throughput Computing
6.3.2

Condor Jobs
Jobs are submitted to Condor using the condor^submit eommand, which reads a

supplied submit file and queues the job(s) until they can be executed remotely on an
available machine. Only one executable can be refen-ed to per submit file, but that
executable may be queued to run a number of times. In heterogeneous environments
different versions of the same executable may be included with Condor selecting the
correct one to match the remote machine. It is possible to supply each run of the
executable with different inputs and command line arguments; outputs for each run can
be stored in separate files or directories.

The Reqiiirenients expression m the submit file is used to specify required
attributes of any potential host machine, such as

Requirements = Memory >= 32
Requirements = OpSys == "LINUX" && Arch = "INTEL"

Similarly the Rank expression can be used to specify the job's preferences
towards a host machine. The expressions

Requirements = Memory >= 32
Rank = Memory >= 64

infonns Condor the job requires at least 32 megabytes of memory, but would
prefer 64 if it is available on another machine.
Condor jobs execute in the background on remote machines, if the program
requires interactive input this is redirected from an input file - containing the correct
keystrokes. All outputs, log and errors are also redirected to files, the names of which
must be specified in the submit file.

132

High Throughput Computing
6.3.3

Checkpointing and Migration
In a HTC environment such as Condor the owner of a machine can reclaim a

machine at any time, evicting all jobs from it, forcing the system to terminate the Job
and restart it on another machine when one becomes free. In such a scenario, all work
performed on the vacated machine would be lost. Some of this work may be saved if the
job/machine supports checkpointing.
Under checkpointing a snapshot of the job is taken periodically while it
executes. If the job has to vacate the machine then this checkpoint is used to migrate it
to a new machine. As the owner is entitled to reclaim his entire machine, and
considering it may take some time for a machine to become free, the checkpoint is
stored on the submit machine and not the execute one. It is also possible to assign one
machine as a checkpoint server, in which case all checkpoints images for jobs running
in the pool are stored on it.

6.3.4

Condor Universes
A Condor uses the term universe to describe the execution environment under

which a submitted job is to run. The default universe is Standard. For jobs running
under this universe the execution environment of the submit machine is preserved
through the use of Remote System Calls. Any attempted file operation on the remote
host is redirected to a shadow process on the submit machine, which performs the actual
operation. Figure 6.1.

Remote Host

Local Host

Figure 6.1 Remote System Call

133

High Throughput Computing
This requires that the job be re-linked using the condor libraries using the
condorjconipile command. The standard universe is also the only universe to support
checkpointing. Jobs that avail of both these features are subject to a number of
restrictions these include;

•

Jobs must contain a signal process only.

•

Jobs must not communicate with other processes.

•

Jobs must not use the SIGUSR2'^ and SIGTSTP'”^ signals.

•

Jobs must not utilise multiple kernel-level treads.

•

Jobs must not use memory-mapped files.

Jobs that cannot be re-linked or are restricted from running in the Standard
universe can use the Vanilla universe, which cannot avail of checkpointing. Instead jobs
are suspended in the hope that the machine will become available again soon and it can
resume. Failing this the job must be restarted on another machine.
Parallel applications written in PVM or MPI are run under the PVM and MPI
universes respectively. Condor-MPl applications require a fixed number of dedicated
nodes while the application is running, whereas Condor-PVM applications can avail of
Condor's opportunistic environment to grow/shrink the size of the virtual machine, as
nodes become available/unavailable. The PVM universe requires an extra module to be
installed on all hosts that will take part in the virtual machine.
Jobs written in Java can avail of the Java universe to run under the Java Virtual
Machine execution environment. The Globus universe is used to provide a Condor
interface for launching Globus jobs. With the exception of Standard universe all submit
files must name the universe under which the job(s) will run.

6.4

Condor and Training Session Parallelism
Training session parallelism involves trying a number of different network

configurations and selecting the best one. Training session parallelism is required due to
the problem of local minima; when a network converges on a solution there is no way
In UNIX like systems the SIGUSR2 signal is application defined. Condor reserves its use for
instruction a remote job to perform a periodic checkpoint.
In UNIX like systems the SIGTSTP signal is used for job control, it instructs the receiving task to
perform an interactive stop. Condor used the signal to cause a remote job to checkpoint and terminate.

134

High Throughput Computing
of knowing if it is optimal. As a result a number of networks may need to be trained
before a suitable solution is found. As each network trained is completely independent
of the other networks no communication or synchronisation is required between them.
This can easily be implemented using the Condor system by submitting a sequential
backpropagation program and queuing it to run a number of times. Each run of the
program will use a different input file and thus train a different network. If the program
is adjusted to write the final network evaluation to file, the files from each run can be
compared to automatically select the best network once all networks are trained.
A problem with opportunistic environments such as Condor is that particular
machines are available for an unknown amount of time and may be reclaimed without
any warning. The longer it lakes a job to complete the greater the likelihood of the
machine being reclaimed before it does so. For jobs that lake a long time it is imperative
to insure that the job is not continuously pre-empted and restarted on a new machine.
This is true of backpropagation training where it may take days or weeks to train a
complete network. One way of avoiding this problem is to re-linking the program for
use in the Standard universe and allowing it to make use of Condors checkpointing and
migration features'

6.5

Condor-PVlVl
PVM's message-passing routines provide the application programmer with a

powerful tool for developing robust dynamic parallel applications. However one
disadvantage of the system is that when running in a non-dedicated environment a PVM
program cannot easily probe the slate of machines not currently included in the virtual
machine. While it is possible to write routines within a PVM program instructing it
when to add or remove hosts (that may be required for other work during the day) such
routines are likely to be complex and eiTor prone.
The Condor system when incorporated with PVM provides the parallel
application with the ability to handle the dynamics of a non-dedicated environment in a
complex manner without introducing such complexities into the PVM code itself
Condor serves as a resource manager in the system. It monitors all hosts in the pool and
notifies PVM whenever a host becomes available or is reclaimed by its owner. PVM is
used for its message-passing interface between the tasks on each host.
These features are only available in the standard universe.

135

High Throughput Computing
Condor-PVM is designed to run applications in the master/worker model [62],
which although similar is still nevertheless slightly different from the master/slave
model. A problem with such models of parallel programs is that different sources use
different names for the same model, or perhaps even the same name for different
models. A survey of various naming conventions can be found in [8], it is however
useful here to clarify the distinction between the two models in relation in to the context
in which they are used here.

•

Master Worker Model
In the master worker model a single master task holds a pool of tasks that
must be completed. These tasks are highly independent of each of each other and
are fanned out to slaves as they become available. Due to the independence of
the tasks there is no communication or syiichronisation required between the
workers; should a worker task be lost it can be simply restarted without affecting
the remaining workers. There is also high degree of data independence between
the workers, making it efficient to terminate the slave as soon as it returns
results. Condor suggests that Condor-PVM applications be implemented using
its Master Worker class templates which enforce this model [63]. Such a model
is practical in a HTC environment.

•

Master Slave Model
The master slave model also employs a single master task that controls a
group of subordinate tasks. But in this case the slaves are not as independent as
the workers. The master task performs some repetitive calculation during which
the slaves may need to communicate or synchronise with each other. The
lifespan of a slave exceeds that of a worker, with a slave generally being
retained after it has returned results in order to reuse its local data and continue
its participation in the parallel calculation. In such a system the loss of a slave
may have a negative impact on all other slaves, making it less efficient to
implement in a HTC environment.

The master task is specified in the submit file and resides on the submit machine
throughout the life of the program. All worker tasks are launched using the pvm_spawn
0 routine, and are restricted to one worker task per remote host. While Condor ensures
136

High Throughput Computing
that the master task is never pre-empted, the same is not true of workers tasks running
on remote hosts, which may be lost at any time without warning.

This is perhaps the main difference between Condor-PVM and standard PVM
applications. Standard PVM applications run a relatively stable dedicated environment,
while it is still good practice to implement fault tolerance routines, such events are
considered rare and therefore have little impact on the design of the algorithm. CondorPVM by its very nature is an unreliable environment, a fact that has to be taken into
account when designing algorithms to run under it.
The master has to adjust to a constantly shifting number of workers and cope
with the loss of up to all its workers. The master program should strive to achieve some
work at all times regardless of the available resources. In a HTC environment the most
important concern is that the job eventually finishes, and not how fast individual
sections complete. Ideally the master program should be able to perfomi all of the work
on its own and farm it out to workers as and if they become available.
When submitting a Condor-PVM job it is possible to specified a minimum and
preferred number of hosts in the submit file through the use of the niacliine_class
command. Condor will not start the master program until a virtual machine with at least
the minimum number of hosts has being formed. The remaining hosts will be added
later as they become available.

6.6

Condor-PVM and Backpropagation Training
The previous chapter examined the implementation of parallel backpropagation

on a dedicated cluster using PVM. The two methods investigated were training session
and set parallelism. This section will examine the suitability of each method for
implementation in a HTC environment using Condor-PVM.

6.6.1

Condor-PVM and Training Session Parallelism
Section 6.4 outlined a training session scheme using Condor and a sequential

backpropagation program. With training session parallelism the slaves\jobs are
completely independent of each other, there is therefore little benefit to be gained
through the use of PVM's communication routines. In the algorithm outlined in the last
chapter PVM was used as a delivery system to distribute nearly sequential slave tasks to
137

High Throughput Computing
remote hosts. Condor can just as easily be used to distribute completely sequential jobs
to the same remote hosts. The use of Condor does however offer a number of
advantages over a Condor-PVM based version these include:
•

Jobs do not require PVM to be installed on the remote machines.

•

Jobs do not incur any extra overhead though running the PVM system.

•

Jobs can be run under the Standard universe and therefore be migrated.
As a result Jobs are not required to regularly save their weights to the
master process -to offset against any loss of work due to sudden
termination. 10

6,6.2

Condor-PVM and Training Set Parallelism
Training set parallelism can be efficiently implemented using the master slave

model; this poses a number of problems in a HTC environment, which is more suited to
the master worker model. The master task divides the training set between the slaves,
which perform the actual calculation. In a dedicated environment the number of slaves
is fixed and known in advance, allowing the system to perform static load balancing
when partitioning the data. But in HTC environment the number of slaves is not a
constant and may vary over the course of training. With no advanced knowledge of the
number of slaves available the master is forced to perform dynamic load balancing as
part of the parallel calculation. This can be achieved through the logical division of the
training set into a number of fixed sized blocks, and the distribution of these blocks
between the slaves. A number of issues arise from the use of such a load balancing
strategy, these include:

Granularity
The grain size of each block is important as when running in a HTC
environment a host may be lost at any moment with a complete loss of any partially
completed but not yet returned results. The finer the granularity the less work that is lost
when a host is reclaimed. On the other hand the smaller the grain size the more network
traffic that is generated, as more sets of updates must be returned.

If required jobs may still save their weights to a file on the submit machine through the use of remote
system calls.

138

High Throughput Computing
l oad Balancing
Load balancing and granidarity are inteirelated; the finer the granularity the
greater the scope for load balancing. In a heterogeneous environment a smaller block
size affords the master task greater precision in its effort to match the relative speed of
each processor with the total number of blocks they possess. This is important, as an
epoch cannot complete until the last slave has returned results; if the system is not well
balanced the majority of slaves may become idle while a few slave process the last
remaining blocks. Load balancing is especially important in a dynamic environment
where the slaves are constantly changing.
Data l.ocality
As backpropagation training sets can be quite large, it is important to ensure that
the system isn't transmitting large blocks of data over the network each epoch. Caching
the blocks on slave nodes and allowing slaves to reuse local blocks in successive
training epochs can achieve this.
Redundancy
The greater the redundancy in the system the easier it is to perfonn load
balancing and error recovery. But as redundancy involves transmitting and caching each
block of data on a number of slaves, redundancy increases both network traffic and the
memory requirements of slave tasks.

6.6.3

Block Scheduling Algorithm
To optimise the assignment of blocks to slaves the master implements a

scheduling algorithm with the following aims.

1. To reduce the idle time where some slaves are waiting for others to finish
before the start of the next epoch.
2. To reduce the number of copies of a block held by each slave.
3. To reduce network load by transmitting as few blocks of data as possible.
4. To reduce the recovery time when a task or host fails.

139

High Throughput Computing

Parallel Virtual Machine
No. Blocks
Block of
Oil til

Unclaimed
Hosts

0

1

-1

2

-1

3

0

■"I lU I 9

/

4

2

-Hit 4

5

1

:>

6

3

-*EH

ko« cil lVila

XTX

/Pointers 1(1 Status

Slave List

/ PVM
Message
Passing
Slave
Spawning

Data

Submit Machine
▼

5

\

:

\

—'•

3

—«

4

■*

A

No. Blocks

3

No. Blocks 1 3 1

No. Block-s
Blocks
of Data

Blocks
of Data

Claimed Hosts
Condor Pool

Figure 6.2 Block scheduling architecture

An overview of the seheduling arehitecture is shown in Figure 6.2. The master
stores the complete training set and logically divides this into a number of blocks by
creating pointers into the data, whereas the slaves stores each block as a separate
structure to allow for ease of maintenance. Each block is referenced by a unique ID,
which is shared between the master and all the slaves.
For each block the master maintains its status for the cuiTent epoch, this can have
one of three values:

•

Unassigned (0).

•

Processed and returned (-1).

•

Currently being processed (Tid of the processing slave).

140

High Throughput Computing
The master also records which blocks slaves have a copy of, and when each of these
slaves last processed the block. When a slave finishes processing its current block, it
must be assigned a new one from the unassigned pool. The master selects blocks from
the unassigned pool using the following priorities.

1. The slave has a copy of tlie block. (Aim 2 and 3)

If possible the algorithm will always select a block that the slave has
cached in its local memory. This policy not only reduces redundancy and
network load but also increases the efficiently of the training algorithm. As
the slave can begin processing after receiving a brief notification message, it
is not required to wait for the transmission of an entire block.
2. A block, which is shared by as few slaves as possible. (Aim 2 and 4)

If the slave doesn't hold a copy of any block in the unassigned pool, a
block must be selected from the pool for transmission to the slave. In this
case the system shows a preference for blocks that are held by the least
amount of slaves. This increases the redundancy of the individual block,
which in turn reduces the chances that it will have to be reissued to a new
slave, at some later point in training.
This policy means that no slave is left idle while there are unassigned blocks to
process (Aim 1). The longest time a slave should have to wait is the length of time it
takes to process a block. But in a HTC environment it is possible that the slave
processing this last block is terminated before it can return results. In such event the
system must reassign this block to another slave, thus increasing the overall ideal time.
A small grain size can help alleviate this problem.

6.6.3.1 Maintenance Algorithm
The system needs to keep some check on the number of blocks held by each
slave. Unchecked, a slave could acquire a large number of blocks over the course of
training, the majority of which are redundant copies that are now processed by other
slaves. But each slave executes on a remote host and may have access to a limited
amount of memory. It is therefore necessary to implement a maintenance policy which
that deletes blocks from overcrowded slaves.
141

High Throughput Computing
A pool wide policy can be implemented by capping the number of blocks per
slave. If a slave requests a new block when it has already reached this limit, one of its
existing blocks must be selected for deletion. This selection process can be performed
locally by the slave or globally by the master.
Allowing the slave to select which block to delete has the advantage of
simplifying the master-scheduling algorithm, but has the disadvantage that each slave
has no knowledge of the global distribution of the blocks. The only information a slave
maintains concerning its cached blocks of data, is the epoch each block was last
processed by that slave. This allows the slave to select a redundant block whose
processing has since being taken over by another slave.
One disadvantage of local selection is that the block selected for deletion might
be the only redundant copy of that block in the system. Should the slave containing the
active copy of that block terminate at some later point in training, the block will have to
be retransmitted to another slave. This situation can be avoided by allowing the master
to select the block for deletion. The master policy is as follows:

•

Select ail old block that hasn't recently being used.
Ideally the block deleted from a slave should be the one that is least
likely to be required again by that slave. If all a slave's cached blocks have been
accessed recently the best course of action is to delete the block with the oldest
access time (epoch).

•

Select the least common block.
Of the blocks that are considered non-recent, the one with the most
number of copies throughout the whole system is selected for deletion.

6.6.3.2 Master Algorithm
The master algorithm is outlined below; when it is launched it makes no
assumptions about the state of the virtual machine and instead uses the pvm config ()
routine to probe its state and return the number of remote hosts available. For simplicity
it is assumed here that the master task takes no part in the parallel calculation, and that
the ProbeVirtualMachine () routine blocks until at least one host has being added.

142

High Throughput Computing
// Initialisation Phase

// GrainSize = No. of Patterns per block
PartitionTrainingSet

(GrainSize);

// Probe the state of the virtual machine to see any
// remote hosts exist. Function blocks until at
// least one host is added.
NoHosts = ProbeVirtualMachine ();
// Can only Spawn one slave at a time
for (i = 0; i < NoHosts; i++/)
Tids[i] = SpawnSlave ();
// Add notification requests for slaves and hosts
AddNotificationRequests ()
// Create an index of slaves for the scheduling
// algorithm
AddSlavesToTaskList(Tids)
RandWeights(Seed)/

// Training Phase

do {
BroadcastWeights ();
// Each slave is assigned the first block to
// process, in accordance with the scheduling
// policy
AssigninitialBlockToSlaves ();
// Complete one epoch of training,
// Update the current set of weights and
TrainEpoch ()

;

UpdateWeights ();
} while(NetworkError > StopError);

// Termination Phase
for (i = 0; i < NoSlaves; i++
143

High Throughput Computing
pvm kill(Tids[i]

The main function of interest in the master routine is the TrainEpoch () function.
Once the initialisation phase is completed this function performs the majority of the
work carried out by the master task. It is in this function that the scheduling and
maintenance algorithms are implemented. The train network function is outlined below,

void TrainEpoch ()

{

do {
if (received weight update message)
get block id
mark block returned
assign free block to slave
unpack and store weights
if (received new host message)
spawn slave on host
assign free block to slave
if (received task fail message)
remove task from list
mark block free
} until (all blocks returned)

6.6.3.3 Slave Algorithm
The slave-training algorithm is outlined below, it implements a simple loop
cycle where it waits for and processes instructions from the master. Note that the slave
will update its weights whenever requested to by the master.
void SlaveTrain()

{

do {
if

(received weight update message)
Unpack and update weights

if

(received new block assignment message'
Unpack and store new data block

if

(received block process message)
Unpack block ID
Current block = block ID
Train network on block
144

High Throughput Computing
Send weight updates
if

(received delete block message
Unpack block ID
Delete block

while (true)
}

6.6.4

Evaluation of System
The system was evaluated on a cluster of twelve homogenous PC’s. Each PC in

the cluster contained an Intel Pentium 111 processor and 125MB of RAM. For these tests
using the Condor system the PC’s were running the RedHat 8.1 Operating System. The
machines were networked using a Cisco Systems Catalyst 2900 XL series switch.
Graph 6-1 shows the variations in time per epoch for nins using a constant
number of hosts. It can be seen from the figure that while the average time per epoch for
each test is relatively the same, times can vary greatly from one epoch to the next. This
is due to the manner in which the algorithm assigns blocks to hosts, resulting in some
hosts performing more processing than others in certain epochs. To account for this
variation and to produce clearer results in tests it was decided that two hosts would be
added or removed from the system at the same time.

Variations in Time Per Epoch

-♦—Test 1
Test 2

Graph 6-1 Variations in time per epoch

145

High Throughput Computing
Loss of hosts from the system
Graph 6-2 shows the effect loosing two hosts has on a typical run of the system.
This figure shows the average time per epoch for a run comprised of 10 hosts, 8 hosts
and a run where the number of hosts drops from 10 to 8 on the seventh epoch of the run.
It can be seen from the figure that after an initial surge in time on the eight epoch where the system reconfigures itself - the average time per epoch returns to a level
comparable with that of the 8 host test. This surge can be explained by the fact that the
system had to resend all blocks of data that were stored on the lost hosts only.
System Adjusting to Loss of Hosts
600
■g 500
o
111 400

-♦— 8 Hosts
10 Hosts

<v

I

300

10 Hosts loose 2

200

-----^-------- -------- -------- -------- -------- -------- -------- --------

100

1

1

1

1

1

1

1

1

23456789

1

10

Epoch

Graph 6-2 System ad justing to loss of hosts

Addition of Hosts to the system
Similarly Graph 6-3 shows the effect gaining two hosts has on the system. In
this case it can be seen that the system adapts to the change almost immediately,
dropping from 10 to 8 hosts in a single epoch with no surge in time for the post-change
epoch. The lack of a surge can be explained by the fact that the two new hosts
immediately joined in the processing of blocks reducing the load on the previously
employed hosts.
System Adjusting to Gain of Hosts
270 n
o 250

o

Q.

LU

0)
o

-♦— 8 Hosts
230
10 Hosts

CL 210
.§

8 Hosts Gain 2

190

H

170
5

10

6

Epoch

146

High Throughput Computing
Graph 6-3 System adjusting to gain of hosts

General Speedup obtained by the system
Graph 6-2 and Graph 6-3 shows that the system can adjust to the loss or gain of
hosts and quickly reconfigure itself to perform at a level comparable to a system
running the same number of hosts and which has not being reconfigured. These graphs
do not however indicate the overall performance of the system in comparison to a
dedicated cluster or the ideal speedup - where a ten host system runs ten times faster
than the best sequential version of the algorithm running on a single computer. This
general speedup information is presented in Graph 6.4, where the performance of the
system (without reconfigurations) is compared with the ideal speedup as the number of
hosts varies.

— Speedup
Ideal

Processors

Graph 6.4 Speedup of system.

6.7

Limitations of Condor-PVM based System
While the results above demonstrates that Condor-PVM can be used to

implement a high throughput implementation of backpropagation. With the system
quickly able to readjust itself after a change in configuration, it should be pointed out
that the use of Condor-PVM is far from an ideal solution. There are a number of
limitations associated with Condor-PVM:

147

High Throughput Computing
Reduced PVM functionality
PVM and Condor-PVM have similar but not identical functionality; a list of the
differences may be found in Appendix C. While many of these differences are slight in
nature they do require minor changes to the code, which is contrary to the misleading
claims in the Condor documentation. There is however one important major change
which greatly reduces the power of any Condor-PVM solution in comparison to a
standard PVM one. That is Condor limits the number of tasks per host to one; however
many applications can best be implemented using a modular design, where the task is
decomposed into a number of processes each performing a specific task.
Figure 6.3 demonstrates this using the master task outlined above as an
example'^. Ideally the master task should strive to process some work at all times, and
so should process the data itself when there are no slaves. But while processing this data
the master must also listen for notification messages, if this is performed frequently data
processing will suffer due to the large number of expensive system calls'^. But if it is
infrequent slaves may sit idle for some time waiting for the master to receive their
messages.
Submit Machine

Local Slave

Shared Memory

Master Scheduler

Full Training Set
Number of Rows
for Local Slave
to process

yr

Shared Memory

Block Index

Semaphores for
Synchronizing
weight updates
Master Maintenance Task
Figure 6.3 Solution for master task employing multiple tasks per host

By applying a modular design the master seheduler residing in a separate task
issues a single blocking receive, leaving the backpropagation task to continue serial

'' Communication between modules in the diagram is via shared memory because PVM library functions
are not reentrant.
The non-blocking receive pvm_nrecv ()

148

High Throughput Computing
processing until a new host is added. Such an approach allow more complex algorithms
to be implemented by each module with the minimum negative impact other modules.
Cumbersome to program
Condor is a batch-scheduling environment, it provides a suite of command line
programs for processing and managing jobs in the condor system. However this
functionality is not available through any library or API calls, making it difficult to
access them from within an application running on the system. The same is true of the
file based system infomiation, which if it were more easily accessible could be used to
estimate how long a host will be available for'^\ and thus the benefit of including it in
the parallel calculation.
Overhead of Condor functionality
Much of Condors functionality is unavailable in the PVM universe. Yet it is
required that the full Condor system be installed when running Condor-PVM
applications, this can be viewed as introducing unnecessary overhead into the system.
Not Compliable with MS Windows
The chapter investigates the implementation of backpropagation in a high
throughput-computing environment. As stated in the introduction the motivation behind
this investigation is the utilisation of wasted CPU cycles across a network of machines,
to aid neural network research. As the most common operating system in use today is
MS Windows it stands to reason that most of these wasted cycles will be on such
machines. That fact that Condor-PVM is not currently available for the Windows
operating system, severely limits the scope of any applications designed using it.

Based on the owners policies in the configuration files.

149

High Throughput Computing
6.8

Conclusion
The aim of this chapter was to investigate the possibility of applying high

throughput computing techniques to backpropagation training. Backpropagation poses a
problem for HTC environments because of the requirement that all slave processes
synchronise at least once per epoch in order to perform weight updates. The loss of a
single task just prior to this update will force all other tasks to wait until the work can be
reassigned and completed by another task. In an environment as unreliable as a HTC
system this could lead to lengthy delays, if the final task is repeatedly terminated before
it can complete.
Reducing the granularity and assigning each task only a few rows of training
data to process best avoids this situation. However this does not fully solve the problem.
Unlike many tasks that are suitable for HTC environments the data used by each task is
not once off data. That is each row of the training set is required for every epoch of
training and cannot be simply discarded after a task has used it in the current
calculation.
It is important that tasks retain and reuse these rows, in order to avoid the
overhead of transmitting the entire training set over a shared network for every epoch of
training. In order to achieve this a simple caching algorithm was implemented, whereby
the master task shows a preference for assigning rows to tasks that already passed
copies of those rows. As can be seen from the results the system quickly stabilises on a
stable set-up, where the number of nodes remains a constant, and when the number of
nodes changes the system quickly recovers and returns to its stable state. Moreover the
speedup achieved realises the goal outlined in Chapter 1; increased throughput for
backpropagation training using cheap computing resources.

150

Chapter 7

Conclusion

Conclusion

7

Conclusion
Neural networks are biologically inspired infomration processing tools that have

found widespread application in areas where traditional rule based systems have
encountered little successes. This is in part due to the fact that some neural based
solutions can be either trained to perfomi the task using a set of examples. One such
solution is the backpropagation of error training algorithm; however backpropagation
training is computationally intensive and many take days or weeks to complete.
The aim of this thesis was to investigate methods for reducing backpropagation
training times using standard off the shelf commodity hardware. The first method
examined was to exploit the parallelism available with the micro-architecture of a
modern microcomputer. This involves using the packed SIMD instructions and packed
registers provide by many computer vendors. These instructions allow a number of
array elements to be processed in a single instruction offering a potential speedup for
applications that rely heavily on array and matrix operations.
As shown in Graph 7-1 all network sizes tested showed some performance
increase from using SIMD instructions. Although the increase was less for small
networks or networks that required padding.

Graph 7-1 SIMD speedups for different network sizes

152

Conclusion
The second method employed was the use of cluster computing techniques to
implement a parallel version of the algorithm, using the PVM library for message
passing. The algorithm was parallelised by dividing the training set between the hosts,
and allowing each host to train a network independently on its portion of the data training set parallelism.
Using this method the networks are not completely independent of each other, as
each network must employ the same set of weights during training. This requires that
that the networks must periodically exchange weights, which is usually performed at the
end of each epoch although it is possible to employ a shorter update period

Graph 7-2 shows some speedups obtained for networks using the PVM system, as
can be seen for small networks the PVM system performs worse than the sequential
version. This is due to the large communication overhead in comparison to the
relatively short calculation periods required for smaller networks. For such networks the
alternative training secession parallelism approach might provide better performance.

3-lnputs
8-lnputs

5-Inputs
9-lnputs

6-Inputs
12-lnputs

7-Inputs

Graph 7-2 Speedups obtained for PVM system on different network sizes

The final approach investigated was the use of high-throughput computing
techniques to train a backpropagation network over a non-dedicated network of shared
machines. This method also employed training set parallelism but the new environment
introduced a number of difficulties that were not present for the PVM system. These
153

Conclusion
include the need for dynamic load balancing and greater fault tolerance due to the
unreliable nature of a HTC environment.

To overcome these problems the granularity was reduced; slaves were assigned
smaller blocks of data to work on and a caching and scheduling algorithm was
implemented. The algorithm has the following aims,

1. To reduce the idle time where some slaves are waiting for others to finish
before the start of the next epoch.
2. To reduce the number of copies of a block held by each slave.
3. To reduce network load by transmitting as few blocks of data as possible.
4. To reduce the recovery time when a task or host fails.

The algorithm essentially keeps track of which blocks each slave contains, and
when selecting a block for an idle slave to processes tries to assign it a block that it
already has a copy of This both reduce network traffic and slave waiting times. A
number of tests were performed on the system to see if it stabilised and recovered after
the network was reconfigured.
When the network configuration was held static the system was found to stabilise
with successive epochs recording times within a similar range; and after the removal or
addition of hosts the system quickly settles within the range of the new configuration.
Graph 7.3.
System Adjusting to Gain of Hosts
270
g 250 -

^ 230
0)

Q. 210

E 190
i170
5

10

6

Epoch
-♦— 8 Hosts

10 Hosts

8 Hosts Gain 2

Graph 7-3 HTC system adjusting to a change in network configuration

154

Conclusion
This thesis demonstrates that exploiting parallel processing can reduce the training
times for backpropagation training. The findings of this research are summarised below.
Each of the three methods examined showed some benefit from the parallelisation
process. Furthermore the three methods are not mutually exclusive, provided that the
machines used in the cluster (both HPC and THC) support SIMD operations each slave
task could utilise these instructions to speedup the time required for it to process a block
of data. It is also possible for a dedicated HPC cluster to utilise additional resources
provided via the HTC system to reduce the workload of each dedicated host.

The main findings of this thesis are as follows:

•

Artificial neural networks are biologically inspired computational methods that
have the ability to approximate discrete, real and vector valued target functions,
see literature survey Chapter 2.

•

A major limiting factor of the backpropagation algorithm is the length of time
required to train a network, sec literature survey Chapter 2.

•

This time can be reduced by exploiting parallel processing techniques, see
literature survey Section 3.9.

•

One such technique investigated is the use of SIMD processing within a desktop
computer, such as SSE on Intel processors, see implementation Section 4.5.

•

There are four implementation methods for writing SSE enabled code.
Assembly, Intrinsics, Classes and Automatic Vectorisation, see implementation
Section 4.4.

•

Assembly language routines require additional hand optimisation in order to
yield similar performance levels to compiler-optimised code. This is due to the
complex micro architecture of modem microcomputers, see implementation
Section 4.4.1.

•

Automatic vectorisation of serial C code may miss opportunities to parallelise
certain loops due to the problems of dependencies analysis, see results Section
4.5.2.

•

Of the four implementation methods for SSE enabled code, Intrinsics offers the
best compromise between ease of use performance and portability, see results
Section 4.5.2.
55

Conclusion
•

The Intel C ++ compiler generates more optimal code than the Microsoft Visual
C++ compiler, see results Section 4.5.2.

•

It may be possible to use the faster but less accurate reciprocal instruction
instead of divide, especially in the earlier stages of training, see results Section
4.5.2.

•

The use of padding has a negative effect of speedup in comparison to nonpadded networks, see results Section 4.6.

•

Large networks may be trained in parallel using cluster-computing techniques
by applying training set parallelism, see literature survey Chapter 5.

•

If the network is not of sufficient size it is best to apply training session
parallelism, see results Section 5.6.3.6.

•

It is possible to overcome the limitations of a HTC environment for
backpropagation training by employing a simple scheduling and caching
algorithm, see implementation Section 6.3.

Of these findings the main contributions to the research area made by the author are,

•

An investigation of the different SIMD coding techniques for neural network
simulation. A number of techniques were examined assembly, intrinsics, object
orientated classes and compiler based implementation, and the most suitable method
was determined by experiment, as described in Chapter 4.

•

The development of a backpropagation algorithm suited to operation in a nondedicated HTC environment. This involved the development of scheduling and
maintenance algorithms that in order to increase throughput by caching blocks of
data locally on remote hosts for later use by the same slave, as presented in Chapter
7.

56

References

8

References
[1]

G. Dorffner "Internal Report for NEiiroNet", 1999, Weblink
http:.//w\vw.kcl.ac.iik/neuronet

[2]

Jiri Sima and Pekka Oiponen, "General-purpose computation with neural
networks: A survey of complexity theoretic results", Neural Computation
Vol. 15. No 12. pp 2727-2778, 2003

[3]

D. E. Rumelhait, J. L. McClelland and the PDP research group, "Parallel
Distributed Processing" Vol. 1, Chapter 8, pp 322-328, MIT Press 1986.

[4]

Jondar Gibb, “Back Propagation Family Album”, Technical Report C/TR9605, Department of Computing, Macquarie University Australia, August
1996.

[5]

Douglas Aberdeen, Jonathan Baxter, and R. Edwards, “92 c /MFlop/s, UltraLarge-Scale Neural-Network training on a Pill cluster”. Proceedings of
Super Computing 2000, Dallas, TX., November 2000. SC2000 CDROM.

[6]

N. Sundararajan and P. Saratchandran, “Parallel Architectures for Artificial
Neural Networks Paradigms and Implementations”, Chapter 1, IEEE
Computer Society Press 1998.

[7]

Top 500 Supercomputers List November 2003, Weblink
http;//www.top500.org/list/2003/l 1/

[8]

Rajkurnar Buyya, Editor, “High Perfonnance Cluster Computing: Volume 2
Programming and Applications”, Chapter 1, Prentice Hall, 1999

[9]

Rajkurnar Buyya, Editor, “High Perfomiance Cluster Computing: Volume 2
Architectures and Systems”, Chapter 5, Prentice Hall, 1999

[ 10]

Alexey L. Lastovetsky, “Parallel Computing on Heterogeneous Networks”,
Wiley-Interscience, 2003.

[11]

Les Robertson, “What is High Throughput Distributed Computing?”, CERN
Computing Summer School 2001, Weblink
http://les.home.cem.chyies/csc01/lectures.ppt

[12]

D. O. Hebb, "The organization of Behavior", Wiley: New York, 1949.

[13]

D. A. Medler, “A Brief History of Connectionism”, Neural Computing
Survey 1, pp 61-101, 1998.

57

References

[14]

Kevin Gurney, "An Introduction to Neural Networks", Chapter 7, CCR
Press, 1997.

[15]

Mikael Bod’en, “A Guide to recurrent neural networks and
backpropagation”, 2001, Weblink
http:/Avww. itee.uq.edu.au/~mikael/papers/rn_dallas.pdf

[16]

Rumelhart, D. E., Hinton, G. E., and Williams, R. J., “Learning internal
representations by back-propagating errors”. Nature, Vol. 323, pp. 533-536,
1986.

[17]

F. Rosenblatt, “The perceptron: A probabilistic model for information
storage and organization in the brain”. Psychological Review, Vol. 65, pp.
386-408, 1958.

[18]

J. J. Hopfield, "Neural Networks and physical systems with emergent
collective computational abilities". Proceedings of the National Academy of
Sciences.

[19]

M. Minsky and S. Papert, "Perceptrons: An Introduction to Computational
Geomerty", MIT Press, 1969.

[20]

Philip D. Wasserman, “Neural Computing Theory and Practice”, Van
Nostrand Reinhold, 1989.

[21]

K. Hornik, M. Stinchcombe, and H. White, “Multilayer feedforward
networks are universal approximators”. Neural Networks, Vol. 2, No. 5, pp.
359-366, 1989.

[22]

E. J. Hartman, J. D. Keeler, and J. M. Kowalski, “Layered neural networks
with gaussian hidden units as universal approximations”. Neural
Computation, Vol. 2, No. 2, pp. 210-215, 1990.

[23]

W. Duch and N. Jankowski, “Survey of Neural Transfer Functions”, Neural
Computing Survey 2, pp 163-212, 1999.

[24]

Jim Torresen, Shinji Tomita and Olav Landsverk, “The relation of Weight
Update Frequency to Convergence of BP”, World Conference on Neural
Networks, 1995.

[25]

H. Drucker and Y. Le Cun, “Improving generalization performance using
double backpropagation”, lEE Transications on Neural Networks, Vol 3, No.
6, pp. 991-997, 1992.

158

References
[26]

A. Waibel, “Consonant recognition by modular construction of large
phonetic time-delay neural networks”, Neural Network, Vol. 10, No 2, pp.
243-256, 1997.

[27]

Y. Le Cun, et. al., “Backpropagation applied to handwritten zip code
recognition”. Neural Computation, Vol 1, No 4, pp. 541-551, 1989.

[28]

M. J. Flynn. “Very high-speed computing systems”. Proceedings of the
IEEE, Vol. 5, No. 12 pp. 1901-1909, December 1966.

[29]

The Open Group Base Specifications Issue 6, Weblink
http://www.opengroup.org/onlinepubs/009695399/

[30]

Official OpenMP Specifications, Weblink http://www.openmp.org/specs/

[31 ]

Seyed H. Roosta, “Parallel Processing and Parallel Algorithms - Theory and
Computation”, Springer-Verlag, Chapter 2, pp. 66-82, 1999

[32]

The Computational Science Education Project, “Computer Architectures”,
Weblink http://csepl.phy.oml.gov/ca/ca.html.

[33]

Charles E. Leiserson. “Fat-Trees: Universal Networks for Hardware
Efficient Supercomputing”, IEEE Transactions on Computers, C34(10):892-901, October 1985.

[34]

“MPl-2: Extensions to the Message-Passing Interface”, Message Passing
Interface Forum, 1997, Weblink http://www-unix.mcs.anl.gov/mpi/

[35]

Al Geist, Adam Beguelin, Jack Dongarra, “PVM; Parallel Virtual Machine:
A Users' Guide and Tutorial for Network Parallel Computing (Scientific and
Engineering Computation)”, MIT Press 1994

[36]

The ScaLAPACK Project, Weblink http://www.netlib.org/scalapack/

[37]

Portable, Extensible Toolkit for Scientific Computation, Weblink
http://www-unix.mcs.anl.gov/petsc/petsc-2/

[38]

M. Aldinucci and M. Danelutto, “An operational semantics for skeletons”,
Weblink
http.7/citeseer.ist.psu.edu/cis?q=^An+operational+semantics+for+skeletons&c
s=l,2002.

[39]

Mark A. Baker, Geoffrey C. Fox and Hon W. Yau, “A Review of
Commercial and Research Cluster Management Software”, 1996

[40]

Condor Team, "Condor Version 6.6.5 Manual" University of WisconsinMadison, 2003 Weblink http:/7w'ww.cs.wisc.edu/condor
159

References

[41]

S. Ahuja, N. Carriero, and D. Gelemter, “Linda and Friends”, IEEE
Computers, Vol 19(8), pp. 26-34, August 1986

[42]

N. Carriero, and D. Gelemter, “Technical Correspondence on Linda in
Contex”, Communications of the ACM, Vol. 32(10), pp. 1255-1258,
October 1989.

[43]

NetsSolve Homepage, Weblink http://icl.cs.utk.edu/netsolve/index.html

[44]

Globus Homepage, Weblink http://www.globus.org/

[45]

G. M. Amdahl, “Validity of the single-processor approach to achieving
large scale computing capabilities”. Proceedings of AFIPS Conference,
1967, pp. 483-485.

[46]

David O Neal, “On Microprocessors, memory hierarchies and Amdahl's
law”. Proceedings of the DoD HPCMP Users Group Conference, Monterey,
CA, .Line 1999.

[47]

J. Gustafson, “The Scaled-Sized Model: A Revision of Amdahl's Law”,
Proceedings of the Third International Conference on Supercomputing, May
1988.

[48]

Y. Shi, “Reevaluating Amdahl's Law and Gustafson's Law," Computer and
Infonnation Sciences department. Temple University, Web link
http://joda.cis.temple.edu/~shi/docs/amdahl/amdahl.html, October 1996.

[49]

N. Sundararajan and P. Saratchandran, “Parallel Architectures for Artificial
Neural Networks Paradigms and Implementations”, Chapter 2, IEEE
Computer Society Press 1998.

[50]

Roldan Pozo, “Draft Proposal for Java BLAS Interface”, Roldan Pozo,
“Draft Proposal for Java BLAS Interface”, Weblink
http://math.nist.gov/javanumerics/blast.html.

[51]

Millind Mittal, Alex Peleg, and Uri Weiser, “MMX Technology
Architecture Overview”, Intel Technology Journal, 3'‘’' Quarter, 1997

[52]

Shreekant Thakkar, Tom Huff, "The Internet Streaming SIMD Extensions",
Intel Techonology Journal, 2nd Quarter 1999.

[53]

James Abel, Kumar Balasubramanian, Mike Bargeron, Tom Craver, Mike
Phlipot, “Applications Tuning for Streaming SIMD Extensions”, Intel
Technology Journal, 2nd Quarter 1999.

160

References

[54]

N. Sundararajan and P. Saratchandran, “Parallel Architectures for Artificial
Neural Networks Paradigms and Implementations”, Chapter 10, IEEE
Computer Society Press 1998.

[55]

“Using the RDTSC Instruction for Performance Monitoring”, 1998

[56]

D. E. Rumelhart, G.E. Hilton, and R. J. Williams, “Learning internal
representations by eiTor propagation". Nature, Vol. 323, pp 533-536, 1986.

[57]

Y. Le Cun, et ah, "Backpropagation applied to handwritten zip code
recognition". Neural Computation, Vol. 1, No. 4, pp. 541-551, 1989.

[58]

“MPI: A Message-Passing Interface Standard”, Message Passing Interface
Forum, 1994, Weblink http://www-unix.mcs.anl.gov/mpi/

[59]

William Gropp, Ewing Lusk, "PVM and MPI are Completely Different"

[60]

G. A. Geist, J. A. Kohl, P. M. Papadopoulos, "PVM and MPI: a Comparison
of Features"

[61]

D.H..1. Epema, M. Livny, R. van Dantzig, X. Evers, and J. Pruyne, "A
Worldwide Flock of Condors: Load Sharing among Workstation Clusters",
Journal on Furure Generations of Computer Systems, Vol. 12, 1996

[62]

The Condor-PVM Homepage, Weblink http://www'.cs.wisc.edu/condor/pvm/

[63]

MW Overview, Weblink http://www.cs.wisc.edu/condor/mw/overview.html.

[64]

lA-32 Intel Architecture Software Developers Manual Volume 1: Basic
Architecture, Weblink http://www.intel.com

[65]

lA-32 Intel Architecture Software Developers Manual Volume 2A:
Instruction Set Reference, A-M, Weblink http://www.intel.com

[66]

lA-32 Intel Architecture Software Developers Manual Volume 2A:
Instruction Set Reference, N-Z, Weblink http://www.intel.com

[67]

The IA-32 Intel Architecture Software Developer's Manual, Volume 3;
System Programming Guide, Weblink http://www.intel.com

161

Appendices

Appendix A

Intel's SIMD Instruction Set

Appendix A
Appendix A

Intel’s SIMD Instruction Set

Thus far Intel has introduced four SIMD extensions each with their own
instruction sets. These extensions are fully documented in the different volumes of the
Intel Software Developers Manual [64], [65], [66], [67]. Although many of the
instructions in these sets perfonn practically identical functions (except on different
data types\sizes) they are documented by extension and are thus dispersed throughout
the manuals. Making it difficult to get a clear picture of Intel's SIMD capabilities.
This situation is compounded by the fact that later extensions expand the range of
some earlier instructions; while at the same time adding new instructions that
logically belong to a previous extension. For this reason it was deemed useful to
present a summary Intel's full SIMD instruction set in one document. Further
infonnation on any of topics\instructions discussed may be found in the Intel manuals.

A.l

SIMD Registers and Data-Types
For reference the SIMD registers and data types are brietly summarised in

Table A.l. Note that while nether SSF or SSE 2 extend the range of data-types that
can be stored in the MM registers, they both however introduce new instructions to
operate their existing data-types. The situation is similar with SSE 3 adding new
instructions that operate on existing 128-bit XMM data-types.

Extension

New Packed DataType

Register

Bits

MMX

Integer

MM

64

XMM

128

XMM

128

N/A

N/A

SSE

SSE 2

SSE 3

Single-Precision
Integer
Double-Precision
None

Size of
Packed
Element
Byte
Word
DoubleWord
DoubleWord
Byte
Word
DoubleWord
QuadWord
QuadWord
N/A

No. of
Packed
Elements
8
4
2
4
16
8
4
2
2
N/A

Year
Introduced

Table A.l SIMD Registers and Packed Data-Types

Note SSE 3 was introduced to late to be incoiporated into the project it is presented here just for
completeness

A-I

1997
1999

2000

2007®

Appendix A
A.2

Types of Instructions
Not all instructions in the SIMD set can be described as truly packed

instructions. A number of instructions perfomi auxiliary functions and do not operate
on any registers or data, still more treat a blocks of data as single entities and are
unaffected by its packed interi^retation. It is only the remaining instructions that
operate on packed data and interprets it as such can be considered SIMD instructions.

Of these a distinction can be made between some of the 128-bit XMM
registers instructions^' as they can be classed as either packed or scalar. Figure A.l.

Packed Instructions
Packed operations are performed on all the values in registers
simultaneously and can be regarded as SIMD instructions. Note that
not all packed instructions are true SIMD instructions; SSE 3
introduces two instructions that performs different operations on odd
and even packed elements.
Scalar Instructions
Scalar operations are performed on the single value stored in
the bottom element of the register, and are serial instructions. Note
scalar operations can only be performed on lloating-point data22 .

^3

ai

^2

bi

b2

A

'T
C

^0

^3

bo

A
V
a-i op bi ao op bo

^0

^2

bp

b3

bi

bo

V

a3 op b3 a2 op b2

C

^3

^2

ai

b) Scalar SIMD operation

a) Packed SIMD operation

Figure .4.1 Packed and scalar SIMD Instructions
■' This distinction is not applicable to all types of instructions.
■■ With the exception of conversion instructions.

A-Il

ao op bo

Appendix A
A.3

Arithmetic Instructions
The SIMD instruction set includes a number of basic integer and floating

point operations, Table A.3 and Table A.4 respectively. The relationship between
arithmetic instructions, registers and each extension is shown in Table A.2. As can be
seen SSE and SSE 2 increased the number of integer operations on MM registers.
However care should be taken with their use as they require the later extension, and
will not work on MMX only machines. The same is true of SSE 3 and single and
double precision values in the XMM registers.

MMX
Integers

MM
XMM

SSE 2
Integers
Integers
Double-precision

SSE
Integers
Single-precision

SSE 3
Single-precision
Double-precision

I able A.I Rclationsliip between SIMD e.xtensions and Registers for Arithmetic Instructions

The majority of arithmetic instructions perform standard packed or scalar
operations as outlined in

Figure A.l.

There are however a few notable exceptions

regarding integer arithmetic, these being:

A.3.1

•

the special sum of absolute difference instruction, PSADBW,

•

Integer addition and subtraction,

•

Multiplication,

•

the ADDSUB* instructions,

•

Horizontal arithmetic instructions.

Sum of Absolute difference
This instruction performs a packed absolute difference operation on 8 integer

bytes and returns the sum of these as a single value. This value is stored in the lower
part of the destination register,

Figure A.2.

Note the relationship between the 128-bit

and 64-bit versions of this instruction are not as straightforv,^ard as other instructions.
Instead of computing the ABS of 16 bytes or 8 words the 128-bit version returns two
values, one for both the low and high 64 bits of the register.

A-IIl

Appendix A
4

8

7

6

3

1

2

4

-

y(l

6

5

3

7

6

2

4

3

-2

3

4

-1

-3

-1

-2

-1

2

I

3 I 4

I

1

I

3 I 1

I

2

I

1

l)

17
Figure A.2 Sum of Absolute Difference Instruction

A.3.2

Integer Addition and Subtraction
When adding or subtracting two integers overflow oceurs if the result of the

operation is too large to be stored in the destination, similarly underflow oceurs when
the result is too small to be stored. The standard method of dealing with both of these
events is to flag its occurrence and return as much of the result as can be stored in the
destination. It is up to the application to detect the event and correct the result if
required.

SIMD integer instruetions do not set flags so there is no way to determine if
overflow\undernow occurs. Two types addition and subtraction instructions are
provided allowing the programmer choice of how to deal both of these events

Wraparound Arithmetic
Truncates and stores the result ignoring any bits that will not fit in the
destination element. However this truncated result could be any random value.

Saturation Arithmetic
Offers a fast and effeetive alterative by returning the smallest
representable value when it detects underflow, and the largest for overflow.
As this value depends interpretation of the data there are both signed and
unsigned versions of each instruction.

A-IV

Appendix A
A.3.3

Multiplication
A problem with integer multiplication is that the result requires twice as many

bits as the two source operands. This poses a problem when the values being
multiplied are in packed data-types and the results must be repacked after the
operation. There are three types of multiplication instructions each offering a different
solution to the problem.
•

Multiply High \ Low
Multiplies all packed elements but only stores the high or low part of
the result for each element.

•

Multiply Overwrite
Multiplies all odd elements and stores the full result. Even elements are
effectively overwritten in the destination register.

•

Multiply and Add
Multiplies all elements and adds every pair of results, these paired
results are stored effectively combining every second data element.

aq

^3

bq
ctq,*bq.

^^3

32

ai

^2

bi

★

a3*b3

ai-bi

32’b2

■{'t'
T;
(a4*b4)+(a3*b3) (a2'b2)+(a-|*bi) Results
a) Multiply and Add

?

7

93

91

■At
^3

7 bi

as'bs

ai-bi

?

h~[RMh'

High Results

Low Results

b) HighMow Multiply

Results
c) Multiply Overwrite

Figure A.3 SIMD Integer Multiply Operations

A-V

Appendix A
A.3.4

ADDSUB* instructions
While the ADDSUB* instructions operate on all the packed elements in the

XMM registers they are not true SIMD instructions. The SIMD model applies the
same operation to all data elements, whereas these instructions add the corresponding
odd elements of two packed registers, while subtracting the even elements.

a2

^3

^1

ao

b2

B

^3 + ^3

32-b2

bo

ai + b-|

ao-bo

Figure A.4 .4DDSIIB* Instructions

A.3.5

Horizontal Arithmetic Instructions
The horizontal arithmetic instructions are SIMD instructions in that the same

operation is applied to all data elements. However unlike most packed instructions
that operate vertically between con'esponding elements of different registers, these
instructions on pairs of adjacent elements within a register, see Figure A.5.

Figure A.5 Horizontal .Arithmetic Instruction

A-VI

Appendix A

Wraparound
PADDB
PADDW
PADDD
PADDQ
PSUB
PSUBW
PSUBD

Addition

Subtraction

Sign
Saturation
PADDSB
PADDSW

Unsigned
Saturation
PADDUSB
PADDUSW

PSUBSB
PUSBSW

PSUBUSB
PUSBUSW

Signed
PMULLW
PMULHW

Multiply
Overwrite
Multiply and
Add
Average
Maximum
Minimum
Sum of Absolute
difference.

MMX
SSE 2

Unsigned
MMX
PMULHUW
PMULUDQ

SSE
SSE 2

PMADDWD

PMAXSW
PMINSW

MMX*
SSE 2

PSUBQ
Multiply
Low \ High

Introduced

MMX
PAVGB, PAVGW
PMAXUB
PMINUB
PSADBW

SSE
SSE
SSE
SSE

*A11 64-bit MM register arithmetic instructions introduced by tlie MMX and SSE extensions were
extended to 128-bit XMM operations by SSE 2
Fable A.3 SIMl) Integer Instructions

Packed

Scalar

SSE 2
Packed
Scalar

Addition

ADDPS

ADDSS

ADDPD

ADDSD

Subtraction

SUBPS

SUBSS

SUBPD

SUBSD

MULPS
DIVPS
SQRTPS
RCPPS

MULSS
DIVSS
SQRTSS
RCPSS

RSQRTPS

RSQRTSS

MAXPS
MINPS

MAXSS
MINSS

SSE

Multiplication
Division
Square Root
Reciprocal*
Reciprocal of
Square Root*
Maximum
Minimum

SSE 3
ADDSUB* Horizontal
HADDPS
HAPPPD
ADDSUBPS
ADDSUBPD HSUBPS
HSUBPD

MULPD MULSD
DIVPD
DIVSD
SQRTPD SQRTSD
RCPSD
RCPPD

MAXPD
MINPD

MAXSD
MINSD

*Uses an approximation.
Table A.4 SIMD Floating-Point Arithmetic Instructions

A-VII

Appendix A
A.4

Create Mask of Packed Data
The create mask of packed data instructions work by selection the most

significant bit of each packed element in the source register, and returning them as a
single values. This value may be later examined to perfonn branches and other
conditions based on the packed values.
Most
Significant Bits
^ —

^ .
.
Packed
Element

[T !9 9f|9 9:ii:9 9
-iW 0 1 0 1 1
f.

Result

Figure A.6 Create Mask Operation

Integer
Single-Precision
Double-Precision

Instruction

Registers

MOVMSKB

MM, XMM

MOVMSKPS
MOVMSKPD

XMM
XMM

Extension
SSE\
SSE 2
SSE
SSE 2

Fable .4.5 Create Mask of Paeked Data Instructions

A.5

Shift
Shift instructions perform packed and register wide shift operations on the

contents of the MM or XMM registers. They are designed for use on integer data
types only, although no type checking is perfonned to enforce this. It is not possible
to shift individual packed elements by different amounts. Shift instructions are non
scalar.

Eeft Logical
Right Logical
Right Arithmetic

Word
PSLLW
PSRLW
PSRAW

MMX \ SSE 2
Double Word QuadWord
PSLLD
PSLLQ
PSRLQ
PSRLS
PSRAD
Fable A.6 Shift Instructions

A-VIII

SSE 2
Double QuadWord
PSLLDQ
PSRLDQ

Appendix A
A.6

Logical
I.ogical instructions perform a bitwise logical function on the entire contents

of two packed registers. Although they are not strictly speaking packed instructions
they can be used to perform packed operations through the use of bit masks, see
Figure A.7.

-

8.5

-H

6.3

-I 1.3

-t-

4.9

4.0052

-

XOR

XOR
1 lo ...0 1 0 ...0 |1 0 ... 0 1 !o ...0

1

0.................. 0

—
-h

8.5

-

6.3

-t-

0.0003

-

1.3

-

4.9

a) Single-Precision Negate

4.0052

1
—
-H

0

...............................

......0

0.0003

b) Double-Precision Negate

Figure A.7 Packed Logical Operations

AND
OR
XOR
AND NOT

SSE
AMDPS
ORPS
XORPS
ANDNPS

MMX
PAND
POR
PXOR
PANDN

SSE 2
ANDPD
ORPD
XORPD
ANDNPD

Table A.7 Logical Instructions

Note that usage of the p* MMX instructions have being extended to 128-bit
operations and as a result all three versions of each function perform identical
operations and ean be used on any 128-bit data-type.

A.7

Compare
Packed comparison operations compare corresponding elements of two packed

data-types and returns a True or False value for each element compared,

Figure A.8.

True\False values take the form of bit-masks with a mask of all O's indicating False
and a mask of all Ts True. If required it is possible to condense the results of all
packed comparisons by using one of the move byte-mask instructions of Section A.4.

A-IX

Appendix A
o

3

L.

9

20

False

False

14

8

>

8

6
=

False

True

Figure A.8 SIMD Compare Operation

Greater Than
Equal To

MMX \ SSE
Word
PCMPGTW
PCMPEQW

Byte
PCMPGTB
PCMPEQB

2
DoubleWord
PCMPGTD
PCMPEQD

Table A.8 Integer Compare Instructions

Floating-point comparisons instructions accept an immediate as a third
operand that is used to control what comparison is applied to the data. The following
comparisons are implemented m hardware,

•
•
•
•

•
•
•
•

Equal To
Less Than
Less Than or Equal To
Unordered

Not Equal To
Not Less Than
Not Less Than or Equal To
Ordered

Note that greater than is not included in the list. All greater-than type
comparisons are implemented in software, by swapping source and destination operands
and performing an appropriate less than operation. They are therefore less efficient to
implement.
Flags are not affected by most of these comparisons with the obvious exceptions
being the instructions that are specifically designed to set the flags of the EFLAGS
register. These instructions compare the single lower values only and set flags as would
a standard serial compare.

Single-Precision (SSE)
Double-Precision (SSE 2)

Packed

Scalar

CMPPS
CMPPS

CMPSS
CMPSD

Set Flags*
COMISS, UCOMISS
COMISD, UCOMISD

*COMIS instructions generate an error if the source is a QNaN, UCOMIS don't.
Fable A.8 Floating-Point C ompare Instructions

A-X

Appendix A
A.8

Unpack
Unpack instructions can be used to combine the elements of two matching

packed data-types. Each instruction can be classed as either high or low, high pack
instructions interleaves elements from the top half of each register to produce a result.
Low pack instructions perfomi the same operation but use the low half of each register
instead,

Figure A.9

.

Note that although Intel classifies pack instructions according to the

data-types they operate on (int, single or double-precision), each pack instruction
merely moves fixed sized blocks of data in a predestined way and are affected by the
underlining values or inteipretation of the data.

b4

34

^3

Figure A.9 Unpack Operations

Byte
Word
DoubleWord
Quad Word

High
PUNPCKHBW
PUNPCKHWD
PUNPCKHDQ
UNPCKHPS
PUNPCKHQDQ
UNPCKHPD

Low
PUNPCKLBW
PUNPCKLWD
PUNPCKLDQ
UNPCKLPS
PUNPCKLQDQ
UNPCKLPD

Table A.9 Unpack Instructions

A-XI

Extension
!V1MX\SSE 2
MMX \ SSE 2
MMX\SSE 2
SSE
SSE 2
SSE 2

Appendix A
A.9

Shuffle
Shuffle instructions are similar to unpack instructions in that they also rearrange

packed elements. The major difference between the two types of instructions is that
shuffle instructions allow the programmer to choose which elements are rearranged,
whereas with unpack instructions this is fixed. However this choice is somewhat
restricted and not all pemiutations of the packed elements in the two registers are
possible.

There are two main types of shuffle instructions:

•

Insert
These instructions completely overwrite the destination register with
elements from the source. A sub-set of these instructions are the PSHUFHW and
PSHUFLW instructions, which perfonn a standard insert shuffle but the high or
low QuadWord of the XMM register only.

•

Mix

These instructions mix the elements of two registers, but confine the
selected elements from each register to either the high or low half of the result.
Within this half any permutation of the registers elements are possible.

Mix

Insert
Word

High \ Low
PSHUFHW
PSHUFLW

PSHUFW*
PSHUFD

DoubleWord
QuadWord

SHUFPS
SHUFPD

SSE
MM

SSE 2
XMM

Not extended by SSE 2
Table A.10 Shullle Instruction

A-Xll

SSE

Appendix A
Selection Mask

Destination

Lovv' Shuffle

a) Insert Shuffle

b) High \ Low Insert Shuffle
Selection Mask

4

13

4

34

33

32

3l

b4

\
bs

b2

bi

b4

bi

3l

84

B

c) Mix Shuffle
Figure A. 10 Shuttle Operations

A. 10

Pack
Pack instructions also combine the elements of two matehing packed data-types.

But this all the elements each data-type are represented in the result; this is achieved
through he use of saturation. As saturation is only applieable to integers, the pack
operations are defined for integer (signed \ unsigned) data types only.

Figure A.l 1 Pack Operations

Word
DoubleWord

Signed
PACKSSWB
PACKSSDW

Unsigned
PACKUSWB

Table A.ll Pack Instructions

A-XIII

Extension
MMX\SSE 2
MMX\SSE 2

Appendix A
A. 11

Conversion
Both SSE and SSE 2 provide a number of instructions that convert data between

the following packed data-types,

•

32-bit DoubleWord Integer

•

32-bit Single-Precision Float

•

64-bit Double-Precision Float

SSE 3 also introduces a conversion instruction FISTTP. The instruction converts
the floating-point on top of the ST stack into an integer memory location; the floating
point value is popped off the stack. This instruction is included in the SIMD instruction
sets because it used a rounding technique introduced by earlier extensions.
When converting from a Ooating-point value to an integer the decimal part of
the number may be truncated or rounded. In cases where the target data-type is smaller
than the source only the low half of the source register is converted, when the target is
larger the result is stored in the lower half of the target register, see

Figure A. 12.

Note that

conversion instructions may convert data between XMM and MM registers.

lilt to Single

Single to Int

Int to Double

Rounded
Rounded
Truncated
Rounded
Rounded

Double to Int

Truncated

Double to Single

Rounded

Single to Double
Float to Int

Rounded
Truncated

Packed
*CVTPI2PS
CVTDQ2PS
*CVTPS2PI
CVTPS2DQ
*CVTTPS2PI
CVTTPS2DQ
*CVTPI2PD
CVTDQ2PD
*CVTPD2PI
CVTPD2DQ
*CVTTPD2PI
CVTTPD2DQ
CVTPD2PS
CVTPS2PD

Scalar
CVTSI2SS
CVTSS2SI
CVTTSS2SI
CVTSI2SD
CVTSD2SI

SSE 2
CVTTSD2S1
CVTSD2SS
CVTSS2SD
FISTTP**

*Converts between MM and XMM registers.
**Generates floating-point exceptions not SIMD Hoating-point exceptions

Fable A.12 Conversion Instructions

A-XIV

Extension
SSE
SSE 2
SSE
SSE 2
SSE

SSE 3

Appendix A

a2

a4

ai

ag

a2

©
0..0

0..0

3=2

h

a^p

a) Convert to smaller data-type

b) Convert to larger data-type

as

a^4

ai

91

a=3

a^2

aS

c) Convert to same size data-type
Figure A.12 Conversion operations

A. 12

Data Movement
Data movement instriietions perfomi the standard task of transferring data

between registers and memory. They operate on fi.xed sized blocks of data and do not
affect its underline packed interpretation; as a result many of the SSE 2 instructions are
identical to their SSE equivalent even though they are defined for different data types.
Move instruction may operate on the whole of the register or a portion of it.
When operating on a portion of the register data may be transferred from the high or
low ends of the register, but never the middle. The exception being the PINSRW and
PEXTRW instructions, which are used to insert or extract any packed word of the
register. The duplicate move instructions only copy every second element of source
register the destination register, but duplicates every element copied to fill the register.
MOVDQ2Q and MOVQ2DQ are the only move instructions pemiitted to
transfer data between the MM and XMM registers. When moving data from a MM
register to a XMM register the high order QuadWord of the XMM register is cleared.
Note that there are both aligned and unaligned move instructions, aligned move
instruction generate exceptions if used on unaligned data. While unaligned move
instruction can operate on aligned they still must perfonn alignment checking, and thus
aligned moves should always be used if the data is aligned. SSE 3 introduces a
specialised unaligned move instruction LDDQU that is designed to avoid cache line
splits.
A-XV

Appendix A

Instruction (s)
Move Data between Registers and Memory
MOVD

Registers

Extension

MM, XMM

MOVQ

MM, XMM

MOVHPS
MOVHPD
MOVLPS
MOVLPD
MOVHLPS, MOVLHPS

XMM
XMM
XMM
XMM
XMM

MMX/
SSE 2
MMX/
SSE 2
SSE
SSE 2
SSE
SSE 2
SSE

MOV APS
MOVQDA, MOVAPD
MOVUPS
MOVQDU, MOVUPD
LDDQU
MOVSS
MOVSD

XMM
XMM
XMM
XMM
XMM
XMM
XMM

SSE
SSE
SSE
SSE
SSE
SSE
SSE

MM, XMM
XMM

SSE\
SSE 2
SSE 3

MM, XMM
MM, XMM

SSE 2
SSE 2

Move High
Move Low
Move Low /
High
Aligned Move
Unaligned
Move
Scalar Move

PEXTRW
PINSRW
MOVSHDUP, MOVSLDUP,
Duplicate
MOVDDUP
Move Data between MM and XMM registers
MOVQ2DQ
J MOVDQ2Q
Insert \ Extract

L

2
2
3
2

Table A.13 Data Movement Instructions

A.13

State Management Functions
On MMX only machines the SIMD execution environment is defined by the

state of the MM registers. These are aliased onto the FPU registers so saving the FPU
state has the net effect of saving the MMX state as well. The FXSAVE and FXRSTOR
are used to save and restore the FPU and thus MMX state.
With the introduction of SSE the SIMD execution environment the XMM
registers and the MXCSR register. To facilitate this expanded state both instructions
were extended to save the entire SIMD state. However this expansion is not considered
part of the SSE extension and must be tested for separately is the application is to use
the XMM registers.

A-XVI

Appendix A
As well as saving the entire SIMD state it is also possible to save and restore the
contents of MXCSR control register independently, using the LDMXCSR and
STMXCSR instructions.

A.13.1

EMMS
As a result of the fact that the MM registers are aliased onto the FPU registers is

that whenever data is written to a MM register it has the following effect on the
corresponding FPU register

•

The lower 64-bits of the FPU register are overwritten with the contents of the
MM register.

•

The higher 16-bits of the FPU register are set to 1. (The floating-point
inteipretation of the number is now a QNaN).

•

Tag Word bits for all FPU/MM registers are set to 00.

The FPU tag word indicates the types of values that are stored in the FPU
registers, using two bits per register. Four states are possible Valid (0 0), Zero (0 1),
liwalid (1 0), and Empty (1 1). MMX instructions mark all registers as valid even
though the number stored is an invalid QNaN.

This may cause problems whenever the application switches from MMX to FPU
routines, with subsequent FPU instructions possibly producing invalid results or
generating exceptions. The EMMS instruction protects against this by marking all FPU
registers empty. Not that all MM register instructions alter the Tag Word and not just
instructions that write to registers. Instructions MM register instructions added by the
SSE\SSE 2 extensions are also included in this list.

SIMD State
MXCSR Register
Empty MMX State

Save
FXSAVE
STMXCSR

Restore
FXRSTOR
LDMXCSR

EMMS

Extension
SSE
SSE
MMX

Table A.14 State Management Instructions

A-XVII

Appendix A
A.14

Cacheabilty Control
Included in the SIMD instruction set are a number of instructions designed to

improve the efficiency of applications by providing control over the caching of data. Of
these only two operate on SIMD registers and will be considered here; the store and
store mask instructions. The others although part of the SIMD instruction set perform
basic caching functions and are not discussed here.

•

Store Instructions
The store instructions copy non-temporal data from the SIMD registers
directly to memory bypassing the cache. They should be used to save any data
that will not be needed by the application for some time.

•

Store Mask Instructions
These instructions are similar to the store instructions in that they bypass
the cache and store data directly to memory. The major difference being that
store mask instructions use a byte mask to select which elements of the packed
register will be saved. Only selected elements are written to memory, these are
stored relative to their location in the register, and are not reorganised to fill any
gaps created by unselected elements. If an element is not selected its
eoiTesponding memory location is not overw'ritten, thus store mask instructions
can be used to intermix arrays.

Move

Source
Bytes
Moved

0

1

0

0

0

1

0

1

Memory Before
b
d

A

B

C

D

E

F

G

H

Memory After

H

a I B I c Id

f

g

F

g -H

Bytes Overwritten
Figure A.13 Store .Mask Operation

A-XVIII

h

Appendix A

SSE
MM
Store

MOVNTQ

Store Mask

MASKMOVQ

Fence
Prefetch
Flush

SFENCE
PREFETCH!!

SSE 2
General
Purpose
Register

XMM

XMM

MOVNTPS

MOVNTPD
MOVNTDQ
MASKMOVDQU

SSE

MOVNTI

SSE 2

LFENCE, MFENCE
CLFLUSH*

* Implemented separate form SSE 2 in hardware
1 able A.15 Cacbeabilty Control Instructions

A.15

Thread Instructions
The instruction sets also include instructions for dealing with threads. These

instructions can be used with both threads in a multiprocessor system and threads
running on a single processor using Hyper-Threading technology. The instructions are
non-SIMD and are only presented here for completeness.

Instruction

PAUSE
MONITOR
MWAIT

Extension
SSE 2
SSE 3
SSE 3

Figure A.14 Thread Synchronisation Instructions

A-XIX

Appendix B

Checking for SIMD Hardware and
Software Support

Appendix B
Appendix B

Checking for SIMD Hardware and Software Support

Before any attempt is made to issue SIMD instructions it is important to insure
that the underline hardware and software can support such instructions. Depending on
the extension and type of instructions used the application should perform a number of
the following tests before issuing any SIMD instructions.

B.l

Testing for the presence of SIMD extensions
Intel allows software to test for SIMD extensions through use of the CPUID

instruction. If the instruction is executed when the value of EAX is 1, bits 23, 25 and 26
of the EDX register indicate the presence of MMX, SSE, and SSE2 respectively; bit 1
of ECX indicates the presence of SSE 3.

MOV
CPUID
TEST
JNZ

EAX, 1
EDX, ????
JUMP LABEL

Note SSE extends the set of packed integer operations that can be performed on
the MM registers; as a result some 64-bit packed integer instructions are not compatible
with machines that contain the MMX extension only.

B.2

Testing for the presence of a Floating-point unit
Bit 2 the emulation bit (EM bit) of the CRO control register is used to indicate

the presence of a floating-point unit. If it is set to 1 the unit is absent and all floating
point instructions are handled through software. In that event all MMX and most
SSE\SSE2 instructions will generate exceptions, leaving only a handful of caching
instructions that are safe to use.

B-I

Appendix B
B.3

Testing for the FXSAVE and FXRSTOR support
The FXSAVE and FXRSTOR instructions provide fast save and restore

operations for the FPU register state. With the introduction of SEE on the Pentium III
their use was extended to include the XMM state. Flowever this extension is not
considered part of SEE or SSE2 and it is therefore good practice to test for its presence.
This is achieved by use of the CPUID command, where bit 24 of EDX indicates the
presence of the extension.

More importantly it is important that the operating system support their use in
order to preserve the state of XMM registers during a context switch. This support can
be tested for by examining bit 9 of the control register CR4, if it is set to 1 the operating
system supports both these extended instructions. Most SSE\SSE2 instructions will
generate exceptions in the absence of proper operating system support.
Note that as MM registers are aliases to the FPU registers they are
automatically saved with them. They therefore do not require this extended support and
as a result instructions operating on them are not subject to this condition.

B.4

Testing for Operating System support of SIMD exceptions
Instructions operating on SIMD registers can generate two types of exceptions,

non-numeric and floating-point.

Non-Numeric Exceptions

These exceptions include the list the list of system exceptions described above
and exceptions relating to memory access, such as stack segment and page faults. In
general such exceptions are similar to existing IA32 instruction exceptions and can be
handled in the usual manner without the need for additional support.

B-II

Appendix B
SIMD Floating-Point Exceptions
There are six floating-point exceptions that can be generated when perfonning
floating-point operations within the IA32 architecture.

•
•
•

Precision
Underflow
Overflow

Zero Divide
Denormal Operand
Invalid Operand

Each of these exceptions may be masked or unmasked by setting the appropriate
mask bits of the FPU control word (Standard floating point exceptions) or the MXCSR
register (SIMD floating-point exceptions).

Masked Exceptions
The processor traps masked exceptions and returns a predefined value to
the destination operand, based on the type of exception trapped. Their use
provides a fast a predictable method of recovering for the underling error.

Unmasked Exceptions
When unmasked exceptions are detected a software handler is invoked to
deal with the exception. This provides a more fine-tuned approach to exception
handling by allowing a more appropriate result to be calculated in software.

Unmasked SIMD exceptions require operating system support when invoking
the exception handler. The presence of this support can be detected by examining bit 10
of the CR4 control register. If it is not set an invalid opcode exception is generated
whenever a SIMD unmasked exception is detected.

Note that this is mainly due to the fact that different registers are used to indicate
SIMD and standard floating-point exceptions (the MXCSR and FPU status registers
respectively) and not to any fundamental difference the way exceptions are reported.
Specifically SIMD operations do not report exceptions for individual packed elements,
but rather that the exception occurred for at least one of the elements. Masked SIMD
exceptions are unaffected by the absence of such software support, so too are integer
SIMD operations as they don't generate any exceptions.
B-lII

Appendix C

Differences between PVM and
Condor-PVM

Appendix C
Appendix C

Differences between PVM and Condor-PVM

HTC and HPC are different computing environments, with HTC applications
generally requiring more fault tolerance and error recovei*y than HPC applications; as a
result extra functionally may be required when converting an application from PVM to
Condor-PVM. There is also however a number of minor changes to workings of PVM
routines when operating in the Condor environment why may need to be address when
porting the code. These include:

C.l

No Multiple Slaves Per Host
Condor-PVM only permits one slave task to be launched on each remote host.

This is in contrast to standard PVM where a host may contain a number of different
slave processes cooperating to accomplish the task.

C.2

No Multiple Spawns
Standard PVM used the last argument of the pvm_spawn () routine to specify

how many copies of the task to spawn. In Condor-PVM this argument must always be
set to one, as Condor only allows a single task to be spawned per call; Multiple spawns
must be processed in a loop.

C.3

PVM Architecture Class Replaced by Condor Machine Class
When a task is spawned in PVM it is possible to specifies the architecture on

which to spawn the slave, by setting the /Jag argument PvmTaskArch and the where
argument to the PVM architecture class required (e.g. "LINUX"). However when same
operation is performed in Condor-PVM the Condor machine class must be used instead.
The machine class is a number assigned by Condor to each type of architecture,
and is not directly related to the PVM architecture class. In a homogenous environment
the machine class is always "0" regardless of underling architecture. In a heterogeneous
environment the machine class is directly related to the ordering the submit file.

C-I

Appendix C
C.4

New notification events
Slave tasks can't be check-pointed or migrated; when a machine is reclaimed the

slave is usually just temiinated. However the policies of the machine owner might allow
the slave to be suspended for a time in the hope that it can be resumed later. This adds
two new events (PvmHostSuspend, PvmHostResume) to the three existing PVM
notification events (PvmHostAdd, PvmHostDelete, PvmTaskExit).
A Condor-PVM application can request notification for any of these five events.
However the use of these new notifications requires the PVM source code to be
modified and recompiled. The file include/pvm3.h must be edited to define the new
events, and the pvm_notify () function in src/lpvmgen.c requires modification in order
to handle them.

C.5

Host not added immediately on request
The pvm addhosts () routine is used to add one or more hosts to the virtual

machine. Standard PVM will attempt to add the hosts before returning a status flag. But
in Condor-PVM a call to pvm addhosts () returns immediately, the hosts will be added
later if they become available. In this environment pvm_addhosts () is more of a request
of extra resources than a specific instruction to actually add the new hosts. The
pvm_notify () routine should be used to inform PVM when the host is actually added.

C-II

Appendix D
Paper Presented to the ParCo2003
Conference on Parallel Computing

