A strategy for mapping unstructured mesh computational mechanics programs onto distributed memory parallel architectures by McManus, Kevin
Greenwich Academic Literature Archive (GALA)
– the University of Greenwich open access repository
http://gala.gre.ac.uk
__________________________________________________________________________________________
Citation:
McManus, Kevin (1996) A strategy for mapping unstructured mesh computational mechanics 
programs onto distributed memory parallel architectures. PhD thesis, University of Greenwich.
__________________________________________________________________________________________
Please note that the full text version provided on GALA is the final published version awarded 
by the university. “I certify that this work has not been accepted in substance for any degree, 
and is not concurrently being submitted for any degree other than that of (name of research 
degree) being studied at the University of Greenwich. I also declare that this work is the result 
of my own investigations except where otherwise identified by references and that I have not 
plagiarised the work of others”.
McManus, Kevin (1996) A strategy for mapping unstructured mesh computational mechanics  
programs onto distributed memory parallel architectures. ##thesis  _type##  ,  ##institution##  
Available at: http://gala.gre.ac.uk/6249/
__________________________________________________________________________________________
Contact: gala@gre.ac.uk
H919S2
A Strategy for Mapping Unstructured Mesh
Computational Mechanics Programs onto
Distributed Memory Parallel Architectures
Kevin McManus
A thesis submitted in partial fulfilment of the
requirements of the University of Greenwich
for the Degree of Doctor of Philosophy
25th September 1995 
Revised 22nd February 1996
Centre for Numerical Modelling and Process Analysis
School of Computing and Mathematical Science
University of Greenwich
London, UK
To Libby
Acknowledgements
There are a number of people who I would like to thank for their help during the time 
that it has taken me to write this thesis.
My supervisors, Professor Mark Cross and Doctor Steve Johnson for their invaluable 
support and guidance.
My colleagues, Chris Bailey, Peter Chow, Nick Croft, Emyr Evans, John Ewer, Yvonne 
Fryer, Cos lerotheou, Peter Lawrence, Peter Leggett, Miltos Petridis and Chris Walshaw, 
for their help and patience in assisting me to write this thesis.
The staff and researchers at the School of Computing and Mathematical Science for 
providing a pleasant working environment.
The Engineering and Physical Science Research Council for supplying the funding that 
allowed me to escape from the pressures of industry and rediscover the world of academia.
n
Abstract
The motivation of this thesis was to develop strategies that would enable unstruc- 
tured mesh based computational mechanics codes to exploit the computational advan- 
tages offered by distributed memory parallel processors. Strategies that successfully 
map structured mesh codes onto parallel machines have been developed over the pre- 
vious decade and used to build a toolkit for automation of the parallelisation process. 
Extension of the capabilities of this toolkit to include unstructured mesh codes requires 
new strategies to be developed.
This thesis examines the method of parallelisation by geometric domain decomposi- 
tion using the single program multi data programming paradigm with explicit message 
passing. This technique involves splitting (decomposing) the problem definition into P 
parts that may be distributed over P processors in a parallel machine. Each processor 
runs the same program and operates only on its part of the problem. Messages passed 
between the processors allow data exchange to maintain consistency with the original 
algorithm
The strategies developed to parallelise unstructured mesh codes should meet a num- 
ber of requirements:
The algorithms are faithfully reproduced in parallel.
The code is largely unaltered in the parallel version.
The parallel efficiency is maximised.
The techniques should scale to highly parallel systems.
The parallelisation process should become automated.
Techniques and strategies that meet these requirements are developed and tested in this 
dissertation using a state of the art integrated computational fluid dynamics and solid 
mechanics code. The results presented demonstrate the importance of the problem par- 
tition in the definition of inter-processor communication and hence parallel performance.
The classical measure of partition quality based on the number of cut edges in the
111
mesh partition can be inadequate for real parallel machines. Consideration of the topol- 
ogy of the parallel machine in the mesh partition is demonstrated to be a more significant 
factor than the number of cut edges in the achieved parallel efficiency. It is shown to be 
advantageous to allow an increase in the volume of communication in order to achieve 
an efficient mapping dominated by localised communications. The limitation to parallel 
performance resulting from communication startup latency is clearly revealed together 
with strategies to minimise the effect.
The generic application of the techniques to other unstructured mesh codes is dis- 
cussed in the context of automation of the parallelisation process. Automation of par- 
allelisation based on the developed strategies is presented as possible through the use 
of run time inspector loops to accurately determine the dependencies that define the 
necessary inter-processor communication.
IV
Contents
1 Introduction 2
1.1 The Nature of a Parallel Machine ....................... 2
1.2 The Nature of an Unstructured Mesh Code ................. 5
1.3 Objectives of Parallelisation .......................... 7
1.4 Parallelisation Strategies ............................ 9
1.5 Parallelisation by Domain Decomposition .................. 11
2 Parallel Processing 13
2.1 Processor Interconnection ........................... 14
2.2 Inter-Processor Communication ........................ 15
2.3 Communication Model ............................. 18
2.3.1 Shared Memory............................. 18
2.3.2 Message Passing ............................ 18
2.4 Code Structure ................................. 19
2.4.1 Parallel Utility Library ........................ 21
2.4.2 Parallel Communication Library ................... 22
2.4.3 Communication Harness ........................ 22
3 Domain Decomposition 25
3.1 Representation of an Unstructured Mesh ................... 26
3.2 Mesh Partitioning ............................... 28
3.2.1 Load Balance .............................. 29
CONTENTS
3.2.2 Communication Balance ........................ 30
3.2.3 Processor Topology Mapping ..................... 31
3.2.4 Partitioning Algorithms ........................ 34
3.2.5 Parallel Partitioning .......................... 39
3.3 Mesh Decomposition .............................. 40
3.3.1 Derive Secondary Partitions ...................... 41
3.3.2 Overlap Construction ......................... 43
3.3.3 Parallel Execution Control and Renumbering ............ 46
3.3.4 Overlap Communication ........................ 51
4 Algorithm Decomposition 57
4.1 UIFS - Unstructured Incompressible Flow and Stress ............ 58
4.1.1 The FV Fluid Dynamics Scheme ................... 58
4.1.2 The FV Solid Mechanics Scheme ................... 61
4.1.3 Integration within UIFS ........................ 66
4.2 Parallelisation of UIFS ............................. 68
4.2.1 Partitioning ............................... 69
4.2.2 Renumbering .............................. 70
4.2.3 Communication ............................. 70
4.2.4 Parallel Utilities ............................ 71
4.3 Matrix Decomposition ............................. 72
4.4 Iterative Methods ................................ 75
4.4.1 Jacobi Method ............................. 76
4.4.2 Gauss-Seidel SOR ........................... 79
4.4.3 Conjugate Gradient .......................... 81
4.4.4 Summary ................................ 83
5 Performance of the Parallel Code 85
5.1 Measuring Performance ............................ 86
5.1.1 Speed-up ................................ 87
VI
CONTENTS
5.1.2 Parallel Efficiency ........................... 88
5.1.3 Scalability ................................ 88
5.2 Irregular Shape Test Case ........................... 90
5.2.1 Fluid Dynamic Test Case ....................... 94
5.2.2 Solid Mechanics Test Case ....................... 94
5.2.3 Solidification Test Case ........................ 95
5.3 Performance on the Transtech Paramid ................... 96
5.3.1 Fluid dynamic test case ........................ 100
5.3.2 Solid mechanics test case ....................... 103
5.3.3 Solidification test case ......................... 106
5.4 Improving Performance ............................ 109
5.4.1 Latency Reduction ........................... 109
5.4.2 Flow and Heat Solvers ......................... 109
5.4.3 Solid Mechanics Solver ......................... Ill
5.4.4 The Effect of Optimised Solvers on the Solidification Test Case . . 114
5.4.5 Asynchronous Communication .................... 114
5.5 Summary .................................... 120
6 Automation of Parallelisation 122
6.1 Computer Aided Parallelisation Tools .................... 122
6.1.1 Dependence Analysis .......................... 123
6.1.2 Data Partitioning ............................ 124
6.1.3 Execution Control ........................... 125
6.1.4 Communication ............................. 125
6.2 Generic Parallelisation Methods for Unstructured Mesh Codes ...... 126
6.2.1 Application of CAPTools Structured Mesh Techniques to Unstruc- 
tured Mesh Codes ........................... 128
6.2.2 Data Structures for an Unstructured Mesh ............. 129
6.2.3 Inspector Loops ............................ 131
vn
CONTENTS
6.2.4 Partitioning ............................... 132
6.2.5 Communication Generation ...................... 133
6.2.6 Renumbering .............................. 133
6.3 Summary .................................... 136
7 Other Parallel Issues 137
7.1 Are Further Improvements Possible? ..................... 137
7.1.1 Layered Overlaps ............................ 138
7.1.2 Machine Topology Profile ....................... 138
7.1.3 Dynamic Load Balance ........................ 139
7.1.4 Other Communication Schemes .................... 140
7.2 Difficult Problems ............................... 141
7.2.1 Inhomogeneous Problems ....................... 141
7.2.2 Adaptive Meshing ........................... 142
7.2.3 Long Range Dependencies ....................... 142
7.3 Are there any alternatives? .......................... 143
7.3.1 Parallel Mesh Generation ....................... 143
7.3.2 Parallel Visualisation .......................... 144
7.3.3 Virtual Shared Memory ........................ 144
8 Conclusions 147
8.1 Were the Objectives Met? ........................... 147
8.1.1 Objective (i) Minimise the Changes to the Original Algorithm . . 147
8.1.2 Objective (ii) Minimise the Visibility of the Parallel Code ..... 148
8.1.3 Objective (iii) Maximise Parallel Efficiency ............. 150
8.1.4 Objective (iv) Portability to Most DM MIMD Platforms ...... 151
8.1.5 Objective (v) Scalability of Computation ..............151
8.1.6 Objective (vi) Scalability of Memory ................. 152
8.1.7 Objective (vii) Automate the Parallelisation Process ........ 152
8.2 Summary .................................... 152
vin
CONTENTS
A Parallel Utilities 154
A.I Parallel Included Declarations ......................... 154
A.2 Parallel Utility Library ............................. 156
B Partition List 159
C Parallel Iterative Solvers 160
C.I Jacobi Solver .................................. 161
C.2 Gauss-Seidel Solver ............................... 166
C.3 Diagonally Preconditioned Conjugate Gradient Solver ........... 168
D Modified Parallel Iterative Solvers 172
D.I Modified Jacobi Solver ............................. 172
D.2 Modified Diagonally Preconditioned Conjugate Gradient Solver ...... 175
E Asynchronous Parallel Iterative Solvers 179
E.I Asynchronous Jacobi Solver .......................... 179
E.2 Asynchronous Diagonally Preconditioned Conjugate Gradient Solver ... 182
IX
List of Figures
1.1 Four mesh categories. ............................. 5
1.2 Automatically generated three dimensional unstructured mesh. ...... 6
1.3 Possible data dependency stencils over an unstructured mesh. ....... 7
2.1 Shell structure of the parallel code. ...................... 20
3.1 Entity relationship diagram for a three dimensional unstructured mesh. . 28
3.2 Example run times for two possible partitions over 5 processors. ..... 30
3.3 Processor interconnection mapped to a pipe mesh partition. ........ 32
3.4 Partitions of a 2D mesh into (a) ID, (b) 2D and (c) uniform topologies
with the corresponding sub-domain connectivity graphs. .......... 33
3.5 Mesh partitioned into three parts with overlap elements applied. ..... 40
3.6 A mesh of 28 triangles divided into two sub-domains with the overlaps
required for the flow scheme. ......................... 44
3.7 A mesh of 28 triangles divided into two sub-domains with the overlaps
required for the stress scheme. ......................... 45
3.8 A mesh of 28 triangles divided into two sub-domains showing the renum- 
bering of grid points from global to local numbering. ............ 50
3.9 A mesh of 28 triangles divided into two sub-domains showing the renum- 
bering of elements from global to local numbering. ............. 50
3.10 Overlap update communication scheme. ................... 52
3.11 Mesh of 42 triangular elements. ........................ 54
LIST OF FIGURES
3.12 Mesh of 42 triangular elements partitioned into three renumbered sub- 
domains. ..................................... 55
4.1 Formation of a control volume from sub-control volumes around point P. . 63
4.2 Mapping of a finite volume element to a reference element. ........ 64
4.3 Flowchart for UIFS. .............................. 67
4.4 Matrix form for a five point element stencil over a 4 x 4 regular mesh. . . 73
4.5 4x4 mesh operated on as 2 sub-domains showing the transfer of data into
the overlaps on each renumbered sub-domain. ................ 74
4.6 Mesh of 42 triangular elements. ........................ 74
4.7 Mesh of 42 triangular elements partitioned into three renumbered sub- 
domains. ..................................... 75
4.8 Matrix for the 42 triangle mesh. ....................... 76
4.9 Matrices for the 42 triangle mesh partitioned into three sub-domains. ... 77
5.1 The number of cut edges against the number of partitions for a range of
partition strategies on the 3,034 triangle irregular shape mesh. ...... 91
5.2 The number of cut edges against the number of partitions for a range of
partition strategies on the 10,027 triangle irregular shape mesh. ...... 91
5.3 The number of cut edges against the number of partitions for a range of
partition strategies on the 30,064 triangle irregular shape mesh. ...... 92
5.4 The number of cut edges against the number of partitions for a range of
partition strategies on the 60,005 triangle irregular shape mesh. ...... 92
5.5 The number of cut edges against the number of partitions for a range of
partition strategies on the 119,822 triangle irregular shape mesh. ..... 93
5.6 Flow vectors for the fluid dynamic test case. ................. 94
5.7 Mesh displacement for the solid mechanics test case. ............ 95
5.8 Residual stress contours and flow vectors for the solidification test case. . 96
5.9 Speed-up for the fluid dynamic test case against the number of processors
for a range of partition strategies using a 3,034 triangle mesh. ....... 100
XI
LIST OF FIGURES
5.10 Speed-up for the fluid dynamic test case against the number of processors
for a range of partition strategies using a 10,027 triangle mesh. ...... 100
5.11 Speed-up for the fluid dynamic test case against the number of processors
for a range of partition strategies using a 30,064 triangle mesh. ...... 101
5.12 Speed-up for the fluid dynamic test case against the number of processors
for a range of partition strategies using a 60,005 triangle mesh. ...... 101
5.13 Speed-up for the fluid dynamic test case against the number of processors
for a range of partition strategies using a 119,822 triangle mesh. ..... 102
5.14 Best speed-up obtained for the fluid dynamic test case against the number
of processors for a range of mesh sizes. .................... 102
5.15 Graph of speed-up for the solid mechanics test case against the number
of processors for a range of partition strategies using a 3,034 triangle mesh. 103
5.16 Speed-up for the solid mechanics test case against the number of proces- 
sors for a range of partition strategies using a 10,027 triangle mesh. .... 103
5.17 Speed-up for the solid mechanics test case against the number of proces- 
sors for a range of partition strategies using a 30,064 triangle mesh. .... 104
5.18 Speed-up for the solid mechanics test case against the number of proces- 
sors for a range of partition strategies using a 60,005 triangle mesh. .... 104
5.19 Speed-up for the solid mechanics test case against the number of proces- 
sors for a range of partition strategies using a 119,822 triangle mesh. . . . 105
5.20 Best speed-up obtained for the solid mechanics test case against the num- 
ber of processors for a range of mesh sizes. .................. 105
5.21 Speed-up for the solidification test case against the number of processors
for a range of partition strategies using a 3,034 triangle mesh. ....... 106
5.22 Speed-up for the solidification test case against the number of processors
for a range of partition strategies using a 10,027 triangle mesh. ...... 106
5.23 Speed-up for the solidification test case against the number of processors
for a range of partition strategies using a 30,064 triangle mesh. ...... 107
xn
LIST OF FIGURES
5.24 Speed-up for the solidification test case against the number of processors
for a range of partition strategies using a 60,005 triangle mesh. ...... 107
5.25 Speed-up for the solidification test case against the number of processors
for a range of partition strategies using a 119,822 triangle mesh. ..... 108
5.26 Best speed-up obtained for the solidification test case against the number
of processors for a range of mesh sizes. .................... 108
5.27 Speed-up obtained with the optimised (solid lines) and unoptimised (dashed 
lines) Jacobi solver for the fluid dynamics test case with a range of mesh 
sizes. ....................................... 110
5.28 Graph of speed-up obtained with the optimised (solid lines) and unopti- 
mised (dashed lines) conjugate gradient solver for the solid mechanics test 
case with a range of mesh sizes. ........................ 112
5.29 Speed-up obtained with the optimised conjugate gradient solver using a 
hypercube (solid lines) and a pipeline (dashed lines) global commutative 
for the solid mechanics test case with a range of mesh sizes. ........ 113
5.30 Speed-up obtained with the optimised solvers for the solidification test
case with a range of partition strategies using a 60,005 triangle mesh. . . 115
5.31 Mesh of 42 triangular elements partitioned into three sub-domains renum- 
bered for asynchronous communication. ................... 116
5.32 Matrices for the 42 element mesh partitioned into three sub-domains
renumbered for asynchronous communication. ................117
5.33 Speed-up obtained with the asynchronous (solid lines) and synchronous 
(dashed lines) optimised solvers for the fluid dynamic test case with a 
range of mesh sizes. .............................. 118
5.34 Speed-up obtained with the asynchronous (solid lines) and synchronous 
(dashed lines) optimised solvers for the solid mechanics test case with a 
range of mesh sizes. .............................. 119
xin
LIST OF FIGURES
5.35 Speed-up obtained with the asynchronous optimised solvers for the so- 
lidification test case with a range of partition strategies using a 60,005 
triangle mesh. .................................. 120
6.1 Four element mesh. ............................... 129
7.1 Foil mesh partitioned over four processors. .................. 142
7.2 Foil mesh partition with solver balancing. .................. 142
xiv
List of Tables
3.1 Partition mapping strategies provided by JOSTLE ............. 39
3.2 Element indirection pointer arrays for the partition illustrated in Fig- 
ure 3.9 ..................................... 51
3.3 Communication operations required for a simple chain of processors ... 53
Chapter 1
Introduction
1.1 The Nature of a Parallel Machine
The quest for greater performance has driven the development of computer technology at 
an exponential rate. Clock speeds and bus widths continue to increase while low power 
semiconductor technologies now permit Very Large Scale Integration (VLSI) to shrink 
the Central Processing Unit (CPU) of a 64bit computer onto a single silicon substrate. 
It has long been assumed that there is a fundamental limit to the performance that may 
be achieved by a single processor. How small can semiconductor features be made? How 
fast can a semiconductor switch operate? When does the technology reach a fundamental 
limit? [MF95]
Since the 1960's pipelined or vector processors have been at the heart of many su- 
percomputers. Rather than operating upon a single variable at a time, these machines 
increase their computational performance by allowing a vector of data to be operated 
upon simultaneously [HJ81]. The achievable performance depends upon successfully 
loading the appropriate vector operands from memory [Rod82, Ier90]. Initially the vec- 
torisation of code was an optimisation for the code author to implement. Subsequent 
development led to the vectorising compiler which could automatically extract the vector 
parallelism from the source code [DLD93].
An extrapolation of this concept led to the development of the array structured Sin-
CHAPTER 1. INTRODUCTION
gle Instruction, Multiple Data (SIMD) [Fly72] parallel machines in which whole fields 
of a variable could be subjected to the same operation in parallel [HB84]. These ma- 
chines possessed large numbers of small processors (64 in Illiac-IV circa 1970, 65536 in 
DAP circa 1980) and gave rise to the description Massively Parallel Processing (MPP). 
SIMD machines have changed little since their conception and can still sustain a credible 
throughput in comparison with more modern architectures. Like the vector machines, 
they rely on running a code which maps well to the machine [Par82]. In this case a reg- 
ularly structured code containing few inherently serial operations is required. Unlike the 
vector machines, automatic compilation of serial code for SIMD processing has not been 
possible. Mapping of irregular problems to efficiently utilise the power offered by SIMD 
machines has consequently been the focus of much research [Far89, FFL93, Wil91j. The 
difficulties encountered in successfully programming for SIMD has contributed to the 
architecture falling from popularity.
The notion that it may be more worthwhile to build a number of modest individual 
computers rather than one large one is not new. Many such parallel machines have 
now been successfully built, used and become obsolete [TW91]. Such machines are 
categorised as Multi Instruction, Multi Data (MIMD) [Fly72], of which there are two 
main variants: Distributed Memory (DM), in which each processor is equipped with its 
own private memory and Shared Memory (SM), where the memory is common to all 
processors [AG94]. Now that integration density can place what was until very recently 
considered a supercomputer onto a single chip, and furnish it with a quantity of memory 
in a similarly small space, with sufficiently low energy requirements to allow the intimate 
connection of many processing elements, this makes highly parallel MIMD the probable 
architecture for the next generations of supercomputers [FWM94].
The von Neumann programming model of a computer has not changed during these 
developments [vN66]. Programs continue to be written as a series of instructions to 
be executed in sequence. Indeed many algorithms depend upon the sequential order 
of variable evaluation. A diversity of new languages and paradigms have consequently 
been developed that attempt to express and exploit parallelism with concepts such as
CHAPTER 1. INTRODUCTION
Communicating Sequential Processes [Hoa86], tasks (Ada, Occam), data flow [DeC89] 
and data parallelism (FortranD, HPF) [vH92, Ric95]. There exists, however, not only 
a legacy of software that has been written in a simple sequential procedural manner, 
but also a large base of program developers who have no interest in parallel processing. 
Program developers are content with the von Neumann model as a means of algorith- 
mic expression and want nothing more than a larger, faster serial processor. A means 
of efficiently mapping existing and future software onto DM MIMD platforms is there- 
fore required. The success of the vectorising compilers has led to an expectation that 
parallelising compilers will eventually be produced [ZC90, CBB+94]. Success has been 
shown with automatic parallelism for shared memory parallel MIMD systems with small 
numbers of processors (Cray Y-MP, C90 (actually shared memory vector parallel), SGI 
Power Challenge, Sun Sparc20MP, Digital 8400) [Sun94]. But shared memory is unlikely 
to be feasible for large numbers of processors as the memory bandwidth does not scale 
with the number of processors. Virtual shared memory systems that allow distributed 
memory to appear as shared memory have shown some limited success (Kendall Square 
KSR1, Cray T3D) but fail to reach the potential peak machine performance largely as 
a consequence of the high degree of inter-processor communication required to sustain 
memory/cache coherence [Bom93]. The advantage of distributed memory is freedom 
from the SM bandwidth problem as the DM bandwidth scales automatically with the 
number of processors. This is seen to outweigh the disadvantage of having to explicitly 
express the distribution, communication and synchronisation of data between processors. 
The argument for DM MIMD is essentially an economic one. An enormous amount of 
development is directed towards the cost-effective high-performance workstation market. 
No matter how powerful these machines become there will always exist users who seek 
greater processing power. The simple interconnection of workstation technology allows 
the DM MIMD parallel machine to capitalise on the economy of scale of workstation 
development and provide the required power at a cost which is highly competitive in 
comparison with other High Performance Computing (HPC) technologies [Smi90]. The 
number of floating point operations (Flops) per dollar has become a new yardstick for
CHAFTER 1. INTRODUCTION
the performance measurement of HPC.
1.2 The Nature of an Unstructured Mesh Code
Computational Mechanics (CM) may be applied to the modelling of diverse physical 
systems (structural mechanics, structural dynamics, fluid dynamics, electromagnetics, 
magnetohydrodynamics, etc.). The technique of applying a system of equations over a 
discretised domain leads inevitably to the concept of a mesh or grid. A mesh describes 
the spatial nature of a discretisation. Wherever possible this thesis will deal with 3 
dimensional space, this is however not always convenient for the purposes of illustration 
or example, where 2 dimensional space will normally be used for clarity.
Figure 1.1: Four mesh categories.
The complexity of a computational mesh ranges from the simple regular structured 
to fully unstructured. Structured grids, suitable for transport phenomena modelling, 
were widely used in the development of Finite Volume (FV) (finite difference / control 
volume) schemes for Computational Fluid Dynamics (CFD) [PatSO]. Irregular and block
CHAPTER 1. INTRODUCTION
structured grids were introduced to allow FV schemes to work with complex geometries 
and a deformable mesh. The Finite Element (FE) method for structural and thermal 
analysis introduced an unstructured mesh to represent arbitrarily complex geometries 
[Zie77]. The desire to analyse flow in complex three dimensional geometries motivated the 
development of FE-CFD codes [MSSP88]. Difficulties with continuity and convergence 
in FE-CFD [Che91] led to recent work extending FV methods to unstructured grids 
[Cho93] and solid mechanics [FBCL91, CBCP92]. Unstructured mesh codes are unlikely 
to offer the computational efficiency of structured mesh codes. The implicit nature of a 
structured mesh avoids the need for indirection in variable addressing and allows great 
efficiency of coding, cache utilisation and vectorisation. But unstructured meshes provide 
a far greater flexibility for the modelling of complex geometries and avoid the need for 
the complexity of a block structured code. Now that automatic generation of complex 
unstructured meshes has become readily available [Law94] the focus of development is 
towards unstructured mesh codes.
Figure 1.2: Automatically generated three dimensional unstructured mesh. 
In parallelising a program the concern is not so much with the nature of the algo-
CHAPTER 1. INTRODUCTION
rithms or intentions of the program but rather the nature of the data dependency. The 
data dependency for a CM code stems from the integration stencil required for solution 
of the mesh based discretisation of Partial Differential Equations (PDE's). For example, 
the value of pressure in an element may be calculated from the pressure in all adjacent 
elements with a four point integration stencil as in Figure 1.3a. Temperature at a node 
may be expressed in terms of the temperature at all connected nodes (Figure 1.3b). 
The stencil may be deeper than nearest neighbour and extend to next neighbours (Fig- 
ure 1.3c). Additionally the data dependency may be more extensive than simply the 
integration stencil, for instance the contribution from adjacent elements may need to be 
evaluated in terms of some node based value (Figure 1.3d).
Figure 1.3: Possible data dependency stencils over an unstructured mesh.
1.3 Objectives of Parallelisation
There are a number of rudimentary objectives that whilst not mandatory would certainly 
be desirable outcomes from a parallelisation strategy.
CHAPTER 1. INTRODUCTION
i) Minimise the changes to the original algorithm:
The parallel code should ideally produce identical results to the original serial 
code. This can be a necessary requirement for acceptance by code users who are 
familiar with the serial code and require confidence that the results generated by 
the parallel code execution are every bit as reliable as those produced by the serial 
code.
ii) Minimise the visibility of the parallel code:
The parallel code should be hidden from both the serial code developers and the 
parallel code users. This permits transparent maintenance of the parallel code 
alongside the serial code by the serial code developers. In addition this avoids 
deterring users from the parallel code. Code developers and users may be safely 
assumed to have no interest in parallelism and a significant interest in rapid code 
execution.
iii) Maximise parallel efficiency:
The parallel code must show significant speed-up over the serial code. The primary 
motivation for parallelisation is to reduce the code run-time. The parallel code 
must therefore use the parallel machine efficiently, otherwise the time and money 
expended on a parallel machine would be better invested on one or more serial 
machines.
iv) Portability to most DM MIMD platforms:
Parallel code needs to make good use of most currently available hardware, the DM 
MIMD model provides an efficient lowest common denominator hardware model. A 
programming model is therefore also required to necessitate only the most primitive 
platform support without loss of efficiency.
v) Scalability of computation:
DM MIMD Massively Parallel Processing (MPP) is the direction in which the high 
Flop per Dollar supercomputers are being developed. Although there continues 
to be much discussion concerning the implementational details of such MPP's, the
CHAPTER 1. INTRODUCTION
development of high performance, highly integrated serial processors will inevitably 
lead to the interconnection of increasing numbers of such processors (Cray T3D, 
Intel Paragon, IBM SP2, TMC CM5). To take advantage of the full power of MPP's 
the performance of a parallel code needs to be able to scale with the number of 
available processors. Doubling the number of processors should ideally halve the 
run-time.
vi) Scalability of memory:
Larger machines allow larger problems to be solved. To make full use of the 
distributed memory a parallel code must be able to distribute a problem over 
the DM machine. Globally dimensioned data items (data objects that are not 
distributed) must therefore be avoided.
vii) Automate the parallelisation process:
The human effort required to parallelise a CM code is significant. The majority 
of this effort is demonstrably automatable for structured mesh codes [JICL94, 
CIJL94]. A strategy is required which can minimise human intervention in the 
process of parallelising unstructured mesh based codes.
1.4 Parallelisation Strategies
Why use a parallel processor? Why not simply use many serial processors? There are 
two significant reasons; one is to provide a machine which can sustain a problem size 
that is too large to fit onto a serial processor, an other is to reduce the critical path to 
a solution. Given a set of interrelated tasks, a task interaction graph can be produced 
to describe the operations required to find the solution. Tasks may be carried out in 
sequence, one after the other, or some tasks may be executed in parallel as concurrent 
processes. The greater the level of concurrency that can be employed the less time is 
required to achieve the solution. Parallelism in computation exists in many forms and 
many different approaches have been used to exploit the parallelism that can be found 
in CM codes.
CHAPTER 1. INTRODUCTION
Task farming, for example, has the advantage of potentially high parallel efficiency 
by keeping all processors busy. As soon as a processor completes one task another 
is initiated. The technique is however, only suited to problems which present a large 
number of unrelated tasks such as Monte Carlo techniques. To achieve any efficiency 
the amount of data to be sent to and returned from each task must be insignificant in 
comparison to the task computation, which for a CM code is unlikely.
Algorithmic parallelisation involves each processor operating on different parts of a 
algorithm. For example solving flow in three dimensions could be achieved by solving for 
each dimension on differing processors. Taking the example further other computed vari- 
ables could be distributed over a set of processors. Each processor calculates its variable 
and hands the problem to the processor computing the next stage in the algorithm. This 
scheme has little to commend it as it suffers from a high communication requirement 
and poor efficiency as each stage in the calculation will take a different amount of time 
leaving most of the processors waiting for data.
Geometric decomposition partitions the problem space over a set of processors. Each 
processor executes the same algorithm on their own section of the problem. This method 
has the advantage of flexibility to allow variations on the decomposition strategy to be 
used to minimise the communication and maximise processor utilisation. Partitioning 
may be based on the mesh geometry or topology, or on the distribution of computational 
effort within the algorithms used in the code. For example computational partitioning of 
a CM code based around a direct solver may be dominated by the solver which dictates 
the decomposition of the problem. Often a wraparound partition of a matrix (i.e. with 
processors, processor owns matrix rows  ) may be required to 
keep the processors busy in the solver. This can also determine how other parts of the 
mesh are to be distributed. For example in the FAMCALC parallelisation [JAC92] the 
finite elements are distributed in a wraparound fashion according to their inclusion in 
the system matrix. In this case a large communication overhead is incurred to allow 
satisfactory processor utilisation. As is often the case with CM codes based on short 
range interactions communication can be minimised and processor utilisation maximised
10
CHAPTER 1. INTRODUCTION
by a domain decomposition based on the geometry (topology) of the mesh.
1.5 Parallelisation by Domain Decomposition
Domain Decomposition (DD) is a generic name given to a variety of computational 
activities which involve the division of a problem space into two or more parts that 
may be operated on separately to some advantage. Such is the interest in DD that 
there is an annual conference devoted to domain decomposition methods in all their 
diversity [KX93]. Originally developed as a means of solving engineering problems that 
were too large to fit into machine memory [Kro63], there has been a revival of interest 
in domain decomposition as a means of mapping CM codes onto parallel computers 
[Wil90, BCG93]. Parallelisation by DD is a divide and conquer strategy in which a 
problem domain is decomposed into a set of sub-domains which can then be operated 
on in parallel. Attempts have been made at new parallel algorithms which seek to find a 
partial solution for each sub-domain and then reconcile the partial solutions across the 
sub-domain interfaces [FXR92, Lai95j. This runs contrary to the strategies discussed in 
this thesis which should meet objective (i) (and (ii)) and maintain as far as possible the 
integrity of the original algorithm across the partitioned domain. This thesis is concerned 
only with geometric DD as a method for the direct parallelisation of unstructured mesh 
based CM codes for DM MIMD computers. This is a technique that is well suited to the 
short range dependence typical of a CM iterative method (Section 1.2).
The initial step in applying DD to an unstructured mesh based code is to obtain a 
partition of the mesh that allows the problem to be distributed amongst the available 
processors in such a way as to equally apportion the computation time on each of 
processors. If this process is 100% efficient then the processing time for a problem may 
be divided by To achieve a high parallel efficiency with a large has consequently 
become the subject of much research. Much success has been shown with the paral- 
lelisation of structured grid codes using DD with message passing [JC91, GCC+ 93] , 
wherein the partition of the mesh is closely mapped onto the processor interconnection
11
CHAPTER 1. INTRODUCTION
topology in order to minimise the inter-processor communication. Some work on un- 
structured mesh codes following the same topology mapping principle has shown success 
[RL90]. A generic method that can provide good performance without requiring an ab- 
solute adherence to the processor topology is needed to allow automated decomposition 
of unstructured meshes with scalability and efficient portability.
A number of languages and environments have been developed for the generation of 
code which may be automatically parallel. Parallel languages have much to offer, but 
are of limited use for 'dusty deck' codes and more importantly of little interest to serial 
code developers. It is simply not acceptable to require code authors to learn new skills 
in order to be able to use parallel machines. It is a hard enough task to author a CM 
code in the first instance without having to spend more time and effort in persuading 
the code to run on a parallel machine. Environments and libraries for parallelisation 
may point the way for development of parallel code that is transparent to both the code 
developers and the code users, but they fall a long way short of addressing the entire 
parallelisation problem. The Computer Aided Parallelisation Tools project (CAPTools) 
at the University of Greenwich [JICL94, CIJL94] seeks to resolve the parallelisation of 
structured mesh Fortran codes through the use of an toolkit based on highly 
sophisticated interprocedural dependence analysis. It is hoped that the strategies devel- 
oped in this thesis will extend scope of the CAPTools package towards the parallelisation 
of unstructured mesh codes.
12
Chapter 2
Parallel Processing
A Distributed Memory Multi-Instruction Multi-Data (DM-MIMD) parallel computer 
is, in the simplest of terms, a number of interconnected processors, each of which is 
equipped with a quantity of memory. The combination of processor and memory is re- 
ferred to as a Processor Element (PE). Programs (processes) running on the processors 
can communicate with each other in what has been described and formalised as concur- 
rent communicating sequential processes [Hoa86]. In this way the processors operate in 
unison to provide a high overall rate of computation.
Many different approaches to programming for a DM-MIMD parallel machine have 
been explored [Kri89, LC90]. The parallel programming strategy used in this thesis is a 
Single Program Multi Data (SPMD) message passing paradigm. Each processor runs the 
same program (process) on its part of the data set communicating with other processors 
through the exchange of messages. The terms processor and process for the purposes 
of this thesis are consequently interchangeable. This strategy has similarities with the 
data parallel strategy [Hil94] but uses an explicit derivation the data partition based on 
the mesh. The strategy is actually a master slave scheme during input/output processes 
in that one processor is the designated master simply because it has control of the i/o 
processes. Parallel i/o hardware is still uncommon and any dependency on such platform 
specific features would pose a significant barrier to portability.
Any time spent in communication between the processors is an overhead not incurred
13
CHAPTER 2. PARALLEL PROCESSING
with serial processing and so to use a parallel machine efficiently the inter-processor com- 
munication must be minimised. Successful inter-processor communication requires a high 
degree of synchronisation between the processes [Val90]. Successful parallel processing 
requires that no processor needs to idle whilst waiting to synchronise with other pro- 
cessors. To achieve an efficient parallel implementation the workload must therefore be 
balanced amongst the processors.
2.1 Processor Interconnection
There are many varied and novel methods by which processing elements may be inter- 
connected. The relative merits of the differing interconnection strategies are discussed 
at length by several authors [TW91, AG94, FWM94]. A number of interconnection 
topologies have been tried. The richly connected hypercube (nCUBE 2s), two and three 
dimensional arrays, often looped into a ring or torus connection (Intel Paragon, Cray 
T3D) and other connections such as fat trees (Thinking Machines CMS) have also been 
used [vanderSteen94]. The advent of the INMOS transputer [Inm89c, Inm89a] with four 
high speed serial communication ports integrated into a single chip CPU popularised the 
scheme of a simple interconnected mesh of relatively low cost, highly integrated PE's 
[HJ88]. The companion chip to the transputer family, the Inmos C004 32-way crossbar 
switch [Inm89c, Inm89b] provides at low cost a means of reconfiguring the intercon- 
nection topology of an array of transputers. This model has persisted into many new 
designs, most probably as a result of the low cost of implementation coupled with a po- 
tentially high performance. Different switching technologies have been employed (IBM 
SP2, NEC Cenju-3, Meiko Computing Surface) but the reconfigurable interconnection 
model remains largely similar. Consequently this is the model of PE interconnection that 
this thesis will focus upon. Because this model of a parallel machine relys upon no special 
features the concepts discussed will be applicable to the majority of DM-MIMD plat- 
forms. Highly sophisticated and complex processor interconnections suffer significantly 
from the high cost of implementation. To remain cost effective the interconnection cost
14
CHAPTER 2. PARALLEL PROCESSING
must be small in comparison with the PE cost. Additionally the reliance upon machine 
specific features in programming may provide a good performance on one platform but 
can result in restricted portability. Advanced interconnection features may be imple- 
mented on simple platforms through the use of a software communication harness, but 
with consequent performance degradation. To achieve a cost effective parallel machine 
the investment in processor interconnection must result in a well balanced ratio between 
the communication performance and the calculation performance of the individual PE's.
2.2 Inter-Processor Communication
The key parameters for communication between processors are the bandwidth of the 
communication channels and the startup latency time to send a message.
The bandwidth is the rate at which a data packet of length may be transferred 
between two processors, normally measured in millions of bytes (Megabytes) per second 
(MBs" 1 ). Typical bandwidths may be l.TMBs" 1 per connection for the T800 trans- 
puter up to ITOMBs" 1 per connection in the Intel Paragon. For clusters of workstations 
connected by ethernet TCP/IP the bandwidth is more like O.QMBs" 1 [DD95]. This 
bandwidth cannot however be shared simultaneously by all of the processors as they all 
share the same ethernet connection. A more meaningful measure of interconnect band- 
width may be to divide the sum of the bandwidth of all interconnects in the machine 
by the number of PE's to give the bandwidth per processor. Clearly the bandwidths 
provided by different parallel systems ranges dramatically over two orders of magnitude. 
This spread in performance is even wider if the bandwidth per processor is considered.
The definition of latency varies but should give some measure of the time that it 
takes for a communication or message to begin transmission [CDJ95]. Latency is usually 
measured in microseconds (//s)and varys markedly from around 3/^s in the Cray T3D up 
to 900//S for ATM-100 TCP/IP [DD95].
Measurement of the peak achievable communication performance for a platform can 
be misleading. The nature of a parallel code is that execution is synchronised in data
15
CHAPTER 2. PARALLEL PROCESSING
exchanges [Val90]. Ergo the critical communication is not with one individual message 
in the machine but with every processor involved in communication. The effect of this on 
the actual communication performance is highly dependent upon the machine hardware 
implementation. None of the DM machines offer a totally interconnected processor 
network and hence the interconnection bandwidth is shared amongst the processors. A 
more meaningful measure of latency and bandwidth can be obtained with the processor 
interconnects saturated as this reflects more accurately the communication of a typical 
code execution [MWC+ 95]. It is possible to saturate the interconnects with either local 
(near neighbour) or distant (non adjacent) traffic which will give differing measures of 
communication performance. The degree to which this will affect measurement is of 
course system dependent.
The number of processors (hops) between the source of a message and its destination 
affects the time for a message to complete. Jack Dongarra [DD95] considers the per hop 
delay to be a linear function of distance and so gives a model of the time required to 
transmit bytes of data as:
(2.1)
With start up time (latency) a, per byte time /?, per hop delay 7 and number of hops 
The bandwidth of the system can therefore be expressed as:
Hence the peak bandwidth of a system is therefore expressable as:
roc = i (2.3)
A popular measure of the communication performance that combines latency with 
bandwidth is the bisection bandwidth denned as the message length at which half 
of the peak bandwidth is reached (perhaps better described as the bisection message 
length). For a single hop message this reduces to being simply the ratio of latency to 
peak bandwidth:
" = (2.4)
16
CHAPTER 2. PARALLEL PROCESSING
It can be useful to consider whether bandwidth or latency is the bound on the per- 
formance of a code on a particular platform. The latency is often large in comparison 
with the time to transmit an individual data item. Given that the most obvious op- 
timisation is to communicate only the data that is absolutely necessary, the next step 
is to minimise the number of transmissions that need to be made. Bundling the data 
to be communicated into large packets that require infrequent transmission reduces the 
latency overhead but incurs the overhead of copying data into buffer space. The extent 
to which communication may be buffered depends upon the individual code.
A parallel machine may be characterised by the communication to calculation ratio. 
This is sometimes given as the ratio of the time to send a one word message to the time 
for a floating point operation [FJL+88]. The notion being that a machine is well bal- 
anced if this ratio is less than unity. The actual MFlop performance is seldom maximal. 
As processor clock speeds increase to rates well beyond the access times for Dynamic 
Random Access Memory (DRAM) cache success rate begins to dominate the returned 
processing speed. Communication performance is both code and problem dependent as 
to whether latency or bandwidth form the limit. The computation to communication 
ratio is consequently somewhat arbitrary and subjective but if considered carefully can 
give a reasonably meaningful comparison of machine performance [AG94, FWM94]. A 
high ratio is likely to give poor parallel performance, the inter processor communication 
causing a processing bottleneck. A very low ratio would suggest that the investment 
in communication outweighs the investment in processing. Isolated consideration of the 
achievable parallel efficiency or speed-up of an application may give a misleading im- 
pression of the machine performance. The users (purchasers) viewpoint is usually more 
pragmatic involving wall-clock and dollars [FJL+88j.
17
CHAPTER 2. PARALLEL PROCESSING
2.3 Communication Model
2.3.1 Shared Memory
Prom a programming viewpoint the simplest communication model is the shared memory 
model in which the entire machine memory is considered to be shared by all processors. 
For a DM-MIMD machine this leads to a locality dependent Non-Uniform Memory Ac- 
cess (NUMA) which can be handled to a some extent by advanced compiler techniques 
[LP92]. Whilst this presents an attractive model for programming and is amenable to 
automatic parallelisation it is an inefficient model for communication, giving rise to many 
small communications and hence tending to be latency bound. Nevertheless this can be a 
moderately successful communication model for small to medium scale parallelism (2-16 
processors) and low latency platforms.
2.3.2 Message Passing
Message passing provides an explicit control of the inter-processor communication in 
which data to be transmitted is considered to be a messsage sent to a destination pro- 
cessor. This allows greater optimisation of the inter-processor communication and con- 
sequently is the communication model adopted in this thesis.
A communication harness of some description is normally used to implement mes- 
sage passing. At its most primitive the harness allows message passing between directly 
connected processors. More usually some form of 'wormhole' routing is provided that 
allows messages to be sent from any processor to any other processor hiding the under- 
lying processor interconnection from the programmer [NM93]. A per-hop cost penalty 
on non local message passing as discussed in Section 2.1 means that messages should be 
wherever possible nearest neighbour (localised) to maximise efficiency. Implementational 
details of the message passing paradigm vary greatly but may be contrived to provide a 
uniform view of the parallel machine across a wide range of platforms (Section 2.4.2). It 
is now widely accepted that shared memory offers a simple port to serial codes to attract 
code developers and users to parallel processing but cost effective efficiency can only be
18
CHAPTER 2. PARALLEL PROCESSING
obtained from low latency, high bandwidth, localised message passing.
2.4 Code Structure
Implementation of a message passing parallelisation into an unstructured mesh code 
must be largely hidden in order to comply with objective (ii). A structured approach to 
the parallel implementation can go a long way towards achieving this aim. The SPMD 
paradigm is used in this thesis as it allows a single source code parallel program to be 
developed which may be maintained as a serial code by the original code authors. The 
DD method adopted requires extension of existing data structures and additional data 
structures to define the mesh decomposition and inter-processor communication. These 
additional data structures need to circumvent the subroutine parameter lists to remain 
hidden. Include files containing common data areas provide a reasonably convenient 
way to manage these variables. Mapping of the partitioned mesh to the original mesh 
(required to rebuild partioned data for output) requires a global sized data structure 
that has to be distributed among the processors in order to remain scalable (objective
In this parallelisation strategy a shell structure illustrated in Figure 2.1 has been 
used to build layers of (in) visibility within the code. Around the outside of the shell are 
the majority of the original routines which remain unchanged.
At the next level in are the routines from the original code that have been modified to 
function in parallel. Most of these routines are changed only slightly in that additional 
subroutine calls have been included and some array dimensions and loop lengths are 
changed. The i/o routines unfortunately require extensive modification and remain a 
difficult area of code to successfully parallelise. Parallel i/o hardware is uncommon and 
so a serial pipelined approach has been adopted.
The visible parallel routines are provided by a parallel utilities library which provides 
routines that are locationless and directionless and so form a barrier to the visibility of 
the parallel implementation. At this level there is no concept of master or slave processor
19
CHAPTER 2. PARALLEL PROCESSING
or indeed processor number, position or communication channel. It is felt that the serial 
code developers should have no problem with this view of parallelism.
The communication library provides a barrier to the visibility of the parallel machine. 
The communication library consists a very simple set of communication routines used by 
the utility library to present a uniform functionality on all machines. This layer provides 
a portability interface and provides similar functionality to the many popular high level 
parallel communication harness' such as PVM or MPI.
The innermost level is the native communication harness provided for the parallel 
machine. Only the most primitive send and receive functions are necessary at this level 
thereby guaranteeing portability to most hardware platforms. Higher level communica- 
tions at this level may however be used to simplify or improve the implementation of the 
communication library.
Figure 2.1: Shell structure of the parallel code.
CHAPTER 2. PARALLEL PROCESSING
2.4.1 Parallel Utility Library
Routines in the utility library are visible at the serial code level and must attempt 
to hide the parallel implementation whilst providing a parallel functionality which is 
conceptually straightforward. Simplicity of calling is of paramount importance in the 
library routines to achieve objective (ii). The routines in the library are described in 
Appendix A along with the parallel data declarations. The library is currently written 
in terms of the data structures used by the code being parallelised and hence is specific 
to that code. This library could however be made general purpose by adoption of a 
generic data structure for the utilities, this is discussed further in Chapter 6. The 
mesh decomposition routines at this level require extensive data structures and globally 
dimensioned variables. Embedding of these routines in the parallel code is not always 
possible, mainly due to memory restrictions. In which case they may be used to pre- 
process the serial problem files into a domain decomposed problem file that can then be 
used by the parallel program in place of the original problem specification. This process 
can be made reasonable seamless from the viewpoint of a code user.
Similar functionality has been developed for the Bulk Synchronous Parallel (BSP) 
[MR93] package and the Oplus package both from The Oxford Parallel group at the Ox- 
ford Computer Laboratory, LOCO from Katholieke Universiteit Leuven, PLUMP from 
CSCS in Switzerland [CDE+94] and DIME from Caltech [FWM94]. These packages offer 
a range of attractive features for portability, adaptive gridding and dynamic load balanc- 
ing. The significant difference between their work and the work presented in this thesis 
is that they provide an environment and data structure that supports the of 
codes to handle irregular problems so that parallelisation of the code becomes more or 
less automatic. CM programmers cannot be expected to take on-board the overhead of 
authoring parallel code. This thesis therefore attempts a strategy for the parallelisation 
of codes for irregular problems with the intention of developing a methodology 
for automation of the parallelisation of old and new codes.
21
CHAPTER 2. PARALLEL PROCESSING
2.4.2 Parallel Communication Library
The parallel communication library imparts portability to the code by providing an in- 
terface between the parallel utility library and the machines' communication harness. 
Porting the parallel code to a new platform (harness) requires re-writing only the com- 
munication library. The library used for this thesis is the CAPLib library developed as 
part of the Computer Aided Parallelisation Tools project (CAPTools) at the University 
of Greenwich [CIJL94]. This library is constructed in two layers; CAPLib for high level 
routines and CAPLow for the low level portability shell. This further simplifies the porta- 
bility of code using the CAP library system as only CAPLow requires porting. CAPLib is 
currently available for C Toolset on the Transtech Paramid, 3L Fortran on transputers, 
PVM2, PVM3 and MPI with Cray shared memory under development.
2.4.3 Communication Harness
A communication harness is in many ways analogous to an operating system in that it 
provides a means of loading an executable code onto the processors with a number of 
system facilities such as input/output. Most notably a parallel communication harness 
provides a means of inter-processor (inter-process) communication. Some manufacturers 
refer to their harness as a parallel operating system (Helios, Genesys, Parix) whilst oth- 
ers describe it more in terms of a loader or server program. In actuality it is usually a bit 
of both. Networks of workstations running UNIX can be configured as a Parallel Virtual 
Machine by using the popular PVM package or one of the more recently developed Mes- 
sage Passing Interface (MPI) packages. Some of the larger parallel machines use UNIX 
as the communication harness which then provides direct support for communication 
packages such as PVM or MPI but at the cost of a memory and processing overhead.
Communication Packages
The communication harness in Figure 2.1 may be implemented as any of a wide range of 
communication packages. There are almost as many different communication packages 
as there are parallel machines. An incomplete list of some of the most popular and
CHAPTER 2. PARALLEL PROCESSING
persistent of the packages is given here:
C Toolset - Inmos [Inm92]
PVM - Parallel Virtual Machine - Oak Ridge National Laboratory. [GBD+94]
MPI - Message Passing Interface - An international consortium coordinated through the 
University of Tennessee, Knoxville. [For94]
Parmacs - Parallel Macros for Fortran - Argonne/GMD. [Hem91]
CHIMP - Common High-level Interface for Message Passing - Edinburgh Parallel Com- 
puting Centre. [CTHW91]
PICL - Portable Instrumented Communication Library - Oak Ridge National Labora- 
tory. [GHPW90]
Express - ParaSoft Corporation. [Par92]
MPL - Message Passing Library for the IBM SP2.
At the most fundamental level these packages provide a means of explicitly sending 
a message from one process (processor) to another. This simple message passing is all 
that is necessary for CAPLib to be ported to a communication package. Many of the 
packages provide more sophisticated features such as global commutative operations and 
asynchronous communications. Such features often rely on hardware specific calls for 
their successful implementation. Where available such features can be used directly by 
CAPLib to provide the functionality with consequent improved performance.
23
CHAPTER 2. PARALLEL PROCESSING
Communication Primitives
To achieve parallel message passing only a small number of communication primitives 
are required from the communication harness. Only Initialise, Send and Receive are ac- 
tually required to implement a usable communication library. High level communication 
routines such as broadcast and global commutative operations can be built from these 
simple primitives. More efficient implementations of higher functions may be provided 
as primitives on some platforms and harness'. Some of the more sophisticated functions 
such as asynchronous communication must however be supported as primitives and can- 
not be built from synchronous communications. Primitive calls provided by the harness 
take many varied forms, some of the terms used to describe the routines are outlined 
below.
  synchronous (blocking) communication: returns when the operation is complete 
and data resources used in the call are available for re-use.
  asynchronous (non-blocking) communication: returns before the operation is com- 
plete and data resources used in the call are not available for re-use.
  broadcast: sends a data item to all processes
  reduction: performs a commutative arithmetic or logical operation on all processes.
  scatter: distribute a data item amongst the processes.
  gather: rebuild a data item using components from many processes.
24
Chapter 3
Domain Decomposition
Decomposition of a mesh based domain into a set of 5 sub-domains that may be allocated 
to a set of processors involves finding a partition of the mesh so that the amount 
of compute time on each processor is very nearly equal. Two schemes are popularly 
used. One is to divide the problem into as many sub-domains as there are processors, 
i.e. = P, so that each processor is allocated one sub-domain. The other scheme is to 
divide the problem into more sub-domains than there are processors, P, so that each 
processor operates on one or more sub-domains. This latter scheme has some advantages 
for targeting an inhomogeneous compute platform such as a network of workstations, in 
which the PE's are workstations which may have not only differing characteristics, but 
may also be subject to other workloads. Such a scheme can provide an effective coarse 
grained dynamic load balancing mechanism necessary for successful use of shared facility 
networks [MJ95]. Such networks tend to be reasonably small scale (~ 32), in which 
case the overhead of dynamic sub-domain allocation may allow an effective speed-up. 
This thesis attempts to propose a scheme which will scale to a highly parallel (~ 64) 
homogeneous DM MIMD processor array and so the former scheme is advocated. 
The simpler scheme carries a lower sub-domain allocation overhead and so may 
achieve a greater overall efficiency. Also there is an overhead incurred for each cut edge 
of the mesh which is minimised by keeping Edge is used here in a graphical sense 
meaning a relationship between mesh entities that is cut if the entities are in different
25
CHAPTER 3. DOMAIN DECOMPOSITION
sub-domains. Dynamic load balancing schemes may still be implemented as fine grained 
migration of the mesh entities between the sub-domains.
Partitioning of a mesh is a reasonably straightforward procedure of cutting 
the mesh along the grid lines (2D) or planes (3D) [JC91]. Achieving a precise load balance 
in this instance requires that the mesh size along the partitioned axis is a multiple of the 
required number of partitions. Obtaining a balanced partition of an unstructured mesh 
is potentially a more complex problem and the focus of considerable research.
In order to solve for the nodes and elements around the edge of each sub-domain 
data is required from the neighbouring sub-domains according to the stencil of data 
dependency as discussed in Section 1.2. This data may be communicated as required 
from the processor on which the neighbouring domain is calculated, but this can lead 
to an unnecessarily large number of small communications. The strategy adopted in 
this thesis is to extended each sub-domain to overlap its adjacent sub-domains. This is 
discussed in more detail in Section 3.3. Each processor can then solve for the problem 
inside its sub-domain using the variables held in the overlap layer. Variables in the 
overlaps are updated from variables calculated on other processors to maintain a solution 
consistent with the original serial code.
3.1 Representation of an Unstructured Mesh
An unstructured mesh is specified as a hierarchy of components or mesh entities, each 
of which may be regarded as a data object or structure which can be used to provide 
a spatial, geometric or topological reference to the variables used in a computational 
mechanics code.
The definition of an unstructured mesh begins with a set of grid points or nodes, each 
of which is defined by set of spatial coordinates. The grid points describe the geometric 
shape and physical size of the mesh. Points are also convenient to provide a spatial 
reference for dimensionally independent variables such as temperature or pressure.
Points can be connected to form a set of edges, faces or both edges and faces. In
CHAPTER 3. DOMAIN DECOMPOSITION
three dimensions edges can be connected to form a set of faces. Edges in 2D and faces in 
3D may be used to provide a spatial reference for flux variables such as current density.
The space enclosed by a set of edges or faces describes an element. Elements have 
a volume and may be used as a spatial reference for volumetric entities such as mass or 
heat.
The perimeter or surface of a mesh defines a boundary which can be usefully asso- 
ciated with some boundary condition. Boundaries may also be defined internally to a 
mesh.
A defined volume or area within the mesh can be defined as a domain which is subject 
to certain conditions such as being of a material with specified physical characteristics.
The entity relationship diagram for a three dimensional unstructured mesh as shown 
in Figure 3.1 has only these few components and yet the web of relationships is highly 
interconnected. In two dimensions there is no definition of a face and so the relationships 
are a little more straightforward. Not all of the entities or the relationships are mandatory 
and the relationships may be explicit or implicit. The actual entities and relationships 
used varies from code to code.
The connectivity or topology of the mesh is explicitly expressed as relationships 
between like or differing mesh entities. For example the elements may be described in 
terms of their nodes as a list of node numbers for each element. From this information 
the element connectivity (adjacency) may be derived as a list of element numbers for 
each adjacent element. There is a trade off to be made between the memory used 
for the storage of these relationships against the ease of calculation required within 
the code. The nature of the integration employed by CM codes is nearest neighbour. 
Evaluation of an element based variable may for example require the variable values for 
all neighbouring elements and the coordinates of the points that comprise those elements 
(see Figure 1.3 d). This example would require the element to element connectivity to 
find the neighbouring elements and the element to node relationship to find the nodes 
of the adjacent elements.
27
CHAPTER 3. DOMAIN DECOMPOSITION
Figure 3.1: Entity relationship diagram for a three dimensional unstructured mesh.
3.2 Mesh Partitioning
The problem of partitioning an unstructured mesh has attracted the imaginations of 
many workers for more than twenty years [KL70] [PSL89] [BS93]. It is after all an 
interesting problem and one which at first sight at least seems well defined and self 
contained. A good mesh partition is one which divides the computational load equally 
amongst the sub-domains and minimises the amount of communication required between 
sub-domains. For many meshes it can be computationally prohibitive to find an optimal 
partition and computationally expensive to find a near optimal partition. On the other 
hand a reasonable partition may be calculated with little effort. The search for the 'best' 
partitioning algorithm has led to exploration of the middle ground, trading partition 
quality with the order of the partitioning routine.
Partitioning may be based on any of the mesh entities, usually either the elements or 
nodes of the mesh. A sensible choice is to partition according to the structure associated 
with the greatest amount of computation in the computational mechanics code. For
CHAPTER 3. DOMAIN DECOMPOSITION
example a flow code dealing with element based variables would be partitioned according 
to elements whereas a stress code using node based variables would be partitioned as 
grid points. In actuality an element based code integrates over each face of each element 
and so a face based partition may be more appropriate. Similarly a node based code 
may integrate over each edge of the mesh and so an edge based partition may be more 
appropriate. The actual basis for partition chosen is not however of great consequence 
providing that the resulting mesh partition is balanced. This thesis will for simplicity 
normally refer to an element based partition. A mesh partition may be expressed in any 
of a number of ways, the method adopted is a simple list of the partition number for 
each element (entity). (Appendix B)
3.2.1 Load Balance
A fundamental objective in finding a partition is to balance the computational effort or 
load required in each sub-domain. The simplest approach is to assume that the load 
per element is homogeneous throughout the mesh. In this case the partition should 
have as near equal numbers of elements per partition as possible. Should the load be 
inhomogeneous then a weight or cost function may be applied to the elements to achieve a 
cost balanced partition. For example the computational effort required for each element 
may be proportional to the number of faces the element possesses. So tetrahedra will 
incur a cost of 4, bricks a cost of six and so on. An important consideration in load 
balancing is that it is not so much essential to achieve a totally uniform balance of load 
but rather that no one processor should have significantly more than average load. Any 
processor with an exceptional work load will cause all other processors to incur idle time 
with resultingly poor parallel performance. Should any one processor have too little work 
this will not hold up any other processors and have a correspondingly less detrimental 
effect on overall performance. This is illustrated in Figure 3.2 where the overall run time 
for partition A is longer than the overall run time for partition B despite the greater 
imbalance between the individual processor run times for partition B. The definition of 
a good load balance must reflect this effect. What is required is not a small deviation of
CHAPTER 3. DOMAIN DECOMPOSITION
any load from the average load. Nor a small maximum to minimum load difference, but 
a small maximum to average difference.
9-
Processorl Processor2 Processors Processor4 Processors
Figure 3.2: Example run times for two possible partitions over 5 processors.
3.2.2 Communication Balance
The perimeter interfaces between the sub-domains should be as short as possible to re- 
duce the communication overhead between the sub-domains. Again an optimal solution 
is expensive to compute and a near optimal solution is sufficiently good. Reducing the 
number of adjacent sub-domains reduces the amount of messages that require trans- 
mission again reducing the communication overhead. It is also important to have some 
degree of balance in the communication, especially that no one sub-domain interface is 
unduly larger than the average. Again any exceptionally large interface will delay the 
overall parallel execution. These requirements paint a picture of partitions that are low 
order, to reduce the number of interfaces and reasonably regular, to present uniform 
smooth perimeters.
CHAPTER 3. DOMAIN DECOMPOSITION
3.2.3 Processor Topology Mapping
The complexity and therefore the cost of building a totally interconnected non-blocking 
processor array is significant and so some form of interconnection map is generally 
favoured. As discussed in Section 2.1 this may be anything from a simple ID or 2D 
array up to a 3D torus array or a fat tree structure. Many transputer based systems 
employ the Inmos C004 32 channel crossbar switch programmable link router chip al- 
lowing reconfigurable topologies to be constructed from a set of compute nodes. The 
IBM SP2 and the NEC Cenju3 use 4x4 switches to similar effect. A more detailed de- 
scription of a number of popular and esoteric hardware architectures may be found in 
[vdS94, TW91]. In spite of what hardware manufacturers may claim there will always 
be a distance related communication cost. This cost becomes more significant as the 
number of processors increases. No matter how the processor interconnection is realised, 
a parallel processor platform will incur some form of topological communication cost. It 
is inevitable that it is more efficient to communicate with neighbouring processors than 
with distant processors. Robinson and Lonsdale [RL90] suggest that communication 
costs may be reduced by interconnecting the processors to reflect the mesh partition 
as illustrated in Figure 3.3. It may not however be possible or practical to reconfig- 
ure a processor array to suit a given partition. A more generic, flexible and scalable 
scheme is to consider the processor topology to be fixed as, for example, a 2D or 3D 
grid. This processor interconnection topology can then be reflected in the mesh par- 
tition. A transputer based platform, for example, would require the partition to limit 
the number of adjacent sub-domains to four (a 2D grid or 4 dimensional hypercube), 
as this is the number of communication links on each transputer. To this end weights 
can be applied to the partition to discourage the separation of neighbouring elements 
onto non-neighbouring processors [Jon94, Wal95j. In practice it can prove impossible 
to force a partition to adhere to a processor map, but the closer the partition reflects 
the processor map the greater the potential efficiency of the partition. A number of 
workers attempt to incorporate the underlying machine topology into the partitioning 
process in order to produce a partition that can provide improved parallel performance
31
CHAPTER 3. DOMAIN DECOMPOSITION
Figure 3.3: Processor interconnection mapped to a pipe mesh partition.
[Far89, WCE+95, Har94, MWC+95]. Figure 3.4 shows a mesh partitioned (using the 
JOSTLE code discussed in Section 3.2.4) into 16 sub-domains using three different par- 
titioning strategies along with the corresponding processor interconnection graphs.
Regardless of how the mesh partition is calculated one is faced with the problem 
of mapping partitions onto processors ( ) [SE87, SER90, BA92, HS92]. 
If is small then all combinations may be tried to rind the optimal mapping, that 
is the mapping which minimises the number of partition boundaries that do not align 
with processors interconnections. The combinations of mappings increase as factorial 
which makes this impractical for even modest sizes of A simple scheme to obtain 
a mapping for little cost is to loop over all partitions in an initially arbitrary mapping 
looking for a partition which can be swapped so that communication cost reduction is 
maximised. This loop is iterated until no further cost reduction is found. Schemes such 
as this are prone to local minima traps but can give a useful mapping with little overhead 
[WCE+95].
CHAPTER 3. DOMAIN DECOMPOSITION
Figure 3.4: Partitions of a 2D mesh into (a) ID, (b) 2D and (c) uniform topologies with 
the corresponding sub-domain connectivity graphs.
CHAPTER 3. DOMAIN DECOMPOSITION
3.2.4 Partitioning Algorithms
Some partitioning algorithms operate on the geometric mesh coordinates. Others treat 
the mesh as a graph of nodes and edges. Graph based techniques have the 
advantages of dimensional independence and a true representation of the connectivity 
of the mesh in the partioning process. This is demonstrated by Nick Floros and Jeff 
Reeve to be of particular importance when partitioning highly complex shapes [FR94]. 
The graph to be partitioned may be simply the grid points (nodes) of the mesh or a 
dual graph of the mesh with the graph nodes representing for example elements and the 
graph edges representing the element adjacency. If the graph is based on elements of 
the same shape then the node degree (number of edges on each node) in the graph is 
more or less constant (nodes at the boundaries are of reduced degree). Partitioning to 
achieve an equal number of nodes in each sub-domain may achieve a good load balance. 
If however the graph is based on grid points, or the mesh is of mixed element shapes the 
node degree in the graph is variable. Partitioning a graph to achieve an equal number 
of edges (rather than nodes) in each partition may, in some cases, be more appropriate 
for load balance. Other factors may affect the computational load at each node of 
the graph, perhaps different materials, or phases for instance are associated with each 
node. Applying a weight to the nodes (perhaps based upon the number of connected 
elements and/or some other parameter) and then partitioning the weighted list can give 
an improved load balance. In practice it can prove difficult to accurately predict the 
computational load in each sub-domain.
Many of the schemes involve recursive bisections, variations on the bisection schemes 
involve cutting the mesh into more than two partitions at each step. This allows the 
algorithms to provide numbers of partitions other than 2n .
What is required of a mesh partitioning algorithm is a high quality of partition 
at a low cost. The time required to calculate the partition must be insignificant in 
proportion to the time for the CM code to execute. High quality means a balanced load, 
short interfaces and a small number of interfaces. This paints a picture of partitions as 
uniform packed bubbles, shapes of minimum surface energy. Much of the current research
CHAPTER 3. DOMAIN DECOMPOSITION
centres on hybrid approaches with graph reduction techniques and multilevel schemes 
to reduce the order of the problem [Jon94, WCE+95, HL93, VK95, DMM95, KK95]. 
A good but incomplete review of partitioning algorithms has been compiled by Chris 
Geenough [GF94] and Dirk Roose [RVD93]. A number of the algorithms have been 
collected into a package called RalPar [FG94]. Some of the more important techniques 
are covered in detail by Beryl Jones in her thesis [Jon94]. There follows a brief summary 
of many of the better known algorithms.
Recursive coordinate bisection
Recursive Coordinate Bisection (RGB) [Fox88] is a simple geometric scheme in which the 
grid points of the mesh are sorted into order along one axis (normally the longest) and 
then bisected. This process is repeated recursively on each partition until the required 
number of partitions is obtained. This gives rise to thin strip partitions with long 
interfaces. A variant of the scheme is Orthogonal Coordinate Bisection (OCB) in which 
the sort axis is alternated at each recursion. The resulting partitions are consequently 
more checkerboard in shape. An improvement is to bisect each partition along longest 
axis, which is not necessarily the same for each partition.
Recursive inertial bisection
Recursive Inertial Bisection (RIB) is similar to RGB but bisects the geometric coordinates 
along the line of principal inertia [RVD93]. It can be expected that the line of principal 
inertia is aligned with the length of the mesh and the narrowest part of the mesh will be 
orthogonal to it. Whilst RIB is more expensive than RGB or OCB it is still a 'cheap' 
method and gives better results with concave geometries. RIB is still popularly used as 
it is fast and reliable.
Greedy
The greedy method is a graph based technique which begins with a node of minimum 
degree (minimum number of connected edges) and 'bites' level sets from the graph [Far
35
CHAPTER 3. DOMAIN DECOMPOSITION
until the appropriate number of nodes (^) have been 'eaten'. This process is repeated 
on the remaining graph until all of the graph has been consumed. This is an extremely 
cheap method (O(AT)) which produces mostly good partitions but is liable to leave some 
disconnected partitions (i.e. partitions that are split into two or more pieces).
MINCUT
MINCUT [KL70] employs heuristics to optimise a partition by swapping vertices of the 
graph between partitions to find the swap that minimises cost. "The general idea is to 
perturb the locally optimal solution in what we hope is an enlightened manner, so that 
an iteration of the process on the perturbed solution will yield a further reduction in the 
total cost." A logical exchange of all vertex pairs in the graph is performed and the effect 
of each exchange on the partition cost calculated. All exchanges up to the exchange that 
produces the minimum cost are then committed as actual exchanges. This process is 
repeated until no reduction in cost is obtained. This method attempts to climb out of a 
local mina trap but is not always successful.
Recursive graph bisection
Recursive Graph Bisection (RGB) [Sim91] is similar to RGB and RIB but operates on 
the graph of the mesh. A diameter of the graph is found and starting from one end 
of the diameter level sets are removed from the graph until the graph is bisected. The 
process is repeated recursively on each partition.
Recursive spectral bisection
Recursive Spectral Bisection (RSB) [PSL89] represents the graph with its Laplacian 
matrix L. The method recursively partitions the graph by finding x which minimises 
xTLx. The eigenvector that corresponds to the second smallest eigenvalue (the first 
eigenvalue is trivial) is sorted and bisected to give a partition of the graph. This is 
a sophisticated and expensive method that provides a high quality partition that is 
especially suitable for complex geometries. Hendrickson and Leland [HL92] extended
CHAPTER 3. DOMAIN DECOMPOSITION
the method to allow weighting of the nodes and edges and cutting into more than two 
partitions at each step. Multilevel Recursive Spectral Bisection (MRSB) dramatically 
speeds up the algorithm by coarsening the graph with clustering and using RSB on the 
coarsened graph [BS93]. This is a highly elaborate technique that provides the high 
partition quality of RSB at less cost.
Tabu search
Tabu search (TS) [Glo89, Glo90] is a combinatorial optimisation based iterative im- 
provement technique that tries to avoid local minima traps by temporarily accepting 
unprofitable changes to the partition. Cycling in the search trajectory is avoided by 
keeping a history of the most recent changes, making further changes of the most re- 
cently moved nodes 'taboo'. Some open problems of TS are the determination of an 
appropriate 'prohibition period' and the robustness of the technique for a wide range 
of different problems. Some of the limitations of TS have been overcome in Reactive 
Tabu Search (RTS) [BT94] in which the appropriate size of the prohibition list is learned 
automatically by reacting to the occurrence of cycles.
Simulated annealing
Simulated Annealing (SA) is a generalised optimisation method that borrows ideas from 
a statistical mechanics approach to annealing in a cooling solid [KJV83, vLA87]. A 
parameter analogous to temperature is reduced during the course of the calculation. 
For each temperature a number of modifications to the current solution are tested. If 
a modification reduces the cost function the modification is accepted, otherwise the 
modification is accepted according to a probability function based on the exponent of 
the ratio of cost function to temperature. As the temperature cools the algorithm is 
less likely to accept a change that increases the cost. With a slow 'cooling' rate this 
method can produce good partitions but is computationally expensive. Developments of 
the basic ideas of SA have led to Mean Field Annealing (MFA) which combines SA type 
strategies with Neural Network techniques[BA92].
CHAPTER 3. DOMAIN DECOMPOSITION
JOSTLE
JOSTLE [Wal95, WCE+95, MWC+95] is the code used to produce the partitions used in 
this thesis. The JOSTLE strategy is to derive an initial partition as quickly and cheaply 
as possible and then use optimisation techniques to improve the quality of the partition. 
Two alternative methods are provided to produce the initial partition. One method is a 
variation of the Greedy algorithm, in this case a graph based variant on the original mesh 
based algorithm proposed by Charbel Farhat [Far88]. The other method is geometric 
sorting which operates in a similar manner to OCB. This method provides a crude map- 
ping to a x processor grid The nodes are sorted on the longest axis and split 
into sets of The nodes in these sets are then sorted in the orthogonal axis and split 
into sets of Having used one of the above methods to obtain an initial partition 
one of two optimisation methods can be applied to the improve the partition. Uniform 
optimisation is a technique in which each partition attempts to minimise its own surface 
energy analogous to the way that bubbles pack together. The technique works by calcu- 
lating the centre of each partition in a graphical sense and determining the radial distance 
of each node from the centre. Nodes that are most distant from the centre can then be 
migrated between neighbouring partitions. Grid optimisation is a similar technique to 
uniform optimisation except that nodes are allowed to migrate only between neighbours 
in the processor grid. Four partitioning (mapping) strategies are provided by JOSTLE. 
partitioning ignores the processor interconnection topology throughout the 
entire partitioning process. A partition is an unmapped partition that has 
been mapped to the processor topology with a simple mapping algorithm applied post 
partitioning. The partition begins with a partition that is crudely mapped to 
the processor topology and then is optimised ignoring the processor topology to minimise 
the number of cut edges. The partition acknowledges the processor topology 
throughout the partitioning process. Some partitions produced by JOSTLE can be seen 
in Figure 3.4.
CHAPTER 3. DOMAIN DECOMPOSITION
Strategy
Unmapped
Postmapped
Premapped
Mapped
Initial partition
Greedy
Greedy
Geometric sort
Geometric sort
Optimisation
Uniform
Uniform
Uniform
Grid
Processor allocation
No
Yes
No
No
Table 3.1: Partition mapping strategies provided by JOSTLE
3.2.5 Parallel Partitioning
Ideally the partition of the mesh should be carried out at run time in parallel. As and 
increase an partitioning algorithm may become unacceptable for a solver running 
at Few of the available partitioning algorithms are suitable for parallel 
implementation. The work of Chris Walshaw [Wal95] and Ralf Diekmann [DMM95] 
aims to provide paralleliseable routines that can be used to partition and also re-partition 
meshes in a dynamic load balancing scheme. This strategy relies on obtaining a rapid 
initial mesh partition to crudely distribute the mesh across the processors and then 
operate on the partitions in parallel to optimise the partitions. Difficulties arise when 
the size of the mesh becomes too great to fit onto one processor. This is a natural 
consequence of massively parallel processing where the capacity of the whole machine 
may be orders of magnitude greater than the capacity of a single node. In such an 
instance the partitioning algorithm may have to begin by taking an arbitrary partition 
of the mesh as it is read in from file and distributed in sequence to a number (not 
necessarily all) of the processors. A high level of communication will then be required to 
re-distribute the mesh amongst all of the processors to provide a crude initial partition. 
If the partitioning strategy is, for example, to be the mapped JOSTLE scheme this will 
be a reasonably successful process. Geometric sorting will be a reasonably simple and 
cheap algorithm to implement as a parallel initial partition scheme.
39
CHAPTER 3. DOMAIN DECOMPOSITION
3.3 Mesh Decomposition
Having obtained a partition of the mesh into parts the partition is used to decompose 
the mesh into sub-domains that can be allocated one per processor. The elements, 
nodes and faces that are allocated uniquely to a processor are referred to in this thesis 
as the core mesh components. These components are said to be 'owned' by a processor. 
Each sub-domain is extended with a layer of points and elements which overlap the sub- 
domains along the inter-processor boundaries as illustrated in Figure 3.5. These overlap 
or halo regions carry variable values from neighbouring sub-domains that are required 
for the solution of variables inside the sub-domain.
Figure 3.5: Mesh partitioned into three parts with overlap elements applied.
Decomposition of the mesh into a set of extended sub-meshes consists of five essential 
steps;
i) Find a partition of the mesh (primary). 
ii) Derive secondary partitions from the primary partition.
CHAPTER 3. DOMAIN DECOMPOSITION
iii) Determine the mesh overlaps to the neighbouring sub-domains.
iv) Re-number the mesh in each sub-domain.
v) Construct a template for overlap data exchange.
3.3.1 Derive Secondary Partitions
As mentioned in Section 3.2 the mesh entity that provides the dominant spatial reference 
used by the code to be parallelised is ordinarily chosen as a basis for mesh partitioning. 
This partition is referred to as the primary partition. Secondary partitions may be 
derived from the primary partition for the other mesh entities used in the code. The 
compute time for a CM code is dominated by the time spent in the solution of an 
equation of the form Ax = b. It is consequently important for load balance to obtain 
an equal number rows and an equal number of coefficients in each of the distributed A 
matrices. This inevitably results in some compromise. With an element based x for 
example, a primary partition based on elements will keep the vector length and hence 
number of rows in A balanced across each sub-domain. But the number of off diagonal 
coefficients in each A depends upon the number of internal faces in the sub-domain. 
Balancing elements will not necessarily balance matrix coefficients In the case of the two 
dimensional flow code used in this thesis the primary partition is based on elements and 
there is only one secondary partition, that being for grid points. For reasons of clarity 
the following discussion is based on an element based primary partition. The discussion 
is nonetheless applicable to other mesh entity partitioning orders.
Secondary partitions are inherited from the primary partition in accordance with the 
connectivity between the entities. For example, each node is connected to a number 
of elements, each of which belongs exclusively to one sub-domain. This provides a 
basis for the allocation of the node to a sub-domain. The most obvious and simple 
partition inheritance scheme is to allocate the node to the sub-domain which owns the 
majority of the connected elements. In the case of an equal number of connected elements 
being owned by two or more sub-domains, the node is allocated to the domain which
41
CHAPTER 3. DOMAIN DECOMPOSITION
owns the least number of nodes. This simple, inexpensive scheme gives a good match 
between the primary and secondary partitions, but can lead to an unnecessarily high load 
imbalance in the secondary partition. It does not follow that two unstructured meshes 
with equal numbers of elements will have the same number of nodes, indeed there may 
be a large discrepancy between the two node counts. When the two meshes are sub- 
domains to be operated on in parallel this can produce an unacceptably high degree of 
load imbalance for element based matrix computations as discussed earlier and possibly 
even greater imbalance for node based calculations. If however the node allocation 
between the sub-domains is forced to be balanced the element and node partition may 
not be well matched which can result in an undesirably large and imbalanced overlap 
layer. This will consequently lead to large and unbalanced communications between the 
sub-domains. The comments about load and communication imbalance in sections 3.2.1 
and 3.2.2 should be borne in mind at this point.
The load imbalance may be redressed to an extent by the use of more elaborate 
schemes to derive secondary partitions. A possibly superior partition inheritance scheme 
is to first locate the nodes for which all connected elements lie in one partition and for 
each node found, allocate the node to that partition. The remaining nodes are then 
allocated in turn to the least loaded domain beginning with the node which has the 
greatest connectivity to that domain.
It is conceivable that the nodal imbalance may become unmanageably large, in which 
case some nodes may require allocating to sub-domains that own none of the connected 
elements in order to redress the balance. This will result in a communication imbalance 
which may or may not be significant depending upon the characteristics of the hardware 
platform. The quality of the secondary partitions then becomes a platform dependent 
optimisation issue.
These schemes may be seen as an attempt at solving a graph problem by the applica- 
tion of simple heuristics. It may therefore be worthwhile to use graph based techniques 
to derive the secondary partitions. A possible scheme is to produce a weighted graph 
of the nodes which clusters the nodes for which all connected elements lie on one parti-
42
CHAPTER 3. DOMAIN DECOMPOSITION
tion. This graph may then be partitioned using one of the graph partitioning algorithms 
developed for obtaining primary partitions. The work of Chris Walshaw [Wal95] is of 
interest here. The amount of effort that it is worthwhile devoting to the derivation of a 
secondary partition is problem dependent. Like the search for a primary partition there 
may be no singular optimal solution and a near optimal solution will in the majority of 
cases provide a sufficiently good solution.
3.3.2 Overlap Construction
The overlaps between the sub-domains are determined in accordance with the data de- 
pendency required by the code as discussed in section 1.2. For example, if the solution 
for an element based variable requires the values in all adjacent elements as illustrated in 
Figure 1.3a then the adjacent elements that lie in neighbouring sub-domains are added as 
overlaps to the list of elements. Similarly if the nodes that compose the overlap elements 
are also required as in Figure 1.3d then they too are added to the list of overlap nodes. 
In this way the description of the mesh for each sub-domain is extended to include all 
data that are required for solution of the sub-domain. The utility used to construct 
overlaps for the codes parallelised in this thesis uses a simple set of rules to determine 
the elements and nodes which are to be included in the overlaps (Appendix A).
When using only the element based flow and heat code;
Overlap elements are denned as:-
All elements that are adjacent to a core element. 
Overlap nodes are defined as:-
Nodes of all elements including overlaps that are not core nodes.
However the node based stress code involves a more extensive data dependency 
and the required overlap layers become deeper so that;
Additional overlap elements are defined as:-
CHAPTER 3. DOMAIN DECOMPOSITION
Elements that contain at least one core node. 
Additional overlap nodes are defined as:-
Nodes that are connected to core nodes.
An example of the overlaps required for the flow code is shown in Figure 3.6. The same 
mesh is shown in Figure 3.7 with the additional elements and nodes in the overlaps 
required for the stress code .
Figure 3.6: A mesh of 28 triangles divided into two sub-domains with the overlaps 
required for the flow scheme.
Providing that the mesh data structures are either one dimensional linked or indexed 
lists, or stored as multi dimensional arrays in which the number of entities is the highest 
index (last in F77, first in C) then the overlaps may be stored as extensions to existing 
data structures which allows them to be passed to subroutines and addressed in the 
parallel code in the same manner as the original data structures. This hides the paral- 
lelism and results in only small source file changes being required to extend mesh as it
CHAPTER 3. DOMAIN DECOMPOSITION
Figure 3.7: A mesh of 28 triangles divided into two sub-domains with the overlaps 
required for the stress scheme.
is implemented in the serial code. For example the array of grid points in Fortran may 
be declared as;
INTEGER DIMENSION, NO_OF_GRID_POINTS
INTEGER GRID_POINTS(1:DIMENSION, 1:NO_OF_GRID_POINTS)..
This array may be easily extended to include overlaps as;
INTEGER DIMENSION, EXTD_NO_OF_GRID_POINTS
INTEGER GRID_POINTS(1:DIMENSION, 1:EXTD_NO_OF_GRID_POINTS)
Clearly this structure will still be correctly declared in all subsequent subroutines calls 
without any code modification. Subroutines may be called with either the original or the 
extended point count and the declaration will remain consistent. If however the array of 
grid points is declared as;
INTEGER GRID_POINTS(1:NO_OF_GRID_POINTS, 1:DIMENSION)
CHAPTER 3. DOMAIN DECOMPOSITION
Then the array may also be extended as;
INTEGER GRID_POINTS(1:EXTD_NO_OF_GRID_POINTS, 1:DIMENSION)
But now each subroutine must declare grid points to the extended size in order to 
remain consistent. It may prove less invasive to change the serial code to reverse such 
declarations and subsequently all occurrences of the variable. Apart from cache effects 
such a modification will not affect the serial code and unlikely to raise objections from 
the serial code authors.
3.3.3 Parallel Execution Control and Renumbering
Consider the following code fragment that loops over each grid point in each element.
INTEGER NUMBER_OF_GP_IN_ELEMENT(1: NUMBER, OF.ELEMENTS)
INTEGER GP_IN_ELEMENT (1: MAX_NUM_GP_PER_ELE, 1: NUMBER_OF_ELEMENTS)
REAL XELE(1:NUMBER.OF.ELEMENTS)
REAL YGP(1:NUMBER_OF_GRID_POINTS)
DO I = 1, NUMBER_OF_ELEMENTS
DO J = 1, NUMBER_OF_GP_IN_ELEMENT(I)
XELE(I) = XELE(I) + YGP(GP_IN_ELEMENT(J,I))
END DO 
END DO
Two arrays are used in this example to describe the element topology;
NUMBER_OF_GP_IN_ELEMENT is a vector that contains the number of grid points that are 
in each element.
GP_IN_ELEMENT is a two dimensional array that contains the grid point number for each 
grid point in each element.
Two data items are involved; an element based variable XELE and a grid point based 
variable YGP . This code fragment can be implemented in parallel by using 'owner com- 
putes' execution control masks which are conditionals to control the scope of operations
CHAPTER 3. DOMAIN DECOMPOSITION
for each processor. In this example the execution control mask is implemented with a 
function OWNER.OF.ELEMENT that returns true only if the argument is an element number 
that is owned by the processor, the computation only being performed if this is the case.
DO I = 1, NUMBER_OF_ELEMENTS
IF ( OWNER_OF_ELEMENT(I) ) THEN
DO J = 1, NUMBER_OF_GP_IN_ELEMENT(I))
XELE(I) = XELE(I) + YGP(GP_IN_ELEMENT(J,D) 
END DO 
END IF 
END DO
However in order to achieve scalability of memory each processor can store only its own 
sub-domain. In this example the most fundamental mesh entity, the grid point, described 
as a set of coordinates, will renumber itself through the simple process of being packed 
into memory as a consecutive list of coordinates for each grid point in the sub-domain. So 
the core grid points are packed into the first 1 to LOCAL_NUMBER_OF_GRID_POINTS locations 
and the overlap grid points as LOCAL_NUMBER_OF_GRID_POINTS+1 to EXT_LOC_NUMBER_OF_GRID_POINTS. 
Where LOCAL_NUMBER_OF_GRID_POINTS is the number of grid points in the sub-domain core 
and EXT_LOC_NUM_OF_GRID_POINTS is the number of grid points in the entire sub-domain. 
Similarly extracting and storing (packing) only the local entries for the variables XELE, 
YGP and NUMBER.OF_GP_IN_ELEMENT is straightforward. Other mesh entities are however 
described as relationships or 'pointers' between entities. So packing GP_IN_ELEMENT re- 
sults in a list of global node numbers for each locally numbered element. To allow for 
this pointer arrays must be embedded into the code in order that each time the code 
refers to a grid point of an element the pointer array indirectly addresses a grid point in 
the local numbering scheme.
INTEGER NUMBER_OF_GP_IN_ELEMENT(1 :EXT_LOC_NUM_OF_ELEMENTS)
INTEGER GP_IN_ELEMENT(1 :MAX_NUM_GP_PER_ELE, 1:EXT_LOC_NDM_OF_ELEMENTS)
INTEGER PTR.ELE(1:NUMBER.OF.ELEMENTS)
INTEGER PTR.GP(1:NUMBER_OF_GRID_POINTS)
REAL XELE( 1:EXT_LOC_NUM_OF_ELEMENTS)
REAL YGP(1: EXT_LOC_NUM_OF_GRID_POINTS)
DO I = 1, NUMBER.OF.ELEMENTS
47
CHAPTER 3. DOMAIN DECOMPOSITION
IF ( OWNER_OF_ELEMENT(I) ) THEN
DO J = 1, NUMBER_OF_GP_IN_ELEMENT(PTR_ELE(I))
XELE(PTR_ELE(I)) = XELE(PTR_ELE(I)) + 
i- YGP (PTR.GP (GP_IN_ELEMENT (J ,PTR_ELE (I))))
END DO 
END IF 
END DO
Here two indirection pointer arrays are used PTR_ELE and PTR.GP which store the local 
element and grid point numbers respectively. For example if element number 28 is 
local element number 14 then PTR_ELE(28) has the value 14. The code still uses global 
numbers, only the addresses are indirected. A simple optimisation here is to move the 
element indirection upwards.
DO II = 1, NUMBER_OF_ELEMENTS
IF ( OWNER_OF_ELEMENT(II) ) THEN 
I = PTR.ELE(II) 
DO J = 1, NUMBER_OF_GP_IN_ELEMENT(I)
XELE(I) = XELE(I) + YGP(PTR_GP(GP_IN_ELEMENT(J,I))) 
END DO 
END IF 
END DO
These pointers will need to be globally sized and so do not scale in memory. Also the loop 
still increments over the global number of elements and so does not scale in processing. 
Execution of the control mask for every element can be a significant operation. Since 
PTR.ELE now represents the local renumbering implied by the array packing, the local 
element numbers in the above loop when the execution control mask is true will run from 
1 to LOCAL_NUMBER_OF_ELEMENTS. Therefore a further optimisation is possible by changing 
the loop limits to local numbering.
DO I = 1, LOCAL_NTJMBER_OF_ELEMENTS
DO J = 1, NUMBER.OF_GP_IN_ELEMENT(I)
XELE(I) = XELE(I) + YGP(PTR_GP(GP_IN_ELEMENT(J,I)))
END DO 
END DO
Now only one pointer is required but it remains globally sized and so is still not scalable. 
If all uses of GP_IN_ELEMENT throughout the code are as the index of the array PTR.GP
CHAPTER 3. DOMAIN DECOMPOSITION
then this indirection can be propagated upwards to the highest level where PTR_GP is 
used to renumber the contents of GP_IN_ELEMENT to a local grid point numbering scheme. 
The example now becomes
DO I = 1, LOCAL_NUMBER_OF_ELEMENTS
DO J = 1, NUMBER_OF_GP_IN_ELEMENT(I)
XELE(I) = XELE(I) + YGP(GP_IN_ELEMENT(J,I))
END DO 
END DO
If this code fragment exists inside a subroutine where NUMBER.OF.ELEMENTS is passed 
into the subroutine as an argument then the calling routine can be modified to call the 
subroutine with LOCAL_NUMBER_OF_ELEMENTS so that 
This thesis follows the option of re-numbering each entire sub-domain to a local 
numbering scheme as this has been shown above to be consistent with objectives (ii) and 
(iii). Each processor 'sees' its renumbered sub-domain as a complete mesh consisting 
of 1 to elements and 1 to grid points where and are the local number of 
elements and grid points respectively. This can be carried out at the highest possible 
level in the code, that is where the problem specification is read from file. A record of the 
global (serial) numbers for each local mesh entity (referred to as a decomposition index) 
is stored on each processor in order to allow reconstruction of data back into its original 
global form. Translation back from local to global numbering using this record is only 
required as an i/o process when writing variables to file. Rebuilding of global variables is 
carried out by the i/o (master) processor and so this is the only processor that requires 
the decomposition indices, however the indices are distributed with the sub-domains to 
maintain scalability of memory. This scheme can encounter difficulty when the problem 
size increases to the point at which the geometry description will no longer fit into the 
memory of the master processor. This is not however insurmountable and is discussed 
further in Section 4.2 and Chapter 7. The effect of renumbering is illustrated in Figures 
3.8 and 3.9. Consider the element partition in Figure 3.9 The partition list of 
processor numbers that own each element as returned from the partitioner utility is as 
follows;
1111112222222111111112222222
49
CHAPTER 3. DOMAIN DECOMPOSITION
Figure 3.8: A mesh of 28 triangles divided into two sub-domains showing the renumbering 
of grid points from global to local numbering.
~,~ * 
Figure 3.9: A mesh of 28 triangles divided into two sub-domains showing the renumbering 
of elements from global to local numbering.
50
CHAPTER 3. DOMAIN DECOMPOSITION
The resulting element renumbering as stored in PTR_ELE is listed in Table 3.2. The
Global
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
Processor 1
1
2
3
4
5
6
15
0
0
0
0
0
16
7
8
9
10
11
12
13
14
17
18
0
0
0
0
0
Processor2
0
0
0
0
0
15
1
2
3
4
5
6
7
16
17
0
0
0
0
0
18
8
9
10
11
12
13
14
Table 3.2: Element indirection pointer arrays for the partition illustrated in Figure 3.9
renumbering has maintained the core elements as the first 14 elements in each partition 
allowing the transformation to local loop limits. The implications of renumbering are 
discussed further in Section 4.3.
3.3.4 Overlap Communication
The notion of the mesh overlaps is that each processor calculates only the values of core 
variables. That is variables that are associated with mesh entities within its own domain,
51
CHAPTER 3. DOMAIN DECOMPOSITION
no computation being performed on the overlaps. Variable values are then swapped into 
the overlap from the processors on which the variables are calculated, as shown in Fig- 
ure 3.10. This is a one way communication process between all adjacent sub-domains. 
Data travels only from the core of the sub-domains (where it is calculated) into the 
overlaps of adjacent sub-domains (where it is used). There are however some rather 
obvious exceptions, where data operations are so trivial that it is faster to perform the 
operation locally on the overlap than to import the new values from a neighbour (see 
Jacobi example in Appendix C. For example, setting a variable to a fixed value, zero 
for instance, requires a processor only to write a register to memory. This will undoubt- 
edly be faster than reading data from the communication port and writing the data 
back out to memory. Implementation of such exceptions may be seen as an optimisa- 
tion of the parallelisation. Indeed such optimisations may produce an improvement in 
performance on some platforms and not others. Overlap values are generally exchanged 
between processors as soon as practically possible, usually whenever a variable has been 
fully updated, for example, at each iteration of a solver. Asynchronous communication 
schemes may be used to improve the parallel performance by overlapping communication 
with calculation. This is discussed further in Chapter 5. The coordination of overlap
Figure 3.10: Overlap update communication scheme.
data exchange requires a communication template for each sub-domain which holds the 
mesh entity numbers to be sent and the processor number to which they are to be trans-
CHAPTER 3. DOMAIN DECOMPOSITION
mitted. A corresponding template records the entity numbers to be received and the 
processor number from which they will arrive. These templates must be matched across 
each sub-domain boundary so that the data sent from one sub-domain is received in 
the anticipated order in the adjacent sub-domain. This is achieved by preserving the 
global ordering of the elements. For a simple processor interconnection topology such as 
a pipeline (a one dimensional chain), where the partition can guarantee mapping to the 
processor topology, the template becomes reasonably straightforward. Exchange of data 
between processors can be synchronised by the template on an odd-even alternate pair 
basis. This is a four cycle process described in the following table.
Processor Number Odd Even
Send right Receive left
Receive right Send left
Send left Receive right
Receive left Send right
Table 3.3: Communication operations required for a simple chain of processors
This simple scheme enables the exchange to be carried out as a parallel process. 
More elaborate processor topologies can be handled with variations on such a scheme. 
Regular two dimensional processor arrays can for instance use red - black checkerboard 
type schemes. It cannot however be assumed that the mesh can be partitioned in such 
a way as to map perfectly to the processor interconnection topology (Section 3.2.3). 
A scheme is required which can cope efficiently with an unstructured partition of an 
unstructured mesh mapped imperfectly to an array of processors. This is a scheduling 
problem of the type familiar to operational research [Wil84].
The scheme adopted involves constructing the graph G(P, of processors and 
sub-domain (processor) interconnections and attaching weights to the interconnects 
according to the size of the interface. This graph is initially sorted by weight with the 
processor pair having the largest amount of data to communicate being first. The graph 
is then scheduled to provide a sequence in which exchanges occur as a parallel process
53
CHAPTER 3. DOMAIN DECOMPOSITION
with the largest exchanges first. Starting with the heaviest node pair, the processor 
numbers are recorded. The graph is then searched for the next heaviest weight that does 
not use one of the already recorded processors. When found this processor pair is sorted 
to be the next entry in the graph. This operation is carried out until either all processors 
are involved in communication or an unrecorded processor pair is no longer available for 
scheduling. If there are still entries in the graph that have not been scheduled the list 
of recorded processors is cleared and the process repeated until all processor pairs have 
been scheduled. This results in a layering of exchange communication processes which 
should be (but is not guaranteed to be) no deeper than the maximum node degree of 
the processor graph 
Consider the mesh illustrated in Figure 3.11 decomposed into three renumbered sub- 
domains in Figure 3.12 Here the overlap renumbering has followed the original global
Figure 3.11: Mesh of 42 triangular elements.
numbering scheme Processor (a) must receive data for overlap elements 17 and 18 from 
processor (b) where they are numbered 6 and 9 respectively. Similarly processor (b) must 
receive data for overlap elements 15 and 16 from processor (a) where they are numbered 
3 and 8 respectively. The communications for this example may be carried out in six 
stages as follows:
54
CHAPTER 3. DOMAIN DECOMPOSITION
'. 16 .--' \
Figure 3.12: Mesh of 42 triangular elements partitioned into three renumbered sub- 
domains.
Processor (a)
1 Sending to processor (b) elements 3 and 8
2 Receiving from processor (b) elements 17 and 18
3 Sending to processor (c) elements 9 and 12
4 Receiving from processor (c) elements 15 and 16
Processor (b)
1 Receiving from processor (a) elements 15 and 16
2 Sending to processor (a) elements 6 and 9
5 Sending to processor (c) elements 5, 7, 10, and 13
6 Receiving from processor (c) elements 17, 18, 19 and 20
Processor (c)
55
CHAPTER 3. DOMAIN DECOMPOSITION
3 Receiving from processor (a) elements 15 and 19
4 Sending to processor (a) elements 5 and 6
5 Receiving from processor (b) elements 16, 17, 18 and 20
6 Sending to processor (b) elements 6, 8, 10 and 14
Note that these element numbers are always in increasing order both globally and 
locally. The sending is always carried out first to allow parallelism in packing.
Data that is to be transmitted from a sub-domain core is collected into a data buffer 
which allows one transmission and therefore only one latency to complete the transfer. 
Unpacking of data from a buffer is an overhead that is not necessary for data reception. 
Data is only ever received into an overlap, so arranging for the overlap renumbering 
scheme to consecutively number overlap entities that are owned by the same processor 
allows incoming data to be received directly into the overlap memory. So the global 
number ordering is preserved for each interface to other processors, but not throughout 
the overlap. In the above example elements 15 and 19 on processor (c) are in the core 
of processor (a) and so should be numbered consecutively. This involves renumbering 
overlap element 19 on processor (c) to be 16 and then overlap elements 16, 17, 18 and 
20 to be 17, 18, 19 and 20 respectively.
Chapter 4
Algorithm Decomposition
The algorithms employed in unstructured mesh codes have invariably been developed us- 
ing the traditional Von Neumann programming model of sequential instruction execution. 
The conversion of these serial algorithms into parallel algorithms may be straightforward, 
or may be very involved. Parallelism exists in many forms with a CM code. Having 
chosen a geometric (topologic) DD strategy, decomposition of the algorithms to concur- 
rently operate locally within each sub-domain whilst performing the same operations as 
the original serial algorithm becomes a largely automatic process of communicating data 
as and when required. Profiling CM execution shows that the majority of run time is 
spent within the matrix equation solvers. It is these solvers that are subjected to close 
scrutiny to extract the maximum possible parallel efficiency Ideally we should be able 
to meet objective (i) and produce results from the parallel code that identically match 
the results produced by the serial code. This may not however, be either practical or 
possible. A variation between the serial and parallel code is sometimes inevitable. There 
are instances where it may be more important for example to meet objective (iii) and 
produce a highly efficient parallel code at the expense of failing to precisely meet ob- 
jective (i). Again it will usually be a case of having to make an intelligent decision as 
to which is the overridingly important criteria. Often there is little choice but to either 
modify the algorithm or else suffer unacceptable inefficiency.
CHAPTER 4. ALGORITHM DECOMPOSITION
4.1 UIFS - Unstructured Incompressible Flow and Stress
The code used as a vehicle for developing the parallel strategies used in this thesis is 
known as UIFS. Developed for the purpose of modelling the processes involved in metals 
casting UIFS is a 2D unstructured mesh code for solving the Navier Stokes equations 
for transient and steady state flow problems with solidification [Cho93] along with the 
elastic stress-strain equations [FBCL91, CBCP92]. The techniques developed for UIFS 
have led to the development of the 3D code PHYSICA which provides even greater 
modelling flexibility for multi-physical processes.
4.1.1 The FV Fluid Dynamics Scheme
The Finite Volume (FV) (irregular control volume) fluid dynamics scheme in UIFS solves 
for flow on a single unstructured mesh using a modification of the SIMPLE algorithm 
of Patanker and Spalding [PatSO]. This is a cell centred scheme in which the control 
volume is formed by the elements of the mesh which may be any arbitrary shape. The 
definition of a staggered grid as used by Patanker is not clear for an unstructured 
mesh. So the scheme uses a co-located grid with the Rhie and Chow [RC82] pressure 
weighted interpolation method to suppress pressure oscillation. The solidification scheme 
uses the Voller and Cross enthalpy method [VCM87] to model the velocity correction 
and latent heat release during phase change. The dependency required by the solvers in 
this element centred finite volume scheme is simple nearest neighbour as illustrated in 
Figure 1.3(a). However in order to evaluate the cell volumes for the displaced grid the 
grid point dependency as shown in Figure 1.3(d) is also required. Hence the definition 
of the overlap mesh entities as given in Section 3.3.2. The scheme produces a sparse 
irregular diagonally dominant system matrix which may be solved using either Jacobi or 
Gauss Seidel SOR iterative methods. The fluid dynamics loop is illustrated in Figure 4.3. 
The number of iterations for each of the momentum, pressure and heat solvers are set 
at run time along with the maximum and minimum number of sweeps around the fluid 
dynamics loop. Convergence is based on the residuals of all of momentum, heat and
58
CHAPTER 4. ALGORITHM DECOMPOSITION
pressure variables.
Momentum Equations
The equations governing the conservation of momentum for an incompressible fluid in a 
cartesian system of coordinates may be expressed as:
+ V   V   (AtViii) (4.1) 
Here is the momentum in the axis, similar equations govern the momentum in the 
other axis. The other terms are; the density the resultant velocity, v, the viscosity //, 
the pressure p, the face normal component and the momentum source in the axis 
The momentum source term includes the buoyancy source and the Darcy source 
terms which couple the momentum equation to the energy equation.
= ^>6j > "^boundary ^other 
Continuity Equation
Then continuity equation governing mass conservation can be expressed as:
|£ + V   (4.3) 
Here is the mass source.
Energy Equation
Conservation of energy can be written as:
dt + V   V   (fcVT) + (4.4)
Where is the specific enthalpy, is the thermal conductivity, is the temperature and 
s^ is the volumetric source for heat. This equation may be expressed solely in terms of 
temperature using where is the specific heat.
59
CHAPTER 4. ALGORITHM DECOMPOSITION
Buoyancy Source
The source terms sUt in Equation 4.1 couples into the energy equation through the 
buoyancy terms. Two alternative buoyancy terms are available in UIFS; constant and 
variable density. The constant density approximation Boussinesq source s^ in the 
direction can be expressed as
(4.5)
Where pref is the constant density, is the volumetric coefficient of thermal expansion, 
is the temperature, Tref is the reference temperature (temperature for pref) and is the 
acceleration due to gravity in the direction. Density may be more accurately expressed 
as a function of temperature so the buoyancy source becomes
(4.6)
Solidification Sources
For a system undergoing a change of phase from liquid to solid (or solid to liquid) the 
total enthalpy can be expressed as the sum of the 'sensible' enthalpy and the latent 
heat A#
. (4.7)
Latent heat will be some function of temperature
= (4.8)
which may be written in terms of the latent heat of solidification and liquid fraction 
(ratio of liquid to solid) 
(4.9)
Combining this with Equation 4.4 gives the enthalpy source due to the latent heat of 
solidification as
(4.10)
CHAPTER 4. ALGORITHM DECOMPOSITION
Velocity correction for changes in material properties during phase transition uses the 
Darcy source term
(4.11)
where is the viscosity and is the permeability. Little data is available for the viscosity 
and permeabilities of materials undergoing phase transition so a simple approximation 
involving the liquid fraction is used
= (4.12) 
where is an empirical constant.
4.1.2 The FV Solid Mechanics Scheme
The grid point (vertex) based solid mechanics code uses the finite volume unstructured 
mesh procedure of Fryer [FBCL91, Fry93] for the solution of the elastic stress-strain 
equations for bodies undergoing thermal or mechanical loads.
Governing Equations
The general equilibrium equations governing the conservation of force on a static body 
are
(4 - 13)
Where and are the components of normal stress, shear stress and body forces 
acting in direction In matrix form the above equations become
cr = Ds 
where the stress vector is = <Jzy ) T and the elastic strains are = 
£xy) T - The matrix holds the material elastic properties; Youngs modulus
61
CHAPTER 4. ALGORITHM DECOMPOSITION
and Poissons ratio // where for plane strain
(1 - /Li2 )
1
0
1
0
0
0
9 -
(4.15)
The total strain is related to displacement by
<T> Ld (4.16)
Where the displacement vector d = represents displacement in the and 
directions and L holds the differential operators
L-
d ^\ o
(4.17)
Thermal strains are given by
(4.18)
where a is the coefficient of thermal expansion, AT is the temperature change and
Discretisation of the Solution Domain
This scheme forms a control volume around each grid point with contributions to the 
control volume from each of the surrounding mesh elements as illustrated in Figure 4.1. 
Here the sub control volumes in each surrounding element are formed by connecting 
the element centres to the face centres. Temperature and displacement variables are 
stored at the grid points and the material properties, Youngs modulus, Poisson ratio, 
etc., are associated with the elements. The equilibrium equations are integrated over the 
control volumes where the divergence theorem is used to transform the area integrals
CHAPTER 4. ALGORITHM DECOMPOSITION
Figure 4.1: Formation of a control volume from sub-control volumes around point P.
into line integrals which enables the stresses to be approximated at the integration points 
on the surface of the control volume. The discretisation uses reference elements to 
represent the mesh elements in a local coordinate system in a manner similar to the 
Finite Element (FE) method [SR87] (Figure 4.2). This is a computationally efficient 
scheme which obtains approximations to the derivatives in the equilibrium equations in 
local coordinates and uses a Jacobian matrix to map the approximations back to global 
coordinates. A variable 0 and its derivatives can be approximated anywhere within an 
element of grid points using Equations 4.19 and 4.20.
(4.19)
1=1
= 
The shape functions for a bilinear quadrilateral are
= 0.25(1
CHAPTER 4. ALGORITHM DECOMPOSITION
Figure 4.2: Mapping of a finite volume element to a reference element.
= 0.25(1 
0.25(1 
= 0.25(1+ s)(l-t)
The Jacobian matrix in Equation 4.21 is used to map the derivatives of the shape function 
from local to global coordinates.
. L -I
r i 
(4.21)
Discretisation of the equilibrium equations
cr DLNu 
/ (DBu) n f 
-i
\
4.2 Parallelisation of UIFS
IF ( MASTER ) THEN
4.2.1 Partitioning
4.2.2 Renumbering
4.2.3 Communication

4.3 Matrix Decomposition














_____________ CHAPTER 5. PERFORMANCE OF THE PARALLEL CODE
_____________ CHAPTER 5. PERFORMANCE OF THE PARALLEL CODE
_____________CHAPTER 5. PERFORMANCE OF THE PARALLEL CODE

CHAPTER 5. PERFORMANCE OF THE PARALLEL CODE

C0 3500 
CD 
.g> 3000










~O 


























I 


- 




























n 




\s~
C
r






\j
























