Parallel solution of power system linear equations by Grey, David John
Durham E-Theses
Parallel solution of power system linear equations
Grey, David John
How to cite:
Grey, David John (1995) Parallel solution of power system linear equations, Durham theses, Durham
University. Available at Durham E-Theses Online: http://etheses.dur.ac.uk/5429/
Use policy
The full-text may be used and/or reproduced, and given to third parties in any format or medium, without prior permission or
charge, for personal research or study, educational, or not-for-proﬁt purposes provided that:
• a full bibliographic reference is made to the original source
• a link is made to the metadata record in Durham E-Theses
• the full-text is not changed in any way
The full-text must not be sold in any format or medium without the formal permission of the copyright holders.
Please consult the full Durham E-Theses policy for further details.
Academic Support Oﬃce, Durham University, University Oﬃce, Old Elvet, Durham DH1 3HP
e-mail: e-theses.admin@dur.ac.uk Tel: +44 0191 334 6107
http://etheses.dur.ac.uk
The copyright of this thesis rests with the author. 
No quotation f rom it should be pubhshed without 
his prior written consent and information derived 
f rom it should be acknowledged. 
Parallel Solution of Power 
System Linear Equations 
David John Grey 
B.Eng. (York) 
School of Engineering and Computer Science 
University of Durham 
A thesis submitted in partial fulfilment of the requirements 
of the Council of the University of Durham for the Degree 
of Doctor of Philosophy (Ph.D.) . 
February 1995 
Abstract 
A t the heart of many power system computations lies the solution of a large sparse 
set of linear equations. These equations arise f r o m the modelling of the network and 
are the cause of a computational bottleneck in power system analysis applications. 
Efficient sequential techniques have been developed to solve these equations but 
the solution is stiU too slow for applications such as real-time dynamic simulation 
and on-line security analysis. Parallel computing techniques have been explored in 
the a t tempt to f i nd faster solutions but the methods developed to date have not 
efficiently exploited the fuU power of parallel processing. 
This thesis considers the solution of the linear network equations encountered in 
power system computations. Based on the insight provided by the ehmination tree, 
i t is proposed that a novel ma t r ix structure is adopted to allow the exploitation of 
parallelism which exists w i t h i n the cutset of a typical parallel solution. Using tliis 
m a t r i x structure i t is possible to reduce the size of the sequential part of the problem 
and to increase the speed and efficiency of typical LU-based parallel solution. A 
method for t ransforming the admittance mat r ix into the required f o r m is presented 
along w i t h network par t i t ioning and load balancing techniques. 
Sequential solution techniques are considered and existing parallel methods are sur-
veyed to determine their strengths and weaknesses. Combining the. benefits of exist-
ing solutions w i t h the new mat r ix structure allows an improved LU-based parallel 
solution to be derived. A simulation of the improved L U solution is used to show the 
improvements i n performance over a standard LU-based solution that result f r o m 
the adoption of the new techniques. The results of a multiprocessor implementa-
t ion of the method are presented and the new method is shown to have a better 
performance than existing methods for distributed memory multiprocessors. 
Declaration 
I hereby declare that this thesis is a record of work undertaken by myself, that i t has not 
been the subject of any previous application for a degree, and that all sources of informat ion 
have been duly acknowledged. 
(c) C o p y r i g h t 1994, D a v i d J o h n G r e y 
The copyright of this thesis rests w i t h the author. No quotation f r o m i t should be published 
wi thou t his wr i t t en consent, and informat ion derived f r o m i t should be acknowledged. 
m 
This thesis is dedicated to Anna, my wife and best fr iend. 
IV 
Acknowledgments 
The fol lowing people have been v i ta l to the production of this work; either in their direct 
advice and input or just i n pu t t ing up w i t h me whilst 1 was going crazy wr i t ing i t . 
• To my wife, Anna - for her love and support. 
• To my supervisor, Doctor Janusz Bialek of the University of Durham - for his direction 
and advice. 
• To Kelvey Marden of the University of Durham - for friendship. 
• To A l a n , Alex , John, Raghu, Jeremy, Juliette, Sue, M a t t , Chris, Ph i l , Howard and 
Hayley - for companionship and a good laugh when in need. 
• To Alan - w i t h grateful thanks for all the proof reading 
The fol lowing trademarks are acknowledged: I M S , I N M O S , T R A M and occam are trade-
marks o f Inmos L imi ted ; I . B . M . and P . C . / A . T . are a trademarks of International Business 
Machines Corp.; Unix is a trademark of A T & T . 
List of Abbreviations 
B B D F Bordered Block Diagonal Form 
C E G B Central Electricity Generating Board (now the National Gr id Company) 
C S P Communicating Sequential Processes 
F P U Floating Point Uni t 
I B M International Business Machines 
I E E E Ins t i tu te of Electrical & Electronic Engineers 
I / O Input and Output 
M D M i n i m u m Degree 
M D M L M i n i m u m Degree M i n i m u m Length 
M D M L L R U M i n i m u m Degree M i n i m u m Length Least Recently Used 
M F L O P S Mil l ion Floating point Operations per Second 
M I M D Mult ip le Instruct ion stream Mul t ip le Data stream 
M I S D Mult ip le Instruct ion stream Single Data stream 
M L M i n i m u m Length 
M L M D M i n i m u m Length M i n i m u m Degree 
R A M Random Access Memory 
R B B D F Recursive Bordered Block Diagonal Form 
R I S C Reduced Instruct ion Set Computer 
R P Recursively Parallel 
S I M D Single Instruct ion stream Mul t ip le Data stream 
S I S D Single Instruct ion stream Single Data stream 
T R A M Transputer Applicat ion Module 
VI 
Contents 
1 I n t r o d u c t i o n 1 
1.1 The Components of a Power System 1 
1.2 Power System Analysis 4 
1.2.1 The Power Flow Problem 5 
1.2.2 Power System Simulation 5 
1.2.3 Power System Security 9 
1.2.4 System Planning 10 
1.2.5 Operator Training 11 
1.3 Power Systems Analysis and Computer Architectures 12 
1.4 Parallel Processing and Parallel Architectures 14 
1.4.1 Classification of Computer Architectures • 14 
1.4.2 S I M D Architectures 17 
1.4.3 M I M D Architectures 18 
1.4.4 Interconnection Networks for M I M D Architectures 20 
1.4.5 The I N M O S Transputer 22 
1.4.6 Bounds on Multiprocessor Performance 23 
1.5 Parallel Processing in Power System Analysis Problems 28 
1.6 Summary 29 
1.7 Outline of Thesis 30 
2 So lv ing the Netw^ork E q u a t i o n s 33 
2.1 Modeling the Power System 33 
2.1.1 The Generator Model 33 
2.1.2 The Load Model 34 
2.1.3 The Transmission Line Model 34 
vu 
C O N T E N T S 
2.1.4 The Transformer Model 35 
2.2 Formalizing the Problem 35 
2.3 Linear Equations, Matrices and Sparsity 38 
2.4 Direct Solution of the Linear Equations 39 
2.4.1 Gaussian EUmination and FiU-Ins 39 
2.4.2 L U Decomposition 42 
2.4.3 L D U Decomposition 43 
2.4.4 Bifactorisat ion 44 
2.5 P ivota l Ordering 50 
2.5.1 Pre - Ordering 50 
2.5.2 Dynamic Ordering 51 
2.6 El imina t ion Trees 53 
2.7 Near Opt ima l Ordering Strategies 55 
2.7.1 The M i n i m u m Degree Algor i thm 55 
2.7.2 The M i n i m u m Length Algo r i t hm 57 
2.7.3 The M i n i m u m Degree M i n i m u m Length Algor i thm 58 
2.7.4 The M i n i m u m Length M i n i m u m Degree Algor i thm 58 
2.7.5 The M i n i m u m Degree M i n i m u m Length Least Recently Used Algor i thm 59 
2.7.6 Comparative Analysis of the Ordering Methods 59 
2.7.7 Deriving the El iminat ion Tree 61 
2.8 Implementing a Sequential Solution of the Network Equations 62 
2.8.1 Storage of Sparse Matrices 62 
2.8.2 Determination of El iminat ion Ordering 63 
2.8.3 Coefficient M a t r i x Factorisation Using Bifactorisation 64 
2.9 Summary 65 
3 P a r a l l e l M e t h o d s of Solv ing the Netvirork E q u a t i o n s 67 
3.1 In t roduct ion 67 
3.2 I terat ive Methods for Solving Linear Equations 68 
. 3.2.1 The Jacobi Method 68 
3.2.2 The Gauss-Seidel Method 70 
3.2.3 The Conjugate Gradient Method 71 
3.3 Direct vs I terative Methods 75 
vm 
C O N T E N T S 
3.4 Parallel Algor i thms for Direct Solution 78 
3.4.1 Granular i ty of Solution 78 
3.4.2 Task Mapping and Load Balancing 79 
3.4.3 Ordering Strategies for Parallel Solutions 81 
3.5 Diakoptical Based Solution Methods 82 
3.5.1 The Method of Diakoptics 82 
3.6 The Mul t ip le Factoring Method 84 
3.7 Parallel L U Decomposition Techniques 88 
3.7.1 Chan's Method 91 
3.7.2 The W - m a t r i x Method 92 
3.8 Cholesky Factorisation Techniques 95 
3.8.1 The Parallel Fan-In Algo r i t hm 97 
3.8.2 The Parallel Fan-Out A lgo r i t hm 98 
3.8.3 Frontal Methods 98 
3.9 Summary 99 
4 E l i m i n a t i o n T r e e s , Netvi^ork Part i t ion ing and L o a d Ba lanc ing 101 
4.1 In t roduct ion 101 
4.2 Balancing the Computat ional Load 102 
4.2.1 The Two Approaches to Load Balancing 105 
4.2.2 Load Balancing Methodologies Adopted by Other Parallel Solutions 109 
4.3 The Ehminat ion Tree and Parallel Processing 110 
4.3.1 The Ehminat ion Tree and Network Part i t ioning 112 
4.3.2 Using the Ehminat ion Tree to Achieve Load Balancing 114 
4.3.3 Advantages of the Tree-based Approach 116 
4.3.4 Performance of the Tree-based Load Balancing 119 
4.4 Summary 119 
5 A n I m p r o v e d P a r a l l e l Factor i sat ion 121 
5.1 In t roduct ion ". 121 
5.2 Development of the Recursively Parallel Method 122 
5.2.1 Ident i fy ing the Potential Parahehsm 122 
5.2.2 The Recursive Bordered Block Diagonal Form 125 
5.2.3 Balancing the Load 130 
IX 
C O N T E N T S 
5.2.4 Reducing the Sequential Part of the Method 131 
5.3 A Simulation of the Recursively Parallel Method 132 
5.3.1 Implementat ion 133 
5.3.2 Results of the Simulation 137 
5.4 Summary 143 
6 I ssues of P a r a l l e l Implementa t ion 145 
6.1 In t roduct ion 145 
6.2 Algor i thmic Issues 146 
6.2.1 Program Structure and Task Design 146 
6.2.2 Data Storage and Data Structures 150 
6.2.3 Reducing the Communication Overhead 159 
6.3 Archi tec tura l Issues 160 
6.3.1 The Software Architecture 160 
6.3.2 The Hardware Architecture 166 
6.4 Performance of the Recursively Parallel Solution 172 
6.5 Summary 189 
7 F u r t h e r W o r k 190 
7.1 Automat ic Network Part i t ioning '. 190 
7.2 The Search for an Opt imal Ordering 193 
7.3 Block-oriented Solution and Vector Processing 196 
7.4 Summary 199 
8 C o n c l u s i o n s 201 
8.1 Conclusions 201 
A T h e I N M O S T r a n s p u t e r 215 
A . l The Architecture of the Transputer 216 
A . 1.1 The T 2 Family 218 
A.1.2 The T 4 Family 219 
A.1.3 The T8 Family 219 
A . 2 Programming the Transputer 220 
A.2 .1 Tasks and Channels 220 
A.2.2 Programming Languages 221 
X 
C O N T E N T S 
A . 3 Bui ld ing Parahel Systems w i t h the Transputer 222 
A .3 .1 -The T R A M Standard 222 
A . 3.2 The Experimental Setup 223 
B D e r i v a t i o n O f T h e Mode l s of P o w e r S y s t e m E l e m e n t s 225 
B . l The Generator Model 225 
B.2 The Transmission Line Model 227 
B . 2.1 Short Lines 228 
B.2.2 Medium Length Lines 228 
B.2.3 Long Lines 230 
B.3 The Transformer Model 233 
B.4 The Load Model 236 
C D e r i v i n g the B u s A d m i t t a n c e M a t r i x 239 
D N e t w o r k Par t i t i on ing and Diakopt ics 242 
D . l Node Tearing 243 
D . 2 Branch Cu t t ing 244 
E P r o o f of L i u ' s T r e e T h e o r e m s 247 
E. l Nota t ion 247 
E.2 Other Theorems Required 247 
E.3 Proof of the Tree Theorems 248 
F R e d u c i n g the L e n g t h of I n t e r t a s k Messages 251 
G M o n i t o r i n g the P e r f o r m a n c e of the P a r a l l e l Solution 255 
H T e s t S y s t e m s 262 
XI 
List of Figures 
1.1 The components of a modern power system 2 
1.2 A model of a synchronous generator connected to the transmission network 6 
1.3 The basic von Neumann machine 15 
1.4 Ins t ruct ion and data streams in the von Neumann machine 16 
1.5 Functional design of a M I M D computer 18 
1.6 Static interconnection network topologies 21 
1.7 Conceptual overview of the configuration of the Transputer based parallel 
machine 23 
1.8 Various bounds on parallel performance 25 
1.9 Speed-up predicted by Amdahl 's Law 27 
2.1 Transmission line equivalent TT circuit model 35 
2.2 Equivalent circuit of a practical single phase transformer 36 
2.3 General n-port representation of a power system 37 
2.4 A simple 10 node graph and its associated mat r ix 53 
2.5 Filled graph and associated mat r ix for the simple 10 node example 54 
2.6 The eUmination tree of the 10 node example 55 
2.7 The effect of storage scheme on nodal degree 56 
2.8 Storage Of Sparse M a t r i x Rows In Linked Lists 63 
2.9 Storage Of E x t r a Informat ion About The M a t r i x 63 
2.10 The in format ion array used for updating 64 
3.1 A n elimination tree and the wrap mapping strategy 80 
3.2 Structure of conventional parallel solution algori thm 89 
3.3 Simple B B D F factorisation example, showing independence between operations 90 
3.4 A l g o r i t h m structure of Chan's method, using duphcate cutset computation 93 
X l l 
L I S T O F F I G U R E S 
3.5 The three flavours of Cholesky factorisation 97 
4.1 Simple load balancing example 103 
4.2 Graphical depiction of the execution of the example program 104 
4.3 Three task implementation of the LU-based solution 106 
4.4 A supervisor/worker approach to parallel bifactorisation 107 
4.5 Par t i t ioning of the ehmination tree and corresponding network parti t ions . I l l 
4.6 Treating the lower por t ion of the tree as a separate subnetwork 116 
4.7 Geist k Ng's par t i t ioning method applied to a simple tree 117 
5.1 The simple 12 node example system 122 
5.2 Part i t ioned topologically reordered network 123 
5.3 Subnetworks constrained to a binary tree connection structure 126 
5.4 The interconnections giving rise to R B B D F 128 
5.5 The constrained subnetwork interconnections 129 
5.6 Par t i t ioning of the reduced CEGB 734 node system for solution on 8 processors 131 
5.7 The block oriented data structure 133 
5.8 The row oriented data structure 134 
5.9 A l g o r i t h m structure for the Recursively Parallel method 135 
5.10 A l g o r i t h m structure for repeated substitution w i t h multiple right hand sides 137 
5.11 Overall speed-up results of simulated solution of the four test systems . . . 138 
5.12 Factorisation speed-up results of simulated solution of the four test systems 139 
5.13 Substi tut ion speed-up results of simulated solution of the four test systems 140 
6.1 Task structures w i t h different granularity 147 
6.2 Intertask communications in a 7 subnetwork solution 148 
6.3 Intertask communications in 3 and 15 subnetwork solutions 149 
6.4 The generic task of the Recursively Parallel solution 150 
6.5 R B B D F ma t r i x structure showing ' r ' segments 151 
6.6 Por t ion of the coefficient ma t r ix stored by a single task 152 
6.7 Storage techniques used by the Recursively Parallel method 153 
6.8 Assessing the density of the coefficient ma t r ix . 153 
6.9 Varia t ion of speed-up w i t h location of changeover point in hybr id storage, 
for US 1624 node system 155 
X l l l 
L I S T O F F I G U R E S 
6.10 Modif ied intertask communications for a 7 subnetwork system 159 
6.11 A many to one task allocation strategy 162 
6.12 The acyclic graphs used in determining task allocations 163 
6.13 Contract ion mapping for Recursively Parallel task allocation 163 
6.14 Task execution, showing the effect of mult i tasking 164 
6.15 Direct and indirect communication 167 
6.16 The Muta t ed Tree interconnection network 170 
6.17 Performance curves for the Recursively Parallel method, w i t h 2-1 task allo-
cation 175 
6.18 Speed-up against uniprocessor for the Recursively Parallel method, w i t h 2-1 
task allocation 179 
6.19 RP method compared to predicted performance 180 
6.20 RP method compared to simulated performance 181 
6.21 RP method compared to Padhila's W - m a t r i x method 184 
6.22 RP method compared to Lau's method 186 
6.23 RP method compared to Chan's method 187 
7.1 Conceptual view of a simple four subnetwork system and its B B D F mat r ix 196 
XIV 
List of Tables 
1.1 Flynn's Taxonomy 15 
2.1 Statistical performance of the ordering algorithms 60 
2.2 The effect of ehmination ordering on speed-up 61 
3.1 Operation counts for direct and iterative solution schemes 76 
4.1 Speed-ups for the load balancing example 105 
4.2 Effect of load balancing on speed-up for an LU-based solution 106 
4.3 The effect of load balancing on speed-up 119 
5.1 Results of the simulated solution of the four test systems 141 
5.2 Simulated RP solution vs best sequential solution 143 
6.1 The effect of storage scheme on speed-up 154 
6.2 Effect of load imbalance on performance 165 
6.3 Comparison of performance using pipeline and hypercube architectures . . . 171 
6.4 Characteristics of the test systems 172 
6.5 Performance of the best sequential algorithm 173 
6.6 Performance of the Recursively Parallel solution of the test systems 174 
6.7 Performance of the Recursively Parallel solution of the test systems, using 
1-1 task allocation 176 
6.8 Performance of the Recursively Parallel solution of the test systems - speed-
up over uniprocessor 178 
6.9 Simulated, predicted and observed factorisation speed-ups for the RP solu-
t ion of the test systems 178 
XV 
Chapter 1 
Introduction 
M odern electrical power systems represent some of the largest non-linear systems in existence. Today's power systems are complex interconnected networks consisting of 
many thousands of nodes and the analysis of these systems and planning their expansion and 
performance is no simple task. The modern power engineer is forced to rely upon powerful 
software tools to enable h im to get the best out of the system. As systems continue to 
grow more complex and engineers wish to gain deeper insight into their workings there is 
a continuaUy growing demand for faster and more powerful software analysis tools. Many 
such tools have been developed over the past four decades but there is always more that 
can be done. This thesis focuses on the solution of one of the problems which currently 
hmits the performance of many power system analysis appUcations. Solving this problem 
surmounts an impor tan t obstacle in the creation of more efficient analysis tools. 
1.1 The Components of a Power System 
Gross [1] defines an electrical power system as 
"a network of interconnected components designed to convert non-electrical energy con-
tinuously into the electrical form; transport the energy over potentially large distances; trans-
form the electrical energy into a specific form subject to close tolerances; and convert the 
electrical energy into a usable non-electrical form." 
This defini t ion shows that the power system consists of three basic components 
1.1 The Components of a Power System 
Generation 
Transmission 
Sub-transmission 
Distribution 
Use 
Non-electrical to electrical conversion 
Electrical energy transmission 
Electrical to non-electrical conversion 
Figure 1.1: The components of a modern power system 
• Non-electrical to electrical energy conversion 
• Electrical energy transmission 
Electrical to non-electrical energy conversion 
In fact this can be subdivided further and the modern power system thus has the five 
facets shown in Figure 1.1 
The generation process turns non-electrical kinetic energy of a rotating shaft into electri-
cal energy. Kinetic energy is imparted to the shaft by some form of turbine. Steam, gas and 
hydro turbines are the most common prime movers, although wind turbines are making an 
increasing contribution to electricity generation. The generator itself makes use of the prin-
ciple of electromagnetic induction to generate three sinusoidal alternating voltages which 
differ in phase by 120°. This is the most efficient form for electrical energy transmission 
as i t makes best use of the transmission cables and give constant power in balanced loads. 
The aim of generation is simply to pump enough electrical energy into the system to satisfy 
the demand of the end users and to account for the losses which occur in transmission. 
1.1 The Components of a Power System 
Strictly speaking the transmission network takes power from substations at the gener-
ating sites and delivers it to substations in the load centres. The generation sites are often 
remote from areas of load for many reasons. Originally generators were located where there 
is a ready supply of fuel e.g near coal fields, but such areas are seldom areas of electricity 
demand. As i t is cheaper and easier to transport the electrical energy to areas of demand, 
power stations have tended to be sited near the fuel supplies. In some cases this is essential 
as the 'fuel' cannot be moved e.g. hydropower and wind power. Government policy in 
the UK has also affected the location of power generation plant with current legislation 
dictating that all fossil fuel burning power plant have to be sited outside urban areas. Con-
sequently there is a need for transmitting electrical energy from generators to loads in the 
most efficient way possible. The UK national grid is a typical transmission system con-
sisting of transmission lines with a total length exceeding 7,500 kilometres. This network 
of lines provides a high degree of interconnectivity between the sites of generation and the 
major load centres. I t also ties the UK system into other continental systems to increase 
the reliability of the system and the availability of power. The transmission system has to 
be able to deliver large amounts of power to the loads and to ensure rehability it has to be 
able to deliver this power by a number of alternative routes. The process of transmitting 
electrical energy is not 100% efficient and the losses are proportional to the square of the 
current, making it more efficient to transmit power at high voltage and low current. The 
UK grid developed in three distinct phases and it now operates at three different voltage 
levels of 132 kV, 275 kV and 400 kV. Transmission voltage levels vary throughout the world 
but they all lie in the range 100 kV to I M V . 
The distinction between subtransmission and distribution is not altogether clear but 
is dependent upon voltage levels and geographical extent. Subtransmission systems take 
electrical power from the point of arrival in the load area and deUver i t to a number of 
distribution substations throughout this area. Subtransmission operates at much lower 
voltages than transmission, with l l k V and 33kV being the typical voltages used in the UK. 
As well as the lower voltages, subtransmission systems also have a much smaller geograph-
ical extent. Transmission systems extend to cover countries and whole continents whereas 
subtransmission systems are usually fimited to the extent of a given urban area. 
The distribution network is the final fink in the chain between the generator and the user. 
Distribution is distinguished from subtransmission by its lower voltage levels with 12kV 
down to 2.4kV being common. The distribution system takes power from the distribution 
1.2 Power System Analysis 
substation and forwards it to individual users who are located within a short distance ( 2km) 
of the substation. 
The purpose of transmitting electrical power through the various stages of the network 
is to supply the end user with electrical energy. Upon receipt of this energy the user 
converts it back into some more usable non-electrical form using any one of the electrical 
appliances available to him. To ensure that his apphances work correctly the user expects 
his electricity supply to have constant voltage magnitude, constant voltage frequency and 
an ideal sinusoidal waveform. To meet these criteria the power companies specify three 
performance measures to which they must adhere 
• Voltage regulation - the deviation of voltage magnitude as load varies. Typically this 
is around 5%. 
• Frequency regulation - the deviation of system frequency from the nominal value. For 
a 50Hz system this is typically ±0.1Hz. 
• Harmonic content - The ideal supply has only a single sinusoidal component. 
There is a complex control system at the heart of every power system which allows 
the system operators to control these three parameters so that the user's demands on the 
quality of his electricity are satisfied. Controlling such a large and complicated system is 
no simple problem and to do so efficiently requires a detailed understanding of the system 
and how its constituent parts interact with one another. 
1.2 Power System Analysis 
Modern power systems are extremely large and complex entities. Planning, maintaining 
and operating such a system would be difficult i f i t were not for the wide range of analytical 
methods available to help the power engineer. Today there is a wealth of software analysis 
tools to help in understanding any conceivable aspect of the power system. Prior to 1940 
there were very few interconnected systems of any complexity and analysis methods existed 
only for dealing with generators, transformers and transmission fines. With the advent of 
interconnected systems came the development of techniques which enable the engineer to 
determine the electrical state of the network and how it would respond to given disturbances 
{i.e load flow, stabifity analysis). Further work has led to the development of power flow 
programs, coirtingency analysis tools and economic analysis packages. Requirements for 
1.2 Power System Analysis 
low operating and construction costs and the ever increasing complexity of systems have 
led to a constant demand for powerful new automated analysis methods. Software tools are 
available to aid analysis in the areas of operations planning, systems planning, contingency 
and security analysis, refiabiUty and economics. Several of these areas are worth considering 
in more detail but first the power flow problem must be examined. 
1.2.1 T h e Power F low Prob lem 
The appfications outUned in the previous section are all fimited in their performance by 
the time taken to solve the power flow equations for the system under consideration. At 
the heart of the power flow solution fies a large sparse set of finear network equations. It is 
the solution of these equations which is so time consuming and over the last 35 years much 
research effort has been expended on improving and accelerating this solution. The set of 
algebraic equations is of the form 
A x = b (1.1) 
where A is the matrix of coefficients of the set of equations, b is a vector of known values 
whilst the elements of vector x are the unknowns in the equations. One of the earfiest and 
most successful improvements came through the appfication of sparse matrix techniques. 
The matrix. A , has few non-zero elements and many zero elements; often more than 95% 
of the matrix elements are zero. I t is wasteful both of memory and solution time to store 
the zero elements of this sparsely populated matrix. Sparse matrix techniques are used to 
store and process only the non-zero coefficients and this gives a significant increase in the 
speed of solution [2, 3]. 
1.2.2 Power S y s t e m Simulat ion 
Like many other simulations, power system simulation involves an iterative process of solving 
the equations which model the system. On each iteration the simulation time is incremented 
and the system equations must be solved again for this new time step. I f the time for one 
iteration is less than the time constant of the fast dynamics of the system then the simulation 
can be considered real-time. The integration time step of a typical power system simulation 
is of the order of one second. This is sufficient to allow all but the very fast (sub-transient) 
dynamics to be modeled. I f the simulation is to operate in real-time it is necessary to solve 
all the mathematical equations which model the system more than once a second if real-time 
1.2 Power System Analysis 
EXCITATION MACHINE ROTOR 
CONTROL SYSTEM ELECTRICAL 
EQUATIONS EQUATIONS 
TURBINE GOVERNOR 
MACHINE ROTOR 
MECHANICAL 
EQUATIONS 
EQUATIONS 
STATOR 
EQUATIONS 
INTERCONNECTED 
TRANSMISSION 
SYSTEM 
DIFFERENTIAL EQUATIONS ALGEBRAIC EQUATIONS 
Figure 1.2: A model of a synchronous generator connected to the transmission network 
simulation is to be achieved. 
A power system simulation consists of two distinct sets of equations relating to the differ-
ent parts of the system. Firstly there is the network model which describes the transmission 
network of the system and the stators of all the machines connected to the system. The 
network model consists entirely of algebraic equations. Secondly there is a set of non-finear 
machine models which may be broken down into first order non-finear differential equations 
that describe all the generators and loads connected to the network. On each iteration the 
differential equations must be solved to determine the currents which are injected into the 
transmission network. The Unear equations must then be solved to determine the power 
flows around the system and the voltage at each system bus. Figure 1.2 depicts a single 
machine connected to the transmission network and it shows how the model is divided into 
a set of differential equations and a set of algebraic equations. 
The differential equations are of the form 
(1.2) 
and this set of equations contains the differential equations of every machine connected to 
the network. Each machine in the system is only coupled to other machines through the 
transmission network and, as the network is treated separately, the differential equations 
are a collection of uncoupled sets of equations, one set for each machine. The equations 
may be represented as [4] 
y = f{y,u) = Ky + Bu (1.3) 
where A is a sparse, square, block diagonal matrix and B is a rectangular sparse matrix 
1.2 Power System Analysis 
with a block structure. When the effects of saturation are neglected A and B become 
constant in most models. Chapter 2 and Appendix B discuss the mathematical models of 
machines in detail. 
The algebraic equations are of the form 
y = 5(x,u) (1.4) 
where u is the vector of stator voltages and is the input to the equations whilst y is a vector 
of currents. The equations may be separated into two parts 
/ ( E , V ) = Y V (1.5) 
and 
u = u ( E , V ) (1.6) 
where Y is the bus admittance matrix, V is the vector of terminal voltages at load busses 
and E is the vector of stator voltages at generation busses. I is the vector of bus injection 
currents. Injection current at a generator bus is a function of stator voltage whilst the 
injection current at a load bus is a function of the terminal voltage. It is important to 
note that coefficients of Y depend on the topology of the network and these coefficients 
may vary. I f the topology of the transmission network changes due to equipment outages 
then the values in the matrix Y are changed. When auto-tap changing transformers are 
represented Y can change frequently. 
For power system simulation the problem is to solve the differential and algebraic equa-
tions simultaneously. The conventional approach is to solve (1.3) separately by integration 
to yield values for y. Equation (1.5) is then solved and the solutions are alternated in some 
manner. Another approach is to solve (1.3) and (1.5) simultaneously using the impficit 
trapezoidal rule. The solution of (1.5) is the bottleneck in the solution process and three 
methods of solution exist. 
1. Gauss-Seidel - This method is simple to program and easily accommodates changes 
in Y as the method operates directly on admittance matrix values [4]. The method 
is iterative and convergence to acceptable accuracy varies from problem to problem. 
Simple problems may converge in 2 or 3 iterations whereas difficult problems may 
require hundreds of iterations 
1.2 Power System Analysis 
2. Factorisation ofY - Triangular factorisation of Y is a direct method for solving the 
algebraic equations. LU decomposition is used to yield the factored form of Y and 
I = L . U . V (1.7) 
where L is a lower triangular matrix and U is an upper triangular matrix. Forward 
and backward substitution with I is used to provide a solution for V . I f Y does not 
change then L and U remain constant and V may be obtained by repeat solution of 
(1.7). Any change in Y requires a complete refactorisation to yield new values of 
L and U , before (1.7) can be solved for V . I t is observed that factorisation takes 
up to six times as long as forward/backward substitution [4] but substitution only 
takes about 1.5 times as long as a single iteration of the Gauss-Seidel method and it 
is difficult for Gauss-Seidel to be competitive. 
3. Newton's Method - Newton's Method is also an iterative method for solving systems of 
non-finear equations. Equation (1.5) becomes non-finear when non-impedance loads 
are connected to the network. Under these conditions the relationship between node 
voltage and bus current is non-finear and is defined by the impedance characteristics 
of the load. Newton's method finearises the equations using a truncated Taylor series 
approximation and forms the Jacobian matrix J [1]. Equation (1.5) can be rewritten 
as 
F = I-Y.V (1.8) 
and F — 0 when the correct solution of V is obtained. The solution is obtained by 
iterative correction and each iteration requires the solution of the Jacobian matrix 
equation 
F = - J . A V (1.9) 
LU factorisation and triangular substitution may be used to give a direct solution to 
this equation on each iteration. A strict implementation of Newton's method requires 
J to be updated and factorised for each iteration but this is computationaUy intensive. 
Faster solutions are achieved by allowing the triangular factors of J to be used for 
several consecutive iterations. 
The choice of the models used in the simulation has impfications for the method chosen 
to solve the network equations. I f there is no safiency, saturation or non-impedance loads 
8 
1.2 Power System Analysis 
in the model then the injection currents, I , are a function of stator voltage, E , only. An. 
exact solution for the node voltages, V , can rapidly be obtained using (1.7). Introducing 
machine safiency leads to a constant term being introduced into the admittance matrix 
Y . To compensate for this a corrective term is added to V and this term is a function 
of I . I t is necessary to iterate repeat solutions of (1.7) until V achieves convergence. 
The representation of non-impedance loads similarly leads to a portion of each load being 
represented by a shunt admittance in Y . The remainder of the load appears as a non-finear 
function of I in V . Again i t is necessary to iterate repeat solutions of (1.7) until V achieves 
convergence. 
1.2.3 Power Sys t em Security 
In power system engineering terms security is defined as the 
'ability of the system to withstand any one of a predefined list of possible contingencies 
without serious consequences' 
where a contingency is an interruption in the normal functioning of the network caused 
by objects in the environment. The objective of power system operators is to maintain 
system voltages and power flows within defined fimits regardless of changes in generation 
and load. Operating interconnected systems requires strict control over synchronisation; any 
loss of synchronisation may be catastrophic. Equipment outages themselves seldom cause 
much damage but the readjustment of voltages and power flows throughout the system may 
lead to a dangerous cascade of overloads causing large sections of the system to be switched 
out or damaged. In the past system security has been assured through the construction of 
robust systems. Rising costs and environmental concerns have meant that it is no longer 
economically or environmentally feasible to build extremely robust systems. As a result 
systems are being operated closer to their fimits with smaller safety margins leading to 
a greater exposure to unsatisfactory recovery foUowing disturbances. System security is 
no longer seen as a systems planning concern but as an exercise in risk aversion which is 
controUed by the system operators. There is therefore a pressing need for operators to keep 
a close eye on the security of the system. 
Assessment of system security is performed by determining the probabifity that the 
system wiU move from its normal operating conditions into an abnormal, or emergency 
state. These calculations are based upon a knowledge of the current state of the system, 
the conditions at the time and a forecast of load demand. In order to determine the response 
1.2 Power System Analysis 
of the system it is necessary to apply these data to a model of the system and this is achieved 
using computer simulation. Contingency analysis packages attempt to determine what the 
response of the system wiU be to any of a fist of possible contingencies, based upon a 
knowledge of current system state. A fuU analysis requires a simulation of the system for 
each combination of the contingencies in the fist. 
Many utifities operate their control systems in conjunction with on-fine security mon-
itoring and contingency analysis tools. Operators can continuaUy monitor the state of 
system security and how that might be changed by future events (the contingencies). Using 
contingency analysis on this basis allows operators to determine what is fikely to happen 
to the state of the system and take the appropriate preventative, or corrective, action as 
necessary. The many thousands of cases that have to be considered for a fuU contingency 
analysis make it prohibitively slow for on-fine usage. Many techniques have been developed 
which reduce the fist of contingencies to include only those which are most fikely to occur. 
This reduces the time needed to perform an analysis and makes on-fine contingency analysis 
possible. Such packages are stiU slow requiring between 5 and 15 minutes [5] to generate 
their results. I f the contingency analysis package takes more than 15 minutes to complete 
its calculations then it is not worth using i t as its predictions are based upon a model of the 
system which was updated too long ago for i t to be accurate. Due to the high volume of 
calculations and the speed required, contingency analysis tools usually have to be executed 
on expensive, dedicated, high performance computers. 
Many of the current contingency analysis tools only consider the steady state system 
and ignore the dynamic response of the system to transients. Dynamic analysis is possible 
but this requires many more calculations than the steady state analysis and as yet is too 
slow for on-fine operation. Future developments would ideally make real - time dynamic 
analysis packages available to operators giving them much faster response to contingencies 
and improving standards of system security. With the increasing speed and capacity of 
modern computers i t should soon be possible to develop such tools using the techniques 
of parallel and distributed computing in conjunction with the latest generation of high 
performance processors. 
1.2,4 S y s t e m P lann ing 
System planners are continuaUy seeking ways to improve and expand the current system. 
Analysis tools can be useful in this work by aUowing the designer to perform 'what-if' studies 
10 
1.2 Power System Analysis 
on various designs. Such studies are performed using computer models of the systems to 
simulate what happens to the systems when various changes are made. As system planning 
is not an on-fine appfication there is no pressing need for real - time simulation but continued 
improvements in the speed of the tools used by system planners wiU manifest themselves in 
a reduction in the design to construction time of system improvements. As a final stage in 
the design assessment it may be necessary to undertake a real - time dynamic simulation 
to determine the performance of the new design. 
1.2.5 Operator Tra in ing 
The flow of power in the power system, or part of the system, is controUed from a cen-
tral control room by trained operators known as dispatchers. The dispatchers meet the 
demand from users by controUing the flow of power around the network and have to deal 
with emergencies such as sudden changes in load demanded or equipment outages. New 
dispatchers must be trained to operate the control room mechanisms and to handle the 
various emergency situations that may arise. Trained dispatchers need to continually im-
prove their skiUs, especially those relating to potentiaUy catastrophic emergencies which 
seldom occur. I t would be foolhardy to let trainees practice on the real system as an error 
in their responses could place the system into a potentiaUy unstable state. Recent advances 
in refiabifity have meant that it is difficult to acquire the skiUs needed to cope with all 
conceivable operating conditions within the actual operating environment [6]. Some form 
of simulated environment is the most effective method of providing the necessary training. 
Training with a simulator gives the dispatcher greater confidence to take the correct actions 
when faced with the same situation in the actual system. To be most effective it is necessary 
for the simulator to provide the same interfaces to the system as the control room provides. 
This requires the development of a control room mock-up in which aU interaction with the 
simulator is made through the same computers, screens and other hardware used in the real 
control room. Due to their effectiveness dispatcher training simulators are now an integral 
part of the energy management systems of many power utifities [7]. 
AU dispatcher training simulators have two main components; a control centre model 
and a power system model. The control centre model provides all the main functions of a real 
dispatch control centre including data acquisition, supervisory control, system monitoring 
and man-machine interfaces. The system model provides an equivalent model of the system 
being controUed which must be sophisticated enough to reafisticaUy reproduce the responses 
11 
1.3 Power Systems Analysis and Computer Architectures 
of the actual system. This requires the model to accurately simulate the dynamics of the 
system. Two distinct components of the system model can be identified in aU dispatcher 
training simulator designs [6, 7, 8] - the static model and the dynamic model. 
• Dynamic model - Models the system components to provide dynamic simulation of 
generators, loads, prime mover systems, protective relays, substation controls etc. 
• Static model - Models the power system network and provides the power flow solu-
tion, network topology analysis, system frequency deviation and transient stabifity 
calculations. 
Biglari [7] and other researchers observe that it is the solution of the power flow equations 
which is the most time consuming stage in the power system model. Furthermore, the 
solution time of the power flow equations directly determines the iteration cycle time of 
the power system model component of the simulator. Although fast decoupled solution 
techniques a,re employed the time taken to solve for power flow equations is significant. The 
long solution times mean that i t is not possible to accurately represent the responses of the 
system's fast dynamics. 
1.3 Power Systems Analysis and Computer Architectures 
Digital computer analysis of power systems developed in the 1960's when the availabifity 
of large digital computers made it feasible to use such technology as a fast and flexible 
tool for modefing power system behaviour. Early research into computer analysis tech-
niques quickly highfighted the disparity between problem size and the capabifities of the 
computer technology available at that time [9]. Researchers concentrated on developing 
highly efficient algorithms which would extract the maximum performance from machines 
which were primitive and fimited in their performance. As computers have become more 
powerful, engineers have exploited this increased power by creating analysis tools to run on 
these machines. With the development of extremely high performance supercomputers the 
software designers have naturally looked toward these machines to see what benefits they 
have in store for power engineering appfications. Whilst they are extremely efficient and 
offer incredible performance these machines are expensive and beyond the budget of most 
power utifities. 
As the sequential computer nears the fimits of its performance much attention has been 
12 
1.3 Power Systems Analysis and Computer Architectures 
focused on the development of parallel computers which offer higher performance from ex-
isting technology than conventional sequential machines. These computers are built from 
tried and tested processors and achieve their performance by allowing many operations to 
be performed simultaneously, thereby reducing the total amount of time taken to execute 
a given algorithm. ParaUel computers are not new; they have been in existence since the 
digital computer was first developed. Given their performance benefits it may seem surpris-
ing that they have not achieved widespread usage in commercial, scientific and engineering 
appfications. This apparent unpopularity is due to the lack of commercial software and 
the difficulties of developing appfications software. Whilst it may be argued that concur-
rency is a more natural and logical way of solving problems, a different approach is required 
for the development of parallel programs than is used in developing sequential programs. 
Many programmers used to sequential machines are used to the von Neumann model and 
find it difficult to work with the different architectural models and programming paradigms 
associated with paraUel machines 
In recent years a number of smaller 'turnkey' paraUel computing systems have arrived in 
the marketplace (e.g. Intel iPSC, Meiko Computing Surfaces and INMOS Transputer based 
systems). These systems are built from readily available microprocessors which makes them 
cheap and accessible to a wide user base. Even in 1988 for as Uttle as i^l5,000, the price of 
a mid-range workstation, i t was possible to buy an oflF-the-shelf paraUel system with more 
than 8 processors and a performance approaching 100 MFLOPS [10], which exceeded the 
performance of the similarly priced workstation. With the introduction of such systems 
the software developers have seen a market niche opening up. Software manufacturers have 
expanded to fiU this niche and there are now a variety of operating systems available for 
these entry level paraUel systems, including the ever-popular Unix. The development of 
paraUel versions of traditional sequential languages {e.g. C and FORTRAN) along with the 
introduction of new parallel languages (e.g. Occam) has eased the process of developing 
software appUcations. In the past one criticism that has been leveUed at paraUel computers 
is that algorithm development and hardware architecture are too deeply intertwined to be 
treated independently. The advent of high-level 'parallel' languages has seen a move toward 
abstracting the logical development of programs away from the architectural details of the 
machine. As these systems faU in price and more software is developed, more details of the 
hardware become hidden from the programmer making the machines easier to program. 
I t is a generally held befief that paraUel computers wiU become significant in many 
13 
1.4 Parallel Processing and Parallel Architectures 
scientific and engineering appfications. I t is already proven that using readily available 
technology simple paraUel computers can exceed the performance of high performance se-
quential machines. Trew [11] and Sabot [12] observe that parallel computers tend to be 
much more cost eflPective than serial machines with the same level of performance. With 
the advent of cheap oflP-the-shelf paraUel systems i t seems that power engineers may have 
machines which can satisfy the performance requirements of their software appfications at 
reasonable cost. Many of the developments in uniprocessor computers that have occurred 
in the last 20 years {e.g. pipefining, coprocessors) have occurred as a result of the appfi-
cation of techniques used in paraUel computer systems [10, 13, 14, 15]. I f research into 
paraUel machines continues to increase their performance i t is fikely that they wiU satisfy 
the performance requirements of scientific and engineering appfications for some time into 
the future. Whilst paraUel machines are today primarily the domain of researchers, many 
experts predict that by 1998 they wiU be serious contenders to the industrial/commercial 
domination of the conventional von Neumann machine. 
1.4 Parallel Processing and Parallel Architectures 
Almasi and Gottfieb [13] define parallel processing as 
'A large collection of processing elements that can communicate and cooperate to solve 
large problems fast' 
Parallel computers achieves high performance by using multiple processors to solve the 
independent parts of a problem concurrently. I f correct results are to be produced then the 
individual processors must cooperate with one another by exchanging data and synchro-
nizing their operations. An important feature of every parallel machine is the abifity of 
processors to communicate with each other other and this is provided by an interconnection 
network which finks the processors. Many different type of interconnection network exist 
but aU involve some trade-off between cost and performance. 
1.4.1 Classification of Computer Architectures 
Before discussing paraUel architectures it is worth reviewing the von Neumann model [11] of 
sequential computation as this is the model upon which many of today's computer systems 
are based. In the von Neumann model program instructions and program data are both 
stored in a common memory connected to a single processor, giving the von Neumann 
14 
1.4 Parallel Processing and Parallel Architectures 
Input / Output 
System Jvlemory 
Bus 
Central Processing Unit 
C.P.U. 
Figure 1.3: The basic von Neumann machine 
machine the architecture of Figure 1.3. The flow of instructions and data is shown in Figure 
1.4. Instructions are executed sequentially with instruction operands and data being fetched 
from memory, operated upon by the Arithmetic Logic Unit (ALU) and data returned back 
to the memory. The model employs a single stream of instructions and a single stream of 
data. 
In 1966 Flynn [16] proposed his scheme for classifying computer architectures which has 
become known as Flynn's Taxonomy. Flynn classifies computer architectures according to 
the number of instruction (I) and data (D) streams that they have and this yields the four 
classes of computer shown in Table 1.1. 
Single data stream Multiple data stream 
Single instruction stream SISD 
(von Neumann) 
SIMD 
(Array/Vector processor) 
Multiple instruction stream MISD MIMD 
[true multiprocessor) 
Table 1.1: Flynn's Taxonomy. 
SISD computers are conventional von Neumann, sequential machines such as PC clones 
based on Intel 80x86 processors. This class also includes uniprocessor supercomputers such 
15 
1.4 Parallel Processing and Parallel Architectures 
IS 
CohtroIUnit Arithmetic Logic 
Unit 
Central Processing Unit 
IS = Instruction stream DS = Data stream 
Figure 1.4: Instruction and data streams in the von Neumann machine 
as the Cray 1 which achieves high performance through pipefining of instructions. The 
remaining three categories refer to different types of paraUel computer. 
SIMD computers are parallel computers which consist of multiple processors controUed 
by the same control unit. Each receives the same instruction from the controUer and 
executes i t on a different data stream in synchronous lockstep. The vector processors 
used by vector supercomputers are considered as SIMD devices as the processing of vector 
quantities is performed by multiple ALU's connected in an array processing manner. SIMD 
machines built from multiple CPU's are in existence - the CPU's are interconnected by a 
data routing network in the form of a regular array. Such machines are useful for problems 
which exhibit a high degree of paraUefism although they are difficult to program and are 
appfication specific. 
Flynn's classification of a MISD machine sees autonomous processors executing diflFerent 
instructions on the same stream of data. The data flows between the processors in a 
pipefined fashion. No computer has yet been identified as faUing into this category and many 
researchers claim that a MISD machine is purely conceptual as i t is difficult to visuafise an 
appfication for which i t would be useful. 
16 
1.4 Parallel Processing and Parallel Architectures 
The M I M D category encapsulates a wide variety of multiprocessor and multicomputer 
systems. Multiple processing elements autonomously execute different instructions on dif-
ferent streams of data. The MIMD classification is an extremely wide one covering asyn-
chronous arrays of microprocessors through to distributed multicomputer systems. The 
off-the-shelf Transputer systems suppfied by companies such as Meiko are prime examples 
of the sort of M I M D machines which are widely used by scientific researchers. 
1.4.2 SIMD Architectures 
The SIMD paradigm [17, 18, 13] consists of interconnected processors which receive their 
instructions from a central control unit. The interconnection network allows for communi-
cation between individual processors and between processors and local memory. AU SIMD 
machines are variants of array processors. 
Vector processors are considered array processors due to the arrays of ALU's used to 
process individual elements of vector operations in paraUel. Some degree of pipefining is 
often used in vector processors and this incurs a significant start-up overhead. Due to the 
presence of these pipefines vector processors are only efficient i f their pipefines are always 
fuU. 
Systofic array architectures are processor arrays in which the processing elements are ex-
tremely simple and perform an invariant sequence of primitive operations. Data is pumped 
through the network from the memory and returns to the memory after processing. Flow 
of data through the network is synchronised by a global clock and data appears to pulse 
through the network in a similar manner to blood flowing through the heart. Systofic ar-
rays are weU suited to intensive computations on regular data. Suitable appfications are 
invariably algorithm specific. 
Despite their lack of generality SIMD architectures have some advantages over more 
flexible M I M D architectures. The synchronous nature of SIMD machines efiminates the 
delays associated with synchronisation and the need to wait for the slowest processor. The 
single instruction stream aUows the use of a common instruction memory and does not 
require local repfication of parts of the program, as in a MIMD architecture. This gives 
SIMD a much higher memory efficiency compared to M I M D . The single instruction scheme 
also fixes the interleaving of operations, unfike the MIMD paradigm which guarantees some 
interleaving of operations although it is not possible to determine which of the possible in-
terleavings wiU occur. The guaranteed order of instruction execution makes SIMD programs 
17 
1.4 Parallel Processing and Parallel Architectures 
Shared 
Memory 
I / O Channels 
1 I . . . 1 
Input - output Interprocessor 
Interconnection Interconnection 
Network Network 
MM : Memory module 
L M : Local memory 
P: Processor 
Interprocessor 
Interrupt 
Network 
LM, 
( ^ P 2 ) " LMj 
LM 
Figure 1.5: Functional design of a MIMD computer, after Hwang and Briggs 
much easier to create, debug and maintain. 
1.4.3 M I M D Archi tectures 
M I M D architectures come in two distinct flavours, shared memory and distributed memory. 
A third hybrid flavour distributed shared memory is also possible but less popular. Figure 1.5 
shows the functional design of a MIMD computer and how the different elements combine 
to yield the three flavours. 
Shared memory M I M D architectures [17, 18] use some form of bus interconnection net-
work to connect aU the processors to a common memory bank. Synchronisation and commu-
nication between processes is performed via the common memory. In order to increase the 
efficiency of the bus, shared memory machines often utilise some form of local cache at each 
processing node to reduce the amount of traffic passing across the bus. There then arises the 
problem of cache coherency [18]; each cache can contain a copy of the same memory data. 
Correct operation requires some mechanism that ensures that all caches contain the same 
values for a given data item. Cache coherency is a major problem in the design of shared 
memory systems. Another problem arises in trying to extend the system to use a larger 
18 
1.4 '. Parallel Processing and Parallel Architectures 
number of processors. Whilst the bandwidth of modern memory systems is sufficiently 
large to allow the connection of multiple processors, bus contention begins to become more 
significant as more processors are added. Bus-based shared memory systems seem to have a 
limit of around 20 processors. Other forms of memory interconnection networks have been 
devised including crossbar switching and various multistage networks. These allow larger 
numbers of processors to be used but impose some penalty on system performance. Even 
with these interconnection systems it is not possible to achieve the easy scalability that is 
possible with distributed memory architectures. 
Distributed memory M I M D architectures [17, 18, 11] have no global memory but each 
processor has its own private memory. Both program code and data are partitioned into 
the local memories of the processors in the system. Processors which wish to synchronise 
their operation or exchange data must do so by passing explicit messages across the com-
munications network interconnecting the processors. Unlike shared memory architectures, 
distributed memory systems can be scaled up to use any number of processors and com-
mercial systems are in existence which use hundreds of processing nodes [11]. Distributed 
memory systems are also easier to design and cheaper to build as there is no complex hard-
ware required to provide access to a global memory. The major disadvantage of distributed 
memory architectures is that delays are associated with the communication of messages, 
especially if messages have to be routed via intermediate processors. These delays can seri-
ously reduce the performance of the system. Another disadvantage of distributed memory 
machines is that their lack of global memory makes i t harder to write and debug programs. 
Writing programs for distributed memory architectures requires the programmer to think 
in a distributed manner. The global memory of a shared memory architecture'allows the 
programmer to utilise the conventional von Neumann programming paradigm to a certain 
extent. Despite these disadvantages the cheapness and ease with which distributed mem-
ory machines can be built makes the distributed memory architecture ideal for entry level 
parallel computing systems. Indeed many of the entry level parallel systems available today 
are distributed memory MIMD machines. 
Distributed shared memory architectures [17] are a compromise between the distributed 
and shared memory approaches which attempts to solve the problems associated with both 
of these models. Distributed shared memory machines have both local private memory at 
each processing node and access to a common shared memory. Two interconnection net-
works are used, one to connect all the processors to allow distributed message passing and 
19 
1.4 Parallel Processing and Parallel Architectures 
one to connect each processor to the common memory. Data exchange and synchronisation 
can now be performed either via shared memory or via explicit message passing. These 
architectures are more scaleable than shared memory architectures and give better perfor-
mance than distributed memory architectures. They are however compficated to build and 
few are commercially available. 
1.4.4 Interconnect ion Networks for M I M D Architectures 
The preceding discussions have made reference to the interconnection network between pro-
cessors in the case of distributed memory machines, and between processors and memory 
in the case of shared memory machines. The choice of interconnection network can make 
or break the performance of the parallel machine and is therefore of critical importance. 
Choosing an inappropriate interconnection can severely reduce speed-up due to the extra 
communication overheads involved. Often a particular configuration is suitable for one par-
ticular algorithm but is inappropriate for a different algorithm, making it difiicult to find 
an efficient general purpose interconnection strategy. The most efficient general topology to 
date is the hypercube [15, 17, 18, 13, 19, 11] shown in Figure 1.6. Some of the other common 
interconnection strategies are also shown in Figure 1.6. These interconnections are referred 
to as static topologies as they are determined by a physical interwiring of the processors. 
Dynamic interconnection networks are also possible [17, 18, 13] in which interprocessor 
communications are made via a multiple stage switching network. The switching network 
automatically routes message to the destination processor by configuring the switches ac-
cording to address information contained in the message and operation is similar to that 
of a modern packet-switched telephone exchange. Dynamic multistage networks are higlily 
efficient in that they allow an arbitrary input to be connected to an arbitrary output with 
a constant communication delay. Such networks are expensive to implement due to the dy-
namically configurable switching hardware required and are beyond the scope of this thesis. 
Instead this thesis is based upon work performed on a reconfigurable statically connected 
multiprocessor system. In this system physical processors interconnections are made via 
crosspoint switch mechanisms which have to be set up before the system can be used. Once 
set the network retains that topology until it is reset and a new interconnection topology 
defined. The network cannot be reconfigured during the course of a program's execution 
and hence the topology is essentially static. 
20 
1.4 Parallel Processing and Parallel Architectures 
Pipeline 
Tree 
Ring Star 
Mesh Systolic array 
Completely connected Chordal ring Hypercube 
Figure 1.6: Static interconnection network topologies, after Hwang 
21 
1.4 Parallel Processing and Parallel Architectures 
1.4.5 T h e I N M O S Transputer 
The INMOS Transputer is a general purpose reduced instruction set (RISC) processor de-
signed specifically for use in parallel computers [20, 21, 22]. Each Transputer consists of a 
fast microprocessor, four serial communication finks, fast cache memory, external memory 
interfacing, floating point coprocessor, real-time clocks and a hardware implemented multi-
tasking scheduler. An array of Transputers may be created by interconnecting the serial 
finks with those of other Transputers in a point to point fashion. Although i t is possible 
for an array of Transputers to access a global shared memory the usual configuration of 
Transputer systems is as a distributed memory machine. To enable the easy building of 
scaleable paraUel systems INMOS have created a modular system for building Transputer 
based machines. This standard is based around the use of Transputer Appfications Modules 
(TRAM's) which are small circuit boards measuring 3.6 inches by 1.1 inches. Each TRAM 
hosts a single Transputer, RAM and aU necessary interfacing logic and is a complete com-
puter in its own right. TRAM's plug into a motherboard which resides in a host PC or 
workstation and the motherboard provides aU the power and control signals to each T R A M . 
Two of the serial communication finks of each T R A M on the motherboard are hardwired 
into a pipefine configuration. The remaining two finks from each Transputer may be con-
nected in any desired fashion using the reconfigurable crosspoint switch. In addition the 
interface between the motherboard and the host allows the Transputers to access the disk, 
screen and keyboard I /O systems of the host computer. Figure 1.7 provides a conceptual 
view of the entire parallel machine configuration. Further details about the INMOS Trans-
puter are presented in Appendix A. Graham and King [21] provide an exceUent overview 
of the Transputer and Transputer-based systems. 
The paraUel computing system used throughout the duration of this research project 
consisted of 16 INMOS T805 30MHz Transputers and one INMOS T805 20MHz Transputer. 
Fifteen of the 30 MHz processors were suppfied with 1 MB of fast RAM whilst one 30 MHz 
processor was equipped with 4 MB of RAM. The 20 MHz processor was suppfied with 16 
MB of fast R A M and was used as the root processor in the Transputer network. AU of the 
Transputers were mounted on two INMOS BOOS compatible motherboards and hosted by 
an I B M PC AT clone. Each motherboard could accommodate up to 10 Transputers and was 
equipped with an electronic crosspoint switch which allowed the Transputer interconnection 
network to be reconfigured from software. 
22 
1.4 Parallel Processing and Parallel Architectures 
Host P C 
1 2 3 4 
5 6 7 8 
9 10 11 12 
13 14 15 16 
Figure 1.7: Conceptual overview of the configuration of the Transputer based parallel ma-
chine 
1.4.6 Bounds on Mult iprocessor Performance 
The speed-up, S{n), that is obtained in using n processors to solve a problem defines how 
much quicker that problem is solved by n processors than by one processor. I f tg is the 
time taken to execute the algorithm on one processor and tp is the time taken to execute 
the same algorithm on n processors then speed-up is defined as 
Sin) (1.10) 
This is the strict definition of speed-up and it describes the improvement over a uniprocessor 
implementation of an identical algorithm. The most efficient sequential algorithm is not 
always the best parallel algorithm and a less efficient sequential algorithm wiU often produce 
a better parallel implementation. Due to these possible differences in algorithms the user 
is ultimately interested in the speed-up relative to the best sequential algorithm. Equation 
(1.10) stiU defines the speed-up but now tg is the time to execute the best sequential 
algorithm and tp is the execution time of the parallel algorithm. Both algorithms must be 
executed on the same type of processor running at the same clock speed in order to make 
23 
1.4 Parallel Processing and Parallel Architectures 
the analysis vafid. An algorithm which achieves high speed-up but requires a large number 
of processors to operate is obviously inefficient. The efficiency of implementing a parallel 
algorithm is expressed as the ratio of speed-up to the number of processors required to yield 
that speed-up. Hence the parallel efficiency, E{n), is defined as 
Ein) = ^ = speed-up 
n number oj processors 
The maximum speed-up that can be achieved with n processors working simultaneously 
is n. This ideal case is known as linear speed-up. The speed-up achieved in practice is 
often much less than this due to the inabifity of the algorithm to exploit all the concurrency 
in the problem, communication overheads and the time processes spend idfing waiting for 
synchronisation and/or communication. Minsky's conjecture [23] gives a lower bound on 
the performance that can be expected of the n processor system of S{n) = log2 n but 
this is a rather pessimistic estimate of performance. Hwang [15] gives a more optimistic 
estimate of a practical upper bound on speed-up which is based upon statistical analyses 
of the performance of real programs and thus takes account of communications overheads 
etc. Hwang's calculations show that the upper bound on realistically achievable speed-up is 
asymptotic to j ^ . These calculations were based upon experiments performed in the early 
1980's and since that time advances in parafiel technology have meant that i t is now possible 
to achieve speed-ups in excess of those predicted. Figure 1.8 summarises the relationship 
between these various predictions of performance and it is obvious from this figure that 
actual speed-ups often faU short of the theoretical ideal. 
Amdahl [24] gives a quantitative analysis of expected speed-up based upon the amount 
of paraUefism in the problem. In any problem there is a certain amount, Wp, which can be 
solved in paraUel but Amdahl argues that there is always a sequential part of the problem, 
Wg, which cannot be paraUefised. I f we define Cs{n) to be the cost of performing a single 
sequential operation on an n processor machine and Cp{n) to be the cost of performing n 
paraUel operations , we can define the uniprocessor, and multiprocessor execution times, ts 
and tp, as 
t. = W.C.m + W,CM ^ tnC^n) ^  H i W (1.12) 
where -{• Wp - I diS Ws and Wp are normafised. I f the cost of a single sequential or 
24 
1.4 Parallel Processing and Parallel Architectures 
16 
14 + 
12 + 
10 
a 
•6 
% a 
(0 
Gelenbe 
Ideal Linear 
Hwang 
Minsky 
6 8 10 
Number of processors, n 
12 14 16 
Figure 1.8: Various bounds on parallel performance 
25 
1.4 Parallel Processing and Parallel Architectures 
parallel operation takes one unit of processing time then 
t,^Ws + Wp (1.13) 
Considering the n processor machine; if Cs{n) = nCs{l) 
W.nC.jl) ^ WpCp _ ^ WpCpjn) ^^^^^ 
tp = 
• ^  n n n 
Similarly, i f we assume that Cp{l) = Cp(n) = 1 then tp becomes 
^=.W, + ^ (1.15) 
n 
Speed-up S{n) is defined as the ratio of uniprocessor execution time to multiprocessor 
execution time. Hence 
Equation (1.16) is Amdahl's Law and the speed-up it predicts for various numbers of 
processors, n, and various values of 14^ ,^ is shown in Figure 1.9. This graph vividly shows 
the effect that sequential operations in the parallel algorithm have on speed-up. The greater 
the sequential part of the problem, the lower the speed-up and the greater the rate of speed-
up saturation. For example, consider the case with 16 processors. If only 10% of the problem 
must be solved sequentially, the speed-up that can be achieved is only half that of the ideal 
linear speed-up. 
Amdahl's Law provides an asymptotic upper bound on speed-up of 
I t should be noted that Amdahl's Law focuses only on the computations involved and 
does not take account of other aspects of multiprocessor performance such as communica-
tion overheads and cache misses. Amdahl's Law impfies that a greater speed-up could be 
achieved by partitioning a problem into more parallel parts and executing them on more 
processors. However this would require much more interprocessor communication and the 
measured speed-up is fikely to be significantly less than Amdahl's law predicts. Gelenbe 
[25] has proposed a number of extensions to Amdahl's Law which take interprocessor com-
munication into account. He notes that as n increases so does the fraction of the execution 
26 
1.4 Parallel Processing and Parallel Architectures 
16.00 
= 8.00 
Ws=0% 
Ws=1% 
Ws=5% 
Ws=10% 
Ws=25% 
Ws=50% 
Ws=75% 
Number of processors, n 
Figure 1.9: Speed-up predicted by Amdahl's Law 
27 
1,5 Parallel Processing in Power System Analysis Problems 
time spent in communication, c{n). This makes the communication time the limiting factor 
to speed-up for large numbers of processors and the upper bound on speed-up thus becomes 
. (1-18) c{n) 
A second extension considers the fact that the parallel program does not make fuU use of 
the n processors and this gives an upper bound on speed-up of 
S < r ^ (1.19) log2n 
Gelenbe's bounds on speed-up are plotted in Figure 1.8. 
1.5 Parallel Processing in Power System Analysis Problems 
EarUer in this chapter the subject of power systems analysis was introduced and standard 
problems such as power flow, dynamic simulation, security analysis and operator training 
were considered. Many of these, and other power system analysis problems, require the 
solution of a set of linear equations. This set of equations is often large and in real-time 
analysis software the equations must be solved as quickly as possible. Despite the use of 
techniques such as sparse matrix storage it is often not possible to solve these equations as 
fast as desired [26, 27, 28, 29, 30, 31, 32, 33, 34, 35] 
One of the most promising approaches is to use parallel computers to give a fast solution. 
Parallel computers have more than one processor and their high performance results from 
their use of these multiple processing units. The basic premise of parallel computing is 
that a problem can be solved more quickly if it is split into independent parts which can 
be solved simultaneously. With regard to the equations for the power system network, tliis 
involves the use of diakoptical techniques to partition the problem for solution by individual 
processors. Several successes have been achieved [36, 37] with three- or fourfold decreases 
in solution times recorded by a number of researchers. 
The use of parallel computers to solve power system analysis problems is not new and 
numerous researchers have developed many different parallel approaches to problems such 
as transient stability analysis [38, 39, 40], short circuit analysis [41], state estimation [42] 
and simulation of electromagnetic transients [43]. A number of researchers [26, 36, 27, 29, 
33, 44, 45, 46, 35] have concentrated solely on the parallel solution of the finear network 
28 
1.6 Summary 
equations. Most of the methods developed to date are based on triangular factorisation 
and solution although other methods have been attempted, such as the Multiple Factoring 
method [46, 29] and an approach based on the use of the Conjugate Gradient method 
[28]. The existing methods wiU be considered more fuUy in Chapter 3 but the basis of the 
triangular decomposition methods is to partition the set of equations into subsets which 
may be solved independently. Unfortunately i t is not easy to spfit the equations into 
independent subsets and a successful partitioning strategy requires detailed analysis of the 
numerical algorithms [29]. Kron's method of diakoptics [47, 48] is often used to decompose 
the system into subsets which may be solved concurrently using multiple processors. A 
coordination phase is introduced into the solution algorithm to combine and modify the 
individual solutions to give the overaU solution of the set of equations. 
None of the methods developed to date has successfuUy exploited the fuU performance 
of paraUel computers due to the nature of the problem. Speed-up seems to be fimited to 
about 3 or 4 and the coordination phase is recognized as being a bottleneck in the solution 
[49]. Some methods do achieve higher speed-ups [26, 46] but they require a large number of 
processors which makes them expensive and inefficient. The authors of [50] note that whilst 
algorithm development has produced good theoretical results, fittle software has actually 
.been developed for parallel machines. 
This thesis discusses the development of a paraUel solution which attempts to circumvent 
the fimitations of existing paraUel methods. Whilst the thesis only discusses the techruque 
in relation to the solution of power system network equations i t is also vafid when appfied 
to other similar networks (e.g. telecommunications networks, electronic circuit analysis, gas 
and water networks etc). Indeed the technique can be appfied to any network which can be 
represented by a set of sparse symmetric diagonally dominant finear equations. 
1.6 Summary 
This chapter has introduced power systems and paraUel computation. The power system 
and its components have been discussed and a number of power system analysis appfications 
have been examined. The finear network equations have been clearly identified as having 
a central role to play in these appfications. Some of these appfications, such as dynamic 
simulation and on-fine dynamic security assessment require real-time or faster operation and 
in these appfications i t is essential that the finear equations be solved as fast as possible. 
29 
1.7 Outline of Thesis 
Sequential computers are limited in their abifity to perform these computations within the 
required time frame and other computer architectures must be considered if the solutions 
are to be accelerated. 
Parallel computers have been introduced as a different type of computer from the con-
ventional sequential machine. The operation of parallel computers has been considered 
and a number of parallel computer architectures have been examined. Using a parallel 
computer i t is possible to solve a problem faster than on a sequential machine by dividing 
up the problem and solving independent parts-concurrently on the multiple processors of 
the parallel machine. Parallel processing seems to be an ideal technique for accelerating 
the solution of the power system network equations and indeed this approach has ah-eady 
been used. The equations are solved by partitioning them into independent subsets which 
may be solved concurrently by the multiple processors. Unfortunately it is not easy to 
spfit the equations into independent subsets and a coordinating phase must be introduced 
into the parallel solution to combine and modify the results for each of the subsets. This 
combination phase is a bottleneck in the solution process and hmits the speed-up of the 
parallel solution to about 3 or 4 regardless of the number of processors used. Some of the 
existing methods achieve reasonable speed-ups but require many processors to achieve their 
performance [46, 26] and they are therefore rather inefficient. 
During the early developments in computer-based power system analysis tools signifi-
cant performance enhancements were made through improvements in the algorithms used. 
Through these improvements the algorithms for the sequential solution of linear equations 
have evolved into the highly efficient state we see today. Whilst developments in com-
puter hardware will produce faster machines capable of solving the network equations more 
quickly than they can be solved at present there is stiU much that can be done to improve 
the algorithms used in parallel solutions. Rather than buying a bigger hammer to crack 
the nut i t is more profitable to redesign the smaller hammer so that it cracks nuts more 
efficiently. 
1.7 Outline of Thesis 
This aim of this thesis is to explore the derivation of a technique for efficiently solving 
power system finear equations on a distributed memory multiprocessor. This introductory 
chapter has discussed what these equations are and where they arise. The concepts of 
30 
1.7 Outline of Thesis 
paraUel computing have been introduced and the need for paraUel methods of solving finear 
equations has been demonstrated. 
Chapter 2 wiU survey sequential methods for solving the finear equations as aU parallel 
solutions are based around these sequential methods. LU based triangular decomposition is 
introduced as the standard technique whilst sparse matrix methods and optimal reordering 
are introduced as ways of minimizing computation and processing time. These techniques 
wiU provide a good platform from which to explore paraUel solution methodologies. The 
efimination tree wiU be shown to be a powerful tool for providing insight into the solution 
process and its introduction is intended to highfight areas of independence and potential 
paraUefism in the solution of the equations. 
Chapter 3 moves on to discuss existing methods for the parallel solution of the equa-
tions. The two flavours of solution, iterative and direct, wiU be considered with the aim 
of determining which approach is most suitable for power system computations. Typical 
paraUel methods wiU be considered in some detaU for both direct and iterative solutions. I t 
is intended that this chapter wiU highUght the beneficial features of these methods so that 
they may be used later either to form the basis of a new method or to suggest improvements 
to existing techniques. 
Chapter 4 returns to the subject of the eUmination tree and wiU show how it can be used 
in partitioning the problem for paraUel solution. I t wiU also be shown that the efimination 
tree is useful in optimizing the assignment of computations to individual processing elements 
in a multiprocessor. The insight provided by the efimination tree wiU be of paramount 
importance in improving the amount of paraUefism that can be exploited when solving the 
equations concurrently. 
Chapter 5 wiU take the insight provided by the eUmination tree and combine it with 
the beneficial features of existing paraUel solutions to produce an improved paraUel solution 
method. I t wiU be shown that this is not a radical new algorithm but a restructuring of 
the problem which aUows more of the inherent paraUefism to be exploited. The benefits 
of the method and improvements in performance wiU be iUustrated with the results of 
simulations which solve several systems of equations using both the standard technique and 
the improved method. 
Chapter 6 considers how the improved method may be implemented on a multiprocessor 
array. Some of the techniques for improving the sequential solution wiU be revisited and 
appUed to the parallel solution whilst other techniques that can be used to further improve 
31 
1.7 , Outline of Thesis 
the implementation of the new parallel solution wiU also be presented. The performance 
of a parallel implementation of the new method will be compared with that of existing 
methods and the theoretical predictions of Chapter 5. This chapter aims to show that 
the new method gives faster and more efficient solutions than those obtained from existing 
parallel solutions. 
Chapter 7 wiU present suggestions for further work on the methods discussed in this 
thesis. Having demonstrated the effectiveness of the new approach. Chapter 8 will conclude 
the thesis by assessing what has been achieved. These achievements wiU be compared to 
the initial aims and objectives. 
32 
Chapter 2 
Solving the Network Equations 
2.1 IVIodeling the Power System 
M odern power systems are complex entities consisting of thousands of interconnected nodes. To analyse such a system it must be described by an equivalent formal 
mathematical model. This requires the use of a mathematical model for each different type 
of system component and the interrelationship between all these different models yields the 
set of equations which form the basic framework of the analysis. Suitable models for the 
main system components described in the previous chapter are now discussed. 
2.1.1 T h e Generator Model 
The synchronous generator has two main components; the rotor and the stator. The stator 
is a hollow cyfindrical structure which provides a housing for the rotor. Wound into slots 
along the length of the stator casing are coils which are connected together to form three 
separate phase windings. The rotor is a sofid cyfindrical structure which can rotate freely 
about its axis within the stator structure. A coil wound on to the rotor is excited from a DC 
source. This winding produces an intense magnetic field which sweeps the stator as the rotor 
rotates, inducing a sinusoidal voltage into each of the stator windings. The voltages are 
identical in ampfitude and frequency but are 120° separated in phase. Mechanical rotation 
of the rotor is provided by some form of turbine connected to the rotor shaft, with steam, 
hydro, gas and wind turbines being the four main types of rotor prime mover. 
The synchronous generator can be described in terms of the real and reactive power it 
33 
2.1 Modeling the Power System 
delivers. 
where 
P = mS] = ^ s i n 5 
VEf 
Q = 5[5] = - ^ c o s < J - — 
(2.1) 
(2.2) 
(2.3) 
5 = complex power delivered 
Ej = stator internal voltage 
V = stator terminal voltage 
P = real power delivered by generator 
Q = reactive power delivered by generator 
6 = power angle 
Xd = direct axis synchronous reactance 
The equivalent circuit model of the synchronous generator and the derivation of these 
equations is presented in Appendix B . l . Taken together (2.1) to (2.2) provide a model of 
the synchronous generator suitable for use in a real-time dynamic simulation. 
2.1.2 The Load Model 
In analysing an electrical power system it is necessary to consider the loads connected to 
the system. Loads at each bus are treated as composite loads and may be modeled as either 
constant current sinks or, more usually, as a simple impedance. Appendix B.4 considers 
the characteristics of system loads and methods of modeling them, 
2.1.3 The Transmission Line Model 
I t is possible to derive a set of complex equations which give a complete mathematical 
model of a transmission line. This model has an analogous electrical equivalent circuit 
model, shown in Figure 2.1 as the equivalent TT circuit model. The model comprises a 
series impedance term, Z, which accounts for the resistive and inductive losses on the hue. 
Similarly a shunt admittance term, Y, is included to account for the shunt displacement 
currents arising from the electric fields between the conductors. I t is usual to place half of 
this shunt admittance at either end of the line. Appendix B.2 derives the values of the series 
34 
2.2 Formalizing the Problem 
Vr 
Figure 2.1: Transmission line equivalent TT circuit model 
impedance and shunt admittance and discusses transmission hne modehng in more detail. 
In the problem formulation which follows i t is assumed that the values of the parameters Z 
and Y are given for each line. 
2.1.4 The Transformer Model 
The transformer is a constant power device comprised of two or more coils used in electrical 
power systems to transform voltage and current levels. The coil connected to the power 
source is known as the primary winding and the coil connected to the load is known as the 
secondary winding. Assuming the transformer to be ideal, the power input to the primary 
winding is equal to the power delivered by the secondary winding. I f there are turns 
on the primary winding and Ni turns on the secondary winding then the terminal voltages 
and currents are related by 
(2.4) 
V2 N2 
h 
h 
N2 
A practical single phase equivalent circuit model of a two winding transformer is given 
in Figure 2.2. The model accounts for finite core permeability, winding resistance, imper-
fect flux linkage and eddy current and hysteresis losses. The parameters of the model are 
expressed in terms of the series resistance and flux leakage of the primary and secondary 
windings, xi and X2 account for flux leakage in the primary and secondary windings respec-
tively. Similarly r i and r2 are the series resistances of the primary and secondary windings. 
Appendix B.3 provides a ful l derivation of the model. 
2.2 Formalizing the Problem 
Given a power system we would like to address issues such as 
35 
2.2 Formalizing the Problem 
Xi 
^1 
> A 
G N 
V, 
Ideal 
Figure 2.2: Equivalent circuit of a single phase transformer 
• What are the loads on transformers, Hues and generators in the system ? 
• What are the voltages and currents at each point in the system ? 
This is the power flow problem and it is concerned with calculating the voltage magnitude 
and phase at each bus in the system. 
A system bus is defined as a point of physical interconnection of system components 
and a power system consists of many buses interconnected by a transmission network. At 
each bus there will be three components contributing to the total power dehvered at that 
point; generation, load and transmission although either generation or load may be missing. 
Generation delivers power into the bus whilst transmission and load extract power from the 
bus, i.e 
Sg = Si + St (2.5) 
where 
Sg — complex power delivered into the bus by the generator 
Si = complex power absorbed by the load 
St = complex power extracted / dehvered by the transmission network 
The distribution of power is achieved by the transmission network and it is this network 
which must be analysed to determine the voltage characteristics at each system bus. 
We can consider the transmission network as an n-port network to which generators and 
loads are connected to form the power system, as depicted in Figure 2.3 As generators and 
loads are external to the n-port transmission network they need not be considered further. 
Should the generator voltages and currents be required they can be calculated using the 
equations of Section 2.1.1. For the purposes of this thesis i t is assumed that at each system 
36 
2.2 Formalizing the Problem 
O -
O 
o 
S G i S L i V i 
- o -
o -
-O- ref 
n P O R T 
T R A N S M I S S I O N 
N E T W O R K 
Figure 2.3: General n-port representation of a power system 
bus the current li-^^ ^ is given, where /,• is a current due to generation less load at bus i. 
The transmission network is simply a network of impedances as each transmission Une in 
the network can be replaced by its equivalent TT circuit, as described in Appendix B. Given 
that the current li is known at each bus and that the impedances comprising the network 
are known, the Ohmic equation 
V = I Z (2.6) 
can be used to obtain the voltage characteristics of each bus. It is more usual to consider 
the transmission network in terms of its admittance as the impedance matrix, Z , is dense 
whereas the admittance matrix, Y is sparse. This has important consequences for compu-
tational efficiency, as later sections will show. With the network described in terms of its 
admittance, the currents injected at each bus become the inputs to the system whilst the 
unknown bus voltages become the output of the system. Hence (2.6) becomes 
(2.7) 
where V is the vector of bus voltages and I is the vector of bus currents. Y is known as the 
bus admittance matrix and this characterizes the admittance of the transmission network. 
The solution of (2.7) is the problem on which the work in this thesis is based. The solution 
for the unknown voltage vector is given by 
V = Y - ^ I (2.8) 
37 
2.3 Linear Equations, Matrices and Sparsity 
2.3 Linear Equations, Matrices and Sparsity 
Many real world systems can be modeled by a set of simultaneous hnear equations of the 
form 
aixi + a2X2 + a^x^ = bi 
04X1 -I- 052:2 -I- aex^ = 62 (2-9) 
a7Xi + asX2 + asX3 = 63 
This set of equations can be represented in matrix notation as A.x = b, which is 
expressed in fuU as 
r 
a i 02 0-3 
04 as ae 
a? ^8 09 
Xi h 
X2 = b2 
X3 
(2.10) 
Most real systems have many zero entries in the A matrix and these systems are known 
as sparse systems. The matrix of coefficients. A , is known as a sparse matrix. Such systems 
are of special significance in the design of a computer program for solving hnear equations 
and will be considered shortly. 
The values of x are unknown and a solution for the elements of this vector is required. 
The usual method for solving a set of simultaneous equations involves some form of Gaussian 
ehmination of the set of equations followed by substitution to yield the unknown values. 
Given the set of equations i t can be seen that the problem of finding a solution to x requires 
determination of the matrix inverse A~^. Once the inverse has been determined the vector 
X is obtained by the simple matrix multiphcation 
x = A-^b (2.11) 
In power systems analysis i t is necessary to solve the voltage at each node in the system 
using the equation 
Y . V = I (2.12) 
where 
Y = System nodal admittance matrix 
V = Nodal voltage vector 
I = Nodal current vector 
38 
2.4 Direct Solution of the Linear Equations 
The matrix Y is derived using nodal admittance analysis of the given power network, as 
described in Appendix C, and for systems of any reasonable size this matrix exhibits some 
degree of sparsity i.e many of the elements in this matrix are zero. This admittance matrix 
is symmetrical by virtue of the structure of the network and it is diagonally dominant. 
This allows computer solution techniques to achieve greater accuracy through a reduction 
in relative numerical error. 
As the admittance matrix is sparse only non - zero matrix elements contribute to a 
solution and hence only those elements need enter into calculations. This has strong impli-
cations for the design of computer algorithms which store and operate on these matrices. 
2.4 Direct Solution of the Linear Equations 
The determination of the inverse of A is an inefficient, computationally expensive procedure. 
The key to solving the network equations is to produce the ejfect of the inverse without 
actually calculating the fuU matrix inverse. The basis of the method is to decompose the 
coefficient matrix into a number of factor matrices which are multiplicatively combined to 
produce the effect of the inverse. This approach is significantly more efficient, requiring 
fewer calculations than the determination of the fuU inverse and a solution obtained using 
this method is termed a direct solution. 
Three of the more common methods are now introduced. The method used exclusively 
throughout the research work described in this thesis, Zollenkopf's Bifactorisation method, 
is described in more detail. 
2.4.1 Gaussian Elimination and Fill-Ins 
The set of linear equations A x = b arises directly from the structure of the network they 
model. The known vector b can be considered as the input to the system and the unknown 
vector X corresponds to the system output. The structure of the matrix of coefficients. A, is 
directly determined by the topology of the network. Graph theory [51] teUs us that there is 
a duality between graphs and matrices and that a matrix may be used to describe a graph. 
The adjacency matrix associated with a graph is square and has as many rows/columns as 
there are nodes in the graph. I f nodes i and j of the graph are directly connected then 
39 
2.4 Direct Solution of the Linear Equations 
non-zero entries are inserted into the matrix at elements and ^. The 
values added to these elements usually represent some parameter of the network and in the 
case of power systems the values represent the admittance connected between nodes i and 
j. The bus admittance matrix (Appendix C) is thus the adjacency matrix of the power 
system network. 
Given A and b the problem is to solve the equations to yield x. This is usually achieved 
through some form of Gaussian eUmination in which the aim is to modify matrix A using 
successive column ehminations to reduce i t to upper triangular form. Back substitution with 
b then yields the vector x. Given the graph-matrix duaUty one would expect operations 
on the matrix to manifest themselves in the associated graph. EUmination of columns from 
the matrix is equivalent to the ehmination of nodes from the graph. 
Consider the example graph and adjacency matrix shown below 
15 
5 
5 
5 
10 
10 
5 
5 
5 
10 
10 
The first stage in the Gaussian eUmination of this system is to eUminate elements in 
column 1 of rows 2 to 6. This is achieved by subtracting multiples of row 1 from rows 2-6 so 
as to make the first element of each row equal to zero. For example, column 1 is eUminated 
from row 2 by subtracting | of row 1 from row 2. This modifies the elements of the second 
row and the matrix becomes 
15 5 
25 _ 5 
3 3 
5 - I 5 
3 ^ 
10 5 5 
5 - | 5 10 
10 
Notice that two new elements have appeared in columns 3 and 5 of row 2. The signifi-
^This assumes an undirected connection between i and j. If the connection is directed from » to j then 
non-zero values will only be inserted in elements 
40 
2.4 Direct Solution of the Linear Equations 
cance of these elements can be seen by examining the connectivity of the adjacency graph. 
Node 2 is indirectly connected to nodes 3 and 5 via node 1. When node 1 is eliminated from 
the graph new connections must be inserted between 2 and 3 and 5 and 3 to preserve the 
connectivity of the network. These new connections, known as fill-ins appear in the matrix 
as the new elements (2,3) and (2,5). Once column 1 has been eliminated from rows 2-6 the 
matrix and the graph have been modified to give 
15 
25 
3 
5 - 5 
10 
5 
5 
'4 5 
10 
The elimination of node 1 has produced fiU-in connections in the graph between 2 & 3, 2 
& 5 and 3 & 5. New fill-in elements have been created in the adjacency matrix at elements 
(2,3), (2,5), (3,2), (3,5), (5,2) and (5,3). Following the elimination of each node in the 
network at successive steps of the Gaussian elimination algorithm, i t may be necessary to 
introduce fill-ins to preserve the connectivity of the network. Note that even if fill-ins are 
not created, the values of existing matrix elements may be modified. 
The significance of fiU-ins becomes apparent when the number of operations required 
to eliminate all nodes is considered. The original coefficient matrix is sparse. In eliminat-
ing columns from the matrix i t is only necessary to operate on non-zero elements. The 
introduction of fill-ins increases the number of non-zeros and thus increases the number of 
operations required to eliminate all the nodes. The more fiU-ins there are, the longer i t wiU 
take to complete the elimination. It is possible to reorder the matrix in such a way that the 
amount of fiU-in is reduced. Reordering the matrix to give the minimum fLU-in also gives 
rise to the minimum solution time. Section 2.5 and Section 2.7 consider matrix ordering in 
more detail. 
Fill-ins are also important when the storage requirements for a solution are considered. 
Wi th a sparse matrix it is only necessary to store the non-zero elements. Introducing fiU-ins 
increases the storage required and this could be a problem if memory is limited. Minimum 
fiU reorderings are useful in that they also minimize the amount of memory required. 
41 
2.4 Direct Solution of the Linear Equations 
2.4.2 L U Decomposition 
The LU factorisation technique is one of the more widely used triangular factorisation 
techniques. The coefficient matrix is considered to be the product of two triangular factor 
matrices 
A = L U (2.13) 
where 
L is a lower triangular matrix in which the leading diagonal elements are unity 
U is an upper triangular matrix 
Hence 
a-ii ai2 ais 
a2l 022 «23 
«31 ^32 ^33 
1 
/21 1 
3^1 3^2 1 
•"11 Wl2 '''13 
^22 ^23 
i^33 
(2.14) 
The LU factorisation is used as the first stage in the three stage solution of Unear equations 
Stage 1 Factorise A = L U 
Stage 2 Forward substitute to solve for y L y = b 
Stage S Backward substitute to solve for x Ux = y 
The vector, y, is a vector of intermediate results. The advantage of this approach is that 
operations on the right hand side (stages 2 and 3) may be performed independently of the 
factorisation stage. This aUows the same system to be solved with multiple right hand side 
vectors. 
The formulae for generating the elements of L and U are 
Q.J - E L i h,kUk,j 
Ui 
k j = ^ ^ > c = i - ' - — ^ j ^ k + l . . . n (2.15) 
(2.16) 
fc=i 
The coefficients of L and U may be merged and stored in a single {n X n) matrix, Ap- A p 
is created by overwriting the elements of A as the factorisation progresses. Note that if 
A is symmetric, the LU factorisation destroys the symmetry as A p is not symmetric i.e. 
h,3 
42 
2.4 Direct Solution of the Linear Equations 
By way of example, consider the LU factorisation of the matrix 
16 4 8 
4 5 - 4 
8 - 4 22 
This results in the factor matrix 
-till Ul2 Ul3 16 4 8 
A F = hi U22 U23 = 0.25 4 - 6 
Isi I32 U33 0.5 -1.5 9 
2.4.3 L D U Decomposition 
Another triangular factorisation method often used in power system computations is the 
L D U factorisation. The coefficient matrix is considered to be the product of three factor 
matrices, L ' , D and U . 
Here 
L ' is a lower triangular matrix with unity elements on the leading diagonal 
U is an upper triangular matrix in which the leading diagonal elements are unity 
D is a diagonal matrix 
The method is similar to L U factorisation and the diagonal elements of the U matrix in 
LU factorisation appear as the the diagonal elements of the D matrix in LDU factorisation. 
The lower triangular matrix of LDU factorisation, L ' , is obtained from the lower triangular 
matrix of L U factorisation, L , by dividing each column of L by the diagonal element of that 
column. 
Consider the following matrix by way of an example 
3 - 1 
. - 1 2 
A = 
' -1 2 1 
- 1 1 
43 
2.4 Direct Solution of the Linear Equations 
L U factorisation of this matrix gives L = 
•1 1 
-1 1 
U 
L D U factorisation of A yields 
1 
L ' = •I 1 
1 
8 ^ 
D = 
Notice that if A is symmetric then 
U = L iT 
u = 
1 _ 1 _ 1 
^ 3 3 
1 - i 
Hence 
A = L ' D L iT 
This has important consequences for storage of the matrix factors as i t is only necessary 
to derive and store L ' and D. As power system network matrices are often symmetric 
this factorisation method finds widespread usage in power system computations. The L D U 
factorisation again forms the first stage of a three stage solution process. 
Stage 1 Factorise 
Stage 2 Forward substitute to solve for y 
Stage 3 Backward substitute to solve for x 
A = L D U ^ L ' D L ' ' 
L 'Dy = b 
Ux = y (L'2^x = y) 
2.4.4 Bifactorisation 
Bifactorisation is another factorisation technique similar to L U and L D U decomposition. 
Given an {n x n) matrix this method spUts it into 2n factor matrices, each of order (n x n). 
The method produces n left hand factor matrices L ^ ^ ' , L ' ^ ^ , . . . , L^"^) and n right hand factor 
matrices R ( ^ ) , R ( ^ ) , . . . , R ( " ) . The factor matrices satisfy the constraint that 
L ( " ' . L ( " - I ) . . . L ( 2 ) . L ( ^ ) . A . R ( ^ ) . R ( 2 ) . . . R ( ' ^ - I ) . R ( " ) = I (2.17) 
44 
2.4 Direct Solution of the Linear Equations 
where I is the unit matrix of the same order as A . Note that 
A - i = R ( I ) . R ( 2 ) . . . R ( " - I ) . R W . L W . L ( " - I ) . . . L ( 2 ) . L ( I ) (2.18) 
One factor matrix exists for each row and column of the coefficient matrix and each 
factor matrix differs from the unit matrix by only one row, if i t is a right hand factor, or 
one column if i t is a left hand factor. Once again i t is possible to merge the 2n factor 
matrices to create a single (n X n) factored matrix A p . 
(2.19) 
R\,2 Rl,3 Rl,4 
1^2,1 £2,2 R2,3 R2,4 
-C'3,1 L3,2 L3,3 R3,4 
L4,2 LA,3 L4,4 
For a symmetric matrix Lij = Ej,,- and hence only the left hand factor elements need to 
be derived. A further saving on memory can be made by storing only the left hand factor 
elements so that the factored matrix becomes 
A F 
^1,1 
-£2,1 •£2,2 
^3,1 L3,2 
-t'4,1 L4,2 
•'3,3 
•'4,4 
(2.20) 
Consider again equation (2.18). As x = A b^ i t is possible to write 
X = R ( I ) . R ( 2 ) . . . R ( » - I ) . R W . L ( " ) . L ( " - I ) • • •L(2).L(^).b (2.21) 
Commencing from the left hand end the first operation involves post multiplying R^^) by 
R ( 2 ) yielding 
1 a 6 c 1 1 a ad-\- b ae + c 
R ( I ) R ( 2 ) = 
1 1 d e 1 d e 
(2.22) -
1 1 1 
1 1 _ 1 
where a, b, c, d, e are values produced by the factorisation of the coefficient matrix and 
45 
2.4 Direct Solution of the Linear Equations 
may, or may not, be zero. The result of post multiplying any {n x n) matrix by any other 
(n x n) matrix is itself an (n X n) matrix . By the time all the factor matrices have been 
multipUed 2n matrix multipUcations wiU have occurred, each of which produces an (n X n) 
matrix result. The final operation to obtain the solution vector, x, is the post multiphcation 
of the product of the matrix factors {i.e the fuU matrix inverse) by the vector b. 
x -
k I m n bi kbi + lb2 + mbs + nb^ 
o p q r b2 obi + pb2 + qb3 + rb4 
s t u V bz sbi + tb2 + ub^ + vb4 
w X y z b4 wbi + xb2 + ybz + 264 
(2.23) 
which produces an {n X 1) vector result. 
Now consider performing the same factor multipUcations but commencing from the 
right hand side. The final result is the same as in the previous case (2.23) but now the 
first operation is the post multipUcation of L^^' by b. i.e. 
L(i)b 
' k ' 61 " kbi 
a 1 62 (a-f 1)62 
b 1 63 ( 6 + 1 ) 6 3 
c 1 _ ( c + 1)64 _ 
(2.24) 
and the result is an (n X 1) vector. Subsequent operations involves post multiplying factor 
matrices by intermediate vector results to yield an (n x 1) vector result. Note that in 
multiplying from right to left no matrix inverse is generated but its effect is obtained. 
Tinney [3] notes that this right to left multiphcation, when appUed to a dense matrix, 
requires n additions and - n multipUcation - additions to compute a solution whereas left 
to right multipUcation requires (n - l)(2n^ + n) additions and 2n^(n^ + n) multipUcations 
to yield the same solution. Right to left multipUcation is therefore much more efficient and 
wiU yield faster solutions with the added advantage of requiring less intermediate storage. 
As with LU and LDU factorisation the factor matrices are derived in n steps. At the 
k*^ iteration the factor matrices L^'^' and R^*^) are determined and the coefficient matrix is 
updated to produce a new coefficient matrix A ' ' . 
46 
2.4 Direct Solution of the Linear Equations 
The formulae for determining left and right factors at the A;*'' iteration step are 
T (fc) _ 1 v>W 
1-11. L. — 
'kk - (fc_l) ^^kk Rr.^  = 1 (2.25) 
^kk 
'kk 
(k-i) 
(2.26) 
(2.27) Rif = ^$^ j = k + l,...n 
where is referred to as the pivot or pivotal element. 
I t is clear that L^''^ = R'''' for a symmetric matrix and as only left hand factors need to 
be determined the rules for deriving the factors from a symmetric coefficient matrix become 
^k,k - j k - i ) 
"•k,k 
(k) 
'i,k 
- \ k 
(fc-i) 
where Lp^' is stored in A f ) . , , and Ap stores the compact factored matrix. 
The coefficient matrix is updated according to 
Jk) _ Jk-l) _ ^]k -^l, _ (k-l) (k-l) Ak) . . , 
(2.28) 
(2.29) 
(2.30) 
Equations (2.29) and (2.30) imply that they must be applied to every matrix element 
aij for which i,j = k + 1 , . . . , n. For a sparse matrix this is clearly inefficient as many of 
these elements wiU contain zeroes. Equations (2.29) and (2.30) only need to be apphed to 
non-zero elements aij for which 
k + l < i , j <n (2.31) 
As an example of the bifactorisation method, consider the factorisation of the matrix 
A(o) 
3 - 1 - 1 
- 1 2 
- 1 2 - 1 
- 1 1 
0 © 
® 0 
At the first step k = 1. Applying (2.29) to row 1 of the matrix produces the first left 
47 
2.4 Direct Solution of the Linear Equations 
hand factor matrix L^^^. 
L ( I ) = 
1 
3 
I 1 
Applying (2.30) to row 1 of the matrix causes the remaining rows of the matrix to be 
modified. The result is 
0 /? 
A(i) = 
5 _ 1 
3 3 
1 5 
3 3 
- 1 1 
1 
Notice that fiU-ins occur at (2,3) and (3,2) and that rows 2 and 3 of the matrix have 
been updated as nodes 2 and 3 are the neighbours of node 1. 
At the second factorisation step k = 2. Applying (2.29) gives 
L(2) 
3 
5 
\ 1 
and using (2.30) to modify rows 3 and 4 produces 
A(2) = 
\ - 1 
•1 1 
0 © 
©—0 
No fiU-ins occur during this step and row 3 is updated as node 3 is the neighbour of 
node 2. 
Step 3 (/c = 3) results in 
L(3) 
1 
1 
5 
8 
^ 1 8 ^ 
and 
48 
2.4 Direct Solution of the Linear Equations 
0 © 
© 0 
Again no fiU-ins occur and row 4 is modified as node 4 is the neighbour of node 3. 
The final step (k — 4) simply creates L^''). No updating of the coefficient matrix is 
required as all nodes have been eliminated. 
L(4) = A(4) 
1 
1 
A further point to note about symmetric coefficient matrices and their factored coun-
terparts is that all the information about the matrix is contained in both upper and lower 
triangles of the matrix. The rules for factorising and updating the coefficient matrix need 
only be applied to this triangular form, reducing the computation time as well as reducing 
the amount of memory needed to store the matrix. Given the matrix A in triangular form 
the bifactorisation method can be used to produce a triangular factored matrix. After the 
n*'* iteration step the coefficient matrix. A , has been reduced to a unit matrix and n factors 
have been produced and stored in the factored matrix. Note that the R ( " ) factor is a unit 
matrix and can be disregarded as it has no effect in the subsequent multiphcations. 
Although the admittance matrix, Y , is often symmetrical there are situations under 
which i t is only incidence symmetric^. These situations arise from the presence of system 
components such as quadrature boosters, which are transformers with complex transforma-
tion ratios. Their effect is to introduce unequal admittances into symmetric locations of 
the admittance matrix and the techniques for storing and processing symmetric matrices 
can no longer be used. The square representation of the admittance matrix must be used 
and factorisation generates left and right hand factors for which L | ^ ^ 7^  I^'^ t'- ^ compact 
representation of the factors is no longer possible and both the left and right hand factors 
must now be stored explicitly. The formulae for generating factors from symmetric matrices 
(equations (2.28)- (2.29)) can no longer be used and equations (2.25)- (2.27) must be used 
instead. 
^ An incidence symmetric matrix is only symmetric in terms of element locations, not in terms of element 
values. When the element values are ignored the matrix is seen to be symmetric. 
49 
2.5 Pivotal Ordering 
The test systems used in this thesis do not include devices which result in incidence sym-
metry and the admittance matrices of all the test systems are symmetric. This has allowed 
the more efficient symmetric storage and processing techniques to be used throughout. 
2.5 Pivotal Ordering 
When operating on a coefficient matrix to obtain the effect of its inverse i t is not necessary 
to operate on rows or columns in the natural order in which they occur I t is possible to 
process rows and columns in a different order so that a given diagonal element is selected 
as the pivot at a given iteration step. 
There are three reasons for choosing to operate on a matrix in an order that is not 
necessarily the naturally determined order. 
• increased numerical accuracy due to minimisation of round - off error. 
• preservation of matrix sparsity. 
• increase in computational efficiency. 
These desirable properties result from minimizing the amount of fill-ins introduced during 
elimination. Minimizing the fiU-ins reduces the amount of computation needed to yield a 
solution, thus increasing computational efficiency. Preserving the sparsity of the matrix 
also gives increased numerical accuracy as fewer round-off errors are introduced. 
I t has been observed that the matrices associated with power systems networks are di-
agonally dominant and by determining the elimination ordering based upon an examination 
of only the diagonal elements sufficient numerical accuracy is retained for most applications 
[2]. This allows the matrix ordering to be chosen so as to preserve sparsity and reduce 
memory requirements and computation time. 
There are two main forms of ordering strategy used in matrix computations - Pre -
Ordering and Dynamic Ordering. 
2.5.1 P r e - Order ing 
Pre - ordering strategies are used before processing the matrix and have short execution 
times. Such strategies cannot take account of changes in the coefficient matrix due to the 
factorisatioii process and are unlikely to produce the most optimal ordering. Despite this 
50 
2.5 Pivotal Ordering 
pre - ordering strategies can be very useful for simple problems although they are often not 
very good at preserving sparsity. • 
As an example of a pre-ordering strategy consider the 'least number of connected 
branches' method. This orders matrix rows(columns) for elimination in ascending order of 
number of non - zero off diagonal elements. When two rows (columns) have the same num-
ber of non - zero elements the ordering becomes most efficient when these rows (columns) 
are taken in their naturally occurring order. For the matrix below the following ordering is 
obtained 
• • * 
Natural Ordered 
1 3 
2 1 
3 4 
4 2 
Note that the ordering strategy merely requires a knowledge of the location of non zero 
matrix elements. The actual values of these elements are irrelevant. 
2.5.2 D y n a m i c Order ing 
Dynamic ordering strategies differ from pre-ordering strategies in that they are used after 
each step of the factorisation algorithm to determine the most optimal order of elimination 
based upon an examination of the updated coefficient matrix. Such strategies slow down 
computation time, which is their main drawback, but they are very good at preserving 
sparsity. The simplest dynamic ordering strategy applies the 'least number of connected 
branches' algorithm after each iteration and this is the weU known Minimum Degree algo-
ri thm [3]. BrameUer et. al. [2] suggest the use of a semi - optimal ordering strategy. This 
has the advantage of off-line usage (like a pre-ordering) which does not slow down the ac-
tual computations yet still retains the sparsity preserving properties of a dynamic ordering 
strategy. The ordering is applied prior to processing of the matrix and uses the coeificient 
matrix sparsity pattern and a simulation of the factorisation process to determine the order 
which introduces least fiU-in. This technique is very attractive i f several solutions are to be 
obtained for matrices which have different numerical values but the same sparsity pattern 
as the same elimination order can be used for each matrix. 
The semi-dynamic ordering algorithm operates by reading connection information for 
the required system from a datafile held on disk. The data in this file is used to establish 
51 
2.5 Pivotal Ordering 
linked lists for each row which define the topology of the admittance matrix. These lists 
hold only column indices and no element values are stored, as depicted below. 
• * 
•k -k 
-k -k -k 
k -k 
1 -> 2 ^ 3 ^ NULL 
1 -> 2 ^ NULL 
l ^ 3 ^ 4 ^ NULL 
3 ^ 4 ^ NULL 
A suitable ordering strategy (e.g. Minimum Degree) is applied to determine the first 
row/column to be eliminated. The bifactorisation update rule (2.30) then determines where 
in the matrix fiU-in wiU occur due to the elimination of this row and column. I f a fiU-in 
occurs in row k column i the topology list for row k is modified by the insertion of an entry 
for column i. After all fiU-ins have been identified and the topology lists altered the list 
of the eliminated row is marked so that i t will not be consulted further. Similarly all the 
other lists which make reference to the eliminated column have the entry for that column 
removed. The ordering strategy is again used to determine the next row and column to 
be eliminated and the necessary updating of lists is performed. This continues until all 
rows are marked as eliminated. The output of the ordering algorithm is a mapping array 
which specifies how the system nodes are to be renumbered so that when the nodes in the 
renumbered system are eliminated in ascending order, the optimal elimination ordering is 
being followed. 
In the case of a symmetric matrix only the lower triangle needs to be stored to completely 
specify the matrix. Analysis of the ordering algorithms shows that if the Hnked lists store 
only the topological information for a triangular matrix a different ordering is produced 
than i f the lists contain the topology information of the whole symmetric matrix. I t is 
observed that the former case does not produce the optimum ordering of system nodes 
whilst the latter case does. Hence the ordering algorithm must use a fuU representation of 
the matrix whilst the actual solution for the algebraic network equations only makes use of 
a triangular matrix. Creating an ordering using the fuU matrix ensures that the triangular 
matrix used by the factorisation routine will be optimaDy ordered. 
52 
2.6 Elimination Trees 
1 
2 
3 
4 
5 
6 
7 
B 
9 
10 
1 2 3 4 5 6 7 8 9 10 
~ X X 
X X X 
X X X 
X X X 
X X X X 
X X 
X X X 
X X X X 
X X X X 
X X X X X 
(a) (b) 
Figure 2.4: A simple 10 node graph (b) and its associated matrix (a) 
2.6 Elimination Trees 
The elimination tree is an extremely powerful tool that can be used to describe the prece-
dence relationships which exist between nodes in the factorisation and substitution phases 
of LU decomposition. Liu [52] recognizes the elimination tree as a tool previously used 
in many different guises by a number of authors. First formalized by Schreiber [53], the 
elimination tree was applied to parallel processing by Jess [54] in the early eighties. Such 
trees were first introduced in a power systems context in the mid-eighties [55] and since 
that time they have been used extensively in examining the parallelism in the solution of 
large sparse sets of linear equations. 
The duality between graphs and matrices is well known [51]. For a particular graph, 
G{A) = V{A), E(A), with vertices V{A) and edges E{A) there exists an associated incidence 
matrix. The matrix contains numeric data when the edges E(A) have weights associated 
with them. In an electrical network the edge weights are the admittances (or impedances) 
associated with each edge, or circuit branch. The matrix then becomes the admittance 
(impedance) matrix of the system. Figure 2.4 shows the duality of graphs and matrices 
with a simple 10 node network example. 
I f the matrix. A , is the coefficient matrix of a set of linear equations then the matrix may 
be factorised using one the Gauss-based L(D)U decomposition techniques. As elimination 
proceeds fiU-ins are generated in the matrix A creating the filled matrix, F . The creation 
of F is equivalent to introducing extra connections in the graph, G{A) to create the fiUed 
graph, G{F). Figure 2.5 shows the filled graph, G{F) for the simple 10 node example given 
53 
2.6 Elimination Trees 
1 
2 
3 
4 
5 
6 
7 
8 
9 
10 
1 2 3 4 
3r 
a 10 
X X 
X X 
X X 
X X 
X X X 
X X X 
X X o 
X X X o 
X X X o 
X O O X X X 
X O X X 
(a) (b) 
Figure 2.5: Filled graph (b) and associated matrix (a) for the simple 10 node example 
above along with the corresponding filled matrix, F . 
The elimination tree of a set of n linear equations is a rooted tree with n vertices labeled 
1 to n. The labeling of the vertices in the tree corresponds to columns in the filled matrix, 
F . The root of the tree is the last column, n. Tracing the paths through the elimination 
tree gives the factorisation paths for given nodes. The factorisation path for a node, k, is an 
ordered list of nodes starting at k. The list contains the index of the first non-zero element 
below the diagonal in column k of the filled matrix F . This element is then taken as a 
column and the first non-zero element below the diagonal in this column of the fiUed matrix 
is added to the list. The process is repeated until no more unvisited nodes exist below the 
diagonal in column k. The elimination tree is generated by tracing the factorisation path 
of the last node, n. A node such as k is referred to as a child or descendant node and the 
node corresponding to the first non-zero below the diagonal is referred to as the parent or 
ancestor node. The root of the tree is unique in that i t has no parent node and leaf nodes 
of the tree have no children. Every other node is both a child and a parent of another node 
in the tree. The elimination tree for the simple 10 node example system is shown in Figure 
2.6. In this tree, node 10 is the root whilst nodes 1, 2, 3, 4 are leaf nodes. Node 6 is a child 
node and node 7 is its parent. Node 6 is also the parent of node 1. The factorisation path 
for node 6 is the list 6 ,1 . 
54 
2.7 Near Optimal Ordering Strategies 
Figure 2.6: The elimination tree of the 10 node example 
2.7 Near Optimal Ordering Strategies 
Much research attention has been given to the subject of near optimal ordering in the last 
few decades and this section presents a brief summary of the most popular and effective 
methods developed to date. The various ordering schemes can be characterized by the 
amount of fiU-in they introduce and the effect they have on the shape of the elimination 
tree. 
2.7.1 T h e M i n i m u m Degree Algor i thm 
The degree of a node in the graph G{A) is the number of other nodes to which that node 
is connected i.e. the number of connections branching out from that node. For example, 
node 1 of Figure 2.4(b) has degree 2 whilst node 9 has degree 4. Comparing Figure 2.4(b) 
with Figure 2.4(a) shows the degree of a node to be equivalent to the number of non zero 
off diagonal entries in the row of the matrix corresponding to that node. 
The Minimum Degree algorithm [3, 56, 2] determines the order of elimination based on 
an analysis of the degree of the nodes involved. The aim of the algorithm is to minimize the 
number of fiU-ins introduced as a result of the elimination process as this reduces the amount 
of computation required to achieve a solution. At each stage of the elimination process the 
algorithm chooses the next node to be eliminated as the one which has the smallest degree. 
In the event of a two or more nodes having the same, smallest degree the choice of which 
55 
2.7 Near Optimal Ordering Strategies 
Degree 
0 1 
0 2 
0 3 
0 4 
1 5 
1 6 
2 7 
2 8 
3(4) 9 
4 10 
1 2 3 4 5 6 7 8 9 10 
r " 
X 
X 
X 
X X 
X X 
X X X 
X X X 
X X 0 X X 
_ X X X x _ x 
(a) 
1 2 3 4 5 6 7 8 9 10 
Degree 
2 1 nr X X 
2 2 X X X 
2 3 X X X 
2 4 X X X 
3 5 X X X X 
2(3) 6 X X X 0 
2 7 X X X 
2 8 X X X X 
4(5) 9 X X 0 X X X 
4 10 [ _ X X X X _ X 
(b) 
Figure 2.7: The effect of storage scheme on nodal degree a) triangular storage b) square 
storage 
node to eliminate is an arbitrary choice from the set of nodes with smallest degree. The 
algorithm has to be applied dynamically or semi-dynamicaUy as the degree of the nodes in 
G{A) changes as elimination progresses. This is due to the removal of connections due to 
elimination and the introduction of new connections due to fiU-in. Allowing the algorithm 
to take account of the changing degree of the nodes ensures that the minimum fill ordering 
is produced. 
A description has already been given of how the sparse coefficient matrix, A (or Y ) , 
maybe stored efficiently in memory. The scheme exploits the symmetry of the matrix 
by storing only the upper or lower triangle of the matrix as either of these contains all 
the numeric information required. When applying an ordering strategy to the coefficient 
matrix i t is not sufficient to use a triangular representation as the resulting ordering differs 
from that generated using the fuU coefficient matrix. This is due to the observation that 
the degree of a node is equivalent to the number of off diagonal non zero elements in the 
corresponding row of the matrix. This observation holds true only for a complete matrix 
and does not hold for the triangular representation. Figure 2.7 shows a triangular and 
complete representation of the coefficient matrix for the 10 node example system of Figure 
2.4. 
I t is easy to see that row 1 in Figure 2.7(b) has the correct degree of 2 whilst row 1 
in Figure 2.7(a) has degree 0. Furthermore, if node 1 is eliminated from the graph G{A) 
56 
2.7 Near Optimal Ordering Strategies 
a fiU-in is introduced at elements (6,9) and (9,6) in the complete matrix, increasing the 
degree of nodes 9 and 6 by 1. In the triangular matrix fill-in is only introduced at element 
(9,1). The degree of node 9 is correctly increased by 1 whereas node 6 incorrectly retains 
its original degree. 
In creating a computer program to determine the elimination ordering it is easier to use 
the square representation of the sparse matrix. I t is possible to use the triangular matrix 
representation in the ordering algorithm but this requires greater programming effort as an 
extra data structure has to be used to store information about the degree of each node. The 
only situation where this might be useful would be in processing a very large system on a 
computer with a small amount of memory. Given the cheapness and availability of modern 
memory chips, and the use of fast, cache supported virtual memory, this situation is unlikely 
to occur. The type of storage used by the ordering routine does not create any problems 
for the solution algorithm as the ordering routine is applied oflF-Une prior to solution. 
2.7.2 ' T h e M i n i m u m Length Algor i thm 
Like the Minimum Degree algorithm the Minimum Length algorithm orders the elimination 
of nodes from the network so as to minimize a given constraint, in this case the length of 
the elimination tree associated with the network. At each stage of the elimination process, 
the next node selected for elimination is the one which has the shortest path length. In the 
event of a tie the choice is arbitrary. Here path length is defined to be the length from the 
initial (root) node of the tree to the node currently under consideration. AU nodes start 
with a path length of zero and, at each stage of the ehmination the path length of nodes 
referenced by the elimination are updated using a simple formula [56]. Suppose node k has 
just been eliminated and that row k of the matrix has entries in columns i and j. Clearly 
an update or fill-in will be made to elements aij and a^i. an will be similarly updated and 
the path length of an, written as da, is modified according to 
dn = max[dkk + 2, da + 1] 
The path length of the last node to be eliminated is equal to the length of the critical path 
in the elimination tree. 
Betancourt and Taylor [56, 57] both observe that the Minimum Length algorithm should 
result in a short critical path length for the elimination tree. As the algorithm does not 
57 
2.7 Near Optimal Ordering Strategies 
consider the degree of the nodes it cannot be expected to have good sparsity preserving 
properties. Whilst this algorithm is found to give shorter trees than the Minimum Degree 
algorithm, it is also found that it introduces significantly more fill-in. 
2.7.3 T h e M i n i m u m Degree M i n i m u m Length Algor i thm 
Both of the algorithms described above operate by attempting to minimize a certain pa-
rameter. The decision as to which node to ehminate next is based purely on which node 
minimizes the desired parameter. When two equally likely nodes are encountered there is 
no protocol for breaking the tie and an arbitrary choice has to be made. Many implemen-
tations simply choose the first or last tied node encountered so as to ease the programming 
of the algorithm. I t has been found [56, 57] that using a second criterion to resolve conflicts 
between tied nodes gives a significant improvement in the resultant ordering. The Minimum 
Degree Minimum Length (MDML) algorithm uses path length as the criterion used to re-
solve tie break situations. At each stage of the elimination, the next node to be ehminated 
is chosen to be the one with minimum degree. If more than one node has minimum degree 
then the path lengths of the tied nodes are examined. The node with the minimum path 
length is chosen as the next to be eliminated. I f more than one node has minimum path 
length then the choice is again arbitrary. As the primary selection criteria is the degree of 
the nodes the M D M L algorithm has the sparsity preserving properties of the MD approach 
but the secondary selection criteria reduces path lengths. MDML-ordered systems have 
similar amounts of fill-in to their MD counterparts but with shorter ehmination trees. 
2.7.4 T h e M i n i m u m Length M i n i m u m Degree Algor i thm 
I t is also possible to use path length as the primary selection criterion with the degree 
of the nodes used as a secondary selection criterion in the event of a tie. This gives rise 
to the Minimum Length Minimum Degree (MLMD) algorithm. This algorithm results in 
short path lengths, as in the ML approach, but also reduces the amount of fiU-in produced. 
However the fiU-in is still significantly greater than that introduced by the Minimum Degree 
algorithm. 
58 
2.7 Near Optimal Ordering Strategies 
2.7.5 T h e M i n i m u m Degree M i n i m u m Length Least Recent ly Used A l -
gori thm 
Taylor [57] introduces a variant of the M D M L strategy which he refers to as the Minimum 
Degree Minimum Length Least Recently Used (MDMLLRU) algorithm. A third selection 
criterion is used to resolve the conflict between nodes which remain tied after the application 
of the first two selection criteria. This criterion selects the next node for elimination as the 
tied node which has been least recently referenced by the elimination process. Suppose node 
k has just been eliminated and that row k of the matrix contains entries in columns i and j. 
Nodes i and j are thus referenced in updating the matrix A following the elimination of k. 
Associated with each node in the network is a timestamp which indicates when that node 
was last referenced. These timestamps are altered each time a node is referenced during 
an update or fill-in operation. When a tiebreak occurs the timestamps of the tied nodes 
are examined and the one with the smallest timestamp is chosen. If the tie still cannot 
be broken the choice is once again arbitrary. The timestamps used in this method do not 
have to be actual times but may be a simple integer value indicating at which step in the 
elimination the given node was last referenced. 
The use of the Least Recently Used criterion ensures that the MDMLLRU algorithm 
does not continue following the same path through the elimination tree when a tie occurs 
but allows it to jump to other paths in the tree. Instead of focusing its attention on a 
particular part of the tree until that path has been efiminated, the MDMLLRU algorithm 
distributes its focus more evenly over the entire tree. This results in ehmination trees which 
are both short and wide. The use of Minimum Degree as the primary selection criterion 
maintains good sparsity preserving properties. 
2.7.6 Comparat ive Analys i s of the Ordering Methods 
A simple computer program was written to allow an analysis to be undertaken on the 
performance of each of the algorithms described above. The selected ordering technique 
was repeatedly applied to the given system, usually the CEGB 734 node system, for 1000 
iterations. Before each iteration the nodes in the test network were randomly reordered to 
something other than their natural order. The ordering algorithm was applied and the path 
length and fiU-in resulting from the chosen algorithm were recorded. At the end of the test 
run these figures were used to derive a set of statistics which characterized the performance 
59 
2.7 Near Optimal Ordering Strategies 
FiU-in Path length 
Ordering Scheme Minimum Mean Minimum Mean Maximum 
MD 616 636 28 38 45 
M D M L 616 628 23 31 37 
M L M D 877 970 22 26 31 
MDMLLRU 617 629 24 30 33 
Table 2.1: Statistical performance of the ordering algorithms 
of the chosen ordering scheme in terms of path length and fiU-in. Applying this test to 
the same system for each ordering scheme allows the performance of these algorithms to be 
compared. Table 2.1 shows the results of this testing. 
As Table 2.1 shows, the ordering algorithms behave much as expected. The Minimum 
Degree algorithm gives a small amount of fill-in but results in long elimination trees. Mini-
mum Length reduces the length of the tree at the expense of the amount of fiU-in. Minimum 
Degree Minimum Length improves on the situation by maintaining the low fiU-in of Min-
imum Degree whilst reducing the length of the tree. Minimum Length Minimum Degree 
has good length qualities but poor fiU-in performance. The best ordering algorithm is the 
MDMLLRU algorithm, which has short tree lengths and good fiU-in performance. As Taylor 
predicts, the use of this algorithm gives rise to short, broad trees. Broad trees are desirable 
to facilitate the partitioning of the tree into subtrees for parallel processing whilst the short 
critical path length gives shorter execution times than those produced as a result of other 
ordering algorithms. Table 2.2 shows the effect of the different ordering strategies on the 
speed-up obtained by processing the CEGB network in parallel. The table clearly shows 
the beneficial effect of introducing tie-breaking criteria into the ordering strategy. Each 
result was obtained by partitioning the reordered system into the same number of indepen-
dent parts which were solved using four processors. The actual parallel solution algorithm 
used to achieve the solution is unimportant as all the results were obtained using the same 
solution algorithm. The relative speed-ups show the effect of ordering on speed-up. 
I t is observed that in Table 2.1 there are certain shortest paths and minimum fills encoun-
tered which differ considerably from mean path length and mean fill-in. This phenomenon 
merits explanation. Although they are often referred to as optimal ordering strategies, the 
ordering schemes outlined above can best be thought of as approximate, near optimal tech-
niques. Consider a graph with n nodes - there are n\ different ways in which the nodes in the 
60 
2.7 Near Optimal Ordering Strategies 
Ordering Scheme FiU-in Path Length Speed-up 
MD 655 35 3.07 
M D M L 631 31 3.17 
MDMLLRU 634 24 4.52 
Table 2.2: The effect of elimination ordering on speed-up 
graph can be renumbered (reordered). To find the optimal elimination ordering the ordering 
algorithm must examine all n\ possible reorderings. When n is of the order of 1000 this 
becomes an intractable problem and it is in fact NP-complete [58, 59]. It is not possible to 
examine all possible orderings in a reasonable time so the near optimal ordering algorithms 
work by optimizing the elimination based upon the initial network ordering provided. Un-
fortunately these techniques are sensitive to the initial ordering and certain initial orderings 
result in better than average elimination orderings, as characterized by short path lengths 
and low fiU-in. These solutions cannot in any way be considered optimal unless the whole 
solution space has been searched - i t is always possible that a better reordering may be 
found in a difll'erent area of the solution space. The observed shortest paths and minimum 
fill-in of Table 2.1 are a direct result of randomly ordering the network before applying the 
chosen ordering algorithm. The initial random ordering causes the program to leap around 
the n\ search space and it occasionally happens upon a better than average resultant order-
ing. A similar argument explains why worse than average orderings are also encountered. 
Chapter 7 proposes a technique based on the use of genetic algorithms, which may help in 
rapidly locating the better than average orderings which exist within the solution space. 
2.7.7 D e r i v i n g the E l i m i n a t i o n Tree 
There are a number of methods available for deriving the elimination tree from the filled 
graph of a given system. One technique eliminates all olf-diagonal non zeroes from the filled 
graph except for the first non zero below the diagonal in each column. This matrix, referred 
to as Ft has a tree structured graph G{Ft) associated with i t . This graph depends entirely 
on the structure of the original matrix A and its initial ordering and is the elimination tree, 
r ( A ) , for the system described by A , G{A) and F , G(F). 
Another method of obtaining the elimination tree is to perform a depth first search using 
the data in F. Depth first search [58] is an eflRcient recursive algorithm for systematically 
visiting vertices in a tree. The search starts at the root of the tree and travels to unvisited 
61 
2.8 Implementing a Sequential Solution of the Netw^ork Equations 
nodes along a path until i t reaches the end (leaf node) of the path. The search then recurses 
back to previously visited nodes and from there travels to unvisited nodes on different paths 
until a leaf is reached again. Eventually all the nodes in the tree are visited and the search 
recurses back to the root. This approach is equivalent to starting with the last row of the 
matrix, F^, corresponding to node n, as the root of the tree. The tree is derived as the 
factorisation path of node n as follows. From row n, take the first non-zero to the left of the 
diagonal as the first entry in the ordered list. Take this as a row and place the index of its 
first entry to the left of the diagonal in the list. Repeat the process until a row is reached 
which has no non-zeroes to the left of the diagonal. This corresponds to a leaf node. Now 
trace back through the path of visited nodes to the first row which has unvisited nodes to 
the left of the diagonal. Place the index of the first unvisited entry to the left of the diagonal 
in the list and begin the search again, backtracking as necessary. The search continues until 
the list contains n entries. Sedgewick [58] provides a more detailed discussion of the depth 
first search algorithm. The algorithm is particularly suitable for computer implementation 
and the program which automatically plots the tree diagrams shown in this thesis is based 
on the use of a depth first search method. 
2.8 Implementing a Sequential Solution of the Network Equa-
tions 
A program which solves the algebraic network equations has two main components - an 
offline ordering routine to determine the optimal elimination order and the triangular fac-
torisation {eg bifactorisation) routine which solves the equations for the unknown values. 
This section considers some of the practical aspects of implementing a sequential solution. 
2.8.1 Storage of Sparse Matr ices 
The usual method of representing a matrix within a computer program is as a two dimen-
sional array. Given a sparse matrix i t is inefficient to store all the zero elements as these do 
not contribute to the solution. A better approach is to store only the non - zero matrix ele-
ments, [2, 60]. Under this scheme each row of the matrix is stored in memory as a linked list 
of the row elements, with only the non - zero elements being stored. Also stored alongside 
each element is its column index. The storage method is illustrated graphically in Figure 
2.8. This data structure is implemented using a standard form of linked list created and 
62 
2.8 Implementing a Sequential Solution of the Network Equations 
Value - 1 - 1 ,, 
„ , _ —^ „ —> Null 
Column 2 3 
Value - 1 , r ,, 
„ , ^ ^ Null 
Column 1 
Value - 1 - 1 ,, 
Column 1 4 
Column 3 
Figure 2.8: Storage Of Sparse Matrix Rows In Linked Lists 
Diag Irap Noze 
3.0 + j4.0 1140 1 
2 . 0 - i l . O 1180 0 
2.0 + J5.0 1200 1 
1.0-I-jO 1240 0 
Figure 2.9: Storage Of Extra Information About The Matrix 
managed by functions similar to those presented by Kelley & Pohl [61]. In the C language, 
each node of the linked list is a structure which has two fields 
• val- A double floating point complex variable holding the value of the matrix element 
• coLno - Integer variable holding column index of this element 
Other information about the matrix also needs to be stored and this can be held in a 
two dimensional array with n rows, where n is the number of rows in the matrix. This 
arrangement, shown in Figure 2.9, allows diagonal elements to be accessed quickly and 
provides for easy searching of the row linked lists. The three fields of the array are 
• diag - Double floating point complex value holding the diagonal matrix element value 
• irap - Pointer to the memory address of the head of the linked list for this row 
• noze - Integer variable holding number of non - zero entries in row 
2.8.2 Determinat ion of E l i m i n a t i o n Ordering 
The ordering algorithm operates by reading network connection data from the data file 
and is used prior to setting up the admittance matrix to determine the required order of 
63 
2.8 Implementing a Sequential Solution of the Network Equations 
Column Index 
Element Value 
Element Value 
1 2 3 4 
1 
1 
A . A 2k A 3k A . 
4k 
A . 
ik 
A F 
~k2 
A F 
' k3 
A F 
~k4 
A F 
Figure 2.10: The information array used for updating 
elimination. This is used to estabhsh a mapping between physical system node numbers 
and the required new node numbers. 
A selection sort algorithm, adapted from one presented by Sedgewick [58], can be used 
to examine the topology lists of the matrix and determine the one which has the lowest value 
of a given parameter as the next one to be eliminated. I f more than one fist has the lowest 
value then the next row to be eliminated is taken to be the first one occurring in the matrix. 
The selection sort routine is employed between simulated elimination steps to identify the 
next row/column for elimination. The factorisation routine forms the admittance matrix 
by reading in data from the disk file and renumbering each pair of node numbers before the 
data associated with them is inserted into the admittance matrix. 
2,8.3 Coefficient M a t r i x Factorisation Us ing Bifactorisation 
Having stored the coefficient matrix as a set of linked lists and determined the order of 
elimination of matrix rows, the factored matrix, Ap, can be created as described previously. 
The formulae for obtaining the elements of the factored matrix are those presented in 
Section 2.4.4. Whilst determining the factors of the row being eliminated an array is created 
which holds holds the column index and the value of A^^,. for each entry in the row linked 
list. This array provides for fast and efficient updating of the matrix elements as entries of 
the coefficient matrix are updated according to the entries of this array. 
Suppose we are in the pth column of the array. App is modified by adding to it Apt* A^^p 
64 
2.9 Summary 
yielding 
Then the elements p -f- 1 , . . .n are used according to 
^pl - ^ p l + ^ p k *^Fk, 
where I = p + I,.. .n. I f at any step Ap; does not exist in the row finked list for row p i t is 
assumed to be zero and 
^pl - ^pk * ^Fkt 
Once the factored matrix has been derived it is necessary to multiply its entries to 
achieve the right to left multipUcation of equation (2.18) and hence the direct solution. 
Multiplication of x by all n left hand factors is performed followed by multiphcation 
of aD the right hand factors. Throughout the multiplication only non zero elements are 
used and efficient use is made of the fact that multipHcation by unity is equivalent to no 
multiplication at all. After multipfication by the last factor x contains the solution to the 
system of equations. 
2.9 Summary 
This chapter has introduced the subject of power system modehng. Basic models of power 
system components have been presented and the derivation of these models is given in detail 
in Appendix B. Particular attention has been given to the representation of the network 
and how this gives rise to the finear equations which have to be solved in many power 
system computations. Efficient sequential algorithms for obtaining a direct solution of the 
network equations have been presented and sparse matrix techniques were introduced as an 
efficient method for representing and processing the coefficient matrix within a computer 
program. The operation of these algorithms can be examined by resorting to the elimination 
tree, a simple structure which identifies the precedence relationslups in the factorisation 
and substitution phases of the algorithms. Elimination trees are extremely important in 
analysing the behaviour of both parallel and sequential solutions and a simple method 
for deriving the elimination tree of any coefficient matrix has been introduced. The use 
of near-optimal ordering methods has been presented as a way of reducing the amount 
of computation needed to solve the equations and thus minimizing the solution time. In 
65 
2.9 Summary 
addition practical considerations for the implementation of a solution program have been 
examined. 
The next chapter extends the discussion to focus upon parallel algorithms for solving 
the network equations. Many of the direct parallel methods are based upon the direct 
sequential methods presented in this chapter. Iterative methods, which are of limited use 
for sequential solutions, also provide effective parallel solutions. The relative merits of both 
approaches are considered in respect to the solution of power system network equations. 
66 
Chapter 3 
Parallel Methods of Solving the 
Network Equations 
3.1 Introduction 
rX"^ he solution of large sparse sets of linear equations is a common computational problem 
- - encountered in many branches of science and engineering. Such systems of equations 
appear in fields as diverse as analysis/simulation of electrical networks, finite element meth-
ods, structural analysis and analysis/simulation of hydraulic networks. Sequential methods 
for solving these equations have been presented but i t is also possible to solve them us-
ing parallel processing techniques. Tylavsky et al. [50] note that parallel dense matrix 
algorithms are not competitive with sequential sparse matrix algorithms, but i t is possible 
to create parallel sparse matrix algorithms by exploiting independences in the equations. 
Sparse matrix computations contain more inherent parallelism than their dense matrix 
counterparts but due to the irregular pattern of sparsity in power system matrices it has 
been difficult to find efficient sparse parallel methods [50]. 
Two flavours of parallel sparse solution exist - direct and iterative. These two distinct 
methods have different computational characteristics and are best suited to different types 
of problem. Direct methods are more suitable for power system problems [50] and much 
research effort has been concentrated on the development of parallel LU solution methods. 
Outside the power system field much work has been done on parallel implementations of 
the Cholesky factorisation techniques. 
This chapter examines some issues in the solution of sparse sets of linear equations 
67 
3.2 Iterative Methods for Solving Linear Equations 
on parallel computer architectures. The chapter begins with a discussion of direct and 
iterative methods and assesses which is most suitable for power system problems. Issues 
faced in the design and use of direct methods are then considered before existing techniques 
are presented. Cholesky methods are examined to illustrate the principles of common 
approaches and a summary of the LU techniques used in power systems work is presented. 
3.2 Iterative Methods for Solving Linear Equations 
The fundamental difference between direct and iterative methods is the number of passes 
through the algorithm required to give the complete solution. Direct methods apply heuris-
tic rules to manipulate the equations and achieve an exact solution with only one pass 
through the algorithm. As the name suggests, iterative methods require more than one 
pass through the algorithm and the solution algorithms come in two distinct flavours, sia-
tionary techniques and gradient descent techniques [62]. Al l iterative methods operate by 
choosing trial values for the unknown variables and using iterative correction to improve on 
previous values. The true solution will not be obtained in practice and the iterative method 
must be terminated once a suitably accurate solution has been achieved. Three common 
iterative techniques will now be examined. 
3.2.1 T h e Jacobi Method 
The Jacobi method was one of the first iterative techniques developed and it is a stationary 
iterative technique. As convergence is only guaranteed in the presence of a dominant leading 
diagonal the coefficient matrix is usually scaled to give unit coefficients on the leading 
diagonal. Hence the linear equations A x = b may be written as 
1 0.12 0.13 Oln Xi bi 
021 1 023 02n X2 b2 
031 ^32 1 Oln X3 — b3 
O n l fln2 On3 1 bn 
(3.1) 
The coefficient matrix may be expressed in terms of an upper and lower triangular compo-
nent, U and L respectively, as 
A = I - L - U (3.2) 
3.2 Iterative Methods for Solving Linear Equations 
where 
and 
L = 
0 
-a2i 0 
-asi -(132 0 
-a„i -an2 -ftn.n-l 0 
u = 
0 - a i 2 - a i 3 
0 -023 -0,271 
0 -an-l,n 
0 
To solve the equations using the Jacobi method a trial vector, x(°^ is selected. I t is 
usual to set all elements of this initial vector to zero. A new trial vector is derived at each 
iteration by modifying the trial vector of the previous iteration. This process of iterative 
modification continues until the solution converges to within some desired tolerance. The 
benefit of the Jacobi method is that convergence is guaranteed for diagonally dominant 
matrices [63]. 
At the k^'^ iteration the elements of the new trial vector x(''"+^ ) are derived according to 
ik) (k) (k) a- x^''^ (3.3) 
The complete iteration step may be expressed in matrix notation as 
x^ '^ +i) = h + {L + U)x('=) = b -I- (I - A)x (fc) (3.4) 
Defining the residual vector r^ ^^  = b - Ax^^^ allows (3.3) to be expressed in terms of this 
residual [63]. 
(k) (3.5) 
I f the coefficient matrix has been scaled to give unit diagonals 
, ( ^ + 1 ) ^ + (3.6) 
69 
3.2 Iterative Methods for Solving Linear Equations 
The complete iteration step is expressed in matrix notation as 
xC^+i) = r(*=) + x^ '^ ) (3.7) 
Equation (3.6) provides the key to the efficient parallel implementation of the Jacobi 
method. I f the vectors are assigned to processors such that x\''^ and r'''^ reside on the same 
processor then (3.6) may be computed without the need for interprocessor communication. 
If a;(fc) has n elements then it is possible to calculate all n elements of a;('^ + )^ in parallel using n 
processors. Calculation of the residual vector rC )^ requires interprocessor communication as 
a result of the matrix-vector multiplication Ax '^'^ . The decision to terminate the iterations 
is based on the residual norm |r'*^)| and the termination condition is 
< tolerance (3.8) 
The tolerance value is normally set to less than 0.001 to ensure that the solution vector is 
correct to at least three significant figures. 
3.2.2 T h e Gauss-Se ide l Method 
The Gauss-Seidel method is similar to the Jacobi method and is also a stationary iterative 
method. The equations are once again expressed as in (3.1) and (3.2). A trial vector is 
selected and the elements of this vector are iteratively modified until the solution converges 
to within an acceptable tolerance. 
At the A;*'^  iteration step the values of Xi,X2,. • •, a;,_i of the vector a;('=+ )^ will have been 
derived from the previous trial vector x^'^^ but the values of .. .,Xn remain to be 
determined. The i* '* element of x is modified according to 
x^'^ = 6. - a , i x f a,,_ixl^^ - a,,^ix^, a ^ ^ (3.9) 
Equation (3.9) may be written in matrix notation as 
(I - L)x(''-+^) = b + Ux^ '^ ) (3.10) 
The successive correction procedure is iterated until the error in the solution falls below 
some specified tolerance limit. The decision to terminate this iterative process is based on 
70 
3.2 Iterative Methods for Solving Linear Equations 
the calculation of a residual norm. Termination occurs when 
< tolerance (3-11) 
where r^ '"') is the residual and is calculated as 
r^ '') = b - Ax '^^ ) (3.12) 
Again the tolerance value is normally set to less than 0.001 to ensure that the solution 
vector is correct to at least three significant figures. 
The Gauss-Seidel algorithm has often been used in sequential power system analysis 
programs. Wi th suitable modifications i t is also possible to implement a parallel version of 
the Gauss-Seidel method. Suppose that the coefficient matrix can be reorganized to give i t 
a block structure. Each matrix block can then be assigned to an individual processor of a 
parallel machine. A central pool of values for i,-*^^ is maintained. On each iteration each 
processor applies the Gauss-Seidel algorithm to the nodes within its own matrix block and 
uses the values available in the pool at the start of the iteration. Upon completion of an 
iteration the processors send the modified values of x,-''"'"^ ^ back to the pool. The algorithm 
is asynchronous and each processor may begin a new iteration once the previous iteration 
is complete and values have been sent to the pool. I t is hkely that the parallel Gauss-
Seidel solution of a given system wiU require more iterations than a sequential solution of 
that system due to the asynchronous nature of the algorithm. Even if the same number 
of iterations occur the parallel algorithm wiU require more total processing time due to the 
contention between processors accessing the central pool of x,-'^ ' values. 
3.2.3 T h e Conjugate Gradient Method 
The Conjugate Gradient method is a specific example of a category of iterative solution 
techniques known as gradient descent methods. Gradient descent methods are based on 
the premise that solving a set of n simultaneous equations is equivalent to locating the 
minimum of an error function in n-dimensional space. At each iteration the set of trial 
values for the variables are used to create a new set of values which correspond to a lower 
value of the error function. The location of the global minimum of the error function in the 
n-dimensional space corresponds to the solution of the set of simultaneous equations. 
71 
3.2 : Iterative Methods for Solving Linear Equations 
I f X is the vector of trial values then a residual vector, r, can be calculated as 
r = b - A x (3.13) 
The error function, h, may be defined as 
h = r^A-^r (3.14) 
I f the matrix is positive definite symmetric-' then the error function will have a positive 
value for all vector x except for the correct solution x = x where r = 0 and h = 0. The 
vector x a ; represents a point in the n-dimensional space and the equation 
X = x;t + adk (3.15) 
defines a line which passes through x^ with a direction determined by d^. a is a parameter 
which is directly proportional to the distance of x from x/;. Note that x^ : is the value of x 
obtained at the A;*'' iteration. The error function h varies quadratically [65, 62] with a and 
has a local minimum at 
£ = 2d[[aAd;t - Tfc] = 0 (3.16) 
Al l the gradient descent methods use the location of this local minimum to derive the next 
value of the trial vector according to 
"-t = i A J ^-t-n = Xfc -t- akdk (3.17) 
dfc Adfc 
The only difference between the various gradient descent methods is the choice of the 
direction vectors dk- In the Conjugate Gradient method the direction vectors are chosen 
to be a set of vectors po, p i , . . . , which represent the steepest descent of the points 
x o , X i , . . . ,Xfc. Additionally the pk vectors are chosen to be conjugate (i.e. orthogonal with 
respect to A ) . The vectors thus satisfy the condition 
pf A p , = 0 i ^ j (3.18) 
^ A symmetric matrix is positive definite if all of its eigenvalues are positive [64]. Alternatively, the n x n 
symmetric matrix A is positive definite iff x ' A x > 0 for every n-dimensional column vector x ^ 0. Power 
system admittance matrices do not obey these conditions and are not positive definite. In many conditions 
they are almost positive definite and these techniques have been used with limited success. 
72 
3.2 Iterative Methods for Solving Linear Equations 
To solve a set of equations using the Conjugate Gradient method initial values (k = 0) 
must be specified for p according to 
p = ro = b - Axo (3.19) 
Xo may be initialized to zero. At the A;"* iteration 
Ufc = Apfc (3.20) 
ak = ^ (3.21) 
Pfc "fc 
Xk+i =Xk + ctkPk (3.22) 
r^+i = r k - OkUk (3.23) 
r^rk 
Pi+i = r^+i + Pk + Pk (3.25) 
Theoretically the correct solution is obtained after n iterations but if the equations are 
ill-conditioned or the matrix is densely populated then it may take more than n iterations 
to reach convergence. 
Equations (3.20) to (3.25) are calculated on each iteration and the potential paraUehsm 
in the method is visible in equation (3.22) to (3.23). Neither of these equations depends 
on values calculated in the other and they can be computed concurrently. I f there are 
n simultaneous equations and n processors are available in the parallel machine then the 
minimum number of processing steps taken on each iteration is 27i4-3[log2(n)]-|-10. Tliis is 
3[log2(n)] -I-10 steps more than the 2n minimum steps required by the Gauss-Seidel method 
but the Conjugate Gradient method is hkely to achieve convergence significantly quicker. 
The Conjugate Gradient method is particularly well suited to solving large sparse sys-
tems of equations, such as those occurring in power systems analysis. Unfortunately power 
system admittance matrices are not positive definite and the Conjugate Gradient equations 
must be changed to a form suitable for symmetric indefinite systems. The error function 
changes to become 
h = r^r (3.26) 
73 
3.2 Iterative Methods for Solving Linear Equations 
and 
13k 
f^c+i 
ffcArfc 
U ^ U i 
Arfc+i 
p f A^p, = 0 i ^ j 
(3.27) 
(3.28) 
(3.29) 
Decker et al. [28] proposed a parallel method for solving the power system network 
equations which combines both the LU decomposition and Conjugate Gradient methods. 
Given the set of network equations in Bordered Block Diagonal Form 
I I 
I2 
Ip 
I . 
Yi 
Y2 
Yp Yp 
Yl Ys 
V i 
V2 
(3.30) 
Decker notes that Block Gaussian elimination may be used to solve this set of equations in 
two stages 
1. Step 1 : Solve 
Y , V , = I , (3.31) 
where 
Y , = X^Y*Y-iYi 
i=i 
l, = l,-j2Y\Y-'li 
i=i 
2. Step ^ ; For i = 1,2,..., p, solve 
(3.32) 
Step 2 is inherently parallel and if p processors are available then the solutions for all p 
subnetworks can be obtained simultaneously. Step 1 is inherently serial and although it can 
be solved by a parallel direct method Decker does not advocate this approach as 
the parallel implementation of direct methods is not an easy task [28] 
74 
3.3 Direct vs Iterative Methods 
. Instead Decker proposes the use of the Conjugate Gradient method to solve the cutset 
equation (3.31). This requires the formation of Y^V^, is,Ysd'' and the residual r° at 
each iteration. I f the subnetwork equations (3.32) are solved by direct LU factorisation 
techniques the factors of Yi,i=i,...,p can be used to efficiently calculate Y^V^, I^ and Ygd'' 
in parallel. 
Decker [28] formulated this approach as part of a transient stabihty simulation imple-
mented on an array of 8 Transputers connected in a hypercube configuration. The solution 
of the network equations was actually implemented using a combined LU factorisation and 
preconditioned conjugate gradient method. Preconditioning techniques modify the residual 
vector formed at each iteration to accelerate convergence. Decker notes that the use of this 
conjugate gradient based method produces substantial reductions in total computation time 
when compared to the best sequential method, although the sequential method he uses for 
comparison utilizes only LU factorisation. Speed-up figures are provided for the complete 
transient stabihty program and these show a speed-up of between 1.2 and 3.9 with 2 and 
8 processors respectively. Unfortunately no results are provided for just the network solu-
tion phase of the simulation. The use of preconditioning reduces the number of iterations 
required by the conjugate gradient method but does not always produce a similar reduc-
tion in computation time dues to the extra overheads introduced by the preconditioning 
calculations. The decomposition of the network is significant as it affects the load balancing 
of the method. A poor partitioning produces an imbalanced load which adversely affects 
performance. The network decomposition also has an affect on the number of iterations 
required by the conjugate gradient method. A poor decomposition produces iU-conditioned 
equations which take longer to converge, although this can usually be corrected through 
the use of suitable preconditioning techniques. 
3.3 Direct vs Iterative Methods 
The difference between the direct and iterative approaches can be characterized in terms 
of the number of steps required to yield a solution [62]. Table 3.1 illustrates the computa-
tional requirements of the two approaches for two common algorithms operating on a set 
of equations with n variables. 
Direct methods often require more operations in total to yield a solution than iterative 
techniques but there are two major advantages of the direct methods. Firstly the fac-
75 
3.3 . Direct vs Iterative Methods 
Number of steps 
Method Factorisation Substitution Total Iterations 
LU Decomposition 3 ( n - 1) 5n - 4 8 n - 7 1 
Gauss-Seidel - - 2nK + [log2(n + 1)] - 1 K 
Table 3.1: Operation counts for direct and iterative solution schemes 
torisation and substitution operations of direct methods {e.g. LU decomposition) may be 
performed separately. This allows easy solution of systems with multiple right hand sides as 
the coefficient matrix only needs to be factorised once. After factorisation the substitution 
operation may be used any number of times to solve for different right hand sides. As the 
number of right hand side solutions becomes large, the number of steps taken to yield a 
solution tends to 5n - 4 and the direct methods become significantly more efficient than 
iterative methods. Iterative techniques do not provide separate factorisation and substitu-
tion operations and 2nK + [log2{iT. + 1)] ~ 1 operations are required to yield a solution for 
each right hand side vector. In a power system simulation much of the time is taken up in 
solving the same system of equations with different right hand side vectors and this explains 
why direct methods find such widespread usage in the field of power systems analysis. 
The second advantage of direct methods is that they do not suffer the convergence 
problems which are prevalent in iterative techniques. The value K in Table 3.1 is the 
number of iterations required for the iterative method to produce a solution which meets 
some predefined tolerance limits. I f the tolerance limits are narrow or the equations are 
poorly conditioned K can become quite large, significantly increasing the time taken to 
reach a solution. For power system problems it has often been found that iterative methods 
do not converge quickly enough to a solution of sufficient accuracy. If iterative methods are 
employed either approximate solutions or long solution times must be accepted, neither of 
which is appropriate for accurate, real-time simulations. The problems with convergence 
have tended to preclude the use of iterative techniques for the solution of power system 
linear equations. 
This thesis is concerned only with the direct method of solution and iterative methods 
will not be considered further. The development of efficient parallel formulations of direct 
methods for sparse linear systems is an area of research currently ehciting much interest. 
Substantial parallelism exists in direct methods but only limited success has been achieved 
in the development of parallel formulations. There are two reasons for this. Firstly the 
76 
3.3 Direct vs Iterative Methods 
amount of computation involved is small in relation to the size of the system to be solved. 
Even a small amount of interprocessor communication can significantly alter the balance 
between computation and communication and this results in poor efficiencies. The second 
reason for the inefficient parallel solutions developed to date is that most of them are based 
on good sequential algorithms. It is not always the case that paralleUsing the best sequential 
algorithm will give the best parallel algorithm [63, 66, 44]. The goals of a serial algorithm are 
often inappropriate in a parallel environment. Numerous parallel direct methods have been 
developed and they have all been based upon Gaussian ehmination or Cholesky factorisation. 
This thesis examines ways to improve upon the existing methods to provide a faster and 
more efficient method of solution. 
Heath et al. [67] and Kumar et al.[63] observe that a complete direct sequential solution 
consists of four phases 
• Ordering - find a good ordering of the matrix such that minimal fill-in is introduced 
• SymboHc Factorisation - Determine the structure of the factor matrices and set up 
data structures to hold them 
• Numeric Factorisation - Compute the factor matrices of the coefficient matrix, A 
• Triangular Solution - Using the factor matrices and the right hand side vector perform 
the forward/backward substitution operations to determine the values in the unknown 
vector X 
In most power system computations the ordering and symboUc factorisation opera-
tions are required infrequently and there is not the same need for high performance which 
arises from the frequently used operations. As the admittance matrix of the system seldom 
changes, the numerical factorisation operation is also infrequently used. In real-time simu-
lations i t may be necessary to refactorise the admittance matrix in response to a change in 
the network topology. Inefficient numerical factorisations have no place in real-time simu-
lations as they may cause the simulation to leave real-time. Consequently there has been a 
significant amount of research into the development of efficient parallel numeric factorisa-
tions [35, 32, 68, 31, 50, 46, 29] although the performance of methods developed to date is 
not very impressive. Tylavsky et al. [50] observe that much of the algorithm development 
has been purely theoretical and httle software has actually been developed for parallel ma-
chines. The software which has been produced [32, 40] shows that a fuU factorization can 
77 
3.4 Parallel Algorithms for Direct Solution 
be achieved with a speed-up of about 2, whilst a speed-up of about 10 may be achieved 
if factorization is halted before the densest part of the matrix is encountered. The most 
frequently used operation in power system computations, and particularly in simulations, 
is the triangular solution operation. I f power system analysis and simulation algorithms 
are to be efficient then efficient parallel triangular solution methods are required. This has 
also been the focus of much research attention [31, 32, 26, 37, 36] but again the results are 
disappointing with speed-ups seldom exceeding 3 or 4 [69, 70, 57]. The reasons for the poor 
performance have been analysed by Bialek [49]. 
This thesis concentrates on the numeric factorisation and triangular solution operations. 
Symbolic factorisation is not considered as efficient sequential techniques have already been 
developed for this operation [67, 71, 72]. As Heath [67] points out, there is little that can be 
done to parallelize the symbolic factorisation operation and this operation is perhaps best 
performed sequentially on a single processor in a multiprocessor array. Ordering techniques 
are considered here but only in relation to improving the amount of inherent parallelism 
exploited in the numeric factorisation and triangular solution operations. No consideration 
is given to the development of parallel ordering algorithms as highly efficient sequential 
techniques already exist. 
3.4 Parallel Algorithms for Direct Solution 
3.4.1 G r a n u l a r i t y of Solution 
To solve any problem in parallel requires that problem to be divided into separate tasks 
which can be assigned to individual processors. The solution of power system network 
equations requires the set of equations to be decomposed into independent subsets. Unfor-
tunately the algebraic equations are not easy to decompose as they are global and relate 
to all nodes in the network. Conventional diakoptics techniques make use of either node 
- tearing or branch cutting to split the network into subnetworks, thus allowing a parti-
tioning of the equations. Direct methods can then be used to solve individual subnetworks 
concurrently with the solutions being combined to give an overall solution for the algebraic 
equations. The number and size of subnetworks required really depends on the capabilities 
of the target parallel machine. Granularity is a qualitative measure of the size of the parallel 
tasks. Three levels of granularity can be identified and Kumar [63] defines them as 
• Fine-grain parallelism - Parallelism at the level of individual floating point operations 
78 
3.4 Parallel Algorithms for Direct Solution 
• Medium-grain parallehsm - ParaUehsm in performing groups of floating point opera-
tions, for example on an entire matrix row or column 
• Coarse-grain paraUehsm - ParaUeUsm in operating on independent groups of matrix 
rows / columns 
The exploitation of fine-grain paraUeUsm is not suitable for message passing, distributed 
memory machines due to the excessively high amount of interprocessor communication 
required. Fine-grain paraUel solutions are really only suited to massively paraUel computing 
platforms containing thousands of relatively simple processing units or to shared memory 
architectures. Message passing architectures are better married with the exploitation of 
medium and coarse grain paraUeUsm and research has been directed toward developing 
solutions at these levels. When designing a paraUel algorithm it is not possible to ignore 
the architecture of the machine on which that algorithm is to be executed. As Lin and Van 
Ness [33] point out 
... the architectural dijferences between the machines is a far more important 
factor (in speed-up) than the variations in the way that the algorithm is applied. 
Coarse grain paraUeUsm arises as a direct result of the sparsity of the coefficient matrix. 
Due to the sparse nature of the matrix i t is possible to identify independent groups of rows 
and columns which may be processed in paraUel. No such groups exist in dense systems 
and coarse grain paraUeUsm cannot be exploited. Dense parallel solutions are therefore 
restricted to the usage of medium grain paraUeUsm. 
For a coarse grain solution the groups of independent matrix rows/columns correspond 
to subnetworks within the network. Individual subnetworks are processed by separate pro-
cessors with a central coordinating processor being used to combine individual subnetwork 
results to give the overall result. The use of the central coordinating processor introduces a 
large sequential stage in the paraUel algorithm and is a bottleneck in the process, Umiting 
the efficiency of the solution of the equations. Any method which increases the amount of 
paraUeUsm extracted from these equations and reduces the size of the sequential step wiU 
significantly increase the speed of solution. 
3.4.2 T a s k Mapping and L o a d Balancing 
Two of the fundamental issues in the design of any paraUel program are those of task 
mapping and load balancing. Chapters 4 and 6 consider these issues in more detail but a 
79 
3.4 Parallel Algorithms for Direct Solution 
33 
I 
29 
I 
25 
17 
41 
I 
39 
I 
37 
18 
34 
I 
30 
I 
26 
19 
49 
I 
48 
I 
47 
I 
46 
I 
45 
I 
44 
I 
43 
20 
35 
I 
31 
I 
27 
21 
42 
I 
40 
I 
38 
22 
36 
I 
32 
I 
28 
23 24 
A A A A A A A A 
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 
0 
A A A A A A A A 
0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 
(a) (b) 
Figure 3.1: An ehmination tree (a) and the wrap mapping strategy (b) 
brief introduction is now presented. 
Having partitioned the problem into a number of smaller subproblems, or tasks, it is 
necessary to assign processing resources to carry out these tasks. Tasks may be assigned to 
processors in any order.but i t is desirable to minimize the amount of communication between 
tasks on different processors. The task mapping operation must find the optimum placement 
of tasks which minimizes the delays associated with synchronisation and communication. 
One of the most common task mappings used for Cholesky factorisations is known as 
the wrap mapping. Consider the elimination tree of Figure 3.1(a) which shows the prece-
dence relationships between the tasks which make up the problem. Suppose that there are 
p processors available and there are i tasks in the problem. Generally p < i and the wrap 
mapping assigns the i * ' ' task to the processor {{i - 1) mod p) [73]. The tasks in the tree are 
wrapped on to the available processors, as shown in Figure 3.1(b). (The numbers in the 
diagram refer to the processor to which each task is assigned). The wrap mapping assigns 
aU potentially concurrent operations to different processors and distributes the communica-
tion load evenly across the processors. The volume of communication associated with this 
mapping scheme is high and Chapter 4 wiU introduce a more efficient strategy. 
80 
3.4 Parallel Algorithms for Direct Solution 
Load balancing is aii issue of critical importance to the efficiency of any paraUel algo-
ri thm. Maximum speed-up can only be attained if the computational workload is equally 
divided amongst aU the processors and these processors are kept constantly busy [19]. De-
spite this many researchers seem to ignore the load balancing issue. In fact some go further 
and claim that load balancing is unimportant [44], a view which challenges the beUefs of the 
parallel computing community. Load balancing techniques may have one of two flavours 
- static or dynamic. Static approaches are the simplest and balance the load by a careful 
division of the problem into tasks and optimum assignment of tasks to processors based 
on some a priori knowledge. Dynamic strategies equaUze the load whilst the program is 
running by moving tasks and data between the processors. Due to the relatively high 
communication requirements of dynamic strategies they can be inefficient when appUed to 
distributed memory machines. The Transputer and its languages do not provide support 
for task migration [19] and other dynamic load balancing operations. This thesis considers 
only the simpler static load balancing strategies. 
3,4.3 Order ing Strategies for Paral le l Solutions 
Sequential solutions employ ordering techniques to reduce the memory requirements and 
computational workload by minimizing the amount of fiU-in resulting from eUminations. 
The same considerations apply to paraUel solutions but there are additional reasons for 
reordering the coefficient matrix. Reordering the matrix alters the shape of the eUmination 
tree associated with that system. In their natural order certain systems appear to have 
Uttle exploitable paraUehsm. For example the eUmination tree associated with a tridiagonal 
system is simply a Unear chain of tasks. Applying an appropriate reordering can modify 
this system such that its eUmination tree has a more traditional tree-Uke structure and this 
allows some concurrent execution of tasks to occur. One of the main goals of ordering for 
paraUel solution is to rearrange the system to allow the exploitation of more of the potential 
paraUeUsm in the problem. 
Another frequently cited goal of ordering for paraUel solution is to reduce the height of 
the eUmination tree. Many authors [67, 69, 54, 74, 57] consider tree height to be critical 
in obtaining short execution times. Execution time is proportional to the length of the 
critical path through the eUmination tree and critical path length is the same as tree height 
by definition. Applying orderings which minimize the height of the tree is an attempt to 
minimize the execution time of the paraUel algorithm although Heath impUes that there 
81 
3.5 Diakoptical Based Solution Methods 
is no proof tliat systems witl i sliorter trees execute quicker than tliose witli long trees. 
His view is that those authors who choose to use short trees base their choice on instinct 
rather than on theoretical proof. This author believes that in general shorter trees do give 
rise to shorter execution times and this view is supported by empirical observations made 
throughout the course of this research project. 
3.5 Diakoptiral Based Solution Methods 
3.5.1 The Method of Diakoptics 
The method of diakoptics may be used to divide the network equations into sets of equations 
corresponding to interconnected subnetworks. The techniques of diakoptical analysis [47, 75, 
76, 77, 78] were developed in the 1950's by Gabriel Kron. The word diakoptic is derived from 
the Greek kopto, meaning to break or tear apart [47], and this neatly summarises the whole 
concept of diakoptical analysis. The essence of the technique is to solve a large system by 
tearing i t into smaller subsystems which are then solved independently. The solution for the 
whole system can be found by combining and modifying the individual subsystem solutions. 
Developed as a method for solving large network problems, diakoptical analysis is an ideal 
approach for the treatment of electrical power systems. In fact the diakoptics method was 
devised by Kron as a solution to a particular power engineering problem [47]. For diakoptical 
solution the power network must be partitioned into subnetworks and the partitioning can 
be based on geographical or political considerations, on ownership of utilities comprising 
the network or on consideration of the complexity of computing a solution. Regardless 
of how the split is obtained, diakoptical analysis gives a mathematical structure which 
is particularly amenable to multiprocessing computer environments using either parallel 
or distributed computing techniques. Combining diakoptical methods with sparse matrix 
techniques and triangular decomposition methods provides a powerful tool for simple, rapid 
solution of large sparse network problems [3, 2, 78]. 
Two approaches exist for partitioning a given network into a set of subnetworks. The 
easiest method to visualise is the branch cutting method. This partitions the network by 
cutting some of the branches which connect the nodes in the network. The branches are 
chosen so that, when cut, they separate the nodes into independent regions, or subnetworks. 
The second partitioning method splits the network into subnetworks by tearing some of 
the network nodes apart. This is known as the node tearing method and the nodes are 
82 
3.5 Diakoptical Based Solution Methods 
chosen such that tearing them apart decomposes the network into distinct subnetworks. 
Appendix D provides a more detailed and mathematical treatment of the two methods. 
Partitioning of the network using node tearing has an effect on the system's admittance 
matrix. The process of partitioning transforms the matrix into the Bordered Block Diagonal 
Form (BBDF) shown below. 
5^ 11 Yic h 
Y22 Y2C V2 h 
Ykk Ykc Vk h 
Yc2 • •• Yck Y Vc Ic 
(3.33) 
The name arises from the fact that the matrix has blocks of non-zero elements along the 
leading diagonal and along the lower and right edges. Elsewhere in the matrix the elements 
are all zero. I f branch cutting is used the Bordered Block Diagonal Form is not produced 
directly as extra elements may exist in the matrix between the diagonal blocks and the 
borders. Appendix D shows that i t is possible to restructure the problem such that the 
admittance matrix produced by branch cutting also exhibits BBDF. 
The diakoptic method may be married with the triangular decomposition approach [78] 
to obtain a solution for network node voltages given the network branch currents. The 
BBDF admittance matrix Y is decomposed into a series of triangular factors and forward 
and backward substitution with the right hand side yields the solution for the unknown node 
voltage vector V . I t is observed that the BBDF is maintained in the factored admittance 
matrix. 
The bordered block diagonal form (3.33) has important consequences for the parallel 
solution and it defines the algorithmic structure of conventional parallel solutions for (2.7). 
Y\\.,Y22-, • • • lYkk a^ re the admittance submatrices corresponding to the k independent sub-
networks created by the removal of the tearing nodes or branches. Similarly Vk and Ik 
are the node voltages and branch currents for the /cth independent subnetwork. As the 
subnetworks associated with F n , . . . , Ykk are independent i t is possible factorise the entries 
of Yii,Yci and Yic in parallel, where i = 1,.. .,k. Ycc must be factorised after this is com-
pleted as the presence of Yd and Yic results in the triangular decomposition updating the 
values of Ycc- I t is then possible to forward/backward substitute with Ic to yield values 
for the cutset voltages Vc. Once this has been done, forward/backward substitution can be 
83 
3.6 The Multiple Factoring Method 
performed in parallel for each subnetwork to yield solutions for F i , V 2 , • • •, ^ i - Note that 
the distributive effect of Ycc through forward and backward substitution prevents the com-
mencement of parallel substitution for individual subnetworks until the cutset voltages have 
been determined. 
3.6 The IVLultiple Factoring IVEethod 
The multiple factoring method [46, 29] is aimed at the substitution phase of the solution 
of linear equations and is based on a standard LU decomposition. An improvement in 
the substitution phase is achieved at the expense of factorisation, as extra work has to 
be introduced at the factorisation stage. The efficiency of the method lies in the fact that 
factorisation only needs to be performed infrequently. Multiple factoring is a prime example 
of an inefficient sequential algorithm being used as the basis of an efficient parallel algorithm. 
Consider the equations A x = b factored into lower, L , and upper, U , triangular matrices 
such that L U x = b. The multiple factoring method requires these matrix factors to be 
factored further. The basic scheme, known as diagonal partitioning, is 
(3.34) 
where La and JJjj are lower and upper triangular matrices respectively, I is the identity 
matrix and K and R are rectangular matrices. (3.34) can be easily solved as 
I L i i U n I R 
K I L 2 2 U 2 2 I 
U i i 
I 
K I 
L 2 2 
U 2 2 
I R 
I 
W2 
= b 
W i 
VV^ 3 = W2 
W3 
(3.35) 
(3.36) 
(3.37) 
(3.38) 
Given b the vectors w^i, W 2 , W 3 and x may be found successively. In examining the structure 
of (3.35) i t is observed that the upper part of w i will be identical to the corresponding part 
84 
3.6 The Multiple Factoring Method 
of b. The lower part of W i may be found by substituting the upper part into successive rows 
of the lower part. No precedence relationships exist in solving for the lower part of w i so 
the rows may be processed in parallel and in any order. (3.36) is solved to yield values for 
W 2 by forwarding substituting L n and L 2 2 with the values obtained for wi. Substitution 
with L i i can take place immediately as the upper part of W i is known to be identical 
to the corresponding part of b. I t is not necessary to wait until w i has been completely 
obtained before processing begins with L n and the generation of the upper part of W2 may 
be computed in parallel with.the computation of the lower part of wi if desired. The fact 
that no coupling exists between the upper and lower parts of this matrix also allows the 
upper and lower parts of w^ 2 to be computed in parallel. (3.37) is similar to (3.36) but 
processing cannot commence until W2 is completely determined. Again no coupUng exists 
between upper and lower parts of the matrix and upper and lower partitions of can be 
computed in parallel. Solution of (3.38) is similar to the solution of (3.35) and once again 
the upper and lower parts of the vector x may be computed in parallel. 
By multiplying together the first and last two matrices of (3.34) 
L i i 
K L i i L 2 2 
U „ U n R 
U 2 2 
= A (3.39) 
I ' l l , L 2 2 , U i i and U 2 2 can be seen to be simple submatrices of the lower and upper trian-
gular factors of A . K and R may be found from the equalities 
K L i i = L 2 1 (3.40) 
U i i R = U 1 2 (3.41) 
The generation of all the required factor information is a two step process. Firstly a standard 
LU decomposition must be used to determine the lower and upper triangular factors L and 
U . L i i , L 2 2 , U i i and U 2 2 may then be determined directly and K and R can be found 
from (3.40) and (3.4.1). 
The power of the multiple factoring method becomes apparent when the L and U factor 
matrices are subdivided many times. Three possible schemes exist for subdividing the 
matrix factors of A . 
85 
3.6 The Multiple Factoring Method 
Scheme 1 : Forw^ard factoring 
Scheme 1 subdivides the factor matrices according to 
I 
K 2 1 I 
K31 I 
I Rl3 
I R23 
I 
I 
K32 I 
I R 1 2 
I 
I 
L i i 
J22 
--33 
(3.42) 
U 11 
u 22 
U: 33 
(3.43) 
A consequence of this scheme is that extra fiU-ins, over and above those resulting from 
LU decomposition, are introduced into the matrices by the subdivision process. With the 
factors divided in this fashion it can be seen that results from each step must be propagated 
to the next step of the solution process. For example, the computations involving K21 
and K31 can commence immediately with these computations taking place concurrently. 
The computations involving K32 cannot proceed until the computations involving K31 are 
complete as substitution with K32 depends on values which are altered in substituting with 
K31. 
Scheme 2: Backward factoring 
The second strategy for division partitions the factor matrices according to 
I 
I 
K31 K32 I 
I R 1 2 R l 3 
I 
I 
I 
K 2 1 I -'22 
--33 
(3.44) 
I R23 
I 
u 22 
U 3 3 
(3.45) 
Under this scheme there are no precedence relationships between the K (R's) blocks and 
this means that there is no propagation of results between the steps of the solution. Hence 
the potential for parallelism is great. However Van Ness [46] notes that this partitioning 
strategy introduces much more fiU-in than the first strategy. As a result he suggests that 
the first strategy should be applied to the partitioning of the majority of the matrix, which 
86 
3.6 The Multiple Factoring Method 
is sparse. The lower right hand corner of the matrix is usually fairly densely populated and 
here fiU-in is of little consequence. Van Ness suggests that this region should be treated using 
the second partitioning scheme. The lowest submatrix in each of the L and U chains are 
both fuUy populated. These submatrices have to be processed consecutively (the sequential 
part of the method) and Van Ness advocates representing this fuUy populated section using 
its fuU matrix inverse. 
Scheme 3 : ,Row oriented factoring 
Both of Van Ness's partitioning schemes are based on the use of diagonal partitioning. 
Berry et al. [29] have proposed a third partitioning scheme which is row oriented. Under 
this scheme 
U = 
U i i Ri2 Ri3 
I 
I 
U 2 2 R23 
I u 33 
(3.46) 
This is similar to the column oriented partitioning of L factors under Scheme 2. Solution 
of the equations requires scheme 2 partitioning of the L matrix with scheme 3 partitioning 
used for the U matrix. Berry et al. found that although this approach introduces a large 
amount of fi l l- in, the solution is achieved faster then when using either scheme 1 or scheme 
2 factors alone. 
Both Van Ness [46] and Berry [29] cite results in their papers but neither really sheds any 
light on the speed-up performance of the multiple factoring method. Berry gives absolute 
execution times for the solution of a 60 bus test system but gives no reference sequential 
execution time from which speed-up may be calculated. Van Ness provides the theoretical 
minimum computation time and and 'actual' computation time obtained from a simulation 
of the method. As with most parallel simulations no attempt is made to account for the 
effects of interprocessor communication or task switching overheads. The absence of a 
sequential reference time makes it impossible to calculate the speed-up. When a large 
number (50^-) of processors are used Van Ness finds the 'actual' computation time to be 
similar to the theoretical minimum computation time, indicating that speed-up with a large 
number of processors is close to the theoretical predicted maximum speed-up, whatever that 
may be. When a more modest number of processors ( « 20) is used the 'actual' computation 
time is three times greater than the theoretical minimum time, indicating that speed-up 
87 
3.7 Parallel L U Decomposition Techniques 
can be no more than | of the theoretical maximum. Evidently a large number of processors 
are needed to give maximum speed-up, making the solution both expensive and inefficient. 
Reducing the number of processors used to a more economic level has the effect of drastically 
reducing the performance. 
3.7 Parallel L U Decomposition Techniques 
A survey of the available literature reveals a number of publications on the parallel solu-
tion of power system equations by LU based methods. Some of these pubUcations detail 
interesting and innovative work but most are nothing more than subtle modifications to 
existing methods. This section presents a survey of some of the parallel LU decomposition 
methods currently available. AU parallel techniques developed to date have a structure 
similar to that of sequential LU decomposition algorithms {i.e. factorisation of the admit-
tance matrix followed by forward and backward substitution). The factorisation phase sees 
each subnetwork being factorised in parallel on an individual processor. Once all subnet-
works have been factorised any information relating to the cutset block must be passed 
to the processor responsible for that block. The cutset block is then factorised and for-
ward and backward substitution of this block with the current vector is performed. This 
generates information relating to all the other subnetworks and hence the result of this for-
ward/backward substitution must be sent to all other processors before substitution can be 
completed. The parallel algorithm has the structure shown in Figure 3.2. The cutset block 
must be processed by a central coordinating processor which receives information from and 
passes information back to all the other processors, controffing their operation in a Master 
- Slave fashion. 
A simple example of a 5 X 5 matrix with Bordered Block Diagonal Form is shown in 
Figure 3.3. The matrix consists of two 2 x 2 blocks are arranged on the diagonal and there 
is a single element cutset block. Figure 3.3 also shows aU the operations that must be 
performed to factorise this matrix. The independence of operations on each block is clearly 
illustrated in this example. Factorising rows 1 and 2 causes the cutset element in row 5 to 
be modified but no modifications to rows 3 or 4 take place. Similarly, factorising rows 3 and 
4 results in the the cutset element of row 5 being modified but no modifications to rows 1 or 
2 take place. The operations on blocks 1 & 2 and 3 & 4 can be performed concurrently with 
the cutset block being processed once the processing of the other two blocks is complete 
3.7 Parallel L U Decomposition Techniques 
Factorise Factonse Factorise Factonse 
Factorise Cutset Block 
Forward & Backward Substitute 
Forward 
Substitute 
Forward 
Substitute 
Forward 
Substitute 
Forward 
Substitute 
Backward 
Substitute 
Backward 
Substitute 
Backward 
Substitute 
Backward 
Substitute 
Figure 3.2: Structure of conventional parallel solution algorithm 
and all modifications have been made to the cutset block. 
Several researchers [79, 57, 80] have noted that when more than about five processors are 
used the benefits of partitioning the network and solving it in parallel tend to saturate - i.e. 
introducing more processors does not decrease the overall solution time. This saturation is 
attributed to the significant size of the cutset block, which must be processed sequentially, 
and the amount of interprocessor communication resulting from the Master - Slave approach 
of the conventional parallel solution. Despite the significant research effort devoted to this 
problem speed-ups for both the factorisation and substitution steps of the LU decomposition 
remain disappointing. The results of Lau's approach [32] to parallelizing the factorisation 
part of the method reveal a maximum speed-up of two even when as many as 32 processors 
are used. Techniques for parallelizing the substitution phase have fared little better. Abur's 
technique [26] requires 57 processors to achieve a speed-up of 3.8 for the substitution stage 
of the solution of the IEEE 118 node test system and as many as 281 processors are required 
to achieve a speed-up of 5 from a system containing only 590 nodes. Whilst this method 
gives a better speed-up than many previous methods it is extremely inefficient due to the 
large number of processors required. 
89 
3.7 Parallel L U Decomposition Techniques 
1 2 3 4 5 
Calculate 
' ' I I 
« I 2 — 
" I I 
L 
" A l 
A s 
" 1 1 
^ ' A , 
Update t ' ^ 2 | ' ^ I 5 
• ^ 2 5 - " 2 5 , 
•^ll 
"22 - ^22 , 
A l 
J A i ' ^ 1 2 
" 5 2 - " 5 2 . 
A l 
4 4 A l - ^ I S 
" 5 5 - " 5 5 . 
A l 
Calculate 
" 2 2 
-
"~ T 
" 2 2 
A22 
Update A - A A2 '425 
" 5 5 - " 5 5 . 
A2 
Calculate 
".1.1 A„ 
L 
" A 
" J 3 
p _ A s 
" J 3 
^ ' A , 
Update 
" < 5 ~ " 4 5 . 
^ 3 
_ . -^flAs 
A J 
4 _ 4 '^53'4.V. 
' ' 54 ' "54 . 
A J 
4 _ 4 A3A15 
".55 - " "S5 . 
A3 
Calculate 
^ 4 4 = - ^ 
" 4 4 
P - ^ ' ^ 
" 4 4 
^' A " 4 4 
Update 4 _ 4 A4A5 
" 5 5 - " " 5 5 . 
-^44 
Calculate 
" 5 5 
1 2 3 4 5 
Figure 3.3: Simple BBDF factorisation example, showing independence between operations 
90 
3.7 Parallel L U Decomposition Techniques 
The fact that most of the existing LU factorisation techniques have very similar algo-
rithms is not surprising. Al l of the techniques rely on the use of BBDF matrix structure for 
exploiting parallelism. The structure of the algorithm is a function of the BBDF structure 
not a function of the particular LU-based method chosen. The bottleneck introduced by 
the cutset solution is also a function of the BBDF matrix topology. This implies that any 
improvement over existing parallel methods must begin by improving the structure of the 
matrix to allow greater exploitation of parallelism. The actual LU or LDU-based factori-
sation chosen wiU have only a minor effect on performance. The main differences between 
existing techniques are the way in which they store data and the architecture of the target 
parallel machine. 
3.7.1 Chan's Method 
One recently developed technique is that of Berry, Chan and Dunn [36]. Their technique 
is designed for M I M D machines but they do not specify in their paper whether shared or 
distributed memory machines are the target architecture. A personal communication with 
the authors [81] has subsequently revealed that the target architecture was a distributed 
shared memory machine. Communication of all data was performed via the global shared 
memory and the message passing faciUties of the architecture were only used for synchroniz-
ing the operation of the program tasks. The need for a custom parallel machine renders this 
approach unsuitable for use with off-the-shelf distributed memory machines but it is still a 
useful and interesting method. BBDF is the prominent feature of the approach but the au-
thors successfully divorced the BBDF structure from the conventional algorithm structure. 
Algorithms which use BBDF normally assign the cutset block to a central coordinating 
processor. AU other processors in the machine must communicate their updates to the cen-
tral processor which then becomes the bottleneck in the solution process. Chan dispenses 
with the need for the central coordinating processor and the Master-Slave architecture by 
assigning a copy of the cutset block to each processor in the network. Instead of com-
municating updates to a central processor each processor broadcasts its updates to every 
other processor in the network. Al l the other processors then update their local copy of the 
cutset matrix data. When all the processors have performed all the updates supplied by all 
the processors they each solve their own local version of the cutset block in parallel before 
solving their respective subnetworks. There are two interesting points to this approach 
91 
3.7 Parallel L U Decomposition Techniques 
1. The elimination of the need for a central coordinating processor means that this 
method uses one processor less than a more traditional BBDF approach and should 
be more efficient 
2. The assignment of a copy of the cutset block to each processor removes a commu-
nication step from the algorithm. I t is no longer necessary to broadcast the results 
of the cutset solution throughout the network as each processor holds its own local 
solution. Provided that each processor maintains a coherent copy of the cutset matrix 
data prior to its solution, all the processors should obtain the same answer from the 
solution of the cutset block. The extra communication step is eliminated at the ex-
pense of implementing a method for ensuring coherency between local copies of data 
structures held by all processors. 
That there are differences between this method and the traditional methods is quite ob-
vious from its algorithm structure, shown in Figure 3.4. The bottleneck of communications 
to the central processor is no longer present but has been replaced by a much larger collec-
tion of intertask communications. The explicit communication following cutset solution has 
been eliminated. Chan actually found that the duplicated computation of the cutset by aU 
processors is the dominant feature in the total amount of computation. As the inverse of 
the cutset block is not significantly denser than the cutset block itself, Chan found it more 
efficient to solve the cutset block by calculating its fuU inverse using a single processor. 
This is carried out after the parallel factorization of subnetwork blocks is complete and 
an extra communication step is required to pass the result of inverting the cutset to the 
subnetwork processors. The algorithm structure once again becomes the same as that of 
the conventional structure (Figure 3.2). 
Whilst the method is interesting its speed-up performance is little better than that of 
other approaches. Speed-ups are only marginally improved, i f at all, but the benefit of the 
approach is its increase in efficiency. Similar speed-ups to other methods can be achieved 
using fewer processors. 
3.7.2 The W-matrix Method 
The W-matrix method was first proposed by Alvarado [37] and adapted by Padhila and 
Morelato [44] in 1992. The approach is based on a sequential solution technique known 
as matrix inverse factors, or the W-matrix method. The authors note that the W-matrix 
92 
3.7 Parallel L U Decomposition Techniques 
Factorise Factorise Factorise Factorise 
Solve Cutset Solve Cutset Solve Cutset So ve Cu se 
Subsitute Substitute Subsitute Substitute 
Figure 3.4: Algorithm structure of Chan's method, using duplicate cutset computation 
method is not the most efficient solution algorithm for sequential machines but it holds 
promise for parallel solutions as the techniques gives a degree of independence between 
elementary operations. 
The W-matrix method is based on LDU factorisation and it decomposes the coefficient 
matrix. A , into three matrix factors such that 
A = L D U (3.47) 
The solution to the set of equations is given by 
x = U - ^ D - ^ L - i f t (3.48) 
A matrix, W , is defined as 
W = L " ^ (3.49) 
and this can be expanded into n separate factors, W i , W 2 , . . . , W „ , where Wi = L^^ and 
L^^ is a matrix which differs from the unit matrix in the i^^ column, which is the same as 
the i*'^ column of L . Hence 
X = W ^ D - ^ W b = W^W^ ... W „ r D - i W i W 2 . . . W „ b (3.50) 
A parallel formulation of the method is achieved by partitioning the W-matrix such that 
93 
3.7 Parallel L U Decomposition Techniques 
multiplications between columns of W and elements of the righthand side vector become 
independent [44]. The partition is accompUshed by horizontal division of the elimination 
tree associated with the system and if p partitions are produced then these must be processed 
consecutively. The solution of (3.50) can be seen as an ordered sequence of updating tasks 
operating on the components of the right hand side vector, b. Each update task is formed 
from multiplication and addition operations and is of the form 
hj = hj + wk,jhk (3.51) 
The W-matrix method allows all multiplications between W matrix elements and elements 
of b to be performed concurrently within each partition. As each multiplication requires 
read-only access to one row of W (W^t) and read-only access to the vector b^, each row-by-
vector multiplication is independent and can be performed concurrently without introducing 
any problems of data coherency. Unfortunately the additions cannot all be performed 
concurrently as each addition requires the latest value of b j and this may be under processing 
when it is required for addition. Hence some of the additions (not previously known [44]) 
must be performed after the parallel multiplications are complete. 
The approach described by Padhila and Morelato is aimed at improving the parallel 
substitution operations required to solve for the unknown vector once the matrix has been 
factorised. The disadvantages of the method are 
• An extra stage is required in the solution process to calculate the matrices. This 
would introduce a large overhead into a one-off solution but in applications such as 
power system simulation where repeated solution with multiple right hand sides is 
required the extra overhead introduced in not all that large. 
• Although the method can be used on a message passing machine the best performance 
is obtained when the right hand side vector is stored in common shared memory. 
The method is therefore best suited to distributed shared memory or shared memory 
machines. 
The results quoted for the substitution phase speed-up are better than those for many 
other methods and Padhila's results are better than those obtained by Chan. For similar 
size systems Chan achieves a speed-up of 3.36 using 8 processors whereas Padhila observes 
a speed-up of 5.14 with the same number of processors. In a recent paper [33] Lin and 
94 
3.8 Cholesky Factorisation Techniques 
Van Ness show that the W-matrix method is mathematically equivalent to the multiple 
factoring method and give performance results for a shared memory implementation of the 
multiple factoring method. These results are very similar to Padhila's results for the W-
matrix method, highlighting the equivalence between the two approaches. Unfortunately 
the results for a distributed memory implementation of the multiple factoring method are 
not nearly as good. The speed-ups produced are less than half of those produced by the 
shared memory implementation. 
3.8 Cholesky Factorisation Techniques 
Cholesky factorisation is a commonly used method for solving sparse symmetric positive def-
inite linear equations. Power system finear equations are not positive definite and Cholesky 
methods may not be used to solve them but i t is worth considering these methods for the 
sake of completeness. Certain forms of Cholesky factorisation suggest ways in which parallel 
LU decomposition based solutions may be improved. 
Cholesky factorisation was originally developed as a sequential method of solving equa-
tions. Wi th the advent of high performance computers much research effort has been appfied 
to the development of parallel Cholesky techniques, particularly within the applied mathe-
matics community. A number of issues have been confronted in the development of these 
parallel techniques which are also encountered with other parallel factorisation methods. 
Given a system of linear equations 
A x = b (3.52) 
where A is the coefficient matrix, b is the known vector and x is the unknown vector, a 
solution for x may be obtained by computing the Cholesky factorisation 
A = L L ^ (3.53) 
L is the Cholesky factor and it is a lower triangular matrix which has positive entries on 
its leading diagonal. The unknown vector x can be be easily computed as 
L y = b L^x = y (3.54) 
95 
3.8 '. Cholesky Factorisation Techniques 
Two substitutions are required to calculate first y and then x. As in LU decomposition no 
pivoting is required to maintain numerical stabihty and ordering techniques may be appUed 
to enhance the preservation of sparsity. 
Cholesky factorisation is a variant of Gaussian eUmination, as is LU decomposition, and 
is therefore based around the equation 
ioikjikj) _ ^ ciik akj ^2 55) 
O-kk " ^yO'kk y/CLkk 
The difference between LU decomposition and Cholesky factorisation is that element values 
in Cholesky factorisation are normahzed by the square root of the pivot. The matrix L 
must be real but if A is not symmetric positive definite L will not be real. Hence the 
coefficient matrix for Cholesky factorisation must be symmetric positive definite. Jennings 
[62] notes that i t is possible to apply Cholesky factorisation to a matrix which is symmetric 
but not positive definite but the procedure is made significantly more complicated by the 
introduction of imaginary elements in L . 
Cholesky factorisation is similar to both LU and LDU factorisation. However Cholesky 
factorisation has the disadvantage of requiring the calculation of n square roots. As square 
root calculations are usually the slowest arithmetic operations on a computer Cholesky 
factorisation may be slower than LU factorisation. 
Note that three subscript indices i, j, k exist in equation (3.55) and in a computer 
implementation three nested loops are used to increment these indices to scan through the 
coefficient matrix. Three different Cholesky factorisation algorithms wiU result depending 
upon which of these indices is controlled by the outer loop. 
1. Row Cholesky algorithm - i is controlled by the outer loop and successive rows of the 
matrix L are computed on each iteration. The inner loops solve a triangular system 
in terms of previously computed rows. 
2. Column Cholesky algorithm - j is controlled by the outer loop and successive columns 
of L are computed. The inner loops perform matrix-vector multiplication of previously 
computed columns on the current column. 
3. Submatrix Cholesky algorithm - k is controlled by the outer loop and successive rows 
of the factor matrix are computed on each iteration. The inner loops update the 
submatrix to the right of the current column. 
96 
3.8 Cholesky Factorisation Techniques 
Row 
Cholesky 
Column 
Cholesky 
Submatrix 
Cholesky 
Used for modifications 
Modified 
Figure 3.5: The three flavours of Cholesky factorisation 
The row and column Cholesky algorithms are leftward looking algorithms as all the infor-
mation required to modify the current row/column can be found to its left. Submatrix 
Cholesky factorisation is a rightward looking algorithm as all the information in the current 
column is used to modify the submatrix to the right of it. LU decomposition is also a 
rightward looking method and similar in its operation to submatrix Cholesky. The three 
flavours of sequential Cholesky algorithm, illustrated in Figure 3.5, develop directly into 
two parallel Cholesky algorithms. Row/column Cholesky factorisation gives rise to the par-
allel method known as the 'fan-in' algorithm whilst submatrix Cholesky gives rise to the 
'fan-out' algorithm. 
3.8.1 The Parallel Fan-In Algorithm 
The parallel fan-in algorithm, originally proposed by Ashcraft, Eisenstat and Liu [82], is 
based on the sequential column Cholesky method. Columns of the coefficient matrix are 
assigned to processors and each processor performs a local processing and data reduction 
phase before participating in a global data reduction phase. The algorithm is demand driven 
and it requires aggregated update columns to be passed between processors. If the results 
97 
3.8 Cholesky Factorisation Techniques 
of processing on a given processor relate to columns on other processors then that processor 
sends the aggregated update columns to the appropriate recipient, which then incorporates 
them into its calculations. The fan-in algorithm has a very regular communication struc-
ture and a low volume of communication traffic, making it more efficient than the fan-out 
Cholesky algorithm described in the following section. 
3.8.2 T h e Para l l e l F a n - O u t Algor i thm 
The fan-out algorithm, which is based on the sequential submatrix Cholesky algorithm, 
assigns columns to processors. UnUke the fan-in algorithm, the fan-out algorithm is data 
driven and the data which pass between processors are matrix columns. Each processor 
continually checks for incoming columns, which may be used to update the local niatrix 
data, and when the processor has completed the processing of its own columns it sends the 
results to aU other processors which may require them. 
The fan-out algorithm is not very efficient and this is a function of its poor communica-
tion performance. The algorithm gives rise to a large number of long messages and although 
aggregation techniques have been employed to improve the performance i t is still less effi-
cient than the fan-in algorithm or multifrontal methods. Complicated message structures 
are required and these must be assembled before transmission and unpacked upon receipt. 
This, and the searching that is required to allow updating of the matrix, results in low serial 
efficiency. Eg, where 
Es = ^ (3.56) 
tup 
where tg is the execution time of the best sequential solution and t^p is the execution time 
of the parallel algorithm executed on a single processor. 
3.8.3 Fronta l Methods 
Frontal methods provide another approach to the paraDel solution of linear equations. These 
method are sophisticated variations of submatrix Cholesky techniques and are significantly 
more complicated than any of the basic Cholesky algorithms. They were originally devel-
oped [67] to make more efficient use of auxiliary storage {i.e. disk storage) in the days 
when memory was expensive and limited in size. The essence of the method is that a small 
'window' oi: front of active computation moves through the coefficient matrix and a fuU 
matrix representation is used to store the frontal submatrix in memory. The use of ful l 
98 
3.9 Summary 
matrix representation makes solution more efficient on scalar machines and allows vector 
architectures to be used to speed up the computations on the frontal submatrix. Parallel 
multifrontal methods have been developed but they appear to be no more efficient than 
the fan-in algorithm. The success of frontal methods depends on keeping the active window 
as small as possible so as to keep to a minimum the amount of dense storage required. The 
main disadvantage of these methods is the complexity of the algorithms. Continual con-
version between sparse and fuU matrix representations is required and sophisticated data 
management methods are needed. However Chapter 6 wiU show that the adoption of fuU 
matrix representation for parts of an LU solution can lead to an increase in the efficiency 
of parallel LU factorisation. 
3.9 Summary 
This chapter has examined existing parallel methods for solving linear equations. Two 
classes of solution have been identified. Jacobi iteration, Gauss-Seidel iteration and the 
Conjugate Gradient method have been introduced as examples of iterative solutions. Meth-
ods in this class of solution make an initial guess at a solution and refine it through a 
procedure of iterative correction until an acceptably accurate result is achieved. Iterative 
techniques suffer from two main drawbacks. Firstly i t is not possible to achieve multiple 
solutions of the same system of equations with different right hand side vectors quite as 
readily as i t is using direct methods. Secondly, iterative techniques may require many iter-
ations to reach convergence if the equations are ill-conditioned or if a high precision answer 
is required. Direct methods, of which LU and Cholesky factorisation are examples, provide 
an exact solution for the equations but may require more total computation than iterative 
methods. Direct methods do not suffer from convergence problems and have the advantage 
that separate factorisation and substitution operations are available. Factorisation of the 
matrix only needs to be performed once and then the substitution operation may be used 
to solve for multiple right hand sides. In applications such as power system simulation, 
where most of the processing involves solving the same set of equations with different right 
hand sides, direct solutions are significantly more efficient than iterative solutions. Iterative 
solutions are used in power systems computations but direct methods hold the greatest 
promise for high performance solutions. 
Existing direct solution methods have been examined to determine the relative advan-
99 
3.9 Summary 
tages and disadvantages of the various approaches. Cholesky techniques are not suitable 
for power system applications as they can only be applied to symmetric positive definite 
systems and power system equations do not fall into this category. This leaves parallel LU 
and LDU-based methods as the only really suitable solution methods for power system ap-
plications. Numerous different techniques have been attempted but the speed-up resulting 
from them is unimpressive and quickly saturates as the number of processors increases. The 
inefficiencies are attributed to a sequential bottleneck present in aU these methods. The 
algorithms for each of the methods are similar and arise as a direct result of the use of the 
BBDF structure as a device for exploiting parallelism. I t is the BBDF structure and not 
the particular flavour of LU-based solution which gives rise to the sequential bottleneck and 
performance limitations. 
Improving upon existing LU methods requires the BBDF structure to be improved. The 
sequential bottleneck must be eliminated and more parallelism exploited. The review of the 
various existing methods presented in this chapter suggests ways in which improvements 
can be made to this structure. Chapter 5 takes these suggestions and uses them to develop 
an efficient new matrix structure suitable for parallel LU-based solutions. Before examining 
this structure i t is necessary to consider the elimination tree again and what implications 
it has for network partitioning and balancing the computational load. 
100 
Chapter 4 
Elimination Trees, Network 
Partitioning and Load Balancing 
4.1 Introduction 
- - n a sequential computer the processor executes a program by operating on the required 
- - data using the sequence of operations prescribed by the program until all operations are 
complete. Throughout the lifetime of the program the processor is always busy and spends 
most of its time performing useful^ work. In a parallel computer each task executes on its 
own in isolation from the other tasks. When cooperation with other tasks is required some 
time may be spent waiting for the other tasks to reach their synchronisation points and 
valuable processing time is wasted in these idle wait states. In order to achieve maximum 
efficiency i t is desirable to have all the processors performing useful work aU of the time. 
Maximum speed-up can only be achieved if all of the processors are constantly kept busy. 
Ensuring that processors are always busy requires the partitioning of the total workload 
into equally sized tasks which are then assigned to each available processor. The process of 
spreading the workload over the processors is known as load balancing and a balanced load 
occurs when equal portions of the workload are assigned to each processor. 
Consider the computer solution of the linear equations associated with a large network 
problem. The workload in this case is the solution of the hnear equations and the division 
of the workload requires the solution of the equations to be divided equally across the 
''Useful' in this context means work that is of value to the user and is generally taken to mean the 
execution of the user's program. Non-useful work would include such things a servicing operating system 
interrupts, hardware interrupts etc.. 
101 
4.2 Balancing the Computational Load 
processors. Splitting the equations up is equivalent to tearing the network into subnetworks 
using Kron's method of diakoptics, thereby reducing the load balancing problem to one of 
network partitioning. A balanced loading can be seen as the partitioning of a network into 
a number of subnetworks which require equal amounts of processing. This partitioning is 
seldom easy and ideal load balancing is rarely achieved in all but the most trivial of cases. 
Within the general literature of parallel computing much emphasis is placed on the 
importance of achieving an ideal balanced load. Strategies are presented for achieving this 
balance but most authors [19, 25] observe that if the workload is known a priori then near-
optimal load balance can be achieved through a static distribution of workload at compile 
time. A knowledge of the true workload requires a detailed knowledge of the program, the 
operating system and the hardware on which it is executed. Many programmers resort 
to using approximate load balancing techniques. One technique often used when solving 
network equations is to partition the network into subnetworks such that each subnetwork 
contains an equal number of nodes. This method does not guarantee that each subnetwork 
requires an equal amount of processing and this chapter describes the development of a 
better technique which is based upon an analysis of the computational complexity of the 
solution and the use of the elimination tree. 
4.2 Balancing the Computational Load 
In executing any program on a multiprocessor computer the aim is to achieve maximum 
speed-up using the available processors. It was stated previously that this is only possible 
i f all of the available processors are constantly busy. This is achieved through dividing the 
processing load into equal portions which are assigned to individual processors (balanced 
loading). Careful consideration must be given to the decomposition of the algorithm into 
logical tasks and the partitioning of the data into distinct subsets. These two issues are 
closely interrelated. Of particular importance to the efficiency of a parallel program is task 
synchronisation and the use of interprocessor communication. A well partitioned problem 
will require only a minimal amount of interprocessor communication and good speed-ups 
can be expected. A poorly partitioned problem is characterized by large amounts of inter-
processor communication and many process synchronisation requests. Interprocessor com-
munication incurs overheads and if the amount of interprocessor communication is large 
then the penalty incurred can seriously limit the speed-up obtained. Poorly partitioned 
102 
4.2 Balancing the Computational Load 
TASK A TASK B 
begin begin . 
c a l c u l a t e F{Da) c a l c u l a t e F{Db) 
send r e s u l t s to task B get r e s u l t s from, task A 
get r e s u l t s from- task B send r e s u l t s to task A 
end end 
Figure 4.1: Simple load balancing example 
programs can be less efficient than sequential programs designed to solve the same problem 
simply because of the penalty incurred by interprocessor communication. Few problems are 
easy to partition well and many problems fall into the 'hard to partition' category. Badly 
partitioned programs are more often a function of the problem and its inherent lack of 
parallelism than of the designer. 
As an example of the need to obtain a balanced load consider the simple program 
whose algorithm is given in Figure 4.1. This program consists of two tasks with similar 
structures which are executed in parallel on independent processors. Each tasks applies the 
function F() to its dataset before passing the results to the other task. Function F() has 
the property that its execution time is proportional to the size of the dataset on which i t 
operates. Suppose also that the dataset for the problem, D, is partitioned into two subsets 
,Da and Db, such that 
D = DaUDb (4.1) 
Data subset Da is used by task A whilst Db is used by task B. 
Let us assume that D is not partitioned equally and that Da holds one tenth of the 
information from D with the remaining nine tenths being held in Db- As the execution 
time of FQ is proportional to the size of the data i t operates on, F{Db) will take nine times 
longer to execute than F{Da). Task A wiU complete the appfication of F{) to Da and try 
to send the results to task B long before B has finished the operation F{Db). Task A will 
then be forced to wait until task B finishes F{Db) before it can send its information. If 
we assume that the time taken to perform F(Da) to be the basic unit of time, then task 
A will waste 8 units of time waiting for task B to finish the operation F{Db). This is 
shown graphically in Figure 4.2(a). Suppose now we alter the partitioning of D such that 
DA and DB both contain half of the information from D. F{DA) and F{DB) will now 
execute in identical times of 5 time steps. Task A no longer has to wait before sending its 
information and the processor spends all of its time performing useful computations. The 
103 
4.2 Balancing the Computational Load 
11 
10 
9 
RX 
I 
D 
L 
E 
RX 
Execution Time 
Computation 
Time 
F I D B 
7 
6 
5 
B 
RX TX 
mm RX 
A B 
Execution Time 
Computation 
Time 
(a) 
Figure 4.2: Graphical depiction of the execution of the example program with a) imbalanced 
load b) balanced load 
elimination of the idle time reduces the overall execution time for the program to 7 time 
steps, as opposed to 11 time steps for the previous case. This is the optimum execution 
time and it arises from the equalization of the computational load between the processors. 
Any imbalance in the division of the computational load between the processors results in 
one of the processors taking longer to execute. The other process is forced into an idle 
wait state and this increases the total execution time of the program. The situation for the 
balanced load is shown graphically in Figure 4.2(b). 
The effect of imbalanced loading manifests itself in the execution time of a program, and 
consequently in the speed-up. The speed-up, 5(n), is the ratio of the sequential execution 
time to parallel execution time on n processors. For the example above consider only the 
computations, that is the operation F{), and ignore the interprocessor communication. 
Amdahl's Law allows us to predict the maximum speed-up we can hope to obtain provided 
we know the sizes of Wp, the amount of work which can be performed in parallel, and W,, 
the amount of work which cannot be parallelised and has to be processed sequentially. The 
simple program of Figure 4.1 has all the work performed in parallel. Hence 
I ^ p = 1 
104 
4.2 Balancing the Computational Load 
Computation Time 
Load Condition Parallel Sequential Speed-up 
Balanced 5 10 5 = f = 2 
Imbalanced 9 10 S=f = 1.11 
Table 4.1: Speed-ups for the load balancing example 
Amdahl's Law predicts that, for Wg - 0, S{n) = n indicating that hnear speed-up is 
achievable. As the example program uses two processors the maximum speed-up that can be 
achieved is two. Assuming that the function FQ obeys the principle of Unear superposition 
then the time taken to execute F{D) {i.e. the sequential solution) will simply be the sum 
of the times taken to execute F{DA) and F{DB), which is 10 time steps. This allows 
the calculation of the speed-ups for the two loading examples, which are shown in Table 
4.1. As Lewis predicts [19], maximum speed-up is achieved under conditions of balanced 
loading whilst the imbalanced load case performs Uttle better than the sequential program. 
The importance of this result is that in trying to extract the maximum speed-up from 
any program it is essential to achieve a load balance which is as near as possible to the 
ideal balanced load. I f the load balance is ignored then the resulting speed-ups will not 
be optimal. Table 4.2 shows similar speed-up figures obtained from an LU-based solution 
program implemented as three tasks running on three processors. The data for this test, 
the CEGB 734 node system, is similarly divided into three subnetworks. The program 
has the structure shown in Figure 4.3. Tasks A and B operate in parallel and pass their 
results to task C which then operates sequentially. One subnetwork is processed by each 
task and the workload for each task is changed by altering the partitioning of the network 
into subnetworks. The amount of the problem which has to be processed sequentially, Wg, 
is the workload of task C. Table 4.2 shows the effect on speed-up of changing the workloads 
assigned to tasks A and B - the more unequal the load the further the resulting speed-up is 
from the ideal speed-up predicted by Amdahl's Law. Note that i t is usually not possible to 
achieve an exact load balance and the first row of Table 4.2 gives figures for the most equal 
loading possible with this data. 
4.2.1 T h e T w o Approaches to Load Balancing 
The previous section has shown the need for, and the benefits of, computational load bal-
ancing but has not discussed the method of achieving a balanced load. Section 4.1 has 
105 
4.2 Balancing the Computational Load 
Figure 4.3: Three task implementation of the LU-based solution 
Percentage of work performed by Amdahl's Law 
Task A Task B Task C Speed-up Predicted Speed-up 
48.5 50.9 0.4 2.23 2.98 
42 56 2.7 1.98 2.85 
25 70 4 1.55 2.77 
22 75 3 1.48 2.83 
Table 4.2: Effect of load balancing on speed-up for an LU-based solution 
indicated that if the structure of the algorithm and the size and nature of the workload 
are known a priori then i t is possible to achieve static load balancing by assigning parts of 
the problem to tasks at compile time. Static load balancing techniques can be categorized 
into four distinct approaches with graph theoretical approaches being the most common. 
The technique relies on the use of two graphs - a graph representing the target machine 
{i.e. the physical processor interconnections) and a task graph which shows the relation-
ship between tasks and the communication requirements of each task. The problem of load 
balancing reduces to one of mapping the task graph to the hardware graph in a manner 
which minimizes interprocessor communication and execution time. This is a graph theory 
problem and much work has been undertaken in this area [83, 19, 84]. Dynamic load balanc-
ing, where the distribution of the computational workload changes during the course of the 
program's execution, is also possible but harder to achieve. The major difference between 
the two approaches is that static load balancing can be performed by the programmer but 
dynamic load balancing is controlled by the operating system or appHcations software and 
is beyond the control of the programmer/user. 
Distributing the workload throughout the processors in the system can be performed 
either in a. domain decomposition or control decomposition based manner [19]. Control de-
106 
4.2 Balancing the Computational Load 
Supervisor 
Figure 4.4: A supervisor/worker approach to parallel bifactorisation 
composition is concerned with the structure of the algorithm and the way i t is spht into 
logical tasks which can then be assigned to individual processors. One typical approach to 
control decomposition is the supervisor/worker approach which is based on the client/server 
and processor pool models of distributed computing [85]. There are a number of worker 
tasks which aU perform the same function. The supervisor is responsible for allocating work 
to each worker task. When a worker finishes its work i t sends the results back to the super-
visor and if there is stiU work to be performed the supervisor allocates another job to the 
worker. In this manner all the workers are kept busy performing the work scheduled by the 
supervisor until all the work has been completed. This approach often has groups of tasks 
which perform certain functions, each group consisting of multiple instances of the required 
task. This is shown in Figure 4.4 which depicts a possible supervisor/worker approach to a 
parallel implementation of bifactorisation. Three groups of tasks exist; one group consists 
of tasks to perform factorisation, the second has tasks to perform left multiplication and the 
third consists of tasks which perform right multipfication. Note that the supervisor/worker 
approach is a dynamic load balancing approach implemented in the apphcations software. 
Domain decomposition on the other hand is concerned with the data to be operated upon 
by the program. The domain of the input data is partitioned and the different subsets are 
assigned to the available processors. 
Both control and domain decomposition are used in implementing an LU-based solution. 
107 
4.2 Balancing the Computational Load 
There is domain decomposition in that the input data (i.e. the network) is partitioned into 
subnetworks which are then assigned to individual processors. The program consists of 
multiple instances of the same task, one for each subnetwork. This single task performs 
all the necessary processing for a subnetwork and each task is assigned to an individual 
processor, thus embodying the principle of control decomposition. The implementation is 
also loosely based on the supervisor/worker model in that there needs to be a supervisor 
task which partitions the data and forwards it to the subnetwork tasks before solution 
commences. Each task is assigned only one unit of work throughout the program's lifetime. 
Once the data has been distributed by the supervisor it becomes redundant and the worker 
tasks operate autonomously. 
The supervisor/worker model is often used as a dynamic load balancing strategy to 
allow a supervisor, usually the operating system, to distribute work across the system in 
a dynamic fashion according to the resources {i.e. worker tasks) available at the time. 
The parallel LU-solution is only loosely based on the super visor/worker model in that the 
partitioning of data is statically determined a priori by the domain decomposition. A 
dynamic supervisor/worker scenario has no prior knowledge about the workload and must 
adapt the distribution of work as the program executes. A different approach to dynamic 
load balancing is based on the use of task migration. Tasks are initially assigned to certain 
processors but as the program executes the operating systems can reassign tasks and their 
data to different processors in order to equalize the computational loading. Suppose two 
tasks are assigned to each processor in the array and that on a certain processor. A, these 
tasks complete before the corresponding tasks complete on processor B. A is now idle 
whilst B is busy. A better computational loading is achieved if one of B's tasks is moved 
to processor A and allowed to complete its computation there. Whilst the concept of 
task migration is weU understood for both parallel and distributed systems, it remains a 
largely theoretical concept as few practical implementations exist in commercially available 
operating systems [85]. 
Most off-the-shelf Transputer systems are not supplied with an operating system which 
has global control of the Transputer array. Each processor has built in hardware which is 
responsible for providing local process scheduling and facilities for interprocess communica-
tion. This basic microcoded kernel is sufficient to allow most applications to be implemented 
with careful programming. I t is possible to buy operating systems which give global control 
of the network but these are expensive and for the amount of usage it would receive, i t was 
108 
4.2 Balancing the Computational Load 
decided that the expense was not warranted for this research project. Consequently the 
programs developed during the project had to rely on facihties provided by the Transputer 
hardware. As the Transputer provides no support for dynamic load balancing a static load 
balancing approach with domain decomposition was used throughout. It is the author's view 
that the use of dynamic load balancing techniques for this problem would merely provide a 
fine tuning mechanism for wringing the last httle bit of speed-up out of the system but the 
benefits of implementing a dynamic load balancing mechanism are probably outweighed by 
the complexity of its implementation. 
4.2.2 L o a d Ba lanc ing Methodologies Adopted by Other Paral le l Solutions 
Chapter 3 surveyed some of the existing parallel algorithms for solving sparse sets of finear 
equations. The same load balancing problem has been encountered in the development of 
these methods and various techniques for achieving approximately balanced loads have been' 
developed. Many of the researchers avoid the subject of load balancing in their publications 
and it is not clear whether they have addressed the issue or simply ignored i t . Fadhila and 
Morelato [44] go one step further and state that 
'load imbalancing is not an issue ... The solution with the best load balance is 
not necessarily the fastest.' 
This is certainly a different view from that held by most of the parallel computing com-
munity which beUeves load balancing to be an issue of critical importance, as the previous 
sections-have shown. Where load balancing has been considered i t is usually performed in a 
simple and inefficient way. Many authors [86, 27] simply assign each column of the matrix 
to an individual processor but this is expensive in terms of the computing hardware re-
quired and it ignores the fact that columns with different numbers of off-diagonal elements 
require different amounts of computation. Columns with many off-diagonals give rise to 
a large amount of computation whereas columns with few off-diagonals require relatively 
little computation and their processors spend most of their time idfing. Whilst this scheme 
divides the load across the processors it takes no account of the characteristics of the load 
and results in an i l l balanced computational loading. Variations on this scheme exist [86] 
which assign multiple columns to processors but these also ignore the characteristics of the 
load and fare little better in achieving a balanced load. This is the load balancing strategy 
adopted by Padhila and Morelato [44] in their parallel solution, despite their claim that 
109 
4.3 _ The Elimination Tree and Parallel Processing 
load balancing is unimportant. 
Other researchers [57, 32] assign multiple columns to individual processors in a manner 
that tries to equafize the number of non-zero elements processed by each processor. I t is 
not clear whether i t is the number of non-zeroes before or after fill-in that is equalized. 
As this scheduling is based on the use of the elimination tree, and the coefficient matrix 
must be at least symbofically factorised before the tree can be derived, i t is reasonable to 
assume that i t is the number of non-zeroes after fiU-in which is equalized. I f the number of 
non-zeroes prior to efimination was equafized then the schedule would fail to take account 
of fiU-ins that occur as a result of ehmination. As factorisation proceeded the load across 
the network would become significantly imbalanced. This would be especially true for rows 
at the lower right of the matrix which become much more densely populated than rows at 
the top left of the matrix. 
Liu [52] proposes a different approach based on the use of ehmination trees to determine 
the schedufing strategy which gives the best load balance. Geist and Ng [73] also use this 
approach to balance the load in the parallel Cholesky factorisation method. This method 
gives a reasonably even load balancing and has been adapted for use with the LU-based 
solution. 
4.3 The Elimination Tree and Parallel Processing 
The factorisation path for each node in the elimination tree determines the precedence 
relationship in the factorisation and substitution phases of the triangular solution of the set 
of equations represented by matrix A . Nodes which do not belong to the same factorisation 
path are independent and may be processed simultaneously by assigning them to different 
processors in a multiprocessor array. This is true for both the factorisation and substitution 
operations of the solution. For example, nodes 5 and 6 in the 10 node example of Figure 
2.6 belong to different factorisation paths and may be processed in parallel. 
The power of the eHmination tree Ues in the fact that i t is a useful tool for visuahsing 
the factorisation/substitution processes and the paraUehsm inherent in them. Recall that 
factorising a column of the coefficient matrix is equivalent to eliminating a node from the 
graph G{A) and hence from the tree T[A]. It is easy to see that removing aU the leaf nodes 
from the tree exposes a new layer of leaf nodes. Repeating the process again exposes another 
layer of leaf nodes, and so on. As each leaf node belongs to a different factorisation path 
110 
4.3 The Elimination Tree and Parallel Processing 
0 © 
(a) (b) 
Figure 4.5: Partitioning of the elimination tree (a) and corresponding network partitions 
(b) 
they can be simultaneously eliminated. This is true for each exposed layer of leaf nodes and 
exemplifies one typical approach to parallelising the triangular solution [73]. 
A different approach to the exploitation of parallelism is based on an analysis of the elim-
ination tree. The essence of the method is to group nodes within the tree into subtrees and 
this is equivalent to grouping the nodes of the network into subnetworks. Many researchers 
who have used the elimination tree in their approach to parallel triangular decomposition 
have not explicitly recognized this fact [87]. As the subtrees contain independent factori-
sation paths they may be processed in parallel by independent processors. This method 
gives a coarse grain parallel solution whereas the wrap mapping method described above 
produces a fine grain solution. Figure 4.5(a) shows how the 10 node example tree of Figure 
2.6 could be partitioned into subtrees. Subtrees 1 and 2 are independent and may be pro-
cessed in parallel. The remainder of the tree corresponds to the cutset in the partitioned 
network for this system and i t must be processed after the subtrees have been processed. 
Figure 4.5(b) shows how the elimination tree partitioning corresponds to the partitioning 
of the network into subnetworks 
Within any elimination tree i t is always possible to find a longest path through the 
tree. This path is referred to as the critical path of the tree and the length of this path 
has important consequences for parallel solutions. The longer the critical path the slower 
the elimination process will be. As a consequence, systems which have very short critical 
111 
4.3 The Elimination Tree and Parallel Processing 
paths can be processed much faster than systems with long critical paths. Consider the 
fine grain approach which exposes successive layers of leaf nodes. I f the critical path has 
p nodes on it then p parallel steps wiU be required to complete the ehmination. This can 
be verified by assuming that the processing of each node on the critical path is performed 
in unit time. Hence a system with p nodes on the critical path will execute in p units of 
time and a system with 10 nodes on the critical path will take twice as long to execute as 
a system with 5 nodes on the critical path. A similar argument can be made for the coarse 
grain approach and the result is the same in both cases. 
Unfortunately this assumption is not valid for real triangular solution programs. The 
processing of each node is achieved in a time which is proportional to the number of nonzero 
elements in the row of the matrix corresponding to that node. Consequently a short path 
consisting of nodes with a large computational overhead may take longer to execute than 
a long path consisting of nodes with a small computational overhead. The critical path 
is not necessarily the longest path, but the path with the largest aggregate computational 
overhead. 
4.3.1 T h e E l i m i n a t i o n Tree and Network Part i t ioning 
Section 2.7 states that i t is possible to swap rows and columns of a matrix before performing 
Gaussian elimination. This is equivalent to reordering the nodes in the network, and hence 
in the graph G{A). In his paper on the role of elimination trees Liu [52] introduces the 
concepts of topological and equivalent reorderings. A topological reordering of the ehmina-
tion tree reorders the nodes such that all child nodes are numbered before their parents. 
The consequence of such an ordering is that the last row/column of the matrix always cor-
responds to the root of the elimination tree. An equivalent reordering is defined as follows; 
i f there is a symmetric matrix A and two orderings P and Q, then P and Q are said to 
be equivalent if the fiUed graphs of PAP-^ and QAQ-^ are the same {i.e. isomorphic). 
Furthermore P (or Q) is said to be an equivalent reordering if the filled graph of PAP-^ 
has the same structure as the filled graph of A. The benefit of equivalent reorderings is 
that the reordered matrix incurs the same computational and storage costs as the original 
matrix and in terms of performance, the equivalent reordering is every bit as good as the 
original ordering. This implies that i f the matrix A has been ordered with some form of near 
optimal ordering algorithm applying an equivalent reordering will not destroy the optimal 
nature of the solution. 
112 
4.3 The Elimination Tree and Parallel Processing 
Liu proves a number of theorems which are needed to explain how ehmination tree based 
partitioning works. The proofs of these theorems are given in Appendix D. 
Theorem 1 For each node Xj in G{A), the subgraph of G{A) (or G{F)) which consists of 
nodes in the tree T[xj] is connected, where T[xj] is the subtree rooted at node Xj. 
This theorem impHes that partitioning the tree into disjoint subtrees is the same as 
partitioning the network into subnetworks. Considering Figure 4.5(a), partitioning the tree 
into disjoint subtrees T[7] and T[8], rooted at nodes 7 and 8, is the same as clustering the 
corresponding nodes in the network into two subnetworks, as shown in Figure 4.5(b). 
Theorem 2 Given the matrix. A, and an equivalent reordering, F, the filled graphs of 
G{A) and G(PAP-^) are isomorphic if they are treated as unlabeled structures. 
Theorem 2 implies that every topological reordering of A is an equivalent reordering of 
the matrix A . The corollary to this theorem is that the tree T[PAP-^] is isomorphic to 
r [ A ] i f they are treated as unlabeled structures. 
Combining the results of these two theorems allows us to perform a partitioning of the 
network into subnetworks based upon an inspection of the elimination tree. If the matrix A 
has been ordered using a near optimal ordering strategy, subtrees can be identified within 
r [ A ] to partition the network into the required number of subnetworks, as in Figure 4.5. At 
this stage the choice of subtrees is arbitrary but Section 4.3.2 introduces criteria for selecting 
the roots of the subtrees so that the resulting network partition gives an approximately 
balanced load across the processors. The results of theorem 2 allow the optimally ordered 
coefficient matrix to be reordered so as to give the coefficient matrix a particular desired 
structure. As this reordering is an equivalent and topological reordering no extra fiU-ins will 
be introduced when processing the matrix and the same number of arithmetic operations 
are required to process both this and the original matrix. The importance of this result 
cannot be overstated as it is the foundation of the proposed parallel method. Given a 
system of equations a near optimal ordering may be apphed to minimize the fiU-in resulting 
from the factorisation of that system. Theorem 2 allows this minimum fiU-in reordering 
to be ordered again using an equivalent reordering to give some desired matrix structure 
without introducing any extra computations or information. Examining the ehmination 
tree of this system and using theorem 1 allows independent subtrees to be identified within 
the eUmination tree. These subtrees may be processed in parallel by assigning them to 
113 
4.3 The Elimination Tree and Parallel Processing 
different processors in a multiprocessor array. A parallel solution lias been created by 
simply rearranging the system of equations. 
Theorem 1 allows the paraUehsm inherent in the system to be exploited and theorem 
2 ensures that both sequential and parallel solutions wiU require the same total amount of 
computation. This contrasts with a number of existing parallel solutions [44, 29, 35] which 
rely on the introduction of extra information or computations to allow the exploitation of 
parallelism. The new approach wiU be more efficient than existing methods, assuming that 
all methods exploit the parallelism in a given problem to the same extent and in the same 
way. Note that no reference has been made to the particular triangular decomposition 
method adopted. This is because the paralleUsation based on theorems 1 and 2 depends 
only on the structure of the coefficient matrix and is independent of the decomposition 
technique. Hence a parallel solution can be devised which utilizes any of the available 
triangular decomposition methods. 
I t has been stated that partitioning the elimination tree into subtrees is the same as 
partitioning the network into subnetworks. Partitioning the elimination tree has the effect 
of giving the coefficient matrix a block based structure, in the same way that partition-
ing the network does. Each block in the matrix corresponds to one or more subtrees in 
the elimination tree. Allocating the main row/column blocks to individual processors, as 
described in Section 3.4.1, is the final stage of the partitioning process. 
4.3.2 U s i n g the E l i m i n a t i o n Tree to Achieve Load Balancing 
Section 4.3 describes how the computational load may be spread across an array of pro-
cessors by assigning subtrees of the elimination tree to separate processors. This approach 
is based on a method due to George et. al [88] known as the subtree-to-subcube mapping. 
George et. al pioneered the method to assign subtrees of the elimination tree onto disjoint 
subcubes of a hypercube multiprocessor. Section 4.3 made no attempt to discuss how suit-
able subtrees are identified so that there are always exactly the same number of subtrees as 
processors, or how to assign subtrees to processors if there are more subtrees than proces-
sors. The problem of dealing with the portion of the tree below the chosen subtrees {e.g. 
nodes 9 and 10 in Figure 4.5) has also not been considered. 
Geist & Ng [73] observe that the subtree-to-subcube mapping is only efficient when 
applied to balanced trees with a regular structure, such as those resulting from grid-based 
problems {e.g. finite element problems). Applying the method to imbalanced trees results 
114 
4.3 The Elimination Tree and Parallel Processing 
in a large increase in the amount of interprocessor communication which degrades the 
performance of the solution algorithm. Furthermore the method assumes there to be exactly 
the same number of subtrees as there are processors. Geist & Ng discuss ways of treating 
the lower portion of the matrix when using the subtree-to-subcube mapping and they put 
forward two approaches. The first uses a fine grain mapping to assign alternate nodes 
from the lower portion of the tree to the different processors whilst the second approach 
considers the entire lower portion of the tree to be a subtree, rooted at the root of the full 
tree. The latter technique was adopted here as it is more in keeping with the concept of 
minimally interconnected subnetworks. Consider treating nodes 9 and 10 of Figure 4.5(a) 
as a third subtree. This causes nodes 9 and 10 in the associated graph (Figure 4.5(b)) 
to be encapsulated into a third subnetwork lying between the two existing subnetworks 
(Figure 4.6). This new subnetwork has the property that, i f it and the branches connected 
to i t are removed, i t separates the remaining network into disjoint subnetworks. Recalling 
the discussion of the diakoptic method from Section 3.5.1 highlights the fact that this new 
subnetwork has aU the properties of the cutset block of the diakoptic method, and it is in fact 
the cutset of the system. This can easily be verified by considering the ehmination tree and 
the role played by the third subtree (Figure 4.6(a)). Removing this subtree separates the 
tree into two disjoint subtrees. This cutset can be assigned its own processor but i t cannot 
be processed in parallel with the other subtrees. As the structure of the elimination tree 
shows this cutset subtree must be processed after the other subtrees have been processed 
in parallel. This gives rise to the Master-Slave structure of Figure 3.2. 
Geist & Ng refine the subtree-to-subcube mapping to create a method which is suitable 
for use with unbalanced trees. Given an arbitrary tree and a set of n processors the technique 
finds the smallest set of branches in the tree which can be partitioned into exactly n subsets 
whose solution requires approximately the same amount of work. The key to method is 
the use of a weighted elimination tree. Each node i in the tree is assigned a weight equal 
to the number of operations required to eliminate node i plus the sum of the weights of 
its child nodes. The weight of node i corresponds to the number of operations required 
to eliminate the subtree rooted at node i and the weight of the root node corresponds to 
the total number of operations required to factorise the matrix. An heuristic bin packing 
technique [73, 89, 90] is used to partition the tree into n subsets of approximately equal 
weighting. The initial step is to select the first n branches and place their weights in the n 
bins. A breadth first search then scans the tree selecting nodes and adding their weights to 
115 
4.3 The Elimination Tree and Parallel Processing 
(a) (b) 
Figure 4.6: Treating the lower portion of the tree as a separate subnetwork a) partitioned 
elimination tree b) partitioned network 
each of the bins in turn. The algorithm continues until the difference between the weights 
in the bins falls below some user defined tolerance. The contents of each bin {i.e. the nodes 
associated with i t ) are then assigned to individual processors. The remaining portion of 
the tree, referred to as the separator set, can then be treated using a wrapping assignment 
or a separate subtree. Figure 4.7 shows the results of this mapping for a simple tree, to be 
processed using 4 processors. A wrapping technique is used to assign the members of the 
separator set to different processors. Members of the separator set are denoted by double 
circles. 
As Figure 4.7 shows, a significant portion of the tree is actually contained within the 
separator set. Treating the whole separator set (cutset) as a subtree is inefficient as this 
has to be processed sequentially after the other subtrees have been eUminated. The more 
nodes there are in the separator set, the more processing has to be performed sequentially, 
reducing the amount of potential parallelism in the problem. Geist's technique becomes 
inefficient when the cutset is treated as a separate subtree. 
4.3.3 Advantages of the Tree-based Approach 
The use of weighted elimination trees has a number of advantages besides those already 
mentioned. I t has been suggested that the execution time of the parallel algorithm is 
proportional to the length of the critical path. More precisely, the parallel execution time is 
116 
4.3 The Elimination Tree and Parallel Processing 
Figure 4.7: Geist & Ng's partitioning method apphed to a simple tree (after Geist & Ng) 
directly proportional to the number of operations that have to be performed to eliminate the 
nodes on the critical path during the factorisation stage (i.e.the weight of the critical path 
Wcp). The total number of operations required to factorise the whole matrix is equal to the 
weight of the root node, Wroot- Hence the time taken to factorise the matrix sequentially 
is proportional to Wroot- In solving any problem on a parallel computer we are interested 
in the amount of speed-up which the parallel solution gives us. The speed-up, S, is the 
ratio of the execution time of the best sequential program, tgeq, to the execution time of 
the parallel program, tpar- Combining this with the two observations made above allows 
the derivation of a simple rule of thumb for estimating the speed-up that can be obtained 
for the parallel solution of a set of equations. As 
tseq = OiiWroot (4.2) 
and 
•-par a2W, cp (4.3) 
117 
4.3 The Elimination Tree and Parallel Processing 
where 0 1 , 0 2 are constants of proportionality, speed-up can be expressed as 
I f the sequential and parallel solutions are executed on the same type of processor running 
at the same clock speed 
a-i Ki a2 (4.5) 
and hence 
Overheads and interprocessor communication have been ignored and (4.6) is a simple rule-
of-thumb for estimating the speed-up that can be expected for a given system. 
When considering networks with many thousands of nodes the ehmination tree can 
become very large and it is difficult to plot this on a reasonably sized sheet of paper. 
Associating weights with each node in the tree allows the tree to be pruned to a more 
manageable size by replacing entire subtrees with a single node of the same weight. I f a 
toleranced threshold weight is set it is possible to replace aU subtrees with a weight that 
falls inside the tolerance band by single nodes of equivalent weight. During the course of 
this research project a software package was developed to derive and plot the elimination 
tree for any desired system. The toleranced threshold technique was used to automatically 
reduce the elimination tree diagram to an acceptable and easily managed size. The tolerance 
band in the program was set to ±10%. The threshold value is calculated for each system 
according to the number of subnetworks required, n, and the total weight of the elimination 
tree. Hence 
W,,resh = — ± 1 0 % (4.7) 
n 
As the partitioning of the ehmination tree into subtrees is currently performed by inspection 
it is advantageous to have trees of a manageable size'. The reduced trees can be used to 
determine the network partitioning but nodes replacing the pruned subtrees must be clearly 
marked so that when nodes are assigned to processors the pruned subtrees can be expanded 
back into their original form. 
118 
4.4 Summary 
Subnetworks Speed-up Predicted 
System Size SA SB speed-up 
734 3 2.29 2.27 1.94 
734 7 5.68 4.00 3.46 
734 15 5.52 5.33 5.40 
Table 4.3: The effect of load balancing on speed-up 
4.3.4 Performance of the Tree-based Load Balancing 
Table 4.3 shows speed-up results for a number of test systems partitioned into different 
numbers of subnetworks. For each case the speed-up resulting from two different load 
balancing techniques is shown. Scheme A equalizes the number of nodes in each subnetwork 
whilst scheme B equalizes the computational complexity of each subnetwork based on the 
weighted elimination tree analysis. SA is the speed-up resulting from the former scheme 
whilst SB is the speed-up resulting from the latter. The speed-up predicted by the rule-of-
thumb of (4.6) is also shown for the sake of comparison. 
Table 4.3 clearly shows the efficacy of the tree-based load balancing strategy, which 
consistently results in higher speed-ups. The speed-ups produced in practice are similar to 
those predicted using (4.6). At present, partitioning of the elimination tree is performed 
by visual inspection. However the process is an heuristic one and it should be possible 
to develop a set of heuristic rules which wiU form the basis of an automatic partitioning 
algorithm. This idea is developed further in Chapter 7. 
4.4 Summary 
This chapter has considered the role played by the elimination tree in the creation of parallel 
solution methods. I t has been demonstrated that the elimination tree is an indispensable 
aid to be used in meeting the combined goals of network partitioning and load balancing. 
The need for a balanced computational load has been stressed in this and other chapters 
and a method has been presented which allows the network to be partitioned to give the 
most balanced load. The method is based on the use of a weighted elimination tree and 
it partitions the network in a way which tries to equalise the computational requirements 
of each subnetwork. Results from a number of test systems have shown that this method 
produces good parallel performance. A simple technique for estimating the maximum speed-
119 
4.4 Summary 
up that can be expected from the parallel solution of a set of equations has also been 
introduced. 
The most significant point emerging from this chapter is that it is possible to improve the 
performance of a parallel solution of a given system simply by rearranging the elimination 
tree associated with that system. The concept of an equivalent reordering has been intro-
duced and a theorem has been presented which states that applying an equivalent reordering 
changes the matrix structure and ehmination tree of that system but not the amount of 
computation required to yield a solution. This allows a parallel solution to be formulated 
which requires the same total amount of computation as the best sequential solution and 
this is an improvement over existing parallel formulations, many of which introduce extra 
computation steps to allow the exploitation of paraUehsm. The new method is independent 
of the solution algorithm chosen as i t depends only on the structure of the elimination tree. 
Hence it is possible to formulate a parallel solution using any desired triangular solution 
method. The next chapter discusses the design of a Transputer implementation of this new 
parallel formulation, based around the bifactorisation algorithm. 
120 
Chapter 5 
A n Improved Parallel 
Factorisation 
5.1 Introduction 
r I "\ his chapter considers in detail how the insight provided by the eUmination tree can 
be used to create an improved parallel triangular factorisation and solution. For 
power system simulations i t is most important to improve the performance of the for-
ward/backward substitution operations of the algorithm as it is these operations which are 
continuously repeated in a dynamic simulation. Factorisation only needs to be performed 
once before repeated solution can occur and poorer performance can be tolerated. Any 
event which causes a change in the topology of the system network requires a refactorisa-
tion of the matrix before further solutions can be obtained. When the topology does change 
it is vital that the refactorisation be performed as quickly as possible so that real-time so-
lutions remain in soft real-time^. Therefore i t is also desirable to improve the performance 
of the factorisation part of a triangular method. Terformance' in this context equates to 
the speed-up obtained from the method under consideration. 
The previous chapter has discussed existing parallel Cholesky and LU-based methods 
for the solution of linear equations. In the development of these methods the authors 
have adopted various algorithmic approaches. The method described in this chapter draws 
upon these different approaches and uses a combination of the best techniques to give an 
^Soft real-time systems are those in which response time is important but the system still functions 
correctly if some deadlines are missed. More specifically, soft real-time systems are real-time systems which 
are tolerant of the occasional missed deadline or deadhnes which are not missed by much [91]. 
121 
5.2 Development of the Recursively Parallel Method 
® 0 — © 
1 2 3 4 5 6 7 8 9 1011 12 
1 X X X 
2 X X X 
3 X X X X 
4 X X X 
5 X X X X 
6 X X X 
7 X X X 
8 X X X X 
9 X X X X 
10 X X X X 
11 X X X 
12 X X 
(a) (b) (c) 
Figure 5.1: The simple 12 node example system a) network graph b) coefficient matrix 
structure c) elimination tree 
improvement in both the factorisation and substitution steps of a parallel LU-based solution 
algorithm. 
5.2 Development of the Recursively Parallel IVEethod 
5.2.1 Ident i fy ing the Potential Paral le l i sm 
The discussion on the identification of potential parallelism is best treated with regard to 
a specific example. The example used here is a simple 12 node system and the network, 
its associated matrix and elimination tree are shown in Figure 5.1. The elimination tree of 
the system, which has already been ordered using an optimal ordering algorithm, shows the 
parallelism existing in the problem. 
To implement a parallel solution the workload must be divided up and assigned to 
the individual processors. Consider the factorisation phase of the solution algorithm (a 
similar argument applies to the substitution phase). The workload in this case is the set 
of nodes which must be eliminated from the network (i.e. the nodes in the elimination 
tree). Dividing the workload is equivalent to assigning nodes to processors for processing, 
or in terms of the elimination tree, dividing up the tree and assigning the partitions to the 
available processors. Two approaches exist for partitioning the tree - the first considers 
each node in the tree individually whilst the second groups nodes into collections of nodes 
122 
5.2 Development of the Recursively Parallel Method 
1 2 3 4 5 6 7 8 9 10 11 12 
(a) (b) (c) 
Figure 5.2: a) Partitioned topologically reordered network, b) admittance matrix structure 
and c) effect on elimination tree 
which are then assigned to individual processors. The first method also clusters tree nodes 
into groups but each group contains nodes which are scattered throughout the tree. This 
approach is equivalent to a wrap mapping and results in a large volume of interprocessor 
communication. As a result this form of partitioning is really only useful for algorithms 
implemented on parallel machines with shared memory. The latter method is more suitable 
for distributed memory machines as it groups together nodes in the same region of the tree 
and results in a lower volume of interprocessor communication. 
Consider the example network - i t is possible to identify four subtrees which may be 
processed concurrently. To partition the network into four independent subnetworks a topo-
logical reordering must be applied. It is assumed that the network of Figure 5.1 has already 
been optimally ordered. The four independent subtrees are {1,6},{4,11},{2,5},{3,12} and 
the remainder of the network ({7,8,9,10}) are the cutset. The topological reordering wiU 
reorder the network such that nodes in the independent subtrees are numbered first and 
cutset nodes are numbered last. The nodes in the first subtree {1,6} will be renumbered as 
{1,2} whilst the nodes in the last subtree will be renumbered as {7,8}. The cutset nodes 
will be renumbered as {9,10,11,12}. Figure 5.2(a) shows the renumbered network after 
topological reordering and Figure 5.2(c) shows how the ehmination tree is divided into the 
four independent subtrees which correspond to the four subnetworks. Parallel factorisation 
can be accompUshed by assigning the four subnetworks to four different processors. The 
123 
5.2 Development of the Recursively Parallel Method 
problem then arises of how to deal with the cutset nodes. These nodes may be assigned 
to the four processors along with the subnetworks using a wrap mapping [73] but this in-
troduces a large volume of communication during the cutset solution. A better approach 
[73] is to treat the cutset as a separate subtree and assign it to its own (fif th) processor. 
Figure 5.2(c) implies that the cutset must be processed after the processing of the other four 
subnetworks is complete. Information from all the other processors must be passed to the 
processor which hosts the cutset so the cutset must be processed by a central coordinating 
processor. The algorithm structure is that of Figure 3.2. 
The problem in treating the cutset as a separate subtree is that i t ignores potential 
parallelism existing within the cutset. In the cutset of Figure 5.2 nodes 9 and 10 may be 
processed simultaneously, as indicated by Figure 5.2(c). Three further subtrees may now be 
created { 9 } , {10} and {11,12}. These subtrees are derived by partitioning the cutset and 
are referred to as minor subtrees, or minor subnetworks. Subtrees {1,2},{3,4},{5,6},{7,8} 
are created by the existence of the cutset and are referred to as major subtrees, or major 
subnetworks. Subtrees {9} and {10} can be assigned to separate processors and factorised 
in parallel whilst subtree {11,12} is factorised after these two have completed. Extra paral-
lelism has been exploited by partitioning of the cutset thus reducing the amount of sequential 
computation. The elimination tree now consists of three distinct levels. Each level contains 
a number of subtrees (subnetworks) and all the subtrees in a level are independent and may 
be processed in parallel. For the example of Figure 5.2 the levels are 
Level 1 {1,2},{3,4},{5,6},{7,8} 
Level 2 {9},{10} 
Level 3 {11,12} 
Although the processing within levels can be performed concurrently, the levels themselves 
must be processed in sequence (i.e. Processing of level 2 cannot begin until the processing 
of level 1 is complete). 
The scheduhng strategy can be used to make this approach more efficient as i t is ob-
served that subtrees {1,2},{3,4},{5,6} and {7,8} must be factorised before {9} and {10} 
can commence factorisation. The processors which dealt with {1,2},{3,4},{5,6},{7,8} are 
lying idle and may be used to factorise {9} and {10}. The same argument can be apphed 
to subtree {11,12} and only four processors are required to factorise the seven subtrees. 
Each subtree constitutes a separate computational task and more than one task is assigned 
124 
5.2 Development of the Recursively Parallel Method 
to each processor. For example, one processor wiU host the tasks (subtrees) {1,2}, { 9 } , 
{11,12} whilst another might host {5,6} and {10}. This is a departure from existing ap-
proaches which divide the problem into computational tasks and assign each task to its own 
processor. The essence of the method is to exploit the parallelism which exists within the 
cutset by making use of idhng processors and the method has been termed the Recursively 
Parallel (RP) method. 
5.2.2 T h e Recurs ive Bordered Block Diagonal F o r m 
I f the Recursively Parallel solution is to be efficient it is necessary to ensure that exploiting 
parallehsm within the cutset actually reduces the volume of communications to the last task. 
This has been achieved by constraining the RP method to make use of a particular coefficient 
matrix structure known as the Recursive Bordered Block Diagonal Form (RBBDF). 
The RBBDF matrix can be derived by considering a graph of the network described 
by the hnear equations. Normally the network is comprised of a number of subnetworks 
connected in some arbitrary fashion. Suppose there exists a system with fifteen subnetworks, 
eight major and seven minor, and that the interconnections are constrained such that the 
subnetworks are arranged in binary tree structure. Figure 5.3(a) shows the tree structure 
and Figure 5.3(b) shows the matrix associated with the network. This matrix structure 
is similar to the BBDF structure in that there are blocks along the leading diagonal and 
a border region along the bottom and right of the matrix. Most of this border region is 
empty and it contains only two fines of blocks which run diagonally across to the lower right 
corner of the matrix. There is a significant amount of parallefism available for exploitation 
in this structure as the constrained interconnection creates very Uttle dependence between 
the matrix blocks. There are no dependencies between the first eight blocks and these may 
be processed in parallel. Significantly these first eight blocks are all the subnetworks in the 
leaf level (level 0) of the tree-based network {i.e. the major subnetworks). Dependencies 
do exist between the major and the minor subnetworks themselves. During factorisation of 
blocks 1 to 8 updates will be made to blocks 9 to 12 and hence 9 to 12 cannot be processed 
until aU processing on blocks 1 to 8 is complete. There are no dependencies between blocks 
9 to 12 and these may also be processed in parallel. These blocks together constitute level 
1 of the tree-based network graph. Processing of blocks 9 to 12 updates blocks 13 and 14 
and the processing of these two blocks must commence after the processing of blocks 9 to 
12 has completed. The lack of dependencies between 13 and 14 allows these two blocks 
125 
5.2 Development of the Recursively Parallel Method 
1 2 3 4 5 6 7 8 LevelO 
\ / \ / \ / \ / 
9 10 11 12 Level 1 
13 14 ljevBl2 
15 Levels 
•k • 
•k 
-k 
•k 
kr 
-k -k 
•k 
•k 
k 
•k 
•k 
* -k 
-k k 
k k 
•k k k k 
k k k 
Figure 5.3: Subnetworks constrained to a binary tree connection structure a) elimination 
tree b)coefficient matrix structure 
to be processed in parallel and they constitute level 2 of the network graph. Finally block 
15 may be processed after 13 and 14 have completed and this block corresponds to the 
root node of the tree structured network graph. The four levels of the tree give rise to 
four regions of independence within the matrix. AU the blocks within each of these regions 
may be processed in parallel. However the regions must be dealt with one after another 
in a sequential manner. Constraining the network graph to have a binary tree structure 
immediately introduces four phases of paraUehsm into the factorisation and substitution 
operations on the coefficient matrix. Unfortunately the network is somewhat artificial and 
it would be very difficult to partition a real network, particularly one as complicated as 
a power system, such that its constituent subnetworks were connected together to form a 
binary tree. 
Suppose that the strict binary tree connection constraint is relaxed so that an additional 
connection is allowed between every subnetwork and the last subnetwork (i.e. the root) of 
the tree. Adding in the extra connections creates a border in the last row and column of 
the matrix but this does not affect the exploitation of parallelism described in the preceding 
paragraph. Four phases of parallelism stiU exist corresponding to the four levels in the tree 
and aU subnetworks which lie in the same level of the tree may be processed in parallel. 
The only consequence of allowing the extra connections in the network graph is that extra 
updates have to be performed between the parallel phases. For example the processing of 
126 
5.2 Development of the Recursively Parallel Method 
level 0 {i.e. blocks 1 to 8 in parallel) requires updates to block 9 to 12 but also to block 
15. Relaxing the connection constraints still further allows any subnetwork to be connected 
to any other subnetwork below it in the tree. Connections across the tree are not allowed. 
Implementing'this strategy would allow subnetwork 1 to be connected to subnetworks 9, 
13 and 15. This is depicted in Figure 5.4(a) and the associated matrix is given in Figure 
5.4(b). This matrix exhibits Recursive Bordered Block Diagonal Form. If the diagonal 
blocks corresponding to the subnetworks in level 0 of the tree are considered i t can be seen 
that the remaining blocks form a border block below and to the right of them. The diagonal 
blocks corresponding to the first two levels (1 to 12) are also bordered below and to the 
right by the remaining network blocks. Similarly the diagonal blocks for the first three levels 
are also bordered below and to the right by the remaining network blocks. The bordered 
block diagonal form is recurrent in the matrix giving rise to the name of Recursive Bordered 
Block Diagonal Form. 
Interconnecting the subnetworks such that the associated matrix has RBBDF imposes 
constraints on the network partitioning but these are reasonably loose constraints. Many 
connections are allowable in the network and it is relatively easy to partition any real 
network such that the coefficient matrix is in RBBDF. The structure is even more flexible 
in that some of the connections may be missing. For example if the connection between 
1 and 15 is missing i t does not significantly alter the matrix structure and the same four 
phases of parallelism can be exploited. In fact any of the connections may be missing and 
more than one may be missing simultaneously. The only requirement is that there must 
be at least one connection to each subnetwork which is present. An even greater degree 
of flexibiUty is offered when it is recognized that not all of the subnetworks have to be 
present either. I f insufficient network partitions can be found to give the required number 
of subnetworks without violating the connection constraints then some of the subnetworks 
can be missed out. For example if only 14 subnetworks can be found, subnetwork 1 in 
Figure 5.4(a) can be missed out ^ so that only seven subnetworks exist in the first level. 
Multiple subnetworks can be missing simultaneously, the only condition being that the root 
of the tree must always exist. 
The constraints which must be imposed on the network partitioning in order to produce 
^The numerical labeling of the subnetworks must be contiguous. In this example, level 0 would consist 
of nodes 1 to 7, level 1 consists of nodes 8 to 11, blocks 12 and 13 lie in level 2 and the root of the tree is 
block 14. 
127 
5.2 Development of the Recursively Parallel Method 
1 2 3 4 5 6 7 8 LevelO 
Level 1 
Level 2 
Level 3 
k k: 
k: • 
k k 
k: k k k 
-k -k -k k; k 
-k 
k 
k k 
k 
• -k 
•k 
•k 
kr 
•k 
k 
k 
k 
* 
•k 
k 
k 
k 
k 
k 
Figure 5.4: The interconnections giving rise to RBBDF a) ehmination tree b) coefficient 
matrix structure 
an RBBDF coefficient matrix may be summarized as 
• The interconnection of subnetworks is based on a binary tree with additional connec-
tions 
• There may be up to 2 " - 1 subnetworks connected in a tree-like fashion, where m is 
the number of levels in the tree and indicates the number of parallel phases that can 
be exploited in either factorisation of substitution 
• The subnetwork which forms the root of the tree must always be present - any other 
subnetworks may be missing 
• The subnetworks which are present must be labeled contiguously 
• Connections may be missing from the tree structure but there must be at least one 
connection to every subnetwork which is present. In other words, subnetworks which 
are present must be connected to the tree and cannot be isolated. 
• Any subnetwork is allowed to connect to any other subnetwork below it in the tree. 
Connections across the tree are not allowed 
Figure 5.5 shows the possible interconnections for networks with 3, 7, 15 and 31 subnetworks. 
128 
5.2 Development of the Recursively Parallel Method 
-I 2 
1 2 3 4 
Figure 5.5: The constrained subnetwork interconnections 
Given the coefficient matrix of a system it may be placed into RBBDF in three simple 
steps 
1. Apply an optimal ordering to this system to ensure minimum fiU-in and short path 
lengths 
2. Determine the ehmination tree of the system 
3. Using the eUmination tree, apply the equivalent reordering which converts the matrix 
into RBBDF. 
The equivalent reordering is determined by visual inspection of the elimination tree. The 
tree is divided into the required number of subtrees and these subtrees must be arranged 
within the tree such that their interconnections do not breach the connectivity constraints 
necessary for RBBDF. The individual nodes of the tree must then be renumbered. Suppose 
that k subtrees have been identified and that subtree 1 encloses n i nodes, subtree 2 encloses 
n2 etc. The individual tree nodes are renumbered a subtree at a time in ascending subtree 
order. The n i nodes in subtree 1 are renumbered as 1 to n i . The nodes in subtree 2 are then 
renumbered as n i -f 1 to n i + n2 etc. Once all the nodes in the tree have been renumbered 
129 
5.2 Development of the Recursively Parallel Method 
the matrix associated with the tree will have been restructured to give RBBDF. 
5.2.3 Ba lanc ing the L o a d 
The subtree-to-subcube mapping (Section 4.3.2) is not adequate for partitioning trees for 
solution with the Recursively Parallel method. A variation of this technique has been 
developed to partition the tree so as to exploit the potential paraUehsm existing in the 
cutset itself. As with subtree-to-subcube mapping, the approach is based on the use of a 
weighted eUmination tree. For each node i in the tree the number of multiphcation-additions 
required to ehminate that node from the network is calculated. This is the computational 
complexitr^ C{i) of ehminating node i and if row i of the matrix contains k non zero elements 
C ( 0 = 1 + A: + M ^ (5.1) 
The nodes in the tree are assigned a weight, W{i), defined to be the computational com-
plexity of node i plus the weight of the descendant nodes. The weight of the root node is 
the computational complexity of factorising the whole matrix. The method then assigns 
nodes to processors by selecting the correct number of subtrees from the eUmination tree. 
For example, to partition the network into eight main subnetworks up to fifteen subtrees 
have to be identified - the eight main subtrees and seven minor ones. These subtrees must 
be connected in the manner shown in Figure 5.5. 
When partitioning an ehmination tree i t is necessary to pick subtrees such that those 
subtrees which Ue in the same level of the appropriate tree in Figure 5.5 have roughly 
equal weights. For example subtrees 1 to 8 he in the same level and they should have 
approximately equal weights. Consider the tree of the reduced CEGB 734 node system 
shown in Figure 5.6 and how it is partitioned into subtrees for a solution with eight main 
subnetworks on eight processors. The subtree rooted at * has a weight of 1228 whilst the 
subtree rooted at ** has a weight of 1246. In real systems it is not usually possible to 
identify subtrees with exactly the same weight and some imbalance in the computational 
loading has to be accepted. As following chapters wiU show, this imbalance can be used to 
improve speed-up. 
^Several other formulae for determining computational complexity may also be used e.g. total number 
of nonzeroes in the corresponding matrix row, total number of machine level arithmetic operations for 
factorisation, total number of machine level arithmetic operations for substitution 
130 
5.2 Development of the Recursively Parallel Method 
* 31* 
Figure 5.6: Partitioning of the reduced CEGB 734 node system for solution on 8 processors 
5.2.4 R e d u c i n g the Sequential Par t of the Method 
The RBBDF matrix structure may be used to reduce the size of the sequential part of a 
conventional parallel LU solution and can increase performance. Section 5.2.1 showed that 
potential parallelism exists in the processing of the cutset and the use of RBBDF allows 
this parallelism to be exploited. 
Consider the network represented in Figure 5.4(a). The sequential part of the algorithm 
is reduced by partitioning the cutset blocks into minor subnetworks found in level 1 and 
below. In Figure 5.4(a) minor subnetworks 9 to 15 together constitute the cutset block and 
they would be solved sequentially as a single block in a conventional triangular solution. 
Blocks 1 to 8 are the subnetworks in the traditional approach and parallelism is only ex-
ploited in the processing of these subnetworks. Giving RBBDF structure to the coefficient 
matrix and partitioning the cutset into minor subnetworks allows parallelism to be exploited 
in the solution of both main subnetworks and the cutset. ParaUeHsing cutset processing 
reduces the size of the sequential part of the method and will result in improved speed-ups. 
For example, consider the system of Figure 5.4(a) and assume that the processing of each 
subtree requires one unit of computation. In a conventional BBDF-based parallel solution 
131 
5.3 A Simulation of the Recursively Parallel Method 
subtrees 1-8 would be processed in parallel whilst subtrees 9-15 would be aggregated into 
a single subtree and processed sequentially. Processing of 9-15 would take seven units of 
computation whilst parallel processing of 1-8 would take one unit of computation. The 
BBDF-based solution requires eight units of computation time to yield a solution. Now 
consider the RBBDF-based triangular solution. Subtrees 1-8 are again processed in parallel 
requiring one unit of computation. Now 9-12 are also processed in parallel, requiring one 
unit of computation. Processing 13 and 14 concurrently requires one unit of computation 
and processing of 15 requires one further unit of computation. The RBBDF-based solution 
requires only 4 units of computation time to yield a solution, which is a significant improve-
ment over the BBDF-based solution. Note that both solutions require the same amount of 
total (sequential) computation but the RBBDF solution allows idle processors to be used 
to process the cutset in parallel. The exploitation of paraUeUsm within the cutset is made 
possible by the use of the RBBDF coefficient matrix. 
5.3 A Simulation of the Recursively Parallel Method 
In order to verify the effectiveness of the Recursively Parallel method a simulation of the 
approach was implemented on an IBM PC-AT clone. A suite of four test systems was 
used in the simulation and these were partitioned into various numbers of subnetworks for 
solution. There were three aims to the simulation 
1. To verify that the RP method worked correctly, particularly in the presence of missing 
connections and subnetworks. 
2. To verify that the RP method, when executed on a parallel machine, would give a 
faster solution than the best sequential method. Also to verify that the total compu-
tation time for the RP method was the same as that of the best sequential method. 
3. To verify that the RP method, executed on a parallel machine, would give a faster 
solution than existing parallel LU based solution techniques. 
This section discusses the implementation of the simulation and the results obtained from 
i t . 
132 
5.3 A Simulation of the Recursively Parallel Method 
Figure 5.7: The block oriented data structure 
5.3.1 Implementat ion 
The simulation of the RP method is based around ZoUenkopf's bifactorisation method. 
Bifactorisation is used to decompose the RBBDF coefficient matrix into the relevant factor 
matrices and these are multiplied together to yield the solution for the unknown vector. The 
use of the R B B D F coefficient matrix allows the factorisation and substitution operations to 
be performed in a number of parallel phases, as described in the previous section. Timing 
mechanisms have been built into the simulation to allow the computations to be accurately 
timed. This enables an estimate to be made for the computation time of a multiprocessor 
implementation of the method. The overall execution time may be monitored and it is also 
possible to time the processing of individual regions of the coefficient matrix. To maintain 
maximum accuracy timing is performed using an external digital counter/timer connected 
to the PC's parallel port. This timer has a resolution of ^th. of a millisecond and timing 
is triggered by start and stop signals sent from the PC. The times recorded by the timer 
are manually entered into the simulation by the user. 
The heart of the simulation, and of the multiprocessor implementation, is the data 
structure used to store the coefficient matrix. Two data structures were used in the sim-
ulation, each giving rise to a shghtly different algorithm. The first structure, depicted in 
Figure 5.7, treats the coefficient matrix in a blockwise manner. Each populated region 
in the matrix, shown shaded in Figure 5.7, is treated as a separate block and bifactorisa-
tion is accompHshed by performing elementary operations on the individual matrix blocks. 
Computation time is monitored by timing the elementary operations on individual blocks. 
The method is described further in Chapter 8 where it is shown that this block oriented 
treatment of the matrix may form the basis of a different type of parallel solution. For 
133 
5.3 A Simulation of the Recursively Parallel Method 
Figure 5.8: The row oriented data structure 
the method described in this chapter the block-oriented approach does not give the most 
efficient solution. The second data structure treats the matrix in a row-oriented fashion. 
Again the matrix is stored in sections, as in Figure 5.8, and if there are n subnetworks 
then there will be n sections in the matrix. Each of the sections extends from the diagonal 
to the righthand edge of the matrix and covers both populated (shaded) and unpopulated 
(unshaded) regions of the matrix. It is observed that the lower right corner of the matrix is 
more densely populated and sections in this region may employ full array storage for matrix 
elements whilst those in the rest of the matrix make use of sparse linked hst storage. The 
location of the changeover between sparse and fuU storage is specified by the user. For many 
of the simulation runs only the last section made use of full array storage. Bifactorisation 
is accomplished using the bifactorisation rules of equations (2.28) to (2.30). Timing of 
the computation is restricted to monitoring the processing time for each section but this is 
sufficient to allow a prediction of parallel computation time. 
During a simulation run a time is returned for the factorisation, left multiplication and 
right multiplication operations on each block. If the system is partitioned into n blocks 
there will be n factorisation times t } [ \ ] . . . t j [ n \ , n left multiplication times t i [ l \ . . .ti[n] and 
(n - 1) right multiplication times tr[l] . . . t r [ n - I ] . These (3n - 1) timing statistics may be 
manipulated to produce a value for the total computation time of the RP method and an 
estimate of the execution time of the RP method on a parallel computer. Total computation 
time is easily found by summing all (3n - 1) times and this should be similar to the total 
computation time for the best sequential method. If this does not hold then the RP method 
will be inherently less efficient than the sequential method. Total computation time may 
be subdivided into total times for each of the three solution operations (i.e. factorisation, 
134 
5.3 A Simulation of the Recursively Parallel Method 
Figure 5.9: Algorithm structure for the Recursively Parallel method 
left multiphcation and right multipUcation). The following relationships wiU be true if the 
RP method is as efficient as the best sequential method 
n 
n 
(5.2) 
(5.3) 
(5.4) 
1=1 
The estimated execution time for the parallel RP solution can be determined by consid-
ering the tree-based interconnection of subnetworks. This gives rise directly to the algorithm 
structure of the RP method which is of the form shown in Figure 5.9. Consider the fac-
torisation operations. Fi, F2, F3, F4 all commence at the same time. -F5 cannot commence 
until both -Fi and F2 complete whilst Fe cannot commence until F3 and F4 are complete. 
Fr commences when both F5 and FQ are complete. Suppose that Fi takes much longer to 
perform than F2, F3, F4. F3 and F4 will complete and FQ will be able to commence its op-
eration. F2 wiU also have completed but F5 is held up awaiting the completion of i ^ i . When 
135 
5.3 A Simulation of the Recursively Parallel Method 
Fj finally finishes F5 wiU commence but this wiU occur some time after the commencement 
of FQ. I f F5 and Fe take equal times to process then Fe will complete before F5 and Fj wiU 
be held up awaiting the completion of F5. The long processing of Fi can be seen to ripple 
down through the tree causing other operations to wait and Fi has a direct effect on the 
total time for parallel factorisation. In fact the longest operation in the first level of the 
tree defines a critical path from this operation to the root of the tree. Summing the times of 
factorisation operations on this critical path gives the total time for parallel factorisation, 
tpARf Summing the left and right multiphcation times along the same path defines the 
total left and right multiphcation times, ip^ifi, and tpAR^ respectively. 
A cursory examination of the algorithm structure shows that the total parallel execution 
time is not the sum of the total parallel times for each of the three operations. Many of 
the left multipUcation operations occur in paraUel with the factorisation operations and are 
effectively hidden behind the factorisations. Only one left multipUcation operation is not 
performed in paraUel with the factorisations and aU of the right multipUcations have to be 
performed after the left multipUcations are complete. The total paraUel execution time for 
the RP method is therefore given by 
tpARtotai = ^PARf + tPARr + tl[n\ (5.5) 
The speed-ups can be calculated as 
5; = ^ (5.6) 
S = (5.7) 
tPARtotal 
where 5/ is the factorisation speed-up and S is the overall speed-up. Substitution speed-up 
must be calculated in a different manner. In a power system simulation, where the network 
equations are solved for many different right hand sides, factorisation is performed only 
once but substitution is performed repeatedly. Under these conditions the structure of the 
algorithm changes to that shown in Figure 5.10. None of the left multipUcations are hidden 
behind other operations and the time for parallel substitution, tpARsubsti becomes 
tPARs^tst = iPARi + tPARr (5-8) 
136 
5.3 A Simulation of the Recursively Parallel Method 
c 
Figure 5.10: Algorithm structure for repeated substitution with multiple right hand sides 
Hence the substitution speed-up, Ss, is given by 
tpARi + tpARr 
tsEQi + tsEQr 
(5.9) 
5.3.2 Resu l t s of the Simulat ion 
The results of the simulation are encouraging and bear out all the theoretical predictions 
presented earlier in this chapter concerning the performance of the RP method. Table 5.1 
lists the speed-up results for both the RP method and a standard parallel LU method for 
the four test systems. Figure 5.11 to Figure 5.13 show these results graphically. 
I t is easy to see from the graphs that the RP method performs better than the existing 
approaches to parallel solution. Higher absolute speed-ups are obtained and as the number 
of processors increases the speed-ups carry on increasing and do not appear to suffer from 
the saturation which affects the standard solution. This does not mean that the RP solution 
never saturates, simply that saturation is not observed within the region of interest. Given 
that the RP method returns higher speed-ups than the standard method from the same 
number of processors i t is obviously more efficient. 
The results obtained from the simulation are somewhat optimistic as there are certain 
characteristics of a parallel program that are not included in the simulation. The main 
omission is that of interprocessor communication and the overheads it introduces. A com-
munication between two processors takes a finite amount of time and this depends not only 
137 
5.3 A Simulation of the Recursively Parallel Method 
10 T 
9 + 
7 + 
Q. 
I ' 
Q. 
Ui 
118 Node RP 
_l 1 i 1 
4 6 8 10 12 
Number of p rocessors , n 
734 Node RP 
1624 N o d e R P 
A 629 Node RP 
1624 Node 
118 Node 
629 Node 
14 16 
Figure 5.11: Overall speed-up results of simulated solution of the four test systems - solid 
lines correspond to the standard parallel method whilst dashed lines correspond to the RP 
method 
138 
5.3 A Simulation of the Recursively Parallel Method 
„ 4 
3 + 
_. 734 Node RP 
1624 Node RP 
. ' ' 629 NodeRP 
118 Node Re-' 
1624 Node 
629 Node 
Node 
118 Node 
-+- -+-
6 8 10 
Number of p rocessors , n 
12 14 16 
Figure 5.12: Factorisation speed-up results of simulated solution of the four test systems 
- solid lines correspond to the standard parallel method whilst dashed lines correspond to 
the RP method 
139 
5.3 A Simulation of the Recursively Parallel Method 
7 + 
S 5 
4 + 
2 + 
1 + 
1624 N o d e R P 
734 Node RP 
. 629 Node RP 
'1624 Node 
734 Node 
629 Node 
118 Node 
-+- -+- -+- -+-
4 6 8 10 12 
Number of processors , n 
14 16 
Figure 5.13: Substitution speed-up results of simulated solution of the four test systems -
soUd Unes correspond to the standard paraUel method whilst dashed Unes correspond to the 
RP method 
140 
5.3 A Simulation of the Recursively Parallel Method 
Recursively Parallel Speed-up St andard Speed-up 
System CPU's Overall Factorise Substitute Overall Factorise Substitute 
118 2 1.50 1.55 1.62 1.50 1.55 1.62 
118 4 3.43 2.72 2.72 2.55 2.63 2.42 
118 8 5.34 4.41 3.86 3.13 3.28 2.94 
629 2 1.94 1.94 1.94 1.94 1.94 • 1.94 
629 4 2.77 2.41 2.49 2.34 2.31 2.41 
629 8 4.62 3.96 4.29 3.39 3.24 3.69 
629 16 7.12 6.17 6.57 4.29 4.09 4.73 
734 2 1.92 2.01 1.87 1.92 2.01 1.87 
734 4 4.43 3.34 3.08 3.12 3.14 2.95 
734 8 6.12 5.37 4.98 3.76 3.66 3.95 
734 14 9.71 6.67 8.14 4.2 3.91 4.95 
1624 2 1.85 1.79 1.97 1.85 1.79 1.97 
1624 4 3.46 2.41 3.07 3.36 2.34 3.01 
1624 8 6.36 5.15 5.58 4.92 3.92 4.65 
1624 13 8.00 6.27 8.99 4.5 3.77 5.80 
Table 5.1: Results of the simulated solution of the four test systems 
on the length of the message but also on the state of the receiving task. I f the intended 
recipient is not ready to receive information the communication wiU be blocked until the 
receiving task reaches its synchronisation point. Whilst the transfer time of the message is 
proportional to the length of the message the delays due to blocking are non-deterministic 
and difficult to simulate. Accounting for the communication delays would make the simula-
tion significantly more complicated. The results for the standard solution are also obtained 
by simulation and neither set of results includes the effects of communications. Given that 
the volume of communications involved in each method is roughly similar this allows a 
relative comparison of the performance of the two methods. The absolute performance of 
either method will be worse than that given in Table 5.1 due to the delays resulting from 
communication and scheduling. In a multiprocessor environment the communication delays 
will be affected by the target processor topology and communication routing protocols used. 
Omitting communication from the simulation means that i t cannot be used to assess the 
suitability of different target architectures for the RP method. 
I t was stated earlier that if the RP method of solution is efficient the sum of its compu-
tation times should be approximately equal to the computation times of the best sequential 
method. Table 5.2 compares the total computation time of the RP solution with that of 
the best sequential solution for each of the test systems. In most cases the results are 
141 
5.3 A Simulation of the Recursively Parallel Method 
similar, indicating the efficiency of the RP method. However there are a few differences 
which should be noted. In certain cases (e.g. the 118 node system using 8 processors) the 
total computation time of the Recursively ParaUel method is greater than that of the best 
sequential method. Where there are differences they are not that large, ranging from 15% 
for the 118 node, 8 processor case down to 6% for the 734 node, 16 processor case. For 
the 1624 node systems it is noted that the total computation time of the RP method is 
always less than the total computation time of the best sequential method, by up to 3%. 
AU these discrepancies are due to the fact that the data structures used in the two solution 
methods are different. The best sequential method uses a Unked Ust data structure which 
optimises storage and processing for a sequential solution. The RP method uses data struc-
tures that mimic the data structures which would be used in a parallel implementation. 
This introduces extra overhead into the solution and its effect is more noticeable in the 
solution of smaUer systems. The RP method uses a hybrid storage scheme so that some of 
the elements in the dense lower right corner of the matrix are stored in arrays. When the 
lower right corner becomes densely populated, array storage is more efficient and leads to 
faster processing. The effect is more noticeable for larger systems and this explains why 
the total times for the 1624 node RP solution are less than the best sequential times. The 
overhead introduced into the simulated RP solution by the more complex data structures 
stiU exists but its effect is counteracted by a decrease in execution time due to array stor-
age. This effect probably occurs in some of the smaUer systems but here it is Ukely that the 
reduction in execution time produced by hybrid storage in not sufficient to counteract the 
adverse effects of the overhead introduced by the more complex data structures. Hybrid 
storage is not implemented in the best sequential solution but the fact that it reduces the 
execution time of the more complex RP method to below that of the sequential method 
provides further evidence to support the claim that hybrid storage may play a significant 
part in improving existing sequential methods. 
One of the main causes of inefficiency in a paraUel program is the overhead introduced by 
interprocessor communication. I f the communication overheads introduced into the parallel 
implementation can be kept to a minimum the performance achieved wiU be similar to 
that predicted by Table 5.1. Low communication overheads wiU lead to near-unity serial 
efficiency. As Tylavsky et al. [50] observe, the serial efficiency of a paraUel algorithm relates 
to its scalabiUty. An algorithm which has a unity serial efficiency wiU display increasing 
speed-up as the number of processors used to execute the algorithm increases. An efficiency 
142 
5.4 Summary 
System CPU's 
Recursively Parallel 
Total Computation Time 
Best Sequential 
Computation Time 
118 2 13.63 13.24 
118 4 14.47 13.24 
118 8 15.32 13.24 
629 2 80.07 76.49 
629 4 82.43 76.49 
629 8 81.91 76.49 
629 16 82.00 76.49 
734 2 95.87 94.89 
734 4 99.55 94.89 
734 8 99.82 94.89 
734 14 100.45 94.89 
1624 2 273.04 275.40 
1624 4 274.32 275.40 
1624 8 268.08 275.40 
1624 13 269.19 275.40 
Table 5.2: Simulated RP solution vs best sequential solution 
greater than unity indicates that the speed-up will saturate as the number of processors 
increases and may eventually fall off. The greater the serial efficiency, the quicker the speed-
up wiU saturate. Tylavsky et al. note that an efficiency of less than about 4 or 5 is needed 
in order to achieve a speed-up greater than unity. 
5.4 Summary 
The previous chapter discussed existing parallel methods for solving systems of linear equa-
tions. This chapter has considered how those methods can be modified to create a parallel 
solution which has a better performance. The increase in performance is achieved through 
the use of a particular coefficient matrix structure, the Recursive Bordered Block Diagonal 
Form. This results from constraining the subnetwork interconnections so that the network 
graph has a tree-like structure. Within the RBBDF matrix there are regions of indepen-
dence and all subnetworks contained in these regions may be processed in parallel. Several 
independent regions exist in the matrix and solution involves sequential execution of several 
parallel phases. 
A simulation of the method has been implemented and the results for the solution of four 
test systems have been presented. The simulation results support all the claims made for 
143 
5.4 Summary 
the RP method and show that i t offers better speed-up performance than standard methods 
in aU phases of the solution. As the total amount of computation involved is the same as in 
the best sequential method it is possible to create a highly efficient parallel implementation 
i f the communication overheads can be kept to a minimum. 
The Recursively ParaUel method is not a revolutionary new algorithm for the solution 
of Unear equations. I t is simply a restructuring of the problem to allow existing methods 
to exploit more of the potential paraUeUsm in the problem. This restructuring is achieved 
by constraining the topology of the network interconnections to a particular form. Despite 
the constraints, this form of interconnection is very fiexible and should allow the solution 
of any real system. 
144 
Chapter 6 
Issues of Parallel Implementation 
6.1 Introduction 
- - n the previous chapter the general algorithm of the Recursively ParaUel solution method 
- - was discussed without reference to how it might be implemented on a MIMD computer. 
Chapter 4 considered the partitioning of the problem and argued the case for obtaining 
an equal division of work through a careful analysis of the system to be solved. This 
chapter considers some aspects of implementing the Recursively ParaUel solution on a MIMD 
machine, in particular on an array of INMOS Transputers. These issues relate to the 
algorithm (e.g. number of tasks, data structures, methods of communication etc.) and to 
the architecture of the solution, both the physical architecture of the target machine and 
the software architecture of the interconnected tasks and their placement on the available 
processors. 
The INMOS Transputer was designed to provide an implementation of Hoare's Com-
municating Sequential Processes (CSP) paradigm of parallel computation [92, 93]. CSP 
considers a parallel program to be made of a coUection of independently executing tasks, 
where each task is an autonomous unit of sequential code which executes in the same man-
ner as a normal sequential program. Tasks synchronise their actions and share data through 
expUcit interprocess communications along virtual, unidirectional communication channels. 
Many paraUel languages (e.g. aU the INMOS parallel languages, Ada etc.) are based on 
the CSP paradigm and the comments made in this chapter concerning an INMOS ' C im-
plementation apply equaUy weU to implementations in other CSP-based languages. Certain 
specific issues of a Transputer implementation are considered but i t should be noted that 
145 
6.2 : Algorithmic Issues 
the Transputer has only been used as a cheap testbed on which the methodologies may be 
proven. Having verified the methods on a Transputer system it is easy to generalize them 
for any MIMD/CSP implementation and also implementations on distributed computer 
networks {e.g. workstation clusters [85]). 
6.2 Algorithmic Issues 
6.2.1 P r o g r a m Structure and Task Design 
The design of a parallel program requires careful consideration of the algorithm it embodies. 
The algorithm must be decomposed into individual tasks in a manner which maximizes the 
speed-ups that can be obtained. Particularly important is matching the number of tasks 
to the available processing hardware. I f there are more tasks than there are processors the 
performance can be degraded due to overheads introduced by multitasking. Hence it is 
important to choose the right grain of parallelism for the implementation. 
The Recursively Parallel program implemented in this research project is based upon 
ZoUenkopf's Bifactorisation method of triangular decomposition. Within this method three 
distinct operations can be identified - factorisation of the coefficient matrix, multiplication 
of left hand factors and multiplication of right hand factors. For the RP solution of a given 
system, p distinct subnetworks will be created within the network. The three basic opera-
tions must be applied to each subnetwork and it is perhaps easiest to consider the parallel 
program to be made of 3p tasks; p factorisation tasks, p left factor multiphcation tasks and 
p right factor multiplication tasks. One each of the factorisation, left and right multiph-
cation tasks would be associated with each subnetwork. This approach can be seen to be 
inefficient when the bifactorisation process is considered as a whole. First the factorisation 
operation is performed followed by left factor multiphcation and then right factor multipli-
cation. Left factor multiphcation cannot commence until factorisation completes and right 
factor multipUcation cannot commence until left factor multipUcation completes. Applying 
this to an individual subnetwork we have the same sequence of operations. As neither of 
the multipHcations can occur at the same time as factorisation both multipUcation tasks 
would be idle whilst the subnetwork is being factorised. A similar argument shows that at 
any point in time only one of the three tasks associated with a given subnetwork is active, 
the other two being idle (Figure 6.1(a)). A single task can be created which performs the 
three operations one after another on a given subnetwork. Now only p tasks are required to 
146 
6.2 Algor i thmic Issues 
(a) 
•' "" " 
i L 
(b) 
Right 
Multiply 
Left 
Multiply 
Factorise 
k 
BUSY 
Factorise Left Right ^ 
Multiply Multiply 
Figure 6.1: Task structures with different granularity a) fine grain b) coarse grain 
process the p subnetworks and each of these tasks is busy at all points in time. Referring 
to the inherent sequentiaUty of the bifactorisation process reveals that there is little point 
using 3p tasks when p tasks will suffice. From the programmer's point of view it is much 
simpler to manage only p tasks. Figure 6.1(b) shows the structure of a task in the p task 
solution. 
Having determined the number and structure of the tasks it is necessary to connect up 
the tasks so that they can cooperate and share data. For the remainder of this discussion it 
is assumed that task T;, where i = 1 . . .p, is responsible for processing the i th subnetwork. 
Consider Figure 6.2 which shows the structure of the coefficient matrix for a system with 
4 main subnetworks and 3 minor ones. The analogy of blocks in the matrix diagram being 
elements in a 7 X 7 matrix with the same structure is used. In factorising the first row 
of the matrix updates will be made to elements (5,5), (5,7) and (7,7) as defined by the 
equations of Section 2.4.4. In terms of the subnetworks in the RP solution, processing of 
the first subnetwork leads to values in the fifth and seventh subnetworks being modified. As 
the three subnetworks involved are processed by separate tasks these tasks must cooperate 
and share the information between themselves. In other words Ti must send information 
on modifications to T5 and T7. An analysis of the communications required for the other 
subnetworks in all three stages of the Recursively Parallel solution yields the dataflow-
diagram of Figure 6.2. Circles in the diagram represent tasks and the arcs linking them 
show the communications between the various tasks. The flow of data is shown by the 
147 
6.2 Algorithmic Issues 
•k -k 
•k 
-k 
-k 
•k 
-k 
•k 
-k 
•k 
•k 
Figure 6.2: Intertask communications in a 7 subnetwork solution a) and corresponding 
coefficient matrix structure b) 
arrowheads. 
I f one task is used for each subnetwork it is obvious that the tasks are connected in a 
tree form. Figure 6.3 shows the task intercommunications that are required for solutions 
with 3 and 15 subnetworks. Observe that no connection is ever required between tasks in 
the same level of the tree. These observations will be used in Section 6.2.3 to realize a 
reduction in the number of intertask communications. 
The basic task structure of Figure 6.1(b) can be refined to take account of the communi-
cations that occur between tasks in the program. Figure 6.2 shows how data moves in two 
directions through the tree. The factorisation and left factor multiphcation operations both 
produce data that flows down the tree from the leaves towards the root. The right factor 
multiphcation operation generates data which flows in the opposite direction back towards 
the leaves. After factorisation a task Ti must send any data pertaining to the modification 
of other subnetworks to the tasks responsible for processing those subnetworks. These tasks 
all lie below (i.e. toward the root) task Ti in the tree. Modifications to other subnetworks 
resulting from left multiphcation with subnetwork i again requires task to send data to 
tasks lower down the tree. Modifications resulting from right multipUcation require task 
r , to send data to tasks higher up the tree (i.e. toward the leaves). When a task is the 
recipient of any of these communications it must add the modifications to its data before it 
performs the relevant operations on its subnetwork. The refined task structure is shown in 
148 
6.2 Algorithmic Issues 
1 2 
1 2 3 4 5 6 7 8 
Figure 6.3: Intertask communications in 3 and 15 subnetwork solutions 
Figure 6.4. Note that stages 2 and 12 do not occur for tasks at the leaves of the tree and 
that stages 5, 9 and 10 do not occur for the task at the root of the tree. 
One important issue not yet addressed is where do the tasks originally obtain their 
data from ? The collection of generic tasks which perform the computation of the solution 
can be viewed as a set of worker tasks in a supervisor / worker scenario (Figure 4.4). 
A supervisor task is needed to take control over the generic workers. This supervisor 
is responsible for passing the appropriate subnetwork data to each worker task. Stage 
1 of the generic task is responsible for accepting the initial subnetwork data from the 
supervisor task. The supervisor is also responsible for gathering the results from each 
worker once computation is complete (stage 13) and assembhng them into a single resultant 
vector. For the purposes of monitoring the performance of the Recursively Parallel method, 
the supervisor also synchronizes the start of computation within the set of workers. This 
is necessary to aUow an accurate time to measured for the computation of the solution. 
Appendix G gives further consideration to the role of the supervisor in monitoring the 
performance of the computations. 
149 
6.2 Algorithmic Issues 
1: Receive i n i t i a l subnetwork data from supervisor task 
2: Receive modification data from a l l connected tasks higher up the tr e e 
3: Modify subnetwork data 
4: F a c t o r i s e the subnetwork 
5: Send modifications to a l l connected tasks lower down the tr e e 
6: Receive modifications from a l l connected tasks higher up the tr e e 
7: Modify r i g h t hand side vector data 
8: L e f t m u l t i p l y with the subnetwork 
9: Send modifications to a l l connected tasks lower down the tr e e 
10: Receive modifications from a l l coimected tasks lower down the t r e e 
11: Right m u l t i p l y with the subnetwork 
12: Send modifications to a l l connected tasks higher up the tr e e 
13: Send r e s u l t s to supervisor task 
Figure 6.4: The generic task of the Recursively Parallel solution 
6.2.2 D a t a Storage and D a t a Structures 
The worker task must store, modify and process the data associated with the subnetwork 
for which i t is responsible and the choice of the correct data structures is crucial to the 
performance of the parallel program. For the solution of finear equations the data is the 
coefficient matrix and the known right hand side vector of the set of equations. Matrices 
are usually stored by computer as a two dimensional array of an appropriate type but 
sparse matrix techniques are a more efficient method of storing the large coefficient matrices 
associated with network problems. These techniques store only the non-zero matrix elements 
and these are held in finear finked fists. The known right hand side vector is stored as a 
one dimensional array as i t is usuaUy densely populated, particularly as computation draws 
to a close. In a paraUel solution there is no longer a single coefficient matrix structure 
corresponding to a single network. Instead there are a number of subnetworks, which 
correspond to a number of submatrices of the coefficient matrix. Consideration has to be 
given as to what are the best methods of storing these submatrices and the portion of the 
right hand side vector associated with each subnetwork. 
I f a system is divided into m subnetworks to be solved using the Recursively Parallel 
approach then the coefficient matrix of the system wiU have Recursive Bordered Block 
Diagonal Form with p — 2m —1 'r' shaped segments stacked along the diagonal, as in Figure 
6.5. Each task requires the data held in a single 'r' shaped segment and is responsible for 
storing this segment. For example, the data associated with subnetwork 1 is contained in 
the shaded 'r ' shape segment of Figure 6.5 and task Ti must store this data in an appropriate 
150 
6.2 Algorithmic Issues 
2 
1 1 1 
1 1 1 
1 1 1 
3 
4 
5 
6 
7 
Figure 6.5: RBBDF matrix structure showing 'r' segments 
data structure. Suppose that the ith 'r' segment begins at row a, extends to row b and that 
there are n rows in the matrix. If the traditional two dimensional array is used to store the 
'r' segment for this task then a square array of the same height and width as the 'r' segment 
is required and this is extremely wasteful of memory, especially when the subnetworks of the 
first level of the tree are considered. Within the 'r' segment of these subnetworks there are 
only four possible regions where non-zero elements can be located {e.g. blocks 1,5 1,7 5,1 
7,1). A more efficient storage scheme stores only these non-zeros and this is implemented 
through the use of a sparse matrix Unked Ust representation. Under this scheme (n - a) Usts 
will be required for the given subnetwork. Most of these Usts will contain few, if any, entries 
and substantial savings on memory can be realized. Further savings result from examining 
the nature of the coefficient matrices for power system problems. As Chapter 2 has shown, 
the coefficient matrix is often symmetric and if this is the case it is not necessary to repUcate 
data by storing both arms of the 'r' segment. A single arm contains aU the information held 
within the 'r' segment and only (6 - a) Unked Usts are now required. Figure 6.6 shows the 
portion of the coefficient matrix stored in the Unked Ust data structures by task T i . This 
storage scheme proves to be the most efficient for storing the major subnetwork. Inspecting 
the olf-diagonal blocks corresponding to the minor subnetworks reveals a diflTerent picture. 
These blocks contain many more non-zeros than zeros and in moving from the top left corner 
of the matrix to the bottom right corner the density of the off-diagonal blocks increases. 
Most of the off-diagonal blocks corresponding to the minor (cutset) subnetworks are in fact 
151 
6.2 Algorithmic Issues 
. J-.. 
r " • • 
I # ,sJ 
1 -
f 
I 
2 
1 
1 
1 
3 
4 
5 
6 
7 
Figure 6.6: Portion of the coeflficient matrix stored by a single task 
fuU of non-zero entries. BrameUer et. al [2] observe that there is a threshold beyond which 
it is ineflRcient to use sparse matrix techniques and in the minor subnetwork off-diagonal 
blocks this threshold is easily exceeded. The empty spaces between the blocks in these 
regions of the matrix are smaU, accounting for a only a small percentage of the total length 
of the row to the right of the diagonal. It is more efficient to process the minor subnetworks 
as two dimensional arrays rather than as sparse matrix finked fists. Unfortunately the 
point at which the density exceeds the threshold value does not usuaUy coincide with the 
boundary between major and minor subnetworks. The changeover point differs from system 
to system and it often found that i t is more efficient to treat the first few minor subnetworks 
using sparse matrix techniques, reserving the use of array storage for the remaining minor 
subnetworks. Figure 6.7 shows the parts of the coefficient matrix stored in the Recursively 
ParaUel solution and the type of storage scheme used by each task/subnetwork. 
When the optimal ordering routine is appfied prior to solution, a simulation of the 
factorisation process is performed to determine the effect of fiU-ins. This simulation can 
be adapted so that i t assesses the density of the coefficient matrix after the simulated 
factorisation is complete. Starting with the element at the bottom right corner of the 
matrix, the algorithm works back up the diagonal and as i t goes it examines the square 
region below and to the right of the diagonal (Figure 6.8). The number of non-zeros in this 
region is counted and the total number of elements enclosed by the region is calculated. I f 
the density of the region exceeds the sparsity threshold the search continues moving back 
152 
6.2 Algor i thmic Issues 
SPARSE LINKED 
LIST STORAGE 
2 DIMENSIONAL 
ARRAY STORAGE 
Figure 6 .7: Storage techniques used by the Recursively Parallel method 
X X X 
X X X 
Figure 6.8: Assessing the density of the coefficient matrix 
153 
6.2 Algorithmic Issues 
Speed-up with Speed-up with 
System Subnetworks original storage hybrid storage 
118 3 1.42 1.42 
118 7 1.93 2.08 
118 15 2.12 2.47 
629 3 2.18 2.23 
629 7 2.52 2.70 
629 15 3.71 4.04 
734 3 2.46 2.46 
734 7 3.59 4.10 
734 15 4.56 5.83 
734 31 4.71 5.98 
1624 3 2.64 2.64 
1624 7 4.00 4.15 
1624 15 6.54 7.55 
1624 31 7.26 9.38 
Table 6.1: The effect of storage scheme on speed-up 
up the diagonal until the density of the region falls just below the threshold. The location 
of this point aUows the data structures of the Recursively Parallel program to be tailored 
to give the most efficient storage and processing of each individual system. Everything 
above and to the left of the changeover point is most efficiently processed using sparse 
matrix techniques whilst aU rows below and to the right are most efficiently dealt with 
when stored in two dimensional arrays. Performing this sort of density test on a system 
prior to processing determines which subnetworks should store their submatrix in finked 
fists and which should store it in arrays. Future work may aUow the changeover point to 
occur within a subnetwork but for the present the changeover point is constrained to fie 
on a subnetwork boundary. The program implemented for this project allows any of the 
minor subnetworks, except the last, to use either sparse matrix or array storage as the need 
requires. The last subnetwork is always processed using arrays and the major subnetworks 
employ sparse matrix storage techniques. 
The benefits of this hybrid storage scheme are evident in the speed-up results. Table 6.1 
shows the factorisation speed-ups for a number of systems processed both with and without 
hybrid storage. Figure 6.9 shows the variation in speed-up for the 1624 node US power 
system as the changeover between sparse and array storage is moved between the minor 
subnetworks. Changing the number of subnetworks which are stored using array storage has 
a dramatic effect on the speed-up. Notice that i t is possible to vary the factorisation speed-
154 
6.2 Algorithmic Issues 
Figure 6.9: Variation of speed-up with location of changeover point in hybrid storage, for 
US 1624 node system 
up between 7.3 and 9.5 simply by altering the point at which dense storage is introduced. 
I f Uttle use is made of array storage (changeover point to the right of the graph) then the 
factorisation speed-up is poor. Increasing the amount of array storage (changeover point 
moving towards the left of the graph) has the effect of increasing the speed-up, although 
there is a certain point beyond which i t is not efficient to use array storage as the factorisa-
tion speed-up begins to decrease again. Examining a number of graphs Uke this has revealed 
that there is often a clearly defined optimum location for the changeover point which results 
in the maximum factorisation speed-up. For the system shown in Figure 6.9 the maximum 
factorisation speed-up is obtained when subnetworks 25 to 31 are stored using array storage. 
The effect of hybrid storage on the substitution speed-up is somewhat different. I t can be 
seen from the graph that the substitution speed-up can be varied between 3.3 and 4.3 sim-
ply by altering the point at which array storage is first introduced. The graph shows that 
increasing the amount of array storage actuaUy reduces the substitution speed-up. Again, 
an examination of a number of graphs Uke this has revealed that maximum substitution 
speed-up is obtained when only the last subnetwork is stored using array methods. I f any 
other subnetworks are stored using arrays the speed-up rapidly drops off. The use of hybrid 
storage has different effects on the performance of factorisation and substitution. Using a 
significant amount of array storage improves the performance of factorisation as the factori-
sation algorithm performs many update operations to the lower right of the matrix. I f this 
155 
6.2 Algorithmic Issues 
is region is stored using finked fists then each update requires a search through the finked 
fist for the relevant row to find the desired element for updating. I f this region is stored 
using arrays then it is possible to jump directly to the desired element, thus removing the 
need for time-consuming searches. The extra overhead of examining zero elements during 
the factorisation of this region is countered by the significant reduction in processing time 
produced by the efimination of a large number of finked fist searches. The situation is some-
what different with substitution where increasing the use of array storage has a detrimental 
effect on perforihance. There are two effects which contribute to this phenomenon. Firstly, 
substitution simply involves the multipfication of matrix rows or columns by a vector. The 
algorithm for performing this operation simply scans through the stored data and performs 
an element by element arithmetic operations. No lengthy searches are involved when finked 
fists are used so the use of array storage would not efiminate any fist searches. However, 
when array storage is used the multipfication algorithm must perform element by element 
arithmetic on aU elements of the array, even if they are zero. I t is possible to reduce the 
amount of work by first examining the current element to check its value. Only if the el-
ement is non-zero is any arithmetic operation involving that element performed: However 
the examination of each element to determine its value stiU adds a computational overhead. 
Linked fist storage is more efficient as no zero elements are stored and no examination of 
element values needs to be performed. AU elements in the fist must automaticaUy take part 
in multipfication and the algorithm simply performs element by element arithmetic using 
aU the values in the fist. The computational overhead associated with the multipfication 
using finked fists is less than that using arrays, making it more efficient for substitution to 
store as much of the coefficient as possible using finked fists. The last subnetwork in the 
matrix is always stored using arrays as this subnetwork is always fuUy populated. It makes 
fittle difference to the amount of computation involved whether finked fist or array storage 
is used. 
The effect iUustrated by Figure 6.9 has profound impfications for the parallel solution. 
I t is possible to maximize speed-up by tailoring the storage scheme to the characteristics 
of either the factorisation or substitution phase, but not both. I f the storage strategy is 
chosen to maximize substitution speed-up then factorisation speed-up is stiU reasonable. 
However if the situation is reversed and the storage strategy is chosen to maximize fac-
torisation speed-up then the substitution speed-up is poor. The ideal requirement is for a 
storage scheme which maximizes both speed-ups, but as Figure 6.9 shows, this is clearly 
156 
6.2 Algorithmic Issues 
not possible. In power system simulation the factorisation operation is only required when 
the topology of the network changes. This occurs infrequently and it is perhaps possible 
to use the optimum storage for factorisation when factorising and then transform the data 
storage so that data is then stored in the optimum manner for substitution. This would 
maximize the performance of both the factorisation and substitution operations but would 
introduce a large overhead when the network topology changes. An alternative is to use 
two data storage mechanisms. The solution implemented on the Transputer saves on mem-
ory requirements by overwriting the original coefficient matrix with the factored matrix as 
factorisation progresses. I f memory size is not a Umitation then it would be possible to 
store the coefficient matrix in the optimum form for factorisation and to store the factored 
matrix separately in the optimum form for substitution. No transformation overheads are 
introduced when the topology changes and both factorisation and substitution would be 
able to achieve maximum performance. Some small overhead would be introduced into the 
factorisation computations. 
I t is interesting to note that the hybrid storage scheme can also be used to improve the 
performance of sequential solution techniques. With a sequential solution the coefficient 
matrix does not need to be arranged into BBDF or RBBDF. Optimal ordering is appUed 
to minimize the amount of fiU-ins that occur. I f the fiUed matrix is examined the increase 
in density in moving from the top left to the bottom right of the matrix is again observed. 
Usually aU the matrix rows are stored in sparse Unked Usts but in the lower right corner of 
the matrix many of the rows may contain only non-zero elements. Once again the threshold 
of efficiency for sparse matrix storage is exceeded and it would be more efficient to store 
these rows in conventional arrays. Search operations would be performed faster with these 
rows stored as arrays and this would decrease the execution time of the sequential solution. 
This theory has not been tested in practice and the implementation of hybrid storage in a 
sequential solution has been left as an item of further work. However the argument presented 
for the sequential solution is identical to that for the parallel case. Having observed the 
beneficial effect of hybrid storage on the performance of the parallel solution i t is reasonable 
to assume that a similar effect would be observed if hybrid storage were to be implemented 
for sequential solutions. 
Given that i t is not necessary for each task to store the entire coefficient matrix, perhaps 
similar savings can be achieved in storing the right hand side vector by only storing the 
relevant portion of the vector. Unfortunately this does not turn out to be the case, as can 
157 
6.2 Algorithmic Issues 
be seen by considering the multipfication stages of the bifactorisation algorithm. Taking 
left multipfication as the example, consider the multipUcation of the right hand side vector 
by the first left factor, X^^ l Assume i^^) and the right hand side vector have the form 
shown below. 
X 
0 
X 
X 
X 
0 
X 
0 
(6.1) 
Performing this multipfication results in a new vector ,b', which differs from b in elements 
1, 5 and 7. Assuming that the resultant vector is generated by successively overwriting the 
original right hand side vector, 6, left multipfication by the i*'^ subnetwork could possibly 
alter elements a to n of the right hand side vector where a is the first node in the i * ' ' 
subnetwork. Thus i t is necessary to for the i * ' ' task to store elements a to n of the right 
hand side vector otherwise the result of left factor multipfication within that task wiU be 
incorrect. A similar analysis of right factor multipfication shows that the same range of 
RHS vector elements must be stored to obtain correct results from right multipfication. A 
simple examination for the RBBDF matrix structure shows that for a given subnetwork 
only certain elements of the RHS vector are modified and others are never referenced. For 
example, multipfication by subnetwork 1 only ever references elements in blocks 1, 5 and 
7 of the RHS vector. An eflRcient scheme for storing only the RHS vector blocks required 
by each task could be devised but the saving on memory does not reaUy warrant the extra 
overhead and complexity involved in managing such a structure. The largest test system 
used in this research project required less than 12 kB of memory to store the entire right 
hand side vector. Given the amount of memory avafiable in modern computing systems 
it is simplest for each task to store aU the nodes a through to n of the RHS vector even 
though some of them are never referenced. 
158 
6.2 Algorithmic Issues 
Figure 6.10: Modified intertask communications for a 7 subnetwork system 
6.2.3 R e d u c i n g the Communicat ion Overhead 
Although the basic program structure and intertask communications have been defined there 
are several refinements that can be made which significantly improve the performance of the 
paraUel program. These refinements are based on aggregating the intertask communications. 
The first improvement aggregates communication messages so as to reduce the total 
number of messages passed between the tasks in the system. Consider the dataflow diagram 
of Figure 6.2. Suppose that task Ti has performed its factorisation operation. This results 
in data which has to be passed to T5 and Tj. The transmission of data to T 5 is critical to 
the next stage of processing and this communication must be performed first. When this 
finishes data can be sent expUcitly to task Tr as this communication is not critical to the next 
stage of processing. In fact T 7 does not require the data from Ti until T 5 has completed 
its factorisation operation. The use of a suitable store-and-forward aggregation scheme 
eUminates the need for the exphcit communication between Ti and T 7 . If Ti aggregates the 
data i t sends to T5 and T 7 into a single message sent to T5, T 5 can use the data pertaining 
to itself and simply ignore the data relating to T7. When T5 has factorised its submatrix 
it adds the data i t created for Ty to the stored data generated by T j and sends the single 
message to Tj. The expUcit communication between Ti and Ty is thus avoided. Applying 
this technique across the whole task graph reduces the intertask communications shown in 
Figure 6.2 to those of Figure 6.10. The worker tasks which make up the paraUel program 
are connected in a simple binary tree structure. The use of this aggregation technique 
eUminates eight communications in a seven subnetwork solution and forty communications 
in a fifteen subnetwork solution resulting in a significant improvement in efficiency. 
The second improvement is at a much lower level and reduces communication times by 
159 
6.3 _ _ Architectural Issues 
decreasing the length of the intertask messages using a data grouping technique. This tech-
nique is described fuUy in Appendix F but i t basicaUy operates by minimizing the amount of 
addressing information needed to uniquely identify an element within the coeflRcient matrix 
by making use of impficit information about the matrix structure. 
6.3 Architectural Issues 
In examining the architectural issues concerning parallel program design there are two 
aspects which must be considered. The software architecture describes the tasks which make 
up the paraUel program and how they are interconnected to give the best performance. The 
hardware architecture is concerned with the physical processors of the target machine. The 
way these processors are connected and the way in which data is routed between them can 
have an effect on the performance of the paraUel program. 
6.3.1 T h e Software Archi tec ture 
A paraUel program is made up of a set of autonomous tasks which cooperate through 
synchronizing their actions and sharing data using expficit intertask commuifications [92, 
94]. These tasks must be assigned to the available processors. Given a parallel program 
consisting of n tasks, these must be placed on the p processors available and two scenarios 
can be envisaged 
• n > p More than one task must be assigned to each processor and support for a 
multitasking environment is required to execute the tasks simultaneously 
• n < p Each task in the program can be assigned to its own processor 
Whilst the latter scenario provides the easiest implementation i t may not be the best solu-
tion. Assigning more than one task to each processor is often most economic as i t requires 
fewer processors. The efficiency of a paraUel program, E, is given by 
J J = ^ W (6.2) 
n 
and (6.2) clearly shows that program efficiency is a direct trade-off between the number 
of processors used and the speed-up obtained. The more processors that are thrown at the 
problem to achieve a desired speed-up the less efficient that solution becomes. Programs 
160 
6.3 Architectural Issues 
which assign a number of tasks to each processor generally achieve a higher efficiency at 
the expense of lower speed-up. 
A simple examination of the structure of the Recursively ParaUel program quickly iden-
tifies the optimum number of processors required. Taking a seven subnetwork (four major 
and 3 minor subnetworks) system as an example, the program structure is as shown in 
Figure 5.9. Operations with the same subscript are aU parts of the same task {e.g. task 
T i comprises the operations Fi,Li and i ^ i ) . As the factorisation operation is the most 
computationaUy intensive part of the solution only the factorisation operations of Figure 
5.9 wiU be considered in the foUowing argument. At level 1 there are four concurrently 
executing factorisation operations and the fastest computation occurs when each operation 
is assigned to its own processor. Hence level 1 requires four processors. Level 2 has only two 
concurrently executing operations requiring only two processors whilst the final operation 
in level 3 requires only a single processor. By virtue of the way in which the program oper-
ates, all the first level operations {i.e. Fi,F2.,F^,Fi) must have completed before the second 
level operations {i.e. F5,FQ) begin. The four processors which executed Fi . ..F4 are thus 
idle at the end of level 1 and two of them can be used to perform the operations of level 
2. Similarly the two processors responsible for executing level 2 operations are idle at the 
end of level 2 and one of them can process level 3. Consequently the maximum number of 
processors required is the same as the number of concurrent operations in the first level and 
a possible task to processor allocation is shown in Figure 6.11(a). This allocation strategy 
results from the fact that operations Fi .. .F7 are constituent parts of the tasks T i . . .Ty. 
Consider the aUocation strategy of Figure 6.11(a) from the point of view of the factorisation 
operations and recaU the previous discussion of load balancing and recall that maximum 
speed-up only occurs when all of the processors are busy all the time. This strategy results 
in processors P3 and P4 being unused for more than half of the total computation time 
and this is wasteful of processing resources. The efficiency of the task allocation can be 
improved by moving task Ty from processor Pi onto one of the 'spare' processors P3 or P4. 
The modified strategy is shown in Figure 6.11(b). 
The aUocation of tasks to processors is often considered using two acycUc directed graphs 
[19, 83]. One graph is used to represent the processors of the target machine and their in-
terconnection. The second graph is the task graph and this describes the interconnection 
between the tasks which make up the program. The task graph nodes are often annotated 
with computation and communication requirements. Figure 6.12 gives an example of these 
161 
6.3 Architectural Issues 
Figure 6.11: A many to one task allocation strategy a) initial task allocation b) improved 
task allocation 
two graphs The task allocation process then becomes one of mapping the task graph to the 
target machine graph in the way that minimizes computation and communication require-
ments. Berman [83] considers the task allocation process as a 'contraction' of the task graph 
to f i t the target machine graph. To describe the task allocation strategy for a general Recur-
sively Parallel solution it is useful to think of the contraction of the factorisation operations 
of Figure 5.9. The first step of the process is to squash levels 2 and below into a single level 
by superimposing the levels onto one another such that they become interleaved. Vertically 
adjacent tasks of level 1 and the new level 2 are then assigned to the same processor. The 
contraction for a 15 task system is shown in Figure 6.13. Note that there are never more 
than two tasks assigned to each processor but there is always one processor which hosts a 
single task. As there is one task for each subnetwork in the system it is easy to verify that 
= ^ (6.3) 
are required for the RP solution of a system with n subnetworks. 
This argument has so far ignored the effects of left and right factor multiplication on the 
task allocation strategy. Returning to the allocation strategy of Figure 6.11(b) we see tasks 
T\ and T5 sharing the same processor P\. This allocation was derived on the assumption 
that F\ must complete before F5 starts and by making these two tasks share the same 
processor each wiU effectively have the whole processor at its disposal. When is being 
executed F\ must have completed and F5 will be the only operation available for processing 
on processor P i . Recalling that the task Ti consists of the operations {F^^Li^Ri] executed 
162 
6.3 Architectural Issues 
Figure 6.12: Tlie acyclic graphs used in determining task allocations a) the hardware graph 
showing interconnection of processors b) the task graph showing dataflow between tasks 
1 
9 
A 
Pi 
i 
2 
13 
3 
10 
Contracted Task Graph 
4 5 6 
15 
A 
P4 
11 14 
7 
12 
Target Machine Graph 
8 
Ps; 
T 
Figure 6.13: Contraction mapping for Recursively Parallel task allocation 
163 
6.3 Architectural Issues 
(a) ( b ) 
Figure 6.14: Task execution, showing the eflFect of multitasking (a) Naive view of operation 
execution (b) true execution of individual operations 
in sequence and referring to level 2 in Figure 5.9 reveals that this assumption is invalid. 
When task Ti completes the operation i^i it immediately commences the operation Li, and 
this requires processing at the same time as F5. As both Ti and T5 reside on the same 
processor Pi resorts to multitasking to timeslice execution between Ti{Li) and T5(-F5). 
Now neither task has the whole processor at its disposal and the efficiency of the allocation 
strategy is reduced. However the left multiplication operation Li is not as computationally 
intensive as the factorisation Fk and as a result it only executes for a short time compared 
to Fk. Once Li is completed task Ti suspends itself waiting for data to be passed to i t 
following the operation Ri. Now Fk has the whole processor at its disposal and it executes 
Fk to completion without interruption, as shown in Figure 6.14(b). The effect of the left 
multiplication operations is to increase the overall execution time of the parallel program 
and reduce its efficiency. For the seven subnetwork system of Figure 5.9 execution time is 
increased by 
MAX\Li + i s l , |i^2 + Lsl\L3 + Lei 1^ 4 + Le\ (6.4) 
A method exists for improving the efficiency of the allocation strategy and this relies 
upon eliminating the contention between Li and Fk. The contention arises because these 
two operations wish to execute concurrently. I f it were possible to prevent Fk from executing 
until Li completes then the contention could be avoided. Figure 5.9 shows the prerequisite 
of JPS'S operation is that both Fi and F2 have completed. I f F2 takes much longer to 
164 
6.3 Architectural Issues 
Balanced Load Imbalanced Load 
Execution Execution 
System Size Subnetworks Time (ms) S(n) E(n) Time (ms) S(n) E(n) 
118 3 27.24 1.28 0.64 27.24 1.28 0.64 
118 7 18.88 1.85 0.46 20.22 1.72 0.43 
118 15 14.53 2.40 0.30 14.53 2.40 0.30 
629 3 112.81 2.00 1.00 110.06 2.05 1.03 
629 7 88.38 2.55 0.64 83.33 2.71 0.68 
629 15 60.22 3.75 0.47 55.1 4.09 0.51 
734 3 129.97 2.16 1.08 127.60 2.20 1.10 
734 7 77.06 3.64 0.91 70.27 3.99 1.00 
734 15 54.46 5.15 0.64 50.37 5.57 0.70 
Table 6.2: Effect of load imbalance on performance 
execute than Fi, task Ti can perform operation Li whilst T'5(i^5) is suspended awaiting the 
completion of F2. When F2 completes T5 commences its factorisation operation JF5 and has 
the whole processor at its disposal. For this method to work 
t(F2) > t{F,) + t{Lr) (6.5) 
where t(x) is the time taken to perform operation x. Obviously if t{F2) is much greater than 
t(Fi) + t{Li) processor Pi will waste time in idle wait states, thus reducing the efficiency. 
Optimum performance occurs when 
t{F2) = tiFr) + t{L^) (6.6) 
To make this method work i t is necessary to be able to adjust the execution time of each 
of the tasks and their constituent operations. The amount of time taken to complete one 
of the three fundamental operations within a task is directly proportional to the amount of 
computational work to be performed. Modifying the execution time of the operations can 
be achieved by altering the load balancing strategy and assigning a greater proportion of 
the workload to T2 than to T i , thus satisfying the inequahty of (6.5). For the test systems 
used here i t has been found that significant performance benefits result from increasing the 
workload of some tasks by up to 25% and similarly decreasing the loading of other tasks. 
Table 6.2 shows the benefits of this approach over the equally loaded case. Whilst i t may 
appear that there is an imbalanced load it must be remembered that i t is only the workload 
of each task that is imbalanced. The loading on each processor wiU in fact be balanced. 
165 
6.3 Architectural Issues 
This technique of imbalancing the load is essentially a latency hiding method. Latency 
hiding [95] is a way of using the available multitasking facilities to keep processors busy 
during communication events. A program task wiU block and wait on the arrival of a 
message it requires if the sender is not ready to send, thus placing the processor into an 
idle state. I f other program tasks are available for execution the processor may continue 
computing whilst the first task is awaiting the arrival of its message. In this manner the 
latency introduced into the first task by the necessity of waiting for a communication is 
hidden by the processing of other tasks. In the technique described above, increasing the 
execution time of F2 allows task Ti to complete Fi and Li. Once Li is complete the task 
blocks and waits until it can send a message to a subsequent task. This now allows task 
T5, which resides on the same processor, to be executed whilst Li is waiting to send it's 
message. This ensures that processor Pi is kept continuously busy. The hiding of some 
of the latency in the program results in a decrease in execution time and an increase in 
speed-up. 
6.3.2 The Hardware Architecture 
Given a set of processing elements there are many different ways in which those processors 
may be connected. Section 1.4.4 and Figure 1.6 describe some of the standard processor 
interconnection topologies and i t was observed that different topologies are suited to certain 
types of problem. For general parallel problems the hypercube topology often gives the best 
performance. Chapter 1 observed that the choice of the interconnection network can make or 
break the performance of the parallel machine and the parallel algorithm executed on i t . The 
general literature of the field of parallel computing observes that parallel computers are often 
used to solve a single parallel problem and in this case the hardware architecture becomes 
an intrinsic part of the problem and its solution. It is essential to use the interconnection 
network that gives the optimum performance and this section considers which of the many 
interconnection topologies is most suitable for a dedicated Recursively Parallel solution. 
Architecture and Performance 
The hardware architecture of a parallel computer affects performance through its impH-
cations for intertask communication. When the parallel tasks reside on different physical 
processors intertask communication then becomes interprocessor communication. Messages 
traveling between the tasks flow physically along the wires interconnecting the processors. 
166 
6.3 Architectural Issues 
(a) (b) 
Figure 6.15: Direct 1 hop communication (a) and indirect 2 hops communication (b) 
There is a finite delay introduced into the execution of the program which is due to the finite 
amount of time taken for the message to travel between the two communicating processors. 
When tasks on adjacent processors communicate this delay is negligible. I f the tasks reside 
on non-adjacent processors and communicate indirectly via an intermediate processor then 
these delays can become significant. The intermediate processor uses a store-and-forward 
protocol to route messages to other processors. I t receives incoming messages and stores 
them in a buffer. When this processor makes time for communication, and when the receiv-
ing processor is ready, the intermediate processor forwards the message from the buffer to 
the receiver. If each message is acknowledged, as in Transputer systems, the original sending 
processor cannot continue executing the task which requested the communication until the 
acknowledgment from the receiver travels back through the intermediate processors. Rout-
ing through any number of intermediate processors is possible but the more intermediaries 
there are, the greater the delays involved in communication. 
Two adjacent processors in a network are said to be separated by a distance of one 
hop (Figure 6.15(a)) whilst two processors communicating via a single intermediary are 
separated by a distance of two hops (Figure 6.15(b)). The concept of distance is important 
in understanding the effect of different interconnection topologies on performance. The delay 
in communications between two processors is proportional to the distance between them 
and the various interconnection networks of Figure 1.6 can be characterized in terms of their 
maximum, minimum and mean communication distances. For example, the star network 
has a minimum distance of 1, a maximum distance of 2 and a mean distance of slightly less 
than 2. These distances are constant irrespective of the number of processors used. The 
ring network on the other hand has distances which vary with the number of processors -
the minimum is constant at 1 but the maximum is ^. These characteristics of the different 
topologies make them suitable for certain applications but not for others. A program based 
on the supervisor/worker approach would achieve good performance on the star network 
with the supervisor located at the centre and the nodes at the radii. Communication always 
167 
6.3 Architectural Issues 
takes place between adjacent processors as the only communications are those between the 
supervisor and the workers. For other programs communication may have to be routed 
through the centre making the central processor a bottleneck in the system which limits 
its performance. In general the average communication delay for a given topology will be 
proportional to the average communication distance. A topology with a low average distance 
will often be the best choice. This is one of the reasons for the hypercube's popularity. Of all 
the topologies in Figure 1.6 the hypercube has the lowest average communication distance. 
Another explanation of its popularity is that a number of other topologies {e.g. pipeline, 
ring, tree) can be embedded into a hypercube. By purchasing a machine configured as a 
hypercube users may also make use of any of the interconnection topologies which can be 
embedded into a hypercube. The hypercube arrangement is one of considerable flexibility. 
The benefits of an efficient communication topology can be undermined if the task to 
processor allocation is not undertaken carefuUy. Assigning two communicating tasks to 
opposite ends of a pipeUne would be extremely inefficient. The ideal is to place commu-
nicating tasks on adjacent processors, thus minimizing the communication delays. This is 
not always possible and a compromise often has to be reached. The best system design is 
a close marriage between efficient hardware connection and sensible task allocation. 
A Suitable Architecture for the Recursively Parallel Solution 
Processor interconnection networks are often represented as undirected graphs in which 
nodes represent processors and edges represent physical bidirectional communication finks 
between them. I f this convention is adopted. Figure 6.2 immediately suggests a possible 
interconnection for the processors in a system dedicated to the Recursively Parallel solution, 
assuming each task is assigned to an individual processor. As the Transputer has four 
bidirectional communication finks per processor this topology is only reafisable for systems 
of seven subnetworks or less. For system with more than seven subnetworks indirect routing 
of communications would be required. Assigning two tasks to each processor improves the 
situation somewhat and it becomes possible to realize systems with fifteen subnetworks. 
Systems with more than fifteen processors again require indirect communications. Early in 
the project a novel architecture, the Mutated Tree architecture [96], was proposed based 
on an analysis of Figure 6.2. Alterations made to the method since the design of this 
architecture have rendered it non-optimal as some of the assumptions on which it was 
based are no longer valid. I t is worth considering the Mutated Tree architecture for the 
168 
6.3 Architectural Issues 
sake of completeness. 
The Mutated Tree architecture was conceived prior to the introduction of the hierarchical 
message aggregation scheme (Section 6.2.3) and its design is based on the assumption that 
explicit communications are allowed to occur between the last task and every other task 
in Figure 6.2. I f the adjacency constraints for communicating tasks are to be satisfied it is 
clear that every task must be directly connected to the last one in a radial fashion. This 
suggests the use of a star network but such a topology cannot be easily realized with the 
Transputer due to the large number of connections to the central processor. Considering 
how the algorithm works reveals that the communications Pi - P7, P2 - P7, P3 - P7, 
P4 — P7 are not critical to the next stage of processing whereas Pi - P5, P2 — P5, P3 - Pe, 
P4 - Pe, P5 — P7, Pe - P7 are. To minimize execution time the critical communications must 
be performed without delay whilst the non-critical communications can withstand some 
degree of delay. This observation allows a better interconnection of the processors to be 
achieved by allowing non-critical communications to be routed via intermediate processors 
whilst ensuring that critical communications occur only between adjacent processors. The 
resulting interconnection is the Mutated Tree topology and Figure 6.16(a) shows the 4, 
8 and 16 processor Mutated Tree interconnections. Redrawing Figure 6.16(a) gives the 
diagrams in Figure 6.16(b) and these clearly show a modified form of binary tree, hence 
the name Mutated Tree. I f the assumption concerning explicit last task communications is 
valid then the Mutated Tree architecture has better communication characteristics than a 
hypercube with the same number of processors. In each case the Mutated Tree has a total 
communication distance as good as, i f not better than, its hypercube counterpart. Unhke 
the hypercube the Mutated Tree never requires critical communications to be routed via 
intermediate processors. Figure 6.16 reveals another useful property of the Mutated Tree, 
that of scalability. The architecture scales easily to 2^ processors, where p is a positive 
integer, and never requires more than four interconnections per processor. Large Mutated 
Tree structures are therefore realizable using Transputers. The limitation of four links per 
processor means that i t is not easy to build Transputer-based hypercubes with more than 
sixteen processors. 
The use of an hierarchical message aggregation protocol invalidates the assumption 
concerning explicit last task communications and the dataflow diagram of Figure 6.2 is 
reduced to a simple binary tree. There is no longer any need for the extra connections 
provided by the Mutated Tree structure and a binary tree interconnection wiU suffice. The 
169 
6.3 Architectural Issues 
1 
(a) (b) 
Figure 6.16: The Mutated Tree interconnection network a) shown as modified pipefine b) 
shown as modified binary tree 
simplest topology which satisfies the adjacency constraints is a straightforward pipefine of 
processors, as shown in Figure 1.7. This is ideal for Transputer implementation as any 
length of pipefine can be constructed. One interesting feature of the pipefine is that i t can 
be embedded into many of the other structures in Figure 1.6 (e.^.ring, mesh, hypercube, 
chordal ring). Most of the results shown in this thesis were taken with the Transputer 
network configured as a pipefine, although hypercubes and rings were occasionally used. 
The results obtained are independent of topology iff the topology supports the embedding 
of a pipefine. This argument is supported by the results of Table 6.3 which shows the same 
systems solved by the RP method using pipefine and hypercube processor interconnection 
topologies. The results are identical for both topologies. Results for sixteen processor 
hypercubes have not been obtained as it was not possible to configure the experimental 
hardware as a four dimensional (sixteen processor) hypercube. I f the RP method were to 
be used to successively solve different systems of equations then the hypercube topology 
would probably give the best results. The embedded pipefine within the hypercube satisfies 
the requirements for the computation phase whilst the low communication delays of the 
hypercube are beneficial in minimizing the time taken to transfer data from the supervisor 
task to the worker tasks. 
170 
6.3 Architectural Issues 
System Subnetworks Overall 
Pipeline 
Factorise Substitute Overall 
Hypercub 
Factorise 
e 
Substitute 
118 3 1.42 1.42 1.39 1.42 1.42 1.39 
118 7 2.04 2.08 1.89 2.08 2.08 1.84 
118 15 2.50 2.47 1.89 2.50 2.49 1.90 
629 3 2.13 2.23 1.84 2.13 2.23 1.84 
629 7 2.73 2.70 2.17 2.65 2.67 2.19 
629 15 3.97 4.04 3.06 3.94 4.05 3.08 
734 3 2.20 2.46 1.84 2.29 2.46 1.84 
734 7 3.96 4.10 3.00 3.89 4.02 3.01 
734 15 5.68 5.83 4.14 5.56 5.78 4.18 
734 31 5.90 5.98 3.98 - - -
1624 3 2.51 2.64 1.82 2.51 2.64 1.82 
1624 7 4.02 4.15 2.88 4.02 4.15 2.88 
1624 15 7.31 7.55 4.52 7.31 7.56 4.55 
1624 31 8.84 9.38 5.46 - - -
Table 6.3: Comparison of performance using pipeline and hypercube architectures 
Consider again the task graph of Figure 6.13 and the intertask communications that are 
required. At some point in the solution tasks 13 and 14 must communicate with task 15. I f 
the task allocation of Figure 6.13 is adopted and the processors are configured in a pipeline 
then tasks 13 and 14 both reside on processors which lie a distance of two hop from the 
processor hosting task 15. Communications with task 15 must therefore take place via an 
intermediate processor and this wiU increase the communication time. The communication 
time would be reduced if a processor topology was used which directly linked the processors 
hosting tasks 13, 14 15 and this can be achieved by adding two extra connections to the basic 
pipeline. However when this topology is implemented i t is observed that it has very little 
effect on the speed-up obtained (a difference in speed-up of less than 0.08 for the CEGB 734 
node system). There are two reasons for the apparent ineffectiveness of this modification. 
Firstly, the messages that are passed in the indirect communications are short and the 
transfer time for these communications is negligible. The extra connections do reduce the 
transfer time but as these times are small anyway the effect is not noticeable. Secondly it 
is observed that when a simple pipeline is used the latency of the indirect communications 
can be hidden by the introduction of an imbalance in the computational loading of each 
processor (Section 6.3.1). The pipeline topology is therefore the simplest topology for this 
application. 
In relation to the Recursively Parallel solution there seems to be little point in adding any 
extra connections to the basic pipeline as there are only a few indirect communications which 
171 
6.4 Performance of the Recursively Parallel Solution 
System Nodes Non-zeros Sparsity(%) 
IEEE Test System 118 476 3.4 
Reduced CEGB System 629 2301 0.58 
CEGB System 734 2696 0.50 
Reduced US Eastern Seaboard 1624 6050 0.23 
Table 6.4: Characteristics of the test systems 
would benefit marginally from their inclusion. A benefit is perceived when the supervisor 
task is considered. As this task has to communicate with all the processors the additional 
connections can help in reducing the time taken to transfer data before and after solution. 
I f the Recursively Parallel method were to be appfied in a situation where different sets 
of equations needed to be solved in succession then the provision of extra connections to 
reduce indirect communications would be of paramount importance in reducing the start 
up overhead of each solution. Extra connections would also be useful for a fuU dynamic 
simulation in which the RP method is used to solve finear equations. Although the extra 
connections have fittle benefit for the RP solution, they may be necessary for the other 
simulation algorithms to achieve maximum speed-up. 
6.4 Performance of the Recursively Parallel Solution 
Having discussed the development of the Recursively ParaUel solution tliis section presents 
the results of applying the method to various test systems. Observing the performance of 
the program requires the accurate recording of its execution time and this is not straight-
forward. Appendix G gives some consideration to the problems of performance monitoring 
and visuaHsation. 
The performance of the Recursively ParaUel solution has been evaluated using a number 
of test systems drawn from power system appUcations. One of the systems is a standard 
test system whilst the others are derived from real power system networks. Table 6.4 
characterizes the members of the test suite. Using the two-to-one task allocation allowed 
each of the test systems, except the IEEE 118 node system, to be partitioned and solved as 
2, 4, 8 or 16 main subnetworks. The IEEE 118 node system is too small to be effectively 
partitioned into 16 main subnetworks but afi the other partitions can be achieved. 
In examining the performance of the RP solution the most important result is the speed-
up. Speed-up, S{n), is defined as the ratio of uniprocessor execution time to multiprocessor 
172 
6.4 Performance of the Recursively Parallel Solution 
System Solution Time (ms) Factorisation Time (ms) Substitution Time (ms) 
118 34.895 25.23 9.67 
629 225.619 175.81 49.81 
734 280.725 221.91 58.81 
1624 969.49 825.56 143.69 
Table 6.5: Performance of the best sequential algorithm 
execution time for the same algorithm. To adhere to this strict definition requires the 
calculation of the ratio of the execution times of the RP solution executed on one processor 
to the execution time of the RP solution executed on n processors. This figure wiU be 
referred to as speed-up over uniprocessor. The end user is not interested in this largely 
theoretical figure as i t does not reflect the realistic beneflts of using a parallel solution. 
The true benefits of parallel solution are characterized by the speed-up over best sequential 
solution. This quantity is defined as the ratio of the execution time of the best sequential 
solution to the multiprocessor execution time. Only i f the multiprocessor and best sequential 
algorithms are identical do the two speed-ups have the same value. In most cases there are 
necessary differences in the two algorithms and the two speed-ups are different for the same 
system. The speed-up over best sequential method is the important figure as it reflects the 
performance benefit of implementing a paraUel solution. 
The performance of the best sequential method must be considered before the perfor-
mance of the multiprocessor method can be examined. Chapter 2 described the sequential 
solution and this is implemented in INMOS C on a single Transputer. The partitioning of 
the parallel solution is performed by applying the near optimal MDMLLRU ordering and 
then applying a topological reordering to the system to achieve RBBDF structure. The 
data suppfied to the sequential solution also undergoes the MDMLLRU ordering and topo-
logical reordering so that both sequential and parallel programs solve identical systems. 
Table 6.5 details the performance of the best sequential solution of the test systems. It 
is found that the topological reordering of the network makes no difference to the sequen-
tial solution times. This is to be expected as a topological reordering is defined as one 
which introduces no extra operations or fiU-ins. Factorisation and substitution times are 
independently identified for the purposes of later analysis. 
The performance of the multiprocessor solution has been evaluated for each of the test 
systems using 2, 4, 8 and 16 processors. Table 6.6 gives the results for all the test cases. 
173 
6.4 Performance of the Recursively Parallel Solution 
Complete Solution Factorisation Substitution 
System CPU's te Sin) E{n) te Sin) Bin) te Sin) 
118 2 24.574 1.42 0.71 17.768 1.42 0.71 6.957 1.39 0.70 
118 4 17.105 2.04 0.51 12.130 2.08 0.52 5.116 1.89 0.47 
118 8 13.598 2.50 0.31 10.215 2.47 0.31 5.116 1.89 0.24 
629 2 105.924 2.13 1.07 78.839 2.23 1.12 26.354 1.84 0.92 
629 4 82.644 2.73 0.60 65.115 2.70 0.68 22.954 2.17 0.54 
629 8 56.831 3.97 0.53 43.517 4.04 0.51 16.278 3.06 0.38 
734 2 122.587 2.29 1.15 90.207 2.46 1.23 31.962 1.84 0.92 
734 4 70.890 3.96 0.99 54.124 4.10 1.03 19.603 3.00 0.75 
734 8 49.423 5.68 0.68 38.063 5.83 0.73 14.205 4.14 0.52 
734 14 47.581 5.90 0.42 37.109 5.98 0.43 14.776 3.98 0.28 
1624 2 386.257 2.51 1.26 312.720 2.64 1.32 78.951 1.82 0.91 
1624 4 241.167 4.02 1.01 198.930 4.15 1.04 49.892 2.88 0.72 
1624 8 132.625 7.31 0.91 109.346 7.55 0.94 31.790 4.52 0.57 
1624 13 109.671 8.84 0.68 88.013 9.38 0.72 26.317 5.46 0.42 
Table 6.6: Performance of the Recursively Parallel solution of the test systems 
Figure 6.17 offer a graphical interpretation of these results and shows the scalability of the 
RP method. As the problem gets larger (z'.e.more nodes in the network) the greater the 
speed-up that can be obtained from the parallel solution. This is because it becomes easier 
to partition the system into a suitable number of equally sized subnetworks as the system 
gets larger. Obviously there is a limit to the speed-up that can be obtained and this upper 
limit is the linear speed-up, n, produced by n processors In practice the finear speed-up is 
unlikely to be achieved and the upper bound on speed-up wiU be given by Amdahl's Law 
for the system under consideration. 
The factorisation time and substitution time are identified independently in Table 6.6. 
This is made possible by the detailed timing results returned from the performance mon-
itoring schemes described in Appendix G. Simple inspection of the timing results or the 
Gannt chart allows the critical path through the subnetworks to be identified and the fac-
torisation and substitution times are obtained by summing the relevant values along the 
critical path. Note that the execution times for factorisation and substitution in Table 6.6 
do not add together to give the overall execution time. This is because the substitution 
operation is timed as a stand-alone operation (Figure 5.10) and the latency of some of the 
computations is no longer hidden by factorisation computations (Figure 5.9). I t should also 
be noted that the overall and factorisation execution times were obtained with the storage 
scheme optimized for factorisation. The substitution time was obtained with the storage 
scheme optimized for substitution. Table 6.6 also lists efficiency figures for each phase of 
174 
6.4 Performance of the Recursively Parallel Solution 
c 6 
m 
d 
? 5 
•a 
5^ 4 
3 + 
10 
-+-
0 5 10 15 
Number of processors, n 
Q. 
3 
5 + 
4 
3 
2 
-+-
1624 
Node 
734Node 
0 5 10 15 
Number of processors, n 
(a) (b) 
W 2.5 
0 5 10 15 
Number of processors, n 
(c) 
Figure 6.17: Performance curves for the Recursively Parallel method, with 2-1 task alloca-
tion a) overall b) factorisation c) substitution 
175 
6.4 Performance of the Recursively Parallel Solution 
Complete Solution Factorisation Substitution 
System CPU's te Sin) Bin) te S{n) Fin) te S{n) E{n) 
118 2 24.77 1.41 0.71 17.73 1.42 0.71 7.2 1.35 0.68 
118 4 16.96 2.06 0.52 12.22 2.06 0.52 5.63 1.72 0.43 
118 8 13.95 2.50 0.31 10.30 2.45 0.31 5.57 1.74 0.22 
629 2 106.30 2.12 1.06 80.51 2.18 1.09 27.33 1.82 0.91 
629 4 85.57 2.64 0.66 66.11 2.66 0.67 23.32 2.14 0.54 
629 8 57.47 3.93 0.49 43.20 4.07 0.51 17.54 2.84 0.36 
734 2 122.69 2.29 1.15 90.11 2.46 1.23 32.13 1.83 0.92 
734 4 72.96 3.85 0.96 55.23 4.02 1.01 20.10 2.93 0.73 
734 8 50.30 5.85 0.73 35.53 5.76 0.72 17.98 3.27 0.41 
1624 2 386.43 2.51 1.26 313.10 2.64 1.32 78.98 1.82 0.91 
1624 4 241.28 4.02 1.01 198.98 4.15 1.04 50.43 2.85 0.71 
1624 8 132.93 7.29 0.91 109.31 7.55 0.94 36.42 3.95 0.49 
Table 6.7: Performance of the Recursively Parallel solution of the test systems, using 1-1 
task allocation 
the solution of the test systems. 
For the purposes of comparison Table 6.7 details the performance of the multiprocessor 
solution when a one-to-one task allocation is used. The figures show that the one-to-one 
allocation gives greater speed-ups for systems with only a small number of subnetworks. As 
the number of subnetworks increases the speed-up rapidly saturates. The two-to-one task 
allocation gives lower speed-ups for smaU systems but the speed-up saturates more slowly 
as the number of subnetworks increases. 
Note that in Table 6.7, superfinear speed-up is recorded for some of the systems. Super-
finear speed-up is the name given to speed-ups greater than n produced using n processors. 
Superfinear speed-up is a rare event and usuaUy only occurs under certain weU-defined 
algorithmic conditions [97, 98, 99] such as combinatorial implosiveness [100]. In tliis case 
such conditions do not arise and the superfinear speed-up is a function of the differences 
between the best sequential and parallel algorithms. The major algorithmic difference re-
lates to the data structures and storage schemes employed by the algorithms. The parallel 
algorithm uses the highly efficient hybrid storage scheme whilst the sequential algorithm 
uses less efficient sparse finked fist storage. For . a true comparison the speed-up figures 
should be calculated using the execution times from a sequential method which also makes 
use of hybrid storage. As Section 6.2.2 points out, this storage scheme has not yet been 
implemented in a sequential solution so the comparison cannot be made here. I t is stiU 
reasonable to compare multiprocessor performance to that of the best sequential method 
as this sequential method is the current 'industry accepted' standard and it is interesting 
176 
6.4 Performance of the Recursively Parallel Solution 
to quantify the improvement provided by the multiprocessor solution over the standard 
method. Perhaps a fairer analysis of performance can be made by resorting to the speed-
up over uniprocessor figures (Table 6.8 and Figure 6.18). In calculating the speed-up over 
uniprocessor the same program is used for both uniprocessor and multiprocessor solutions 
so there can be no differences in the data structures and algorithms used. Therefore the 
speed-up over uniprocessor reflects more accurately the benefit that can be gained from 
increasing the number of processors used to obtain a solution. In examining the results 
(Table 6.8) i t is immediately apparent that superlinear speed-up no longer occurs. Hence i t 
may be concluded that the superlinear speed-up recorded in Table 6.6 is due entirely to the 
different data structures used by the parallel and best sequential algorithms. The speed-up 
curves (Figure 6.18) are shghtly different to those for the speed-up over the best sequential 
solution. The absolute speed-ups are similar for the factorisation, substitution and overall 
solutions. This is different than for the speed-up over the best sequential solution where it 
was observed that factorisation and overall speed-ups were similar but substitution speed-
up was smaller. The curves of Figure 6.18 confirm the expectation that factorisation and 
substitution speed-up should be similar. Another striking feature of these curves is that 
saturation of the speed-up is much less noticeable than in Figure 6.17 particularly for the 
734 node system. This indicates that the speed-up saturation demonstrated in Figure 6.17 
is a function of the differences in data structures between the best sequential and paral-
lel algorithms. The lower rate of saturation in the speed-up over uniprocessor results is 
manifested in Figure 6.18 as 'straighter' curves which are much closer to an ideal straight 
line than those of Figure 6.17. The lower rate of saturation implies that the RP algorithm 
is easily scaleable and that few overheads are introduced as the number of processors is 
increased. 
In evaluating the performance of the RP solution i t must be compared with the speed-
ups obtained in simulation and the speed-up predicted by ah analysis of the elimination tree. 
To establish the validity of the RP solution i t is also necessary to compare the performance 
of the RP solution with the performance of other multiprocessor solutions. 
Table 6.9 and Figures 6.19 and 6.20 show the theoretically predicted speed-up and the 
simulated speed-ups for each of the test systems. It can be seen that the performance of 
the Transputer implementation matches, and sometimes exceeds, the predicted performance 
of the method. The factorisation speed-up is encouraging as i t is often greater than that 
predicted or obtained in simulation. The main reason for this is that the simulation uses the 
177 
6.4 Performance of the Recursively Parallel Solution 
—Complete Solution— —Factorisation— —Substitution— 
System CPU's S{n) Sin) S{n) 
118 2 1.51 1.49 1.56 
118 4 2.47 2.58 2.40 
118 8 3.98 3.98 3.35 
629 2 1.87 1.90 1.91 
629 4 2.62 2.54 2.39 
629 8 4.05 3.96 3.70 
734 2 1.90 1.92 1.88 
734 4 3.52 3.44 3.21 
734 8 5.51 5.36 5.05 
734 14 7.08 6.84 6.05 
1624 2 1.76 1.69 1.91 
1624 4 2.93 2.78 3.23 
1624 8 5.48 4.98 5.24 
1624 13 7.43 7.04 6.43 
Table 6.8: Performance of the Recursively Parallel solution of the test systems - speed-up 
over uniprocessor 
Predicted Simulation Observed Speed-Up 
System Processors Subnetworks Speed-Up Speed-Up vs Sequential vs Uniprocessor 
118 2 3 1.55 1.55 1.42 1.49 
118 4 7 2.93 2.72 2.08 2.58 
118 8 15 4.17 4.41 2.47 3.98 
629 2 3 1.92 1.94 2.23 1.90 
629 4 7 2.38 2.41 2.70 2.54 
629 8 15 3.91 3.96 4.04 3.96 
734 2 3 1.94 2.01 2.46 1.92 
734 4 7 3.46 3.34 4.10 3.44 
734 8 15 5.40 5.37 5.83 5.36 
734 14 31 7.67 6.67 5.98 6.84 
1624 2 3 1.80 1.79 2.64 1.69 
1624 4 7 3.07 2.41 4.15 2.78 
1624 8 15 5.21 5.15 7.55 4.98 
1624 13 31 8.78 6.27 9.38 7.04 
Table 6.9: Simulated, predicted and observed factorisation speed-ups for the RP solution 
of the test systems 
178 
6.4 Performance of the Recursively Parallel Solution 
1624 Node 
734 Node 
629 Node 
118 Node 
3.00 + 
0 4 8 12 16 
Number of processors, n 
(a) 
7.00 
1624 Node 
734 Node 
629 Node 
118 Node 
3.00 + 
Number of processors, n 
(b) 
624 Node 
734 Node 6.00 + 
4.00 + 
629 Node 
118 Node 
S 3.00 + 
2.00 + 
1.00 + 
0 4 8 12 16 
Number of processors, n 
(c) 
Figure 6.18: Speed-up against uniprocessor for the Recursively Parallel method, with 2-1 
task allocation a) overall b) factorisation c) substitution 
179 
6.4 Performance of the Recursively Parallel Solution 
S 5 
0) 
^ 4 
1 + 
-+- -+-
118 Predicted 
118 Observed 
629 Predicted 
629 Observed 
734 Predicted 
734 Observed 
1624 Predicted 
1624 Observed 
4 6 8 10 
Number of processors, n 
12 14 
Figure 6.19: RP method compared to predicted performance (factorisation only - speed-up 
over best sequential) 
180 
6.4 Performance of the Recursively Parallel Solution 
10 
9 + 
? 5 
•D 
0) 
0) a. 
CO 
-+- -+- -t-
4 6 8 10 
Number of processors, n 
12 
118 Simulated 
118 Observed 
,— 629 Simulated 
629 Observed 
734 Simulated 
- 734 Observed 
1624 Simulated 
1624 Observed 
14 
Figure 6.20: RP method compared to simulated performance (factorisation only - speed-up 
over best sequential) 
181 
6.4 Performance of the Recursively Parallel Solution 
hybrid storage mechanism to a lesser extent and the parallel implementation is therefore 
inherently more efficient. As the factorisation requires many search operations to find the 
relevant matrix elements hybrid storage improves the performance by keeping the number 
of less efficient linked list searches to a minimum. The effect is more noticeable on larger 
systems thus explaining why the performance of the 734 and 1624 node systems are better 
than predicted. A second effect is also at play in the parallel factorisation which is not 
accounted for in the simulation and that is the effect of interprocessor communication. This 
has the effect of reducing the speed-up and is particularly noticeable in smaller systems. I t 
is partly this effect that is responsible for keeping the speed-up of the 118 node system below 
that expected of i t . In larger systems this effect tends to reduce the benefits produced by 
the hybrid storage scheme. The influence of communication becomes greater as the number 
of processors is increased and it is the interprocessor communication which is responsible 
for the tailing off of the speed-up as more processors are introduced [25]. When the speed-
up over uniprocessor results are compared with the predicted speed-ups of Table 6.9 it is 
found that observed speed-ups match very closely to the theoretical predictions. When 
eight (or fewer) processors are used the observed speed-ups are similar to the predicted 
speed-ups and for the 734 node system the two sets of figures are almost identical. As the 
number of processors increases beyond eight the observed speed-up starts to fall away from 
the theoretical predictions and this provides evidence to support the claim that increased 
communication overheads are responsible for the tailing off of speed-ups as the number 
of processors increase. It also confirms that the improved performance offered by the RP 
method is due to the improved partitioning of the problem. 
The substitution speed-up is much less encouraging as i t seldom comes close to the 
predicted or simulated performance. This is due to the inherent difficulties in parallelising 
the substitution operation. Highly efficient sequential substitution algorithms exist which 
execute extremely rapidly. Because the sequential substitution algorithms are so fast i t is 
difficult to create parallel versions which do not introduce extra overheads into the compu-
tation and detract from the performance. The interprocessor communication which results 
from parallelising the substitution operation introduces the greatest overheads. Although 
these communications are responsible for only a very smaU amount of time in factorisation 
they become much more significant in substitution due to its short execution time. This 
explains the almost universal difficulty in creating a parallel substitution algorithm which 
has a speed-up greater than about 3. Although the actual performance (over the best se-
182 
6.4 Performance of the Recursively Parallel Solution 
quential method) falls short of the theoretical and simulated performance it is stiU good 
when compared to the performance of other multiprocessor substitution algorithms and 
some relatively high speed-ups have been achieved using only a small number of processors. 
When substitution performance over uniprocessor is compared to predicted performance the 
outcome is much more encouraging. The speed-ups are much closer to those predicted and 
the absolute speed-ups are greater than the speed-ups over best sequential. This indicates 
that the poor speed-ups over best sequential are once again a function of the differences in 
data structures between the best sequential and parallel algorithms. As with the speed-up 
over best sequential, substitution speed-up over uniprocessor falls off as the number of pro-
cessors increases and it falls off at a faster rate than the factorisation or overall speed-up. 
The implication of this is that interprocessor communication and other scaling overheads 
are primarily responsible for saturation of the substitution speed-up. 
Unfortunately i t is not very easy to directly compare the performance of the RP method 
against the performance of other multiprocessor solutions. Few authors cite actual speed-up 
figures in their publications, although there is an argument that i t is more valid to present 
actual execution times [101]. Unfortunately i t is not possible to compare the same method 
executed on different architectures using only absolute execution times as the execution 
time depends on characteristics of the machine, such as the instruction set and the clock 
speed. Comparisons are made possible if authors publish absolute execution times for their 
parallel algorithms along with execution times for the best sequential algorithms. Some 
authors do give absolute execution times for their methods but fail to give the execution 
time for the best sequential solution on the same processor. This rules out the possibiMty 
of calculating speed-ups from the quoted timings. Where results are available they seldom 
refer to the same systems as those used here and some interpolation of results is necessary 
to allow comparison. A few authors have published performance figures which do allow 
some comparison to be attempted. In the following pages the performance of other paral-
lel methods are compared with the speed-ups (over best sequential) produced by the RP 
method. 
Padhila & Morelato [44] cite results for three systems similar to those used here. Pad-
hila's results,shown below, were obtained using both 8 and 16 processors. 
183 
6.4 Performance of the Recursively Parallel Solution 
Speed-Up Over Sequential Efficiency 
System 8 Processors 16 Processors 8 Processors 16 Processors 
118 4.76 f« 8 0.595 0.5 
725 5.14 ^ 9 0.643 0.563 
1729 5.95 ^ 11 0.744 0.688 
RP 1 i a Node 
L i m b * r of procosaora. n 
Padl-iila 
PacJhila 
R P 1 82-4 Nodo 
R P 73-4 IMocdo 
Figure 6.21: RP method compared to Padhila's W-matrix method 
These results, which are for substitution only, appear impressive at first sight. The 8 
processor results for the larger systems are not significantly different from those produced by 
the RP method. The 16 processor results seem to significantly better than those obtained 
from the RP solution. However the 16 processor results are only simulated residts and 
Padhila points out that the simulation neglects communication and also does not account 
for the processing of diagonal elements. I t is not possible to compare these figures to the 
RP method as actual results are not available but is suspected that the effect of diagonal 
processing and communication will be to reduce the 16 processor performance to something 
more akin to that of the RP method. Another factor which is significant in making Padhila's 
results more impressive is the architecture on which the parallel program was implemented. 
The machine was a distributed shared memory machine in which each processing node 
184 
6.4 Performance of the Recursively Parallel Solution 
consisted of an Intel 80286 processor coupled with an Intel 80287 floating point maths 
coprocessor. The presence of the coprocessor introduces parallel processing at each node as 
the coprocessor can perform mathematical operations in parallel with the main processor. 
The fact that a large shared memory was present makes one suspect that communication 
between processors was achieved via the shared memory and few, if any, interprocessor 
communication delays were introduced. 
Lau [32] presents performance figures for systems similar in size to the smaller of the two 
test systems used here. The performance quoted by Lau is for the factorisation operation 
and no results are presented for substitution. Table 6.6 clearly shows the performance of 
the RP method to be far superior to that of Lau's method. The performance of Lau's 
method is surprisingly poor as the method itself is similar to the RP method. Lau bases 
network partitioning on an analysis of the elimination tree and tries to achieve a balanced 
computational loading. The major difference is that Lau's method uses a standard BBDF 
coefficient matrix which requires the cutset to be processed sequentially. Even when hybrid 
storage is not used the RP factorisation speed-ups are significantly better than Lau's. This 
would imply that the improved performance of the RP solution is due mostly due to the 
exploitation of the extra parallelism introduced by the RBBDF coefficient matrix structure. 
185 
6.4 Performance of the Recursively Parallel Solution 
118 Node System 662 Node System 
Processors S{n) E{n) Sin) E{n) 
1 1 1 1 1 
2 1.6 0.8 1.4 0.7 
4 1.9 0.95 1.7 0.85 
8 1.8 0.9 1.65 0.83 
16 1.4 0.7 1.5 0.75 
RP B20 Node 
RP l i e Node 
Numb*r of proc*»«ora, n 
Uau's eZO Node 
Lau'a -11 a Node 
Figure 6.22: RP method compared to Lau's method 
Chan [36] provides result for the CEGB 811 node network. These results, which are for 
substitution, are useful as the CEGB 734 node network used here is a reduced version of 
the 811 node network used by Chan. The results for the RP solution of the CEGB 734 
node network are superior to those obtained by Chan and the results for the US 1624 node 
network, which is more than twice the size of Chan's system, are also superior. Chan's 
results are disappointing when it is considered that they were obtained using a distributed 
shared memory architecture [81]. As aU interprocessor communication is performed via the 
shared memory Chan's method should not suffer from the communication delays inherent 
in message passing distributed memory machines. One would therefore hope that Chan's 
186 
6.4 Performance of the Recursively Parallel Solution 
method would be more efficient than a distributed memory solution. The fact that the 
results from RP solution were collected using a distributed memory multiprocessor serves 
to highlight the improvements in performance offered by the RP method. 
Processors Speed-Up S{n) Efficiency E{n) 
2 1.77 0.94 
4 2.55 0.78 
8 3.36 0.58 
16 3.65 0.33 
RP method compared to Chan's method 
R P tGZ4 Nodo 
R P 734 Modo 
Chan'a SI 1 Node 
12 1-4 
Numbvr of procaaaora, r 
Figure 6.23: RP method compared to Chan's method 
Van Ness [46] provides results for the substitution of a 1723 node system using the mul-
tiple factoring method. As with many other authors. Van Ness provides only the results 
of a simulation and does not provide real results from a practical implementation of the 
method. Based on the figures given it would appear that the multiple factoring method 
produces a speed-up of 15 using 20 processors and a speed-up of 9.7 using 50 processors. 
The introduction of parallel overheads and interprocessor communication would reduce the 
speed-ups somewhat but they do appear to be greater than those produced by the RP 
method. In a recently published paper [33] Van Ness does provide results from implemen-
187 
6.4 Performance of the Recursively Parallel Solution 
tations of the multiple factoring method on both shared and distributed memory machines. 
Again these results are for a 1723 node system and the results for the distributed memory 
implementation are disappointing, with the highest speed-up being 2.47 using 8 processors. 
I f the number of processors is increased further the speed-up drops off and a speed-up of 
2.02 is achieved using 12 processors. The Transputer-based RP solution, which is also a 
distributed memory implementation, fares much better. The 1624 node system is solved 
with a speed-up of 4.52 using 8 processors and a speed-up of 5.46 using 13 processors. Not 
only does the RP method give greater speed-ups but i t also has quicker factorisation then 
the multiple factoring method, which requires an extra factorisation stage to partition the 
factored coefficient matrix. The multiple factoring method performs much better when im-
plemented on a shared memory machine. For the same 1723 node system the speed-ups are 
now 5.55 with 8 processors, 5.96 with 12 processors and 7.48 with 16 processors. Although 
this performance is almost three times better than that of the distributed memory imple-
mentation i t is still comparable with that of the distributed memory RP implementation. 
The main drawback of the multiple factoring method is that extra information has to be 
introduced into the problem to enable a parallel solution to be implemented. 
Several other authors also provide performance statistics in their pubUcations but these 
all relate to the IEEE 118 node system which is perhaps too small to provide a valid test of a 
parallel solution. Abur [26] quotes a speed-up of 3.8 for the IEEE 118 system which appears 
encouraging until one realizes that between 48 and 57 processors are required to achieve 
this performance. Not only is Abur's method extremely inefficient (0.079 < E{n) < 0.067) 
it is also extremely expensive when one considers the amount of processing power needed. 
El-Kieb et al. [30] provide results for the performance of their parallel solution but these 
results are for system that have no more than 30 nodes and again these systems are too 
small to provide a valid test of performance. However El-Kieb does claim a fourfold speed-
up for the solution of the IEEE 118 node system using 10 processors, although he gives no 
specific results,' to support this claim. 
The comparisons drawn have shown the RP method to give a better performance than 
a number of existing LU-based multiprocessor solutions. The RP method produces higher 
speed-ups at greater efficiencies over a range of test systems. An extensive comparison of 
the methods is not possible as insufficient data is available for the methods developed by 
other authors. Similarly, the four systems used in assessing the RP method do not make for 
a comprehensive test. Obtaining the data for real systems is not easy as the required data 
188 
6.5 Summary 
is often commercially sensitive. Using the limited data available one must conclude that 
the RP method offers performance advantages over existing methods. As the performance 
compares closely with theoretical predictions one must assume that the RP method is an 
efficient and effective method for solving linear network equations in parallel. 
6.5 Summary 
This chapter has considered many of the implementation issues encountered in developing 
an RP solution program for a distributed memory MIMD environment. Following from the 
initial specifications and basic algorithms of Chapter 5, refinements have been introduced 
which significantly improve the performance of such an implementation. Results of the 
Transputer-based implementation have been presented and the performance of the method 
has been compared with that of other methods. It has been demonstrated that the RP 
method offers comparable performance to other methods using fewer processors, making 
the RP method more efficient and economically viable. 
189 
Chapter 7 
Further Work 
Whilst the Recursively Parallel method performs almost as well as the theory predicts 
there is always room for improvement. Throughout this discourse various suggestions have 
been made regarding further work which may be undertaken to improve the method and 
make i t more amenable to incorporation into power system analysis apphcations. These 
suggestions are concerned with the partitioning of the network, the selection of an optimal 
ordering and the use of vector processing techniques. OutHnes of the problems requiring 
further investigation are given along with suggestions as to how these problems may be 
solved. 
7.1 Automatic Network Partitioning 
Chapter 4 showed how the elimination tree may be used as a tool to assist in the partitioning 
of the network. At present the partitioning is achieved by visual inspection of the elimination 
tree. This is far from ideal and a significant amount of preparation is required prior to the 
solution of a system. A method which automatically partitions a system is more appropriate 
and will allow for faster processing of systems. Fortunately the techniques used in dividing 
the tree are heuristic and a set of rules can be derived which form the basis of an automatic 
partitioning algorithm. The method wiU be based upon a bin packing approach [89, 90] and 
the heuristic rules are used to select which nodes or subtrees to place in each bin. Whilst 
many of the rules are already known it is unclear how certain parts of the method will be 
implemented and it is these topics which require further investigation. 
The bin packing approach works by designating a number of 'bins' into which parts of 
the problem are placed. For network partitioning the bins correspond to subnetworks and 
190 
7.1 Automatic Netvv^ ork Partitioning 
subtrees of the elimination tree are assigned into each subnetwork. Hence each subnetwork is 
a collection of subtrees of the elimination tree. It is important to note that the subnetworks 
may consist of disjoint subtrees. A subnetwork can therefore be constructed from a number 
of small subtrees which are not directly interconnected. 
Consider the method for partitioning a system for solution by a typical Master/Slave 
type parallel solution algorithm [87]. A number of subnetworks are to be produced along 
with a single cutset block. The first step in the method is to chose a threshold weight which 
is used in analyzing the weighted eUmination tree (Section 4.3.2). I f the entire tree has a 
weight of W and m subnetworks are to be created then the ideal threshold weight is 
Taking into account the presence of the cutset block, the threshold weight is given by 
An upper and lower bound on subnetwork(bin) weight is also required and the idea is to 
assign subtrees to a subnetwork until the weight of the subnetwork Ues in between the 
upper and lower bound on subnetwork weight. I f has been found empirically that setting 
the upper and lower bounds (V7„ and Wi) to Wth ± 5% gives good results. Hence 
Wu = IMWth W, = 0.95Wtk (7.2) 
The partitioning algorithm has six steps. In the following outline algorithm Wi is the 
weight of the i^'^ subnetwork and i — 1.. .m. 
1. Scan the elimination tree, starting from the root, and pick the first nodes encountered 
on each branch which have a weight lower than or equal to Wth- Order these nodes, 
which are the root nodes of subtrees, in descending order of weight. 
2. Choose the subtree corresponding to the node of the largest weight and assign i t to 
subnetwork i 
3. I f Wi < Wi try adding other subtrees (in descending weight order) until the weight of 
subnetwork i obeys Wi < Wi < Wu-
I f Wi > Wu remove subtrees from the subnetwork until Wi < Wi < Wu and add 
the removed subtrees into other subnetworks so that their weights fall in the desired 
range. 
191 
7.1 Automatic Network Partitioning 
4. Repeat steps two to four until all the subtrees have been assigned to the subnetworks 
and each of the m subnetworks consists of at least one subtree. 
5. I t may be necessary to fine tune the partitioning if there are some subnetworks for 
which Wi < Wi. Fine tuning may also be required if some subnetworks have Wi ^ Wu 
whilst others have Wi ^ Wi. Fine tuning is accomphshed by removing constituent 
subtrees from subnetworks with large weights and assigning them to subnetworks with 
smaller weights in an attempt to balance out the weight of the subnetworks. This stage 
may be repeated until the best balance is found. 
6. AU the nodes which lie between the root of the elimination tree and the identified 
subtree roots constitute the tearing nodes and these are grouped together to form the 
last subnetwork (i.e the cutset). 
This algorithm works well for partitioning a network for solution by a conventional 
parallel approach where m subnetworks and a single cutset are required. The problem 
becomes more complicated when an automatic method of partitioning for RP solution is 
required. With the RP method there are several distinct levels of subnetworks within the 
task graph. A l l the subnetworks in each level should have roughly the same weight and 
these weights should decrease in moving through the levels from the leaves to the root of 
the task graph tree. I f there are k levels in the tree and WLj is the desired weight of all 
nodes in level j then 
WLj>WLj+i j = l . . . k - l (7.3) 
Instead of a single upper and lower bound on subnetwork weight an upper and lower bound 
is now required on WLj, where j — I.. .k. Again the ± 5 % tolerance can be applied to 
yield 
WLu^ = IMWLj WLi. = O.Q5WLj (7.4) 
but some heuristic must be found to determine the relationships between the various values 
of WLij,WLuj,WLj and Wth- At present no heuristic has been derived and further work 
is required. Assuming that the values of WLj can be found the algorithm for partitioning 
for RP solution is 
1. Scan the elimination tree, starting from the root, and pick the first nodes encountered 
192 
7.2 The Search for an Optimal Ordering 
on each branch which have a weight lower than or equal to Wth- Order these nodes, 
which are the root nodes of subtrees, in descending order of weight. 
2. Use steps two to four of the previous algorithm to allocate subtrees to the m subnet-
works in level j, where j = 1.. .k. 
3. Repeat step two until subtrees have been allocated to all the subnetworks in the k 
levels of the tree. 
4. Fine tune the partitioning to achieve the best inter-level balancing. 
There are variations on this algorithm but these wiU not be considered further as the 
aim is only to highlight the issue as a topic for further investigation rather than to solve the 
problem. Another complication for RP partitioning also requiring further study is implicit 
in steps three and four of the first algorithm. In partitioning for a standard parallel solution 
the choice of which subtree to add to which subnetwork is not constrained in any way. In 
partitioning for RP solution subtrees must be selected so that they do not violate RBBDF 
structure constraints as well as satisfying the weight constraints. An additional heuristic 
must be derived which checks to see whether the inclusion of a given subtree violates the 
RBBDF constraints. 
7.2 The Search for an Optimal Ordering 
Section 2.7.6 compares the performance of a number of near-optimal ordering strategies and 
describes a simple computer program which was written to assess the performance of each 
of these methods. The performance of a given ordering strategy is observed by iteratively 
applying that ordering to a given system. Before each iteration the system is randomly 
reordered and performance is quantified in terms of the fill-in introduced and the length 
of the critical path. Whilst most of the path lengths and fiU-ins are close to the mean 
values, certain random reorderings give significantly shorter paths and fewer fiUs. Other 
random orderings give rise to much higher fill and longer paths. The ordering which gives 
the shortest path and minimum fiU-in is of interest as this is the optimum ordering of the 
system. 
Chapter 2 observed that a graph with n nodes may be reordered in n! different ways, 
some of which are, in some sense, 'more optimal' than others. When n is of the order of 
1000 the problem is NP-complete and it is not reaUstic to examine all the possible orderings 
193 
7.2 The Search for an Optimal Ordering 
to find the optimum. However the use of the optimum ordering can have a profound effect 
on the performance of the solution algorithm and if i t is possible to find this ordering 
without incurring large computational overheads then it is surely worth doing. Existing 
near-optimal strategies go some way to optimizing the elimination but they are sensitive 
to the initial network ordering and the orderings they produce can only be considered as 
locally optimal. 
As the problem is NP complete it would be inefficient to perform an exhaustive search 
to locate the optimum ordering. Genetic algorithm techniques [102] suggest ways in which 
the space of all possible reorderings may be searched to find the optimum solution. Genetic 
algorithms are based upon the biological processes of evolution and natural selection, often 
referred to as survival of the fittest. Each organism in nature carries a blueprint of itself in 
its genes and each gene consists of smaller segments caJled chromosomes which define certain 
aspects of the organism and its behaviour. Evolution occurs when the organism reproduces 
and is accomplished through the action of mutation and crossover. In reproduction each 
of the parent organisms passes some of its genetic structure to the child. Crossover is the 
operation which splices together the two parent genes to produce the child's gene which is 
slightly different from that of either parent and this makes each child a unique individual. 
Mutation is the process which may randomly alter some of the child's chromosomes. The 
alteration to genetic structure provided by crossover and mutation gives the child slightly 
different structure and behaviour to its parents and this enables organisms to adapt to their 
environment over a period of several generations. Those organisms which are best adapted 
to their environment reproduce vigorously and produce many offspring with similar genetic 
make up. Those organisms least suited to the environment produce few offspring. Over a 
number of generations the strong genes survive and are spread throughout the population 
whilst the weak genes die out. 
Genetic algorithms (GA's) are a type of search algorithm which 'evolve' toward the 
most optimum solution of a given problem. They are especially useful for problems in 
which the solution space is very large or for NP complete problems. In order to use a 
GA it is necessary to parameterize the characteristics of a solution to the problem. These 
parameters are the 'chromosomes' of the solution and are concatenated to form a gene 
string. An initial population of gene strings is required and it is usual to use a population 
with tens of members. The members of the population initially have their strings initialized 
with random values. Some form of fitness function is required to assess how good each 
194 
7.2 The Search for an Optimal Ordering 
gene string is as a solution to the problem. Those strings which are found unfit are kiUed 
off and reproduction of the remaining strings is used to generate an equivalent number of 
new strings. Crossover is used in reproduction and a small (less than 1%) percentage of 
the strings in the population are mutated. The cycle of fitness testing and reproduction 
continues until the majority of the population converges to an acceptable solution. 
The most difficult part of any genetic algorithm is knowing when to terminate the search. 
The operation of a GA can be seen as a sort of parallel search of the problem's solution space. 
Somewhere within that space lies the optimum solution. Starting from random locations 
within that space the selection, mutation and crossover operations cause the search points 
to jump through the space towards the region where the optimum solution lies. Most of 
the population will eventually hold the gene string representing the optimum solution but 
not all strings will converge to this solution due to the action of the mutation operation, 
which is needed to prevent the search getting trapped by local optima. The search can be 
terminated when a defined number of the population have converged to the same solution. 
The search for the optimum matrix reordering can be coded as a GA by making each 
gene a string which defines how the naturally ordered network is to be reordered. Each 
gene in the population is initially given a random reordering. The selection, mutation and 
crossover operations are applied to produce new populations. Mutation simply shuffles some 
of the elements in a given gene and effectively produces a new random reordering. Crossover 
splices together two genes to give a child gene which encodes a different reordering. Care 
must be taken to ensure that the ordering specified by the child does not contain any 
repeated entries. The selection operation is the fitness function and this must assess which 
reorderings are good and discard those that are not good. This is the most difficult part of 
the proposed method and it is this aspect of the problem which requires further work. 
The fitness of a given ordering can be quantified in terms of the fiUs and critical path 
length, as in the program described in Section 2.7.6. The determination of path length and 
fiU-in can only be obtained by simulating the elimination of nodes from that system and 
although this is quicker than actual elimination i t is stiU a slow process. The simulated 
elimination must be performed for each gene in the population. Considering that hundreds 
of generations may be required to achieve convergence it is easy to see that this approach will 
lead to long run times and the benefits of using the optimal ordering may be outweighed by 
the computational overhead required to identify that ordering. What is needed is a diflFerent 
method of assessing fitness that does not rely on simulated elimination. Any such method 
195 
7.3 Block-oriented Solution and Vector Processing 
CUTSET 
1,1 1,5 
2,2 2,5 
3,3 3,5 
4,4 4,5 
5,1 5,2 5,3 5,4 5,5 
(a) (b) 
Figure 7.1: Conceptual view of a simple four subnetwork system (a) and its BBDF matrix 
(b) 
must be fast so as to minimize the time to convergence. I f convergence is still too slow the 
search could be accelerated by using a number of copies of the GA executing on different 
processors in a parallel machine. Any gene which fails the fitness test represents a location 
in the search space at which the optimum solution does not exist. I f this information is 
shared between the parallel copies of the algorithm then the search space of each copy can be 
reduced and the optimum solution may be found more quickly. The increase in performance 
is produced by the combinatorial implosiveness [100] of the parallel GA method. 
7.3 Block-oriented Solution and Vector Processing 
Chapter 4 stated that it is possible to achieve a parallel solution of the hnear equations in a 
block oriented manner rather than in the rowwise manner which has been adopted through-
out this thesis. Consider the network of Figure 7.1(a) which consists of four subnetworks 
separated by a cutset. Figure 7.1(b) shows the BBDF coefficient matrix associated with 
this network. 
The rowwise method of processing factorises the coefficient matrix into left and right 
196 
7.3 Block-oriented Solution and Vector Processing 
hand factor matrices by applying the bifactorisation rules of Section 2.4.4 to individual 
matrix elements. For the matrix in Figure 7.1(b) the bifactorisation rules may be modified 
such that they act on entire matrix blocks. The solution for the unknown vector, x, is stiU 
given by 
but the rules for creating L^andR'' are 
(7.5) 
= -A i = k + 1.. .n 
j = k + 1.. .n 
i,j = k + l...n 
(7.6) 
(7.7) 
(7.8) 
For a symmetrical matrix 
Rlk = {Lli) (7.9) 
The left and right hand factor matrices can be obtained entirely by operating on the 
matrix at block level. Block manipulations simply require multiplication and additions of 
matrix blocks, which can be treated as matrices in their own right. Only one step is more 
complicated and that is the step which requires a multiplication of the form 
C = A-,lB (7.10) 
where Akk,B and C are matrix blocks. It is not necessary to obtain the fuU inverse of Akk 
although it may be more efficient to do so if Akk consists mostly of non-zeros. The effect 
of A^^ can be obtained by factorising the matrix block Akk to yield the factored form Akk 
and the multiplication can be performed using the algorithm 
loop i f rom 1 to T Z 
Ci = Akk-Bi 
end i loop 
where Ci and 5, are the i^'^ columns of C and B respectively. The multiplication by 
A^^ can thus be reduced to a matrix by matrix multiplication if the fuU inverse is used, or 
a series of n matrix by vector multiplications if the factored form is used. Again the basic 
197 
7.3 Block-oriented Solution and Vector Processing 
operation is that of block multipUcations. 
The benefit of this method is its suitability for use with vector processors. A vector 
processor achieves high performance through the use of a number of pipehned arithmetic 
units. I f there are n arithmetic units then an increase in performance of at almost n times 
can be obtained. For example i f two vectors are to be multiplied then the pipeUne of arith-
metic units performs the operations on individual vector elements and the multiplication is 
performed almost n times faster than on a non-pipeUned (scalar) processor. The advantage 
of vector processors is that they have built in support for vector based operations. A vector 
by vector multiplication can be performed using a single machine instruction and it wiH be 
performed n times faster than on a scalar processor. I f the matrix blocks in Figure 7.1(b) 
are assumed to be densely populated and stored as arrays then the block bifactorisation 
can be performed using only a small number of of machine instructions. An increase in 
performance over a scalar processor wiU be observed due simply to the pipeUned vector 
processor. An even bigger increase in performance may be achieved by using either vector 
processors as the processing elements in a parallel machine. As the four subnetworks in 
Figure 7.1(b) are independent it is possible to process them in parallel and a parallel vector 
processor with 5 CPU's could be used to give a 3 or 4 fold increase over the performance of 
the single vector processor solution. The number of pipelined arithmetic units, n, depends 
the processor used but i t is typically less than 12 for a simple vector microprocessor. Hence 
a single vector microprocessor could theoretically give a 12 fold increase in performance 
over a scalar processor. Using a parallel vector processor could give as much as a 48 fold 
increase in performance over the best sequential method executed on a single processor 
scalar machine. 
The disadvantage of vector processing is that it is only efficient if operations are per-
formed on dense vectors. When sparse matrices and vectors are used scalar processors are 
often found to be more efficient. To enable good speed-ups to be obtained from a paral-
lel vector processor it is necessary to ensure that all matrix blocks are sufficiently dense 
for the vector processor to operate efficiently. This is not a problem with the cutset as i t 
generally tends to be densely populated. Unfortunately the matrix blocks associated with 
the subnetworks are more sparsely populated and it may be inefficient to operate on these 
blocks using vector methods. A hybrid parallel machine consisting of both scalar and vector 
processors can be envisaged. The scalar processors may be used to process the subnetwork 
blocks whilst a vector processor may be used to process the cutset blocks, giving a highly 
198 
7.4 Summary 
efficient solution. This approach has benefits for both standard parallel algorithms and for 
the Recursively Parallel algorithm. Under the RP scheme the main subnetworks would be 
processed using sparse matrix techniques on scalar processors whilst the vector processors 
could be used to process the minor subnetworks which constitute the cutset block. Little 
work has been done on using parallel vector machines and hybrid scalar/vector parallel 
machines for solving the linear equations associated with network problems. I t would be 
interesting to investigate the performance of both standard and RP solution techniques on 
these architectures as great performance benefits may lie in store. 
7.4 Summary 
Although it has been demonstrated that the Recursively Parallel method exhibits good per-
formance there is always room for improvement in any method. This chapter has considered 
some suggestions for improving on the methods described in this thesis to make the Recur-
sively Parallel method more appropriate for use in power systems analysis applications. 
The method presented for partitioning the network prior to RP solution works well but 
relies on visual inspection by the user. This is time consuming and it would be advantageous 
to have an automatic method for partitioning the system. Based on the existing method 
which utilizes a weighted elimination tree, an heuristic bin packing method has been pro-
posed to partition systems for solution by a conventional parallel method. Unfortunately 
the situation is more complicated for the RP method as the heuristic partitioning must 
cope with multiple levels of parallelism and ensure that the chosen decomposition does not 
violate RBBDF constraints. Further work is required to develop the rules which wiU allow 
the automatic heuristic partitioning approach to cope with the constraints imposed by the 
RP solution. 
Near-optimal ordering strategies are often used to optimize the elimination process dur-
ing triangular solution but these strategies are sensitive to the initial network ordering and 
the orderings that they produced can only be considered locally optimum. Throughout the 
space of all n! possible reorderings, some orderings may be found which are 'more optimal' 
than those produced by the near-optimal strategies and these can have a profound eflFect 
on improving the performance of the multiprocessor solution. I t would be advantageous to 
find the (globally) optimum ordering if this does not require the expenditure of too much 
computational effort. An approach has been proposed which uses a genetic algorithm to 
199 
7.4 Summary 
rapidly search the space of all possible reorderings. The problem with the method is that a 
fitness function is required which can be used to assess the optimality of each potential solu-
tion produced by the genetic algorithm. Optimality is characterized by the path length and 
fiU-in produced by the ordering and this can only be determined by performing a simulated 
elimination using that ordering. This is computationally intensive and it would require 
more computational effort to find the optimum ordering than would be saved by using that 
ordering during elimination. Further work needs to be undertaken to find a less intensive 
method of assessing the fitness of potential solutions. I f such a method can be found, the 
genetic algorithm-based optimal ordering may be useful in improving the performance of 
all parallel methods. 
The RP method solves the network equations in a row-oriented manner but it is also 
possible to solve them in a block-oriented manner. Block-oriented solution is particularly 
suited to the use of vector processors which use pipelines of multiple arithmetic units to 
provide machine-level instructions for performing vector arithmetic operations. A vector 
microprocessor with a pipeline of n ALU's can, theoretically, show an n-fold increase in 
performance over an ordinary scalar microprocessor. A multiprocessor computer made 
from vector microprocessors could combine the performance benefits of vector computers 
with those of parallel computers to give speed-up which are not possible from either vector 
or parallel processing alone. Such a computer could be used to step over the current 
performance limitations and to significantly reduce the time taken to solve large sets of 
linear equations. Unfortunately vector processing requires vectors and matrices to be stored 
as arrays and the use of array storage may detract from the possible performance benefits. A 
hybrid parallel scalar-vector computer can be envisaged in which parallel vector processors 
are used to solve the dense cutset block whilst parallel scalar processors are used to solve the 
remainder of the matrix. This type of architecture certainly deserves further investigation. 
200 
Chapter 8 
Conclusions 
8.1 Conclusions 
f I 1 his thesis has examined how the linear equations associated with network problems 
may be solved using parallel computing techniques. In particular, the solution of the 
linear equations arising from the study of large electrical power systems has been considered. 
For most real-time power system simulations i t is necessary to solve these equations as fast 
as possible and the work described here has focussed on the development of a method which 
offers faster and more efficient solutions than existing parallel techniques. 
Parallel methods have been developed for the solution of general sets of linear equations 
and many more methods have been developed to solve the equations which arise from spe-
cific problems. The field of power system analysis is no exception to this and numerous 
researchers have attempted different methods of solving the network equations. Most of 
the methods use some form of Gaussian elimination and diakoptical techniques are used to 
partition the problem into independent parts. The fine details of the Gauss-based algorithm 
aiid the target parallel architecture are the main differences between the methods. Different 
Gauss methods allow different levels of parallelism to be exploited and the parallel architec-
ture can have an effect on the efficiency of the method. Shared memory machines can give 
better performance due to the lower interprocessor communication overheads but cheap, 
off-the-shelf distributed memory machines are widely available. From the user's point of 
view massively parallel computers are perhaps not appropriate for a power system simula-
tion due to the large capital outlay required in the purchase of such a machine. The aim of 
the work described here has been to develop a parallel solution method suitable for use with 
201 
8.1 Conclusions 
distributed memory machines. An array of Transputers has been used as the development 
platform but the solution has not been developed specifically for Transputers. This plat-
form has been used to verify the effectiveness of the method but any method which works 
on the Transputer array should also work on other platforms including distributed memory 
multiprocessors, distributed systems of workstations and multitasking sequential machines. 
Existing solutions are inefficient in that speed-up for either factorisation or substitution 
seldom exceeds a value of three or four and a large number of processors are required to 
achieve this speed-up. The aim was to develop a solution technique that would give greater 
speed-ups than existing methods with only a small number of processors. I t was hoped 
that this could be achieved through the elimination of the speed-up saturation observed in 
existing solutions due to the existence of a large sequential section in the solution algorithm. 
The development of the Recursively Parallel method has been described and, hke existing 
methods, i t is based on diakoptics and Gaussian eUmination. The method has been devised 
by considering the structure of the coefficient matrix and the precedence relationships which 
arise from i t . Using the insight provided by the efimination tree a novel coefficient matrix 
structure has been created by constraining the interconnection of subnetworks. This struc-
ture introduces greater independence into the treatment of cutset elements and allows the 
cutset to be processed in parallel as a number of subnetworks. The size of the sequential 
part of the method has been significantly reduced and more parallelism has been exploited. 
Partitioning the system for parallel solution is simplified by resorting to the use of the 
elimination tree and analyzing the complexity of processing nodes. I t has been demon-
strated that subtrees are equivalent to subnetworks and selecting subtrees is equivalent to 
partitioning the network into subnetworks. Load balancing has been introduced as an im-
portant issue in parallel computing and the need for achieving a balanced load has been 
identified. Using the weighted eUmination tree reduces the problem of load balancing to one 
of selecting subtrees with appropriate weights. Using an aggregate weighting technique it is 
possible to find subtrees of the ehmination tree with nearly equal weights and an ideal bal-
anced load is achieved when the subtrees aU have equal weights. Examining the partitioned 
elimination tree allows a prediction to be made for the speed-up that can be obtained by 
solving the system in parallel. Using this technique it has been possible to partition all the 
test systems described in this thesis. These test systems are based on real networks and 
the method is powerful enough to allow partitioning of any power system network. 
The performance of the RP method depends heavily on the efficiency of the data struc-
202 
8.1 Conclusions 
tures and algorithms used in its implementation. Suitable data structures and outline algo-
rithms of efficient parallel tasks have been presented throughout the course of the discussion. 
A hybrid data storage scheme has been proposed and this allows maximum performance 
to be obtained by tailoring the data storage and processing to the characteristics of the 
individual system. This hybrid parallel data structure has suggested a way of improving 
the performance of existing sequential solution methods. In addition i t is also possible to 
optimize the network partitions and data structures for either the factorisation or substi-
tution phases of the solution thereby improving the performance of these operations. An 
optimal assignment of tasks to processors has been derived to minimize interprocessor com-
munications and number of processors required whilst stiU allowing for easy scalability. I t 
has also been demonstrated that the performance of the method is independent of the tar-
get architecture if a pipeline can be embedded in that architecture. This is true for most 
topologies apart from trees. A method of visualising the operation of the parallel program 
has also been described along with a strategy for accurately timing the execution of the 
solution and events within the solution. 
The performance of the new method is very encouraging. Simulations of the method 
have shown its effectiveness, producing speed-ups which are better than those of many 
other LU-based solution methods when using the same number of processors. The method 
is effective even with a large number of processors and scales easily from a small number 
of processors to a larger numbers of processors. An actual parallel implementation on an 
array of Transputers concurs with the results of the simulations. The resulting speed-ups 
are similar to those predicted by both the theory and the simulation. Whilst the overall 
speed-ups and factorisation speed-ups are significantly better than those obtained from 
existing methods the substitution speed-ups are somewhat disappointing but stiU show 
an improvement over other methods. This highlights the difficulties of speeding-up the 
substitution phase. Highly efficient sequential substitution algorithms already exist and due 
to the short execution times of these algorithms it is difficult to exploit paraUefism without 
introducing overheads which detract from performance. When the differences in the data 
structures used by the parallel and sequential solutions are ignored the performance of the 
substitution phase becomes similar to that of the overall and factorisation performance. 
Although saturation is stiU observed it is less significant than in other methods and is 
primarily due to the overheads associated with interprocessor communication. 
I t has been demonstrated that the performance of both the simulated and actual parallel 
203 
8.1 '. Conclusions 
implementations is dependent on the load balance. Static load balancing is employed by the 
RP method and the load balance is adjusted by altering the partitioning of the network. The 
abihty to predict speed-up based upon an analysis of the partitioned eUmination tree allows 
the load balance to be assessed prior to solution. The adjustments to the load balancing 
strategy presented in Chapter 6 allows the optimum network partitioning, and hence load 
balance, to be easily determined. 
The major benefit of the RP method is that it is more efficient than existing solution 
methods as fewer processors are required to yield comparable speed-ups. This property 
of the method offers important economic advantages for the user. The strategy used for 
assigning tasks to processors offers the advantage of easy scalabiUty. The architectural 
independence of the RP method allows it to be implemented on the simplest of architectures, 
or on more compUcated ones. The fact that the method works as well on the simple pipeUne 
as i t does on any other architecture is advantageous as the pipeUne topology is cheap and 
easy to implement. Although the RP method works weU on a distributed memory machine 
it should be simple to adapt the RP solution to work with shared memory machines. The 
method is also suitable for implementation on a distributed network of workstations. This 
is important when it is considered that the RP method is designed to be the engine for 
solving the Unear equations in a power system simulation. Whilst the performance of the 
method on networked workstations wiU be poorer than on a dedicated multiprocessor, it 
may stiU be acceptable for simulation appUcations. I f this is the case then the end user of 
the simulation would not necessarily have to invest in a dedicated multiprocessor but could 
make use of existing networked computing faciUties. 
Although the RP method performs as expected there is always room for improvement 
and suggestions have been made for further work on the method. I f the RP method is to 
be used as a solution engine in power system analysis software then it wiU be necessary 
to implement an automatic method of partitioning the- system. I t has been shown that 
the eUmination ordering has an effect on the performance of the solution and it would be 
advantageous to find a fast method for locating the most optimum ordering for a given sys-
tem. Parallel processing techniques have been successfuUy appUed to speed-up the solution 
of the network equations but vector processors also show promise for accelerating solutions. 
Significant increases in the speed of solution could be achieved by combining the techniques 
of parallel processing and vector processing through the use of a parallel vector processor 
architecture and this approach to solution deserves further research. 
204 
8.1 Conclusions 
In conclusion, the RP method proposed here has proved to be an effective method for 
solving linear network equations in parallel. I t is more efficient and produces better speed-
ups than many other parallel solutions. For the method to be of any use in a real-time 
power system simulation it must solve the network equations as fast as possible on each 
iteration, and each iteration must completed in a time shorter than the integration step of 
the simulation. Given that the integration step is of the order of 1 second and that the RP 
method solves a 1624 node network in 36 milliseconds, the RP method is certainly suitable 
for use as part of a real-time power system simulator. 
205 
Bibliography 
[1] C.A. Gross, Power Systems Analysis. Wiley, second ed., 1986. 
[2] A. Brameller, R.N. Allan, and Y . M . Hamam, Sparsity - Its practical application to 
systems analysis. Pitman, 1976. 
[3] W.F. Tinney and J.W. Walker, "Direct solutions of sparse network equations by 
optimally ordered triangular factorisation," Proceedings of the IEEE, November 1967. 
[4] B. Stott, "Power system dynamic response calculations," Proceedings of the IEEE, 
pp. 219-241, February 1979. 
[5] Balu et. al, "Online power system security analysis," Proceedings of IEEE, pp. 260-
280, February 1992. 
[6] I . Susmago et al., "Development of a large scale dispatcher training simulator and 
training results," IEEE Transactions on Power Systems, pp. 67-75, May 1986. 
[7] H. Biglari et al, "A dispatcher training simulator design with multi-purpose inter-
faces," IEEE Transactions on Power Apparatus and Systems, pp. 1276-1280, June 
1985. 
[8] R. Podmore et al, "An advanced dispatcher training simulator," IEEE Transactions 
on Power Apparatus and Systems, pp. 17-25, January 1982. 
[9] J. Arillaga and CP. Arnold, Computer Analysis Of Power Systems. Wiley, 1990. 
[10] C. Lazou, Supercomputers And Their Use. Oxford Science Publications, 1988. 
[11] A. Trew and G. Wilson, eds.. Past, Present and Parallel: A survey of available parallel 
computing systems. Springer-Verlag, 1991. 
206 
B I B L I O G R A P H Y 
[12] G. Sabot, The Paralation Model - Architecture independent parallel programming. 
M I T Press, 1988. 
[13] G. S. Almasi and A.J. GottUeb, eds., Highly Parallel Computing. Ben-
jamin/Cummings, second ed., 1994. 
[14] J.P. Hayes, Computer Architecture And Organisation. McGraw-HiU, 1988. 
[15] K .H . Hwang and F.A. Briggs, Computer Architecture And Parallel Processing. 
McGraw-HiU, 1984. 
[16] M.J. Flynn, "Very high speed computing systems," Proceedings of IEEE, pp. 1901-
1909,1966. 
[17] K . H . Hwang, Advanced Computer Architecture - Parallelism, Scalability, Programma-
bility. McGraw HiU Series In Computer Science, McGraw HiU, 1993. 
[18] A .L . DeCegama, Parallel Processing Architectures And VLSI Hardware, vol. 1. Pren-
tice HaU, 1989. 
[19] T.G. Lewis and H. El-Rewini, Introduction To Parallel Computing. Prentice Hall, 
1992. 
[20] J. Hinton and A. Pinder, Transputer Hardware And System Design. Prentice HaU 
International, 1993. 
[21] I . Graham and T. King, The Transputer Handbook. Prentice Hall, second ed., 1990. 
[22] R. Taylor, Selected Notes On Transputers. University of York, 1991. 
[23] M . Minsky and S. Papert, "On some associative paraUel and analog computations," 
in Associative Information Techniques (E.J. Jacks, ed.), Elsevier, 1971. 
[24] G.A. Amdahl, "Limits of expectation," International Journal of Supercomputing, 
pp. 88-94, Spring 1988. 
[25] E. Gelenbe, Multiprocessor Performance. Wiley Series In ParaUel Computing, Wiley, 
1989. 
[26] A. Abur, "A parallel scheme for the forward/backward substitutions in solving sparse 
Unear equations," IEEE Transactions on Power Systems, pp. 1471-1478, November 
1988. 
207 
B I B L I O G R A P H Y 
[27] G. Cafaro, P. Pugliese, and F. Vacca, "Parallel solution of torn network equations," 
Electrical Power and Energy Systems, pp. 131-138, July 1984. 
[28] I.e. Decker, D .M. Falc ao, and E. Kaszkurewicz, "Parallel implementation of a power 
system dynamic simulation methodology using the conjugate gradient method," IEEE 
Transactions on Power Systems, pp. 458-465, February 1992. 
[29] T. Berry, A.R. Daniels, and R.W. Dunn, "Parallel processing of sparse power system 
equations," lEE Proceedings C, pp. 68-74, January 1994. 
[30] A.A. El-Kieb, H. Ding, and D. Maratukulam., "A parallel load flow algorithm," Elec-
tric Power Systems Research, pp. 203-208, 1994. 
[31] J. Fong and C. Pottle, "Parallel processing of power system analysis problems via 
simple parallel microcomputer structures," IEEE Transactions on Power Apparatus 
and Systems, pp. 1834-1841, September/October 1978. 
[32] K. Lau, D.J. Tylavsky, and A. Bose, "Coarse grain scheduling in parallel triangular 
factorisation and solution of power system matrices," IEEE Transactions on Power 
Systems, pp. 708-714, March 1991. 
[33] S. Lin and J.E. Van Ness, "Parallel solution of sparse algebraic equations," IEEE 
Transactions on Power Systems, vol. 9, pp. 1117-1125, May 1994. 
[34] M . Rafian, M.J.H. SterHng, and M.R. Irving, "Parallel processor algorithm for power 
system simulation," lEE Proceedings Part C, pp. 285-290, July 1988. 
[35] D. Yu and H. Wang, "A new parallel LU decomposition method," IEEE Transactions 
on Power Systems, pp. 303-310, February 1990. 
[36] T. Berry, K.W. Chan, and R.W. Dunn, "A parallel computer algorithm for real-time 
electro-mechanical transient power system simulation," in Proceedings of 27 Univer-
sities Power Engineering Conference, vol. 1, pp. 32-12, University of Bath, 1992. 
University of Bath, ENGLAND, 23-25 September 1992. 
[37] F.L. Alvarado, D.C. Yu, and R. Betancourt, "Partitioned sparse A~^ methods," IEEE 
Transactions on Power Systems, pp. 452-459, May 1990. 
208 
B I B L I O G R A P H Y 
[38] M.La Scala, G. Sblendorio, and R. Sbrizzai, "ParaUel in-time implementations of tran-
sient stabiUty simulations on a Transputer network," IEEE Transactions on Power 
Systems, pp. 1117-1125, May 1994. 
[39] G.P. GranelU, M . Montagna, M . La Scala, and F. TorelU, "Relaxation-Newton meth-
ods for transient stabiUty analysis on a vector/paraUel computer," IEEE Transactions 
on Power Systems, pp. 637-643, May 1994. 
[40] S.Y. Lee, H.D. Chiang, K.G. Lee, and B.Y. Ku, "ParaUel power system transient 
stabiUty analysis on hypercube multiprocessors," in Proceedings of the IEEE Power 
Industry Computer Applications Conference, May 1989. 
[41] F. Sato, A.V. Garcia, and A. MonitceUi, "Parallel implementation of probabiUstic 
short-circuit analysis by the Monte-Carlo approach," IEEE Transactions on Power 
Systems, no. 2, pp. 826-832, 1994. 
[42] A.A. El-Kieb, J. Nieplocha H. Singh D.J. Maratukulam M.K. CeUk, and A. Abur, 
"A decomposed state estimation technique suitable for parallel processor implemen-
tation," IEEE Transactions on Power Systems, no. 3, pp. 1088-1097, 1992. 
[43] D .M. Falc ao, E. Kaszkurewicz, and H.L.S Almeida, "AppUcation of paraUel process-
ing techniques to the simulation of power-system electromagnetic transients," IEEE 
Transactions on Power Systems, no. 1, pp. 90-96, 1993. 
[44] A. PadhUa and A. Morelato, "A W-matrix methodology for solving sparse net-
work equations on multiprocessor computers," IEEE Transactions on Power Systems, 
pp. 1023-1036, August 1992. 
[45] D.J. Tylavsky, S. Nagaraj, and P.E. Crouch, "Parallel-vector processing synergy and 
frequency domain transient stabiUty simulations," Electric Power Systems, pp. 89-97, 
1993. 
[46] J.E. VanNess and G. MoUna, "The use of multiple factoring in the paraUel solution of 
algebraic equations," IEEE Transactions on Power Apparatus and Systems, pp. 3433--
3438, October 1983. 
[47] H.H. Happ, Piecewise Methods And Applications To Power Systems. Wiley, 1980. 
209 
B I B L I O G R A P H Y 
[48] G. Kron, "A set of principles to interconnect the solutions of a physical system," 
Journal of Applied Physics, pp. 965-980, 1953. 
[49] J. Bialek, "Parallel solution of torn networks for power system simulation," in Pro-
ceedings of 6th International Conference on Present Day Problems In Power Systems, 
vol. 2, pp. 75-82, 1993. Gliwice, Poland. 
[50] IEEE Committee Report by a Task Force of the Computer and Analytical Meth-
ods Subcommittee of the Power Systems Engineering Committee, "Parallel Processing 
in Power Systems Computation," IEEE Transactions on Power Systems, pp. 629-638, 
May 1992. 
[51] B. Carre, Graphs and networks. Oxford applied mathematics and computing science 
series, Oxford University Press, 1979. 
[52] J.W.H. Liu, "The role of elimination trees in sparse factorisation," SI AM Journal of 
Matrix Analysis Applications, pp. 134-172, January 1990. 
[53] R. Schreiber, "A new implementation of sparse Gaussian elimination," ACM Trans-
actions on Mathematical Software, pp. 256-276, 1982. 
[54] J.A. Jess and G.H. Kees, "A data structure for parallel LU decomposition," IEEE 
Transactions on Computing, pp. 231-239, March 1982. 
[55] W.F. Tinney, V. Brandwajn, and S.M. Chan, "Sparse vector methods," IEEE Trans-
actions on Power Apparatus and Systems, pp. 295-301, February 1985. 
[56] R. Betancourt, "Efficient parallel processing algorithm for inverting matrices with 
random sparsity," lEE Proceedings Part E, pp. 235-240, July 1986. 
[57] A.J.E. Taylor, Techniques For Power System Simulation Using Multiple Processors. 
PhD thesis, University of Durham, UK, 1990. 
[58] R. Sedgewick, Algorithms, 2nd edition. Addison-Wesley, 1988. 
[59] I . Peterson, The Mathematical Tourist - Snapshots of modern mathematics. W. H. 
Freeman & Co., 1988. 
[60] I.S. Duff, A.M. Erisman, and J.K. Reid, Direct Methods For Sparse Matrices. Oxford 
Press, 1986. 
210 
B I B L I O G R A P H Y 
[61] A. Kelley and I . Pohl, A Book on C. Benjamin/Cummings, second ed., 1984. 
[62] A. Jennings and J.J. McKeown, Matrix Computations. Wiley, second ed., 1992. 
[63] V. Kumar, A. Grama, A. Gupta, and G. Karypis, Introduction To Parallel Computing 
- Design and analysis of algorithms. Benjamin/Cummings, 1994. 
[64] R.L. Burden and J.D. Faires, Numerical Analysis. PWS-Kent, fourth ed., 1988. 
[65] G.H. Golub and C.F. Van Loan, Matrix Computations. North Oxford Academic, 
first ed., 1983. 
[66] M.J. Quinn, Designing Efficient Algorithms for Parallel Computers. McGraw-Hill 
Series in Supercomputing and Artificial Intelligence, McGraw-HiU, 1987. 
[67] M.T. Heath, E. Ng, and B.W. Peyton, "Parallel algorithms for sparse linear systems," 
SIAM Review, pp. 420-460, September 1991. 
[68] D.J. Tylavsky, "Quadrant interlocking factorization: a form of block LU factoriza-
tion," Proceedings of the IEEE, 1986. 
[69] J. Bialek and D.J. Grey, "The application of clustering and factorisation tree tech-
niques for the parallel solution of sparse network equations," lEE Proceedings Part 
C, pp. 609-616, November 1994. 
[70] J.S. Chai and A. Bose, "Bottlenecks in parallel algorithms for power system stability 
analysis," IEEE Transactions on Power Systems, pp. 9-15, February 1993. 
[71] A. George and W.H. Liu, "An optimal algorithm for symbolic factorization of sym-
metric matrices," SIAM Journal of Computing, pp. 583-593, 1980. 
[72] A. George and W.H. Liu, Computer Solution of Large Sparse Positive Definite Sys-
tems. Prentice-Hall, 1981. 
[73] G.A. Geist and E. Ng, "Task scheduling for parallel sparse Cholesky factorisation," 
International Journal of Parallel Processing, no. 4, pp. 291-313, 1989. 
[74] J.W.H. Liu, "Reordering sparse matrices for parallel elimination," Parallel Comput-
ing, pp. 73-91, 1989. 
211 
B I B L I O G R A P H Y 
[75] F.F. Wu, "Solution of large scale networks by tearing," IEEE Transactions On Cir-
cuits and Systems, pp. 706-713, December 1976. 
[76] L.O. Chua and L.K. Chen, "Diakoptic and generalized hybrid analysis," IEEE Trans-
actions on Circuits and Systems, pp. 706-713, December 1976. 
[77] Z. Homing, X. Niande, and S. Wang, "Unified piecewise solution of power system 
networks combining both branch cutting and node tearing," Electrical Power and 
Energy Systems, pp. 238-288, October 1989. 
[78] I.S. DufF, "A survey of sparse matrix research," Proceedings of the IEEE, pp. 500-535, 
April 1977. 
[79] J. Bialek, D.J. Grey, and J.R. Bumby, "Parallel decomposed network solution method 
for power system simulation," in Proceedings of 27th Universities Power Engineering 
Conference, vol. 1, pp. 193-196, University of Bath, 1992. University of Bath, ENG-
LAND, 23-25 September 1992. 
[80] M.R. Irving and M.J.H. Sterling, "Optimal network tearing using simulated anneal-
ing," lEE Proceedings Part C, pp. 69-72, January 1990. 
[81] R.W. Dunn. Personal Communication, 1993. Stafford. 
[82] C. Ashcraft, S.C. Eisenstat, and J.W.H. Liu, "A fan-in algorithm for distributed 
sparse numerical factorisation," SIAM Journal on Scientific and Statistical Comput-
ing, pp. 593-599, 1990. 
[83] F. Herman and L. Snyder, "On mapping parallel algorithms into parallel architec-
tures," Journal of Parallel and Distributed Computing, pp. 439-458, 1987. 
[84] R.L. Graham, "Bounds on multiprocessor timing anomalies," SIAM Journal of Ap-
plied Mathematics, pp. 416-429, March 1966. 
[85] G.F. Coulouris and J. DoUimore, Distributed Systems. Addison Wesley International 
Computer Science Series, Addison Wesley, 1988. 
[86] T. Oyama, T. Kitshara, and Y. Serizana, "Parallel processing for power system analy-
sis using band matrix," IEEE Transactions on Power Systems, pp. 1010-1016, August 
1990. 
212 
B I B L I O G R A P H Y 
[87] J. Bialek and D.J. Grey, "An automatic clustering algorithm using factorisation tree 
for parallel power system simulation," in Proceedings of MELECON 1994, 1994. 
[88] J.A. George, J.W.H. Liu, and E.G.Y. Ng, "Communication results for parallel sparse 
Cholesky factorisation on a hypercube," Parallel Computing, pp. 287-298, 1989. 
[89] M.A. Weiss, Data Structures And Algorithms. Benjamin/Cummings, 1992. 
[90] H.S. Will, Algorithms And Complexity. Prentice-Hall, 1986. 
[91] A. Burns and A. WeUings, Real-time systems and their programming languages. 
Addison-Wesley International Computer Science Series, Addison-Wesley, 1989. 
[92] C.A.R. Hoare, "Communicating Sequential Processes," Communications of the ACM, 
pp. 666-777, 1978. 
[93] C.A.R. Hoare, Communicating Sequential Processes. Prentice-HaU International ?, 
1980. 
[94] P. Brinch Hansen, "Distributed Processes: A concurrent programming concept," ACM 
Transactions on Programming Languages, no. 4, pp. 405-430, 1978. 
[95] K.M. Chandy and S. Taylor, An introduction to parallel programming. Jones and 
Bartlett Publishers, 1992. 
[96] J. Bialek and D.J. Grey, "A mutated tree architecture for real-time parallel power 
system simulation," in Proceedings of 28th Universities Power Engineering Confer-
ence, vol. 2, pp. 458-461, University of Stalfordshire, 1993. University of Staffordsliire, 
ENGLAND, 21-23 September 1993. 
[97] D.P. Helmbold and C.E. McDowell, "Modeling Speedup (n) greater than n," IEEE 
Transactions on Parallel and Distributed Systems, pp. 250-256, April 1990. 
[98] D. Parkinson, "Parallel efficiency can be greater than unity," Parallel Computing, 
pp. 261-262, 1986. 
[99] B.W. Weide, "Modeling unusual behaviour of parallel algorithms," IEEE Transactions 
on Computers, pp. 1126-1130, November 1982. 
[100] W.A. Kornfeld, "CombinatoriaUy implosive algorithms," Communications of the 
ACM, pp. 734-738, October 1982. 
213 
[101] L.A. Crowl, "How to measure, present and compare parallel performance," IEEE 
Parallel and Distributed Technology, pp. 9-25, Spring 1994. 
[102] D.E. Goldberg, Genetic Algorithms. Addison-Wesley, 1989. 
[103] Inmos, "IMS D7314A ANSI C Compiler Language Reference Manual," 1992. 
[104] W.D. Stephenson, Elements Of Power System Analysis. McGraw HiU, third ed., 1975. 
[105] B.M. Weedy, Electric Power Systems. Wiley, third ed., 1987. 
[106] H. El-Rewini and T.G. Lewis, "Scheduling parallel program tasks onto arbitrary target 
machines," Journal of Parallel And Distributed Computing, pp. 138-153, 1990. 
[107] N.T. Karonis, "Timing parallel programs that use message passing," Journal of Par-
allel And Distributed Computing, pp. 29-36, 1992. 
214 
Appendix A 
The INMOS Transputer 
The INMOS Transputer first entered the microprocessor marketplace in 1982 with the 
release of the T212 Transputer. It was designed to be a general purpose microprocessor 
which could be used as a building block for creating high performance parallel processing 
systems. Nowadays the transputer is a cheap component that can be connected into large 
arrays to form a high performance machine. Taylor [22] notes that the following three main 
features can be identified in the design of the transputer; 
• Support for multitasking so that many logically concurrent tasks may reside on the 
same processor. 
• Support for synchronous communications so that tasks on the same or different pro-
cessors may communicate efficiently with one another. 
• Modular design and support for the construction of systems of processors. 
An additional feature not explicitly noted by Taylor is that of creating a hardware platform 
for an implementation of Hoare's [92, 93] Communicating Sequential Processes paradigm of 
parallel computing. 
The architecture of the transputer is such that it allows the size of a system to be 
easily scaled. The multi-tasking scheduler built into the chip allows multiple tasks to be 
executed on a single transputer. Large parallel programs may be executed on one or many 
transputers. 
The communication facilities that are provided aUow synchronous communication be-
tween tasks on the same or different processors. The synchronous nature of the communi-
cation primitives requires messages to be acknowledged and this gives a reliable protocol 
215 
for data transfer. Complete multiprocessor systems are built simply by interconnecting 
the communications links of the constituent transputers. Communications are point-to-
point and each serial link consists of two unidirectional communication channels, one in 
either direction. The links operate concurrently with the processor in order to maximize 
performance and maximum data transfer rate is 20 Mbit/sec. 
To address the modular design issue the transputer incorporates many of the supporting 
systems required by other microprocessors onto the same piece of silicon as the processor 
itself [20]. Each transputer consists of a microprocessor, serial communication links, fast 
cache memory, external memory interfacing, floating point coprocessor, real-time clocks 
and a hardware implemented multi-tasking scheduler. As so much is implemented on the 
transputer itself, systems can be built with only a few external components. Printed circuit 
boards can be small and power consumption is lower than in many comparable designs 
which use other processors. AU the resources required by a transputer were designed to be 
local and not globally shared, eliminating the need for complicated bus interfaces. In fact 
the only external signals a transputer needs to enable it to boot and run a program from an 
attached ROM are the power, ground and clock signals. This makes the building of large 
systems straightforward. 
A . l The Architecture of the Transputer 
The first generation of transputers, known as the Txxx family of processors, are essentially 
similar in their design. Although three distinct generations exist with slightly different char-
acteristics, all generations share the same basic architecture and communication interfaces 
and have similar instruction sets. Figure A . l illustrates the basic architecture of the Txxx 
series. In recent months the first of the second generation transputers, the T9000, has been 
released to market. This has a different architecture to the Txxx family and is not directly 
compatible with the earlier processors. It claims to give an order of magnitude greater 
performance over the Txxx family but problems are being experienced with this processor 
and it does not yet perform as well as expected. As it has only just been released it was 
not available for use in this research project and wiU not be considered further. 
The Transputer processors are based around the RISC philosophy. Each instruction is 
8 bits long, consisting of a 4 bit instruction code and a 4 bit operand. This allows 78% 
of all instructions to be coded in a single byte [22] and each memory access fetches 3 or 4 
216 
RAM 
Interface 
Processor 
/ I — ^ ^ J n k 
\ — 1 / Interface 
/I—Mliiik 
— 1 / Interface V 
/ I — l \ [ L i n k 
^ — 1 / Interface 
/ I — f \ [ L i n k 
\ j — / Interface 
Figure A . l : Basic internal architecture of a Transputer 
instructions. This makes the transputer's instruction set extremely efficient. The transputer 
also has a minimal register set which speeds up the context switching that occurs during 
multitasking. 
Depending upon the processor, 2, 4 or 8 KBytes of memory are included onboard the 
transputer. This memory is single cycle static memory and access to it is extremely fast. 
External memory addresses are contiguous with internal addresses and to improve per-
formance the external memory interface is only utilized when an invalid internal memory 
address is generated. The external memory interface is designed such that external mem-
ory systems may be connected with the minimum of extra components. Given the compact 
instruction set and fast internal memory it is possible to use transputers without external 
memory, although this severely limits program size and is insufficient for many applications. 
Each transputer has four serial communications links and each link may be connected 
to any link of any other processor. Communication channels (Section A.2.1) between pro-
cessors are implemented across these links. The hardware links themselves are implemented 
using direct memory access to aUow high speed data transfer concurrently with the opera-
tion of the processor. Communication between tasks on the same processor simply requires 
217 
the copying of information held in the memory. This is implemented using an efficient block 
move instruction in the instruction set. 
Most computer systems provide multitasking through the use of a software scheduling 
kernel. The transputer implements this multitasking scheduler in hardware for maximum 
efficiency. Executable tasks are arranged into two process queues; one for low priority pro-
cesses and one for high priority processes. Low priority tasks are given a 1024 microsecond 
timeslice whilst high priority processes are allowed to execute until they become blocked. 
Tasks may block if they are waiting for a communication, waiting on a timer or waiting 
on an interrupt. When a task blocks, or its timesHce expires, it is placed at the bottom of 
its process queue and the task at the head of the process queue then executes. If the high 
priority process queue becomes non-empty at any time then the low priority task currently 
executing is pre-empted and the high priority task is executed until it blocks. If any further 
high priority tasks are available they will run until the high priority queue is empty. At this 
point the processor returns to handfing the low priority tasks. 
The transputer has two hardware timers onboard which can be accessed by the pro-
grammer. These timers are autonomous and run concurrently with the processor. The high 
priority timer ticks every 64 microseconds whilst the low priority timer ticks every 1024 
microseconds. These timer speeds are independent of the transputer's clock speed. If a 
high priority task makes a call to the timer then it accesses the high priority timer whilst 
low priority tasks access the low priority timer. The timers are used by the scheduler to 
control timesficing and may be used by the programmer to time events or implement delays. 
A.1.1 The T2 Family 
The T2 family of Transputers consists of the T212, T222 and f225 processors. The basic 
architecture is that of Figure A . l and all members of this family are 16 bit processors. The 
T212 was the first Transputer released and has 2KB onboard RAM, 10 MBit/sec link speed 
and 20 MHz clock speed. The T222 has 4KB onboard RAM, 20 MHz clock speed and 20 
MBit/sec link speed. The T225 processor was the last in this family and is identical to the 
T222 but with a clock speed of 30 MHz. This gave the T225 a bidirectional data transfer 
rate across its finks of 2.4 MByte/sec and the processor can achieve 0.06 MFLOPS. 
218 
P r o c S p e e d 
S e l e c t O - 2 -
R e s e t -
A n a l y s e -
E r r o r l n -
E r r o r -
B o o t F r o m R O M -
C l o c k l n -
V C C -
G N D -
C a p P l u s . 
C a p M l n u s -
P r o c C l o c k O u t . < 
n o t M e m S 0 - 4 . > 
notMemWrB0-3 -< 
n o t M e m R d - " 
n o t M e m R f " 
M a m W a l t 
M e m C o n f Ic 
M e m R e c 
M e m G r a n t e d -
Floaling-point unit 
System 
s e r v i c e s 
T i m e r s 
4 Kbytes 
ot 
o n - c h i p 
RAM 
Ex te rna l 
m em o ry 
interface 
3 2 
3 2 
32 
3 2 
32 
3 2 
32 
32-bit 
P r o c e s s o r 
Link 
s e r v i c e s 
Link 
interface 
Link 
interface 
Unk 
interface 
Unk 
interface 
E v e n t 
32 
L I n k S p e c l a l 
L I n k O S p e c l a l 
L l n k 1 2 3 S p e c l a l 
LtnkInO 
L I n k O u t O 
L I n k l n l 
L i n k O u t I 
L i n k l n 2 
L i n k O u t 2 
L I n k l n S 
L I n k O u t a 
• E v e n t R e q 
E v e n t A c k 
M e m A D 2 - 3 1 
M e m n o t R f D1 
IVIemnotWrOO 
Figure A.2: Basic internal architecture of a the T8 Transputer family 
A. 1.2 The T4 Family 
The T4 family consists of the T414 and T425 processors. The basic architecture is again 
similar to that of Figure A . l but the T4 family are 32 bit processors. The instruction set 
was modified to improve floating point performance. The T414 runs at a clock speed of 
20 MHz and incorporates 2KB of onboard RAM. The communication links operate at 20 
MBit/sec. The T425 is capable of running at 30 MHz and has 4KB of onboard RAM. 
Enhanced links are incorporated which are capable of a bidirectional transfer rate of 2.4 
MByte/sec. An enhanced instruction set improves error checking. The T425 is capable of 
30 MIPS and 0.13 MFLOP. 
A . 1.3 The T 8 Family 
The 32-bit T8 family is the last of the first generation of Transputers and includes the T800, 
T801 and T805 processors. The architecture of this family includes a 64 bit floating point 
coprocessor unit and the basic architecture is shown in Figure A.2. The floating point unit 
(FPU) gives single and double precision floating point operations to the ANSI-IEEE 734 
standard. The addition of the FPU makes this family of processors well suited to compu-
219 
tationaUy intensive scientific computing appUcations. 
The T800 was the first processor of the T8 family and is basically a T425 with the 
addition of the FPU and an expanded instruction set. Clock speed is 30 MHz and 2.4 
MByte/sec fink transfer rate is achievable. 
The T805 is an improved T800 offering an enhanced memory interface and improved 
interrupt handfing. The instruction set is also enhanced to facifitate debugging. Running 
at 30 MHz, the T805 is capable of 30 MIPS and 4.3 MFLOPS peak. The sustained rate for 
floating point operations is 3.3 MFLOPS, making the T805 a powerful processor in its own 
right. 
Both the T800 and T805 have an external memory access rate fimited to 40 MByte/sec 
due to their multiplexed data and address busses. The T801 is a repackaged T805 which 
has the address bus and data bus separated allowing an external memory access rate of 60 
MByte/sec. 
A.2 Programming the Transputer 
A.2.1 Tasks and Channels 
According to the CSP paradigm [92, 93], parallel programs are made up of sequential 
modules called tasks. Each task is an autonomously executing unit but two tasks may syn-
chronise their operation or share data through communication. An expficit message passed 
between the two tasks is used to transfer data and the communication automatically forces 
them to synchronise their operations as neither task can continue until the communication 
is complete. Tasks can run concurrently on separate processors or may reside on the same 
processor and execute through timeshared multitasking. A typical transputer program con-
sists of both concurrent and timeshared tasks. A simple configuration file is used to describe 
how tasks are placed on the available processors. 
Communications between tasks are performed using channels. A channel is a logical, 
unidirectional communication fink which exists between two tasks. In an intertask com-
munication the sending task inserts data into the channel at one end and the receiving 
task removes the data at the other end. Channels can exist between tasks on two different 
processors or between tasks on the same processor. A channel between tasks on the same 
processor is known as a soft channel and is implemented using memory copy instructions. 
If the tasks reside on different processors then the channel is assigned to the physical con-
220 
nection between the processors and is known as a hard channel. The Transputer's links can 
accommodate two channels, one in each direction. As there are only four hnks per processor 
it would appear that only four pairs of hard channels may be assigned by each processor. It 
is possible to increase this number through the use of channel multiplexing and virtual chan-
nels. As far as the programmer is concerned virtual channels are identical to hard channels 
in their use and operation. Many virtual channels may be assigned by each processor and 
more than two virtual channels may be assigned to each Transputer link. This is achieved 
by multiplexing the data from the virtual channel communications and passing it through 
the two hard channels which are assigned to the link. Demultiplexing at the receiving end 
splits the data up again and maintains the appearance of virtual channels passing across 
the link. When two processors which are not directly connected need to communicate a 
virtual channel between the two processors may be used. The virtual channel is imple-
mented by passing messages from the sender to the receiver via intermediate processors. 
The latest versions of the INMOS Toolsets [103] incorporate virtual channels and provide 
routing software which automatically handles communications involving processors which 
are not directly linked. 
A.2.2 Programming Languages 
Occam was developed by INMOS as the language for programming the Transputer. It 
provides a strict implementation of Hoare's Communicating Sequential Processes (CSP) 
paradigm [92, 93] and allows all aspects of parallelism to be easily expressed. Occam 
is a small and somewhat terse language that allows the parallelism in a problem to be 
expressed in terms of jobs which must be performed in parallel and jobs which must be 
performed sequentially. The concept of a task stiU exists in Occam but fine grain parallehsm 
is also possible as the language allows parallel execution of individual statements. In fact 
Occam views a program as an hierarchical arrangement of tasks and communication between 
tasks is synchronous via channels. Guarded commands are the other main feature of the 
language and these are implemented through the ALT construct. A number of C and 
Fortran compilers are available and compilers for other languages such as Ada, Modula-2, 
Lisp and Parlog are available commercially, although these are not very widely used. AU 
the C and Fortran compilers provide extensions to standard versions of these languages 
in terms of functions for exploiting parallelism and performing communications. Parallel 
implementations of C are based on extensions to standard ANSI C whilst parallel versions 
221 
of Fortran are usually based on the Fortran 77 standard. 
The INMOS C Toolset was used throughout this project and aU the code written for 
the transputer system was written in INMOS' implementation of parallel C. This is an 
extended ANSI C which incorporates channel communication functions, task creation and 
management functions, functions to control the scheduUng of tasks and functions to access 
the other Transputer features such as event handfing and the hardware timers. Each parallel 
task can be written as a separate C main() routine and compiled individually into binary 
units. A configuration language is also provided to create the configuration files which 
describe how the tasks in the system are connected to form the complete program. Once 
the program has been configured the individual task binaries are finked to produce a single 
binary program file which may be loaded on to the transputer array. The benefit of using the 
C language is that existing sequential code written in C can easily be ported to the parallel 
environment, providing a much quicker route to software implementation than through the 
use of Occam. The latest version of the C Toolset [103] was used throughout this project 
and this provided support for virtual channels and channel routing. 
A.3 Building Parallel Systems with the Transputer 
A.3.1 The T R A M Standard 
To enable the easy building of scaleable paraUel systems INMOS have created a modular 
system for building Transputer based machines. This standard is based around the use 
of Transputer Appfications Modules (TRAM's) and TRAM motherboards. The TRAM 
is a smaU circuit board measuring 3.6 inches by 1.1 inches. Each TRAM hosts a single 
Transputer, R A M and interfacing logic and is a complete computer in its own right. AU 
T R A M ' S have a simple 16 pin interface which allows them to be connected to a mother-
board and the power and control signals from the motherboard are passed through this 16 
pin interface. Processors on different TRAM's are connected through their finks via the 
motherboard. Many motherboards configure two finks of each processor into a pipefine and 
take the remaining finks to an external patch board. Different interconnections are built 
using jumper leads plugged into the patch board. Some motherboards provide an additional 
C004 reconfigurable electronic crosspoint switch to which the spare finks are connected. The 
settings of this switch can be controUed using the vendor-suppfied software and this allows 
different interconnection networks to be estabfished without the need to physically rewire 
222 
the machine. The C004 switch is a static switching device and interconnection topologies 
cannot be changed whilst a program is running. 
A T R A M motherboard must be interfaced to a host machine. Motherboard cards are 
usually designed so that they may reside within the host computer enclosure. PC's and 
Sun workstations are the usual hosts although other hosts may be used. Connecting the 
T R A M ' S to the host via the motherboard-host interface allows the transputers to utihse the 
facilities offered by the host in terms of disk storage, screen output and keyboard input. AU 
1/0 communications between the transputers and the host must be performed via the first 
processor in the network as this is the only processor which connects directly to the host. 
Any other processor that wishes to communicate with the host must use the first processor 
as an intermediary. 
A.3.2 The Experimental Setup 
The parallel computing system used throughout the duration of this research project con-
sisted of 16 INMOS T805 30MHz Transputers and one INMOS T805 20 MHz Transputer. 
The 20 MHz processor was supplied with 16 MB of fast RAM and was used as the root 
processor in the Transputer network. Fifteen of the 30 MHz processors were supplied with 
1 MB of fast RAM whilst the other 30 MHz processor was equipped with 4 MB of RAM. 
All of the Transputers were mounted on two INMOS BOOS compatible motherboards and 
hosted by an IBM PC AT clone. Each motherboard could accommodate up to 10 Trans-
puters and was equipped with an electronic crosspoint switch which allowed the Transputer 
interconnection network to be reconfigured from software. An overview of the machine used 
for the experiments described in this thesis is shown in Figure A.3. 
223 
n 
LU 
Figure A.3: Overview of the experimental Transputer-based parallel machine 
224 
Appendix B 
Derivation Of The Models of 
Power System Elements 
B . l The Generator Model 
Consider a non-salient two pole machine which has the rotor field winding supplied by a 
constant current source. If magnetic saturation is ignored and the rotor spins at a constant 
velocity balanced three phase sinusoidal voltages will be induced in the stator windings. 
These voltages are independent of the stator currents as the rotor field current is constant 
- hence they can be modelled as ideal voltage sources. The stator windings, separated 
spatially by 120 °, are inductively coupled with equal mutual impedances, Z^- The self-
impedances, Zs, of the stator windings are also equal. Under these conditions the generator 
can be modelled as an equivalent circuit consisting of an ideal voltage source driving an 
impedance, one for each phase. Figure B. l shows the equivalent circuit model of the gen-
erator, consisting of the three ideal voltage sources connected through the self impedances, 
Zs and coupled by the mutual impedances, Zm-
Assuming balanced three phase steady state conditions, only positive sequence networks 
[104] are involved. This allows circuit resistance and other machine losses to be ignored for 
the sake of clarity and the system can now be modelled by the equivalent circuit of Figure 
B.2. 
Resorting to Kirchoff's Laws 
E^jXj-i-V (B.l) 
225 
E a Zs la 
Zm 
Zm Zs lb 
Zm 
Zs 
i 1 
Vb 
Vc 
Figure B . l : Equivalent circuit model of the synchronous generator 
Va 
Generator — > < — System 
Figure B.2: Positive sequence equivalent circuit model of the synchronous generator 
226 
where 
V = V/.0°= - generator terminal voltage 
E = EIS — - stator voltage 
6 = power angle 
Xd = direct axis synchronous reactance - this is reactive 
component of stator self-impedance [1] 
The complex power for the system is given by 
s = vr (B.2) 
Substituting from ( B . l ) 
Simplifying yields 
S = V 
E - V 
VE V"^ VE 
jXd Xd Ad 
The real power, P, is the real part of (B.4) 
Xd Xd 
(B.3) 
(B.4) 
VE 
P = SR[5] = - 1 ^ sin ^ 
Xd 
(B.5) 
and reactive power, Q, is the imaginary part of (B.4) 
VE V'^ 
Ad Ad 
(B.6) 
Taken together ( B . l ) to (B.6) provide a mathematical description of the synchronous 
generator which allows the determination of real and reactive power dehvered and the 
voltages and currents which can be measured both within the machine and at its terminals. 
B.2 The Transmission Line Model 
Three different equivalent circuit models of a transmission line may be derived depending 
upon the length of the line. This section discusses these models in detail. In discussing 
transmission lines i t should be noted that parameters such as resistance and impedance are 
227 
Z=R+jwL 
Z l 
Figure B.3: Single phase equivalent circuit model of a short transmission line 
distributed throughout the length of the line. The models can be derived by considering 
these parameters to be uniformly distributed along the hue although it is more usual to 
make use of lumped parameter models [104]. These models concentrate the resistance and 
inductance of the line into single parameters in the equivalent circuit model. 
B.2 .1 Short L ines 
Lines which are less than 80 km in length are considered to be short hues. For a line of this 
length the shunt capacitance between the line and the neutral return is neghgible. Only 
the series resistance, R, and the series inductance, L, of the line need to be considered and 
a lumped parameter model can be devised. The single phase equivalent circuit is shown in 
Figure B.3. The current is the same at the sending and receiving ends of the Una and thus 
IS = IR Vs = 
VR 
(B.7) 
B.2 .2 M e d i u m Length Lines 
Lines which are over 80 km in length and less than 240 km are defined as medium length 
lines. I t is possible to represent a medium length line with a lumped parameter model 
using line series resistance and line series inductance. However the shunt admittance can 
now longer be neglected and it is included as a lumped parameter Y. The shunt admittance 
228 
Z=R+j L 
w v ^ — ^ 
Figure B.4: Single phase equivalent circuit model of a medium length transmission line 
is usually purely capacitive and it is customary to place half of the admittance at each end 
of the line. The equivalent circuit, shown in Figure B.4, is referred to as the nominal-TT 
equivalent circuit model. 
From simple circuit theory 
Y 
Icr = Vn-
The current flowing in the series arm of the circuit is IR + VRJ and thus 
(B.8) 
Vs ={yR^ + ^ + ^ f l = ( — + 1 j + ZIR (B.9) 
Looking at the sending end gives 
Ics = Vs- (B.IO) 
and the current flowing in the series arm is 
Y Y 
Is = Vs- + V R - + IR ( B . l l ) 
Substituting (B.9) into ( B . l l ) yields 
/ ZY \ 1 y y / 7Y \ (7Y \ 
229 
/ + A / 
I 
V+AV 
Figure B.5: Single phase representation of a long transmission line 
B.2 .3 L o n g L i n e s 
Long lines are defined as those lines whose length exceeds 240 km. The mathematical de-
scription of a long line must account for the fact that the series resistance, series inductance 
and shunt admittance are distributed throughout the line rather than being lumped to-
gether. A long line may be represented by a single phase circuit of the form of Figure B.5. 
Series resistance, series inductance and shunt admittance are assumed to be uniformly dis-
tributed along the length of the line. Consider the voltage and current differences between 
the ends of the line element of length Ax, where x is the distance of the element from the 
receiving end. The end of the element nearest the receiving end of the line is referred to as 
the receiving end of the element whilst the end closest to the sending end of the line is the 
sending end of the element. The series impedance and shunt admittance of the element are 
zAx and yAx respectively, where 
z = series impedance of per unit length 
y — shunt admittance to neutral per unit length 
I f V is the voltage of the element at the receiving end then the voltage at the sending end 
is F - f A y as the voltage increases by AV over the length of the element. I f / is the current 
flowing at the receiving end of the element then 
AV 
Ax 
Iz (B.13) 
As Aa; 0 
— = Iz 
dx 
(B.14) 
230 
The current entering the sending end of the element is / + A / , where A / — VyAx. Hence 
as 3; —> 0. Resorting to calculus, i t is possible to prove [104] that 
y ^ V R ± ^ ^ , ^ ^ VR-IRZ^^_^^ ^g^g^ 
I = ^ -6^"= - ^ (B.17) 
2 2 ^ ^ 
where 
Zc = ^^=characteristic impedance of the line 
7 = \/p'=propagation constant of the line 
Recalling that 
sinh e = ^ ^ (B.18) 
coshg= \ (B.19) 
it is possible to rewrite (B.16) and (B.17) in hyperbolic form as 
V = VR cosh 7a; + IRZC sinh 7a; (B.20) 
VR 
/ = / r cosh7a; + — sinh 7a; (B.21) 
Zc. 
Setting a; = /, where / is the length of the line gives 
Vs = V r cosh 7/ + IRZC sinh 7/ (B.22) 
V r 
Is = IR cosh 7/ + ^ sinh 7/ (B .23) 
Zc 
A lumped parameter equivalent circuit can be obtained for the long line. Consider the 
TT equivalent circuit of Figure B.6. Substituting the lumped parameters into (B.9) yields 
( Z Y \ 
^ — + l j V f l - f Z 7 R (B.24) 
To make the equivalent circuit model the transmission line accurately the coefficient of 
VR and IR in (B.22) must equal the coefficients of VR and IR in (B.24). Equating the 
231 
-o WW 
1> 
Figure B.6: Single phase 7r-equivalent circuit of a long transmission line 
coefficients of IR yields 
Z' = Zcsmh-fl = - sinh 7/ = zl^^^^^ 
y y/zyl 
(B.25) 
The total series impedance of the line, Z, is given by Z = ^/ and thus 
Z 
, _ Z sinh 7/ 
(B.26) 
Equating the coefficients of Vn yields-
Z'Y' 
+ 1 = cosh 7/ (B.27) 
Substituting from (B.25) 
y' .Z,sinh7/ 
-|- 1 = cosh 7/ (B.28) 
Hence 
l_ cosh7/- 1 
~2 ~ Zc' sinh7/ 
(B.29) 
Using the identity 
, e c o s h ^ - l 
tanh - = — . . „— 
2 smh e (B.30) 
232 
Flux paths --, 
o 
I'l A/, § o o 
r 1 
yields 
Figure B.7: Two winding transformer and schematic 
r _ y t anh (^ ) 
Y ~ 2" 2 
(B.31) 
where y = ?// is the total shunt admittance of the line. 
B.3 The Transformer Model 
A transformer consists of two or more coils placed such that they are hnked by the same 
flux. The coils are usually wound on to an iron core in order to confine the flux to the coils. 
The coil which is connected to the load is known as the secondary winding and the other 
coil is known as the primary winding. 
Consider the two winding transformer and its schematic shown in Figure B.7. Assuming 
the transformer to be ideal (i.e. the permeabifity of the iron core is infinite and the resistance 
of the windings is zero) the terminal voltages are related by 
V2 N2 
(B.32) 
Similarly the terminal currents are related by 
h (B.33) 
The ful l derivation of these relationships relies on the use of Faraday's Law and Ampere's 
Law and is given in [104]. 
Due to the principle of conservation of energy, the power input to the primary winding 
233 
Xi 
j m 
V, 
I , 
^ 1 . 
G N 
Ideal 
Figure B.8: Modified equivalent circuit of an ideal single phase transformer 
must be equal to the power output from the secondary winding. 
S = Vi/i* = V2/2* 
I f an impedance, Z2 is connected to the secondary winding then 
(B.34) 
Z2 = ^ 
V, (^)Vr 
(B.35) 
The effective impedance seen at the terminals of the primary winding is 
Z2 1 
N2 
(B.36) 
Practical transformers have finite core permeability, winding resistance, losses in the 
core due to eddy currents and hysteresis and imperfect flux linkage to the coils. These 
factors must be accounted for in the equivalent circuit of the practical transformer, shown 
in Figure B.8. This equivalent circuit is derived by adding extra components to account for 
these effects at the primary and secondary windings of the ideal transformer. Applying a 
sinusoidal voltage to the primary winding when the secondary winding is open circuit causes 
a small current to flow in the primary winding. This is the magnetizing current, J^;, and 
is accounted for in the equivalent circuit model by the inductive susceptance in parallel 
with a conductance G. The inductive leakage reactance x\ accounts for flux leakage in the 
primary winding and X2 accounts for flux leakage in the secondary winding, r i and r2 are 
the series resistances of the primary and secondary windings respectively. By referring all 
quantities to the primary side of the transformer the ideal transformer can be removed from 
234 
R X 
jm 
V 
Figure B.9: Equivalent circuit of a practical single phase transformer 
Primary Secondary Load 
Figure B.IO: y — A connected three phase transformer equivalent circuit 
the equivalent circuit. As magnetizing current is small compared to the load current is i t 
often neglected and the equivalent circuit model of a practical single phase transformer is 
shown in Figure B.9. I f there are A'^ i turns on the primary and turns on the secondary 
i t is possible to define a = The parameters of the model are thus 
Ri = ri + a^r2 
Xi = xi + a?X2 
(B.37) 
(B.38) 
A three phase transformer may be created by connecting a bank of single phase trans-
formers such that the three primary (secondary) windings are A connected and the three 
secondary (primary) windings are Y connected (Figure B.IO). The result is the y - A three 
phase transformer. Practical three phase transformers are often constructed by winding 
the three phases onto the same iron core. The equivalent circuit model is constructed from 
235 
^ o n d O pu 
Figure B . l l : P-V and Q-V characteristic for a typical synchronous motor 
three single phase equivalent circuit models of a practical transformer, connected together 
in the appropriate manner. 
B.4 The Load Model 
The simulation of a power system requires the loads connected to that system to be ac-
curately modelled. This requires a consideration of how the power and reactive power 
flows of the load vary with voltage. The individual loads at a bus are usually lumped to-
gether to give a composite load for the bus and composite loads typically consist of [105] 
Induction motors 
Synchronous motors 
Heating and lighting 
Transmission losses 
50%-70% 
10% 
20%-25% 
10%-12% 
Heating and lighting loads have well deflned characteristics. Lighting consumes no 
reactive power and the power consumption varies with (voltageY'^. Heating loads also 
consume no reactive power and maintain a constant resistance as voltage varies and the 
power consumption varies with {voltage^. 
The power consumed by synchronous motors remains approximately constant. As the 
voltage drops the reactive power consumption increases. Figure B . l l shows the variation 
of real and reactive power with voltage for a typical synchronous machine. 
Induction motors account for the largest proportion of the composite load and the 
236 
I — -
m 
Xi is the stator leakage reactance 
X2 is the rotor leakage reactance 
Xm is the magnetizing reactance 
r2 is the rotor resistance 
5 is the rotor slip 
Figure B.12: Equivalent circuit of an induction motor 
variation of real and reactive power flow with voltage can be found by considering a suit-
able equivalent circuit. Figure B.12 shows the equivalent circuit for an induction motor. 
Assuming that the mechanical loading on the rotor shaft is constant, the electrical power 
delivered to the rotor is 
(B.39) 
I^r2 
P = 3 = constant 
Reactive power consumption is given by 
3X/2 
Q = ^ + M\Xr+X2) 
Am 
(B.40) 
Real power consumption is given by 
,Pr2 3F2 
P = 3-
r2 SV'^r2S 
(^)2 + ( X i - f X2)2-5 Rl + {sXy 
(B.41) 
where X = Xi + X2- Figure B.13 shows the variation of real and reactive power flow with 
voltage under different mechanical loadings. 
From the point of view of a simulation i t is the characteristics of the composite load 
which are of interest rather than the characteristics of its constituents. I f the P-V,Q-V 
characteristics have been measured for each substation then these may be used to model 
the loads. Unfortunately these characteristics are not readily available. Many analyses 
237 
Figure B.13: P-V and Q-V characteristics of an induction motor 
represent loads by constant impedances [105] and consequently P oc V^ and Q oc V^. H the 
power consumed by the load is S = P + jQ i t is easy to show that the current in the load 
is given by 
T* — — 
~ V 
P-jQ 
V 
(B.42) 
(B.43) 
From Ohm's Law we have V = IZ and the impedance, Z, used to represent the load is 
V V2 
Z = 
I P-jQ 
(B.44) 
When the network is extensively simplified then the constant impedance model is used 
although the load that this represents seldom occurs in practice. Other methods of load 
modelling represent the load with a constant current sink and this is found to give a good 
approximation to real loads [105]. 
238 
Appendix C 
Deriving the Bus Admittance 
Matrix 
I t is necessary to derive the bus admittance matrix for a given system before attempting to 
calculate the currents and voltages in that system. The bus admittance matrix is determined 
by performing nodal analysis on the system. Consider the example system of Figure C . l . 
The nodal admittance analysis method is based simply on Kirchoff's Current law. The 
consequence of this is that each node, k, in the system obeys the relationship 
(C. l ) 
2 = 1 
where n is the number of branches from this node, yki is the admittance of the branch 
connecting node k and node i. As this is true for each node in the system we can write the 
matrix equation of (2.7) where 
[Y] = 
2/1,1 2/1,2 
2/2,1 2/2,2 
2/n-l,l yn-l,2 
2/l,n-l 
2/2,n-l 
2 /n - l,7ii 
(C.2) 
The diagonal term yk,k is the sum of all the admittances connected to node k whilst yk,i is 
the sum of aU the admittances between node k and node i. For the example of Figure C. l 
239 
-j5 /m 
-o 
O \2 
-o 
Figure C . l : Example circuit for admittance analysis 
the matrix [Y] is 
' ' l - i 5 i 5 - 1 
[Y]= j5 4 - j 5 - 4 (C.3) 
- 1 - 4 l - f 4 - | - j 2 
Note that node 4 does not appear in [Y] as one node in the system has to be chosen 
as a reference node to prevent the creation of a set of dependent equations. Eliminating 
the equations relating to one node ensures that the remaining set of equations is linearly 
independent. 
When considering power systems and the power flow problem, the bus admittance ma-
trix, [Y], is simply the nodal admittance matrix of the transmission network and can be 
derived in the same way using nodal admittance analysis. I t is usual to select the system 
slack bus as the reference node. 
Consider the simple four bus system shown in Figure C.2. 
Each line in the system is a transmission line which has a series impedance of z and 
a shunt admittance of | connected to each end of the line. Considering the fine between 
buses 1 and 2 :- The series impedance and shunt admittance make contributions to the yu 
and y22 terms of the admittance matrix, according to 
1 y 1 y 
2/22 = 2/22 + 3 + 2 (C.4) 
240 
Figure C.2: Simple four bus example system 
The off-diagonal terms account only for the series impedance between buses 1 and 2 and 
hence 
1 1 
(C.5) 
1 1 
2/12 = 2/12 - - 2/21 = 2/21 z z 
Using the same technique and considering each fine in the system in turn, i t is possible 
to derive the complete bus admittance matrix for the system. AU that is required is a 
knowledge of the values of z and y for each line. 
241 
Appendix D 
Network Partitioning and 
Diakoptics 
I n order to solve the network equations in parallel i t is necessary to par t i t ion the system 
network into several smaller, independent subnetworks. The large network is divided by 
' tearing ' i t apart using Kron's method of diakoptics [48]. Each subnetwork is solved in-
dependently and the independent solutions are appropriately modified to give the correct 
solution. 
T w o methods exist for tearing the network into subnetworks - branch cut t ing and node 
tearing. The branch cut t ing method operates by cut t ing some of the branches which in-
terconnect the nodes, thereby separating the network into independent subnetworks. The 
node tearing approach operates by tearing some of the nodes in half to separate the network 
in to a number of subnetworks. The cut branches or the torn nodes give rise to coordination 
variables i n the diakoptic solution. Once the subnetworks have been solved i t is these coor-
dinat ion variables which are used to modi fy the individual subnetwork solutions to give the 
overall solution. The two methods differ not only in the way that they par t i t ion network 
but also in the way the coordination variables are chosen. The coordination variables used 
by the branch cut t ing method are the currents f lowing in the branches that are cut. The 
node tearing method selects the voltages at the torn nodes as the coordination variables. 
Node tearing usually introduces fewer coordination variables than branch cut t ing [77] thus 
making i t more computationally efficient. The branch cut t ing method can be useful when 
in fo rma t ion about the currents flowing in the circuit branches is required. The following sec-
tions present the branch cut t ing and node tearing methods in more detail. The informat ion 
242 
presented here is based on that given by Boming et al.[77]. 
D . l Node Tearing 
Given a network represented by the set of equations Y V = I , i t is possible to tear the 
network into a number of independent subnetworks by choosing appropriate tearing nodes. 
I f these nodes are given larger node numbers {i.e. ordered last) then the system of equations 
wiU have B B D F f o r m , as below 
Y „ ' V i • " I I " 
Ykc 
— 
h 
Y , i • Y 
( D . l ) 
The voltages, V ^ , at the tearing nodes are the coordination variables for this approach 
and may be solved for by 
k k 
! = 1 i = l 
The solution for the individual subnetworks is given by 
(D.2) 
(D.3) 
Figure D . l provides a conceptualization of the node tearing method. The approach can 
be thought of as tearing nodes apart in to pieces. A piece of each node is then connected to 
the subnetwork to which that node was attached. The voltages, V j , are the node voltages 
associated w i t h the tearing nodes but they may be thought of as voltage sources connected 
w i t h the tearing nodes. In the same way that a part of the tearing node is attached to the 
subnetwork, a voltage source of the same magnitude as the tearing node voltage is connected 
to each subnetwork to which that tearing node was attached. As Figure D . l shows, once 
the voltage sources {i.e. tearing node voltages) are known then the individual subnetworks 
may be solved independently and in parallel. 
243 
o o o 
a) 
Figure D . l : A conceptual view of node tearing, f r o m Homing et al . a) applying equivalent 
voltage sources b) tearing the node apart 
D.2 Branch Cutting 
Given an electrical network consisting of nodes connected by branches, Figure D.2(a) , i t is 
possible to replace a number of these branches, L , by current sources. These current sources 
have the same magnitude as the currents originally flowing in the branches and the network 
w i t h current sources is electrically equivalent to the original network. The branches which 
have been replaced by current sources as known as cut branches. Figure D.2(b) and Figure 
D.2(c) show how the equivalent current sources may be exploited to par t i t ion the network 
in to independent subnetworks. Let Y be the admittance mat r ix of the network and Y a be 
the admittance network for the network when the cut t ing branches have been removed. Let 
iL be the currents flowing in the cut branches. Removing the cut branches partitions the 
network in to pieces and the admittance mat r ix Y a becomes block diagonal. Now 
Y = Y , + Y , (D.4) 
where Y c is a a ma t r i x corresponding to the cut branches 
Y e = P y P ^ (D.5) 
where 
• P is a ma t r i x consisting of columns of the incidence ma t r ix of the original admittance 
m a t r i x which correspond to the cut branches. The elements of each column of this 
m a t r i x are zero except at the terminal nodes of each branch where the values are ± 1 . 
244 
• y is an X X X diagonal mat r ix . The elements of this ma t r ix are the admittances of 
the cut branches. 
Subst i tu t ing (D.4) and (D.5) into Y V = I yields 
{Ya + P y P ^ ) V = I (D.6) 
As Y c represents the admittances of the cut branches and 'IL are the currents flowing in 
them, (D.6) may be rewri t ten as 
Y „ V + P I l = I (D.7) 
by setting = y P - ^ V . (D.6) and (D.7) can thus be expressed in mat r ix notat ion as 
Ya P V I 
0 
(D.8) 
The coefficient ma t r i x of (D.8) again exhibits B B D F and parallel processing is possible. One 
addi t ional complication of the branch cut t ing method is that the admittance submatrices 
are overdetermined and singular. This problem has to be overcome by defining a reference 
node in each subnetwork and the effect of this is to remove the overdeterminism f r o m the 
equations, preventing Y a f r o m being singular. 
245 
® ® ® ©/ 
Figure D.2: A conceptual view of branch cut t ing, f r o m Boming et al. a) original network 
b) applying equivalent current sources c) cut t ing the branches 
246 
Appendix E 
Proof of Liu's Tree Theorems 
I n Section 4.3.1 two theorems relating to the properties of eUmination trees were quoted. 
These theorems are taken f r o m L iu [52] and the proofs of these theorems are presented 
below. The proofs are also taken f r o m L i u . 
E . l Notation 
The fol lowing nota t ion is used in the proofs 
• G ' (A) is the graph of the mat r ix A 
• r [ A ] is the ef imination tree of the mat r ix A 
• T[xj] denotes the subtree of the efimination tree rooted at node Xj 
• Adj{v) denotes the set of nodes adjacent to v in the graph 
• £ij denotes the length of the path connecting nodes Xi and Xj 
E.2 Other Theorems Required 
Before proving the theorems of Section 4.3.1 i t is necessary to state a number of other 
theorems. The proofs of these theorems are not given but may be found in [52]. 
T h e o r e m 3 Let i > j . Then £{j ^ 0 if and only if there exists a path 
in the graph G{A) such that {xp^, •"' > ^ P I } ^ ^ ' [ ^ ^ j ] -
247 
T h e o r e m 4 Let i > j . If £ij ^ 0, then the node a;,- is an ancestor of xj in the elimination 
tree. 
T h e o r e m 5 Let i > j . Then iij 0 if and only if there exists a path 
• ^ t ) •^pi) • • • ) 3;p,, Xj 
in the graph G{A) such that all subscripts in {pi, • • - ^pt} are less than j . 
E.3 Proof of the Tree Theorems 
The first theorem used by L iu states that 
T h e o r e m 1 For each node xj in G{A), the subgraph of G{A) (or G{F)) which consists of 
nodes in the tree T[xj] is connected, where T[xj] is the subtree rooted at node Xj. 
The proof of this theorem is derived by induction on the number of nodes t in T[xj]. I f 
there is only one node in T[xj] then the theorem is obviously true. To prove the general case, 
assume that the result is true for all subtrees of size less than t, and t > 1. Let , • • •, Xg^ 
be the child nodes of Xj. Following f r o m the inductive assumption, each subgraph consisting 
of nodes in Tlxg,.], ior 1 < k < m has fewer nodes than t and is connected in the graph 
G{A). For each k, { x j , X s ^ } is an edge in the filled graph 6 ' (F ) . By Theorem (3) there 
exists a pa th f r o m to Xj through nodes in r[a;sj, ] . This proves that the subsetr[a;j] is 
a connected subgraph in G ' (A) . Since G ( F ) is a supergraph of G ( A ) , T[xj] must also be a 
connected subgraph in G{F). 
C o r o l l a r y 1 For each node Xj, the set of nodes in T[xj] forms a connected component in 
the subgraph of G{A) { G ( F ) } consisting of all nodes except those in Adj{T[xj]). 
implies tha t par t i t ioning the tree into disjoint subtrees is the same as par t i t ioning the 
network(graph) in to subnetworks(subgraphs). A second corollary to this theorem is 
C o r o l l a r y 2 For each node Xj, the set of nodes in T[xj] forms a connected component in 
the subgraph of G{A) { G ( F ) } consisting of all nodes except proper ancestors of Xj. 
The second proof derived and used by L iu refers to mat r ix reorderings. I f A is a given 
symmetric ma t r ix , the two orderings P and Q are said to be equivalent i f the structures of the 
fiUed graphs PAP-^ and Q A Q ^ are isomorphic. P is referred to as an equivalent reordering 
of the m a t r i x A i f the filled graph of A and the filled graph of P A P ^ are isomorphic. Given 
248 
an in i t i a l ordering of the ma t r ix A and its corresponding efimination tree T ( A ) , let P be 
a permuta t ion ma t r i x for A that corresponds to a topological reordering^ of the nodes in 
r ( A ) . 
T h e o r e m 2 Given the matrix, A, and an equivalent reordering, P , the filled graphs of 
G{A) and G ( P A P ^ ) are isomorphic if they are treated as unlabeled structures. 
The proof of this theorem is derived by let t ing F by the filled mat r ix of A . Set 
A = P A P - ^ and let F be the corresponding fiUed mat r ix of A . To prove the theorem 
i t must be shown tha t G'(F) and G ( F ) are structurally identical. Let xi,X2, • • •,a;„ be the 
sequence of node efiminations for the mat r ix A and let Xi,X2, • • •,Xn be the sequence of 
node efiminations for the ma t r ix A . I t is sufficient to show that {xi,Xj} is an edge in the 
fiUed graph G{F) i f and only i f { X f , X s } is and edge in ^ ( F ) . 
Assume the { x i , X j } is an edge in G ( F ) and i > j . By Theorem (4), Xi is a proper 
ancestor of Xj i n the efimination tree r [ A ] . As P is a topological ordering, the node Xf 
is labeled after Xg so that f > s. By Theorem (3), there exists a path in the graph G ( A ) 
Xf = Xi,Xp^, - • • ,Xp^,Xj = Xs such that {xp^, - • • ,Xp^] C T[xj]. Due to the property of 
the topological ordering P , these nodes Xp^, • • • ,Xp^ are labeled before Xg in the mat r ix A . 
Therefore, by Theorem (5), X f , X s is also an edge in the fiUed graph (3(F) . 
Conversely, let {a;^, Xg} be an edge in G'(F). For the sake of definiteness, let f > s. Note 
tha t Xi does not belong to the subtree T[xj] otherwise i t contradicts the topological ordering 
property of P . By Theorem (5), there exists a path in G' (A) X f , X p ^ , • • • , X p , , X j , X s such 
tha t aU subscripts P i , • • • , p j are less than s. By the property of the topological reordering 
of P , the nodes in { 5 p j , • • •, ipj cannot be ancestors of xj. A l l the nodes on the path 
X f , x p ^ , - ••,Xp^,Xj,Xs belong to the connected component containing the node xj i n the 
subgraph of G'( A ) excluding the set of proper ancestors of Xj i n the tree. By CoroUary (2) 
these nodes all belong to the subtree T[xj]. Thus there exists a pa th in G' (A) f r o m Xs = xj to 
Xf = Xi through nodes i n the subtree T[xj] and Xi is outside of r[a;j]. Again by CoroUary (2), 
this means tha t Xi is a proper ancestor of Xj so that i > j . Therefore, using Theorem (3), 
{ x i , X j } is also an edge in the fiUed graph G{F). 
Theorem 2 impfies that every topological reordering of A is an equivalent reordering of 
the m a t r i x A . The coroUary to this theorem is that the tree T[PAP"^] is isomorphic to 
^ A topological ordering of a rooted tree is one that numbers the child nodes before their parent nodes. 
i.e.the leaves are numbered first and the root is numbered last. 
249 
T [ A ] i f they are treated as unlabeled structures. 
250 
Appendix F 
Reducing the Length of Intertask 
Messages 
Consider the communication of update values resulting f r o m factorisation f r o m a sending 
worker task, T^, to a receiving worker task, Tr. Before the communication occurs Ts must 
generate the message i t wishes to send as an array of bytes. The Transputer communication 
primit ives send a specified number of bytes, start ing at a given address, to the receiving task 
[103]. Hence the data which makes up the message must lie contiguously in memory. Ts 
holds the data i t wishes to send in the f o r m of sparse mat r ix finked lists and i t must convert 
this in to an array representation before transmission as the list elements are fikely to be scat-
tered anywhere in the task's memory space. Recall f r o m Chapter 2 that there is a separate 
linked list for each row in the mat r ix . Knowledge of which Ust is being used is sufficient to 
uniquely ident i fy the corresponding row in the coefficient mat r ix . Each list element has three 
parameters corresponding to column index, real part and imaginary part of the complex 
value respectively. Knowing the column index and which list is being consulted gives enough 
in fo rmat ion to uniquely ident i fy a single element i n the coefficient mat r ix . These four param-
eters of each element to be updated by Tr need to be inserted into the message transmitted 
by Ts. The simplest a lgori thm for correctly establishing the message array is shown overleaf. 
251 
s e t pointer message_ptr to point at s t a r t of message 
loop i over range of rows which reference subnetwork data held by Tr 
loop j from 1 to length of row i 
i f an update i n Tr r e s u l t s from element i , j 
Place {i,jj^i,jyjbi,j} into message at p o s i t i o n indicated by message_ptr 
Increment message_ptr by four 
end loop j 
end loop i 
message length = message_ptr - address of s t a r t of message 
Algorithm for generating intertask messages 
Ts sends the message by instruct ing the Transputer to send messageJength bytes of data 
located at address start-of.message to Tr. Tr receives the message and stores i t in an array 
located at address head.of.message. I f must then decode the message and add the values 
contained in the message to the finked fist representation of its submatrix. The foUowing 
a lgor i thm performs the decoding and updating funct ion 
loop posn from head-of.message to head-of.message*messageJength i n steps of 4 
ex t r a c t { row, column, a, jb } from the l o c a t i o n posn i n the message 
l o c a t e l i n k e d l i s t f o r given row 
search f o r element corresponding to column 
i f t h i s element i s found 
Add a +jb to the value of t h i s element 
e l s e 
i n s e r t a new l i s t element with column index = column, value = a + jb 
end posn loop 
Algorithm for decoding intertask messages 
This method of message generation and decoding has the advantage that i t is easy to 
implement . However i t is a rather inefficient method as i t unnecessarily repficates informa-
t ion about row addresses w i t h i n the messages. Consider a row in the coefficient ma t r ix of 
Ts which has entries i n columns k, I, m. Updates wiU be required to elements {i, k), {i, I) and 
( i , m ) . Suppose that these elements fie in Tr's submatrix. A message must be sent f r o m Ts 
252 
i k aik jbik i I an jbil i m aim jbim 
Figure F . l : The example three element message 
to Tr which contains the values associated w i t h these updates and the message has the con-
tents shown in Figure F . l . Notice that the row address i appears in the message three times 
and unnecessarily increases the length of the message. Suppose that the location of the first 
value referencing row i is known (START), as is the location of the last value referencing 
row i (END). Suppose also that aU values referencing row i fie contiguously between these 
two locations. I t is then no longer necessary to store the row parameter for each element 
and the message can be reduced to three quarters of its original length. The location of 
START a,nd END for each row can be stored in the message header as a par t i t ioning table 
which shows how to par t i t ion the message into its constituent row informat ion. Note that 
the END of row i is immediately adjacent to the START of row i + 1. Hence the values 
for i must be located in the range of entries START(i) to START(i+l)-l. Consequently 
the par t i t ion ing table only needs to store the value of START for each row. Provided that 
the range of rows to be updated is known, the receiving task Tr can update its coefficient 
m a t r i x using the fol lowing algori thm 
loop i over range of rows to be updated 
loop j from START(i) to START(i+l)-l 
e x t r a c t { column, a, jb } from p o s i t i o n j i n the message 
l o c a t e l i n k e d l i s t f o r row i 
search f o r element corresponding to column 
i f t h i s element i s found 
Add a +jb to the value of t h i s element 
e l s e 
i n s e r t a new l i s t element with column index = column, value = a + jb 
end j loop 
end i loop 
Modified algorithm for decoding intertask messages 
The fo rma t of the modified message is shown in Figure F.2(a) whilst Figure F.2(b) 
shows the contents of the message f r o m the simple three element example introduced in 
253 
Update Data 
(a| 
12 13 14 15 16 17 18 
Figure F.2: Modif ied message structure (a) and three element example message (b) 
Figure F . l . 
The modified message format does yield significant savings in message length for the 
lengthy update messages of real systems. The reduction in communication t ime resulting 
f r o m this reduction in message length is negfigible but the reduction in the amount of 
t ime taken to generate and decode messages is quite considerable. A significant increase 
i n performance was observed when the modified message format was implemented in the 
Transputer-based RP solution. For example, when spfit in to four ma jo r and three minor 
subnetworks, the C E G B 734 node system was processed in 95ms using the modified message 
fo rma t as opposed to 175ms for the unmodified case. 
254 
Appendix G 
Monitoring the Performance of 
the Parallel Solution 
One of the simplest methods for accurately determining the execution t ime of a program 
is to use some external t imer hardware that has the desired t iming resolution and can be 
triggered by instructions in the program to be monitored. This sort of external hardware is 
not sufficiently flexible to allow the computer to poU i t for values, making i t diff icult for the 
moni tored program to associate an accurate timestamp w i t h each event i n the program's 
execution. Fortunately many high level languages provide software implemented t iming 
facifities and these can be used to monitor execution time and provide timestamps. In DOS 
and Un ix environments these timers are usually defined to have a resolution of 1ms, as 
the languages define a constant 1000 clock ticks per second. Unfortunately this does not 
por t ray the true resolution of the t imer as timers implemented in software are interrupt 
driven and the t imer value is only incremented every 50ms or so. In practice this means 
tha t any action which is performed in less t ime than the update interval is recorded as 
tak ing zero t ime. For very fast programs this degree of accuracy is insufficient, hence the 
need to use external t iming hardware. 
The Transputer programmer is lucky in that two timers are implemented in the hardware 
of each Transputer and these timers may be poUed f r o m software. As they are free running 
hardware timers they are highly accurate and the slowest of the two timers has a resolution 
of l/is [22], making i t useful i n almost aU appfications. A program running on a Transputer 
may moni tor its execution t ime by poUing the timer at the first and last instruction of 
the program and taking the difference between the two times. A specific action or event 
255 
CPU CPU 
time 1 2 
10 
30 
Task 
2 
Task 
3 
Task 
1 
Task 
4 
Task 
5 
31 
36 
Figure G . l : A Gant t chart used to visualise program operation 
w i t h i n the program may be timestamped by poUing the t imer immediately prior to that 
event occurring. The I N M O S C language provides numerous functions for performing such 
temporal manipulat ion and comparison. 
Timestamping the actions of a program can provide informat ion about the efficiency 
of the components of the program. When parallel programs are considered, t iming the 
execution of each task and timestamping actions wi th in tasks provides comprehensive in-
fo rma t ion i l lus t ra t ing how tasks interact and how the processing resources are used during 
the execution of the program. Of particular interest is the identification of iirefficient idle 
states. Poring over the numeric results is one way of ident i fying what is happening wi th in 
the program but the whole process becomes much more intui t ive i f some f o r m of visuali-
sation technique is used. The Gantt chart [19, 106] is a useful tool for visuafising parallel 
programs. The chart consists of a fist of all the processors in the parallel machine. A 
fist of tasks is allocated to each processor and these tasks are ordered by their start and 
finish times. Figure G . l shows a typical Gantt chart. Whi t e areas in the chart represent 
times when the processors are busy performing useful work, grey areas represent times when 
they are idle. I n implementing the Transputer-based RP program the need for a method 
256 
of moni tor ing and visuaUsing program performance was identified. The lack of a suitable 
development environment made i t hard to accurately time the execution of the program 
and quant i fy the effect of program modifications. A performance monitor ing method was 
developed out of necessity and its main features are now discussed. 
The aim of modi fy ing a program is to refine its operation and alter those parts which 
wiU yield the max imum improvement in performance. Identification of the inefficient com-
ponents is needed and this is where the timestamp informat ion comes in useful [107]. I f two 
actions, A and B occur sequentiaUy wi th in a program, where A commences at t ime t^ and 
B commences at t ime t's, the execution t ime of A is clearly 
teA = t B - t A ( G . l ) 
I f the commencement of each major program action is timestamped then the difference 
between adjacent entries in the fist of ordered timestamps gives a fist of execution times 
for each of the ma jo r program events. Once again-it is more intui t ive i f the informat ion is 
displayed in a pictor ia l manner. Adopt ing the Gantt chart gives a suitable diagrammatic 
representation. The chart is comprises a fist of aU the tasks in the program. Each task is 
allocated a fist of all the ma jo r actions in that task and these are ordered by their start and 
finish times. Suitable positioning of t imestamping wi th in each task aUows the idle states 
to be identif ied. Idle states are shown as white regions in the columns of the chart, shaded 
regions correspond to busy (useful) states. Different shadings correspond to different actions 
which occur in each task. Figure G.2 shows the chart for a fifteen task implementation of 
the Recursively Parallel method. 
Generating the t iming informat ion needed to derive the execution profile charts is not 
altogether s t ra ightforward. The hardware timers on each Transputer are not synchronized 
and the lack of a global clock makes i t diff icul t to locate actions in absolute t ime. The 
solution is to use one task as a t ime reference and to synchronise the other tasks to this 
reference. The supervisor task residing on the root processor is the most logical choice of 
reference task. The simplest method of synchronizing worker task execution w i t h that of 
the supervisor is to make the first instruction of the worker one which blocks and waits 
on a start signal f r o m the supervisor task. Polling the hardware timer to record a start 
t ime is the second instruct ion executed by a worker. As Figure G.3 shows there is an 
inherent problem in this approach. Due to the lack of message broadcasting facifities on 
257 
T1 1/R5 
I • Idle 
Ractorlse I Forward Sutistltuts 
Figure G.2: Visualisation of the RP program execution profile 
the Transputer, each worker is started in turn. This causes the starting of the worker tasks 
to be staggered in time with respect to the absolute time reference and this gives rise to 
inaccurate timing results. Furthermore the supervisor cannot 'start' until the last worker 
has been activated. Although the supervisor does no useful processing it can be used to 
take a stopwatch timing of the solution to validate other timing information. The staggered 
starting causes the stopwatch timer to be started too late and incorrect results are returned. 
This problem has been surmounted by using the block-and-wait strategy in conjunction 
with variable delay methods. At some point, to, the reference task obtains the time from 
its local hardware timer. It then picks a start time, t^, where 
ts = h + <5« (G.2) 
where St is of the order of 2 seconds. Before to all the worker tasks have blocked and are 
awaiting communication with the reference (supervisor) task. This task considers all the 
workers in turn and before communicating with the i*^ task the reference task polls its local 
258 
t 
i 
m 
e 
/ , Reference 
start time 
supervisor ^^ ^^ ^^ ^^  
<j execution time 
incorrect 
V 
Figure G.3: The staggering effect of block-and-wait synchronisation 
timer to read the time / i , where 
io < ii < ts (G.3) 
A delay, St-, is calculated for the i^^ task 
h =ts- U (G.4) 
and this value is communicated to task i. Upon receipt of this delay value task i polls its 
local timer to obtain the time t\. The delay is added to this time to give the start time 
tsi = t[ + ^ i , (G.5) 
and processing cannot commence until after this time. Task i then suspends itself until 
the local timer value exceeds tsi. The approach is shown diagrammatically in Figure G.4. 
Whilst i t does not guarantee that each task will commence execution at exactly the same 
instant, t^, in global time it does guarantee that every task will commence within a few 
processor clock cycles of this time and this is sufficiently accurate. 
A staggering in time can still occur due to the delay introduced in sending the value St^ 
259 
t1 
t2 
t3 
t4 
t5 
dit 
Cfat 
cbt 
d4t 
dst 
Figure G.4: Synchronisation through variable delays 
across the network to the i " ' task. This can be circumvented by modifying (G.4) such that 
=ts- ti - C{ref ^ i) (G.6) 
where C(ref <— i) is the average time taken to transfer a message of the same size as ^j , -
between the reference (supervisor) task and task i. Values of C{ref <— i) can be calculated 
for z = by sending 1000 fixed length messages to each task and monitoring the 
total transfer time from the sending end. Dividing this by 1000 gives the mean transfer 
time C{ref <— i). This analysis of the communication network characteristics is performed 
before and before the worker tasks block-and-wait on the reference task. The algorithms 
below show the implementation of the timing mechanism in the supervisor and worker tasks. 
260 
f o r i=l to n 
f o r j = l t o 1000 
send f i x e d length message to task i 
store t r a n s f e r time 
end j 
C{ref ^ i) 
end i 
total transfer time 
1000 
to = r e s u l t of t imer p o l l 
St = 2seconds 
= 0^ + h 
f o r I = 1 to n 
ti = r e s u l t of t imer p o l l 
St, = t s - t i - C{ref <- i ) 
send St- t o task i 
end i 
f o r j=i to 1000 
receive f i x e d length message 
end j 
receive St^ from supervisor 
t'- = r e s u l t of t imer p o l l 
tsi = t'i + St, 
suspend u n t i l tgi 
commence RP s o l u t i o n 
suspend u n t i l t^ 
( a ) (b) 
T iming mechanism in the supervisor (a ) and worker (b) tasks 
The impact of these visualisation and performance monitoring techniques on the op-
timization of the Transputer-based RP solution cannot be stressed enough. Most of the 
improvements to the program code arose as direct results of observations on charts of the 
form of Figure G.2. Accurate calculations of speed-up were only possible after the imple-
mentation of the timing and synchronous starting techniques. 
261 
Appendix H 
Test Systems 
The following pages give tree diagrams for the IEEE 118 node network and the reduced 
CEGB 629 and CEGB 734 node networks used as test systems for the research work de-
scribed in this thesis. Tree diagrams for the 1624 node representation of the Eastern U.S 
power system could not be incorporated as they are too large to be printed. However co-
efficient matrix sparsity plots are given for aU four systems. The sparsity plots show the 
structure of the matrix for a given system after optimal ordering and again after partition-
ing into the required number of subnetworks. The RBBDF matrix structure is apparent in 
the sparsity plots for the partitioned systems. 
262 
K44 843 K40 
Figure H . l : Weighted elimination tree for MDMLLRU ordered IEEE 118 Node System 
263 
v-"< ^ '"^'^ 
4f,'J7 -4ftiXi 4(a7 -461(1 
\ . , ^ „ ^ » 
= 1! = ! ^ J 
Figure H.2: Weighted elimination tree for MDMLLRU ordered CEGB 629 Node System 
264 
(7M 7J7 
2tM^ ~2rAi 
^j! =1,11 •= ' 
< : S i -
_ „ 
N , — J 
Figure H.3: Weighted elimination tree for MDMLLRU ordered CEGB 734 Node System 
265 
nz = 476 
100 
Figure H.4: MDMLLRU Ordered IEEE 118 Node System 
266 
100h 
4 0 60 80 
nz = 476 
Figure H.5: MDMLLRU Ordered IEEE 118 Node System, partitioned into 2 major and 1 
minor subnetworks 
267 
100 
60 80 
nz = 4 7 6 
Figure H.6: MDMLLRU Ordered IEEE 118 Node System, partitioned into 4 major and 3 
minor subnetworks 
268 
100 
4 0 60 80 
nz = 476 
Figure H.7: MDMLLRU Ordered IEEE 118 Node System, partitioned into 8 major and 7 
minor subnetworks 
269 
100 
2 0 0 
3 0 0 
4 0 0 
5 0 0 
6 0 0 
"v. • 1 1 1 , 
• • T V - • 
\ 
; \ 
\ 
\ 
- \ 
\ 
• ' -. >\ 
V : • 
\:, 
' 'V, \ 
\ 
s 
' 1 , 
1 1 1 
• X 
100 200 300 400 
nz = 2301 
500 600 
Figure H.8: MDMLLRU Ordered CEGB 629 Node System 
270 
100 
2 0 0 
3 0 0 
4 0 0 h 
500 
6 0 0 
100 200 300 400 
nz = 2301 
500 600 
Figure H.9: MDMLLRU Ordered CEGB 629 Node System, partitioned into 2 major and 1 
minor subnetworks 
271 
100 
2 0 0 
3 0 0 
4 0 0 
5 0 0 
6 0 0 
100 2 0 0 300 400 
nz = 2301 
500 600 
Figure H.IO: MDMLLRU Ordered CEGB 629 Node System, partitioned into 4 major and 
3 minor subnetworks 
272 
100 
2 0 0 
3 0 0 
4 0 0 
500 
6 0 0 
• I • 
- -
' '' • " -^ Si-
• ' • • *' - ' • - X V • • 
. *. .•' \ -s* 
• ' » - r i ' V 
-
• . ••• 1 . 1 , 
) 100 2 0 0 3 0 0 400 500 600 
nz = 2301 
Figure H . l l : MDMLLRU Ordered CEGB 629 Node System, partitioned into 8 major and 
7 minor subnetworks 
273 
lOOh 
200 
300 
400 
500 
600 
700 
0 100 200 300 400 500 600 700 
nz = 2696 
Figure H.12: MDMLLRU Ordered CEGB 734 Node System 
274 
100 
200 h 
300 
400 
500 
600 
700 
100 200 300 400 500 600 700 
nz = 2696 
Figure H.13: MDMLLRU Ordered CEGB 734 Node System, partitioned into 2 major and 
1 minor subnetworks 
275 
100 
200 
300 
400 
500 
600 
700 h 
0 100 200 300 400 500 600 700 
nz = 2696 
Figure H.14: MDMLLRU Ordered CEGB 734 Node System, partitioned into 4 major and 
3 minor subnetworks 
276 
100 
200 
300 
400 
500 
600 
700 
0 100 200 300 400 500 600 700 
nz = 2696 
Figure H.15: MDMLLRU Ordered CEGB 734 Node System, partitioned into 8 major and 
7 minor subnetworks 
277 
700h 
0 100 200 300 400 500 600 700 
nz = 2696 
Figure H.16: MDMLLRU Ordered CEGB 734 Node System, partitioned into 16 major and 
15 minor subnetworks 
278 
200 
400 
600 
800 
1000 
1200 
1400 
1600 
1 1 r T r -
200 400 600 800 1000 1200 1400 1600 
nz = 6050 
Figure H.17: MDMLLRU Ordered US 1624 Node System 
279 
200 
400 
600 
800 
1000 
1200h 
1400h 
1600 
—1 — . — . • — f . ^ 
l \ _ 
1 \ • 
V \ • 
• 1^ r ' 1 1 T 
_ 
\' \ \. 
\ ^ • • - "5 • 
- ; . 1 • 
N . ^ \ • 
• f l . . • ' 
•"•;.•^ >'C•r'V•.•^ .•-': 
.. - ; ? , \ : -
\ •-
- '^ v • 
\ \ \- ; • • - f • •. 1 ^ 
_ \ V-S- t 
-
/ . . - . ^ i l - . - V/ 
1 . . 1 . I - . • t •.- •- 1 -
200 400 600 800 1000 1200 1400 1600 
nz = 6050 
Figure H.18: MDMLLRU Ordered US 1624 Node System, partitioned into 2 major and 1 
minor subnetworks 
280 
200 
400 
600 h 
800 
1000 
1200h 
1400 
1600 
0 200 400 600 800 1000 1200 1400 1600 
nz = 6050 
Figure H.19: MDMLLRU Ordered US 1624 Node System, partitioned into 4 major and 3 
minor subnetworks 
281 
200 
400 
600 
800 
1000 
1200 
1400h 
1600 
200 400 600 800 1000 1200 1400 1600 
nz = 6050 
Figure H.20: MDMLLRU Ordered US 1624 Node System, partitioned into 8 major and 7 
minor subnetworks 
282 
200 
400 
600 
800 
1000 
1200 
1400 
1600 
mi . 
1 1 T 1 1 r 1 — i;' 
{' 
it. 
V 
• . 
1-
-
-
-
•.AA\.f,-^*iS 
200 400 600 800 1000 1200 1400 1600 
nz = 6050 
Figure H.21: MDMLLRU Ordered US 1624 Node System, partitioned into 16 major and 
15 minor subnetworks 
283 
